Compare commits

..

499 Commits

Author SHA1 Message Date
21d2d758fa flag control mkldnn fusion static shape 2024-05-10 14:55:02 +08:00
fcbf2b61e6 Memoize local_scalar_dense calls, refactor all memos (#125623)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125623
Approved by: https://github.com/eellison
2024-05-10 01:52:55 +00:00
8be4104cf3 Update conda to latest version for Docker release builds (#125887)
Fixes https://github.com/pytorch/pytorch/issues/125879

Issue is somewhat similar to this issue: https://github.com/pytorch/pytorch/issues/106470
doing:
```
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch-nightly -c nvidia
```
pulls cpu version of pytorch,vision and audio here: https://github.com/pytorch/pytorch/actions/runs/9014006158/job/24795924934#step:11:6849
```
16 37.21     mpmath-1.2.1               |          py311_0         1.2 MB  pytorch-nightly
#16 37.21     nettle-3.7.3               |       hbbd107a_1         809 KB
#16 37.21     networkx-3.1               |  py311h06a4308_0         3.3 MB
#16 37.21     openh264-2.1.1             |       h4ff587b_0         711 KB
#16 37.21     pillow-9.3.0               |  py311h3fd9d12_2         874 KB  pytorch-nightly
#16 37.21     pytorch-2.4.0.dev20240509  |     py3.11_cpu_0        87.1 MB  pytorch-nightly
#16 37.21     pytorch-cuda-12.1          |       ha16c6d3_6           7 KB  pytorch-nightly
#16 37.21     pytorch-mutex-1.0          |              cpu           3 KB  pytorch-nightly
#16 37.21     sympy-1.12                 |  py311h06a4308_0        14.4 MB
#16 37.21     torchaudio-2.2.0.dev20240509|        py311_cpu         5.1 MB  pytorch-nightly
#16 37.21     torchvision-0.19.0.dev20240509|        py311_cpu         7.3 MB  pytorch-nightly
```
Updating conda to latest and rebuilding solved this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125887
Approved by: https://github.com/huydhn
2024-05-10 01:43:59 +00:00
d14d6127f6 [BE] Rename macos-12 to macos-13/macos- jobs (#125859)
As CI does not have any MacOS 12 runners anymore
Cleanup any misleading references about cross-compilation as M1 builds are done natively for quite some time

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125859
Approved by: https://github.com/ZainRizvi
2024-05-10 01:30:29 +00:00
2ad794550a Support generic stream/event on XPU backend (#125751)
# Motivation
According to [#123611](https://github.com/pytorch/pytorch/pull/123611), we support generic stream/event on XPU backend.

# Additional Context
new method/attribute on `torch.Event` for xpu
- torch.Event.event_id
- torch.Event.elapsed_time
- torch.Event.synchronize

new method on `c10::Event` on xpu backend
- c10.Event.event_id
- c10.Event.elapsed_time
- c10.Event.synchronize

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125751
Approved by: https://github.com/jgong5, https://github.com/albanD
2024-05-10 01:27:30 +00:00
d19d932183 update pointwise cat heuristics (#125772)
Fix for https://github.com/pytorch/pytorch/issues/122871. There are two cases where we emit pointwise cat:

- fusing into a pointwise use
- horizontally fusing copy_ kernels

The regression I looked into previously was due to being overly aggressive in the latter case. I've updated the logic there so that we only emit the horizontal fusion in the case that we would have to emit separate copy_ kernels anyway.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125772
Approved by: https://github.com/Chillee
2024-05-10 01:07:39 +00:00
978b572652 Add registration API for torch.compile-eager (#121387)
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.

In this PR, we intend to provide a registration API dedicated to eager-through-torch.compile. The major workflow of this API will be as follows.

- Load cache
- Check cache according to the input tensors
  - Cache Hit: Run the cached kernel directly
  - Cache Miss: Run the AOTI to produce kernel and run the produced kernel. If AOTI fails to produce the kernel, invoke the python fallback function.

Currently, this PR always fallback to python kernel now and cache mechanism will be implemented in another PR - https://github.com/pytorch/pytorch/pull/116368

Differential Revision: [D57164385](https://our.internmc.facebook.com/intern/diff/D57164385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121387
Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/zou3519, https://github.com/jgong5
2024-05-10 00:30:27 +00:00
c9a258e474 [export] handle constant aliasing for export (#125509)
Summary: Currently export will [error out](2b5ae2611e/torch/export/_trace.py (L477)) if a constant is aliased. This PR supports this by modifying ConstantAttrMap to map constants to a list of FQNs instead of a single FQN, populating the ExportedProgram constants dict to contain multiple entries to the same constant.

Test Plan: added test case in test_export.py

Differential Revision: D56955654

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125509
Approved by: https://github.com/angelayi, https://github.com/ydwu4
2024-05-10 00:14:37 +00:00
fd816bf630 Add script for removing Inductor dependencies from Inductor generated code (#125811)
Usage:
```python
TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 python foo.py
TORCHINDUCTOR_DUMP_LAUNCH_PARAMS=1 python /tmp/torchinductor_chilli/js/cjsbczkf6fj36nhaxxypll6cy4fmwmkoauklrgrvuody2mn7oeef.py
python remove_inductor_deps.py /tmp/torchinductor_chilli/js/cjsbczkf6fj36nhaxxypll6cy4fmwmkoauklrgrvuody2mn7oeef.py
```

Example generated code: https://pastebin.com/m6Ae8heB

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125811
Approved by: https://github.com/chenyang78
2024-05-10 00:00:25 +00:00
3267814d53 [inductor] refactor: device dispatch inside do_bench (#125736)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125736
Approved by: https://github.com/shunting314
2024-05-09 23:50:02 +00:00
13545fe68a [export] Don't create a new fake mode if dynamo tracing (#125185)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125185
Approved by: https://github.com/mikekgfb
2024-05-09 23:43:08 +00:00
23e71ffd82 Remove unused caffe2 subdirs (#125818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125818
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-05-09 22:57:55 +00:00
350a3ed82f Fix unused variable 'kEps' (#125870)
Summary:
> fbcode/caffe2/caffe2/utils/math_gpu_test.cc:227:17: error: unused variable 'kEps' [-Werror,-Wunused-const-variable]

See https://www.internalfb.com/intern/test/844425000398735?ref_report_id=0

Created from CodeHub with https://fburl.com/edit-in-codehub

Test Plan: Sandcastle run

Reviewed By: r-barnes

Differential Revision: D56731004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125870
Approved by: https://github.com/seemethere, https://github.com/atalman
2024-05-09 22:57:37 +00:00
477612c0f6 [dynamo] Clear GenerationTracker on dynamo reset (#125855)
Fixes https://github.com/pytorch/pytorch/issues/125567

Not doing this causes modules to be unspecialized when tests run in sequence, and specialized when run alone.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125855
Approved by: https://github.com/jansel
2024-05-09 22:47:54 +00:00
52fad83335 [onnx.export] Avoid linear look up in env for exist_in_env (#124909)
This PR is part of a series of PRs to significantly speed up torch.onnx.export for models with many nodes (e.g. LLM). See #121422 for more analysis.

- As part of torch.onnx.export, a reverse look-up is made in env. This is done for each node, and this look-up costs in proportional to the graph size, which incurs and overall O(N^2) time complexity.
- A pragmatic solution is simply to keep a separate data structure to make this de facto constant time. So, this introduces a set containing all the values of env. Open to other ideas. Ideally `exist_in_env` wouldn't be needed at all, but to preserve current behavior exactly I'm not sure how that can be done.
- Resolves (4) in #121422.
- This code change and the choice of py::set looks a bit more natural on top of #123063, where the env is changed from a std::unordered_map to a py::dict.

Partially fixes #121422
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124909
Approved by: https://github.com/srikris-sridhar, https://github.com/justinchuby
2024-05-09 22:38:00 +00:00
37d2ecd123 Only log toplevel torchscript calls. (#125714)
Summary: as title.

Test Plan: CI

Reviewed By: gmagogsfm

Differential Revision: D57069719

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125714
Approved by: https://github.com/SherlockNoMad
2024-05-09 22:29:53 +00:00
e43d656921 FakeTensor speedup: minor cleanups (#124224)
A few cleanup tasks that didn't really fit into the other diffs in this stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124224
Approved by: https://github.com/oulgen
ghstack dependencies: #122911, #124223
2024-05-09 22:11:51 +00:00
a08be4b705 FakeTensor speedup: Split cache_key so we only validate once (#124223)
When dispatching a fake tensor op we cache the result with `(op, args)` as the key. There are some args (such as one with a dynamic output shape) where the output can't be cached. Instead of validating the args every time we compute the cache only validate the args when we first see a new cache key.

18.3% FakeTensor perf win on the microbenchmark (21.7% cumulative)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124223
Approved by: https://github.com/oulgen, https://github.com/masnesral
ghstack dependencies: #122911
2024-05-09 22:11:51 +00:00
6a8b1da18d FakeTensor speedup: Delay formatting stack trace until it's actually asked for. (#122911)
When constructing a `FakeTensorMode`, instead of immediately formatting a full stack trace, grab the traceback and only format it on demand.

4.2% FakeTensor perf win on the microbenchmark.

```
import time
import torch
import torch._dynamo as dynamo
from torch._subclasses.fake_tensor import FakeTensorMode
import numpy as np

def toy_example(a, b):
    x = a / (torch.abs(a) + 1)
    b = b * -1
    return x * b

def run_test1():
    dynamo.reset()
    j = [1, 2, 3]
    toy_example(torch.randn(j), torch.randn(j))

def run_test2():
    dynamo.reset()
    j = [1, 2, 3]
    with FakeTensorMode():
        toy_example(torch.randn(j), torch.randn(j))

ITERATIONS = 500000
FORMAT_STRING = "{name:12}: TOT: {tot:10.3f}, AVG: {avg:10.3f}, MIN: {min:10.3f}, P50: {p50:10.3f}, P90: {p90:10.3f}, P99: {p99:10.3f}"

def run_tests(name, step):
    step()
    timings = []
    start = time.time()
    for i in range(ITERATIONS):
        a = time.perf_counter_ns()
        step()
        b = time.perf_counter_ns()
        timings.append(b - a)
    end = time.time()
    fmt = {
        "best": min(timings),
        "tot": end - start,
        "avg": np.average(timings),
        "min": min(timings),
        "p50": np.percentile(timings, 50),
        "p90": np.percentile(timings, 90),
        "p99": np.percentile(timings, 99)
    }
    print(FORMAT_STRING.format(name=name, **fmt))
    return fmt

ts = run_tests("tensor", run_test1)
fs = run_tests("fake tensor", run_test2)
ratio = {k: a / b for ((k, a), (_, b)) in zip(fs.items(), ts.items())}
print(FORMAT_STRING.format(name="ratio", **ratio))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122911
Approved by: https://github.com/oulgen, https://github.com/eellison
2024-05-09 22:11:51 +00:00
eaaf0f3299 Print capture_pre_autograd_graph warning only once (#125848)
Summary: Print this warning only once to avoid flooding the logs of workflows where this is called frequently.

Test Plan: CI

Differential Revision: D57163341

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125848
Approved by: https://github.com/zhxchen17
2024-05-09 22:04:05 +00:00
20271f0a3b Drop caffe2-linux-jammy-py3_8-gcc11-build (#125857)
Removes more caffe2 testing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125857
Approved by: https://github.com/albanD, https://github.com/Skylion007
2024-05-09 21:52:27 +00:00
ae5e2ab92e [dynamo][fsdp] Use Tensor match for FSDP modules (#125827)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125827
Approved by: https://github.com/yf225, https://github.com/jansel
ghstack dependencies: #125828, #125805
2024-05-09 21:26:15 +00:00
0d4fdb0bb7 Revert "[ROCm] amdsmi library integration (#119182)"
This reverts commit 85447c41e32b1e43a025ea19ac812a0c7f88ff57.

Reverted https://github.com/pytorch/pytorch/pull/119182 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the ROCm failed test is legit 85447c41e3 ([comment](https://github.com/pytorch/pytorch/pull/119182#issuecomment-2103433197))
2024-05-09 21:18:21 +00:00
966ebd2e24 Add --warm-start-latency to benchmark harness (#125353)
Summary: This change introduces a new flagg to perform a "warm start" test from the benchmark harness. The idea is to test a model twice: first with a fresh inductor cache (i.e., a "cold start"), and then a second run in a fresh process with the cache available (i.e. a "warm start"). We can later add this mode to CI runs to collect compile times for warm start.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125353
Approved by: https://github.com/eellison, https://github.com/desertfire
2024-05-09 21:12:15 +00:00
ee00349780 [dynamo][logs] move recompilation reason within compile_id scope (#125805)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125805
Approved by: https://github.com/ezyang
ghstack dependencies: #125828
2024-05-09 20:37:23 +00:00
a7575e8bd5 [dynamo] Use correct source for custom getattr (#125828)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125828
Approved by: https://github.com/williamwen42
2024-05-09 20:37:23 +00:00
7c00635125 [CI] Move gha artifact download before xml parsing for test stat uploads (#125609)
Move gha artifact download to before any xml parsing is done for uplaod-test-stats

Do not download gha artifacts during xml parsing since got uploaded to s3 in the above and will be downloaded when all the artifacts are downloaded from s3

The previous method resulted in dups if you run the script again

TODO: write a deduper so we don't have to worry at all
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125609
Approved by: https://github.com/huydhn
2024-05-09 20:35:09 +00:00
1ecea513b6 Fix common_methods_invocations example inputs to _efficient_attention_forward (#125788)
Fixes #120693

this tries to fix the sample input in common_methods_invocations.py:
* I think the arange was intended to be skipping every other integer in the range. Previously, we'd have one length that was -1.
* k, v tensors were too small - updated the sizes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125788
Approved by: https://github.com/drisspg, https://github.com/Aidyn-A
2024-05-09 20:08:49 +00:00
6fd745255e Revert "add uuid in cudaDeviceProperties (#125083)"
This reverts commit 3f36145db298f7305b3b4df6c82c9101025a049a.

Reverted https://github.com/pytorch/pytorch/pull/125083 on behalf of https://github.com/izaitsevfb due to Fails internal builds with: no member named 'uuid' in 'hipDeviceProp_t' ([comment](https://github.com/pytorch/pytorch/pull/125083#issuecomment-2103315320))
2024-05-09 19:52:45 +00:00
74a0ef8f8c Enable UFMT format on test/test_package.py test/test_per_overload_api.py (#125834)
Fixes some files in https://github.com/pytorch/pytorch/issues/123062

Run lintrunner on files:
test/test_package.py
test/test_per_overload_api.py

```bash
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125834
Approved by: https://github.com/malfet
2024-05-09 19:48:22 +00:00
ed8a560845 Update Release Calendar for 2.3.1 and 2.4 releases (#125794)
As per:
- https://dev-discuss.pytorch.org/t/pytorch-release-2-4-0-call-for-features/2051
- https://dev-discuss.pytorch.org/t/pytorch-release-2-3-1-planning/2052

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125794
Approved by: https://github.com/malfet
2024-05-09 18:31:52 +00:00
85447c41e3 [ROCm] amdsmi library integration (#119182)
Adds monitoring support for ROCm using amdsmi in place of pynvml.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/xw285cornell
2024-05-09 18:21:38 +00:00
0e419b9146 Fix graph partitioner and make runtime assertion work with submodules in export (#125793)
Summary: This fix does three things:

1. When we add inputs from partioner to the top level graph module, we insert in the order of partioner which is not guaranteed to be same as original graph inputs. This PR fixes that.
2. When we replace autograd ops with HOP, we create new submodules and access their outputs via getitem calls. As a result, previous node names associated with getitem gets updated, resulting in the graph being different from produced graph signature. So I just update the graph signature accordingly.
3. We run runtime_assertion pass before autograd HOP pass because the constraints won't be populated correctly.

Differential Revision: [D57130314](https://our.internmc.facebook.com/intern/diff/D57130314)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125793
Approved by: https://github.com/zhxchen17
2024-05-09 18:13:46 +00:00
98821b3d92 Disable various flaky tests in test_foreach (#125783)
* Similar to #125046
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125783
Approved by: https://github.com/huydhn
2024-05-09 18:08:39 +00:00
ae20f15941 [dynamo] trace through nn parametrize (#125771)
Fix https://github.com/pytorch/pytorch/issues/120914

Example dynamo output graph (from test_nn_parametrize):
```
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code] TRACED GRAPH
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]  ===== __compiled_fn_1 =====
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]  /data/users/williamwen/pytorch2/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]     def forward(self, L_x_: "f32[10, 10]"):
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]         l_x_ = L_x_
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]         # File: /data/users/williamwen/pytorch2/torch/nn/utils/parametrize.py:275 in forward, code: x = self[0](self.original)
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]         l__self___parametrizations__param___original: "f32[10, 10]" = self.L__self___parametrizations__param___original
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]         # File: /data/users/williamwen/pytorch2/test/dynamo/test_repros.py:4759 in forward, code: return torch.sin(x)
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]         x: "f32[10, 10]" = torch.sin(l__self___parametrizations__param___original);  l__self___parametrizations__param___original = None
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]         # File: /data/users/williamwen/pytorch2/test/dynamo/test_repros.py:4755 in forward, code: return self.param @ x
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]         matmul: "f32[10, 10]" = x @ l_x_;  x = l_x_ = None
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]         return (matmul,)
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125771
Approved by: https://github.com/jbschlosser
ghstack dependencies: #125710, #125724
2024-05-09 17:43:48 +00:00
6ea226b99c Fix DDP no_sync when find_unused_parameters is True (#124193)
Fixes #69031, #42793

This PR fixes the bug introduced in #54981 where parameters used within a `no_sync` scope are not respected when `find_unused_parameters` is set to `True`. The `local_used_map_` and `numGradHooksTriggeredMap_` variables should be updated regardless of the `no_sync` state.

Tested and verified with fairseq2 and wav2vec2 ASR finetuning recipe. All gradients are correctly synced across workers as expected after applying this fix.

Co-authored-by: Kaushik Ram Sadagopan <kaushikram2811@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124193
Approved by: https://github.com/rohan-varma
2024-05-09 17:33:33 +00:00
8fb3ff2a4e Revert "[profiler] enable CUPTI range profiler in build (#125685)"
This reverts commit 2deea9e6e9faf5eacebefa2336861d129c598c99.

Reverted https://github.com/pytorch/pytorch/pull/125685 on behalf of https://github.com/atalman due to Broke nightly ([comment](https://github.com/pytorch/pytorch/pull/125685#issuecomment-2103093237))
2024-05-09 17:28:02 +00:00
26b942c4fc [C10D] Document destroy_process_group usage (#122358)
This API was not documented. It has already been a source of confusion,
but recently has become more urgent as improper destruction can lead to
hangs due to ncclCommAbort's requirement of being called collectively.
<img width="888" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/9e16342d-1108-4d7d-95c8-b8753661b8e9">

Fixes #48203
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122358
Approved by: https://github.com/shuqiangzhang
2024-05-09 16:51:31 +00:00
257d40ba2e Docker release - push nightly tags only for amd64 builds (#125845)
Fixes failure: https://github.com/pytorch/pytorch/actions/runs/9014006158/job/24765880791#step:12:43
```
Unable to find image 'ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240509-runtime' locally
2.4.0.dev20240509-runtime: Pulling from pytorch/pytorch-nightly
docker: no matching manifest for linux/amd64 in the manifest list entries.
```
This cpu image does not exist for amd64 and not uploaded to dockerhub. Hence don't tag it .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125845
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-05-09 16:42:15 +00:00
3ccf107f01 [export] remove upgrader. (#125625)
Summary: talked to executorch team, seems we can remove this now.

Test Plan: CI

Differential Revision: D57013451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125625
Approved by: https://github.com/larryliu0820
2024-05-09 16:30:12 +00:00
0241ed9331 Fix sparse fake tensors detach (#125679)
As in the title.

Fixes a bug reported in https://github.com/pytorch/pytorch/pull/117907#discussion_r1589581536

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125679
Approved by: https://github.com/amjames, https://github.com/lezcano
2024-05-09 15:40:57 +00:00
7e86a7c015 Lint: Update older-python test to 3.6 (#125843)
As python-3.5 can no longer connect to pypi after today's cert update
Fixes https://github.com/pytorch/pytorch/issues/125841
2024-05-09 07:23:59 -07:00
b8a706a321 [EZ][BE] Use untyped_storage in tests (#125838)
Get's rid of the following warning:
```
/Users/shenke/workspace/pytorch/test/test_mps.py:9229: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  if base.storage().data_ptr() != other.storage().data_ptr():
```

(noticed while looking at https://github.com/pytorch/pytorch/issues/96153#issuecomment-2101876484 )

Respective change to view ops was landed back in 2022, see https://github.com/pytorch/pytorch/pull/91414

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125838
Approved by: https://github.com/albanD
2024-05-09 14:04:21 +00:00
4e29e80bf0 Run MPS tests on MacOS Sonoma (#125801)
Those ones are running 14.4.1, so I wonder if they actually pass CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125801
Approved by: https://github.com/kit1980, https://github.com/huydhn
2024-05-09 13:43:12 +00:00
b9588101c4 [Inductor][Quant] Fix PT2E Dynamic Quant regression (#125207)
**Summary**
Fix 2 regression issues caused by previous refactor:

- Fix the issue in dequant promotion pass with dynamic quant when the dequant node is with `tensor` overload.
- Fix numerical issue in dynamic quant, since input will convert to scales' dtype (which is `double`) to do quant operatoration with previous implementation.

**TestPlan**
```
clear && python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_input_dim_exceeds_2
clear && python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_dequant_promotion_dynamic_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125207
Approved by: https://github.com/peterbell10, https://github.com/jgong5
ghstack dependencies: #124041, #124246
2024-05-09 08:47:24 +00:00
c337395cdb [Inductor][Quant] Change the QConv output scale name (#124246)
**Summary**
Change the name of QConv output scale from `inv_output_scale` to `output_scale` after we move the optimization of quant/dequant from decomposition to lowering phase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124246
Approved by: https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #124041
2024-05-09 08:44:00 +00:00
d83ab88f81 [Inductor] [Quant] Enable lowering of quant per tensor and refactor quant pattern (#124041)
**Summary**
Per the discussion in https://github.com/pytorch/pytorch/pull/123444, the `decomposed quant/dequant` patterns changed after https://github.com/pytorch/pytorch/pull/123445, we can move the optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase to avoid the changes. In this way, we can:

- Avoid the pattern matcher failure introduced in https://github.com/pytorch/pytorch/pull/123445
- Make the quantization pattern clearer in the pattern matcher phase, since the `quant/dequant` nodes have not been decomposed.

**Changes in this PR**

- Move optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase.
- Corresponding changes in the quantization pattern matcher to ensure no bc-breaking.

**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_q
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124041
Approved by: https://github.com/peterbell10, https://github.com/jgong5
2024-05-09 08:40:44 +00:00
96c8447001 change error message to avoid failing when nn modules inlined (#125612)
#address https://github.com/pytorch/pytorch/issues/125605

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125612
Approved by: https://github.com/mlazos, https://github.com/anijain2305
2024-05-09 08:34:31 +00:00
da2f4bbc33 remove empty partition (#124920)
In some rare scenarios, the partitioner will produce an empty partition. it's a waste of time to compile an empty graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124920
Approved by: https://github.com/ezyang
2024-05-09 07:39:47 +00:00
e5766f02d0 [onnx.export] Avoid dict <-> unordered_map implicit copies (#123063)
This PR is part of an effort to speed up torch.onnx.export (#121422).

- Avoid [implicit copy](https://pybind11.readthedocs.io/en/stable/advanced/cast/stl.html#automatic-conversion) between `pybind11::dict` and `std::unordered_map` that
  happens for every node that gets processed. The copy scales with N
  (number of nodes), so this creates a quadratic time complexity.
  Solution is to always use `pybind11::dict`.
- This alone speeds up exports by x2 for large models.
- Resolves (1) in #121422.

(partial fix of #121422)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123063
Approved by: https://github.com/justinchuby
2024-05-09 07:34:47 +00:00
c59a2369be [fsdp2] Accomodate FSDP2 to accept parent mesh > 2 (#125778)
as titled, to support higher dimension parallelism

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125778
Approved by: https://github.com/weifengpy
2024-05-09 05:02:21 +00:00
aaa2f93a4f Add meta for _embedding_bag_dense_backward and _embedding_bag_per_sample_weights_backward (#125785)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125785
Approved by: https://github.com/albanD
2024-05-09 04:28:16 +00:00
ed48ea9997 [AOTI] Refine the C shim autogen mechanism (#125589)
Summary: Based on the discussions in https://github.com/pytorch/pytorch/pull/120513. Instead of auto-generate C shim fallback ops for thousands of ops, we maintain a list of fallback ops based on torch/_inductor/lowering.py, and only generate C shim functions for those ops. At the torchgen time, we will re-generate C shim files and compare the header file contents against the existing C shim headers. If there is any change, the compilation will fail with prompt on how to proceed. This makes sure the ABI-compatible C shim layer is small enough to maintain in the long run.

Differential Revision: [D57004046](https://our.internmc.facebook.com/intern/diff/D57004046)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125589
Approved by: https://github.com/frank-wei, https://github.com/chenyang78, https://github.com/albanD, https://github.com/ezyang
2024-05-09 02:48:16 +00:00
0bde9c08ef Prevent rendezvous shutdown on worker restarts (#124819)
Fixes #123678

#### Summary
When the rank leaves and joins back, the workers are restarted and while restarting the rendezvous is shut down. This change prevents rendezvous shutdown during worker restarts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124819
Approved by: https://github.com/malfet, https://github.com/kurman, https://github.com/eqy
2024-05-09 02:40:31 +00:00
cyy
6c4f43f826 Decouple most Caffe2 components from the build systems (r-barnes) (#125711)
Copying #125392 here so I can edit it more easily.

Co-authored-by: cyy <cyyever@outlook.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125711
Approved by: https://github.com/malfet
2024-05-09 02:19:59 +00:00
fdff9920f6 [pytorch] fix blasLt on windows (#125792)
Summary:
It seems like required functions are not available due to `_MSC_VER` guard. Does anyone have more context why this functionality has been disabled for windows?

I'm also unsure how this currently compiles in OSS land on windows, as there doesn't seem to be any preprocessor protection around `scaled_gemm` getting pulled in.

Test Plan:
Fix compilation errors like this
```
C:\open\fbsource\xplat\caffe2\aten\src\ATen\cuda\tunable\TunableGemm.h(74): error C2039: 'scaled_gemm': is not a member of 'at::cuda::blas'
C:\open\fbsource\xplat\caffe2\aten\src\ATen\cuda\CUDABlas.h(19): note: see declaration of 'at::cuda::blas'
C:\open\fbsource\xplat\caffe2\aten\src\ATen\cuda\tunable\TunableGemm.h(74): note: the template instantiation context (the oldest one first) is
C:\open\fbsource\xplat\caffe2\aten\src\ATen\cuda\tunable\TunableGemm.h(71): note: while compiling class template 'at::cuda::tunable::DefaultScaledGemmOp'
Action failed: fbsource//xplat/caffe2:ATen_cuda_lib_ovrsource (cxx_compile aten/src/ATen/native/cuda/Blas.cpp)
```

Differential Revision: D57087985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125792
Approved by: https://github.com/malfet, https://github.com/eqy
2024-05-09 01:54:25 +00:00
902a74c1d6 [caffe2] Lazily symbolize backtrace in c10::Error (#125787)
Summary:
The macros that build `c10::Error` compute the stack trace at the point of throwing, which is then returned as part of the `what()`. If `what()` is never called, which is the case for most exceptions (since logging is throttled), the cost of computing the stack trace was wasted.

By far, the most expensive part of computing the stack trace is its symbolization; just unwinding the stack and collecting the instruction addresses is comparatively cheap. We can thus defer the symbolization to first invocation of `what()`.

Test Plan:
Added unit tests exercising the lazy nature of `what()`.

Ran an adfinder canary: https://www.internalfb.com/intern/ads/canary/460118801509424346

We can see that the cost of symbolization is obliterated (meaning that `what()` is virtually never called, as expected):
 {F1496627896}

Differential Revision: D57128632

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125787
Approved by: https://github.com/huydhn
2024-05-09 01:46:57 +00:00
ea3f625e32 Revert "[Inductor] [Quant] Enable lowering of quant per tensor and refactor quant pattern (#124041)"
This reverts commit 33e6791645b5950b0f39301f55b8a4a79c0ca847.

Reverted https://github.com/pytorch/pytorch/pull/124041 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think there is a land race with the change 33e6791645 ([comment](https://github.com/pytorch/pytorch/pull/124041#issuecomment-2101766558))
2024-05-09 01:34:19 +00:00
ca579c177b Revert "[Inductor][Quant] Change the QConv output scale name (#124246)"
This reverts commit 9ba9f7fa821af062ef3d1580b75e70f74ba05063.

Reverted https://github.com/pytorch/pytorch/pull/124246 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think there is a land race with the change 33e6791645 ([comment](https://github.com/pytorch/pytorch/pull/124041#issuecomment-2101766558))
2024-05-09 01:34:19 +00:00
97509c8eb2 Revert "[Inductor][Quant] Fix PT2E Dynamic Quant regression (#125207)"
This reverts commit 3da949b0fbe91e802d30e00165141d1390621d71.

Reverted https://github.com/pytorch/pytorch/pull/125207 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think there is a land race with the change 33e6791645 ([comment](https://github.com/pytorch/pytorch/pull/124041#issuecomment-2101766558))
2024-05-09 01:34:19 +00:00
19bab45e67 [Inductor] Add SDPA pattern for OOB GPT2 models (#125562)
Add SDPA pattern for 2 OOB models:
- token-classification+gpt2
- text-generation+gpt2

Note that these models have two masks: attention mask with float type and causal mask with bool type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125562
Approved by: https://github.com/jgong5, https://github.com/kadeng, https://github.com/jansel
2024-05-09 01:21:09 +00:00
3da949b0fb [Inductor][Quant] Fix PT2E Dynamic Quant regression (#125207)
**Summary**
Fix 2 regression issues caused by previous refactor:

- Fix the issue in dequant promotion pass with dynamic quant when the dequant node is with `tensor` overload.
- Fix numerical issue in dynamic quant, since input will convert to scales' dtype (which is `double`) to do quant operatoration with previous implementation.

**TestPlan**
```
clear && python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_input_dim_exceeds_2
clear && python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_dequant_promotion_dynamic_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125207
Approved by: https://github.com/peterbell10, https://github.com/jgong5
ghstack dependencies: #124041, #124246
2024-05-09 01:05:00 +00:00
d474d79420 [dynamo][disable] Move disable impl to its own __call__ method (#125486)
There were internal cases where calling disable in distributed causes trace_rules to be generated, which imports distributed and causes circular import errors.

The code has also gone bulky. I think it is time for disable code to exist separately.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125486
Approved by: https://github.com/yanboliang, https://github.com/williamwen42, https://github.com/jansel
2024-05-09 01:03:12 +00:00
9ba9f7fa82 [Inductor][Quant] Change the QConv output scale name (#124246)
**Summary**
Change the name of QConv output scale from `inv_output_scale` to `output_scale` after we move the optimization of quant/dequant from decomposition to lowering phase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124246
Approved by: https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #124041
2024-05-09 00:57:10 +00:00
33e6791645 [Inductor] [Quant] Enable lowering of quant per tensor and refactor quant pattern (#124041)
**Summary**
Per the discussion in https://github.com/pytorch/pytorch/pull/123444, the `decomposed quant/dequant` patterns changed after https://github.com/pytorch/pytorch/pull/123445, we can move the optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase to avoid the changes. In this way, we can:

- Avoid the pattern matcher failure introduced in https://github.com/pytorch/pytorch/pull/123445
- Make the quantization pattern clearer in the pattern matcher phase, since the `quant/dequant` nodes have not been decomposed.

**Changes in this PR**

- Move optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase.
- Corresponding changes in the quantization pattern matcher to ensure no bc-breaking.

**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_q
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124041
Approved by: https://github.com/peterbell10, https://github.com/jgong5
2024-05-09 00:54:22 +00:00
1b1b18a7a4 Add LRScheduler Composability E2E Tests (#125653)
adds tests to verify the LRSchedulers correctly update the compiled optimizers without recompiles.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125653
Approved by: https://github.com/yanboliang
ghstack dependencies: #123751, #123752, #123753, #125383
2024-05-09 00:52:43 +00:00
8c9c169b48 LRScheduler composability kernel tests (#125383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125383
Approved by: https://github.com/eellison
ghstack dependencies: #123751, #123752, #123753
2024-05-09 00:52:43 +00:00
69eeef0727 Update LRScheduler to handle tensor LR (#123753)
Enables LRScheduler to handle tensor LRs.

Note on test changes:
For the test modifications I just removed itertools.product and created two loops. This allows us to create a new set of optim_inputs on each iteration to prevent mutations on the tensor LR carrying over across iterations. Nothing else in those tests was modified.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123753
Approved by: https://github.com/janeyx99
ghstack dependencies: #123751, #123752
2024-05-09 00:52:43 +00:00
7b36b4a765 Fix user warning for tensor LR (#123752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123752
Approved by: https://github.com/janeyx99
ghstack dependencies: #123751
2024-05-09 00:52:43 +00:00
0ea6ffc613 Swap warning counter to flag in LRScheduler (#123751)
This was a counter previously, this should be a flag to indicate whether or not the optimizer step has been called.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123751
Approved by: https://github.com/janeyx99
2024-05-09 00:52:43 +00:00
78a1693266 [Inductor Intel GPU backend Upstream] Reuse inductor test for Intel GPU (PART 1) (#122866)
Reuse Inductor test suite for Intel GPU including:
test_torchinductor.py
test_triton_wrapper.py
test_metrics.py
test_codecache.py
test_codegen_triton.py
test_kernel_benchmark.py
test_triton_heuristics.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122866
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-05-09 00:51:35 +00:00
4dd33a1c2b Better core binding in torch.backends.xeon.run_cpu when launced from torchrun with --nproc-per-node (#123711)
This PR fix `torch.backends.xeon.run_cpu` behavior when it is launched from `torchrun` with `--nproc-per-node` parameter.

As a CPU launcher, `run_cpu` would bind cores to each instance it launches using `numactl`, and assign cores to each instance evenly.

However, if we use `torchrun` to start `run_cpu` and use `--nproc-per-node` to create multiple `run_cpu` processes.   In this case, each `run_cpu` process would assume it can use all the CPU cores, which causes each `run_cpu` process compete for CPU cores.  This results in poor performance.

This PR recognize environment variable `LOCAL_WORLD_SIZE` and `LOCAL_RANK` set by `torchrun`, then use this information to further shard the cores bind to each instance.  With this PR, when launched by `torchrun --nproc-per-node ...`, different CPU cores will be bind to different workers, which maximize CPU utilization and application performance.

The specific use case this PR enabled is using TorchServe with DeepSpeed tensor parallel.  In this case, TorchServe would run `torchrun --nproc-per-node <tp_size>` to start tensor parallel workers it needed.  When run TorchServe on multisocket CPU server with DeepSpeed tensor parallel, we need this PR to achieve best performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123711
Approved by: https://github.com/jingxu10, https://github.com/ezyang
2024-05-09 00:32:11 +00:00
8def2e92f2 [inductor] autotune benchmark support for cpu (#125159)
This PR adds the autotune Infrastructure for CPU. It generalizes and extends `BenchmarkRequest` with CPU support and C++ module loader. A `do_bench_cpu` util function is added for benchmarking functions on CPU with warmups and returns the median number from multiple trials.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125159
Approved by: https://github.com/jansel
2024-05-09 00:28:49 +00:00
96a5698408 Fix torch.profiler Schedule Function (Function Event only) (#125510)
Summary:
github issue: https://github.com/pytorch/pytorch/issues/73828

Whenever we transition from RECORD_AND_SAVE to WARMUP in the profiler schedule, we instantiate a new backend profiler which wipes out the last cycle's information. This makes using the `repeat` parameter less useful in the schedule as you only get contents of the last cycle/repeat. In this diff, we save the accumulated Function Events before setting the new ones and then merge the two EventLists after post processing/cleaning is done. This diff only fixes Function Events so that we can get statistics over each cycle within a schedule. A follow up should be made to accumulate the chrome tracings as well if it is requested.

Test Plan: Added functional python tests in test_profiler.py that test different schedules and their FunctionEvent counts

Differential Revision: D56956245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125510
Approved by: https://github.com/aaronenyeshi
2024-05-08 23:32:50 +00:00
ff090c6937 [dynamo] support tracing nn.Module @property that accesses closure cells (#125724)
Fix https://github.com/pytorch/pytorch/issues/125702

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125724
Approved by: https://github.com/jansel, https://github.com/jbschlosser
ghstack dependencies: #125710
2024-05-08 23:25:39 +00:00
93f3d561f9 [dynamo] don't make nn parametrized Modules unspecialized (#125710)
Workaround for https://github.com/pytorch/pytorch/issues/125314 and https://github.com/pytorch/pytorch/issues/125478.

We no longer make parametrized nn.Modules unspecialized. Instead, when we are about to call a function from the `torch.nn.utils.parametrize` module, we skip the frame.

The script from https://github.com/pytorch/pytorch/issues/125314 now outputs
```
parametrize=True: 6587ms
parametrize=False: 1729ms
parametrize=True: 4497ms
parametrize=False: 1539ms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125710
Approved by: https://github.com/jansel, https://github.com/jbschlosser
2024-05-08 23:25:39 +00:00
e71207b729 Fix infinite recursion in API BC test (#125706)
```
python test/test_fx.py -k test_public_api_surface
```
was failing with a complaint about infinite recursion. Fixed that and then marked the two API changes from #123681 as private (for `get_example_value`) and backward compatible (for `insert_deferred_runtime_asserts`).

Fixes #104012

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125706
Approved by: https://github.com/BoyuanFeng
2024-05-08 23:07:16 +00:00
04bf7713e8 [c10d] Reduce test time by reusing ProcessGroup (#125648)
## Problem this PR resolves
Today, most of distributed tests are arranged like this:
```
def test_allreduce(self):
    pg = self._create_process_group_nccl(store, self.opts())
    pg.allreduce(tensor)
    ...
```
Thus, we are paying PG creation time **per test**. That's bad. But why were we doing that? Is there a constraint?

If we look deeper, we would find that most of our test cases inherit from `torch.testing._internal.common_distributed.MultiProcessTestCase`. From the name, nothing seems wrong, and probably fits distributed well. But a "problem" exists in its `setUp()` and `tearDown()` methods, which basically do the following:
```
def setUp(self):
    self._spawn_processes()

def tearDown(self):
    for p in self.processes:
        p.terminate()
```
Since `setUp` and `tearDown` are "**test-scope fixtures"**, meaning, they are called per test, each test will have brand new processes. Of course we'd have to recreate ProcessGroup every time.

## How we are fixing it
First, obviously, we need to put a PG's lifetime into a longer scope. Python `unittest` provides such a helper, called **"class-scope fixtures."** It is embodied by a `setUpClass` method and a `tearDownClass` method (note the name difference), which are called only once for all tests in the same test class.  Therefore, we would do:
```
@classmethod
def setUpClass(self):
    dist.init_process_group(...)

@classmethod
def tearDownClass(self):
    dist.destroy_process_group()
```
**In this PR, we create a new test template for distributed: `MultiProcContinousTest`, to hold this class-scope fixture.**

Second, we'd need to avoid per-test process spawn and terminate. That's easy, we can either:
1. launch the whole test file with `torchrun --nproc-per-node=...` or
2. use `mp.spawn()` under `if __name__ == "__main__":`.

Point is, launch the processes only once.

## Result
We moved the "positive tests" from test_c10d_nccl.py to test_c10d_ops_nccl.py.
Before this PR:
```
$ python test_c10d_nccl.py -k ProcessGroupNCCLTest
Ran 24 tests in 174.457s
```
After this PR:
```
$ torchrun --nproc-per-node 2 test_c10d_ops_nccl.py
or
$ python test_c10d_ops_nccl.py
Ran 24 tests in 16.247s
```
10X speedup.

## Limitation
For tests intended to test destroy or abort of PGs, we'd need to go back to the old style. So it would make sense to divide our tests into two classes: one for positive tests where we would reuse the PGs, and the other one for abort/destroy and negative tests like watchdog timeout.

## Next step
Migrate the tests of distributed that would fit with this test style!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125648
Approved by: https://github.com/wconstab
2024-05-08 22:33:40 +00:00
8f27c7f181 [sparse] Fix type-dispatch errors (#124777)
I am building PyTorch with the Intel oneAPI 2024.0 compiler and without cuSparseLt, and encountered various type errors of the following forms:
```
[ 63%] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu.o
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/ComputeSparseTile.h(87): error: no operator "=" matches these operands
            operand types are: cutlass::uint2b_t = int
          detected during:
            instantiation of "at::native::Indices4x4 at::native::LargestValuesGreedy<Op>::operator()(Tile4x4Accessor) [with Op=at::native::IdentityOp, Tile4x4Accessor=at::native::KernelTypes<cutlass::half_t>::Tile4x4Accessor]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredPack.h(349): here
            instantiation of "void at::native::KernelTypes<Element_>::sparse_semi_structured_tile_kernel(at::native::KernelTypes<Element_>::Params, MetadataStore, Algorithm) [with Element_=cutlass::half_t, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>, MetadataStore=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(201): here
            instantiation of "void at::native::sparse_semi_structured_tile_kernel<KT,Metadata,Algorithm>(KT::Params, Metadata, Algorithm) [with KT=at::native::KernelTypes<cutlass::half_t>, Metadata=at::native::MetadataCutlass, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>]"
(177): here
            instantiation of "void at::native::named_algorithms(T) [with T=lambda [](auto, const std::__cxx11::string &)->auto]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(265): here
            instantiation of "std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::sparse_semi_structured_tile_typed<Element,MetadataFormat>(at::Tensor, std::__cxx11::string) [with Element=cutlass::half_t, MetadataFormat=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(293): here

/tmp/pytorch/aten/src/ATen/native/sparse/cuda/ComputeSparseTile.h(88): error: no operator "=" matches these operands
            operand types are: cutlass::uint2b_t = int
          detected during:
            instantiation of "at::native::Indices4x4 at::native::LargestValuesGreedy<Op>::operator()(Tile4x4Accessor) [with Op=at::native::IdentityOp, Tile4x4Accessor=at::native::KernelTypes<cutlass::half_t>::Tile4x4Accessor]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredPack.h(349): here
            instantiation of "void at::native::KernelTypes<Element_>::sparse_semi_structured_tile_kernel(at::native::KernelTypes<Element_>::Params, MetadataStore, Algorithm) [with Element_=cutlass::half_t, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>, MetadataStore=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(201): here
            instantiation of "void at::native::sparse_semi_structured_tile_kernel<KT,Metadata,Algorithm>(KT::Params, Metadata, Algorithm) [with KT=at::native::KernelTypes<cutlass::half_t>, Metadata=at::native::MetadataCutlass, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>]"
(177): here
            instantiation of "void at::native::named_algorithms(T) [with T=lambda [](auto, const std::__cxx11::string &)->auto]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(265): here
            instantiation of "std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::sparse_semi_structured_tile_typed<Element,MetadataFormat>(at::Tensor, std::__cxx11::string) [with Element=cutlass::half_t, MetadataFormat=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(293): here

/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredPack.h(238): error: function "lambda [](cutlass::uint2b_t, cutlass::uint2b_t)->void" cannot be called with the given argument list
            argument types are: (int, int)
            object type is: lambda [](cutlass::uint2b_t, cutlass::uint2b_t)->void
          detected during:
            instantiation of "at::native::KernelTypes<Element_>::Tile4x4Packed at::native::KernelTypes<Element_>::pack_4x4(at::native::Indices4x4, at::native::KernelTypes<Element_>::Tile4x4Accessor, uint32_t &, int, __nv_bool) [with Element_=cutlass::half_t]"
(354): here
            instantiation of "void at::native::KernelTypes<Element_>::sparse_semi_structured_tile_kernel(at::native::KernelTypes<Element_>::Params, MetadataStore, Algorithm) [with Element_=cutlass::half_t, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>, MetadataStore=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(201): here
            instantiation of "void at::native::sparse_semi_structured_tile_kernel<KT,Metadata,Algorithm>(KT::Params, Metadata, Algorithm) [with KT=at::native::KernelTypes<cutlass::half_t>, Metadata=at::native::MetadataCutlass, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/ComputeSparseTile.h(177): here
            instantiation of "void at::native::named_algorithms(T) [with T=lambda [](auto, const std::__cxx11::string &)->auto]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(265): here
            instantiation of "std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::sparse_semi_structured_tile_typed<Element,MetadataFormat>(at::Tensor, std::__cxx11::string) [with Element=cutlass::half_t, MetadataFormat=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(293): here

/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredPack.h(241): error: function "lambda [](cutlass::uint2b_t, cutlass::uint2b_t)->void" cannot be called with the given argument list
            argument types are: (int, int)
            object type is: lambda [](cutlass::uint2b_t, cutlass::uint2b_t)->void
          detected during:
            instantiation of "at::native::KernelTypes<Element_>::Tile4x4Packed at::native::KernelTypes<Element_>::pack_4x4(at::native::Indices4x4, at::native::KernelTypes<Element_>::Tile4x4Accessor, uint32_t &, int, __nv_bool) [with Element_=cutlass::half_t]"
(354): here
            instantiation of "void at::native::KernelTypes<Element_>::sparse_semi_structured_tile_kernel(at::native::KernelTypes<Element_>::Params, MetadataStore, Algorithm) [with Element_=cutlass::half_t, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>, MetadataStore=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(201): here
            instantiation of "void at::native::sparse_semi_structured_tile_kernel<KT,Metadata,Algorithm>(KT::Params, Metadata, Algorithm) [with KT=at::native::KernelTypes<cutlass::half_t>, Metadata=at::native::MetadataCutlass, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/ComputeSparseTile.h(177): here
            instantiation of "void at::native::named_algorithms(T) [with T=lambda [](auto, const std::__cxx11::string &)->auto]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(265): here
            instantiation of "std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::sparse_semi_structured_tile_typed<Element,MetadataFormat>(at::Tensor, std::__cxx11::string) [with Element=cutlass::half_t, MetadataFormat=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(293): here
```

The casts added by this PR get the build working again for me.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124777
Approved by: https://github.com/jcaip
2024-05-08 21:49:33 +00:00
1b8891a31d make torch._check understand Eq commutativity (#125629)
Summary:
Given `torch._check(a == b)` we can still get a data-dependent error needing `b == a`. Simple fix.

```
def forward(self, x1, x2, x3, y):
    z1 = x1.item()
    z2 = x2.item()
    z3 = x3.item()
    torch._check((z2 + z3) == z1)
    # torch._check(z1 == (z2 + z3)) didn't work, now does
    if z2 + z3 == z1:
        return y * 2
    else:
        return y + 3
```

Differential Revision: D57014730

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125629
Approved by: https://github.com/ezyang
2024-05-08 21:39:21 +00:00
346343e6b5 [DeviceMesh] Make _validate_tp_mesh_dim support 3D (#125763)
Currently a 3D mesh with a submesh sliced out for TP is going to fail
this check.

According to @wanchaol in [this
comment](https://github.com/pytorch/pytorch/pull/125250#discussion_r1586653669)
it should be OK to remove these checks.  Though I would appreciate a
more careful review here, since I'm not too sure if there are other edge
cases where these checks are important.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125763
Approved by: https://github.com/wz337, https://github.com/wanchaol
2024-05-08 21:22:11 +00:00
e457fdcd81 Revert "[caffe2] Lazily symbolize backtrace in c10::Error (#125682)"
This reverts commit 08f6ef0e1ccadf4626c0d7ecb15db96c01b8f418.

Reverted https://github.com/pytorch/pytorch/pull/125682 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/125682#issuecomment-2101477132))
2024-05-08 21:11:27 +00:00
7e0edafe86 [compiled autograd][dynamo] improve lifted autograd.Function.backward handling and fallback to pseudo-eager (#125661)
- `FakeContext` hides all fields other than ctx.saved_tensors, this dynamo errors when the autograd.Function.backward uses other attrs on ctx and it also doesn't allow fallback to eager.
- If we remove it, we still can't fallback to eager: node variables are already freed (ctx.saved_tensors throws)
- However, we can fallback to "pseudo-eager" by using a duck-typed ctx and routing the ctx.saved_tensors to lifted tensors
- Dynamo tries to inline external_utils.call_backward, treats BackwardCFunction as a AutogradFunctionContextVariable (only used up until we create the fake context: FakeBackwardCFunction)
- we call_function backward from the forward class AutogradFunctionVariable, and we still pass in the fake context as a UserDefinedObjectVariable (can later use AutogradFunctionContextVariable + HOO graph speculate)

Fixes #125489  #124827

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125661
Approved by: https://github.com/jansel
2024-05-08 21:00:37 +00:00
de8ce3be20 [TD] Heuristic based on file path (#125477)
Get the folders of each changed file and attempt to map the folders to some tests.

The intention is to push up things like dynamo tests if someone changes a file in the dynamo folder

Please see the tests for examples of what should be matched together
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125477
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2024-05-08 20:56:53 +00:00
17ab7f77c2 [Kineto] Update Kineto Submodule Hash (#125621)
Summary:
Update the Kineto submodule in PyTorch. The following diffs are included:
- Removed CUPTI overhead track in AMD traces
- Delay logging for CUDA stream wait event until the end
- Changed chrome trace unit will be in milliseconds, and data will be in ns
- Refactored roctracer to include metadata and improved names.
- Lowered Kineto Stage log level, reducing noisy output
- Changed relative time of ts to quarterly interval for distributed trace alignment
- Fixed Non-risky deprecated use of 0/NULL
- Removed hardcoding of /opt/rocm
- Handling cuLaunchKernelEx better
- Fixed Non-risky missing field initializers and unused variables.

Test Plan: CI and this is running internally.

Differential Revision: D57011897

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125621
Approved by: https://github.com/sraikund16
2024-05-08 20:49:07 +00:00
255a3afbf1 [dynamo] don't LOAD_FAST local context variables in modified bytecode (#125719)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125719
Approved by: https://github.com/jansel
2024-05-08 20:39:06 +00:00
0feca5e341 Increase Python version for Docker builds (#125782)
Fixes https://github.com/pytorch/pytorch/issues/73714

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125782
Approved by: https://github.com/huydhn
2024-05-08 20:32:32 +00:00
19a9de114a Forbid subclassing _TensorBase directly (#125558)
As per title.
This ensures that all the places where we assume the method defined in _tensor.py do exist.

BC-Breaking: This is bc-breaking as the user cannot subclass this private class anymore.
You should replace any use of _TensorBase to Tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125558
Approved by: https://github.com/ezyang
2024-05-08 20:29:29 +00:00
afea237935 [minimizer] Create block traverse mode in minimizer for graph aware debugging (#125613)
Summary:
block traverse mode:
Assumption:
culprits block formed by (start_idx, end_idx) in topologically sorted graph
and the error will go away if  graph patterns breaks

Reviewed By: junhanh

Differential Revision: D56799587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125613
Approved by: https://github.com/jfix71
2024-05-08 20:21:21 +00:00
603d1e6049 [DTensor] allow numel 1 tensor operand to be implicitly replicate DTensor (#125073)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125073
Approved by: https://github.com/wanchaol
2024-05-08 19:47:47 +00:00
445a0c01da Retry: Low mem max_pool2d_with_indices (#122832)
Based on #105687

The low memory path does not need to strictly return the int8 offsets
instead the offset to index computation can be separated from the
inner function of the max pool lowering. The partitioner can then choose
to move the offset to index computation into the backward pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122832
Approved by: https://github.com/peterbell10, https://github.com/eellison
2024-05-08 19:37:08 +00:00
005a12722d Remove duplicated nodes in dfs_iter_find_cycle (#125585)
In case the `dfs_iter_find_cycle` function receives duplicated node entries in the `all_user_nodes` argument, it will still process each one of them. This commit changes the `all_user_nodes` list into a set, so each element is unique, resulting in a shorter execution time of the `propose_partitions` function.

Fixes #125584

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125585
Approved by: https://github.com/Skylion007
2024-05-08 19:21:15 +00:00
3f36145db2 add uuid in cudaDeviceProperties (#125083)
Replaces #99967.

Fixes #99903.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083
Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy
2024-05-08 19:15:55 +00:00
f4b2d50fd7 [export] disable_forced_specializations (#124949)
Summary:
By default, some inferred dynamic shapes guards/constraints that are not expressible with the current dynamic shapes language will lead to specialization to the concrete input values provided. If disable_forced_specializations is set to True, we will not specialize, and will not perform runtime checks on such produced guards. Instead, we allow the user to specify arbitrary shapes, and fail during runtime if the inputs are invalid. Constraints expressible with the language (e.g. ranges, linear derived dims) will still be enforced, and behavior for all other guards remains the same.

Cases where we typically specialize are reshapes:
```
x: [4, 6]  # [s0, s1]
x = x.reshape([x.shape[0] - 1, -1])
# this emits a guard Mod(s0*s1, s0-1) = 0, we specialize on s0=4, s1=6

x: [4, 6], y: [24]  # [s0, s1], [s2]
x = x.reshape([-1]) + y
# this emits a guard s0*s1 = s2, we specialize on s0=4, s1=6, s2=24
```

For now only applicable for non-strict mode (need to figure out how to pass this flag into dynamo's call of produce_guards).

Test Plan: Added test case that checks compilation, runtime, and suggested fixes behavior.

Differential Revision: D56361177

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124949
Approved by: https://github.com/avikchaudhuri
2024-05-08 18:42:39 +00:00
74b1674860 Use nvidia/cuda:CUDA_VERSION-devel-ubuntu22.04 as base for official Docker release (#125770)
We don't need to install cudnn system wide. We already install it with pytorch install during this step:
```
[conda-installs 2/3] RUN case linux/amd64 in          "linux/arm64")  pip install --extra-index-url https://download.pytorch.org/whl/cpu/ torch torchvision torchaudio ;;          *)              /opt/conda/bin/conda install -c "pytorch-nightly" -c "nvidia" -y "python=3.10" pytorch torchvision torchaudio "pytorch-cuda=$(echo 12.1.1 | cut -d'.' -f 1-2)"  ;;     esac &&     /opt/conda/bin/conda clean -ya
```
Ref: https://github.com/pytorch/pytorch/actions/runs/8998055687/job/24717424912

Validate via: https://github.com/pytorch/builder/actions/workflows/validate_docker_images.yml
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125770
Approved by: https://github.com/nWEIdia, https://github.com/seemethere
2024-05-08 18:41:30 +00:00
faf0015052 [dtensor] run transformer sdpa in dtensor (#122997)
Now that efficient attention is supported in dtensor, we can modify the transformer test to use dtensor in SDPA and get rid of the manual num_head adjustments.

Caveat: Efficient attention is supported only with bf16/fp32 (not fp64) and has other constraints. If any of the constraints are not satisfied, the SDPA would fall back to the math decomposed attention, which will break as it does not fully work with dtensor (it creates a `torch.Tensor` mask in the middle). I considered adding some checks like in P1202254918 but that needs to be added everywhere this Transformer is used. Is it necessary if the current CI machines can run efficient attention?

Test files containing this Transformer:
- `test/distributed/tensor/parallel/test_tp_examples.py`
- `test/distributed/_composable/fsdp/test_fully_shard_training.py`
- `test/distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122997
Approved by: https://github.com/XilunWu
ghstack dependencies: #122995, #122996
2024-05-08 17:08:47 +00:00
efece3f142 [dtensor] add op support for memory efficient attention (#122996)
This is a followup to flash attention. On cuda, flash attention is supported only for fp16/bf16, whereas memory efficient attention is supported for fp32 (but not fp64). With this PR, one can run SDPA and in general Transformer completely in dtensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122996
Approved by: https://github.com/XilunWu, https://github.com/wanchaol
ghstack dependencies: #122995
2024-05-08 17:08:27 +00:00
08be8ec8a9 [dtensor] improve new factory strategy (#122995)
Previously, the new tensor out of the "new factory" all become replicated.
With this PR, if the new tensor has the same shape as the old tensor **and** the shape can be evenly sharded, then the old spec is inherited and preferred.

To accommodate this when the old tensor has sharded placements, the input args for local computation (size, stride) need to be adjusted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122995
Approved by: https://github.com/wanchaol
2024-05-08 17:05:07 +00:00
affd7a9789 Get PT2 Cutlass backend working under fbcode [take 2] (#125688)
Differential Revision: D57051232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125688
Approved by: https://github.com/chenyang78
2024-05-08 16:44:49 +00:00
87f86fd586 Fix multi template debug trace (#125703)
Fix for https://github.com/pytorch/pytorch/issues/125642

We were trying to render the template of multi template kernel before it had been finalized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125703
Approved by: https://github.com/shunting314
2024-05-08 16:31:18 +00:00
e28d9947a1 AsyncCollectiveTensor: prevent wait_tensor() calls on graph inputs from getting DCEd (#125677)
@wanchaol was seeing the loss eventually become NaN when compiling individual transformer blocks in torchtitan - with this patch I no longer see the NaN loss.

The problem is the following:

(1) It is possible to have graph inputs to a compiled region that are AsyncCollectiveTensors. In particular: when we compile individual transformer blocks in the llama model, the first layer (embedding layer) is run in eager mode, and it outputs an AsyncCollectiveTensor that is fed to the first transformer block

(2) ideally, we would like that AsyncCollectiveTensor graph input to desugar into a `wait_tensor()` op that shows up at the beginning of the graph.

(3) the way this is supposed to happen is: AOTAutograd traces through the __torch_dispatch__ of AsyncCollectiveTensor, tracing out a `wait_tensor()` call before dispatching to any of the other ops in the function we are tracing

(4) however: `trigger_wait()` was getting called in a way where we would ignore its output (and return `self.elem` directly), which would cause the `wait_tensor` ops to get DCE'd.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125677
Approved by: https://github.com/wanchaol, https://github.com/yifuwang
ghstack dependencies: #125676
2024-05-08 15:54:01 +00:00
5d97c22845 AOTAutograd: use info not debug logging for ViewAndMutationMeta (#125676)
Before, the AOTAutograd metadata would not get logged when running with `TORCH_LOGS="aot"`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125676
Approved by: https://github.com/albanD
2024-05-08 15:54:01 +00:00
6f619cc727 [ez] functorch/test_vmap and test_dataloader to run in parallel (#125597)
Also mark test_svd serial in linalg to see if it helps with the flakiness
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125597
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-05-08 15:37:29 +00:00
bd2635578b [vision hash update] update the pinned vision hash (#125521)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125521
Approved by: https://github.com/pytorchbot
2024-05-08 15:34:36 +00:00
2e237fcd70 Revert "[inductor] add cpp builder code. (#124045)"
This reverts commit 469383755fe416eb1c41fa724762ad3eaecdff07.

Reverted https://github.com/pytorch/pytorch/pull/124045 on behalf of https://github.com/clee2000 due to broke inductor/test_codecache and inductor/test_max_autotune 469383755f https://github.com/pytorch/pytorch/actions/runs/8996772350/job/24724775182 ([comment](https://github.com/pytorch/pytorch/pull/124045#issuecomment-2100851419))
2024-05-08 15:33:20 +00:00
c5b6c696c1 Start refactoring runtime wrappers (#125595)
This is the first PR in a series where I try to organize our runtime wrappers a bit: specifically, I'd like to separate wrappers into objects that have (up to) 2 methods:
A **pre-compile** function, which takes in flat_fn and flat_args (inputs to the compiler) and wraps/modifies them
A **post-compile** function, which takes in a compiled_fn and runtime args and wraps the compiled_function.

Extra metadata necessary to run the compile functions can be stored on the attributes of the class. This way, when we think about caching, the set of attributes on the class should be the exact set of metadata that we need to serialize and save in the cache (along with common data, like fw_metadata)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125595
Approved by: https://github.com/bdhirsh
2024-05-08 15:20:36 +00:00
13462ecd27 Update preserve_node_meta to reset torch.fx.traceback.current_meta (#125500)
Fixes https://github.com/pytorch/pytorch/issues/122766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125500
Approved by: https://github.com/xmfan, https://github.com/ezyang
2024-05-08 14:30:34 +00:00
8cad88e1f3 [BE]: Improve exception typing. Remove NOQAs (#125535)
Improve some exception typing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125535
Approved by: https://github.com/albanD
2024-05-08 14:07:13 +00:00
82b7b59d2a [inductor] Check if n is the input tensor of conv_pointwise (#125119)
Fix https://github.com/pytorch/pytorch/issues/124837.
Check whether n is the input tensor of convolution_pointwise or qconv2d_pointwise, if so freeze the layout to channels_last.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125119
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2024-05-08 13:25:49 +00:00
d17be10df1 make torch.amp.autocast more generic (#125103)
# Motivation
As discussed in [#124479](https://github.com/pytorch/pytorch/pull/124479), `torch.amp.autocast` can NOT be completely equivalent to `torch.cuda.amp.autocast` and `torch.cpu.amp.autocast` since `torch.amp.autocast` has NOT the default `dtype` for CPU (`torch.bfloat16` by default) and CUDA (`torch.float16` by default) respectively. We would like `torch.amp.autocast` to be more generic to help the developer/customer write the device-agnostic code. Because there are not enough reasons to add device-specific autocast `torch.xxx.amp.autocast` for each device backend.

# Solution
When `None` is passed to `dtype`, we should use `torch.get_autocast_dtype` to get the related dtype for each backend. Meanwhile, `torch.get_autocast_dtype` is necessary to be supported in JIT path for BC.

# Additional Context
With this PR, `torch.amp.autocast(device_type='cuda')` is equivalent to `torch.cuda.amp.autocast`.
Add two new UTs to cover this change in eager and jit path respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125103
Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/gujinghui
2024-05-08 12:13:26 +00:00
320af5eaa6 Compute bounds for the variables created during codegen (#123100)
Before we would just bail out on these bounds for all variables that did
not come from the FX graph. Now we propagate the bounds whenever we have
a rule for that op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123100
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-05-08 08:14:06 +00:00
15a9770225 [DSD] Implement broadcast_from_rank0 option for optim state_dict (#125339)
Summary:
This is useful if users would like to avoid CPU memory OOM when loading from a full state_dict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125339
Approved by: https://github.com/weifengpy
ghstack dependencies: #125708, #125338
2024-05-08 07:22:20 +00:00
0542fd485f [DSD] Implement broadcast_from_rank0 option for model state_dict (#125338)
Summary:
This is useful if users would like to avoid CPU memory OOM when loading from a full state_dict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125338
Approved by: https://github.com/weifengpy
ghstack dependencies: #125708
2024-05-08 07:11:18 +00:00
88fbe79550 [DSD] Fix set_optimizer_state_dict() changes the parameters with some optimizers (#125708)
Summary:
Some optimizers, like AdamW, change the parameters even if gradients are zero. So `set_optimizer_state_dict()` may affect the parameters values with these optimizers. This PR fixes the issue.

This PR also fixes https://github.com/pytorch/pytorch/issues/121186.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125708
Approved by: https://github.com/wz337
2024-05-08 06:57:20 +00:00
469383755f [inductor] add cpp builder code. (#124045)
Previous full PR https://github.com/pytorch/pytorch/pull/115248 is failed to merge due to fb_code is hard to debug.
I also tried to submit them as two pieces, https://github.com/pytorch/pytorch/pull/118514 https://github.com/pytorch/pytorch/pull/118515. And they have passed PreCI at that time.

Now I tried to split https://github.com/pytorch/pytorch/pull/115248 into smaller piece, and it is the first step of RFC https://github.com/pytorch/pytorch/issues/124245.
Changes:
1. Add cpp builder code, the new cpp_builder support Windows OS.
2. Add CPU ISA checker which is cross OS and exported from backend cpuinfo.
3. Switch compiler ISA checker to new cpp builder.
4. CppCodeCache use the new ISA checker.
5. Add temprary `test_new_cpp_build_logical` UT to help on transfer to new code.
<img width="1853" alt="Image" src="https://github.com/pytorch/pytorch/assets/8433590/ce6519ab-ba92-4204-b1d6-7d15d2ba2cbe">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124045
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-05-08 05:27:15 +00:00
08f6ef0e1c [caffe2] Lazily symbolize backtrace in c10::Error (#125682)
Summary:
The macros that build `c10::Error` compute the stack trace at the point of throwing, which is then returned as part of the `what()`. If `what()` is never called, which is the case for most exceptions (since logging is throttled), the cost of computing the stack trace was wasted.

By far, the most expensive part of computing the stack trace is its symbolization; just unwinding the stack and collecting the instruction addresses is comparatively cheap. We can thus defer the symbolization to first invocation of `what()`.

Test Plan:
Added unit tests exercising the lazy nature of `what()`.

Ran an adfinder canary: https://www.internalfb.com/intern/ads/canary/460118801509424346

We can see that the cost of symbolization is obliterated (meaning that `what()` is virtually never called, as expected):
 {F1496627896}

Reviewed By: ezyang

Differential Revision: D56586844

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125682
Approved by: https://github.com/ezyang
2024-05-08 04:57:59 +00:00
a1a22a22d5 [ROCm] Parameterize the triton build dir (#125420)
- Removes hard coding and helps in internal builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125420
Approved by: https://github.com/malfet
2024-05-08 04:46:52 +00:00
50073127b5 [tp] add some test for shard output layouts for rowwise parallel (#125713)
as titled. This is to make sure everything work as expected if we
configure RowwiseParallel output layouts be sharded

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125713
Approved by: https://github.com/XilunWu
ghstack dependencies: #125693, #125695
2024-05-08 03:45:34 +00:00
9a2375b6b7 [dtensor] improve some pretty print in op schema (#125695)
as titled, when I debugged
https://github.com/pytorch/pytorch/pull/125369 I found this would be
quality of life improvements

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125695
Approved by: https://github.com/yifuwang, https://github.com/XilunWu
ghstack dependencies: #125693
2024-05-08 03:45:34 +00:00
65fec7bbbf [dtensor] make sure meta tensor random op does not alternate rng state (#125693)
as titled, for meta tensor ops, we should avoid calling the RNGTracker,
which could potentially alter the current RNG state. Meta tensor ops
should be no-op and post `to_empty` init would really alter the RNG
state

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125693
Approved by: https://github.com/XilunWu
2024-05-08 03:45:29 +00:00
38baa02a40 Meta kernel for _pack_padded_sequence (#124794)
Summary: Op implementation: 8cf54929e3/aten/src/ATen/native/PackedSequence.cpp (L34)

Fixes https://fb.workplace.com/groups/pytorch.edge.users/permalink/1499571650913123/

I'm not entirely sure how to test this meta kernel.

Differential Revision: D56478332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124794
Approved by: https://github.com/ezyang
2024-05-08 03:11:22 +00:00
2deea9e6e9 [profiler] enable CUPTI range profiler in build (#125685)
Fixes #125272
## About
(This is a re-spin of PR #106617)

Kineto introduced a new profiler to read performance counters from NVIDIA GPUs (CUPTI Range Profiler API) added in PR[75616](https://github.com/pytorch/pytorch/pull/75616).  Support for the range profiler mode was disabled as we had to link with a NV PerfWorks library (`libnvperf_host.so`). This PR adds that link.

The change includes-
* Updates cmake build files to find `libnvperf_host.so` and set `CUDA_nvperf_host_LIBRARY`
* WIP use the above cmake variable in kineto, will update this PR after kineto PR has landed
See https://github.com/pytorch/kineto/pull/724

## Example usage of CUPTI profiler
The code snippet below shows how to configure pytorch profiler in CUPTI Profiler mode. Any code included in profiling window with be profiler by CUPTI/Kineto. Note how the `_ExperimentalConfig` struct is used to configure profiler metrics
```
    with torch.profiler.profile(
        activities=[torch.profiler.ProfilerActivity.CUDA],
        record_shapes=True,
        on_trace_ready=trace_handler,
        experimental_config=torch.profiler._ExperimentalConfig(
            profiler_metrics=[
                "kineto__tensor_core_insts",
                "dram__bytes_read.sum",
                "dram__bytes_write.sum"],
            profiler_measure_per_kernel=False),
    ) as prof:
        res = train_batch(modeldef)
        prof.step()
```
For a full example see this [xor.py](https://gist.github.com/briancoutinho/b1ec7919d8ea2bf1f019b4f4cd50ea80) gist.

### Details of how to configure CUPTI profielr
The` _Experimental` config structure can be used to pass metrics to profiler
```
   profiler_metrics : a list of CUPTI profiler metrics used
       to measure GPU performance events. Any metric supported by CUPTI can be used, see here=
       https://docs.nvidia.com/cupti/r_main.html#r_profiler
       There are two special alias metrics `kineto__tensor_core_insts` and `kineto__cuda_core_flops` for FLOPS counting.
   profiler_measure_per_kernel (bool) : whether to profile metrics per kernel
      or for the entire measurement duration.
```

## Testing
Built from source with kineto [PR](https://github.com/pytorch/kineto/pull/724)
```
$> USE_CUDA=1 python setup.py install
--   CUDA_cupti_LIBRARY = /public/apps/cuda/11.6/extras/CUPTI/lib64/libcupti.so
--   CUDA_nvperf_host_LIBRARY = /public/apps/cuda/11.6/extras/CUPTI/lib64/libnvperf_host.so
```

Then run example [xor.py](https://gist.github.com/briancoutinho/b1ec7919d8ea2bf1f019b4f4cd50ea80). This only works on V100+ GPUs only. Adding logs for debugging etc.
```
>$ export KINETO_LOG_LEVEL=1
>$ python xor.py
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:167] CUDA versions. CUPTI: 16; Runtime: 11060; Driver: 11040
  Log file: /tmp/libkineto_activities_1683060.json
  Trace start time: 2023-02-11 19:11:47  Trace duration: 500ms
  Warmup duration: 0s
  Max GPU buffer size: 128MB
  Enabled activities: cuda_profiler_range
Cupti Profiler metrics : kineto__tensor_core_insts, dram__bytes_read.sum, dram__bytes_write.sum
Cupti Profiler measure per kernel : 0
Cupti Profiler max ranges : 10
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:638] Enabling GPU tracing
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:567] Running child profiler CuptiRangeProfiler for 500 ms
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:104] Configuring 3 CUPTI metrics
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:109]    sm__inst_executed_pipe_tensor.sum
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:109]    dram__bytes_read.sum
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:109]    dram__bytes_write.sum
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:575] Running child profiler CuptiRangeProfiler for 500 ms
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:672] Tracing starting in 9s
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:677] Tracing will end in 10s
STAGE:2023-02-11 19:11:37 1683060:1683060 ActivityProfilerController.cpp:310] Completed Stage: Warm Up
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:693] Starting child profiler session
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125685
Approved by: https://github.com/sraikund16
2024-05-08 02:34:31 +00:00
9fedf41b60 Dockerfile should set the syntax directive to v1 (#125632)
Fixes #125526 [#1811](https://github.com/pytorch/builder/issues/1811)

Adopt syntax=docker/dockerfile:1 whcih has been stable since 2018, while still best practice to declare in 2024.
- Syntax features dependent upon the [syntax directive version are documented here](https://hub.docker.com/r/docker/dockerfile).
- While you can set a fixed minor version, [Docker officially advises to only pin the major version]

```
(https://docs.docker.com/build/dockerfile/frontend/#stable-channel):
We recommend using docker/dockerfile:1, which always points to the latest stable release of the version 1 syntax, and receives both "minor" and "patch" updates for the version 1 release cycle.
BuildKit automatically checks for updates of the syntax when performing a build, making sure you are using the most current version.
```

**Support for building with Docker prior to v23 (released on Feb 2023)**
NOTE: 18.06 may not be the accurate minimum version for using docker/dockerfile:1, according to the [DockerHub tag history](https://hub.docker.com/layers/docker/dockerfile/1.0/images/sha256-92f5351b2fca8f7e2f452aa9aec1c34213cdd2702ca92414eee6466fab21814a?context=explore) 1.0 of the syntax seems to be from Dec 2018, which is probably why docker/dockerfile:experimental was paired with it in this file.

Personally, I'd favor only supporting builds with Docker v23. This is only relevant for someone building this Dockerfile locally, the user could still extend the already built and published image from a registry on older versions of Docker without any concern for this directive which only applies to building this Dockerfile, not images that extend it.

However if you're reluctant, you may want to refer others to [this Docker docs page](https://docs.docker.com/build/buildkit/#getting-started) where they should only need the ENV DOCKER_BUILDKIT=1, presumably the requirement for experimental was dropped with syntax=docker/dockerfile:1 with releases of Docker since Dec 2018. Affected users can often quite easily install a newer version of Docker on their OS, as per Dockers official guidance (usually via including an additional repo to the package manager).

**Reference links**
Since one of these was already included in the inline note (now a broken link), I've included relevant links mentioned above. You could alternatively rely on git blame with a commit message referencing the links or this PR for more information.

Feel free to remove any of the reference links, they're mostly only relevant to maintainers to be aware of (which this PR itself has detailed adequately above).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125632
Approved by: https://github.com/malfet
2024-05-08 01:52:56 +00:00
58e045d03c [MPS] Fix strided ELU op (#125692)
Fixes https://github.com/pytorch/pytorch/issues/124834

Summary of changes:

In case of non-contiguous input, the output would be non-contiguous too. At the moment it's not supported to save the result to a non-contiguous buffer, thus we need two steps, one to allocate a contiguous buffer and the second one to scatter the result back to the original ouput.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125692
Approved by: https://github.com/kulinseth
2024-05-08 01:34:40 +00:00
21aaac47e7 [torchelastic] add timing events to different stages of rendezvous (#125636)
Summary: as title

Test Plan: unit tests. Launched a test job and observed scuba results: {F1506543300}

Differential Revision: D57018103

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125636
Approved by: https://github.com/d4l3k
2024-05-08 01:14:23 +00:00
a3d97f6ce4 [ONNX] Benchmark onnx export w/ ort fusions (#125700)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125700
Approved by: https://github.com/thiagocrepaldi
2024-05-08 01:10:05 +00:00
baf36f6d11 Pad bandwidth bound split k kernels on a100 (#125650)
For
```
import torch
import triton

dtype = torch.bfloat16

t1 = torch.empty([2, 20569856], dtype=dtype, device="cuda")
t2 = torch.empty([20569856, 13], dtype=dtype, device="cuda")

@torch.compile()
def benchmark(t1, t2):
    return torch.ops.aten.mm(t1, t2)

print(triton.testing.do_bench(lambda: benchmark(t1, t2)))
```

Improves perf from 449ms  -> 1.2578779458999634ms

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125650
Approved by: https://github.com/bertmaher
2024-05-08 01:04:35 +00:00
ba27548679 [MPS] Remove in place views (causes too many crashes) (#124895)
Fixes https://github.com/pytorch/pytorch/issues/96153

Remove in place views as they are a general cause for many crashes.
Proper fix to handle views without copies will come in a different PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124895
Approved by: https://github.com/kulinseth
2024-05-08 01:00:37 +00:00
3fb53bb6a7 [MPS] Fix strided mse_loss (#125696)
Fixes https://github.com/pytorch/pytorch/issues/124621

Summary of changes:
- In case of non-contiguous input, the output would be non-contiguous too. At the moment it's not supported to save the result to a non-contiguous buffer, thus we need two steps, one to allocate a contiguous buffer and the second one to scatter the result back to the original ouput.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125696
Approved by: https://github.com/kulinseth
2024-05-08 00:52:26 +00:00
939b701d3a SymInt-ify mem-efficient attention forward op signature (#125418)
Need this for dynamic shapes! Before this PR, guards on constant min / max seq len values are introduced when SDPA calls mem-efficient attention.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125418
Approved by: https://github.com/soulitzer
2024-05-07 23:59:28 +00:00
bb6ba31250 [DCP] Adds storage metadata, and passes it during the save path (#124772)
This PR seeks to increase observability of save/load requests. This is accomplished with two main changes:

1. The creation of save_id and load_id:
    - a save_id and load_id is added to the filesystem writer. `save_id` is re-generated on every save call, and `load_id` is also re-generated on every load call.
    - both these ID's are stored in a new `StorageMeta` class, and saved as part of Metadata. (`load_id` is None when we save, and only set during load)

2. A new mechanism is implemented in the save path which gives the SavePlanner a chance to inspect the `storage_meta` object. The mechanism mirrors the same metadata exchange in the load path. In the load path, `storage_meta` is added to `metadata` such that the LoadPlanner can also access `storage_meta` before we begin loading.

*If users now wish to access the checkpoint_id in the SavePlanner, they simple need to access the value in `storage_meta` from the `set_up_planner` call*

*Additionally, users now have a generic way of passing data to the SavePlanner from the StorageWriter at the start of the save path, similar to the load path*

This PR has been tested for backwards compatibility -- meaning any checkpoints saved before this PR can continue being loaded after this PR.

One major consideration is that there is limited forwards compatibility. If a checkpoint is generated _past_ this PR, there is no support for loading it using older torch versions. This brings up a fairly important point: since we expect the metadata object (which is saved to the disk) to continue evolving, and we want to support forwards compatibility, we explore patching `pickle` so we can at least add new members to `metadata` and maintain fwd compat.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124772
Approved by: https://github.com/fegin
2024-05-07 23:53:53 +00:00
244d93039d Remove fbobjc_configs from xplat (#125586)
Summary: Pull out the configs attributes in xplat targets, these no longer do anything.

Test Plan:
```
$ buck2 uquery //xplat/... > /dev/null
```

Reviewed By: d16r

Differential Revision: D56855974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125586
Approved by: https://github.com/malfet
2024-05-07 23:48:20 +00:00
8b4d62009d [EZ] Update jinja2 to 3.1.4 (#125698)
To address https://cwe.mitre.org/data/definitions/79.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125698
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-05-07 23:40:54 +00:00
8be4c1bc2f [export] Add metadata for nodes insert_deferred_runtime_asserts (#125414)
Fixes [internal error](https://fb.workplace.com/groups/1075192433118967/permalink/1416709435633930/).

The issue is that the asserting nodes added in the `insert_deferred_runtime_assertion` pass do not contain metadata that the ExportedProgram requires the graph to have. One solution to fix this is to retrace the entire module, or another solution is to manually add back this metadata.

This diff implements the latter solution (manually add back the metadata) through hooking into fx.graph's `create_node` function, and adding export-specific metadata for every node that is created. The reason I did this is so that the `insert_deferred_runtime_assertion` does not have to know about what metadata export wants.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125414
Approved by: https://github.com/zhxchen17, https://github.com/BoyuanFeng
2024-05-07 23:15:21 +00:00
8024e72326 [export] Warn on capture_pre_autograd_graph. (#125602)
Summary: capture_pre_autograd_graph is deprecated and torch.export won't able to provide timely fix for this API. To reduce some confusion around this we should explicitly give users clear warnings.

Test Plan: eyes

Reviewed By: tarun292

Differential Revision: D56955202

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125602
Approved by: https://github.com/angelayi
2024-05-07 22:51:17 +00:00
021ff7fd77 [BE] Explicitly handle all types c10::isSigned (#125637)
By defining `CASE_ISSIGNED` macros that just returns `std::numeric_limits<dtype>::is_signed`  for the types where it makes sense and explicitly code some types when it does not

Remove `default:` case from the switch to avoid regressions like the one reported in https://github.com/pytorch/pytorch/issues/125124 , as [`-Wswitch-enum`](https://clang.llvm.org/docs/DiagnosticsReference.html#wswitch-enum) in combination with `-Werror` will raise an error in case of a missing entry, for example:
```
/Users/nshulga/git/pytorch/pytorch/c10/core/ScalarType.h:518:11: warning: enumeration value 'QInt32' not handled in switch [-Wswitch]
  switch (t) {
          ^
/Users/nshulga/git/pytorch/pytorch/c10/core/ScalarType.h:518:11: note: add missing switch cases
  switch (t) {
          ^
1 warning generated.
```

Fixes https://github.com/pytorch/pytorch/issues/125124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125637
Approved by: https://github.com/albanD
2024-05-07 22:51:03 +00:00
51f25c08f4 Fix 'Could not infer dtype of SymBool' on torch.tensor call (#125656)
Internal xref:
https://fb.workplace.com/groups/469587837192818/posts/1638909336927323/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125656
Approved by: https://github.com/albanD
2024-05-07 22:41:49 +00:00
e3d5afc60a Enable dynamo'd test for 116499 (#123469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123469
Approved by: https://github.com/janeyx99
ghstack dependencies: #123619
2024-05-07 22:17:01 +00:00
0f02e0aa39 Disable dynamo on functional optims if capturable=False (#123619)
This resolves a bug in eager where if an old state dict is loaded (without the capturable flag) but the original dict had the capturable flag, then state_steps would be on cuda but we would take the non-capturable path. We now fallback to eager if capturable=False.

Current design doc and discussion: https://docs.google.com/document/d/1DmmbiaSp16CDZtGw1qzXKHFTY_0gqc0xpnBdviXq0vk/edit#heading=h.871u7bvwz7ze

Note on the actual fallback logic - there was an issue with torchscript originally not handling *args, **kwargs properly, after rectifying that by using `functools.wraps`, there was an additional bug with scoping which required the single tensor implementation to be in the global scope at the time of the fallback closure being created. I pass in the single tensor function to the `_disable_dynamo_if_unsupported` decorator to workaround this bug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123619
Approved by: https://github.com/janeyx99
2024-05-07 22:17:01 +00:00
0fd1fc17c3 [MPS] Fix abs for complex types (#125662)
By calling `realPartOfTensor:` if input type is complex on Sonoma and fall back to `at::view_as_real` trick on Ventura.

Split `unary_op` template into `unary_op` and `unary_op_noresize`, which skips resize and empty checks

Marked `abs`, `isclose` and `nn.functional.softsign` OpInfo tests as supported by complex types

Fixes https://github.com/pytorch/pytorch/issues/125135

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125662
Approved by: https://github.com/kulinseth
2024-05-07 22:15:20 +00:00
2163956208 [TGIF][HHC][Sharding] add device_ordinal to Subgraph (#125616)
Summary: Add a new field device_ordinal to Subgraph class so we can store device ordinal during splitting

Test Plan: See test plan for D56535827

Differential Revision: D57010103

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125616
Approved by: https://github.com/jiayisuse
2024-05-07 21:48:59 +00:00
b356a0de86 Add support for multiple flexattention calls in a single compile (#125516)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125516
Approved by: https://github.com/yanboliang, https://github.com/drisspg
2024-05-07 21:37:37 +00:00
d4225c55d9 [fx] Prioritize runtime assertions ops (#124213)
Summary:
We want to prioritize operators involved in data-dependent runtime assertions when legalizing the graph. For example, in the following piece of code, the `assert_scalar` and `assert_async` calls need to occur before the `slice_copy` for the program to run correctly with fake tensors. Otherwise we will run into a data-dependent error.

```
        _local_scalar_dense: "Sym(u113)" = torch.ops.aten._local_scalar_dense.default(aten_minimum_default);  aten_minimum_default = None

        ge_1: "Sym(u113 >= 2)" = _local_scalar_dense >= 2
        aten_scalar_tensor_default_3: "f32[]" = executorch_exir_dialects_edge__ops_aten_scalar_tensor_default(ge_1);  ge_1 = None
        aten__assert_async_msg_2 = executorch_exir_dialects_edge__ops_aten__assert_async_msg(aten_scalar_tensor_default_3, '_local_scalar_dense is outside of inline constraint [2, 1000].');  aten_scalar_tensor_default_3 = None
        le_1: "Sym(u113 <= 1000)" = _local_scalar_dense <= 1000
        aten_scalar_tensor_default_4: "f32[]" = executorch_exir_dialects_edge__ops_aten_scalar_tensor_default(le_1);  le_1 = None
        aten__assert_async_msg_3 = executorch_exir_dialects_edge__ops_aten__assert_async_msg(aten_scalar_tensor_default_4, '_local_scalar_dense is outside of inline constraint [2, 1000].');  aten_scalar_tensor_default_4 = None

        mul: "Sym(-u112)" = -1 * sym_size;  sym_size = None
        add: "Sym(-u112 + u113)" = _local_scalar_dense + mul;  mul = None
        lt: "Sym(-u112 + u113 < 0)" = add < 0;  add = None
        aten__assert_scalar_default = executorch_exir_dialects_edge__ops_aten__assert_scalar_default(lt, 'Deferred runtime assertion failed -u0 + u1 < 0');  lt = None

        aten_slice_copy_tensor_3: "f32[u113]" = executorch_exir_dialects_edge__ops_aten_slice_copy_Tensor(getitem, 0, 0, _local_scalar_dense);  getitem = None
```

Test Plan: test case

Differential Revision: D56201450

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124213
Approved by: https://github.com/SherlockNoMad
2024-05-07 21:31:10 +00:00
2f79a18324 Revert "[inductor] add cpp builder code. (#124045)"
This reverts commit 7864d287a1e56685aa754285cc2d3c31ff055f62.

Reverted https://github.com/pytorch/pytorch/pull/124045 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing trunk jobs 7864d287a1 including lint ([comment](https://github.com/pytorch/pytorch/pull/124045#issuecomment-2099306071))
2024-05-07 21:04:49 +00:00
c5e04a4479 More accurate is_bw and prompt parents cleanup for ModuleTracker utils (#125634)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125634
Approved by: https://github.com/soulitzer, https://github.com/Chillee
2024-05-07 20:57:36 +00:00
fdfef759a6 Add userbase library dir to windows dll search path (#125684)
Fixes https://github.com/pytorch/pytorch/issues/125109 which is a regression introduced by https://github.com/pytorch/builder/pull/1467 that adds dynamic dependency to mkl, which if installed in the user-dir is placed into `sysconfig.sysconfig.get_config_var("userbase") / "Library" / "bin"`

Fix this one, but adding `userbase` folder to the DLL search path

Testing before this fix:
```
Python 3.12.3 (tags/v3.12.3:f6650f9, Apr  9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Administrator\AppData\Roaming\Python\Python312\site-packages\torch\__init__.py", line 141, in <module>
    raise err
OSError: [WinError 126] The specified module could not be found. Error loading "C:\Users\Administrator\AppData\Roaming\Python\Python312\site-packages\torch\lib\shm.dll" or one of its dependencies.
>>> exit()
```

After:
```
c:\Program Files\Python312>python
Python 3.12.3 (tags/v3.12.3:f6650f9, Apr  9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> exit()
```
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125684
Approved by: https://github.com/malfet
2024-05-07 20:43:40 +00:00
7864d287a1 [inductor] add cpp builder code. (#124045)
Previous full PR https://github.com/pytorch/pytorch/pull/115248 is failed to merge due to fb_code is hard to debug.
I also tried to submit them as two pieces, https://github.com/pytorch/pytorch/pull/118514 https://github.com/pytorch/pytorch/pull/118515. And they have passed PreCI at that time.

Now I tried to split https://github.com/pytorch/pytorch/pull/115248 into smaller piece, and it is the first step of RFC https://github.com/pytorch/pytorch/issues/124245.
Changes:
1. Add cpp builder code, the new cpp_builder support Windows OS.
2. Add CPU ISA checker which is cross OS and exported from backend cpuinfo.
3. Switch compiler ISA checker to new cpp builder.
4. CppCodeCache use the new ISA checker.
5. Add temprary `test_new_cpp_build_logical` UT to help on transfer to new code.
<img width="1853" alt="Image" src="https://github.com/pytorch/pytorch/assets/8433590/ce6519ab-ba92-4204-b1d6-7d15d2ba2cbe">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124045
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-05-07 20:07:41 +00:00
b23b6e7108 Ensure that vmap is restored properly if an exception is thrown during frame eval (#122074)
We save and restore the DynamicLayerStack during frame eval but since fx graph has no way to express a try/finally we just assume it will happen. If we throw an exception between the push and pop to the stack then we're left in a state that affects following operations poorly.  Make sure that if it's in a bad state we restore it after frame eval.

Repro:
before:
```
$ rm test/dynamo_skips/TestSparseCPU.test_log1p_cpu_uint8
$ rm test/dynamo_expected_failures/FuncTorchHigherOrderOpTests.test_vmap_free_tensor
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k 'test_log1p_cpu_uint8'
============= 1 passed, 8588 deselected in 9.75s =============
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k
'test_vmap_free_tensor_dynamic_shapes or test_log1p_cpu_uint8'
================== short test summary info ===================
FAILED [0.0632s] test/test_sparse.py::TestSparseCPU::test_log1p_cpu_uint8 - AssertionError: "only Tensors of floating point dtype can require gradients"
does not match "You are attempting to call Tensor.requires_grad_() (or perhaps using torch.autograd.functional.* APIs) inside of a function ...
======= 1 failed, 1 skipped, 8587 deselected in 10.99s =======
```
(Note that adding test_vmap_free_tensor_dynamic_shapes causes test_vmap_free_tensor_dynamic_shapes to fail)
after:
```
$ rm test/dynamo_skips/TestSparseCPU.test_log1p_cpu_uint8
$ rm test/dynamo_expected_failures/FuncTorchHigherOrderOpTests.test_vmap_free_tensor
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k 'test_log1p_cpu_uint8'
============= 1 passed, 8588 deselected in 9.89s =============
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k
'test_vmap_free_tensor_dynamic_shapes or test_log1p_cpu_uint8'
======= 1 passed, 1 skipped, 8587 deselected in 11.34s =======
```
(test_vmap_free_tensor_dynamic_shapes passes either way)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122074
Approved by: https://github.com/oulgen
2024-05-07 19:36:52 +00:00
196a0b1722 Add Inductor micro benchmark workflow (#125450)
Fixes #ISSUE_NUMBER

Co-authored-by: Huy Do <huydhn@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125450
Approved by: https://github.com/huydhn
2024-05-07 18:56:01 +00:00
5fd0b6e5f7 Revert "add uuid in cudaDeviceProperties (#125083)"
This reverts commit f35fe4eaf1e9fa2e631f6bf1a3eb6e5fbf14183b.

Reverted https://github.com/pytorch/pytorch/pull/125083 on behalf of https://github.com/clee2000 due to test_uuid is flaky.  ex https://github.com/pytorch/pytorch/actions/runs/8988855916/job/24692369523 https://hud.pytorch.org/flakytest?name=test_uuid&suite=TestCuda&file=%25&limit=300 ([comment](https://github.com/pytorch/pytorch/pull/125083#issuecomment-2099029993))
2024-05-07 18:16:27 +00:00
f7d48302b6 [DSD] Fix to remove non_persistent buffer in distributed state dict (#125337)
Summary:
Fixes #122792

state_dict includes only persistent buffers, while named_buffers() would
include non_persistent buffers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125337
Approved by: https://github.com/awgu
ghstack dependencies: #125333, #125501, #125334, #125335, #125336
2024-05-07 17:57:34 +00:00
a89177936c [DSD] Correctly handle _extra_state (#125336)
Summary:
distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125336
Approved by: https://github.com/LucasLLC
ghstack dependencies: #125333, #125501, #125334, #125335
2024-05-07 17:31:33 +00:00
9f1d3eebf5 Update PyTorch ONNX Exporter maintainers (#125630)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125630
Approved by: https://github.com/BowenBao, https://github.com/kit1980
2024-05-07 17:29:05 +00:00
6f1e3a6bf7 [DCP] Always flatten mapping even if no tensors present (#125335)
Summary:
 Right now DCP only flatten a mapping (e.g., dict) if that mapping has tensor objects. This behavior is odd as users may save different non-tensor objects on different ranks. Without flattening the mappings, we may lose these non-tensor objects. One use case is dataloader state_dict.

 We may also want to do so for a list/tuple. But this will cause extra pickles. So we don't do this for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125335
Approved by: https://github.com/LucasLLC, https://github.com/wz337
ghstack dependencies: #125333, #125501, #125334
2024-05-07 17:08:49 +00:00
790f43c315 Run test_inductor_distributed with run_test (#125647)
CI feature like retrying and disable flaky test won't be available otherwise

### Testing

https://github.com/pytorch/pytorch/actions/runs/8977431927/job/24659532123#step:15:1688 looks correct now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125647
Approved by: https://github.com/clee2000
2024-05-07 17:07:36 +00:00
22767e4791 [DCP] Always create requests for non-tensor objects (#125334)
Summary:
If an object only exists on certain non-coordinator ranks, we still need to save them. Otherwise, we lose these objects. If they are duplicated, DCP will deduplicate them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125334
Approved by: https://github.com/wz337, https://github.com/LucasLLC
ghstack dependencies: #125333, #125501
2024-05-07 17:04:36 +00:00
9782439277 [Profiler] Do not emit a warning when using CPU profiler (#125654)
This fixes a logic regression introduced by https://github.com/pytorch/pytorch/pull/123247 where
```python
if self.use_device and self.use_device != _get_privateuse1_backend_name():
```
was replaced with
```python
        VALID_DEVICE_OPTIONS = ["cuda", "xpu", "privateuseone"]
        if self.use_device not in VALID_DEVICE_OPTIONS:
```

That triggers a warning every time code is invoke with `self.use_device` set to None

This change also skips all the checks which are useless if `use_device` is None to begin with
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125654
Approved by: https://github.com/aaronenyeshi
2024-05-07 16:56:17 +00:00
7863e04615 Back out "Get cutlass_library import working under fbcode" (#125606)
Summary: Original commit changeset: de79f6bfe348

Differential Revision: D57002294

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125606
Approved by: https://github.com/chenyang78
2024-05-07 16:55:11 +00:00
71dc15742c [DSD] Improve the performance of distributed state_dict (#125501)
Summary:
1. Remove gc.collect(), which is not necessary.
2. Use lru_cache to cache _get_fqns

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125501
Approved by: https://github.com/wz337, https://github.com/LucasLLC
ghstack dependencies: #125333
2024-05-07 16:55:05 +00:00
0e57bbb6d7 Set timeout for C++ tests (#125517)
Looking at the unrelated Windows timeout failure on https://github.com/pytorch/pytorch/pull/125199, it looks like we don't have a timeout value set for C++ tests atm.  In this case, a C++ test on Windows timed out after 2+ hours.

```
2024-05-02T23:35:34.0639067Z Running cpp/c10_TypeList_test 1/1 ... [2024-05-02 23:35:34.059021]
2024-05-02T23:35:34.0641108Z Executing ['pytest', 'C:\\actions-runner\\_work\\pytorch\\pytorch\\build\\win_tmp\\build\\torch\\test\\c10_TypeList_test.exe', '-m', 'not serial', '-v', '-vv', '-rfEX', '-n', '2', '--junit-xml-reruns', 'test-reports\\python-pytest\\test\\run_test\\test\\run_test-c898ddeff8f33cbf.xml', '-x', '--reruns=2'] ... [2024-05-02 23:35:34.062137]
2024-05-03T02:45:33.7862004Z Process SpawnPoolWorker-2:
2024-05-03T02:45:33.7927201Z Traceback (most recent call last):
2024-05-03T02:45:33.7928032Z   File "C:\Jenkins\Miniconda3\lib\multiprocessing\process.py", line 315, in _bootstrap
2024-05-03T02:45:33.7928722Z     self.run()
2024-05-03T02:45:33.7929722Z   File "C:\Jenkins\Miniconda3\lib\multiprocessing\process.py", line 108, in run
2024-05-03T02:45:33.7931639Z     self._target(*self._args, **self._kwargs)
2024-05-03T02:45:33.7932435Z   File "C:\Jenkins\Miniconda3\lib\multiprocessing\pool.py", line 114, in worker
2024-05-03T02:45:33.7933338Z     task = get()
2024-05-03T02:45:33.7933946Z   File "C:\Jenkins\Miniconda3\lib\multiprocessing\queues.py", line 365, in get
2024-05-03T02:45:33.7935219Z     res = self._reader.recv_bytes()
2024-05-03T02:45:33.7935897Z   File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 221, in recv_bytes
2024-05-03T02:45:33.7936609Z     buf = self._recv_bytes(maxlength)
2024-05-03T02:45:33.7937302Z   File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 310, in _recv_bytes
2024-05-03T02:45:33.7938316Z     waitres = _winapi.WaitForMultipleObjects(
2024-05-03T02:45:33.7938766Z KeyboardInterrupt
```

Retrying was working, but it was already too late to finish the job.  I'm setting the same default `THRESHOLD * 3` timeout value here for C++ tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125517
Approved by: https://github.com/clee2000
2024-05-07 16:41:38 +00:00
1b396d69cb Revert "[CUDNN] Remove defunct cuDNN V8 API build flag (#120006)"
This reverts commit ee4cafa098ede2d9546016223cbc1a522ea3630a.

Reverted https://github.com/pytorch/pytorch/pull/120006 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing ROCm jobs in trunk ee4cafa098 ([comment](https://github.com/pytorch/pytorch/pull/120006#issuecomment-2098849813))
2024-05-07 16:28:04 +00:00
848fce35b5 [CI][ez] Don't retry when it says don't retry (#125643)
default arg for retry_shell is retries=1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125643
Approved by: https://github.com/huydhn
2024-05-07 16:20:00 +00:00
0de9ce9bb3 [export] Fix serialization of empty torch artifact (#125542)
A previous PR added support for serializing/deserializing example inputs, but this fails when `example_inputs` is none.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125542
Approved by: https://github.com/pianpwk, https://github.com/BoyuanFeng, https://github.com/ydwu4
2024-05-07 15:54:45 +00:00
b37bef9b13 Use triton_key instead of triton.__version__ for hash (#125624)
Using `triton.__version__` is not correct as the version is not always updated with each code change, so we should use the proper hash function provided by triton library.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125624
Approved by: https://github.com/eellison, https://github.com/masnesral, https://github.com/jansel
2024-05-07 15:43:50 +00:00
8573d9551a Fix to preserve tensor wrapper subclass dtype through multiprocessing serialization (#125615)
Fixes #125583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125615
Approved by: https://github.com/albanD
2024-05-07 14:35:48 +00:00
b29d77b54f Separate arm64 and amd64 docker builds (#125617)
Fixes https://github.com/pytorch/pytorch/issues/125094

Please note: Docker CUDa 12.4 failure is existing issue, related to docker image not being available on gitlab:
```
docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: not found
```
 https://github.com/pytorch/pytorch/actions/runs/8974959068/job/24648540236?pr=125617

Here is the reference issue: https://gitlab.com/nvidia/container-images/cuda/-/issues/225

Tracked on our side: https://github.com/pytorch/builder/issues/1811
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125617
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-05-07 11:50:54 +00:00
5dee46266a Fix & optimize open device registration test. (#125572)
1. Fix the wrong tests about lazy init for PrivateUse1 named foo
2. Refactor the tests and make it more flexible
3. Disable the two tests temporarily
     - test_open_device_faketensor
     - test_open_device_scalar_type_fallback
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125572
Approved by: https://github.com/albanD
2024-05-07 08:30:01 +00:00
f0c6d6100b Enable dynamo-traced optimizer peak memory tests (#124543)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124543
Approved by: https://github.com/yf225, https://github.com/janeyx99
2024-05-07 08:21:50 +00:00
5033d3ba6d Disable fb_memcache for MTIA (#125658)
Differential Revision: [D57035819](https://our.internmc.facebook.com/intern/diff/D57035819/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125658
Approved by: https://github.com/jamesjwu
2024-05-07 07:00:26 +00:00
e72936c27c [PT2D] Fix the circular import issue (#125618)
As title

Differential Revision: [D57011394](https://our.internmc.facebook.com/intern/diff/D57011394/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125618
Approved by: https://github.com/wz337
2024-05-07 05:10:18 +00:00
acafabaa29 Rename TorchDynamo -> Dyanamo in the dynamo tutorial doc (#123431)
Less verbose and it aligns it with the dynamo deepdive
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123431
Approved by: https://github.com/peterbell10
2024-05-07 05:07:00 +00:00
058e28108f [inductor][cpp] support int64 vertical vec reduction (fix #124821) (#125563)
Fix #124821

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125563
Approved by: https://github.com/desertfire
2024-05-07 03:56:22 +00:00
a60fa960e5 refactor: extract get_lr warning (#125545)
Extract the `_get_lr_called_within_step` checking in the `get_lr()` of every LRSchedulers.
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125545
Approved by: https://github.com/janeyx99
2024-05-07 03:15:58 +00:00
461ffaaaf3 [dynamo] support torchbind object input (#124978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124978
Approved by: https://github.com/jansel
2024-05-07 03:02:00 +00:00
c165a8e71d Enable UFMT on test_decomp.py, test_expanded_weights.py and some files (#125117)
Part of: #123062

Ran lintrunner on:

- test/test_decomp.py
- test/test_deploy.py
- test/test_determination.py
- test/test_dlpack.py
- test/test_dynamic_shapes.py
- test/test_expanded_weights.py

Detail:

```bash
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125117
Approved by: https://github.com/jansel
2024-05-07 02:36:40 +00:00
48b6c8dbc3 [Inductor] log fusion failure due to index mismatch (#124986)
The scheduler searches for fusion opportunities by looking for common memory access. Two memory access are considered common not only when the buffer name match, but it also requires more things
- index formula matches
- var_ranges matches

In this PR, I want to log all the fusion failures due to mismatch index formula or var_ranges. I also want to further categories the failures. Right now I found the following failure categories
- rand_seed: the index for rand seed access is an integer and different access uses different integer offset
- different numel: this happens for cat operation
- broadcast: e.g. kernel A write a buffer which is broadcasted and read by kernel B
- different loop orders: the major category we want inductor to be able to fuse
- different offset: happens when use a concatenated linear layer to project Q/K/V and then split the result. Each split will point to the same buffer with different offset.
- unknown

My hope is to make sure for the models I tested, there is no fusion failure falling in the unknown category so all the failures are well understood and categories. Right now it's true for BertForMaskedLM ( https://gist.github.com/shunting314/6dc2c903629d342fa63ba731a171adc2  ), DistillGPT2 ( https://gist.github.com/shunting314/145176f2e850103c7fad4ad72f0e200e ) and llm.c ( https://gist.github.com/shunting314/cfc64a326312a889ba55f79bd47b2082 )

For BertForMaskedLM, we found 82 instances of fusion failures and majority of them are due to different loop orders! Studying the log a bit more can help us figure out where all these loop order mismatch comes from in real models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124986
Approved by: https://github.com/eellison, https://github.com/jansel
2024-05-07 02:29:00 +00:00
f35fe4eaf1 add uuid in cudaDeviceProperties (#125083)
Replaces #99967.

Fixes #99903.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083
Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy
2024-05-07 01:26:01 +00:00
4332fc4095 [export] Allow constant attr mutation (#125424)
Test Plan: CI

Differential Revision: D56893728

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125424
Approved by: https://github.com/pianpwk
2024-05-07 00:34:57 +00:00
c0c2f6156a Updated docs to add the error case for torch.multinomial Issue#125388 (#125495)
Summary: Updated docs to add the error condition for torch.multinomial

Test Plan: No change in code

Reviewers: @drisspg

Subscribers: @drisspg

Tasks:

Tags:

Fixes #125388

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125495
Approved by: https://github.com/drisspg
2024-05-07 00:26:27 +00:00
3407899ba1 DTensor Fused ADAM (#125369)
Fixes https://github.com/pytorch/pytorch/issues/124633 https://github.com/pytorch/ao/issues/205

```
(pt) [marksaroufim@devvm17057.vll0 ~/pytorch (dfusedadam)]$ pytest test/distributed/_tensor/test_optimizers.py -s -k adamw_1d_sharding
===================================================================================== test session starts ======================================================================================
platform linux -- Python 3.9.19, pytest-7.4.0, pluggy-1.5.0
rootdir: /home/marksaroufim/pytorch
configfile: pytest.ini
plugins: hypothesis-6.100.2
collected 10 items / 9 deselected / 1 selected
Running 1 items in this shard

test/distributed/_tensor/test_optimizers.py .

=============================================================================== 1 passed, 9 deselected in 5.95s ================================================================================
(pt) [marksaroufim@devvm17057.vll0 ~/pytorch (dfusedadam)]$ pytest test/distributed/_tensor/test_optimizers.py -s -k adam_1d_sharding
===================================================================================== test session starts ======================================================================================
platform linux -- Python 3.9.19, pytest-7.4.0, pluggy-1.5.0
rootdir: /home/marksaroufim/pytorch
configfile: pytest.ini
plugins: hypothesis-6.100.2
collected 10 items / 7 deselected / 3 selected
Running 3 items in this shard

test/distributed/_tensor/test_optimizers.py ...

=============================================================================== 3 passed, 7 deselected in 10.79s ===============================================================================
(pt) [marksaroufim@devvm17057.vll0 ~/pytorch (dfusedadam)]$
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125369
Approved by: https://github.com/wanchaol
2024-05-07 00:08:09 +00:00
65fc3c31bc [BE] Delete unused AT_FORALL_SCALAR_TYPES_AND[456] (#125607)
Check that they are not used by running the following
```
% grep -h "AT_FORALL_SCALAR_TYPES_AND" . -R|grep -v #define|cut -d\( -f1|sort|uniq
              AT_FORALL_SCALAR_TYPES_AND3
        AT_FORALL_SCALAR_TYPES_AND3
      AT_FORALL_SCALAR_TYPES_AND
      AT_FORALL_SCALAR_TYPES_AND2
      AT_FORALL_SCALAR_TYPES_AND3
      AT_FORALL_SCALAR_TYPES_AND7
    AT_FORALL_SCALAR_TYPES_AND2
    AT_FORALL_SCALAR_TYPES_AND3
    AT_FORALL_SCALAR_TYPES_AND7
  AT_FORALL_SCALAR_TYPES_AND2
  AT_FORALL_SCALAR_TYPES_AND3
  AT_FORALL_SCALAR_TYPES_AND7
// AT_FORALL_SCALAR_TYPES / AT_FORALL_SCALAR_TYPES_AND macros below, which are
AT_FORALL_SCALAR_TYPES_AND
AT_FORALL_SCALAR_TYPES_AND2
AT_FORALL_SCALAR_TYPES_AND3
AT_FORALL_SCALAR_TYPES_AND7
using at::Half; // for AT_FORALL_SCALAR_TYPES_AND3
```
or by checking online using https://github.com/search?type=code&q=AT_FORALL_SCALAR_TYPES_AND4+repo%3Apytorch%2Fpytorch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125607
Approved by: https://github.com/albanD
2024-05-07 00:01:35 +00:00
3411d54811 fix loading optimizer options from archive (#125215)
This PR makes libtorch behave the same as PyTorch when loading optimizer state from archive. With PyTorch, options of parameter groups are loaded from the archive, which is missing currently in libtorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125215
Approved by: https://github.com/janeyx99
2024-05-06 23:58:40 +00:00
eqy
ee4cafa098 [CUDNN] Remove defunct cuDNN V8 API build flag (#120006)
The flag basically does nothing following #95722

Let's see if the quantization tests break

CC @malfet @atalmanagement

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120006
Approved by: https://github.com/malfet
2024-05-06 23:13:58 +00:00
b98c689261 Better repro command: include test class + fix paths for py3.8 (#125498)
Fixes #117850

This PR:
* Adds the class name in the repro command
* Fixes the path to the test file for python 3.8 jobs (apparently `inspect.getfile(class_type)` returns a relative path in this older python version)

Before (in python 3.8):
```sh
PYTORCH_TEST_WITH_DYNAMO=1 python test_autograd.py -k test_foo
```

After:
```sh
PYTORCH_TEST_WITH_DYNAMO=1 python test/test_autograd.py -k TestAutograd.test_foo
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125498
Approved by: https://github.com/huydhn, https://github.com/janeyx99
2024-05-06 22:19:12 +00:00
22bcfc25ef Initial implementation of Inductor FX Graph Remote Cache (#124669)
This diff implements a remote caching strategy (memcache for internal and redis for external) for caching of Inductor FX Graph to Inductor generated wrapper file.

It uses the same idea with the autotuning result cache that is currently live.

This will land turned off and before turning this on by default, I will do more testing and including looking at the dynamic shape guards added by inductor.

Differential Revision: [D56441624](https://our.internmc.facebook.com/intern/diff/D56441624/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124669
Approved by: https://github.com/jansel, https://github.com/eellison
2024-05-06 22:10:27 +00:00
05bd7fe3eb Nested Tensor + AOTI test (#125513)
Since we expect AOTI to be important for serving NJT in the future, I'm adding a test demonstrating that AOTI currently works with NJT when NJT is entirely in the graph (no NJTs going in or out), and to prevent regressions.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125513
Approved by: https://github.com/angelayi, https://github.com/desertfire
2024-05-06 22:07:22 +00:00
1b3fd83ab2 [TD] Enable TD on AVX related configs (#125482)
On test configs `nogpu_AVX512` and `nogpu_NO_AVX2`, which are the next longest jobs on trunk after windows
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125482
Approved by: https://github.com/huydhn
2024-05-06 22:02:16 +00:00
8c74162074 Reduce the number of layers for mixtral moe model to adapt CI memory limitation (#125608)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125608
Approved by: https://github.com/Chillee, https://github.com/huydhn
2024-05-06 21:52:25 +00:00
7ddf57e9f5 xfail codegen dynamic if the test is xfailed (#125573)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125573
Approved by: https://github.com/peterbell10
2024-05-06 20:55:33 +00:00
373a00df9a [dynamo] better file open method in funcname_cache (#125435)
Fix https://github.com/pytorch/pytorch/issues/124960?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125435
Approved by: https://github.com/ezyang
2024-05-06 20:55:15 +00:00
cbb3791891 [pipelining] Add tests for tracing frontend (#125449)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125449
Approved by: https://github.com/wconstab
ghstack dependencies: #125273, #125448
2024-05-06 20:44:56 +00:00
bdaa7bbd7d [dynamo] fix potentially missing _torchdynamo_inline from ScriptFunction (#125447)
Fix https://github.com/pytorch/pytorch/issues/119747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125447
Approved by: https://github.com/jansel
2024-05-06 20:36:56 +00:00
ad9a27f3e5 Move autocast op list to autocast_mode.h to make sure other backends can reuse it. (#125114)
This PR refactors the op list added in #124051. To make sure other backends can reuse it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125114
Approved by: https://github.com/albanD
2024-05-06 20:31:15 +00:00
2a42c40791 Revert "Compute bounds for the variables created during codegen (#123100)"
This reverts commit bb668c6468dd4adf7737a069e7af4c3f612cfc81.

Reverted https://github.com/pytorch/pytorch/pull/123100 on behalf of https://github.com/huydhn due to Sorry for reverting you change but it is failing inductor tests bb668c6468 ([comment](https://github.com/pytorch/pytorch/pull/123100#issuecomment-2096837821))
2024-05-06 20:23:39 +00:00
9cd4bcb2c4 [FSDP] mark pre_backward_hook unserializable (#125464)
Saw a warning like this:

```
/opt/conda/lib/python3.10/site-packages/torch/utils/hooks.py:86: UserWarning: backward hook functools.partial(<function _pre_backward_hook at 0x7f9a3940fac0>, FullyShardedDataParallel(

....

), <torch.distributed.fsdp.flat_param.FlatParamHandle object at 0x7f25202a9720>) on tensor will not be serialized.  If this is expected, you can decorate the function with @torch.utils.hooks.unserializable_hook to suppress this warning
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125464
Approved by: https://github.com/ezyang
2024-05-06 20:20:31 +00:00
7d10b06e1a Allow building for sm90a (#125523)
# Summary
Fixes: #125413
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125523
Approved by: https://github.com/Skylion007
2024-05-06 20:03:12 +00:00
ee0c47349c Revert "Upgrade submodule oneDNN to v3.4 (#122472)"
This reverts commit dbcf123105a3f11d02f04067ca0cb377ed09e88c.

Reverted https://github.com/pytorch/pytorch/pull/122472 on behalf of https://github.com/atalman due to broke aarch64 builds and tests ([comment](https://github.com/pytorch/pytorch/pull/122472#issuecomment-2096750000))
2024-05-06 19:28:20 +00:00
af144139df Remove some pre-c++17 cruft (#125590)
Summary: C++20 has [eliminated](https://en.cppreference.com/w/cpp/types/result_of) `result_of` in favour of `invoke_result`. It's mysterious that this code even still works, but, nevertheless, I'm fixing it.

Test Plan: Sandcastle

Differential Revision: D56987418

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125590
Approved by: https://github.com/Skylion007
2024-05-06 19:19:28 +00:00
daf1eb44bc try to fix the warning in distribute_tensor (#125476)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125476
Approved by: https://github.com/albanD, https://github.com/awgu
ghstack dependencies: #125475
2024-05-06 18:59:47 +00:00
7ffa5558ee Revert "[FX] Update type hints in torch.fx._compatibility.py (#125469)"
This reverts commit 235b4d6ec22ddac35b2e47b7e871ef10538d4aee.

Reverted https://github.com/pytorch/pytorch/pull/125469 on behalf of https://github.com/izaitsevfb due to breaks pyre in dependent projects (internal: see D56986361) ([comment](https://github.com/pytorch/pytorch/pull/125469#issuecomment-2096665396))
2024-05-06 18:36:43 +00:00
bb668c6468 Compute bounds for the variables created during codegen (#123100)
Before we would just bail out on these bounds for all variables that did
not come from the FX graph. Now we propagate the bounds whenever we have
a rule for that op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123100
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-05-06 18:12:15 +00:00
3827810453 [export] suggest constant dim values in dynamic shapes fixes (#125458)
[https://www.internalfb.com/diff/D54924742](https://github.com/pytorch/pytorch/pull/121860) allowed specifying integer values for static dims in dynamic shapes. This changes suggested fixes to suggest the actual value instead of the current "None".

Test Plan: existing export tests cover this

Differential Revision: D56921142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125458
Approved by: https://github.com/avikchaudhuri
2024-05-06 17:44:19 +00:00
6ebec38453 Add ciflow/linux-aarch64 to auto labeler on mkldnn PR's (#125599)
Trigger Aarch64 CI on oneDNN changes, to detect issues like this: https://github.com/pytorch/pytorch/issues/125548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125599
Approved by: https://github.com/malfet, https://github.com/snadampal
2024-05-06 17:26:17 +00:00
e30e6d321f [MPS][BE] Introduce MetalShaderLibary class (#125550)
That factors out a repeated pattern of creating a library/fetching a func from source

Typical usecase
```cpp
static MetalShaderLibrary lib(SHADER_SOURCE);
...

id<MTLComputePipelineState> cplState = lib.getPipelieStateForFunc("kernel_name")
```
- Make it possible to use with templated sources
- Add `scalarToMetalTypeString(const Tensor&)` variant to avoid repeated `scalarToMetalTypeString(t.scalar_type())` calls in the code

I.e. it makes no functional changes, but reduces MPS codebase size by 365 lines
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125550
Approved by: https://github.com/kulinseth
2024-05-06 17:15:47 +00:00
7bf6ed01ac [inductor] Remove symbol exports in C shim for Windows (#125472)
Summary:
This shim exports symbols on Windows, which can lead to symbol clashes at link time in the following scenario:
1. A DLL imports libtorch
2. A binary imports libtorch, and also depends on the DLL in (1)

Under that scenario, the symbols exported from `shim.h` can clash at link time.

Given that AOTInductor only works for PyTorch2, and PyTorch2 doesn't currently work for Windows, we can work around this problem by simply removing the symbols export on Windows. In the long term, this will need to be figured out when Windows support is added & tested for PyTorch2.

Differential Revision: D56936696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125472
Approved by: https://github.com/desertfire
2024-05-06 14:43:11 +00:00
b6bcd09173 Get rid of tabular and sizes, beef up verbosity of output graph (#125507)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125507
Approved by: https://github.com/Chillee, https://github.com/jansel
ghstack dependencies: #125505
2024-05-06 13:41:58 +00:00
71bec453b1 [xla hash update] update the pinned xla hash (#124599)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124599
Approved by: https://github.com/pytorchbot
2024-05-06 12:19:42 +00:00
60efb1060a Make codegen dynamic test faster (#125569)
Let's early exit + avoid an unnecessary split.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125569
Approved by: https://github.com/kadeng
2024-05-06 12:09:19 +00:00
24b64fc482 [HOP][inductor] Support pytrees as associative_scan input (#122137)
This allows `associative_scan` to take an arbitrary pytree of tensors,
which is flattened to their leaves before calling the `associative_scan`
higher order operator.

I also add support in inductor to generate code for scanning over sequences
of tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122137
Approved by: https://github.com/lezcano, https://github.com/Chillee
ghstack dependencies: #119430
2024-05-06 11:29:28 +00:00
68a1f787c8 [inductor][cpp] move some common cpp utils to cpp_utils.py (#125152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125152
Approved by: https://github.com/desertfire, https://github.com/jansel
2024-05-06 04:30:30 +00:00
fc183f0bde [Inductor] Properly package target info for triton.compile (#125553)
Triton updated the interface for `triton.compile` 5162346487

The `target` argument to compile needs to be wrapped in a `GPUTarget` object. Without proper wrapping, we hit an assert in `compile`. If that assert is removed, Triton attempts to read device info from Torch while inside a torch thread, which hits an in bad fork assert. This change is required for compatibility with latest commits in Triton. The implementation is backwards compatible, so existing versions of Triton that work now continue to work.

Re-submitting this after https://github.com/pytorch/pytorch/pull/125241 was reverted due to an unrelated CI issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125553
Approved by: https://github.com/huydhn
2024-05-06 01:36:36 +00:00
1dd42e42c4 [BE]: Try TCH autofixes on torch/ (#125536)
Tries TCH autofixes and see what breaks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125536
Approved by: https://github.com/ezyang
2024-05-05 23:13:59 +00:00
ccbac091d2 Revert "Add write_record_metadata to PyTorchFileWriter (#125184)"
This reverts commit dd92637f445d2787f83829079276f71b1ad1fc7c.

Reverted https://github.com/pytorch/pytorch/pull/125184 on behalf of https://github.com/izaitsevfb due to breaks internal builds, see D56962076 ([comment](https://github.com/pytorch/pytorch/pull/125184#issuecomment-2094976897))
2024-05-05 22:40:00 +00:00
1b1d593c8c Don't call item() into torch.scalar_tensor uselessly (#125373)
Fixes https://github.com/pytorch/pytorch/issues/125368

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125373
Approved by: https://github.com/Skylion007
2024-05-05 22:38:16 +00:00
ecd62746e3 Also pull size/stride info from example_value (#125505)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125505
Approved by: https://github.com/jansel
2024-05-05 22:27:46 +00:00
d1a3271a55 [ez]2->3 shards for asan slow (#125499)
One of the shards has been timing out recently
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125499
Approved by: https://github.com/huydhn
2024-05-05 21:02:44 +00:00
94c4855e75 [Inductor max autotune] Make autotune_select_algorithm more robust (#124928)
This diff makes sure that a custom exception is thrown when no valid
choices remain during autotuning. This allows to gracefully fall back
to a default choice, even if that default choice has not been passed to
autotune_select_algorithm.

Additionally, this diff handles RuntimeErrors during autotuning gracefully, e.g. the corresponding choice is ignored but it does not lead to the compilation failure of the entire model if a problematic choice is encountered during autotuning.
( An error is being logged, though).

Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124928
Approved by: https://github.com/int3
ghstack dependencies: #125406
2024-05-05 20:10:10 +00:00
58d8388ed3 Remove Inductor IRs for legacy functional collectives (#124992)
This PR completely removes the Inductor IR for legacy functional collectives:
- Removed the `CollectiveKernel` hiearchy and `Wait`, as well as the corresponding lowerings. These IRs are target (i.e. Python) specific and don't model node dependencies propoerly (e.g. they rely on `never_reuse_buffers` for correct behavior). They've been superceded by `ir._CollectiveKernel`.
- Removed `InPlaceHint` and the scheduler logic for handling it. `InPlaceHint` is a codegen-time buffer reuse mechanism controlled by the IR's codegen. It's a bit hacky and overlaps with the default buffer reuse mechanism. Removing it since it is only used by legacy functional collectives.
- Removed `OutputBuffer` and `MultiOutputNoSizeAssert` which are designed for and only used by legacy functional collectives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124992
Approved by: https://github.com/Chillee, https://github.com/wanchaol
2024-05-05 19:49:58 +00:00
235b4d6ec2 [FX] Update type hints in torch.fx._compatibility.py (#125469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125469
Approved by: https://github.com/Skylion007
ghstack dependencies: #125468
2024-05-05 19:30:22 +00:00
30c9fd96f6 [FX] Add missing forbidden mutation methods in immutable collections (#125468)
Add `list.sort`, `list.reverse`, `dict.__ior__`, and `dict.setdefault`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125468
Approved by: https://github.com/Skylion007
2024-05-05 19:30:22 +00:00
7c59720ba7 [comm] Ensure ncclComm is not aborted before checking exception (#124466)
Differential Revision: D56347560

More details in this pytorch issue: https://github.com/pytorch/pytorch/issues/124468

It seems there is a race in the ProcessGroupNCCL shutdown logic. The code is quite simple:
```
for i in range(100):
    dist.all_to_all_single(tensor_out, tensor_in)
dist.destroy_process_group()
```

What can happen is this:

1. dist.destroy_process_group() calls into shutdown() and then calls into abort: b2f6cfd9c0/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L1095)
2. It'll call ncclCommAbort (not graceful afaict), and also set the ncclAsyncErr_ = ncclSystemError; b2f6cfd9c0/torch/csrc/distributed/c10d/NCCLUtils.hpp (L388).
3. ncclWatchdog thread may not have woken up while all this shutdown process happens. And in shutdown we're not waiting for watchdog thread
4. ProcessGroupNCCL dtor is called. It'll wait for the watchdog thread to join
5. watchdog will check the work's isCompleted() -> then calls checkAndSetException(). Because ncclAsyncError_ was set to ncclSystemError, it'll error out and makes you think it's a nccl error.

So we can mitigate this issue by checking if the comm was aborted during work.isCompleted/isStarted

Some more longer term discussion in the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124466
Approved by: https://github.com/shuqiangzhang, https://github.com/yoyoyocmu, https://github.com/kwen2501
2024-05-05 18:55:48 +00:00
99e4909677 Remove assertion for cat target_func (#125540)
Summary:
We remove the assertion for target_func being cat.
The reason is that we have multiple flavors of concat, such as
cat/cat.default/cat_slice/cat_slice_cat/...
Assertion here is causing multiple times of false positives.

Test Plan: Removing assertion code only.

Differential Revision: D56971387

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125540
Approved by: https://github.com/hl475
2024-05-05 18:17:47 +00:00
650a248d3e Rename is_unspecialized to pass_arg_as_tensor, add comment (#125496)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125496
Approved by: https://github.com/lezcano
ghstack dependencies: #125395, #125419, #125483, #125494
2024-05-05 16:57:50 +00:00
12da7ee58f Don't use wrap_fx_proxy_cls for wrap_symint (#125494)
We use very little of the code in wrap_fx_proxy_cls, so dupe it out.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125494
Approved by: https://github.com/lezcano
ghstack dependencies: #125395, #125419, #125483
2024-05-05 16:57:50 +00:00
617e473da5 Split wrap_symint out of wrap_unspecialized_primitive (#125483)
While there are some similarities, they are also quite different (one
handles Numpy numbers while the other handles ints.  I am also going to
add a wrap_symfloat soon which will do even more different behavior.
So split these out for clarity.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125483
Approved by: https://github.com/lezcano
ghstack dependencies: #125395, #125419
2024-05-05 16:57:50 +00:00
10f673541e [Inductor cutlass backend] Enabled nonzero workspace and Cutlass StreamK (#125406)
Enable nonzero workspace and Cutlass StreamK for Inductor Cutlass GEMM ops.

This is a simpler rewrite of my original version of #119005 using @peterbell10 's workspace allocation mechanism from #117992

Test Plan:
 - Additional unit test in test_cutlass_backend.py which specifically tests StreamK GEMM with workspace requirement
 - CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125406
Approved by: https://github.com/jansel
2024-05-05 15:28:45 +00:00
f70bd71a48 [FSDP2] Computed grad divide factors at runtime (#125484)
**Context**
We are interested in supporting the case where HSDP reduce-scatters but does not all-reduce in a microbatch backward. This saves communication while still saving memory. Only on the last microbatch do we need to both reduce-scatter and all-reduce. This is not implemented yet and will hopefully come in a future PR.

There is one notable part of doing this. On the last microbatch, we need to perform an accumulation step after reduce-scatter and before all-reduce. If not, then the preceding microbatch's gradients will not be contributed across the replica group. (In other words, we cannot simply accumulate _after_ all-reduce.)

Consider 32 GPUs with 4-way replication and 8-way sharding and 2 microbatches, and focus on global rank 0.
- After the first microbatch, rank 0 will have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)}$, where we define $S(0) = \{0, 1, \dots, 7\}$ to be the ranks in its shard group and we define the $(1)$ superscript to denote the first microbatch.
- Upon the second microbatch, rank 0 after its reduce-scatter will additionally have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(2)}$. If we only all-reduce this, then this second microbatch's gradients become $\frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, so in total, rank 0 has $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)} + \frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, which is wrong.
- Importantly, we must accumulate $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)}  + \frac{1}{8} \sum_{i \in S(0)} g_i^{(2)} = \frac{1}{8}\sum_{i \in S(0)} (g_i^{(1)} + g_i^{(2)})$ first before all-reducing to get $\frac{1}{32} \sum_{i=0, 1, \dots, 31} (g_i^{(1)} + g_i^{(2)})$.

Now, note how under this approach, we want a factor of $\frac{1}{8}$ only (i.e. reciprocal of the shard group size), not $\frac{1}{32}$, for the first microbatch's gradients.
- For bf16/fp32, since we use `ReduceOp.AVG` and we only reduce-scatter on the first microbatch, we correctly have a factor of $\frac{1}{8}$ on the first microbatch.
- For fp16, since we precompute the gradient divide factors at init time assuming always reducing over both shard and replica groups, we incorrectly have a factor of $\frac{1}{32}$ on the first microbatch, deviating from the bf16/fp32 case.

We can address this issue by matching the bf16/fp32 vs. fp16 semantics by computing the divide factors at runtime based on which process groups were passed into the reduction function (`foreach_reduce`).

**Additional Notes**
How to implement the HSDP reduce-scatter but no all-reduce is not entirely clear yet. (What is the cleanest way to do this?) We need to store the partial reduce-scatter output and check for it upon the next backward. We should also be sure to error if the set of parameters receiving gradients changes, in which case we cannot support this easily. Anyway, we will implement this in a follow-up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125484
Approved by: https://github.com/wanchaol
ghstack dependencies: #125431, #125479
2024-05-05 14:11:33 +00:00
dba689bbfd Revert "[FSDP2] Computed grad divide factors at runtime (#125484)"
This reverts commit 9aa7699185e4ec39077e3046dfd63244dffa9ddb.

Reverted https://github.com/pytorch/pytorch/pull/125484 on behalf of https://github.com/huydhn due to Sorry for reverting your change, I am trying to restore ROCm distributed failures in trunk 9aa7699185 ([comment](https://github.com/pytorch/pytorch/pull/125484#issuecomment-2094646996))
2024-05-05 06:12:01 +00:00
cyy
8a0529e986 [2/2] Remove Caffe2 db and distributed code (#125533)
This PR follows #125092 to remove caffe2/db/* and caffe2/distributed/* .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125533
Approved by: https://github.com/kit1980
2024-05-05 05:10:17 +00:00
7f0c5eb023 Added some more flex attention tests (#125487)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125487
Approved by: https://github.com/yanboliang
2024-05-04 23:42:40 +00:00
6d30803d64 Revert "[Inductor] Properly package target info for triton.compile (#125241)"
This reverts commit 8a1af95b0979d85c4fe32a75e797323ad81f298d.

Reverted https://github.com/pytorch/pytorch/pull/125241 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing inductor tests on ROCm 8a1af95b09 ([comment](https://github.com/pytorch/pytorch/pull/125241#issuecomment-2094472886))
2024-05-04 22:28:16 +00:00
084d818e71 Revert "try to fix the warning in distribute_tensor (#125476)"
This reverts commit 2b41e1d6fc05428008875e3cfe8be17184e57491.

Reverted https://github.com/pytorch/pytorch/pull/125476 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but there are real failures on the PR that sneak in during the log classifier outage ([comment](https://github.com/pytorch/pytorch/pull/125476#issuecomment-2094468740))
2024-05-04 22:25:32 +00:00
a32ad828dc Revert "Don't call item() into torch.scalar_tensor uselessly (#125373)"
This reverts commit 2b4fe183db00db88749f8524f3b4a69ca80da0ec.

Reverted https://github.com/pytorch/pytorch/pull/125373 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but there are real failures on the PR that sneak in during the log classifier outage ([comment](https://github.com/pytorch/pytorch/pull/125373#issuecomment-2094464241))
2024-05-04 22:22:36 +00:00
f04c8471a4 [dynamo][prepare for nn module guards] Guard nn modules for a few benchmarks (#125324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125324
Approved by: https://github.com/jansel
ghstack dependencies: #125439, #125421, #124522
2024-05-04 22:08:56 +00:00
5ba777f46e [guards][cpp-guards] Optimize NN module getattr guards (#124522)
Improves the guard overhead of MobileBert model with nn module guards from 92000 units to 20000 units.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124522
Approved by: https://github.com/jansel
ghstack dependencies: #125439, #125421
2024-05-04 22:08:56 +00:00
76a26a885d Add module tracker (#125352)
This does a few things that were originally a few PRs but I am on a new machine and don't have ghstack.
If it is too problematic to review, I can re-split, just let me know.
This does:
- Cleanup context manager use in test_flop_counter
- Remove need for mod argument in FlopCounterMode, warning about it
- Re-implement a Module tracker from scratch using global forward Module use and multi_grad_hook (we cannot use global backward Module hook because they don't look for nested Tensor and they're custom Function based instead of multi_grad_hook).
- Update FlopCouterMode to use the new ModuleTracker. All the existing test suite passes as-is (only changes there are new tests and refactoring mentioned above)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125352
Approved by: https://github.com/mikaylagawarecki
2024-05-04 18:33:35 +00:00
1a20b4ef3f [dynamo] handle inactive nullcontexts across graph breaks (#125518)
whoops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125518
Approved by: https://github.com/yanboliang
2024-05-04 12:52:20 +00:00
6f70d22277 Extend torch.utils._sympy.symbol for more Inductor symbols (#125419)
I'm still missing a few, cdzq at least

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125419
Approved by: https://github.com/lezcano
ghstack dependencies: #125395
2024-05-04 09:05:00 +00:00
5cd7c75bd9 [pipelining] Add tracing frontend (#125448)
This PR allows user to transform a model into a pipeline representation with split stages, according to a split spec.
```
def pipeline(
    module: torch.nn.Module,
    num_chunks: int,
    example_args: Tuple[Any, ...],
    example_kwargs: Optional[Dict[str, Any]] = None,
    split_spec: Optional[Dict[str, SplitPoint]] = None,
    split_policy: Optional[Callable[[fx.GraphModule], fx.GraphModule]] = None,
) -> Pipe:
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125448
Approved by: https://github.com/H-Huang
ghstack dependencies: #125273
2024-05-04 09:00:25 +00:00
2b4fe183db Don't call item() into torch.scalar_tensor uselessly (#125373)
Fixes https://github.com/pytorch/pytorch/issues/125368

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125373
Approved by: https://github.com/Skylion007
2024-05-04 08:07:13 +00:00
5ef50d75f8 Don't short circuit if shape is same (#125188)
This is more unbacked SymInt friendly.  If this does not work, my back
up plan is to short circuit only if it is statically known equal.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125188
Approved by: https://github.com/albanD
2024-05-04 07:11:08 +00:00
cyy
83845a7c78 [1/2] Remove caffe2 db and distributed from build system (#125092)
This PR tries to decompose https://github.com/pytorch/pytorch/pull/122527 into a smaller one. Caffe2 db, distributed and some binaries have been removed.
To be noted, this was inspired and is co-dev with @r-barnes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125092
Approved by: https://github.com/malfet
2024-05-04 06:48:46 +00:00
2b41e1d6fc try to fix the warning in distribute_tensor (#125476)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125476
Approved by: https://github.com/albanD, https://github.com/awgu
ghstack dependencies: #125475
2024-05-04 05:25:13 +00:00
b62e89c1b8 [dynamo] Do not turn on record relay with TORCH_COMPILE_DEBUG (#125488)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125488
Approved by: https://github.com/yanboliang, https://github.com/mlazos
2024-05-04 05:10:31 +00:00
ff061baa94 [comm_mode] adding some initial c10d ops to CommDebugMode (#125475)
looks like we can make it work :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125475
Approved by: https://github.com/awgu
2024-05-04 04:20:46 +00:00
d4727fd4eb [TD][ez] Better check for is pr or not (#125485)
You can trigger ciflow tags on main branch commits, so we should be more conservative when checking to see if a workflow is a PR/on the main branch.

get_pr_number checks for the pr number based on the PR_NUMBER env var or a tag of the for `ciflow/workflow/pr number`

If we fail to find something like this, then assume it is on the main branch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125485
Approved by: https://github.com/huydhn
2024-05-04 03:08:44 +00:00
0302dc68bf [Reland] Fakify script object inputs and attributes for non-strict ex… (#125490)
A re-land of #124239.

This PR fakify ScriptObject inputs and attributes in export non-strict mode by default.

The basic idea is to only fakify the script object during tracing (i.e. aot_export). After we get the traced graph module, eagerly executing, serializing, or running more passes will use the real script objects. This is essentially treating the script object as constant tensor.

Concretely, we

fakify all the script object inputs, and module attributes (gathered by constant_attrs).
patch the module's attributes with fakified script object
right after aot_export, remove the patching (to avoid changing the original module) then modify the exported graph module's attribute to real script object.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125490
Approved by: https://github.com/angelayi
2024-05-04 02:39:42 +00:00
bfd5bb0c44 [c10d] only PG0 should dump when monitoring thread timed out (#125356)
Summary:
We found that some dumps are missing when monitoring thread timeout.
This is likely due to multiple PGs could still dump the same records
at the same time. So we should allow only PG0 to actualy dump
Test Plan:
 unit test
python test/run_test.py --cpp --verbose -i cpp/ProcessGroupNCCLErrorsTest
Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125356
Approved by: https://github.com/c-p-i-o
2024-05-04 00:43:20 +00:00
eqy
d325c55896 Add CUDA paths to CODEOWNERS (#125409)
CC @ptrblck @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125409
Approved by: https://github.com/albanD
2024-05-04 00:29:39 +00:00
8a1af95b09 [Inductor] Properly package target info for triton.compile (#125241)
Triton updated the interface for `triton.compile` 5162346487

The `target` argument to compile needs to be wrapped in a `GPUTarget` object. Without proper wrapping, we hit an assert in `compile`. If that assert is removed, Triton attempts to read device info from Torch while inside a torch thread, which hits an in bad fork assert. This change is required for compatibility with latest commits in Triton. The implementation is backwards compatible, so existing versions of Triton that work now continue to work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125241
Approved by: https://github.com/jansel
2024-05-04 00:10:53 +00:00
9aa7699185 [FSDP2] Computed grad divide factors at runtime (#125484)
**Context**
We are interested in supporting the case where HSDP reduce-scatters but does not all-reduce in a microbatch backward. This saves communication while still saving memory. Only on the last microbatch do we need to both reduce-scatter and all-reduce. This is not implemented yet and will hopefully come in a future PR.

There is one notable part of doing this. On the last microbatch, we need to perform an accumulation step after reduce-scatter and before all-reduce. If not, then the preceding microbatch's gradients will not be contributed across the replica group. (In other words, we cannot simply accumulate _after_ all-reduce.)

Consider 32 GPUs with 4-way replication and 8-way sharding and 2 microbatches, and focus on global rank 0.
- After the first microbatch, rank 0 will have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)}$, where we define $S(0) = \{0, 1, \dots, 7\}$ to be the ranks in its shard group and we define the $(1)$ superscript to denote the first microbatch.
- Upon the second microbatch, rank 0 after its reduce-scatter will additionally have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(2)}$. If we only all-reduce this, then this second microbatch's gradients become $\frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, so in total, rank 0 has $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)} + \frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, which is wrong.
- Importantly, we must accumulate $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)}  + \frac{1}{8} \sum_{i \in S(0)} g_i^{(2)} = \frac{1}{8}\sum_{i \in S(0)} (g_i^{(1)} + g_i^{(2)})$ first before all-reducing to get $\frac{1}{32} \sum_{i=0, 1, \dots, 31} (g_i^{(1)} + g_i^{(2)})$.

Now, note how under this approach, we want a factor of $\frac{1}{8}$ only (i.e. reciprocal of the shard group size), not $\frac{1}{32}$, for the first microbatch's gradients.
- For bf16/fp32, since we use `ReduceOp.AVG` and we only reduce-scatter on the first microbatch, we correctly have a factor of $\frac{1}{8}$ on the first microbatch.
- For fp16, since we precompute the gradient divide factors at init time assuming always reducing over both shard and replica groups, we incorrectly have a factor of $\frac{1}{32}$ on the first microbatch, deviating from the bf16/fp32 case.

We can address this issue by matching the bf16/fp32 vs. fp16 semantics by computing the divide factors at runtime based on which process groups were passed into the reduction function (`foreach_reduce`).

**Additional Notes**
How to implement the HSDP reduce-scatter but no all-reduce is not entirely clear yet. (What is the cleanest way to do this?) We need to store the partial reduce-scatter output and check for it upon the next backward. We should also be sure to error if the set of parameters receiving gradients changes, in which case we cannot support this easily. Anyway, we will implement this in a follow-up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125484
Approved by: https://github.com/wanchaol
ghstack dependencies: #125431, #125479
2024-05-03 23:44:05 +00:00
996bb74077 [FSDP2] Added HSDP grad acc tests and some minor changes (#125479)
This adds HSDP to the existing gradient accumulation tests and includes some minor changes to simplify things a tiny bit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125479
Approved by: https://github.com/wanchaol
ghstack dependencies: #125431
2024-05-03 23:44:05 +00:00
b96b1e8cff [Distributed] Add P2P versions of *object_list operations (#124379)
This PR adds `send_object_list` and `recv_object_list` to `distributed_c10d.py`. This is extending functionality already present in PyTorch with `broadcast_object_list` that I noticed was missing and decided to upstream.

With this change, sending and receiving arbitrary picklable python objects is possible.

Relevant issue: https://github.com/pytorch/pytorch/issues/3473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124379
Approved by: https://github.com/kwen2501, https://github.com/wconstab
2024-05-03 23:22:58 +00:00
f2ab96a57e [dynamo] fix crash when context manager is passed to a function (#125321)
Fix https://github.com/pytorch/pytorch/issues/125274. Main change was to reconstruct `ContextWrappingVariables` as objects in general, but we can replace them with the class on the caller side when generating the resume function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125321
Approved by: https://github.com/jansel
2024-05-03 23:01:30 +00:00
59abd1dccb Fix lint after PR 122611 (#125512)
Fix lint after https://github.com/pytorch/pytorch/pull/122611
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125512
Approved by: https://github.com/clee2000
2024-05-03 22:58:20 +00:00
4abcf36dde Make c10::Error empty backtrace as an optional argument (#122611)
Summary: Split from the main diff in the stack.

Test Plan: Build validation should be enough.

Reviewed By: ezyang

Differential Revision: D55313410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122611
Approved by: https://github.com/ezyang
2024-05-03 22:50:00 +00:00
a783fef990 [AOTI] Add a missing mypy ignore (#125508)
Summary: Caused by https://github.com/pytorch/pytorch/pull/125397, but somehow was not caught by CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125508
Approved by: https://github.com/izaitsevfb
2024-05-03 22:32:31 +00:00
2b5ae2611e s390x: use runtime detection for vectorization support (#123936)
s390x: use runtime detection for vectorization support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123936
Approved by: https://github.com/malfet, https://github.com/jansel, https://github.com/xuhancn
2024-05-03 21:34:37 +00:00
5503c29357 Introduce torch.utils._sympy.symbol (#125395)
This provides utilities for creating and querying properties on
sympy.Symbol.  I want to use this refactor to get a better handle on how
the 's' prefix is being used in Inductor.  To start, I only do
symbolic_shapes code because that's what I'm familiar with.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125395
Approved by: https://github.com/Skylion007
2024-05-03 21:24:23 +00:00
1a578df57c [FSDP2] Added test to show rank 0 broadcast for HSDP replicas (#125431)
This PR shows a simple utility to broadcast the parameters across replicas for HSDP:
```
replicate_group = mesh.get_group("replicate")
for param in model.parameters():
    # E.g. for mesh [[0, 1, 2, 3], [4, 5, 6, 7]] sharding on dim-1 and
    # replicating on dim-0, broadcast with sources 0, 1, 2, 3
    src_rank = dist.get_process_group_ranks(replicate_group)[0]
    torch.distributed.broadcast(
        param.to_local(), src=src_rank, group=replicate_group
    )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125431
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
2024-05-03 21:17:35 +00:00
c941fee7ea [CPP extention] Baton lock is called regardless the code version (#125404)
Greetings!

Fixes #125403

Please assist me with the testing as it is possible for my reproducer to miss the error in the code. Several (at least two) threads should enter the same part of the code at the same time to check file lock is actually working

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125404
Approved by: https://github.com/ezyang
2024-05-03 21:10:39 +00:00
645baef05d s390x: remove workaround for sleef issue (#124730)
This workaround is no longer needed since sleef was updated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124730
Approved by: https://github.com/soulitzer
2024-05-03 20:52:05 +00:00
b1a7455b99 [Inductor cutlass backend] Fix cutlass_utils.get_max_alignment() for strided layouts. (#124930)
Fixes cutlass_utils.get_max_alignment() which was so far not checking the alignment properly. Basically
the method so far assumed that the passed layout is contiguous and row-major, which does not have to be true.

Test Plan:
CI - test_cutlass_backend.py to prevent regressions
Added unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124930
Approved by: https://github.com/int3
ghstack dependencies: #124929
2024-05-03 20:50:26 +00:00
a988b4ed76 [AOTI] Generate mul_Scalar instead of mul_Tensor (#125397)
Summary: Fix https://github.com/pytorch/pytorch/issues/117365. When the second argument to aten.mul.Tensor is a scalar (e.g. scale factor), the cpp wrapper expects to generate a call to mul_Scalar when fallback happens (e.g. Complex dtype).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125397
Approved by: https://github.com/chenyang78
ghstack dependencies: #125329
2024-05-03 18:35:42 +00:00
e84a5b6cc0 [AOTI] Add missing std::move for constant args (#125329)
Summary: fix https://github.com/pytorch/pytorch/issues/123187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125329
Approved by: https://github.com/angelayi, https://github.com/chenyang78
2024-05-03 18:35:42 +00:00
d6052a35d4 [RFC][FSDP2] Added register_fsdp_forward_method for user fwd methods (#125394)
FSDP only runs its pre/post-forward hooks on `nn.Module.forward`. This means that if the user runs a custom method meant as a forward pass, then FSDP will not all-gather the parameters. Examples include HuggingFace models' `generate()` (https://github.com/pytorch/pytorch/issues/123962, https://github.com/pytorch/pytorch/issues/100069) or others (https://github.com/pytorch/pytorch/issues/109385).

This PR adds a monkey patching API `register_fsdp_forward_method(module: nn.Module, method_name: str)` to allow FSDP pre/post-forward hooks to run on the method. The function is a no-op if the passed-in `module` is not an FSDP module so that the register function can be called even if the FSDP wrapping changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125394
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
2024-05-03 18:31:28 +00:00
52f9128a0d [AMD] Fix cutlass path in inductor (#125463)
Summary: Trunk is broken because fbcode triton-amd doesn't have cutlass path

Test Plan: It now runs.

Differential Revision: D56923833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125463
Approved by: https://github.com/Skylion007
2024-05-03 18:02:58 +00:00
e10b2ba357 Script for compiling count + time of test at file granularity (#125322)
Adds script for compiling # of tests + total time take at the file granularity
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125322
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-05-03 17:35:44 +00:00
12a69afa6d [export] Fix deserializer node meta handling. (#125454)
Summary: The code seems not needed because serializer shouldn't make any meaningful decision about what goes to node metadata.

Test Plan: CI

Differential Revision: D56918543

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125454
Approved by: https://github.com/angelayi
2024-05-03 16:51:08 +00:00
30610251ec [MPS] And naive quantized intmm and .gputrace capture hooks (#125163)
- Implement a very straightforward Metal copy of CPU int4mm kernel
- Implement int8mm kernel by constructing a graph consisting of upcast, transpose and mm
- Add `isCapturing`, `isCaptureEnabled`, `startCapture` and `stopCapture` methods to `MPSProfile` which can be used to help one debug/profile Metal kernels by wrapping the calls with the following
  ```cpp
   if (getMPSProfiler().profiler.isCaptureEnabled()) {
     getMPSProfiler().startCapture(__func__, mpsStream);
   }
   ...
   if (getMPSProfiler().isCapturing()) {
     getMPSProfiler().stopCapture(mpsStream);
   }
  ```
  that, if invoked with `MTL_CAPTURE_ENABLED` environment variable set to one, will produce .gputrace files, in the current working directory, which can later be loaded and used to debug or profiler the kernel
<img width="1093" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/a2bf27e8-df8a-442c-a525-1df67b8a376a">

- Added `test_int4mm` to TestLinalgMPS, which is mostly copy-n-paste of the test from `test_linalg`

TODOs:
 - Add weight pack
 - Perf-tune both kernels
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125163
Approved by: https://github.com/mikekgfb
2024-05-03 15:20:39 +00:00
a99ada5b27 call super().__post_init__ in ForeachFuncinfo.__post_init__ (#125457)
obviously the current main branch's `ForeachFuncInfo`'s dunder post init doesn't `super().__post_init__()` which does some setup including setting `dtypesIfCUDA` and `dtypesIfROCM`.

Fixes #125295
related: #125001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125457
Approved by: https://github.com/janeyx99
2024-05-03 14:54:15 +00:00
79af814369 [FSDP] Added private _unshard API (#124304)
Some toy example:
<img width="998" alt="Screenshot 2024-04-17 at 2 00 05 PM" src="https://github.com/pytorch/pytorch/assets/31054793/b5665a63-beb0-4ca1-92c6-c57a052812fd">

We define `FullyShardedDataParallel._unshard(async_op: bool = False)` that can be used to prefetch all-gathers. The user should make sure:
1. Run lazy init before the first `_unshard` call of training. For example, this can hackily be done via `root_module.check_is_root()` on the root FSDP module `root_module`.
2. Call `root_module._wait_unshard_streams_on_current_stream()` before the first `_unshard` call of the current iteration (just need to call it once after last optimizer step and before first `_unshard` of this iteration).

Differential Revision: [D56262876](https://our.internmc.facebook.com/intern/diff/D56262876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124304
Approved by: https://github.com/wanchaol
2024-05-03 13:14:15 +00:00
ca98c2a932 inductor: Add Conv3d support (#124361)
This PR is to add Conv3d support in inductor. Basicly reuse and expand Conv2d logic and unit tests to Conv3d.

Conv3d inductor support will improve the performance of C2D_R50, I3D_R50, I3D_R101, Slow and SlowFast-R50 from OOB models.

  | C2D_R50 | I3D_R50 | I3D_R101 | Slow | SlowFast-R50
-- | -- | -- | -- | -- | --
eager | 15.805 | 13.909 | 11.639 | 12.101 | 6.606
Compile w/o conv3d | 17.244 | 14.893 | 12.109 | 13.015 | 6.603
Compile w/ conv3d | 21.212 | 17.707 | 14.974 | 16.130 | 8.537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124361
Approved by: https://github.com/leslie-fang-intel, https://github.com/CaoE, https://github.com/jgong5, https://github.com/jansel
2024-05-03 10:24:14 +00:00
489b4586e9 [optim]fix ut and sgd kernel (#124904)
- Original `test_grad_scaling_autocast_fused_optimizers` does not work since there is no "fused" in `optim_inputs`
 - We should use different `grad_scaler`, they should not share 1 `scale`, there is no issue exposed here because the default `_growth_interval` is 2000 so it will not growth and there is also no inf is found so it will not reduced. The one in `test_cuda.py` should also have this issue,
 - I set a manual seed to reproduce purpose if there is any numerical failure
 - I use Tensor tracker here because we failed this UT in dynamo case, the cpp generated code are not exactly same with fused/non fused kernel.
 - I make it check both `cuda` and `cpu`.
 - I find some SGD numerical issue with `clang`, and fixed it by using `fmadd` instead of `add/mul` in fused sgd veckernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124904
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-05-03 09:13:24 +00:00
bebefcf845 Driver folder check (#117548)
Added extra check for driver folders for Libtorch, as stat struct does not recognize driver folders, so torch.save should work for them as well. (e.g. save model.pt directly under C: )

Fixes [#111121](https://github.com/pytorch/pytorch/issues/111121) and #105488

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117548
Approved by: https://github.com/malfet
2024-05-03 09:10:11 +00:00
e5cc7ada67 skip triton template precompilation in 311.0-3.11.7 to workaround 311 cpython bug (#125446)
Fix for https://github.com/pytorch/pytorch/issues/125374. We dont have CI for this specific versions, but I verified locally. THere is a cpython bug from 3.11.0->3.11.7 where the ast parsing state is global, and errors with multiple threads. when dust settles a little around the new process based compilation we can look into migrating.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125446
Approved by: https://github.com/Chillee
ghstack dependencies: #125289
2024-05-03 08:28:32 +00:00
dd92637f44 Add write_record_metadata to PyTorchFileWriter (#125184)
Add `PyTorchFileWriter.write_record_metadata(record_name, num_bytes)` that
- writes the zipfile header/end of central directory metadata for an entry*
- reserves `num_bytes` in the zipfile for the payload.

*Since the payload is not provided, the CRC32 computation is skipped and 0s are written in the corresponding entry of the zipfile header

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125184
Approved by: https://github.com/albanD
2024-05-03 07:29:52 +00:00
4c84789743 [vision hash update] update the pinned vision hash (#123227)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123227
Approved by: https://github.com/pytorchbot
2024-05-03 05:55:29 +00:00
071ee40793 [dynamo][nn module] Check for duplicate tensors in register_attr_or_module (#125421)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125421
Approved by: https://github.com/mlazos
ghstack dependencies: #125439
2024-05-03 05:08:09 +00:00
ef757a5c00 [export] use tree_map for _flatten_dynamic_shapes (#125415)
Summary:
Fixing the implementation of `_flatten_dynamic_shapes()`, to follow how `_process_dynamic_shapes()` does it. The previous implementation would misinterpret some nested dynamic shapes specs, causing it to miss out on some shapes specs, for example with nested inputs/constant input tuples:

```
inputs = (
    (2, 1),
    (
        torch.randn(2, 1),
        torch.randn(2, 2),
        torch.randn(2, 3),
    )
)

dynamic_shapes = (
    (None, None),
    (
        None,
        None,
        None,
    )
)
```
This would get interpreted as 2 shapes specs for 2d and 3d tensors. Fixing so this doesn't happen.

Test Plan: Existing export tests

Differential Revision: D56894923

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125415
Approved by: https://github.com/angelayi
2024-05-03 04:59:17 +00:00
394ec2da30 Remove GPU Check from Basic Chrome Trace test (#125430)
Summary: Remove the check to make sure all GPU labels are enumerated when CUDA is available. There are some systems where CUDA is available but we do not print any GPU labels (because GPU is not available).

Test Plan: Test in regression with ciflow/periodic label

Differential Revision: D56906893

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125430
Approved by: https://github.com/izaitsevfb
2024-05-03 04:51:10 +00:00
8706da2bad [dynamo][cpp-guards] Improve recompilation reason logic for NO_TENSOR_ALIASING guard (#125439)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125439
Approved by: https://github.com/williamwen42
2024-05-03 04:49:41 +00:00
d156cb2e12 Fix mem size mismatch from split/chunk in const folding (#125199)
Summary:
The chunk/split ops on the weights/constants is folded in a fx pass and each output tensor has the same storage size of the original tensor (which is 3x of its actual size if chunk(3)). However Backend calculates the mem size on device from tensor shape/stride/dtype. This causes the mismatch  when copying weights/constants to device as allocated mem on device is always smaller than the size of weights/constants and results in a runtime error in loading weight/constant (T172125529).

This diff fixes the issue by cloning the tensors after const folding so that the tensors has correct storage size.

Test Plan:
Before this change: (18432 = 48 * 64 * 2 * 3)
 ```
RuntimeError: Failed to load constant getitem_idx0 split (remaining=18432) at fbcode/caffe2/torch/fb/acc_runtime/afg/afg_bindings.cpp:3422: Request failed because an invalid parameter
```

```
buck2 run mode/opt //caffe2/torch/fb/acc_runtime/afg/tests:test_operators-artemis -- -r test_mem_size_mismatch
```
```
Ran 1 test in 7.048s

OK
```

Reviewed By: jfix71

Differential Revision: D56663931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125199
Approved by: https://github.com/jfix71
2024-05-03 04:42:38 +00:00
a40d6df448 [MPS] Native nonzero implementation (#125355)
Fixes https://github.com/pytorch/pytorch/issues/124850

Replace previous MPSGraph nonzero construction with native nonzero op. For older OSes, fallback to CPU (previous implementation was not reliable and was comparable to CPU in speed).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125355
Approved by: https://github.com/kulinseth
2024-05-03 03:50:58 +00:00
e15da7856c [MPS] Fix overflow in cumsum when dtype is bool (#125318)
`cumsum` and `cumprod` was (is?) buggy for MPS: c8d2a55273/aten/src/ATen/native/mps/operations/UnaryOps.mm (L435-L436)

A workaround casts the input to int32 prior to performing the op to prevent overflow for certain numeric types.

It turns out this issue also affects boolean types:

```python
import torch
print(torch.ones(128, dtype=torch.bool, device="mps").cumsum(0)[-1])
# tensor(-128, device='mps:0')
```

In this PR I'm adding logic to also cast bool dtypes to int32 prior to `cumsum` and `cumprod`, although output is guaranteed not to overflow for the latter with bools. I'm also adding a test to prevent regressions.

Fixes #96614 #106112 #109166

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125318
Approved by: https://github.com/malfet
2024-05-03 01:19:24 +00:00
acac7aa70f [CI] Unskip Linalg tests on ARM (#125377)
Removes obscure "Issue with numpy version on arm" added by https://github.com/pytorch/pytorch/pull/82213
And replaces it with 4 targeted skips:
 - test_addmv for `float16`
- test_vector_norm for `float16`, `bfloat16` and `float32`

Followups to fix them are tracked in https://github.com/pytorch/pytorch/issues/125438
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125377
Approved by: https://github.com/kit1980
2024-05-03 01:18:52 +00:00
d18a6f46d0 Adding Compare in torch.utils.benchmark documentation (#125009)
`torch.utils.benchmark.Compare` is not directly exposed in torch.utils.benchmark documentation.

I think this is a valuable resource to add since it can help people embracing the torch benchmark way of doing things, and help people building documentation towards it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125009
Approved by: https://github.com/mikaylagawarecki
2024-05-03 00:50:54 +00:00
4440d0755a Support custom layout call under torch dispatch mode (#125379)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125379
Approved by: https://github.com/jbschlosser
2024-05-02 23:44:12 +00:00
7551755cec Update tolerance for flex fp32 (#125444)
# Summary
Updates the tolerances to account for internal failure

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125444
Approved by: https://github.com/kit1980
2024-05-02 23:34:18 +00:00
3b5f6b10ad [Inductor] default block size for head_dim = 256 for flex attention (#125380)
## H100
### torch.bfloat16
No major change, as expected.
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|----------------|
| Average |     1.122 |              |             |             |             |            |             |                |
| Max     |     1.437 |            1 |          16 |         512 |         512 |        128 | head_bias   | torch.bfloat16 |
| Min     |     0.895 |            1 |          16 |        1024 |        1024 |         64 | head_bias   | torch.bfloat16 |
```
### torch.float32
Before: OOM when ```head_dim``` = 256
After:
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype         |
|---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|---------------|
| Average |     2.231 |              |             |             |             |            |             |               |
| Max     |     3.760 |           16 |          16 |        4096 |        4096 |         64 | noop        | torch.float32 |
| Min     |     1.532 |            1 |          16 |         512 |         512 |        256 | causal_mask | torch.float32 |
```

## A100
### torch.bfloat16
Before:
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------|
| Average |     0.587 |              |             |             |             |            |               |                |
| Max     |     0.960 |            1 |          16 |         512 |         512 |         64 | noop          | torch.bfloat16 |
| Min     |     0.017 |            8 |          16 |        4096 |        4096 |        256 | relative_bias | torch.bfloat16 |
```
After:
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|----------------|
| Average |     0.756 |              |             |             |             |            |             |                |
| Max     |     0.931 |            1 |          16 |         512 |         512 |         64 | noop        | torch.bfloat16 |
| Min     |     0.467 |           16 |          16 |        1024 |        1024 |        256 | noop        | torch.bfloat16 |
```

### torch.float32
Before: OOM when ```head_dim``` = 256
After:
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype         |
|---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|---------------|
| Average |     2.386 |              |             |             |             |            |             |               |
| Max     |     7.584 |           16 |          16 |         512 |         512 |         64 | noop        | torch.float32 |
| Min     |     0.948 |            1 |          16 |         512 |         512 |        256 | causal_mask | torch.float32 |
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125380
Approved by: https://github.com/drisspg
2024-05-02 22:51:07 +00:00
5c7b71dccf [DCP] Adds strict option to DefaultPlanner (#123869)
~Users may have custom use cases for the `strict` parameter in load. In my mind, if we automatically call `state_dict` and `load_state_dict` in save/load, we need to support the same functionality in `nn.Modules`.~

It turns out this is actually not related to nn.Module's strict param. Since `state_dict` is called inside `dcp.load`, it's actually impossible to create a model such that the following would raise an error:
```
state_dict = module.state_dict()
module.load_state_dict(state_dict, strict=True)
```

The issue is actually just when there are elements in `state_dict` which do not exist in the checkpoint. This PR adds the ability to configure this behavior through the DefaultSavePlanner (see tests).

Concretely, if module has extra attributes not present in the checkpoint, we will only raise an error if `DefaultLoadPlanner.allow_partial_load==False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123869
Approved by: https://github.com/fegin
2024-05-02 22:50:32 +00:00
2c8237c6aa [ATen-VK] Resolve compiler_flags to allow Mac build (#125361)
Summary:
## `-Wmissing-prototypes`

In ATen-Vulkan, we often define functions in `.cpp` files without declaring them in `.h` files without hiding them in an anonymous namespace.

Example: [`Packing.cpp`'s channel_image_repacking()](f1f142c44f/aten/src/ATen/native/vulkan/impl/Packing.cpp (L299-L348))

On Mac, this results in a `-Wmissing-prototypes` warning, which is disabled in this change.

## `-Wshadow`

In `Adapter.cpp`, we overwrite a variable called `properties`, which we fix in this change as opposed to disabling the warning.

Test Plan: CI

Differential Revision: D56850324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125361
Approved by: https://github.com/SS-JIA
2024-05-02 22:26:39 +00:00
55c705b602 [dynamo] add trace_bytecode logging artifact (#125360)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125360
Approved by: https://github.com/ezyang
2024-05-02 22:01:00 +00:00
a0e2f62edd Revert "Include support for the scatter gather cuda kernels to allow for comp… (#124809)"
This reverts commit 9e24c263f998819f849bb8293323213101e9aefc.

Reverted https://github.com/pytorch/pytorch/pull/124809 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/124809#issuecomment-2091751002))
2024-05-02 21:36:18 +00:00
b1b03992d0 Merge the pyi files into py files of optimizer (#125153)
Merge the interfaces in pyi files into py files in `torch/optim`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125153
Approved by: https://github.com/janeyx99
2024-05-02 21:29:31 +00:00
edad82fc90 Add private helper for determining which version of FA2 closest matches kernel version (#123653)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123653
Approved by: https://github.com/mikaylagawarecki
2024-05-02 21:28:23 +00:00
0199ce8d6c [pipelining] Add microbatch split and merge utils (#125273)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125273
Approved by: https://github.com/H-Huang
ghstack dependencies: #124776, #124875, #124958
2024-05-02 21:09:47 +00:00
1657f7e262 [Doc] Update docstrings for torch/random.py (#125265)
Updates the docstrings for torch/random.py to clarify what device / RNG each function operates on.

While trying to understand the difference between
```
state = torch.random.get_rng_state()
some_code
torch.random.set_rng_state(state)
```
and
```
with torch.random.fork_rng():
    some_code
```
I found out that there was a note about this in the docstring that wasn't being rendered on the website. I fixed that note and added additional clarifications on other functions in this file.

Test Plan:
Built the docs and verified that everything renders correctly.

<img width="911" alt="Screenshot 2024-04-30 at 2 22 08 PM" src="https://github.com/pytorch/pytorch/assets/9263852/f219bc35-89bd-4f5b-ba60-255b089499a4">

<img width="901" alt="Screenshot 2024-04-30 at 2 22 13 PM" src="https://github.com/pytorch/pytorch/assets/9263852/c141e7fa-afc9-4c66-b460-96668ce35606">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125265
Approved by: https://github.com/Balandat, https://github.com/lezcano
2024-05-02 20:55:23 +00:00
fc76764a56 Always pass down kernel_file and grid as string (#125384)
From my test with Ads production workload, I found sometime kernel_file is None and grid is a tuple. It will crash since ExecutionTraceObserver expects string for both kernel_file and grid. This PR is to make sure  kernel_file and grid are always passed down as string. Need to find the root cause why kernel_file is none.

Unit test:
     buck test  @mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125384
Approved by: https://github.com/davidberard98, https://github.com/sraikund16
2024-05-02 20:43:20 +00:00
dae574c713 Don't make replacements for i variables (#125398)
This was introduced in https://github.com/pytorch/pytorch/pull/110262
but actually it looks like they were trying to hit unbacked SymInt.
Now that unbacked SymInt is renamed to u, this code is no longer
necessary

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125398
Approved by: https://github.com/lezcano, https://github.com/Skylion007
2024-05-02 20:38:09 +00:00
4f62494bf9 [DCP] Move async logic into filesystem for better encapsulation (#124944)
This logic is specific to FilesystemWriter, and now has a better place to live due to the new AsyncStager class

Differential Revision: [D56578436](https://our.internmc.facebook.com/intern/diff/D56578436/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124944
Approved by: https://github.com/fegin
ghstack dependencies: #122965, #124939
2024-05-02 20:31:33 +00:00
2bbfb70831 ignore unsupported module from flop counter (#125346)
Summary:
Torchscript modules do not support forward hooks and thus can't work with flop_counter context manager for hierarchical output by passing a module to FlopCounterMode on construction.

Currently any module that includes a script module causes an exception to be thrown so adding a try/catch to ignore any script modules for forward hooks.

Test Plan: CI Signals

Differential Revision: D56850661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125346
Approved by: https://github.com/842974287
2024-05-02 20:30:52 +00:00
799f1460af [DCP] Provides default AsyncStager (#124939)
Differential Revision: [D56575987](https://our.internmc.facebook.com/intern/diff/D56575987/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124939
Approved by: https://github.com/fegin
ghstack dependencies: #122965
2024-05-02 19:48:54 +00:00
3741fb3680 [DCP] Introduce async staging extension points (#122965)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* #124944
* #124939
* __->__ #122965

Differential Revision: [D55493240](https://our.internmc.facebook.com/intern/diff/D55493240/)

*This PR is now ready for merge and is not an RFC*

Major choices are:
-- the introduction of the AsyncStager protocol
-- removed `executor` from param.
-- leave async as a separate method (for now)

This proposal seeks to add extension points to dcp.async_save, allowing users to:
- Specify a specific staging method when calling async_save
- Allow a vehicle for also making the staging method async, to allow for cases where we may want to overlap with the training loop (e.g., overlap d2h with and only synchronize at the optim.step)
- Potentially specify the execution method for doing async_save in parallel. For example some users may prefer a subprocess over a  thread to avoid GIL issues.

A totally reasonable alternative to this entire proposal is to expect users who want this level of customization
to write their own custom async save methods. Here's an example which addresses the issues mentioned
in PR comments.
```
def custom_async_save(...):
    #     this step accomplishes staging and includes the usual 'planning' calls (issue 1)
    buffered_writer = CpuBufferedWriter() # this is stateful, contains a copy of state_dict
    dcp.save(state_dict, storage_writer=buffered_writer)

    final_storage_writer = FileSystemWriter()
    mp.spawn(      # issue2 is gone, do whatever you want here
	dcp.save,    # or some custom sub-process method which calls dcp.save under the hood
        buffered_writer.state_dict,   # lot's of way's to do this, not really the most important part
	checkpoint_id=checkpoint_id,
	storage_writer=storage_writer,
	planner=planner,
	process_group=process_group, # this actually wouldn't work, but again not the pt.
      )
      # leaving out the rest of the details for managing your extra special subprocess.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122965
Approved by: https://github.com/daulet-askarov
2024-05-02 19:01:55 +00:00
da991fac22 [ROCm][CI] upgrade CI to ROCm 6.1 (#124300)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124300
Approved by: https://github.com/malfet
2024-05-02 17:16:02 +00:00
1eb7b8eb60 [PT2D] Ensure the trace rules are correct with distributed (#125333)
Summary:
1. Avoid using `torch._dynamo.disable`.
2. Clear the LRU cache of the trace rules. This won't do anything if rules are not evluated before PG initilization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125333
Approved by: https://github.com/yanboliang
2024-05-02 16:28:38 +00:00
e93b57a570 Add propagate_real_tensors mode for unbacked (#125115)
A common complaint when working with data-dependent code in PyTorch is that it's hard to tell how far you are from the finish line: every time a GuardOnDataDependentSymNode error is hit, you have to somehow fix or workaround it to see the next one.

This PR adds a new mode `torch._functorch.config.fake_tensor_propagate_real_tensors` which modifies fake tensors to also propagate real tensors. This means that when we try to guard on a data-dependent SymNode, we can actually produce a real result. We also produce a warning which you should consult to figure out what the crux points are.

I ran this on vision_maskrcnn. In the baseline (without this mode), the model has 27 graph breaks, resulting in 40 graphs. With this mode on, the model has only 11 graph breaks, resulting in 15 graphs (the remaining graph breaks are due to missing functionality for item() on float tensor and some other Dynamo missing features.) You get a list of things that would have errored like this:

```
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> False
```

Potential later follow ups:

* Improve the warning messages (in particular, should provide user frames)
* GC real tensors when they are no longer needed by tracing. Right now, this will use A LOT of memory, equal to as if your GC was broken and every intermediate tensor was kept live

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125115
Approved by: https://github.com/IvanKobzarev
2024-05-02 15:28:26 +00:00
fb1bfe1156 Get cutlass_library import working under fbcode (#125257)
Differential Revision: D56764089

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125257
Approved by: https://github.com/chenyang78
2024-05-02 15:17:10 +00:00
8046de3512 [Inductor cutlass backend] Remove epilogue nodes from Kernel call (#124929)
Minor refactoring:
Remove unused "fused epilogue node" arguments from some method  Kernel call signatures.

Test Plan:
Covered by current tests in test_cutlass_backend.py - no functional change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124929
Approved by: https://github.com/eellison
2024-05-02 13:02:31 +00:00
a13a0a2479 [dynamo][easy] Simple fixes to prepare for nn module guards (#125316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125316
Approved by: https://github.com/williamwen42
ghstack dependencies: #125275
2024-05-02 12:08:11 +00:00
0b70026d3b Do not pass none to has_pending_mutation (#125359)
#fix https://github.com/pytorch/pytorch/issues/125315

Several failures when inlining nn module is enabled are due to passing None to has_pending_mutation
from previous code, it sounds like its expected for variable to be none when not found, In that case we should skip it and not call has_pending_mutation
this is tested in https://github.com/pytorch/pytorch/pull/125354

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125359
Approved by: https://github.com/mlazos
2024-05-02 09:08:22 +00:00
aa7be72cc5 Convert ForeachFuncInfo to dataclass (#125001)
- `ForeachFuncInfo` to `dataclass` for smaller diff from `OpInfo`
- `skips` to `decorators` and `skip` to `xfail`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125001
Approved by: https://github.com/janeyx99, https://github.com/jeffdaily
2024-05-02 04:19:09 +00:00
da5d2d9b3e Hotfix: restore CPP guard string in structured trace (#125303)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125303
Approved by: https://github.com/albanD
2024-05-02 03:57:19 +00:00
fff7a31800 fix torchdeploy issue on sharddim_alltoall op (#125344)
Summary: fix torchdeploy issues when registering the distributed op, similar to what functional collective did

Differential Revision: D56850434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125344
Approved by: https://github.com/XilunWu, https://github.com/fegin
2024-05-02 03:38:34 +00:00
f59ce798f9 [ROCm] TunableOp for scaled_mm (#123987)
Adds a new ScaledGemmTunableOp implementation using hipblaslt.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123987
Approved by: https://github.com/jianyuh
2024-05-02 03:06:31 +00:00
5ea54839c9 Make min(stride, strides[idx]) in collapse_view_helper size oblivious (#125301)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125301
Approved by: https://github.com/albanD
2024-05-02 02:39:58 +00:00
b119e1bcc2 Fix refcount handling for dtype, layout and memory format (#125271)
Finish fixing https://github.com/pytorch/pytorch/issues/124868
re-use our wrap() utils as much as possible and NewRef in other places.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125271
Approved by: https://github.com/colesbury
2024-05-02 02:34:34 +00:00
4731130ea8 Add a code comment about torch._check_is_size in tensor_split (#125292)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125292
Approved by: https://github.com/albanD
2024-05-02 02:25:38 +00:00
a9309502af Revert "Refactoring to remove unused variable (#125252)"
This reverts commit b094622bc954e179ddb8649652b87d2a81d7d500.

Reverted https://github.com/pytorch/pytorch/pull/125252 on behalf of https://github.com/drisspg due to going to land codev ([comment](https://github.com/pytorch/pytorch/pull/125252#issuecomment-2089394606))
2024-05-02 01:49:57 +00:00
b03fb49ed8 Revert "[dynamo] use lazy disable dynamo for manual seed (#125196)"
This reverts commit 8320b770fd9dc4671bc9eb0d535e14173e95cf45.

Reverted https://github.com/pytorch/pytorch/pull/125196 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/125196#issuecomment-2089355842))
2024-05-02 00:57:39 +00:00
9e24c263f9 Include support for the scatter gather cuda kernels to allow for comp… (#124809)
Fixes #121965

This PR hopes to add support complex numbers in the scatter/gather related kernels. For brevity, I will only include `complex<float>` for now as `complex<double>`, for example, will be more complicated.

C++ unit tests are currently passing alongside tests in `test_scatter_gather_ops.py`. Python test suites also seem to be passing.

Please keep the following in mind:
1) I think this is my first time using Pytorch.
2) This is my first contribution to Pytorch.

Environment:
3080 & WSL 2. `nvcc` is at 12.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124809
Approved by: https://github.com/mikaylagawarecki
2024-05-01 23:58:35 +00:00
f1f142c44f Revert "Fakify script object inputs and attributes for non-strict export (#124239)"
This reverts commit ecc2e034f7e55bf9ff7f4e5df4e9086a5c92caaa.

Reverted https://github.com/pytorch/pytorch/pull/124239 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/124239#issuecomment-2089305447))
2024-05-01 23:56:00 +00:00
9022f131b5 [inductor] switch assume_aligned_inputs to False (#124336)
In #123319, we guard some behavior behind the `assume_aligned_inputs` config option. If we set this to `False`, then the behavior added in #123319 becomes the default behavior. See the referenced PR for more details about the behavior affected.

Side effects:
* It's possible that this will hurt performance in some scenarios. For example, if an unaligned input is used in a matmul, it might be better to perform the clone to align it first.
* This will occasionally cause recompiles. Specifically: the check we perform (`(storage_offset * get_dtype_size(dtype)) % ALIGNMENT == 0`) can be guarded on if the storage_offset becomes dynamic. storage_offset becomes dynamic during automatic_dynamic_shapes after a shape or stride changes. Previously, this was increasing graph breaks in cpu inductor torchbench tests (but is fixed by more carefully guarding checks on alignment, so that we don't run them and generate guards unless actually needed).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124336
Approved by: https://github.com/eellison
2024-05-01 23:49:27 +00:00
c281d3a0cb Enable UFMT on test_indexing&test_view_ops (#125112)
Part of https://github.com/pytorch/pytorch/issues/123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125112
Approved by: https://github.com/ezyang
2024-05-01 23:44:53 +00:00
9043ccafdf Require nnz==0 in sparse meta tensors (#125221)
As in the title and per discussion starting at https://github.com/pytorch/pytorch/pull/117907#issuecomment-2082426468

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125221
Approved by: https://github.com/amjames, https://github.com/ezyang
2024-05-01 23:41:49 +00:00
46f326eff5 explicitly reset stderr/stdout in precompilation (#125289)
I was seeing a weird bug where after running max-autotune my stdout would be misdirected. other people have not been able to repro this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125289
Approved by: https://github.com/shunting314, https://github.com/mlazos
2024-05-01 23:41:36 +00:00
6f5f405b05 [ncclx] Rename NCCL-EXP to NCCLX (#125238)
Reviewed By: kryanchun

Differential Revision: D56534548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125238
Approved by: https://github.com/kwen2501
2024-05-01 23:29:55 +00:00
6cfb55dd5d Add a variable for some testcases. (#124708)
Some testcases can use 'TEST_PRIVATEUSE1_DEVICE_TYPE' to make adapting these testcases on others device more convenient.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124708
Approved by: https://github.com/albanD
2024-05-01 23:19:12 +00:00
c451d108da Implemented isin_Tensor_Tensor_out for MPS backend (#124896)
Addresses issue #124518, adds isin_Tensor_Tensor_out.

Tests added to test_mps.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124896
Approved by: https://github.com/malfet, https://github.com/kulinseth
2024-05-01 23:14:05 +00:00
506eda538b Fix windows build error not propagating (#125306)
* Fixes https://github.com/pytorch/pytorch/issues/124886
* Kind of similar to https://github.com/pytorch/pytorch/pull/109393

I think what happens is `exit` and `exit /b` propagate the errorlevel correctly, but `exit /b` only exists the currently running batch script and not the entire cmd.exe (or whatever program is running the batch script), so `exit /b` exits with errorlevel 1, but the the parent cmd exits with 0, and bash sees cmd's 0

I think `goto fail` and `exit` are the same thing when the batch script is run from a bash script so either would work in this case?  But the `goto fail` method might be better if someone happens to run the script on cmdline

I assumed that anywhere anyone was exiting after checking the error code, they did want to exit completely, and I'm pretty sure that being inside a parenthesis counts as being a different script, so I changed everything to goto fail just in case, this might be too aggressive?

Logs after this change for a build failure on cuda:
https://github.com/pytorch/pytorch/actions/runs/8912185834/job/24475087535?pr=125306
```
2 errors detected in the compilation of "C:/actions-runner/_work/pytorch/pytorch/aten/src/ATen/native/cuda/AdaptiveMaxPooling3d.cu".
AdaptiveMaxPooling3d.cu
[7599/8420] Linking CXX shared library bin\torch_cpu.dll
ninja: build stopped: subcommand failed.
-- Building version 2.4.0a0+git3171c11
cmake -GNinja -DBUILD_ENVIRONMENT=win-vs2019-cuda11.8-py3 -DBUILD_PYTHON=True -DBUILD_TEST=True -DBUILD_TYPE=release -DBUILD_WHEEL=1 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_COMPILER=C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8/bin/nvcc.exe -DCMAKE_CUDA_COMPILER_LAUNCHER=C:/actions-runner/_work/pytorch/pytorch/build/win_tmp/bin/randomtemp.exe;C:/actions-runner/_work/pytorch/pytorch/build/win_tmp\bin\sccache.exe -DCMAKE_CXX_COMPILER_LAUNCHER=sccache -DCMAKE_C_COMPILER_LAUNCHER=sccache -DCMAKE_GENERATOR=Ninja -DCMAKE_INSTALL_PREFIX=C:\actions-runner\_work\pytorch\pytorch\torch -DCMAKE_PREFIX_PATH=C:\Jenkins\Miniconda3\Lib\site-packages -DCUDA_NVCC_EXECUTABLE=C:/actions-runner/_work/pytorch/pytorch/build/win_tmp/bin/nvcc.bat -DCUDNN_LIBRARY=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\lib\x64 -DNUMPY_INCLUDE_DIR=C:\Jenkins\Miniconda3\lib\site-packages\numpy\core\include -DPYTHON_EXECUTABLE=C:\Jenkins\Miniconda3\python.exe -DPYTHON_INCLUDE_DIR=C:\Jenkins\Miniconda3\Include -DPYTHON_LIBRARY=C:\Jenkins\Miniconda3/libs/python39.lib -DTORCH_BUILD_VERSION=2.4.0a0+git3171c11 -DTORCH_CUDA_ARCH_LIST=8.6 -DUSE_CUDA=1 -DUSE_NUMPY=True C:\actions-runner\_work\pytorch\pytorch
cmake --build . --target install --config Release -- -j 8

(base) C:\actions-runner\_work\pytorch\pytorch>if errorlevel 1 goto fail

(base) C:\actions-runner\_work\pytorch\pytorch>exit /b 1
Error: Process completed with exit code 1.
```

vs original
https://github.com/pytorch/pytorch/actions/runs/8910674030/job/24470387612
```
2 errors detected in the compilation of "C:/actions-runner/_work/pytorch/pytorch/aten/src/ATen/native/cuda/AdaptiveMaxPooling3d.cu".
AdaptiveMaxPooling3d.cu
[7604/8420] Linking CXX shared library bin\torch_cpu.dll
ninja: build stopped: subcommand failed.
-- Building version 2.4.0a0+gite09f98c
cmake -GNinja -DBUILD_ENVIRONMENT=win-vs2019-cuda11.8-py3 -DBUILD_PYTHON=True -DBUILD_TEST=True -DBUILD_TYPE=release -DBUILD_WHEEL=1 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_COMPILER=C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8/bin/nvcc.exe -DCMAKE_CUDA_COMPILER_LAUNCHER=C:/actions-runner/_work/pytorch/pytorch/build/win_tmp/bin/randomtemp.exe;C:/actions-runner/_work/pytorch/pytorch/build/win_tmp\bin\sccache.exe -DCMAKE_CXX_COMPILER_LAUNCHER=sccache -DCMAKE_C_COMPILER_LAUNCHER=sccache -DCMAKE_GENERATOR=Ninja -DCMAKE_INSTALL_PREFIX=C:\actions-runner\_work\pytorch\pytorch\torch -DCMAKE_PREFIX_PATH=C:\Jenkins\Miniconda3\Lib\site-packages -DCUDA_NVCC_EXECUTABLE=C:/actions-runner/_work/pytorch/pytorch/build/win_tmp/bin/nvcc.bat -DCUDNN_LIBRARY=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\lib\x64 -DNUMPY_INCLUDE_DIR=C:\Jenkins\Miniconda3\lib\site-packages\numpy\core\include -DPYTHON_EXECUTABLE=C:\Jenkins\Miniconda3\python.exe -DPYTHON_INCLUDE_DIR=C:\Jenkins\Miniconda3\Include -DPYTHON_LIBRARY=C:\Jenkins\Miniconda3/libs/python39.lib -DTORCH_BUILD_VERSION=2.4.0a0+gite09f98c -DTORCH_CUDA_ARCH_LIST=8.6 -DUSE_CUDA=1 -DUSE_NUMPY=True C:\actions-runner\_work\pytorch\pytorch
cmake --build . --target install --config Release -- -j 8

(base) C:\actions-runner\_work\pytorch\pytorch>if errorlevel 1 exit /b
+ assert_git_not_dirty
+ [[ win-vs2019-cuda11.8-py3 != *rocm* ]]
+ [[ win-vs2019-cuda11.8-py3 != *xla* ]]
++ git status --porcelain
++ grep -v '?? third_party'
++ true
+ git_status=
+ [[ -n '' ]]
+ echo 'BUILD PASSED'
BUILD PASSED
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125306
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/atalman
2024-05-01 22:06:47 +00:00
599a2e25f1 Reland "make sure dynamo doesn't inline DTensor __new__ or __torch_dispatch__ (#123347)" (#125288)
Re-land of https://github.com/pytorch/pytorch/pull/123347.

The original PR broke internal because of a circular import due to importing dynamo in the DTensor code. The new version uses `torch._dynamo_disable` to work around

This reverts commit 9d88339b535f57cd0e2926c9ac4c2542e4490aac.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125288
Approved by: https://github.com/ezyang, https://github.com/yanboliang, https://github.com/yoyoyocmu, https://github.com/anijain2305, https://github.com/fegin
ghstack dependencies: #124398, #124399, #124400
2024-05-01 21:56:01 +00:00
9e9ba61fde AOTAutograd: force tangents to be contiguous when subclass inner tensor is noncontiguous (#124400)
Fixes https://github.com/pytorch/pytorch/issues/124397

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124400
Approved by: https://github.com/ezyang, https://github.com/yoyoyocmu
ghstack dependencies: #124398, #124399
2024-05-01 21:56:01 +00:00
5173cbe260 fix FakeTensor creation on noncontiguous subclasses (#124399)
Fixes https://github.com/pytorch/pytorch/issues/125287

Fixes https://github.com/pytorch/pytorch/issues/124090, context on the issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124399
Approved by: https://github.com/soulitzer
ghstack dependencies: #124398
2024-05-01 21:56:01 +00:00
7058563078 support as_python_constant on PlacementClassVariable (#124398)
Fixes an error for torchtitan + internal

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124398
Approved by: https://github.com/ezyang, https://github.com/wanchaol, https://github.com/yoyoyocmu
2024-05-01 21:56:01 +00:00
2d794bcb8a Delete NegateSource handling, I think it's dead (#125311)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125311
Approved by: https://github.com/Skylion007
2024-05-01 21:36:50 +00:00
746da8755c switch tests from constrain_as* to torch._check* (#125253)
To fix data-dependent errors we want to recommend that people use `torch._check*` APIs. The `constrain_as*` APIs should be fully subsumed by them, and in the future we should kill them entirely.

Differential Revision: D56774333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125253
Approved by: https://github.com/ezyang
2024-05-01 21:01:27 +00:00
dbcf123105 Upgrade submodule oneDNN to v3.4 (#122472)
## Improvements
This upgrade fixes the following issues:
- https://github.com/pytorch/pytorch/issues/120982

This upgrade brings the following new features:
- Introduced memory descriptor serialization API. This API is needed to support freezing on CPU in AOTInductor (https://github.com/pytorch/pytorch/issues/114450)

## Validation results on CPU
No regression was found.

1. NLP models accuracy/inference/training

Model Name | Mode| Precision | New | Baseline | New/Baseline
-- | -- | -- | -- | -- | --
bert-large | accuracy | fp32 | 93.15325 | 93.15325 | 100.00%
bert-large | accuracy | bf16 | 93.20125 | 93.20125 | 100.00%
bert-large | accuracy | int8 | 92.66641 | 92.66641 | 100.00%
LCM | accuracy | fp32 | 44.11152 | 44.11154 | 100.00%
LCM | accuracy | bf16 | 43.57667 | 43.65096 | 100.17%
ViT | accuracy | fp32 | 0.8033 | 0.8033 | 100.00%
ViT | accuracy | bf16 | 0.8031 | 0.8031 | 100.00%
ViT | accuracy | int8 | 0.7985 | 0.7985 | 100.00%
yolov7 | accuracy | fp32 | 0.512 | 0.512 | 100.00%
yolov7 | accuracy | bf16 | 0.504 | 0.504 | 100.00%
yolov7 | accuracy | int8 | 0.507 | 0.507 | 100.00%
bert-large | realtime | fp32 | 37.433 | 39.136 | 95.65%
bert-large | realtime | bf16 | 166.592 | 160.134 | 104.03%
bert-large | realtime | int8 | 230.876 | 222.594 | 103.72%
ViT | realtime | fp32 | 288.19 | 282.05 | 102.18%
ViT | realtime | bf16 | 755.42 | 741.1 | 101.93%
ViT | realtime | int8 | 1060.94 | 1092.47 | 97.11%
yolov7 | realtime | fp32 | 17.06927 | 16.47995 | 103.58%
yolov7 | realtime | bf16 | 54.68561 | 54.00723 | 101.26%
yolov7 | realtime | int8 | 78.38271 | 77.63214 | 100.97%
bert-large | throughput | fp32 | 47.142 | 47.341 | 99.58%
bert-large | throughput | bf16 | 200.365 | 200.806 | 99.78%
bert-large | throughput | int8 | 144.999 | 145.295 | 99.80%
LCM | throughput | fp32 | 0.54913 | 0.54897 | 100.03%
LCM | throughput | bf16 | 1.062417 | 1.07772 | 98.58%
stable-diffusion | throughput | fp32 | 0.03301 | 0.0331 | 99.73%
stable-diffusion | throughput | bf16 | 0.08773 | 0.08849 | 99.14%
stable-diffusion | throughput | int8 | 0.0491 | 0.05024 | 97.73%
ViT | throughput | fp32 | 342.55 | 346.47 | 98.87%
ViT | throughput | bf16 | 1263.4 | 1268.32 | 99.61%
ViT | throughput | int8 | 1331.3 | 1345.32 | 98.96%
yolov7 | throughput | fp32 | 115.313 | 115.612 | 99.74%
yolov7 | throughput | bf16 | 323.364 | 323.747 | 99.88%
yolov7 | throughput | int8 | 388.137 | 384.236 | 101.02%
bert-large | train_phase1 | fp32 | 34.223 | 34.309 | 99.75%
bert-large | train_phase1 | bf16 | 90.372 | 88.453 | 102.17%
bert-large | train_phase2 | fp32 | 7.307 | 7.318 | 99.85%

Data Type | Geomean
-- | --
fp32 | 99.88%
bf16 | 100.70%
int8 | 99.88%
all | 100.16%

2. Torchbench cpu userbenchmark inference & training

Test suite | Geomean Ratio (New/baseline)
-- | --
eager_throughtput_bf16_infer | 1.00x
eager_throughtput_fp32_infer | 1.00x
jit_llga_throughtput_amp_bf16 | 0.99x
jit_llga_throughtput_fp32 | 1.01x
eager_throughtput_fx_int8 | 1.00x
eager_throughtput_bf16_train | 1.00x
eager_throughtput_fp32_train | 1.00x

3. Inductor quantization (static & dynamic) accuracy & performance

Config | Performance geomean ratio (New/baseline) | Accuracy ratio (New/baseline)
-- | -- | --
Static quant PTQ | 0.99x | 1.00x
Static quant PTQ_CPP_WRAPPER | 0.98x | 1.00x
Static quant QAT | 0.99x | 1.00x
Dynamic quant PTQ | 1.00x | 1.00x

4. Dynamo benchmarks

Precision | Shape | Wrapper | Thread | Ratio   old/new GEOMEAN | Ratio   old/new GEOMEAN
-- | -- | -- | -- | -- | --
  |   |   |   | Eager | Inductor
Float32 | Static | Default | Multiple | 0.998776 | 1.002091
  |   |   | Single | 1.014086 | 1.01054
Float32 | Dynamic | Default | Multiple | 1.00386 | 1.005975
  |   |   | Single | 1.011036 | 1.008317
AMP | Static | Default | Multiple | 0.996965 | 1.005117
  |   |   | Single | 1.00092 | 0.995666
AMP | Dynamic | Default | Multiple | 0.9959 | 0.995048
  |   |   | Single | 1.002569 | 0.994085

---

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122472
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/atalman
2024-05-01 20:59:17 +00:00
c99617706e Add lintrunner as dev dependency (#125304)
As per title. We expect people to use it before pushing any PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125304
Approved by: https://github.com/Skylion007
2024-05-01 20:08:03 +00:00
197612c84c ProcessGroupWrapper support custom backend (#124447)
Fixes #ISSUE_NUMBER
In current code, ProcessGroupWrapper works only for `GLOO, NCCL, UCC` when `TORCH_DISTRIBUTED_DEBUG=DETAIL`.
I read the ProcessGroupWrapper code,find that communication_op in ProcessGroupWrapper is just communication_op in origin_backend + runCollectiveChecks in gloo, like allreduce:
82e0153487/torch/csrc/distributed/c10d/ProcessGroupWrapper.cpp (L406-L411)

`runCollectiveChecks` is used to `collective finger print` for tensors and run gloo's `monitoredBarrier`.
82e0153487/torch/csrc/distributed/c10d/ProcessGroupWrapper.cpp (L586-L590)
I dont know why ProcessGroupWrapper doesn't work for all backend, but I think custom backend can support it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124447
Approved by: https://github.com/kwen2501
2024-05-01 19:59:55 +00:00
b4ccc615cd Do exact type match on int so we don't pick up bool here too (#125305)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125305
Approved by: https://github.com/Skylion007
2024-05-01 19:46:36 +00:00
a216d87c6b [export] Fix for unflattening modules with duplicate tensors (#125192)
In the given test case, we have a ModuleList of 3 modules (`norm.0`, `norm.1`, `norm.2`) which share the same `weight` and `bias` tensors. However when we trace, they all end up pointing to one state dict name, (ex. `norm.2`).
```
graph():
    %p_norms_0_weight : [num_users=0] = placeholder[target=p_norms_0_weight]
    %p_norms_0_bias : [num_users=0] = placeholder[target=p_norms_0_bias]
    %p_norms_1_weight : [num_users=0] = placeholder[target=p_norms_1_weight]
    %p_norms_1_bias : [num_users=0] = placeholder[target=p_norms_1_bias]
    %p_norms_2_weight : [num_users=3] = placeholder[target=p_norms_2_weight]
    %p_norms_2_bias : [num_users=3] = placeholder[target=p_norms_2_bias]
    %input_ : [num_users=1] = placeholder[target=input_]
    %native_layer_norm : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%input_, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {})
    %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 0), kwargs = {})
    %native_layer_norm_1 : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%getitem, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {})
    %getitem_3 : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm_1, 0), kwargs = {})
    %native_layer_norm_2 : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%getitem_3, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {})
    %getitem_6 : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm_2, 0), kwargs = {})
    return (getitem_6,)
```
This causes an error in the unflattener where after constructing the submodules for `norm.0`, it will have the graph pointing to `norm.2.weight` and `norm.2.bias`:
```
graph():
    %p_norms_2_bias : [num_users=1] = placeholder[target=p_norms_2_bias]
    %p_norms_2_weight : [num_users=1] = placeholder[target=p_norms_2_weight]
    %input_ : [num_users=1] = placeholder[target=input_]
    %native_layer_norm : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%input_, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {})
    %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 0), kwargs = {})
    return getitem
```
Since the attributes are not within the same scope of the graph, (`norm.0` vs. `norm.2`), they will not be added to the subgraph, causing an error.

So this PR handles the duplicate state dict attributes by modifying the `inputs_to_state` dict to map from node names to a list of possible state dict target names.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125192
Approved by: https://github.com/zhxchen17
2024-05-01 19:12:50 +00:00
af67704dcc [privateuse1] _refs.masked_fill support privateuse1 when value.device.type is cpu (#124835)
_refs.masked_fill support privateuse1 when value.device.type is cpu.

1. maybe I should consider whether this modification meets the expectations of other privateuse1 devices,
2. add TestCase

Fixes #124693

Co-authored-by: albanD <desmaison.alban@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124835
Approved by: https://github.com/albanD
2024-05-01 18:57:14 +00:00
07422fd0b9 add missing space to first cmake append (#125294)
the first append not having a space incorrectly merges it to any previous arguments, like `-allow-unsupported-compiler` in my case which results in a silly error: `unrecognized command-line option '-allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS'`

full log:
```
python setup.py develop
Building wheel torch-2.4.0a0+git75fa54a
-- Building version 2.4.0a0+git75fa54a
cmake3 -GNinja -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/code/pytorch/torch -DCMAKE_PREFIX_PATH=/code/pytorch/.venv/lib/python3.12/site-packages;/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/gcc-13.2.0-noa2f4oqalxzqvsebhuntndewgt4gq4h:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/zstd-1.5.6-z3guwm4l5rmmsv4g4wvkej3ri3bppeja:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/zlib-ng-2.1.6-kwi4ljobodjgv5eetnga4bow6crdlacl:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/mpc-1.3.1-nuwa2snyzm265lsupa2dkmxxyhiqcv7e:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/mpfr-4.2.1-wepuwobwttxbtz3nguimxa2mlljjozsi:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/gmp-6.2.1-ashy6kiitonxv2f365f4q3beggzf3646:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/gcc-runtime-14.0.1-wmogkqrzn7t57dogaake2hmhjbod27gs -DNUMPY_INCLUDE_DIR=/code/pytorch/.venv/lib64/python3.12/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/code/pytorch/.venv/bin/python -DPYTHON_INCLUDE_DIR=/usr/include/python3.12 -DPYTHON_LIBRARY=/usr/lib64/libpython3.12.so.1.0 -DTORCH_BUILD_VERSION=2.4.0a0+git75fa54a -DUSE_NUMPY=True /code/pytorch
-- /usr/lib64/ccache/c++ /code/pytorch/torch/abi-check.cpp -o /code/pytorch/build/abi-check
-- Determined _GLIBCXX_USE_CXX11_ABI=1
-- Current compiler supports avx2 extension. Will build perfkernels.
-- Current compiler supports avx512f extension. Will build fbgemm.
-- The CUDA compiler identification is NVIDIA 12.4.131
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - failed
-- Check for working CUDA compiler: /usr/local/cuda-12/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda-12/bin/nvcc - broken
CMake Error at /usr/share/cmake/Modules/CMakeTestCUDACompiler.cmake:59 (message):
  The CUDA compiler

    "/usr/local/cuda-12/bin/nvcc"

  is not able to compile a simple test program.
  It fails with the following output:
    Change Dir: '/code/pytorch/build/CMakeFiles/CMakeScratch/TryCompile-mSGoFl'

    Run Build Command(s): /code/pytorch/.venv/bin/ninja -v cmTC_ee207
    [1/2] /usr/local/cuda-12/bin/nvcc -forward-unknown-to-host-compiler   -allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all  "--generate-code=arch=compute_52,code=[compute_52,sm_52]" -MD -MT CMakeFiles/cmTC_ee207.dir/main.cu.o -MF CMakeFiles/cmTC_ee207.dir/main.cu.o.d -x cu -c /code/pytorch/build/CMakeFiles/CMakeScratch/TryCompile-mSGoFl/main.cu -o CMakeFiles/cmTC_ee207.dir/main.cu.o
    FAILED: CMakeFiles/cmTC_ee207.dir/main.cu.o
    /usr/local/cuda-12/bin/nvcc -forward-unknown-to-host-compiler   -allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all  "--generate-code=arch=compute_52,code=[compute_52,sm_52]" -MD -MT CMakeFiles/cmTC_ee207.dir/main.cu.o -MF CMakeFiles/cmTC_ee207.dir/main.cu.o.d -x cu -c /code/pytorch/build/CMakeFiles/CMakeScratch/TryCompile-mSGoFl/main.cu -o CMakeFiles/cmTC_ee207.dir/main.cu.o
    gcc: error: unrecognized command-line option '-allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS'
    ninja: build stopped: subcommand failed.

  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  cmake/public/cuda.cmake:47 (enable_language)
  cmake/Dependencies.cmake:44 (include)
  CMakeLists.txt:758 (include)

-- Configuring incomplete, errors occurred!
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125294
Approved by: https://github.com/albanD
2024-05-01 18:35:54 +00:00
bf6acf9add [ROCm] Add extra cuda_to_hip_mappings.py (#125108)
Adding extra mappings discovered when hipifying the backward CUDA kernel of the Mamba model (https://github.com/state-spaces/mamba/).

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125108
Approved by: https://github.com/Skylion007, https://github.com/jeffdaily
2024-05-01 18:31:02 +00:00
c8d2a55273 Intel GPU: specify the tolerance for torchbench models (#125213)
We encountered some model accuracy failures as the tolerance is critical. In general, we align with CUDA practice. This PR intends to adjust the tolerance for Torchbench models for training mode on Intel GPU devices and aligns with CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125213
Approved by: https://github.com/desertfire
2024-05-01 17:45:15 +00:00
e3627d05e7 [CMake] Add NVPL BLAS/LAPACK option (#125268)
This PR add a [NVPL](https://docs.nvidia.com/nvpl/introduction.html) BLAS/LAPACK option to CMake for `aarch64` (ARM) machines.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125268
Approved by: https://github.com/albanD
2024-05-01 17:26:28 +00:00
39eb5d4fa4 Add Sanity Testing to Pytorch Profiler (#124773)
Summary: In the recent weeks, we have encountered bugs in both the normal synchronous trace and on-demand tracing. This diff on its own does sanity checking to make sure the profiler does not have spans that extend past the boundaries that we expect. It also checks some basic properties of the tracings we expect to see. Right now the sanity tests check some basic properties to make sure that the tracings are not completely broken. Requests/suggestions for other properties are welcome.

Test Plan: Run the tests in OSS and Buck

Reviewed By: aaronenyeshi

Differential Revision: D56374298

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124773
Approved by: https://github.com/aaronenyeshi
2024-05-01 16:59:35 +00:00
4d410155b2 Revert "Include support for the scatter gather cuda kernels to allow for comp… (#124809)"
This reverts commit e09f98c705e4851414cd8ddf21949177af2b13aa.

Reverted https://github.com/pytorch/pytorch/pull/124809 on behalf of https://github.com/clee2000 due to windows build failure is real, https://github.com/pytorch/pytorch/actions/runs/8910674030/job/24470387612#step:11:11236 is the correct failure line, ignore the statement saying build passed, batch is errorcodes arent propagating again ([comment](https://github.com/pytorch/pytorch/pull/124809#issuecomment-2088680371))
2024-05-01 16:02:02 +00:00
e16f1ee4cc [ez][CI] Move test_modules and test_schema_check off CI_SERIAL_LIST (#125193)
* Related https://github.com/pytorch/pytorch/pull/124085

As in title, move test_modules and test_schema_check off CI_SERIAL_LIST
If things fail, they can get the serialTest decorator instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125193
Approved by: https://github.com/huydhn
2024-05-01 15:48:48 +00:00
8fde9a988c CI: Extending unit test coverage for aarch64 linux (#125255)
Adding core, dynamo and inductor unit tests for aarch64 linux CI runs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125255
Approved by: https://github.com/malfet, https://github.com/atalman
2024-05-01 15:37:52 +00:00
b094622bc9 Refactoring to remove unused variable (#125252)
Summary: Removed unused variable for running encoder

Test Plan: buck test //caffe2/test:transformers

Differential Revision: D56771972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125252
Approved by: https://github.com/drisspg
2024-05-01 15:17:45 +00:00
e09f98c705 Include support for the scatter gather cuda kernels to allow for comp… (#124809)
Fixes #121965

This PR hopes to add support complex numbers in the scatter/gather related kernels. For brevity, I will only include `complex<float>` for now as `complex<double>`, for example, will be more complicated.

C++ unit tests are currently passing alongside tests in `test_scatter_gather_ops.py`. Python test suites also seem to be passing.

Please keep the following in mind:
1) I think this is my first time using Pytorch.
2) This is my first contribution to Pytorch.

Environment:
3080 & WSL 2. `nvcc` is at 12.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124809
Approved by: https://github.com/eqy, https://github.com/mikaylagawarecki
2024-05-01 14:31:31 +00:00
e421f1b4a8 docs: torch.nn.utils.rnn: docs improve (#123559)
docs: `torch.nn.utils.rnn`: docs improve
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123559
Approved by: https://github.com/mikaylagawarecki
2024-05-01 14:27:37 +00:00
a2715144c3 Add NEON-accelerated int8mm for bfloat16 (#125290)
As apparently `vshlq_u32` is faster than `vcvt_f32_f16`

Refactor NEON `tinygemm_kernel` to rely on `load_as_float32x4` and `load_as_float32x4x2` and implement them for float16 (using vcvt), bfloat16 (using left shift) and plain float32 (not using anything)

As result stories110M run at 60 tokens/sec with f16, but at 66 tokens/sec with bf16 and  75 tokens/sec with f32, though more bandwith demand starts to favor reduced floating types as model size gets bigger.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125290
Approved by: https://github.com/mikekgfb
2024-05-01 14:04:49 +00:00
9fbb4dfc12 Fix AttributeError when doing mock patch for FileTimerServerTest.test_expired_timers (#125144)
Fix the patch failure, and we should patch the function where it is used, not where it is defined.
Failure info:
```bash
root@cambricon-PowerEdge-C4140:/workspace# python file_based_timer_test.py -k test_expired_timers
/opt/conda/lib/python3.10/site-packages/torch/_custom_ops.py:253: DeprecationWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
  return torch.library.impl_abstract(qualname, func, _stacklevel=2)
E
======================================================================
ERROR: test_expired_timers (__main__.FileTimerServerTest)
tests that a single expired timer on a process should terminate
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2757, in wrapper
    method(*args, **kwargs)
  File "/opt/conda/lib/python3.10/unittest/mock.py", line 1376, in patched
    with self.decoration_helper(patched,
  File "/opt/conda/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/opt/conda/lib/python3.10/unittest/mock.py", line 1358, in decoration_helper
    arg = exit_stack.enter_context(patching)
  File "/opt/conda/lib/python3.10/contextlib.py", line 492, in enter_context
    result = _cm_type.__enter__(cm)
  File "/opt/conda/lib/python3.10/unittest/mock.py", line 1447, in __enter__
    original, local = self.get_original()
  File "/opt/conda/lib/python3.10/unittest/mock.py", line 1420, in get_original
    raise AttributeError(
AttributeError: <module 'torch.distributed.elastic.timer' from '/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/timer/__init__.py'> does not have the attribute 'log_debug_info_for_expired_timers'

To execute this test, run the following from the base repo dir:
     python file_based_timer_test.py -k test_expired_timers

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 0.792s

FAILED (errors=1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125144
Approved by: https://github.com/gag1jain
2024-05-01 12:08:04 +00:00
47ba7a76e2 [ATen][CUDA][AMP] Fix dtype mismatch in linalg_vector_norm (#125175)
Fixes #125174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125175
Approved by: https://github.com/eqy, https://github.com/lezcano
2024-05-01 10:57:12 +00:00
c59cce38a9 [MacOS][CPUInductor] Fix includes to system Python (#125285)
On MacOS 14.4, system Python is configured to point to a non-existing include dir
```
% /usr/bin/python3 -c "import sysconfig;print(sysconfig.get_path('include'))"
/Library/Python/3.9/include
```

Workaround the issue by composing path to include folder from `stlib` config, which points to
```
% /usr/bin/python3 -c "import sysconfig;print(sysconfig.get_path('stdlib'))"
/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125285
Approved by: https://github.com/kit1980
2024-05-01 10:39:13 +00:00
52142192d4 [pipelining] Add stage backward function (#124958)
This is a helper function which:
1. computes the gradients for the stage inputs, and
2. accumulates gradients for the stage module's parameters.

A unit test for this function is also added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124958
Approved by: https://github.com/wconstab
ghstack dependencies: #124776, #124875
2024-05-01 07:56:58 +00:00
aead440c62 [Inductor] Further tune block size for templated attention on H100 (#125286)
Run a script to enumerate and get the best default block size for templated attention.

A100 -> no change, check numbers at #125139
H100
## torch.bfloat16

Before:
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------|
| Average |     1.103 |              |             |             |             |            |               |                |
| Max     |     1.322 |            8 |          16 |         512 |         512 |         64 | noop          | torch.bfloat16 |
| Min     |     0.829 |            1 |          16 |        1024 |        1024 |        128 | relative_bias | torch.bfloat16 |

```
After:
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------|
| Average |     1.137 |              |             |             |             |            |               |                |
| Max     |     1.442 |            1 |          16 |         512 |         512 |        128 | relative_bias | torch.bfloat16 |
| Min     |     0.913 |            1 |          16 |        1024 |        1024 |         64 | head_bias     | torch.bfloat16 |
```

## torch.float32
Before:
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype         |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|---------------|
| Average |     2.269 |              |             |             |             |            |               |               |
| Max     |     3.740 |           16 |          16 |        1024 |        1024 |         64 | noop          | torch.float32 |
| Min     |     0.761 |            1 |          16 |         512 |         512 |        128 | relative_bias | torch.float32 |
```
After:
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype         |
|---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|---------------|
| Average |     2.489 |              |             |             |             |            |             |               |
| Max     |     3.755 |           16 |          16 |        4096 |        4096 |         64 | noop        | torch.float32 |
| Min     |     1.609 |            1 |          16 |         512 |         512 |         64 | head_bias   | torch.float32 |
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125286
Approved by: https://github.com/Chillee
2024-05-01 07:34:08 +00:00
c511aed27f [Meta Tensor] fix meta inplace set storage (#123880)
Fixes #123879

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123880
Approved by: https://github.com/ezyang
2024-05-01 06:53:49 +00:00
c3c4465f50 Add has_guarded_code to CompilationMetrics (#125279)
While studying some tlparse, I noticed that CompilationMetrics was reporting that there was no error for frames that have no nodes. I'm pretty sure we don't actually install a frame in this situation. has_guarded_code will tell us if that's the case, because it says if the GuardedCode object is None or not.

Actually, while working on this, I was wondering if we can ever trigger the "skip this frame entirely, do not trace it ever again" codepath, as best as I could tell, it's impossible for this to happen by the time we get to compilation metrics block.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125279
Approved by: https://github.com/yanboliang
2024-05-01 06:12:05 +00:00
cyy
081f41a920 Use BFloat16 in distributed quantization when supported by NCCL (#125113)
This PR enables BFloat16 in torch/csrc/distributed/c10d/quantization/quantization_gpu.cu .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125113
Approved by: https://github.com/kwen2501
2024-05-01 05:43:35 +00:00
14857e71c2 Export torch.jit.interface from torch.jit package (#125209)
Seems like this symbol was overlooked when other symbols were exported from `torch.jit`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125209
Approved by: https://github.com/ezyang
2024-05-01 05:38:05 +00:00
75a8e9ee77 [inductor] better cache clearing in fx graph cache tests (#125280)
Summary: There's a shortcoming in the FX graph cache tests in that they don't fully clear all inductor in-memory caches when testing the cache-hit path: We were previously accessing the FX graph cache correctly, but when loading the source object using the PyCodeCache.load_by_key_path() method, _that_ path was serving entries out of memory. To better mimic what happens during warm start (i.e., a new process), we should clear all in-memory caches.

Test Plan: updated the unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125280
Approved by: https://github.com/eellison
2024-05-01 04:47:46 +00:00
787afc5180 Add LR as tensor tests (#123750)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123750
Approved by: https://github.com/janeyx99
2024-05-01 04:46:49 +00:00
1c905f1be3 [EZ][BE] Don't import pathlib twice (#125260)
It was imported once as `import pathlib` and second time as `from pathlib import Path`

Stick to the 2nd flavor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125260
Approved by: https://github.com/kit1980
2024-05-01 04:08:16 +00:00
abaa717350 [FSDP2] Removed logic to save and remove pre-backward hook handles (#125269)
1. This PR removes the logic for saving and removing the pre-backward hook handles (which is registered via `register_multi_grad_hook(mode="any")`).
2. This PR removes the logic for _trying_ to guard against mistargeted prefetches that relies on querying if the engine will execute the module output tensors' `grad_fn`s. (See https://github.com/pytorch/pytorch/pull/118118 for original motivation.)

For 1, the logic was error prone since it relied on `set_is_last_backward(False)` being set correctly or else pre-backward hooks could be de-registered too early. We would prefer to match the hook lifetimes with that of the autograd graph. This solves a bug with a 1f1b interleaved schedule.

If we directly remove the manual saving/removing hook handle logic, then we have a ref cycle where the tensors' `grad_fn`s are passed to the hook function. We decide to simply remove this `grad_fn` logic since (1) it cannot perfectly prevent mistargeted prefetches and (2) it introduces undesired complexity. In the future, we may prefer a different mechanism to override the prefetching for more complex/dynamic use cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125269
Approved by: https://github.com/weifengpy
ghstack dependencies: #125190, #125191
2024-05-01 03:51:30 +00:00
37c993546d [dynamo][guards] Bug fix for set_export_info (#125275)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125275
Approved by: https://github.com/yanboliang
2024-05-01 03:46:26 +00:00
4d5f8070c4 add a decomposition for select_scatter (#124426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124426
Approved by: https://github.com/peterbell10
2024-05-01 03:23:18 +00:00
e9ce23985f [TorchScript] attach target function to OSError when source can't be found (#125248)
Before, it would be hard to figure out which function/module in particular was causing the OSError. Now we'll try to print the function/module string.

Differential Revision: [D56768365](https://our.internmc.facebook.com/intern/diff/D56768365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125248
Approved by: https://github.com/eellison
2024-05-01 03:18:55 +00:00
8f31988088 [C10D] Document 'tag' limitation for nccl send/recv (#125278)
Existing documentation on isend/irecv also applies to send/recv. This PR
copies the doc/warning to send/recv ops as well.

Note: tag may be supplied, but will be ignored when used with nccl
backend.

Fixes #94819 #125079

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125278
Approved by: https://github.com/kwen2501
2024-05-01 02:53:30 +00:00
74e8817311 [inductor] Minor fixes to various tests before enabling fx graph caching in OSS by default (#125258)
Summary: Discovered breakages by enabling codecache by default and doing a CI run. I'll commit these fixes first and eventually enabling caching by default will (hopefully) be a one-liner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125258
Approved by: https://github.com/eellison
2024-05-01 02:34:01 +00:00
0506e95433 [dynamo] support inactive context managers across graph breaks (#125203)
Fix https://github.com/pytorch/pytorch/issues/124900.

When we reconstruct `ContextWrappingVariables`s, we only reconstruct the context class, not the object. Normally, contexts are active (via `with ctx:`) and we initialize the context object in the resume function. But for the case of inactive contexts (contexts declared ahead of time before the `with` block), we do not reconstruct them properly in the optimized bytecode or resume function. So this PR adds initialization for inactive contexts in the resume function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125203
Approved by: https://github.com/jansel
2024-05-01 01:49:09 +00:00
1b9d353e4f [Torch] Add more mm kernel choices (#125000)
Differential Revision: D56616836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125000
Approved by: https://github.com/htyu
2024-05-01 01:40:24 +00:00
a59dc14877 Keep node.meta when fusing subgraph (#125261)
Summary: When CapabilityBasedPartitioner creates the fused subgraph as the call_module node, it didn't populate the node.meta["val"] field.

Test Plan: OSS CI

Differential Revision: D56789259

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125261
Approved by: https://github.com/zhxchen17
2024-05-01 01:38:28 +00:00
0ee5c14163 [PT2][Optimus] Read the patterns from the config instead of hard-code passes (#125136)
Summary: Due to the compatitbility issue, we hard coded the passes to do the pattern optimization. Here, we revisit the method since it has been a while for the changes into production packages. We instead read from the config to decide whether we do the specific pattern optimization, which makes followup pattern add easier.

Differential Revision: D56659934

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125136
Approved by: https://github.com/jackiexu1992
2024-05-01 01:35:30 +00:00
25691558d9 Change templated_attention -> flex_attention (#125251)
# Summary

Change all the names

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125251
Approved by: https://github.com/Chillee, https://github.com/yanboliang
2024-05-01 01:08:48 +00:00
a7023b89f8 Use torch._check for safety assert in _reshape_view_helper (#125187)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125187
Approved by: https://github.com/albanD
2024-05-01 00:40:31 +00:00
1bcbc9158f Add CUDA 12.4 workflows (#121684)
Reference: https://github.com/pytorch/pytorch/pull/98492

Co-authored-by: Andrey Talman <atalman@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121684
Approved by: https://github.com/atalman
2024-04-30 23:03:24 +00:00
d6c713884a [dynamo, 3.12] xfail refleaking tests due to buggy getattr_static (#125062)
For tracking https://github.com/pytorch/pytorch/issues/124302 so that we can re-enable the test once 3.12 updates with the bug fix for https://github.com/python/cpython/issues/118013.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125062
Approved by: https://github.com/anijain2305, https://github.com/jansel
2024-04-30 22:40:47 +00:00
c12c85e919 Revert "[benchmark][cudagraph] Explicitly call aten.div with CUDA denominator for cudagraphs (#119729)" (#125246)
This reverts commit 62b5738a8bf325d79468b839b8412b87cb9951c1.

https://github.com/pytorch/pytorch/pull/119729/ regresses cudagraph dashboard. Moving the one-time per iteration loss from CPU to CUDA is somehow causing a lot of copies:

current (top) vs with revert (bottom)
![image](https://github.com/pytorch/pytorch/assets/9547562/62dfbf66-7edc-4a3c-ba7f-1ec057fba950)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125246
Approved by: https://github.com/eellison
2024-04-30 22:39:53 +00:00
9fec26e231 Fix typo under torch/_inductor directory (#119658)
This PR fixes typo in comments and msgs under `torch/_inductor` directory, and also changes the corresponding test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119658
Approved by: https://github.com/colesbury
2024-04-30 22:28:56 +00:00
ca0f070065 Revert "Add registration API for torch.compile-eager (#121387)"
This reverts commit 61e937f3d6b904d6706594c1b3cfd7d0e56f9663.

Reverted https://github.com/pytorch/pytorch/pull/121387 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/121387#issuecomment-2087541956))
2024-04-30 22:13:04 +00:00
00dd4d55e3 Refactored _remove_auto_functionalization_from_graph_helper (#125180)
Summary:
Refactored the function to remove multiple list slicing and use unused variable.

Test Plan:
python test/run_test.py

Reviewers: @drisspg

Subscribers:

Tasks: [T187526123](https://www.internalfb.com/intern/tasks/?t=187526123) [T93492332](https://www.internalfb.com/intern/tasks/?t=93492332)

Tags: @pytorchbot merge -r viable/strict
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125180
Approved by: https://github.com/drisspg
2024-04-30 21:44:07 +00:00
ea347fa6ce Revert "Fix & optimze open device registration test. (#124712)"
This reverts commit f03cf9d4dc8ebe85552f450678988cac4e959da3.

Reverted https://github.com/pytorch/pytorch/pull/124712 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/124712#issuecomment-2086971499))
2024-04-30 20:00:37 +00:00
c1a3fcfa47 [pipelining] Add util and debug facilities (#124875)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124875
Approved by: https://github.com/H-Huang
ghstack dependencies: #124776
2024-04-30 19:41:41 +00:00
75fa54a9d1 Revert "Convert ForeachFuncInfo to dataclass (#125001)"
This reverts commit 9466335ae4cb049efd3f4c2b32b2115ba00694f3.

Reverted https://github.com/pytorch/pytorch/pull/125001 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it is breaking on ROCm 9466335ae4 ([comment](https://github.com/pytorch/pytorch/pull/125001#issuecomment-2086640674))
2024-04-30 19:05:53 +00:00
56e4cbc69d Fixes two build problems on ROCM 6.1 + Ubuntu 22.04 (#118216)
Fixes two build problems on ROCM 6.1 + Ubuntu 22.04

### Inconsistency value of CMAKE_PREFIX_PATH between `.ci/pytorch/build.sh` and Build Instructions

Current `CMAKE_PREFIX_PATH` points to the base environment of the conda (commonly `/opt/conda`). However the conda environment used in the CI should be `/opt/conda/envs/py_<VRESION>`, which is supplied by `$CONDA_PREFIX`.

This divergence may cause libstdc++ version conflicts because the base conda environment may ship a different libstdc++ than the `pv_<VERSION>`, and/or the system default environment. One notable issue is on our internal CI system this script failed to build AOTriton library on Ubuntu 22.04 due to libstdc++ version conflicts between HIP compiler and conda base environment.

This PR fixes this and make sure the CI script follows the official build instruction.

### Incorrect `tinfo` was linked on Ubuntu 22.04 due to flaws in parsing of `os-release`

The code to parse /etc/os-release is incorrect and the distribution info was parsed as `PRETTY_Ubuntu` instead of `Ubuntu`. `libtinfo` will not be linked into the binary due to this flaw. Thus, cpp unit tests failed to build because of missing symbols from `libtinfo`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118216
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet, https://github.com/atalman
2024-04-30 18:58:48 +00:00
90258e8369 forward fix preferred blas backend and windows CI (#125080)
PR #122106 broke windows tests. The feature should have been disabled for Windows but was not disabled correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125080
Approved by: https://github.com/clee2000
2024-04-30 18:38:31 +00:00
04a241947a [dtensor] delete the old unused mesh_alltoall (#124879)
as titled, as we have a dedicated comm op, this is not needed anymore

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124879
Approved by: https://github.com/XilunWu, https://github.com/wz337
ghstack dependencies: #124871, #124872
2024-04-30 18:30:34 +00:00
00df0d3e94 [dtensor] implement shard dim change with alltoall (#124872)
as titled, we implement a dedicated communication op to allow efficient
sharding dimension change using alltoall, to replace our previous
allgather + local chunk

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124872
Approved by: https://github.com/XilunWu, https://github.com/yifuwang
ghstack dependencies: #124871
2024-04-30 18:30:34 +00:00
02e7800b3f [Torch][Timer] Skip expired timer logging for empty expired timers (#125039)
Summary: same as title

Test Plan: unit tests

Differential Revision: D56636566

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125039
Approved by: https://github.com/kurman
2024-04-30 18:28:49 +00:00
3946fa1c12 Fix bug in get_update_constraint (#125194)
Summary: Title

Test Plan: CI

Differential Revision: D56726321

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125194
Approved by: https://github.com/pianpwk
2024-04-30 18:21:29 +00:00
07958c538c Setup initial testing harness and cache key generation for AOTAutograd Cache (#124642)
This doesn't introduce any new behavior, but sets up a basic cache key generation mechanism that I can test. From here I will:

- Add checks on the ops in an input FXGraph to make sure they are safe to cache. We'll be conservative in the first version here.
- Add serialization for FX graphs
- Save these FX graphs to disk in the cache
- Support graphs with more complicated ops like higher order ops and specialized nn modules

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124642
Approved by: https://github.com/aorenste
2024-04-30 18:17:38 +00:00
8242fb62a7 [quant][pt2e] Fix conv-bn weight + bias per channel QAT (#125208)
Summary: This commit fixes the pattern matching for conv-bn
during QAT fusion where both weight and bias are quantized per
channel. Previously this failed because weights and biases used
the same example kwargs for their scales and zero points,
causing these qparams to be tied during pattern matching.

Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn1d.test_qat_conv_bn_per_channel_weight_bias
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d.test_qat_conv_bn_per_channel_weight_bias

Reviewers: jerryzh168, angelayi

Subscribers: jerryzh168, angelayi, supriyar

Differential Revision: [D56740694](https://our.internmc.facebook.com/intern/diff/D56740694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125208
Approved by: https://github.com/angelayi
2024-04-30 18:12:25 +00:00
05be0fb62d [minimizer] Add exclusion function to minimizer base (#124504)
Summary:
Add exclusion list to minimizer:
1. some operations cannot be lowered when constructing subgraphs; this usually happens when they are isolated from operation group
2. exclude them in search strategies for automation

Reviewed By: jimone1

Differential Revision: D56327289

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124504
Approved by: https://github.com/jfix71
2024-04-30 18:02:46 +00:00
80046c315b Add templated attention BLOCK_M & BLOCK_N default size for different head_dim (#125139)
Run different head_dims [64, 128], which are the most popular ones across major GPT models.
Enumerate different ```BLOCK_M``` and ```BLOCK_N``` candidates [16, 32, 64, 128], and get the best config as default one.

## Before
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|----------------|
| Average |     0.704 |              |             |             |             |            |             |                |
| Max     |     0.953 |            1 |          16 |         512 |         512 |         64 | noop        | torch.bfloat16 |
| Min     |     0.482 |            1 |          16 |        4096 |        4096 |        128 | causal_mask | torch.bfloat16 |
```
## After
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|----------------|
| Average |     0.823 |              |             |             |             |            |             |                |
| Max     |     0.926 |            1 |          16 |         512 |         512 |         64 | noop        | torch.bfloat16 |
| Min     |     0.723 |            1 |          16 |         512 |         512 |        128 | causal_mask | torch.bfloat16 |
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125139
Approved by: https://github.com/Chillee
2024-04-30 17:40:44 +00:00
cyy
04c6424fbf Remove caffe2 image and video (#125045)
This PR tries to decompose https://github.com/pytorch/pytorch/pull/122527 into a smaller one. Caffe2 image and video folders are removed along with the related CMake code.
To be noted, this was inspired and is co-dev with @r-barnes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125045
Approved by: https://github.com/eqy, https://github.com/albanD
2024-04-30 17:31:57 +00:00
a03b9a2189 fix: typo (#125226)
Fixes spelling error: spacial is an incorrect spelling of spatial

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125226
Approved by: https://github.com/Skylion007
2024-04-30 16:57:39 +00:00
d699ade0cb [dynamo] Refactor into torch/_inductor/runtime/compile_tasks.py (#124681)
Differential Revision: [D56723769](https://our.internmc.facebook.com/intern/diff/D56723769)
Co-authored-by: Sam Larsen <slarsen@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124681
Approved by: https://github.com/masnesral
ghstack dependencies: #124592
2024-04-30 16:54:16 +00:00
254128c16e [inductor] Remove usage of device_interface from _inductor.runtime (#124592)
Differential Revision: [D56723770](https://our.internmc.facebook.com/intern/diff/D56723770)
Co-authored-by: Sam Larsen <slarsen@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124592
Approved by: https://github.com/masnesral
2024-04-30 16:54:16 +00:00
5f4c6d9b49 Upgrade nightly wheels to rocm6.1 (#124811)
Follow-up to https://github.com/pytorch/builder/pull/1789

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124811
Approved by: https://github.com/malfet
2024-04-30 16:30:19 +00:00
9466335ae4 Convert ForeachFuncInfo to dataclass (#125001)
- `ForeachFuncInfo` to `dataclass` for smaller diff from `OpInfo`
- `skips` to `decorators` and `skip` to `xfail`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125001
Approved by: https://github.com/janeyx99
2024-04-30 16:19:42 +00:00
ecc2e034f7 Fakify script object inputs and attributes for non-strict export (#124239)
This PR fakify ScriptObject inputs and attributes in export non-strict mode by default.

The basic idea is to `only fakify the script object during tracing (i.e. aot_export)`. After we get the traced graph module, eagerly executing, serializing, or running more passes will use the real script objects. This is essentially treating the script object as constant tensor.

Concretely, we
1. fakify all the script object inputs, and module attributes (gathered by constant_attrs).
2. patch the module's attributes with fakified script object
3. right after aot_export, remove the patching (to avoid changing the original module) then modify the exported graph module's attribute to real script object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124239
Approved by: https://github.com/zou3519
2024-04-30 15:57:25 +00:00
ab80a59677 CI: add opt-in aarch64 linux workflow (#121284)
Triggered by `ciflow/linux-aarch64` and runs only `test_modules`, `test_mkldnn`, `test_mkldnn_fusion` and `test_openmp` as test for now.
TODOS:
 - Enable sscache for fast CI
 - Extend to a more reasonable test coverage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121284
Approved by: https://github.com/atalman, https://github.com/malfet
2024-04-30 15:10:56 +00:00
b7d67e476d upload pt2 cprofile stats to manifold (#125162)
Summary:
https://fb.workplace.com/groups/257735836456307/permalink/657458576484029/

upload cprofile to manifold

D56696397 has a script to convert profiler stats to dot graphs (see its test plan)

Test Plan:
non-MAST
`TORCH_COMPILE_CPROFILE=1 buck2 run mode/opt mode/inplace //pytorch/benchmark:run -- ads_mc_igctr_mc3_v0 -d cuda -t train --torchdynamo inductor --profile --profile-export-chrome-trace`

https://www.internalfb.com/manifold/explorer/pyper_traces/tree/compilation_cprofile/test/20240428_234002_7562397568

MAST
`buck2 run mode/opt aps_models/ads/icvr:icvr_launcher -- mode=mast_ctr_cvr_cmf_rep launcher.fbl_entitlement=ai_infra_training_rnd_tc features=ctr_cvr_conso_cmf_pipeline_features_455876776_3teach model=ctr_cvr_cmf_when_rep_config_msmn_3teach model_name=ctr_cvr_when model.when_arch.use_extended_residual_contexts=True optimizers.dense_default.lr_schedule.0.max_iters=20000 training.planner.storage_reservation_policy=FixedPercentage training.planner.storage_reservation_percentage=0.72 data_loader.dataset.batch_size=2048 trainer.garbage_collection.garbage_collection_interval=100 model.when_arch.layer_norm_init_weight=0.3 optimizers.dense_default.lr_schedule.0.value=0.001 model.when_arch.customized_mlp_init_scale=0.3 launcher.num_workers=128 launcher.max_retries=10 launcher.data_project=oncall_ads_model_platform launcher.hardware=ZIONEX_80G data_loader.dataset.table_ds="[2024-01-01]" launcher.job_name=test_inductor_logging`

https://www.internalfb.com/manifold/explorer/pyper_traces/tree/compilation_cprofile/aps-test_inductor_logging-745febb51a

Generating dotty files from D56696397
```
Generating dot file from cprofile stats /home/daohang/aps-test_inductor_logging-745febb51a/0/0/_compile1.profile ...
P1225733598: https://www.internalfb.com/intern/paste/P1225733598/
Dotty: https://www.internalfb.com/intern/graphviz/?paste=1225733598
Generating dot file from cprofile stats /home/daohang/aps-test_inductor_logging-745febb51a/0/0/_compile10.profile ...
P1225733629: https://www.internalfb.com/intern/paste/P1225733629/
Dotty: https://www.internalfb.com/intern/graphviz/?paste=1225733629
Generating dot file from cprofile stats /home/daohang/aps-test_inductor_logging-745febb51a/0/0/_compile0.profile ...
P1225733649: https://www.internalfb.com/intern/paste/P1225733649/
Dotty: https://www.internalfb.com/intern/graphviz/?paste=1225733649
```

Differential Revision: D56679561

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125162
Approved by: https://github.com/anijain2305
2024-04-30 15:05:01 +00:00
2480e8b8a1 Add MAP_SHARED option for torch.load(mmap=True) (#124889)
Fixes #124528

Going over the options for our MapAllocator and what they do, I don't think any other of them need to be piped up to `torch.load`

4f29103749/aten/src/ATen/MapAllocator.h (L8-L16)

~However, I wonder if this `MmapVisibility(Enum)` is a good way to represent "or-ing" together of `mmap` flags if we want to extend it in the future. I looked over the flags for [`mmap(2)`](https://man7.org/linux/man-pages/man2/mmap.2.html), and could not immediately see how most of them would be useful for `torch.load` (would maybe `MAP_LOCKED` (like `mlock`) or `MAP_HUGE` ever be worthwhile?)~

Using the flags provided by the python `mmap` library so that we can extend the allowed flags and pipe them down to the cpp `mmap` call if there is a need for other flags in the future

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124889
Approved by: https://github.com/albanD
2024-04-30 15:02:19 +00:00
761a7b84ba [Dynamo] Fix alias issue with respect to wrapped numbers (#124731) (#124774)
This PR fixes an issue presented when calling `aten.alias(int)` raises a TypeError.

```python
import torch
import torch.autograd.forward_ad as fwAD

def f(x):
    return 4312491 * x

device = "cpu"

with torch._subclasses.fake_tensor.FakeTensorMode():
    with fwAD.dual_level():
        x = torch.randn(3, device=device)
        y = torch.ones_like(x)
        dual = fwAD.make_dual(x, y)
        f(dual)
```

The test case above illustrates this bug.
1) `4312491` turns into a tensor that is a wrapped number
2) Forward mode AD calls `aten::alias` internally
3) The wrapped number (`4312491`) becomes a python integer
4) `aten.alias(int)` raises a `TypeError`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124774
Approved by: https://github.com/albanD, https://github.com/zou3519
2024-04-30 14:11:46 +00:00
9aed5dcfe6 Clarify wording in docstring for CosineAnnealingWarmRestarts within lr_scheduler.py (#125161)
- Clarifies wording in the docstring for `CosineAnnealingWarmRestarts` within `lr_scheduler.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125161
Approved by: https://github.com/janeyx99
2024-04-30 14:01:22 +00:00
e3db465029 Re-enable nightly testing for linux and macos binaries (#123390)
Related to: https://github.com/pytorch/pytorch/issues/123225

The skip tests logic lives here:
https://github.com/pytorch/builder/blob/main/run_tests.sh#L19

Linux builds are using check_binary:
https://github.com/pytorch/pytorch/actions/runs/8627625694/job/23649245546#step:16:339
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123390
Approved by: https://github.com/ZainRizvi
2024-04-30 12:53:40 +00:00
07d3af8e6a Added ARC test jobs to all build jobs in the unstable bucket (#125142)
Added ARC test jobs to all build jobs in the unstable bucket
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125142
Approved by: https://github.com/ZainRizvi, https://github.com/seemethere
2024-04-30 09:32:22 +00:00
dc514df2af [inductor] add triton code to SchedulerNode.debug_str (#125091)
Here is an example print: https://gist.github.com/shunting314/75c161368a833a535bd0d240b8099d7e

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125091
Approved by: https://github.com/jansel
ghstack dependencies: #125090
2024-04-30 08:27:53 +00:00
a587a93f4c [inductor][easy] add buffer layout to SchedulerNode.debug_str (#125090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125090
Approved by: https://github.com/jansel
2024-04-30 08:27:53 +00:00
e0d2c24de1 Fix device type issue in _get_device_handle (#124390)
Fix #124327

`device_type`, the first arg of [init_device_mesh()](a0466061e1/torch/distributed/device_mesh.py (L503)),  does not support types with indexes, such as `cuda:0`.
If `cuda:0` is used as a parameter, `_get_device_handle()` will not correctly return `torch.cuda`.
So the exception should be thrown before creating DeviceMesh object.

> See https://github.com/pytorch/pytorch/issues/124327#issuecomment-2062551161,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124390
Approved by: https://github.com/wz337, https://github.com/wanchaol
2024-04-30 06:59:56 +00:00
5e5f890273 [dynamo][source] Remove inspect getattr_static from AttrSource (#125200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125200
Approved by: https://github.com/jansel
2024-04-30 06:44:25 +00:00
8320b770fd [dynamo] use lazy disable dynamo for manual seed (#125196)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125196
Approved by: https://github.com/fegin, https://github.com/yanboliang
2024-04-30 06:04:22 +00:00
e7846447e0 dynamic shapes builder API (#124898)
This PR introduces a new way of building `dynamic_shapes` for export. The idea is to build up a mapping from input tensors to the dynamic shapes that should be assigned to their corresponding fake tensors.

This mapping is automatically converted to the current form of `dynamic_shapes`, which must exactly match the structure of inputs. We do this by using pytree utils.

With the current `dynamic_shapes`, we had to be careful about user-defined classes that are registered with pytree, since  such classes are not necessarily polymorphic containers; they may be fine containing tensors, but not dynamic shapes. Thus we had decided to allow input instances of such classes to be associated with dynamic shapes in flattened form. This decision needs to be mirrored in this PR as well. To make it easier to keep these code paths in sync, we refactor the current recursive procedure for associating inputs with dynamic shapes to use the same pytree utils. This needs minor fixes to a few tests where `dynamic_shapes` were not exactly matching the structure of inputs.

Differential Revision: D56551992

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124898
Approved by: https://github.com/zhxchen17
2024-04-30 03:59:49 +00:00
31801918e9 Add pooling support for 3d channels last (#116305)
Part of a multi-PR work to improve #59168

Meant to complete
Write native kernels for AvgPool3d
Write native kernels for MaxPool3d
Write native kernels for AdaptiveAvgPool3d
Write native kernels for AdaptiveMaxPool3d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116305
Approved by: https://github.com/ezyang
2024-04-30 03:51:49 +00:00
16e8431963 Fix hybrid sparse COO tensor conversion to meta tensor (#125120)
As in the title.

Addresses a bug reported in https://github.com/pytorch/pytorch/pull/117907#issuecomment-2080035379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125120
Approved by: https://github.com/ezyang, https://github.com/amjames
2024-04-30 03:43:42 +00:00
74b7c56517 [Autotune] Use half the number of warps for reduction tuning on AMD. (#125084)
I was seeing for a reduction kernel and a given block size, on AMDGPU, the vectorization bandwidth (16-byte) for a thread was not fully leveraged while it was not a problem for NVGPU. It appeared that each thread got fewer data to process as a whole row were processed by more threads, and the number of elements each thread got was not enough to saturate full vectorization. On AMDGPU, a warp has 64 lanes compared to 32 on the NV side. Therefore I'm tuning down the default number of warps (8 for NV) for AMD. I'm seeing 10% speed up for an internal benchmark.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125084
Approved by: https://github.com/shunting314
2024-04-30 02:38:34 +00:00
0969f01d73 [FSDP2] Accumulated in reduce_dtype if not syncing grads (#125191)
For microbatching use cases (e.g. PP), we may use fp32 reduce-scatter (i.e. `MixedPrecisionPolicy(reduce_dtype=torch.float32)`), where we want to accumulate the unsharded gradients in fp32 across microbatches until reduce-scattering in fp32 upon the last microbatch.

Note that the `unsharded_param` is in bf16, so we must save the fp32 accumulated gradient to an attribute different from `.grad`. Moreover, saving a new attribute on the `torch.Tensor` leads to some annoying type checking issues (where the attribute may not be defined), so this PR prefers to save the attribute on the `FSDPParam` class instead.

One could argue that this behavior should be configurable, but since I think for large-scale training, everyone is leaning toward fp32 accumulation across microbatches, let us avoid adding another argument for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125191
Approved by: https://github.com/weifengpy
ghstack dependencies: #125190
2024-04-30 02:19:13 +00:00
631d2b87f1 [FSDP2] Fixed fp32 param dtype/bf16 reduce dtype test (#125190)
The unit test for fp32 `param_dtype` and bf16 `reduce_dtype` was disabled. This PR debugs the issue and identifies the root cause as numeric differences between NCCL bf16 all-reduce vs. bf16 reduce-scatter. We address this by having the baseline use reduce-scatter -> all-gather to implement all-reduce.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125190
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
2024-04-30 02:15:33 +00:00
2369ee49cc Update torch-xpu-ops pin (ATen XPU implementation) (#125011)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125011
Approved by: https://github.com/EikanWang
2024-04-30 01:18:19 +00:00
724c7491d0 Revert " [Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987)"
This reverts commit b3fd94d15ef49c99ffa32a8226d1f00b0cc26f68.

Reverted https://github.com/pytorch/pytorch/pull/124987 on behalf of https://github.com/ezyang due to broke downstream extensions ([comment](https://github.com/pytorch/pytorch/pull/124987#issuecomment-2083956511))
2024-04-30 00:37:53 +00:00
e7631d6eae Revert "CI: add aarch64 linux workflow (#121284)"
This reverts commit 32cf04cb7f7aa14aff4d1cf40517d5de797550e7.

Reverted https://github.com/pytorch/pytorch/pull/121284 on behalf of https://github.com/malfet due to Test only changes has not been reverted ([comment](https://github.com/pytorch/pytorch/pull/121284#issuecomment-2083925890))
2024-04-30 00:24:11 +00:00
744f341aa4 Fix ref leak in dtype.to_complex()/to_real() (#125154)
By using `Py_NewRef`

Also, wrap `THPDtype_to_real`/`THPDtype_to_complex` calls with `HANDLE_TH_ERRORS`

Add regression test for the above issues, by calling to_complex for integral dtypes, that raises an exception and by preserving reference count to the same to_complex/to_real call to detect if leak is happeneing.

Replace
```cpp
auto dtype = (PyObject*)torch::getTHPDtype(current_dtype);
Py_INCREF(dtype);
return dtype;
```
with a more compact/streamlined equivalent
```cpp
return Py_NewRef(torch::getTHPDtype(current_dtype));
```

Fixes https://github.com/pytorch/pytorch/issues/124868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125154
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-04-29 23:59:27 +00:00
4d717cd7c3 [TD] Enable td on cpu windows (#125049)
yolo

Also
* Ensure that at least 1 test always gets run (`//` does truncation which results in 0 if you have too few tests discovered)
* Don't run test removal on slow tests - I'm not touching that yet

I am avoid everything other than pull + trunk workflows, so not doing this on windows CUDA, which runs on periodic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125049
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-04-29 23:39:54 +00:00
8ee6105f84 Fix edge case in cudagraph pool detection (#124981)
When we do cudagraph warmup, we record which outputs are in the cudagraph pool, so subsequently when we invoke a cudagraph and need to reclaim its memory we can free the prior run's outputs and make them error on access.

In warmup, we detect this by ignoring outputs which are an alias of an input that is not a prior output. We did this by checking data pointer. In very rare situations, a data pointer of a non cudagraph input might get reallocated to a cudagraph pool and causes us to ignore it.

This was happening with  gpt-fast error with gemma 2 when coordinate_descent_tuning was set to False.

This updates so that we check aliasing with non-cudagraph inputs by looking at storage pointer..

Unrelated: saw very weird behavior where an output had the same data pointer as a supposedly live input but not the same cdata 🤔  I would think that is not possible.

```
out[0]._cdata in  [ref()._cdata for ab in non_cudagraph_inps_storage_refs] # False
out[0].data_ptr() in  [ref().data_ptr() for ab in non_cudagraph_inps_storage_refs] # True
```

Differential Revision: [D56607721](https://our.internmc.facebook.com/intern/diff/D56607721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124981
Approved by: https://github.com/ezyang
2024-04-29 23:37:34 +00:00
e1e6ef753b [dtensor] use str for reduce_op (#125172)
This PR use str for reduce_op directly instead of the c10d enum. Since
our functional collective already uses str, there's no reason that we
need the c10d enum anymore as that requires a conversion

Also the str hash + eq performance is actually significantly faster than
the c10d type, so this would somewhat improves the CPU overhead too

Some local cpu benchmarks on `1000000` hash operations:

```
Hash performance for string type: 0.039897 seconds
Hash performance for integer type: 0.304665 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125172
Approved by: https://github.com/awgu, https://github.com/XilunWu, https://github.com/tianyu-l
2024-04-29 23:30:24 +00:00
ccaf03fd89 Fix: nn.Parameter return type identified as Tensor instead of nn.Parameter (#125106)
Fixes #125105

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125106
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-04-29 23:25:23 +00:00
26f8d96cab Fix typo in compile docstring regarding default cache_size_limit (#125145)
Docstring of `torch.compile` specifies that default `torch._dynamo.config.cache_size_limit` equals to `64`, while the value is `8` in the corresponding py file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125145
Approved by: https://github.com/kit1980
2024-04-29 22:47:43 +00:00
8c219251c5 Add backwards support to FlexAttention (#123902)
# Summary
This is part one of adding backwards support to FlexAttention.

This PR focuses on the eager implementation and wiring up enough of the templated_attention_backward(name change soon 😉) to get through aot_eager.

Notably this does not actually wire up the triton template just yet in order to make this PR easier to review. That will be the next follow up PR.

#### Structure
We pass both the forward and backward graph to the backwardsHOP since these are both needed to be inlined into the calculation for backwards:
- the forward graph is needed in order to re-compute the scores
- the joint graph is needed in order to construct the correct gradients  post softmax_grad calc

### Attatched AOT Graph
https://gist.github.com/drisspg/ce4c041f8df8a5a7983c5174705cf2b5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123902
Approved by: https://github.com/Chillee
2024-04-29 22:34:22 +00:00
720e5f306d Update CODEOWNERS - Dataloader (#125181)
Fixes #124473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125181
Approved by: https://github.com/gokulavasan, https://github.com/albanD
2024-04-29 21:37:18 +00:00
faee0e5ee8 [ez][CI] Move test_linalg and test_sparse_csr off CI_SERIAL_LIST (#125068)
* https://github.com/pytorch/pytorch/pull/124649 for context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125068
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-04-29 21:22:35 +00:00
946e202c07 [export] Restore user input names to unlifted graph modules (#124765)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/122842

Currently, calling ep.module() on an ExportedProgram leads to a GraphModule with a default forward signature (e.g. arg_0, arg_1, ...). This leads to original placeholder names disappearing for retracing/re-exporting.

Fixing this issue by creating a forward_arg_names field (will take renaming suggestions for this), that stores the positional & keyword arg names that are used. These names aren't present in the call_spec currently stored, and requires a major version bump for the ExportedProgram schema.

Test Plan: Tests exist for export, but names are now changed from generic (e.g. arg_0, arg_1) to follow user inputs (e.g. x, y)

Differential Revision: D56484994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124765
Approved by: https://github.com/zhxchen17
2024-04-29 20:58:17 +00:00
f1d1e3246f Revert "[dtensor] implement shard dim change with alltoall (#124872)"
This reverts commit 6b79469d2437531fa506b48d42488be512a87f4d.

Reverted https://github.com/pytorch/pytorch/pull/124872 on behalf of https://github.com/clee2000 due to broke distributed/tensor/parallel/test_tp_examples.py::DistTensorParallelExampleTest::test_transformer_training_is_seq_parallel_True https://github.com/pytorch/pytorch/actions/runs/8882762411/job/24389191482 f7f018a0ed.  Bad TD ([comment](https://github.com/pytorch/pytorch/pull/124872#issuecomment-2083599445))
2024-04-29 20:26:16 +00:00
3bd67dab32 Revert "[dtensor] delete the old unused mesh_alltoall (#124879)"
This reverts commit f7f018a0ed442f92eb5270150ced7b6117773368.

Reverted https://github.com/pytorch/pytorch/pull/124879 on behalf of https://github.com/clee2000 due to broke distributed/tensor/parallel/test_tp_examples.py::DistTensorParallelExampleTest::test_transformer_training_is_seq_parallel_True https://github.com/pytorch/pytorch/actions/runs/8882762411/job/24389191482 f7f018a0ed.  Bad TD ([comment](https://github.com/pytorch/pytorch/pull/124872#issuecomment-2083599445))
2024-04-29 20:26:15 +00:00
3d1dd79b80 make sure to stopTrace() on exception (#125131)
If there's an exception during collection it can result in the profiler never being stopped properly. As a result all subsequent tests that use profiling will also fail - even if they pass in isolation.

I'm hoping this fixes the flakyness in #124253, #124220, #82720, #119346, #119364, #119490, #119526, #119537 (and the currently closed #82864).

Before:
```
(py312) $ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/profiler/test_profiler.py
===================================================================================================================== FAILURES =====================================================================================================================
============================================================================================================= short test summary info ==============================================================================================================
FAILED test/profiler/test_profiler.py::TestExecutionTrace::test_execution_trace_with_kineto - AssertionError: Element counts were not equal:
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_conv2d_bias_followed_by_batchnorm2d_pattern - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern_benchmark - AttributeError: 'NoneType' object has no attribute 'profiler'
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_fp32_matmul_pattern - AttributeError: 'NoneType' object has no attribute 'profiler'
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_matmul_dim_fp16_pattern - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_kineto_multigpu - torch._dynamo.exc.InternalTorchDynamoError: 'NoneType' object has no attribute 'events'
FAILED test/profiler/test_profiler.py::TestProfiler::test_oom_tracing - AssertionError: RuntimeError not raised
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_basic_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_close_in_scope_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_complex_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_multiple_preexisting_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_open_in_scope_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_optimizer_parameters_sgd - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_refcounts - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_sparse_tensors - RuntimeError: Can't disable Kineto profiler when it's not running
==================================================================================================== 16 failed, 26 passed, 53 skipped in 25.51s ====================================================================================================
```
After:
```
(py312) $ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/profiler/test_profiler.py
===================================================================================================================== FAILURES =====================================================================================================================
============================================================================================================= short test summary info ==============================================================================================================
FAILED test/profiler/test_profiler.py::TestExecutionTrace::test_execution_trace_with_kineto - AssertionError: Element counts were not equal:
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern - RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/data/users/aorenste/pytorch/torch/csrc/autograd/profiler_python.cpp":969...
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern_benchmark - AttributeError: 'NoneType' object has no attribute 'profiler'
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_fp32_matmul_pattern - AttributeError: 'NoneType' object has no attribute 'profiler'
FAILED test/profiler/test_profiler.py::TestProfiler::test_kineto_multigpu - torch._dynamo.exc.InternalTorchDynamoError: 'NoneType' object has no attribute 'events'
FAILED test/profiler/test_profiler.py::TestProfiler::test_oom_tracing - AssertionError: RuntimeError not raised
FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_optimizer_parameters_sgd - RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/data/users/aorenste/pytorch/torch/csrc/autograd/profiler_python.cpp":969, please...
==================================================================================================== 7 failed, 35 passed, 53 skipped in 31.51s =====================================================================================================
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125131
Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi
2024-04-29 19:07:37 +00:00
a434d1487b Fix EtcdServer leak in etcd_server_test.py file (#125121)
As stated in the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125121
Approved by: https://github.com/Skylion007
2024-04-29 18:59:05 +00:00
fab5bd5359 [checkpoint] Improve error message when use_reentrant=True is used with .grad() (#125155)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125155
Approved by: https://github.com/albanD
2024-04-29 18:57:35 +00:00
f03cf9d4dc Fix & optimze open device registration test. (#124712)
Fixes #100152

1. Fix the wrong tests about lazy init for PrivateUse1 named foo
2. Fix wrong backend meta registry mechanism when compiling with clang++( compiling with g++ work well)(introduced by static variable in inline function)
3. Refactor the tests and make it more flexible
4. Disable the two tests temporarily
    - test_open_device_storage_pin_memory
    - test_compile_autograd_function_aliasing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124712
Approved by: https://github.com/albanD, https://github.com/malfet
2024-04-29 18:55:38 +00:00
32cf04cb7f CI: add aarch64 linux workflow (#121284)
aarch64 linux workflow is triggered for ciflow/aarch64 tags.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121284
Approved by: https://github.com/atalman, https://github.com/malfet
2024-04-29 18:25:40 +00:00
ae13c7e593 Revert "[Meta Tensor] fix meta inplace set storage (#123880)"
This reverts commit cccae9355191a807040fb40a65178c4d7fe3f084.

Reverted https://github.com/pytorch/pytorch/pull/123880 on behalf of https://github.com/izaitsevfb due to breaks cpu_inductor_torchbench (detectron2_fasterrcnn) ([comment](https://github.com/pytorch/pytorch/pull/123880#issuecomment-2083366385))
2024-04-29 18:19:42 +00:00
96cc73dc13 [oss][torch.package] fix multiple error messages within PackageExporter (#124943)
Summary:
fixes two issues:
- when exporting with debug=True, the list of error-causing modules and a dependency path to them is not printed correctly, there's a missing newline after the path, meaning the name of the module for the next error is on the wrong line, which makes the output a confusing mess to read
- when a pickled object references more than one mocked module directly, the error message incorrectly repeats the same information, claiming the referenced attribute is present in several different libraries, because the if condition references the last seen module name while walking the pickle ops, not the module name from the enclosing block `for module_name in all_dependencies:`. this is confusing because one error will print as O(all_dependencies) errors, all with different module names but the same attribute name

Differential Revision: D56578035

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124943
Approved by: https://github.com/JonAmazon, https://github.com/houseroad
2024-04-29 18:11:28 +00:00
f7f018a0ed [dtensor] delete the old unused mesh_alltoall (#124879)
as titled, as we have a dedicated comm op, this is not needed anymore

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124879
Approved by: https://github.com/XilunWu, https://github.com/wz337
ghstack dependencies: #124871, #124872
2024-04-29 17:22:30 +00:00
6b79469d24 [dtensor] implement shard dim change with alltoall (#124872)
as titled, we implement a dedicated communication op to allow efficient
sharding dimension change using alltoall, to replace our previous
allgather + local chunk

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124872
Approved by: https://github.com/XilunWu, https://github.com/yifuwang
ghstack dependencies: #124871
2024-04-29 17:22:30 +00:00
8d46ab4104 [dtensor] move pad/unpad_tensor to separate utils (#124871)
as titled, 1. pad/unpad is a general util not specific to the Shard
placement, 2. for the propose of the next PR, move these two out of Shard
placement itself, and give additional pad_dim argument

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124871
Approved by: https://github.com/awgu, https://github.com/wz337, https://github.com/XilunWu
2024-04-29 17:22:25 +00:00
935a946241 [RFC][FSDP2] Renamed FSDP to FSDPModule (#124955)
This PR renames the `FSDP` class to `FSDPModule`. This is a BC breaking change. The rationale is that `FSDPModule` is more descriptive since `fully_shard` is a module-level API (applied to a `module` arg), so the `FSDP` class will always correspond to a module.

Also, users commonly import `FullyShardedDataParallel` as `FSDP`, so this can help avoid some name conflict in some cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124955
Approved by: https://github.com/wanchaol, https://github.com/wconstab
ghstack dependencies: #124651, #124741, #124767, #124768, #124780, #124787
2024-04-29 16:33:18 +00:00
da44d2f7fb split out flop counting its own method (#125061)
Summary: Modularizing code for reuse by splitting __torch_dispatch__ to move flop counting to its own method.

Test Plan: unit tests

Differential Revision: D56644523

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125061
Approved by: https://github.com/842974287
2024-04-29 14:13:44 +00:00
e5e623af4b Codegen runtime asserts in Inductor (#124874)
This completely subsumes https://github.com/pytorch/pytorch/pull/120816

This makes use of the unbacked binding machinery to teach Inductor how to generate deferred runtime asserts directly. There is some back story about why I did it this way, let me explain.

Previously, our strategy for generating runtime asserts was that Dynamo would insert them into the FX graph after finishing tracing, and we would attempt to code generate them based on the FX graph. This is a good strategy for export, where we immediately export the graph. However, this strategy was afflicted by problems in eager, where we reuse the same ShapeEnv as before. In particular, on subsequent graph passes, we would immediately turn all of these assertions into noops, because when we evaluated their expressions, we would see that because we had a deferred runtime assert in the ShapeEnv, we know "oh, of course this expression is True" already. Oops!

So, with this PR, we take the attitude that as long as the ShapeEnv sticks around, the ShapeEnv's list of deferred runtime asserts is the source of truth, and we don't put anything in the graph. So we just need to decide when to actually generate asserts, and the place I picked was Inductor lowering, since we already have an AssertScalar buffer concept, and so I just need to insert them at this point. AssertScalar also uses raw sympy.Expr rather than SymInt/Bool, so it is easier to prevent unrestricted simplification at this point.

There are a few things jumbled together in this PR. I can split them if you want, but some of the changes are before I changed my strategy, but they're useful changes anyway.

**torch/_dynamo/output_graph.py** and **torch/_inductor/lowering.py** - Here, we stop putting deferred runtime asserts in the graph. I also have to make sure we don't DCE unused symbol arguments; we're going to get some goofy graph arguments this way, will be good to restore that optimization eventually. We also just disable codegen for `_assert_scalar`  entirely; we assume that ShapeEnv will be good enough to capture all of these.

**torch/_inductor/codegen/wrapper.py** and **torch/_inductor/ir.py** - Add a way to codegen sizevars without forcing simplification

**torch/_inductor/graph.py** - The main logic. Our strategy is to interpose in the same place we are testing that unbacked SymInts are properly showing up in lowered code. The logic is directly analogous to the logic in the existing insert deferred runtime asserts FX pass, but it's simpler because sympy expressions can be directly stored on inductor IR nodes.

**torch/fx/experimental/symbolic_shapes.py** - For extra safety, we have a way of freezing runtime asserts, so that if you try to add more we error. This prevents us from adding runtime asserts after we've done lowering. There's a funny interaction with backwards which there's a comment for in graph.py

**torch/fx/passes/runtime_assert.py** - This is not really needed in this PR, but I rewrote the runtime assert logic to use unbacked_bindings rather than inferring it by looking for unbacked SymInts. Now, keypaths are translated into FX node acessors. Unfortunately, I couldn't delete the old inference code, because you still need it to find backed SymInts from arguments (as this pass may be used on graphs which don't explicitly bind all their shape variables as argments). There are some new tests exercising this.

TODO: I think we need to generate asserts for replacements too. This is a preexisting problem that the old FX pass had too.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124874
Approved by: https://github.com/jansel
ghstack dependencies: #124864
2024-04-29 10:19:29 +00:00
e498e28b2f Remove API that allows for extra deferred runtime asserts during lowering (#124864)
I want to generate runtime assert nodes during lowering, which means
that I need a finalized list of asserts by the time I start lowering.
This means this runtime assert introduced in
https://github.com/pytorch/pytorch/pull/113839 must go.  Fortunately,
this runtime assert was never exercisable, apparently, and the test
still "passes" without it.  I replace it with a compile time test.  We
can revisit if this assert fails in practice.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124864
Approved by: https://github.com/jansel
2024-04-29 10:19:29 +00:00
303880e16b Update gen.py aoti_fm install dir (#125087)
Summary: make it consistent with all the other install dir

Test Plan: Sandcastle

Differential Revision: D56660301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125087
Approved by: https://github.com/frank-wei
2024-04-29 08:25:16 +00:00
cyy
5585138db9 Remove caffe2 contrib and experiments (#125038)
This PR tries to decompose #122527 into a smaller one.
To be noted, this was inspired and is co-dev with @r-barnes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125038
Approved by: https://github.com/malfet
2024-04-29 06:27:13 +00:00
555f1aeb02 Fix module buffer mutation (#124586)
Fixes #124583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124586
Approved by: https://github.com/leslie-fang-intel, https://github.com/desertfire
2024-04-29 06:05:12 +00:00
06b845dedc Make metadata serialization more strict (#124411)
Summary: When I was debugging an issue, this silent error makes the debugging harder. It is better to error earlier with more descriptive error message.

Test Plan: None

Differential Revision: D56312433

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124411
Approved by: https://github.com/zhxchen17
2024-04-29 02:11:40 +00:00
cc06c00a56 Don't run auto grad safe mode when predispatch is on (#125066)
Summary: Title

Test Plan: CI

Differential Revision: D56646678

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125066
Approved by: https://github.com/zhxchen17
2024-04-29 01:53:23 +00:00
e3b9b71684 [BE]: Ruff - TRY401 - Avoid verbose exception logging (#125126)
Don't bother logging exception obj explicitly with logger, it's captured anyway and would generate verbose outputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125126
Approved by: https://github.com/ezyang
2024-04-28 21:44:33 +00:00
3e1fb96964 [BE]: RUF018 - ban assignment in assert (#125125)
Ban assignment inside of assert. Python code should ideally not break with assertions disabled. Adds a ruff lint rule to enforce this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125125
Approved by: https://github.com/ezyang
2024-04-28 21:41:36 +00:00
a05b2ae302 Enable UFMT on test/test_dataloader.py (#124710)
Part of: #123062

Ran lintrunner on:

- test/test_custom_op_testing.py (already deleted)
- test/test_dataloader.py

Detail:

```bash
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124710
Approved by: https://github.com/soulitzer
2024-04-28 21:21:51 +00:00
hun
518ab48e85 Enable UFMT on test/test_functionalization.py (#123926)
Part of  #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123926
Approved by: https://github.com/ezyang, https://github.com/statelesshz
2024-04-28 17:02:34 +00:00
cccae93551 [Meta Tensor] fix meta inplace set storage (#123880)
Fixes #123879

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123880
Approved by: https://github.com/ezyang
2024-04-28 17:01:12 +00:00
6761b49551 Ensure autocast device_type is a string + Unit test (#125014)
Reviving #124873 (already approved) to resolve CLA issues

Fixes #124738

(Marked as draft until I get local unit tests to run)

Edit: Tests passing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125014
Approved by: https://github.com/mikaylagawarecki, https://github.com/soulitzer
2024-04-28 16:27:30 +00:00
1a0b247762 [dynamo] Bug fix for LOAD_GLOBAL and STORE_GLOBAL (#125002)
Earlier globals of inlined functions from other files were not handled correctly. We were not tracking mutations on them. They were colliding with the same global name in the parent function etc. This PR overrides the LOAD/STORE_GLOBAL for inline tx and tracks mutation on them separately.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125002
Approved by: https://github.com/jansel
ghstack dependencies: #125097, #125107
2024-04-28 15:24:17 +00:00
0f139b04b3 [dynamo] Fix test (#125107)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125107
Approved by: https://github.com/jansel
ghstack dependencies: #125097
2024-04-28 15:24:17 +00:00
49ca2b3429 [BE]: Apply RUF025 perf fixups (#125104)
Uses `dict.fromkeys()` for more efficient dict construction. Automatically generated by RUF025 (prev).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125104
Approved by: https://github.com/ezyang
2024-04-28 15:09:21 +00:00
94b328ee45 add likely/unlikely macro for unsupport c++20 compiler. (#124997)
# Issue:
Intel validation team found some low version gcc which not support c++20 will occur below issue:
```cmd
[2024-04-13T08:03:25.142Z] g++ /tmp/torchinductor_root/vd/cvdytwwwlhi63ofh3pwzqfpjga4w4xe7bjfdoavpblbo5khzf3b2.cpp -shared -fPIC -Wall -std=c++17 -Wno-unused-variable -Wno-unknown-pragmas -D_GLIBCXX_USE_CXX11_ABI=0 -I/root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/include -I/root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/include/TH -I/root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/include/THC -I/root/anaconda3/envs/pytorch/include/python3.8 -L/root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/lib -L/root/anaconda3/envs/pytorch/lib -L/root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/lib -ltorch -ltorch_cpu -lgomp -ltorch_python -lc10 -mavx2 -mfma -DCPU_CAPABILITY_AVX2 -O3 -DNDEBUG -ffast-math -fno-finite-math-only -fno-unsafe-math-optimizations -ffp-contract=off -march=native -fopenmp -D C10_USING_CUSTOM_GENERATED_MACROS -o /tmp/torchinductor_root/vd/cvdytwwwlhi63ofh3pwzqfpjga4w4xe7bjfdoavpblbo5khzf3b2.so
[2024-04-13T08:03:25.142Z]
[2024-04-13T08:03:25.142Z] Output:
[2024-04-13T08:03:25.142Z] /tmp/torchinductor_root/vd/cvdytwwwlhi63ofh3pwzqfpjga4w4xe7bjfdoavpblbo5khzf3b2.cpp: In function ‘T parse_arg(PyObject*, size_t) [with T = long int; PyObject = _object; size_t = long unsigned int]’:
[2024-04-13T08:03:25.142Z] /tmp/torchinductor_root/vd/cvdytwwwlhi63ofh3pwzqfpjga4w4xe7bjfdoavpblbo5khzf3b2.cpp:117:10: error: expected identifier before ‘[’ token
[2024-04-13T08:03:25.142Z] [[unlikely]] throw std::runtime_error("expected int arg");
[2024-04-13T08:03:25.142Z] ^
```

The season is `unlikely` need c++20 attribute, ref: https://en.cppreference.com/w/cpp/language/attributes/likely

# Solution:
Add MACRO to enable non-c++20 attribute GNU compiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124997
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-04-28 07:03:12 +00:00
42a192db0f Fix Conv BN folding with deadcode (#124808)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/124286
The TorchBenchmark includes a method called `run_n_iterations` which runs model multiple times.
43f4e71daa/benchmarks/dynamo/common.py (L2272-L2276)

https://github.com/pytorch/pytorch/pull/123399 enables tracing into a `UserDefinedObjectVariable` that's an instance method.    It will trace the model into FX graph multiple times within `run_n_iterations`. Then, in the Inductor, `Conv-BN folding` at the module level will fuse the same Conv-BN module multiple times in this case, which leads to accuracy failures. This PR addresses the issue by ensuring that each Conv-BN module is fused only once.

**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_inductor_freezing.py -k test_folded_conv_bn_with_module_sharing
python -u -m pytest -s -v test/inductor/test_inductor_freezing.py -k test_folded_conv_functional_bn_with_module_sharing
python -u -m pytest -s -v test/inductor/test_inductor_freezing.py -k test_conv_bn_with_multi_bn_share_conv
python -u -m pytest -s -v test/inductor/test_inductor_freezing.py -k test_conv_functional_bn_with_multi_bn_share_conv
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124808
Approved by: https://github.com/jansel, https://github.com/jgong5
2024-04-28 06:29:40 +00:00
c1e0dea023 Delete unused param 'OP' in KERNEL_PRIVATEUSEONE (#125008)
Parameter 'OP' is unused but occupies  a position that will cause the length of \_\_VA_ARGS\__  less than expected.
Missed this diff in https://github.com/pytorch/pytorch/pull/124050.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125008
Approved by: https://github.com/FFFrog, https://github.com/leslie-fang-intel
2024-04-28 06:17:16 +00:00
5f7c4181b5 Correcting valid device name of privateuse1 (#125018)
"privateuseone" is an invalid string for privateuse1 backend, the correct one should be returned from _get_privateuse1_backend_name().
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125018
Approved by: https://github.com/aaronenyeshi
2024-04-28 06:04:34 +00:00
c5b1a4c269 [inductor] share more cse cache during swap buffer (#124921)
`swap_buffer` will make the `cse_cache` cannot be shared inside/outside of the lambda function scope.
For example,

```
auto tmp8 = -std::numeric_limits<float>::infinity();
auto tmp9 = [&]
{
    auto tmp12 = -std::numeric_limits<float>::infinity();
    return tmp12;
}
```
`tmp12` should not be created since it is same with `tmp8`.

We make the `cse_cache` as a read only cache inside the scope (because it is unsafe to expose cache inside the scope,the outside scope cannot use it.)

**Test Plan**
```
python test/inductor/test_torchinductor.py -k test_AllenaiLongformerBase_repro_cpu
```
the `static_cast<int>(256)` will only occur once after this PR since the inside scope can share the cse buffer outside the scope.

Before this PR,
```
cpp_fused_copy_full_like_0 = async_compile.cpp_pybinding(['const float*', 'float*'], '''
#include "/tmp/torchinductor_root/ub/cub6x5nmhqhp7xapkb3dlgjxef3t2bnkx7y7n4z2f4z5obnecxpy.h"
extern "C" void kernel(const float* in_ptr0,
                       float* out_ptr1)
{
    #pragma omp parallel num_threads(128)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(1L))
                {
                    #pragma GCC ivdep
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(12L); x2+=static_cast<long>(1L))
                    {
                        for(long x3=static_cast<long>(0L); x3<static_cast<long>(512L); x3+=static_cast<long>(16L))
                        {
                            auto tmp0 = c10::convert<int>(x1);
                            auto tmp1 = static_cast<int>(256);
                            auto tmp2 = tmp0 < tmp1;
                            auto tmp3 = [&]
                            {
                                auto tmp4 = c10::convert<int>(x3);
                                auto tmp5 = at::vec::Vectorized<int>::arange(tmp4, 1);
                                auto tmp6 = static_cast<int>(257);
                                auto tmp7 = at::vec::Vectorized<int>(tmp6);
                                auto tmp8 = at::vec::VecMask<int,1>(tmp5 < tmp7);
                                auto tmp10 = at::vec::VecMask<float,1>::from(tmp2);
                                auto tmp11 = tmp8 & tmp10;
                                auto tmp9 = [&]
                                {
                                    auto tmp12 = -std::numeric_limits<float>::infinity();
                                    return tmp12;
                                }
                                ;
                                auto tmp13 =
                                [&]
                                {
                                    if (tmp11.all_zero())
                                    {
                                        return at::vec::Vectorized<float>(static_cast<float>(0.0));
                                    }
                                    else
                                    {
                                        return decltype(at::vec::Vectorized<float>(tmp9()))::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), at::vec::Vectorized<float>(tmp9()), tmp11.template cast<float,1>());
                                    }
                                }
                                ()
                                ;
                                auto tmp14 = c10::convert<int>(c10::div_floor_integer(x1, 256L));
                                auto tmp15 = static_cast<int>(3);
                                auto tmp16 = tmp14 < tmp15;
                                auto tmp18 = tmp16 & tmp2;
                                auto tmp17 = [&]
                                {
                                    auto tmp19 = c10::convert<int>(x3);
                                    auto tmp20 = at::vec::Vectorized<int>::arange(tmp19, 1);
                                    auto tmp21 = static_cast<int>(256);
                                    auto tmp22 = at::vec::Vectorized<int>(tmp21);
                                    auto tmp23 = at::vec::VecMask<int,1>(tmp20 >= tmp22);
                                    auto tmp25 = at::vec::VecMask<float,1>::from(tmp18);
                                    auto tmp26 = tmp23 & tmp25;
                                    auto tmp24 = [&]
                                    {
                                        auto tmp27 = tmp26.template cast<float,1>().template loadu<float,1>(in_ptr0 + static_cast<long>((-256L) + x3 + (513L*(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L*(c10::div_floor_integer(x1, 256L))) + (787968L*x2) + (9455616L*x0)));
                                        return tmp27;
                                    }
                                    ;
                                    auto tmp28 =
                                    [&]
                                    {
                                        if (tmp26.all_zero())
                                        {
                                            return at::vec::Vectorized<float>(static_cast<float>(0.0));
                                        }
                                        else
                                        {
                                            return decltype(tmp24())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp24(), tmp26.template cast<float,1>());
                                        }
                                    }
                                    ()
                                    ;
                                    auto tmp29 = static_cast<float>(0.0);
                                    auto tmp30 = at::vec::Vectorized<float>(tmp29);
                                    auto tmp31 = decltype(tmp28)::blendv(tmp30, tmp28, tmp23.template cast<float,1>());
                                    return tmp31;
                                }
                                ;
                                auto tmp32 = tmp16 ? tmp17() : at::vec::Vectorized<float>(static_cast<float>(0.0));
                                auto tmp33 = static_cast<float>(0.0);
                                auto tmp34 = at::vec::VecMask<float,1>::from(tmp16);
                                auto tmp35 = at::vec::Vectorized<float>(tmp33);
                                auto tmp36 = decltype(tmp32)::blendv(tmp35, tmp32, tmp34.template cast<float,1>());
                                auto tmp37 = decltype(tmp13)::blendv(tmp36, tmp13, tmp8.template cast<float,1>());
                                return tmp37;
                            }
                            ;
                            auto tmp38 = tmp2 ? tmp3() : at::vec::Vectorized<float>(static_cast<float>(0.0));
                            auto tmp39 = c10::convert<int>(c10::div_floor_integer(x1, 256L));
                            auto tmp40 = static_cast<int>(3);
                            auto tmp41 = tmp39 < tmp40;
                            auto tmp42 = [&]
                            {
                                auto tmp43 = c10::convert<int>(x3);
                                auto tmp44 = at::vec::Vectorized<int>::arange(tmp43, 1);
                                auto tmp45 = static_cast<int>(256);
                                auto tmp46 = at::vec::Vectorized<int>(tmp45);
                                auto tmp47 = at::vec::VecMask<int,1>(tmp44 >= tmp46);
                                auto tmp49 = at::vec::VecMask<float,1>::from(tmp41);
                                auto tmp50 = tmp47 & tmp49;
                                auto tmp48 = [&]
                                {
                                    auto tmp51 = tmp50.template cast<float,1>().template loadu<float,1>(in_ptr0 + static_cast<long>((-256L) + x3 + (513L*(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L*(c10::div_floor_integer(x1, 256L))) + (787968L*x2) + (9455616L*x0)));
                                    return tmp51;
                                }
                                ;
                                auto tmp52 =
                                [&]
                                {
                                    if (tmp50.all_zero())
                                    {
                                        return at::vec::Vectorized<float>(static_cast<float>(0.0));
                                    }
                                    else
                                    {
                                        return decltype(tmp48())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp48(), tmp50.template cast<float,1>());
                                    }
                                }
                                ()
                                ;
                                auto tmp53 = static_cast<float>(0.0);
                                auto tmp54 = at::vec::Vectorized<float>(tmp53);
                                auto tmp55 = decltype(tmp52)::blendv(tmp54, tmp52, tmp47.template cast<float,1>());
                                return tmp55;
                            }
                            ;
                            auto tmp56 = tmp41 ? tmp42() : at::vec::Vectorized<float>(static_cast<float>(0.0));
                            auto tmp57 = static_cast<float>(0.0);
                            auto tmp58 = at::vec::VecMask<float,1>::from(tmp41);
                            auto tmp59 = at::vec::Vectorized<float>(tmp57);
                            auto tmp60 = decltype(tmp56)::blendv(tmp59, tmp56, tmp58.template cast<float,1>());
                            auto tmp61 = at::vec::VecMask<float,1>::from(tmp2);
                            auto tmp62 = decltype(tmp38)::blendv(tmp60, tmp38, tmp61.template cast<float,1>());
                            tmp62.store(out_ptr1 + static_cast<long>(x3 + (513L*x1) + (525312L*x2) + (6303744L*x0)));
                        }
                        #pragma omp simd simdlen(8)
                        for(long x3=static_cast<long>(512L); x3<static_cast<long>(513L); x3+=static_cast<long>(1L))
                        {
                            auto tmp0 = c10::convert<int64_t>(x1);
                            auto tmp1 = static_cast<int64_t>(256);
                            auto tmp2 = tmp0 < tmp1;
                            auto tmp3 = [&]
                            {
                                auto tmp4 = c10::convert<int64_t>(x3);
                                auto tmp5 = static_cast<int64_t>(257);
                                auto tmp6 = tmp4 < tmp5;
                                auto tmp7 = [&]
                                {
                                    auto tmp8 = -std::numeric_limits<float>::infinity();
                                    return tmp8;
                                }
                                ;
                                auto tmp9 = tmp6 ? tmp7() : static_cast<decltype(tmp7())>(0.0);
                                auto tmp10 = c10::convert<int64_t>(c10::div_floor_integer(x1, 256L));
                                auto tmp11 = static_cast<int64_t>(3);
                                auto tmp12 = tmp10 < tmp11;
                                auto tmp13 = [&]
                                {
                                    auto tmp14 = c10::convert<int64_t>(x3);
                                    auto tmp15 = static_cast<int64_t>(256);
                                    auto tmp16 = tmp14 >= tmp15;
                                    auto tmp17 = [&]
                                    {
                                        auto tmp18 = in_ptr0[static_cast<long>((-256L) + x3 + (513L*(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L*(c10::div_floor_integer(x1, 256L))) + (787968L*x2) + (9455616L*x0))];
                                        return tmp18;
                                    }
                                    ;
                                    auto tmp19 = tmp16 ? tmp17() : static_cast<decltype(tmp17())>(0.0);
                                    auto tmp20 = static_cast<float>(0.0);
                                    auto tmp21 = tmp16 ? tmp19 : tmp20;
                                    return tmp21;
                                }
                                ;
                                auto tmp22 = tmp12 ? tmp13() : static_cast<decltype(tmp13())>(0.0);
                                auto tmp23 = static_cast<float>(0.0);
                                auto tmp24 = tmp12 ? tmp22 : tmp23;
                                auto tmp25 = tmp6 ? tmp9 : tmp24;
                                return tmp25;
                            }
                            ;
                            auto tmp26 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
                            auto tmp27 = c10::convert<int64_t>(c10::div_floor_integer(x1, 256L));
                            auto tmp28 = static_cast<int64_t>(3);
                            auto tmp29 = tmp27 < tmp28;
                            auto tmp30 = [&]
                            {
                                auto tmp31 = c10::convert<int64_t>(x3);
                                auto tmp32 = static_cast<int64_t>(256);
                                auto tmp33 = tmp31 >= tmp32;
                                auto tmp34 = [&]
                                {
                                    auto tmp35 = in_ptr0[static_cast<long>((-256L) + x3 + (513L*(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L*(c10::div_floor_integer(x1, 256L))) + (787968L*x2) + (9455616L*x0))];
                                    return tmp35;
                                }
                                ;
                                auto tmp36 = tmp33 ? tmp34() : static_cast<decltype(tmp34())>(0.0);
                                auto tmp37 = static_cast<float>(0.0);
                                auto tmp38 = tmp33 ? tmp36 : tmp37;
                                return tmp38;
                            }
                            ;
                            auto tmp39 = tmp29 ? tmp30() : static_cast<decltype(tmp30())>(0.0);
                            auto tmp40 = static_cast<float>(0.0);
                            auto tmp41 = tmp29 ? tmp39 : tmp40;
                            auto tmp42 = tmp2 ? tmp26 : tmp41;
                            out_ptr1[static_cast<long>(x3 + (513L*x1) + (525312L*x2) + (6303744L*x0))] = tmp42;
                        }
                    }
                }
            }
        }
    }
}
''')
```
After this PR,
```
cpp_fused_copy_full_like_0 = async_compile.cpp_pybinding(['const float*', 'float*'], '''
#include "/tmp/torchinductor_root/ub/cub6x5nmhqhp7xapkb3dlgjxef3t2bnkx7y7n4z2f4z5obnecxpy.h"
extern "C" void kernel(const float* in_ptr0,
                       float* out_ptr1)
{
    #pragma omp parallel num_threads(128)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(1L))
                {
                    #pragma GCC ivdep
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(12L); x2+=static_cast<long>(1L))
                    {
                        for(long x3=static_cast<long>(0L); x3<static_cast<long>(512L); x3+=static_cast<long>(16L))
                        {
                            auto tmp0 = c10::convert<int>(x1);
                            auto tmp1 = static_cast<int>(256);
                            auto tmp2 = tmp0 < tmp1;
                            auto tmp3 = [&]
                            {
                                auto tmp4 = c10::convert<int>(x3);
                                auto tmp5 = at::vec::Vectorized<int>::arange(tmp4, 1);
                                auto tmp6 = static_cast<int>(257);
                                auto tmp7 = at::vec::Vectorized<int>(tmp6);
                                auto tmp8 = at::vec::VecMask<int,1>(tmp5 < tmp7);
                                auto tmp10 = at::vec::VecMask<float,1>::from(tmp2);
                                auto tmp11 = tmp8 & tmp10;
                                auto tmp9 = [&]
                                {
                                    auto tmp12 = -std::numeric_limits<float>::infinity();
                                    return tmp12;
                                }
                                ;
                                auto tmp13 =
                                [&]
                                {
                                    if (tmp11.all_zero())
                                    {
                                        return at::vec::Vectorized<float>(static_cast<float>(0.0));
                                    }
                                    else
                                    {
                                        return decltype(at::vec::Vectorized<float>(tmp9()))::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), at::vec::Vectorized<float>(tmp9()), tmp11.template cast<float,1>());
                                    }
                                }
                                ()
                                ;
                                auto tmp14 = c10::convert<int>(c10::div_floor_integer(x1, 256L));
                                auto tmp15 = static_cast<int>(3);
                                auto tmp16 = tmp14 < tmp15;
                                auto tmp18 = tmp16 & tmp2;
                                auto tmp17 = [&]
                                {
                                    auto tmp19 = at::vec::Vectorized<int>(tmp1);
                                    auto tmp20 = at::vec::VecMask<int,1>(tmp5 >= tmp19);
                                    auto tmp22 = at::vec::VecMask<float,1>::from(tmp18);
                                    auto tmp23 = tmp20 & tmp22;
                                    auto tmp21 = [&]
                                    {
                                        auto tmp24 = tmp23.template cast<float,1>().template loadu<float,1>(in_ptr0 + static_cast<long>((-256L) + x3 + (513L*(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L*(c10::div_floor_integer(x1, 256L))) + (787968L*x2) + (9455616L*x0)));
                                        return tmp24;
                                    }
                                    ;
                                    auto tmp25 =
                                    [&]
                                    {
                                        if (tmp23.all_zero())
                                        {
                                            return at::vec::Vectorized<float>(static_cast<float>(0.0));
                                        }
                                        else
                                        {
                                            return decltype(tmp21())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp21(), tmp23.template cast<float,1>());
                                        }
                                    }
                                    ()
                                    ;
                                    auto tmp26 = static_cast<float>(0.0);
                                    auto tmp27 = at::vec::Vectorized<float>(tmp26);
                                    auto tmp28 = decltype(tmp25)::blendv(tmp27, tmp25, tmp20.template cast<float,1>());
                                    return tmp28;
                                }
                                ;
                                auto tmp29 = tmp16 ? tmp17() : at::vec::Vectorized<float>(static_cast<float>(0.0));
                                auto tmp30 = static_cast<float>(0.0);
                                auto tmp31 = at::vec::VecMask<float,1>::from(tmp16);
                                auto tmp32 = at::vec::Vectorized<float>(tmp30);
                                auto tmp33 = decltype(tmp29)::blendv(tmp32, tmp29, tmp31.template cast<float,1>());
                                auto tmp34 = decltype(tmp13)::blendv(tmp33, tmp13, tmp8.template cast<float,1>());
                                return tmp34;
                            }
                            ;
                            auto tmp35 = tmp2 ? tmp3() : at::vec::Vectorized<float>(static_cast<float>(0.0));
                            auto tmp36 = c10::convert<int>(c10::div_floor_integer(x1, 256L));
                            auto tmp37 = static_cast<int>(3);
                            auto tmp38 = tmp36 < tmp37;
                            auto tmp39 = [&]
                            {
                                auto tmp40 = c10::convert<int>(x3);
                                auto tmp41 = at::vec::Vectorized<int>::arange(tmp40, 1);
                                auto tmp42 = at::vec::Vectorized<int>(tmp1);
                                auto tmp43 = at::vec::VecMask<int,1>(tmp41 >= tmp42);
                                auto tmp45 = at::vec::VecMask<float,1>::from(tmp38);
                                auto tmp46 = tmp43 & tmp45;
                                auto tmp44 = [&]
                                {
                                    auto tmp47 = tmp46.template cast<float,1>().template loadu<float,1>(in_ptr0 + static_cast<long>((-256L) + x3 + (513L*(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L*(c10::div_floor_integer(x1, 256L))) + (787968L*x2) + (9455616L*x0)));
                                    return tmp47;
                                }
                                ;
                                auto tmp48 =
                                [&]
                                {
                                    if (tmp46.all_zero())
                                    {
                                        return at::vec::Vectorized<float>(static_cast<float>(0.0));
                                    }
                                    else
                                    {
                                        return decltype(tmp44())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp44(), tmp46.template cast<float,1>());
                                    }
                                }
                                ()
                                ;
                                auto tmp49 = static_cast<float>(0.0);
                                auto tmp50 = at::vec::Vectorized<float>(tmp49);
                                auto tmp51 = decltype(tmp48)::blendv(tmp50, tmp48, tmp43.template cast<float,1>());
                                return tmp51;
                            }
                            ;
                            auto tmp52 = tmp38 ? tmp39() : at::vec::Vectorized<float>(static_cast<float>(0.0));
                            auto tmp53 = static_cast<float>(0.0);
                            auto tmp54 = at::vec::VecMask<float,1>::from(tmp38);
                            auto tmp55 = at::vec::Vectorized<float>(tmp53);
                            auto tmp56 = decltype(tmp52)::blendv(tmp55, tmp52, tmp54.template cast<float,1>());
                            auto tmp57 = at::vec::VecMask<float,1>::from(tmp2);
                            auto tmp58 = decltype(tmp35)::blendv(tmp56, tmp35, tmp57.template cast<float,1>());
                            tmp58.store(out_ptr1 + static_cast<long>(x3 + (513L*x1) + (525312L*x2) + (6303744L*x0)));
                        }
                        #pragma omp simd simdlen(8)
                        for(long x3=static_cast<long>(512L); x3<static_cast<long>(513L); x3+=static_cast<long>(1L))
                        {
                            auto tmp0 = c10::convert<int64_t>(x1);
                            auto tmp1 = static_cast<int64_t>(256);
                            auto tmp2 = tmp0 < tmp1;
                            auto tmp3 = [&]
                            {
                                auto tmp4 = c10::convert<int64_t>(x3);
                                auto tmp5 = static_cast<int64_t>(257);
                                auto tmp6 = tmp4 < tmp5;
                                auto tmp7 = [&]
                                {
                                    auto tmp8 = -std::numeric_limits<float>::infinity();
                                    return tmp8;
                                }
                                ;
                                auto tmp9 = tmp6 ? tmp7() : static_cast<decltype(tmp7())>(0.0);
                                auto tmp10 = c10::convert<int64_t>(c10::div_floor_integer(x1, 256L));
                                auto tmp11 = static_cast<int64_t>(3);
                                auto tmp12 = tmp10 < tmp11;
                                auto tmp13 = [&]
                                {
                                    auto tmp14 = tmp4 >= tmp1;
                                    auto tmp15 = [&]
                                    {
                                        auto tmp16 = in_ptr0[static_cast<long>((-256L) + x3 + (513L*(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L*(c10::div_floor_integer(x1, 256L))) + (787968L*x2) + (9455616L*x0))];
                                        return tmp16;
                                    }
                                    ;
                                    auto tmp17 = tmp14 ? tmp15() : static_cast<decltype(tmp15())>(0.0);
                                    auto tmp18 = static_cast<float>(0.0);
                                    auto tmp19 = tmp14 ? tmp17 : tmp18;
                                    return tmp19;
                                }
                                ;
                                auto tmp20 = tmp12 ? tmp13() : static_cast<decltype(tmp13())>(0.0);
                                auto tmp21 = static_cast<float>(0.0);
                                auto tmp22 = tmp12 ? tmp20 : tmp21;
                                auto tmp23 = tmp6 ? tmp9 : tmp22;
                                return tmp23;
                            }
                            ;
                            auto tmp24 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
                            auto tmp25 = c10::convert<int64_t>(c10::div_floor_integer(x1, 256L));
                            auto tmp26 = static_cast<int64_t>(3);
                            auto tmp27 = tmp25 < tmp26;
                            auto tmp28 = [&]
                            {
                                auto tmp29 = c10::convert<int64_t>(x3);
                                auto tmp30 = tmp29 >= tmp1;
                                auto tmp31 = [&]
                                {
                                    auto tmp32 = in_ptr0[static_cast<long>((-256L) + x3 + (513L*(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L*(c10::div_floor_integer(x1, 256L))) + (787968L*x2) + (9455616L*x0))];
                                    return tmp32;
                                }
                                ;
                                auto tmp33 = tmp30 ? tmp31() : static_cast<decltype(tmp31())>(0.0);
                                auto tmp34 = static_cast<float>(0.0);
                                auto tmp35 = tmp30 ? tmp33 : tmp34;
                                return tmp35;
                            }
                            ;
                            auto tmp36 = tmp27 ? tmp28() : static_cast<decltype(tmp28())>(0.0);
                            auto tmp37 = static_cast<float>(0.0);
                            auto tmp38 = tmp27 ? tmp36 : tmp37;
                            auto tmp39 = tmp2 ? tmp24 : tmp38;
                            out_ptr1[static_cast<long>(x3 + (513L*x1) + (525312L*x2) + (6303744L*x0))] = tmp39;
                        }
                    }
                }
            }
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124921
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #124597
2024-04-28 04:33:25 +00:00
57790fd088 [inductor] share cse cache during vectorized indirect load (#124597)
Fix https://github.com/pytorch/pytorch/issues/123502

`swap_buffer` in not needed in vectorized indirect load, remove it to share cse buffer.
```
auto tmp8 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
//
// other codes
//
// also store tmp7 here (redundant tmp16)
auto tmp16 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124597
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-04-28 01:02:48 +00:00
7478b7f1ca Add common used score_mod functions for templated attention (#124670)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124670
Approved by: https://github.com/Chillee
2024-04-27 21:04:52 +00:00
df08140de2 [dynamo] Collect cell_and_freevars correctly (#125097)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125097
Approved by: https://github.com/Skylion007
2024-04-27 20:39:54 +00:00
7aa6bd7fa0 Refactor all top level usages of record_shapeenv_event to ShapeEnv class (#123735)
This ensures that first argument to record_shapeenv_event is a ShapeEnv
so we can appropriately short circuit when recording is not in progress.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123735
Approved by: https://github.com/ysiraichi, https://github.com/zou3519, https://github.com/albanD
2024-04-27 20:36:40 +00:00
9ce58542ba Ignore torch/distributed/_tensor/_collective_utils.py for TOR901 (#125082)
Fixes https://github.com/pytorch/pytorch/issues/125050

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125082
Approved by: https://github.com/malfet, https://github.com/Skylion007
2024-04-27 20:14:02 +00:00
b4a008209a Expose tensor check from guard for reusing (#124836)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124836
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2024-04-27 18:35:35 +00:00
f0a5a0d298 OSS: Capture triton kernel in ET (#124775)
This DIFF is to capture triton kernels in execution trace

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124775
Approved by: https://github.com/briancoutinho, https://github.com/aaronenyeshi
2024-04-27 18:01:18 +00:00
8246f42864 Export torch.newaxis=None for Python Array API/Numpy consistency (#125026)
Fixes #65307

For consistency with Python Array API (https://data-apis.org/array-api/latest/API_specification/constants.html) and NumPy  (https://numpy.org/devdocs/reference/constants.html), I added `torch.newaxis = None`.

Note that the consistency is directly mentioned also in the `__init__.py`, right above the added export.

The `torch.newaxis` is also mentioned in #110636.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125026
Approved by: https://github.com/lezcano
2024-04-27 16:40:51 +00:00
9bf53b128c [codemod] Remove unused variables in caffe2/aten/src/ATen/test/scalar_test.cpp (#125041)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D56587751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125041
Approved by: https://github.com/Skylion007
2024-04-27 15:53:16 +00:00
905318818d [codemod] Fix missing field initializer in caffe2/torch/lib/libshm/core.cpp +2 (#125047)
Summary:
The LLVM warning `-Wmissing-field-initializers` has found one or more structs in this diff's files which were missing field initializers.

This can be unintended such as:
```
my_struct s1 = {0}; // Initializes *only* the first field to zero; others to default values
my_struct s2 = {}; // Initializes *all* fields to default values (often zero)
```
or it may be because only some of the members of a struct are initialized, perhaps because the items were added to the struct but not every instance of it was updated.

To fix the problem, I've either used `{}` to initialize all fields to default or added appropriate default initializations to the missing fields.

Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D56614179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125047
Approved by: https://github.com/Skylion007
2024-04-27 15:52:56 +00:00
61e937f3d6 Add registration API for torch.compile-eager (#121387)
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.

In this PR, we intend to provide a registration API dedicated to eager-through-torch.compile. The major workflow of this API will be as follows.

- Load cache
- Check cache according to the input tensors
  - Cache Hit: Run the cached kernel directly
  - Cache Miss: Run the AOTI to produce kernel and run the produced kernel. If AOTI fails to produce the kernel, invoke the python fallback function.

Currently, this PR always fallback to python kernel now and cache mechanism will be implemented in another PR - https://github.com/pytorch/pytorch/pull/116368

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121387
Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/zou3519, https://github.com/jgong5
2024-04-27 12:49:58 +00:00
620d808da0 [Pytorch 2] Forward fix for broken test (#125065)
Summary:
This is a forward Hotfix for T186742340.

Some recent changes in Pytorch / Inductor ( D56458606) led to aten.addmm operators being inserted twice into the list of choices to select from during autotuning. This appears to have triggered a test failure in fbcode.

This fix prevents the aten operators being added twice to the list of choices for autotuning.

Test Plan:
* Pytorch CI
 * CUDA_LAUNCH_BLOCKING=1 buck2 test 'fbcode//mode/opt' fbcode//accelerators/pytorch/lib/pt2_utils/tests:compile_pt2_test -- --exact 'accelerators/pytorch/lib/pt2_utils/tests:compile_pt2_test - test_compile_pt2 (accelerators.pytorch.lib.pt2_utils.tests.compile_pt2_test.TestCompilePT2)'

Differential Revision: D56642879

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125065
Approved by: https://github.com/eellison
2024-04-27 10:27:44 +00:00
d4a1b3e093 Make c10d_functional ops call into _c10d_functional ops (#124979)
This PR removes the legacy impls of c10d_functional ops which are now irrelevant. For backward compatibility purpose, c10d_functional ops now call into _c10d_functional ops.

We also changed c10d_functional ops to be CompositeExplicitAutograd, so that when traced, only _c10d_functional ops appear in the graph. After this, we'll be able to remove the Inductor IR for the legacy functional collectives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124979
Approved by: https://github.com/wanchaol
2024-04-27 08:08:02 +00:00
91a4740e72 Disable the CUDA fast path for split_with_sizes_copy when capturing (#125052)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125052
Approved by: https://github.com/awgu, https://github.com/eellison, https://github.com/eqy
2024-04-27 07:59:39 +00:00
cyy
b3fd94d15e [Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987)
This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following #124701. In addition, libfmt dependency is added in CMake code to enable using it in the headers. The libfmt has to be added as private dependency to torch_cuda and torch_hip because they include torch/csrc/distributed/c10d/Utils.hpp which uses libfmt.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124987
Approved by: https://github.com/malfet
2024-04-27 07:22:27 +00:00
ce503c1b40 Dynamo x autograd.Function supports setup_context (#124802)
Fixes part of #118397

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124802
Approved by: https://github.com/zou3519
2024-04-27 04:57:13 +00:00
eqy
a866bfff45 [cuDNN] cuDNN SDPA (Flash Attention) Backward (#122510)
#113713
currently passing trivial smoke tests but I just totally pattern-matched bits and pieces of the autograd defs

Will also collect benchmark data,

CC @drisspg

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122510
Approved by: https://github.com/drisspg
2024-04-27 04:15:49 +00:00
5944a53555 [MPS] Fix nextafter for negative values (#125029)
By changing the logic to on older MacOS:
```cpp
bits += ((input > 0) ^ (input > other)) ? 1 : -1;
```
And use native `nextafter` on MacOS Sonoma (i.e. if Metal 3.1 is available)

TODO:
  - Add tests for infs and denorms

Fixes https://github.com/pytorch/pytorch/issues/124985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125029
Approved by: https://github.com/Skylion007
2024-04-27 02:58:05 +00:00
35b332882b [Quant][PT2E] Enable linear-binary(-unary) post-op recipe for X86Inductor quantizer (#122387)
As the title
**Test plan**
python test/test_quantization.py -k test_linear_binary

Differential Revision: [D56288440](https://our.internmc.facebook.com/intern/diff/D56288440)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122387
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #123240
2024-04-27 02:40:57 +00:00
dc4c75ba72 elastic/rendezvous: make barrier and rank assignment operations O(n) instead of O(n^2) (#124982)
Summary:
This makes barrier and rank operations linear instead of quadratic with the number of workers. This drastically improves performance for rendezvous when running with over 1000 hosts.

This uses 2 approaches for different areas:

* local rank assignment: each worker does 1 set and 1 get, local ranks are assigned on the rank 0 host in a O(n) operation which reduces total store operations to be linear with number of workers.
* exit_barrier: use a counter and a final flag so each worker has to do max 1 set, 1 get and 1 add.

At 4000 hosts we see torchelastic be able to run in as little as 10 seconds down from 373 seconds.

Test Plan:
This is testing using many small tests running on a remote cluster.

{D56549942}

```
torchx run --scheduler mast -- --image=torchelastic_benchmark --j=4000x1
```

Differential Revision: D56605193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124982
Approved by: https://github.com/kiukchung, https://github.com/kurman
2024-04-27 02:21:44 +00:00
1a6fef15ef [compiled autograd] verbose logs for debugging cache misses (#124980)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124980
Approved by: https://github.com/jansel
ghstack dependencies: #124954
2024-04-27 01:10:37 +00:00
43a7ab2a21 [compiled autograd] introduce verbose logs, add autograd node info to graph (#124954)
- sets it as a fake stack trace as we don't have a generic comment feature
- when verbose is disabled, still adds a contextmanager and flag checks. the alternative is to use MACROS, but that wouldn't be usable with TORCH_LOGS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124954
Approved by: https://github.com/jansel
2024-04-27 01:10:37 +00:00
e592a609fd [Quant][ONEDNN] improve performance of qconv by reducing integration overhead (#123240)
## Description
Framework overhead is found to be big for the onednn qconv op (used for quantization with PT2E X86Inductor backend). This PR reduces the integration overhead by modifying the implementation of qconv.

## performance results
Running quantized Resnet50 on an Intel(R) Xeon(R) Platinum 8490H machine
Before
```
Average latency: 8.378 ms.
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
onednn::qconv2d_pointwise        86.54%       6.954ms        87.42%       7.025ms     132.547us            53
```
After
```
Average latency: 6.255 ms.
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
onednn::qconv2d_pointwise        85.05%       6.381ms        85.98%       6.451ms     121.717us            53
```
Test script:
```python
import torch
import torchvision
import time
import copy
import numpy as np
from torch._export import capture_pre_autograd_graph
from torch.ao.quantization.quantize_pt2e import (
    prepare_pt2e,
    convert_pt2e,
)
import torch.ao.quantization.quantizer.x86_inductor_quantizer as xiq
from torch.ao.quantization.quantizer.x86_inductor_quantizer import X86InductorQuantizer

torch._inductor.config.cpp.enable_kernel_profile=True
torch._inductor.config.profiler_mark_wrapper_call = True
torch._inductor.config.freezing = True
torch._inductor.config.cpp_wrapper = True

def bench_model(model, inputs):
    times =[]
    with torch.no_grad():
        for _ in range(5): # warm-up
            output = model(inputs)
        for _ in range(20):
            start_time = time.time()
            output = model(inputs)
            end_time = time.time()
            times.append(end_time - start_time)
        print ('Average latency: %0.3f ms.' % (np.median(times) * 1000.0))

        with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU]) as p:
            out_ipex = model(inputs)
        print(p.key_averages().table(sort_by="self_cpu_time_total", row_limit=-1))

def pt2e_ptq(m, example_inputs):

    m = m.eval()

    exported_model = capture_pre_autograd_graph(m, example_inputs)
    quantizer = X86InductorQuantizer()
    quantizer.set_global(xiq.get_default_x86_inductor_quantization_config())
    prepared_model = prepare_pt2e(exported_model, quantizer)

    _ = prepared_model(*example_inputs)

    converted_model = convert_pt2e(prepared_model)
    torch.ao.quantization.move_exported_model_to_eval(converted_model)
    with torch.no_grad():
        optimized_model = torch.compile(converted_model)
        _ = optimized_model(*example_inputs)
        _ = optimized_model(*example_inputs)

    bench_model(optimized_model, *example_inputs)

    return optimized_model

if __name__ == "__main__":

    data = torch.randn(16, 3, 224, 224)
    model_fp = torchvision.models.resnet50(weights=torchvision.models.ResNet50_Weights.DEFAULT)
    pt2e_ptq(copy.deepcopy(model_fp), (data,))
```

Differential Revision: [D56288440](https://our.internmc.facebook.com/intern/diff/D56288440)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123240
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2024-04-27 00:52:45 +00:00
368f5212fa [cpu] [inductor] decompose bmm for memory bound in lowering (#124826)
Fixes #124697. Resolve the issue of large regression of GPT-FAST MOE with `coordinate_descent_tuning` disabled.

To get better perf for memory bound case, we decompose bmm in lowering.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124826
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-04-27 00:19:10 +00:00
ebb8905e0c [cpu] add VecConvert between 8bits and 16bits (#124828)
The perf benefit was found in https://github.com/pytorch/pytorch/issues/124697#issuecomment-2071658300.

The PR adds intrinsic specializations between int8/uint8 and bf16/fp16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124828
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-04-27 00:17:44 +00:00
fd24d8c05a [dynamo][nn module] Use correct sources for _call_impl (#124970)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124970
Approved by: https://github.com/jansel
ghstack dependencies: #124779, #124627
2024-04-26 23:18:30 +00:00
43069c460e Correct check for Boolean list input type (#124899)
Summary:
This diff fixes a bug in PyTorch where when creating a tensor from a List of booleans, PyTorch was throwing an error.

This fix resolves that issue. All credit goes to swolchok for identifying the root cause of the issue and suggesting this fix.

Test Plan: Running our model end to end works as expected and no error occurs.

Differential Revision: D55990810

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124899
Approved by: https://github.com/zhxchen17
2024-04-26 22:25:43 +00:00
be2c09725a [dtensor][experimental] local_map (#123676)
**Summary**
This PR is attempt to land an experimental feature designed in #103686 . `local_map` is designed to allow users to apply to `DTensor` objects a function that was written to apply to `torch.Tensor`.

As a function, `local_map` takes in 2 required arguments (`func` and `out_placements`) and 3 optional arguments (`device_mesh`, `in_placements`, `redistribute_inputs`). `func` is the function to be applied to each local shard of input `DTensor`. `out_placements` is the sharding specification of output `DTensor`.

`local_map` returns a new function that does the following:

1. Infer `device_mesh` and `in_placements` from `DTensor` input if they're not provided. If `device_mesh` is provided, it must be identical to the device mesh of every `DTensor` input. If `in_placements` is provided, it serves as the required sharding specification of corresponding `DTensor` input before feeding its local shard into `func`. In case it is different from `DTensor`'s sharding specification, if `redistribute_inputs=False` an exception will be raised, otherwise perform a resharding to the required sharding.
2. Call `func` with the arguments passed in along with `device_mesh` except `DTensor`s. For `DTensor`, pass in its local shard. This `func` may include collectives.
3. For each output of `func` that has validate (i.e. not `None) sharding specification in `out_placements`, construct a new `DTensor` using the output and the specification. Use this `DTensor` as the output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123676
Approved by: https://github.com/wanchaol
2024-04-26 22:23:59 +00:00
83e7b9d25f [Inductor] Support fusion of chained reductions even if keepdims=True (#124843)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124843
Approved by: https://github.com/shunting314
2024-04-26 21:50:52 +00:00
2329 changed files with 41076 additions and 315517 deletions

View File

@ -204,7 +204,7 @@ case "$image" in
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=5.7
ROCM_VERSION=6.0
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
@ -215,7 +215,7 @@ case "$image" in
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=6.0
ROCM_VERSION=6.1
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
@ -306,6 +306,12 @@ case "$image" in
DB=yes
VISION=yes
CONDA_CMAKE=yes
# snadampal: skipping sccache due to the following issue
# https://github.com/pytorch/pytorch/issues/121559
SKIP_SCCACHE_INSTALL=yes
# snadampal: skipping llvm src build install because the current version
# from pytorch/llvm:9.0.1 is x86 specific
SKIP_LLVM_SRC_BUILD_INSTALL=yes
;;
*)
# Catch-all for builds that are not hardcoded.
@ -360,7 +366,7 @@ if [[ "$image" == *cuda* && ${OS} == "ubuntu" ]]; then
fi
# Build image
DOCKER_BUILDKIT=1 docker build \
docker build \
--no-cache \
--progress=plain \
--build-arg "BUILD_ENVIRONMENT=${image}" \
@ -399,6 +405,8 @@ DOCKER_BUILDKIT=1 docker build \
--build-arg "EXECUTORCH=${EXECUTORCH}" \
--build-arg "BASEKIT_VERSION=${BASEKIT_VERSION}" \
--build-arg "ACL=${ACL:-}" \
--build-arg "SKIP_SCCACHE_INSTALL=${SKIP_SCCACHE_INSTALL:-}" \
--build-arg "SKIP_LLVM_SRC_BUILD_INSTALL=${SKIP_LLVM_SRC_BUILD_INSTALL:-}" \
-f $(dirname ${DOCKERFILE})/Dockerfile \
-t "$tmp_tag" \
"$@" \

View File

@ -113,7 +113,6 @@ install_centos() {
glibc-devel \
glibc-headers \
glog-devel \
hiredis-devel \
libstdc++-devel \
libsndfile-devel \
make \

View File

@ -4,11 +4,6 @@ set -ex
install_ubuntu() {
apt-get update
apt-get install -y --no-install-recommends \
libhiredis-dev \
libleveldb-dev \
liblmdb-dev \
libsnappy-dev
# Cleanup
apt-get autoclean && apt-get clean
@ -20,12 +15,6 @@ install_centos() {
# See http://fedoraproject.org/wiki/EPEL
yum --enablerepo=extras install -y epel-release
yum install -y \
hiredis-devel \
leveldb-devel \
lmdb-devel \
snappy-devel
# Cleanup
yum clean all
rm -rf /var/cache/yum

View File

@ -61,6 +61,10 @@ install_ubuntu() {
rocprofiler-dev \
roctracer-dev
if [[ $(ver $ROCM_VERSION) -ge $(ver 6.1) ]]; then
DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated rocm-llvm-dev
fi
# precompiled miopen kernels added in ROCm 3.5, renamed in ROCm 5.5
# search for all unversioned packages
# if search fails it will abort this script; use true to avoid case where search fails

View File

@ -263,10 +263,10 @@ unittest-xml-reporting<=3.2.0,>=2.0.0
#Pinned versions:
#test that import:
#wheel not found on aarch64, and source build requires rust
lintrunner==0.10.7 ; platform_machine == "x86_64"
#lintrunner is supported on aarch64-linux only from 0.12.4 version
lintrunner==0.12.5
#Description: all about linters!
#Pinned versions: 0.10.7
#Pinned versions: 0.12.5
#test that import:
rockset==1.0.3
@ -279,9 +279,9 @@ ghstack==0.8.0
#Pinned versions: 0.8.0
#test that import:
jinja2==3.1.3
jinja2==3.1.4
#Description: jinja2 template engine
#Pinned versions: 3.1.3
#Pinned versions: 3.1.4
#test that import:
pytest-cpp==2.3.0

View File

@ -169,9 +169,11 @@ RUN rm install_acl.sh
ENV INSTALLED_ACL ${ACL}
# Install ccache/sccache (do this last, so we get priority in PATH)
ARG SKIP_SCCACHE_INSTALL
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH
RUN bash ./install_cache.sh && rm install_cache.sh
RUN if [ -z "${SKIP_SCCACHE_INSTALL}" ]; then bash ./install_cache.sh; fi
RUN rm install_cache.sh
# Add jni.h for java host build
COPY ./common/install_jni.sh install_jni.sh
@ -188,7 +190,9 @@ ARG BUILD_ENVIRONMENT
ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}
# Install LLVM dev version (Defined in the pytorch/builder github repository)
ARG SKIP_LLVM_SRC_BUILD_INSTALL
COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
RUN if [ -n "${SKIP_LLVM_SRC_BUILD_INSTALL}" ]; then set -eu; rm -rf /opt/llvm; fi
# AWS specific CUDA build guidance
ENV TORCH_CUDA_ARCH_LIST Maxwell

View File

@ -81,7 +81,22 @@ if ! which conda; then
export USE_MKLDNN=0
fi
else
export CMAKE_PREFIX_PATH=/opt/conda
# CMAKE_PREFIX_PATH precedences
# 1. $CONDA_PREFIX, if defined. This follows the pytorch official build instructions.
# 2. /opt/conda/envs/py_${ANACONDA_PYTHON_VERSION}, if ANACONDA_PYTHON_VERSION defined.
# This is for CI, which defines ANACONDA_PYTHON_VERSION but not CONDA_PREFIX.
# 3. $(conda info --base). The fallback value of pytorch official build
# instructions actually refers to this.
# Commonly this is /opt/conda/
if [[ -v CONDA_PREFIX ]]; then
export CMAKE_PREFIX_PATH=${CONDA_PREFIX}
elif [[ -v ANACONDA_PYTHON_VERSION ]]; then
export CMAKE_PREFIX_PATH="/opt/conda/envs/py_${ANACONDA_PYTHON_VERSION}"
else
# already checked by `! which conda`
CMAKE_PREFIX_PATH="$(conda info --base)"
export CMAKE_PREFIX_PATH
fi
# Workaround required for MKL library linkage
# https://github.com/pytorch/pytorch/issues/119557
@ -376,4 +391,8 @@ if [[ "$BUILD_ENVIRONMENT" != *libtorch* && "$BUILD_ENVIRONMENT" != *bazel* ]];
python tools/stats/export_test_times.py
fi
print_sccache_stats
# snadampal: skipping it till sccache support added for aarch64
# https://github.com/pytorch/pytorch/issues/121559
if [[ "$BUILD_ENVIRONMENT" != *aarch64* ]]; then
print_sccache_stats
fi

View File

@ -45,7 +45,10 @@ time python test/run_test.py --verbose -i distributed/test_device_mesh
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_ddp_2d_parallel
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_fsdp_2d_parallel
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_examples
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_random_state.py
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_random_state
# FSDP2 tests
time python test/run_test.py --verbose -i distributed/_composable/fsdp/test_fully_shard_training -- -k test_2d_mlp_with_nd_mesh
# Other tests
time python test/run_test.py --verbose -i test_cuda_primary_ctx

View File

@ -181,6 +181,11 @@ if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then
export PATH="$HOME/.local/bin:$PATH"
fi
if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then
# TODO: revisit this once the CI is stabilized on aarch64 linux
export VALGRIND=OFF
fi
install_tlparse
# DANGER WILL ROBINSON. The LD_PRELOAD here could cause you problems
@ -305,22 +310,23 @@ test_dynamo_shard() {
test_inductor_distributed() {
# Smuggle a few multi-gpu tests here so that we don't have to request another large node
echo "Testing multi_gpu tests in test_torchinductor"
pytest test/inductor/test_torchinductor.py -k test_multi_gpu
pytest test/inductor/test_aot_inductor.py -k test_non_default_cuda_device
pytest test/inductor/test_aot_inductor.py -k test_replicate_on_devices
pytest test/distributed/test_c10d_functional_native.py
pytest test/distributed/_tensor/test_dtensor_compile.py
pytest test/distributed/tensor/parallel/test_fsdp_2d_parallel.py
pytest test/distributed/_composable/fsdp/test_fully_shard_comm.py
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_mlp
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_hsdp
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_transformer_checkpoint_resume
pytest test/distributed/_composable/fsdp/test_fully_shard_frozen.py
pytest test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_compute_dtype
pytest test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_reduce_dtype
pytest test/distributed/fsdp/test_fsdp_tp_integration.py -k test_fsdp_tp_integration
python test/run_test.py -i inductor/test_torchinductor.py -k test_multi_gpu --verbose
python test/run_test.py -i inductor/test_aot_inductor.py -k test_non_default_cuda_device --verbose
python test/run_test.py -i inductor/test_aot_inductor.py -k test_replicate_on_devices --verbose
python test/run_test.py -i distributed/test_c10d_functional_native.py --verbose
python test/run_test.py -i distributed/_tensor/test_dtensor_compile.py --verbose
python test/run_test.py -i distributed/tensor/parallel/test_fsdp_2d_parallel.py --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_comm.py --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_mlp --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_hsdp --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_transformer_checkpoint_resume --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_gradient_accumulation --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_frozen.py --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_compute_dtype --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_reduce_dtype --verbose
python test/run_test.py -i distributed/fsdp/test_fsdp_tp_integration.py -k test_fsdp_tp_integration --verbose
# this runs on both single-gpu and multi-gpu instance. It should be smart about skipping tests that aren't supported
# with if required # gpus aren't available
@ -516,6 +522,11 @@ test_single_dynamo_benchmark() {
fi
}
test_inductor_micro_benchmark() {
TEST_REPORTS_DIR=$(pwd)/test/test-micro-reports
python benchmarks/gpt_fast/benchmark.py
}
test_dynamo_benchmark() {
# Usage: test_dynamo_benchmark huggingface 0
TEST_REPORTS_DIR=$(pwd)/test/test-reports
@ -1152,11 +1163,33 @@ test_executorch() {
assert_git_not_dirty
}
test_linux_aarch64(){
python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \
test_transformers test_multiprocessing test_numpy_interop --verbose
# Dynamo tests
python test/run_test.py --include dynamo/test_compile dynamo/test_backends dynamo/test_comptime dynamo/test_config \
dynamo/test_functions dynamo/test_fx_passes_pre_grad dynamo/test_interop dynamo/test_model_output dynamo/test_modules \
dynamo/test_optimizers dynamo/test_recompile_ux dynamo/test_recompiles --verbose
# Inductor tests
python test/run_test.py --include inductor/test_torchinductor inductor/test_benchmark_fusion inductor/test_codecache \
inductor/test_config inductor/test_control_flow inductor/test_coordinate_descent_tuner inductor/test_fx_fusion \
inductor/test_group_batch_fusion inductor/test_inductor_freezing inductor/test_inductor_utils \
inductor/test_inplacing_pass inductor/test_kernel_benchmark inductor/test_layout_optim \
inductor/test_max_autotune inductor/test_memory_planning inductor/test_metrics inductor/test_multi_kernel inductor/test_pad_mm \
inductor/test_pattern_matcher inductor/test_perf inductor/test_profiler inductor/test_select_algorithm inductor/test_smoke \
inductor/test_split_cat_fx_passes inductor/test_standalone_compile inductor/test_torchinductor \
inductor/test_torchinductor_codegen_dynamic_shapes inductor/test_torchinductor_dynamic_shapes --verbose
}
if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then
(cd test && python -c "import torch; print(torch.__config__.show())")
(cd test && python -c "import torch; print(torch.__config__.parallel_info())")
fi
if [[ "${TEST_CONFIG}" == *backward* ]]; then
if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then
test_linux_aarch64
elif [[ "${TEST_CONFIG}" == *backward* ]]; then
test_forward_backward_compatibility
# Do NOT add tests after bc check tests, see its comment.
elif [[ "${TEST_CONFIG}" == *xla* ]]; then
@ -1181,6 +1214,8 @@ elif [[ "$TEST_CONFIG" == deploy ]]; then
test_torch_deploy
elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then
test_inductor_distributed
elif [[ "${TEST_CONFIG}" == *inductor-micro-benchmark* ]]; then
test_inductor_micro_benchmark
elif [[ "${TEST_CONFIG}" == *huggingface* ]]; then
install_torchvision
id=$((SHARD_NUMBER-1))

View File

@ -17,22 +17,22 @@ set PATH=C:\Program Files\CMake\bin;C:\Program Files\7-Zip;C:\ProgramData\chocol
set INSTALLER_DIR=%SCRIPT_HELPERS_DIR%\installation-helpers
call %INSTALLER_DIR%\install_magma.bat
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
call %INSTALLER_DIR%\install_sccache.bat
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
:: Miniconda has been installed as part of the Windows AMI with all the dependencies.
:: We just need to activate it here
call %INSTALLER_DIR%\activate_miniconda3.bat
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
call pip install mkl-include==2021.4.0 mkl-devel==2021.4.0
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
:: Override VS env here
pushd .
@ -41,8 +41,8 @@ if "%VC_VERSION%" == "" (
) else (
call "C:\Program Files (x86)\Microsoft Visual Studio\%VC_YEAR%\%VC_PRODUCT%\VC\Auxiliary\Build\vcvarsall.bat" x64 -vcvars_ver=%VC_VERSION%
)
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
@echo on
popd
@ -52,12 +52,12 @@ set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION%
if x%CUDA_VERSION:.=%==x%CUDA_VERSION% (
echo CUDA version %CUDA_VERSION% format isn't correct, which doesn't contain '.'
exit /b 1
goto fail
)
rem version transformer, for example 10.1 to 10_1.
if x%CUDA_VERSION:.=%==x%CUDA_VERSION% (
echo CUDA version %CUDA_VERSION% format isn't correct, which doesn't contain '.'
exit /b 1
goto fail
)
set VERSION_SUFFIX=%CUDA_VERSION:.=_%
set CUDA_PATH_V%VERSION_SUFFIX%=%CUDA_PATH%
@ -101,8 +101,8 @@ if "%USE_CUDA%"=="1" (
:: CMake requires a single command as CUDA_NVCC_EXECUTABLE, so we push the wrappers
:: randomtemp.exe and sccache.exe into a batch file which CMake invokes.
curl -kL https://github.com/peterjc123/randomtemp-rust/releases/download/v0.4/randomtemp.exe --output %TMP_DIR_WIN%\bin\randomtemp.exe
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
echo @"%TMP_DIR_WIN%\bin\randomtemp.exe" "%TMP_DIR_WIN%\bin\sccache.exe" "%CUDA_PATH%\bin\nvcc.exe" %%* > "%TMP_DIR%/bin/nvcc.bat"
cat %TMP_DIR%/bin/nvcc.bat
set CUDA_NVCC_EXECUTABLE=%TMP_DIR%/bin/nvcc.bat
@ -114,8 +114,8 @@ if "%USE_CUDA%"=="1" (
set
python setup.py bdist_wheel
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
sccache --show-stats
python -c "import os, glob; os.system('python -mpip install --no-index --no-deps ' + glob.glob('dist/*.whl')[0])"
(
@ -135,3 +135,8 @@ python -c "import os, glob; os.system('python -mpip install --no-index --no-deps
sccache --show-stats --stats-format json | jq .stats > sccache-stats-%BUILD_ENVIRONMENT%-%OUR_GITHUB_JOB_ID%.json
sccache --stop-server
exit /b 0
:fail
exit /b 1

View File

@ -54,6 +54,7 @@ per-file-ignores =
torch/ao/quantization/fx/_decomposed.py: TOR901
torch/distributed/_functional_collectives.py: TOR901
torch/distributed/_spmd/data_parallel.py: TOR901
torch/distributed/_tensor/_collective_utils.py: TOR901
optional-ascii-coding = True
exclude =
./.git,

View File

@ -21,6 +21,7 @@ self-hosted-runner:
- linux.rocm.gpu
- macos-m1-stable
- macos-m1-13
- macos-m1-14
- macos-12-xl
- macos-12
- macos12.3-m1

View File

@ -1 +1 @@
2c4665ffbb64f03f5d18016d3398af4ac4da5f03
d23a6e1664d20707c11781299611436e1f0c104f

View File

@ -1 +1 @@
58a412cb271a3f98ae2e01fd1d24bdbb66645d4e
e3fc03314dab5f44e3ed9ccbba6c15fbca3285cd

11
.github/labeler.yml vendored
View File

@ -58,6 +58,17 @@
- third_party/mkl-dnn.BUILD
- torch/csrc/jit/codegen/onednn/**
- test/test_jit_llga_fuser.py
- test/test_mkldnn.py
"ciflow/linux-aarch64":
- third_party/ideep
- caffe2/ideep/**
- caffe2/python/ideep/**
- cmake/Modules/FindMKLDNN.cmake
- third_party/mkl-dnn.BUILD
- torch/csrc/jit/codegen/onednn/**
- test/test_jit_llga_fuser.py
- test/test_mkldnn.py
"module: amp (automated mixed precision)":
- torch/amp/**

View File

@ -29,10 +29,12 @@
approved_by:
- BowenBao
- justinchuby
- liqunfu
- shubhambhokare1
- thiagocrepaldi
- titaiwangms
- wschin
- xadupre
mandatory_checks_name:
- EasyCLA
- Lint

View File

@ -8,6 +8,8 @@ ciflow_push_tags:
- ciflow/binaries_wheel
- ciflow/inductor
- ciflow/inductor-perf-compare
- ciflow/inductor-micro-benchmark
- ciflow/linux-aarch64
- ciflow/mps
- ciflow/nightly
- ciflow/periodic

View File

@ -5,7 +5,7 @@
# functorch/docs/requirements.txt
# .ci/docker/requirements-ci.txt
boto3==1.19.12
jinja2==3.1.3
jinja2==3.1.4
lintrunner==0.10.7
ninja==1.10.0.post1
nvidia-ml-py==11.525.84

View File

@ -1,7 +1,11 @@
#!/bin/bash
set -x
WHEELHOUSE_DIR=/artifacts
if [ -z "$1" ]; then
echo "Need wheel location argument" && exit 1
fi
WHEELHOUSE_DIR=$1
PATCHELF_BIN=patchelf
ROCM_LIB=backends/amd/lib
ROCM_LD=backends/amd/llvm/bin

View File

@ -157,10 +157,10 @@ def build_triton(
if build_rocm:
check_call(
[f"{SCRIPT_DIR}/amd/patch_triton_wheel.sh"],
[f"{SCRIPT_DIR}/amd/patch_triton_wheel.sh", Path.cwd()],
cwd=triton_basedir,
shell=True,
)
return Path.cwd() / whl_path.name

View File

@ -13,16 +13,16 @@ architectures:
import os
from typing import Dict, List, Optional, Tuple
CUDA_ARCHES = ["11.8", "12.1"]
CUDA_ARCHES = ["11.8", "12.1", "12.4"]
CUDA_ARCHES_FULL_VERSION = {"11.8": "11.8.0", "12.1": "12.1.1"}
CUDA_ARCHES_FULL_VERSION = {"11.8": "11.8.0", "12.1": "12.1.1", "12.4": "12.4.0"}
CUDA_ARCHES_CUDNN_VERSION = {"11.8": "8", "12.1": "8"}
CUDA_ARCHES_CUDNN_VERSION = {"11.8": "8", "12.1": "8", "12.4": "8"}
ROCM_ARCHES = ["5.7", "6.0"]
ROCM_ARCHES = ["6.0", "6.1"]
CPU_CXX11_ABI_ARCH = ["cpu-cxx11-abi"]
@ -58,6 +58,20 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'"
),
"12.4": (
"nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==8.9.7.29; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'"
),
}
@ -324,7 +338,7 @@ def generate_wheels_matrix(
)
# 12.1 linux wheels require PYTORCH_EXTRA_INSTALL_REQUIREMENTS to install
if arch_version in ["12.1", "11.8"] and os == "linux":
if arch_version in ["12.4", "12.1", "11.8"] and os == "linux":
ret.append(
{
"python_version": python_version,
@ -367,5 +381,6 @@ def generate_wheels_matrix(
return ret
validate_nccl_dep_consistency("12.4")
validate_nccl_dep_consistency("12.1")
validate_nccl_dep_consistency("11.8")

View File

@ -21,6 +21,8 @@ DOCKER_IMAGE_TYPES = ["runtime", "devel"]
def generate_docker_matrix() -> Dict[str, List[Dict[str, str]]]:
ret: List[Dict[str, str]] = []
# CUDA amd64 Docker images are available as both runtime and devel while
# CPU arm64 image is only available as runtime.
for cuda, version in generate_binary_build_matrix.CUDA_ARCHES_FULL_VERSION.items():
for image in DOCKER_IMAGE_TYPES:
ret.append(
@ -31,9 +33,19 @@ def generate_docker_matrix() -> Dict[str, List[Dict[str, str]]]:
cuda
],
"image_type": image,
"platform": "linux/arm64,linux/amd64",
"platform": "linux/amd64",
}
)
ret.append(
{
"cuda": "cpu",
"cuda_full_version": "",
"cudnn_version": "",
"image_type": "runtime",
"platform": "linux/arm64",
}
)
return {"include": ret}

View File

@ -46,7 +46,7 @@ env:
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
!{{ common.concurrency(build_environment) }}
jobs:

View File

@ -48,7 +48,7 @@ env:
BUILD_ENVIRONMENT: !{{ build_environment }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
{%- if cross_compile_arm64 %}
CROSS_COMPILE_ARM64: 1
{% endif %}

View File

@ -24,11 +24,6 @@ on:
default: "3.8"
description: |
The python version to be used. Will be 3.8 by default
arch:
required: true
type: string
description: |
Contains the architecture to run the tests with
timeout-minutes:
required: false
type: number
@ -44,7 +39,7 @@ jobs:
# Also ensure that we always run with the right architecture
defaults:
run:
shell: arch -arch ${{ inputs.arch }} bash -e -l {0}
shell: bash -e -l {0}
strategy:
matrix: ${{ fromJSON(inputs.test-matrix) }}
fail-fast: false
@ -133,12 +128,6 @@ jobs:
test-matrix: ${{ inputs.test-matrix }}
job-name: ${{ steps.get-job-id.outputs.job-name }}
- name: Pre-process arm64 wheels
if: inputs.build-environment == 'macos-12-py3-arm64'
run: |
# As wheels are cross-compiled they are reported as x86_64 ones
ORIG_WHLNAME=$(ls -1 dist/*.whl); ARM_WHLNAME=${ORIG_WHLNAME/x86_64/arm64}; mv "${ORIG_WHLNAME}" "${ARM_WHLNAME}"
- name: Set Test step time
id: test-timeout
shell: bash

View File

@ -37,7 +37,7 @@ jobs:
device: ["cuda", "rocm"]
include:
- device: "rocm"
rocm_version: "6.0"
rocm_version: "6.1"
- device: "cuda"
rocm_version: ""
timeout-minutes: 40

View File

@ -7,6 +7,7 @@ on:
- Dockerfile
- docker.Makefile
- .github/workflows/docker-release.yml
- .github/scripts/generate_docker_release_matrix.py
push:
branches:
- nightly
@ -126,20 +127,25 @@ jobs:
run: |
make -f docker.Makefile "${BUILD_IMAGE_TYPE}-image"
- name: Push nightly tags
if: ${{ github.event.ref == 'refs/heads/nightly' && matrix.image_type == 'runtime' }}
if: ${{ github.event.ref == 'refs/heads/nightly' && matrix.image_type == 'runtime' && matrix.build_platforms == 'linux/amd4' }}
run: |
PYTORCH_DOCKER_TAG="${PYTORCH_VERSION}-cuda${CUDA_VERSION_SHORT}-cudnn${CUDNN_VERSION}-runtime"
CUDA_SUFFIX="-cu${CUDA_VERSION}"
PYTORCH_NIGHTLY_COMMIT=$(docker run ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_DOCKER_TAG}" \
python -c 'import torch; print(torch.version.git_version[:7],end="")')
docker tag ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_DOCKER_TAG}" \
ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}-cu${CUDA_VERSION}"
docker push ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}-cu${CUDA_VERSION}"
ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}${CUDA_SUFFIX}"
docker push ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}${CUDA_SUFFIX}"
# Please note, here we ned to pin specific verison of CUDA as with latest label
if [[ ${CUDA_VERSION_SHORT} == "12.1" ]]; then
docker tag ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}${CUDA_SUFFIX}" \
ghcr.io/pytorch/pytorch-nightly:latest
docker push ghcr.io/pytorch/pytorch-nightly:latest
fi
docker tag ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}-cu${CUDA_VERSION}" \
ghcr.io/pytorch/pytorch-nightly:latest
docker push ghcr.io/pytorch/pytorch-nightly:latest
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()

View File

@ -31,7 +31,7 @@ env:
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
concurrency:
group: linux-aarch64-binary-manywheel-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true

View File

@ -31,7 +31,7 @@ env:
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-conda-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
@ -222,6 +222,69 @@ jobs:
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_8-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.8"
runs_on: linux.24xlarge
build_name: conda-py3_8-cuda12_4
build_environment: linux-binary-conda
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_8-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: conda-py3_8-cuda12_4-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.8"
build_name: conda-py3_8-cuda12_4
build_environment: linux-binary-conda
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_8-cuda12_4-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: conda-py3_8-cuda12_4-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.8"
build_name: conda-py3_8-cuda12_4
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_9-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -407,6 +470,69 @@ jobs:
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_9-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.9"
runs_on: linux.24xlarge
build_name: conda-py3_9-cuda12_4
build_environment: linux-binary-conda
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_9-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: conda-py3_9-cuda12_4-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.9"
build_name: conda-py3_9-cuda12_4
build_environment: linux-binary-conda
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_9-cuda12_4-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: conda-py3_9-cuda12_4-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.9"
build_name: conda-py3_9-cuda12_4
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_10-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -592,6 +718,69 @@ jobs:
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_10-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.10"
runs_on: linux.24xlarge
build_name: conda-py3_10-cuda12_4
build_environment: linux-binary-conda
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_10-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: conda-py3_10-cuda12_4-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.10"
build_name: conda-py3_10-cuda12_4
build_environment: linux-binary-conda
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_10-cuda12_4-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: conda-py3_10-cuda12_4-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.10"
build_name: conda-py3_10-cuda12_4
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_11-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -777,6 +966,69 @@ jobs:
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_11-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.11"
runs_on: linux.24xlarge
build_name: conda-py3_11-cuda12_4
build_environment: linux-binary-conda
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_11-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: conda-py3_11-cuda12_4-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.11"
build_name: conda-py3_11-cuda12_4
build_environment: linux-binary-conda
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_11-cuda12_4-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: conda-py3_11-cuda12_4-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.11"
build_name: conda-py3_11-cuda12_4
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_12-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -961,3 +1213,66 @@ jobs:
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_12-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.12"
runs_on: linux.24xlarge
build_name: conda-py3_12-cuda12_4
build_environment: linux-binary-conda
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_12-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: conda-py3_12-cuda12_4-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.12"
build_name: conda-py3_12-cuda12_4
build_environment: linux-binary-conda
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_12-cuda12_4-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: conda-py3_12-cuda12_4-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.12"
build_name: conda-py3_12-cuda12_4
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -26,7 +26,7 @@ env:
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-libtorch-cxx11-abi-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true

View File

@ -31,7 +31,7 @@ env:
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-libtorch-cxx11-abi-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
@ -229,7 +229,7 @@ jobs:
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm5_7-shared-with-deps-cxx11-abi-build:
libtorch-cuda12_4-shared-with-deps-cxx11-abi-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
@ -238,97 +238,56 @@ jobs:
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm5.7
GPU_ARCH_VERSION: 5.7
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.7-main
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.4-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-rocm5_7-shared-with-deps-cxx11-abi
build_name: libtorch-cuda12_4-shared-with-deps-cxx11-abi
build_environment: linux-binary-libtorch-cxx11-abi
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-rocm5_7-shared-with-deps-cxx11-abi-test: # Testing
libtorch-cuda12_4-shared-with-deps-cxx11-abi-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: libtorch-rocm5_7-shared-with-deps-cxx11-abi-build
runs-on: linux.rocm.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm5.7
GPU_ARCH_VERSION: 5.7
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.7-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: libtorch-rocm5_7-shared-with-deps-cxx11-abi
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: pytorch/libtorch-cxx11-builder:rocm5.7-main
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
libtorch-rocm5_7-shared-with-deps-cxx11-abi-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-rocm5_7-shared-with-deps-cxx11-abi-test
needs: libtorch-cuda12_4-shared-with-deps-cxx11-abi-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm5.7
GPU_ARCH_VERSION: 5.7
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.7-main
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.4-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-rocm5_7-shared-with-deps-cxx11-abi
build_name: libtorch-cuda12_4-shared-with-deps-cxx11-abi
build_environment: linux-binary-libtorch-cxx11-abi
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-cuda12_4-shared-with-deps-cxx11-abi-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-cuda12_4-shared-with-deps-cxx11-abi-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.4-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-cuda12_4-shared-with-deps-cxx11-abi
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
@ -440,3 +399,109 @@ jobs:
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm6_1-shared-with-deps-cxx11-abi-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.1
GPU_ARCH_VERSION: 6.1
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.1-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-rocm6_1-shared-with-deps-cxx11-abi
build_environment: linux-binary-libtorch-cxx11-abi
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-rocm6_1-shared-with-deps-cxx11-abi-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: libtorch-rocm6_1-shared-with-deps-cxx11-abi-build
runs-on: linux.rocm.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.1
GPU_ARCH_VERSION: 6.1
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.1-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: libtorch-rocm6_1-shared-with-deps-cxx11-abi
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: pytorch/libtorch-cxx11-builder:rocm6.1-main
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
libtorch-rocm6_1-shared-with-deps-cxx11-abi-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-rocm6_1-shared-with-deps-cxx11-abi-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.1
GPU_ARCH_VERSION: 6.1
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.1-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-rocm6_1-shared-with-deps-cxx11-abi
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -26,7 +26,7 @@ env:
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-libtorch-pre-cxx11-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true

View File

@ -31,7 +31,7 @@ env:
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-libtorch-pre-cxx11-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
@ -229,7 +229,7 @@ jobs:
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm5_7-shared-with-deps-pre-cxx11-build:
libtorch-cuda12_4-shared-with-deps-pre-cxx11-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
@ -238,97 +238,56 @@ jobs:
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm5.7
GPU_ARCH_VERSION: 5.7
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.7-main
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: pre-cxx11
build_name: libtorch-rocm5_7-shared-with-deps-pre-cxx11
build_name: libtorch-cuda12_4-shared-with-deps-pre-cxx11
build_environment: linux-binary-libtorch-pre-cxx11
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-rocm5_7-shared-with-deps-pre-cxx11-test: # Testing
libtorch-cuda12_4-shared-with-deps-pre-cxx11-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: libtorch-rocm5_7-shared-with-deps-pre-cxx11-build
runs-on: linux.rocm.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm5.7
GPU_ARCH_VERSION: 5.7
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.7-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: pre-cxx11
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: libtorch-rocm5_7-shared-with-deps-pre-cxx11
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: pytorch/manylinux-builder:rocm5.7-main
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
libtorch-rocm5_7-shared-with-deps-pre-cxx11-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-rocm5_7-shared-with-deps-pre-cxx11-test
needs: libtorch-cuda12_4-shared-with-deps-pre-cxx11-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm5.7
GPU_ARCH_VERSION: 5.7
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.7-main
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: pre-cxx11
build_name: libtorch-rocm5_7-shared-with-deps-pre-cxx11
build_name: libtorch-cuda12_4-shared-with-deps-pre-cxx11
build_environment: linux-binary-libtorch-pre-cxx11
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-cuda12_4-shared-with-deps-pre-cxx11-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-cuda12_4-shared-with-deps-pre-cxx11-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: pre-cxx11
build_name: libtorch-cuda12_4-shared-with-deps-pre-cxx11
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
@ -440,3 +399,109 @@ jobs:
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm6_1-shared-with-deps-pre-cxx11-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.1
GPU_ARCH_VERSION: 6.1
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.1-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: pre-cxx11
build_name: libtorch-rocm6_1-shared-with-deps-pre-cxx11
build_environment: linux-binary-libtorch-pre-cxx11
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-rocm6_1-shared-with-deps-pre-cxx11-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: libtorch-rocm6_1-shared-with-deps-pre-cxx11-build
runs-on: linux.rocm.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.1
GPU_ARCH_VERSION: 6.1
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.1-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: pre-cxx11
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: libtorch-rocm6_1-shared-with-deps-pre-cxx11
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: pytorch/manylinux-builder:rocm6.1-main
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
libtorch-rocm6_1-shared-with-deps-pre-cxx11-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-rocm6_1-shared-with-deps-pre-cxx11-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.1
GPU_ARCH_VERSION: 6.1
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.1-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: pre-cxx11
build_name: libtorch-rocm6_1-shared-with-deps-pre-cxx11
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -26,7 +26,7 @@ env:
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-manywheel-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true

File diff suppressed because it is too large Load Diff

View File

@ -26,7 +26,7 @@ env:
BUILD_ENVIRONMENT: macos-arm64-binary-conda
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
concurrency:
group: macos-arm64-binary-conda-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true

View File

@ -26,7 +26,7 @@ env:
BUILD_ENVIRONMENT: macos-arm64-binary-libtorch-cxx11-abi
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
concurrency:
group: macos-arm64-binary-libtorch-cxx11-abi-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true

View File

@ -26,7 +26,7 @@ env:
BUILD_ENVIRONMENT: macos-arm64-binary-wheel
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
concurrency:
group: macos-arm64-binary-wheel-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true

File diff suppressed because it is too large Load Diff

View File

@ -800,3 +800,260 @@ jobs:
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-cuda12_4-shared-with-deps-debug-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: windows.4xlarge.nonephemeral
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: debug
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
- uses: actions/upload-artifact@v3
if: always()
with:
name: libtorch-cuda12_4-shared-with-deps-debug
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
libtorch-cuda12_4-shared-with-deps-debug-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: libtorch-cuda12_4-shared-with-deps-debug-build
runs-on: windows.8xlarge.nvidia.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: debug
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: libtorch-cuda12_4-shared-with-deps-debug
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Test PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
libtorch-cuda12_4-shared-with-deps-debug-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-cuda12_4-shared-with-deps-debug-test
with:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
LIBTORCH_CONFIG: debug
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
build_name: libtorch-cuda12_4-shared-with-deps-debug
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -800,3 +800,260 @@ jobs:
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-cuda12_4-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: windows.4xlarge.nonephemeral
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
- uses: actions/upload-artifact@v3
if: always()
with:
name: libtorch-cuda12_4-shared-with-deps-release
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
libtorch-cuda12_4-shared-with-deps-release-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: libtorch-cuda12_4-shared-with-deps-release-build
runs-on: windows.8xlarge.nvidia.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: libtorch-cuda12_4-shared-with-deps-release
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Test PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
libtorch-cuda12_4-shared-with-deps-release-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-cuda12_4-shared-with-deps-release-test
with:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
build_name: libtorch-cuda12_4-shared-with-deps-release
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,40 @@
name: inductor-micro-benchmark
on:
schedule:
- cron: 0 7 * * *
push:
tags:
- ciflow/inductor-micro-benchmark/*
workflow_dispatch:
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions: read-all
jobs:
linux-focal-cuda12_1-py3_10-gcc9-inductor-micro-benchmark-build:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [
{ config: "inductor-micro-benchmark", shard: 1, num_shards: 1, runner: "linux.gcp.a100" },
]}
linux-focal-cuda12_1-py3_10-gcc9-inductor-micro-benchmark-test:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_1-py3_10-gcc9-inductor-micro-benchmark-build
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-inductor-micro-benchmark-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-inductor-micro-benchmark-build.outputs.test-matrix }}
use-gha: anything-non-empty-to-use-gha
timeout-minutes: 720

View File

@ -16,28 +16,28 @@ concurrency:
permissions: read-all
jobs:
linux-focal-rocm6_0-py3_8-inductor-build:
name: rocm6.0-py3.8-inductor
linux-focal-rocm6_1-py3_8-inductor-build:
name: rocm6.1-py3.8-inductor
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-rocm6.0-py3.8
build-environment: linux-focal-rocm6.1-py3.8
docker-image-name: pytorch-linux-focal-rocm-n-py3
test-matrix: |
{ include: [
{ config: "inductor", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.2" },
]}
linux-focal-rocm6_0-py3_8-inductor-test:
linux-focal-rocm6_1-py3_8-inductor-test:
permissions:
id-token: write
contents: read
name: rocm6.0-py3.8-inductor
name: rocm6.1-py3.8-inductor
uses: ./.github/workflows/_rocm-test.yml
needs: linux-focal-rocm6_0-py3_8-inductor-build
needs: linux-focal-rocm6_1-py3_8-inductor-build
with:
build-environment: linux-focal-rocm6.0-py3.8
docker-image: ${{ needs.linux-focal-rocm6_0-py3_8-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_0-py3_8-inductor-build.outputs.test-matrix }}
build-environment: linux-focal-rocm6.1-py3.8
docker-image: ${{ needs.linux-focal-rocm6_1-py3_8-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_1-py3_8-inductor-build.outputs.test-matrix }}
linux-focal-cuda12_1-py3_10-gcc9-inductor-build:
name: cuda12.1-py3.10-gcc9-sm86

View File

@ -230,11 +230,11 @@ jobs:
with:
submodules: false
fetch-depth: 1
- name: Setup Python 3.5
- name: Setup Python 3.6
if: matrix.test_type == 'older_python_version'
uses: actions/setup-python@v4
with:
python-version: '3.5'
python-version: '3.6'
architecture: x64
check-latest: false
cache: pip

39
.github/workflows/linux-aarch64.yml vendored Normal file
View File

@ -0,0 +1,39 @@
name: linux-aarch64
on:
push:
tags:
- ciflow/linux-aarch64/*
workflow_dispatch:
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }} but found ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
jobs:
linux-jammy-aarch64-py3_10-build:
name: linux-jammy-aarch64-py3.10
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-jammy-aarch64-py3.10
docker-image-name: pytorch-linux-jammy-aarch64-py3.10-gcc11
runner: linux.arm64.2xlarge
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 4, runner: "linux.arm64.2xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "linux.arm64.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "linux.arm64.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "linux.arm64.2xlarge" },
]}
linux-jammy-aarch64-py3_10-test:
name: linux-jammy-aarch64-py3.10
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-aarch64-py3_10-build
permissions:
id-token: write
contents: read
with:
build-environment: linux-jammy-aarch64-py3.10
docker-image: ${{ needs.linux-jammy-aarch64-py3_10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-aarch64-py3_10-build.outputs.test-matrix }}

View File

@ -13,33 +13,29 @@ concurrency:
permissions: read-all
jobs:
macos-12-py3-arm64-build:
name: macos-12-py3-arm64
macos-13-py3-arm64-build:
name: macos-13-py3-arm64
uses: ./.github/workflows/_mac-build.yml
with:
sync-tag: macos-12-py3-arm64-build
build-environment: macos-12-py3-arm64
sync-tag: macos-py3-arm64-build
build-environment: macos-13-py3-arm64
runner-type: macos-m1-stable
build-generates-artifacts: true
# To match the one pre-installed in the m1 runners
python-version: 3.9.12
# We need to set the environment file here instead of trying to detect it automatically because
# MacOS arm64 is cross-compiled from x86-64. Specifically, it means that arm64 conda environment
# is needed when building PyTorch MacOS arm64 from x86-64
environment-file: .github/requirements/conda-env-macOS-ARM64
test-matrix: |
{ include: [
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-stable" },
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m2-14" },
]}
macos-12-py3-arm64-mps-test:
name: macos-12-py3-arm64-mps
macos-py3-arm64-mps-test:
name: macos-py3-arm64-mps
uses: ./.github/workflows/_mac-test-mps.yml
needs: macos-12-py3-arm64-build
needs: macos-13-py3-arm64-build
with:
sync-tag: macos-12-py3-arm64-mps-test
build-environment: macos-12-py3-arm64
sync-tag: macos-py3-arm64-mps-test
build-environment: macos-13-py3-arm64
# Same as the build job
python-version: 3.9.12
test-matrix: ${{ needs.macos-12-py3-arm64-build.outputs.test-matrix }}
test-matrix: ${{ needs.macos-13-py3-arm64-build.outputs.test-matrix }}

View File

@ -217,11 +217,11 @@ jobs:
docker-image: ${{ needs.linux-vulkan-focal-py3_11-clang10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-vulkan-focal-py3_11-clang10-build.outputs.test-matrix }}
linux-focal-rocm6_0-py3_8-build:
name: linux-focal-rocm6.0-py3.8
linux-focal-rocm6_1-py3_8-build:
name: linux-focal-rocm6.1-py3.8
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-rocm6.0-py3.8
build-environment: linux-focal-rocm6.1-py3.8
docker-image-name: pytorch-linux-focal-rocm-n-py3
test-matrix: |
{ include: [
@ -229,16 +229,16 @@ jobs:
{ config: "distributed", shard: 2, num_shards: 2, runner: "linux.rocm.gpu" },
]}
linux-focal-rocm6_0-py3_8-test:
linux-focal-rocm6_1-py3_8-test:
permissions:
id-token: write
contents: read
name: linux-focal-rocm6.0-py3.8
name: linux-focal-rocm6.1-py3.8
uses: ./.github/workflows/_rocm-test.yml
needs:
- linux-focal-rocm6_0-py3_8-build
- linux-focal-rocm6_1-py3_8-build
- target-determination
with:
build-environment: linux-focal-rocm6.0-py3.8
docker-image: ${{ needs.linux-focal-rocm6_0-py3_8-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_0-py3_8-build.outputs.test-matrix }}
build-environment: linux-focal-rocm6.1-py3.8
docker-image: ${{ needs.linux-focal-rocm6_1-py3_8-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_1-py3_8-build.outputs.test-matrix }}

View File

@ -414,13 +414,13 @@ jobs:
{ config: "default", shard: 1, num_shards: 1 },
]}
linux-focal-rocm6_0-py3_8-build:
linux-focal-rocm6_1-py3_8-build:
# don't run build twice on main
if: github.event_name == 'pull_request'
name: linux-focal-rocm6.0-py3.8
name: linux-focal-rocm6.1-py3.8
uses: ./.github/workflows/_linux-build-label.yml
with:
build-environment: linux-focal-rocm6.0-py3.8
build-environment: linux-focal-rocm6.1-py3.8
docker-image-name: pytorch-linux-focal-rocm-n-py3
sync-tag: rocm-build
test-matrix: |

View File

@ -25,11 +25,11 @@ jobs:
id-token: write
contents: read
linux-focal-rocm6_0-py3_8-build:
name: linux-focal-rocm6.0-py3.8
linux-focal-rocm6_1-py3_8-build:
name: linux-focal-rocm6.1-py3.8
uses: ./.github/workflows/_linux-build-label.yml
with:
build-environment: linux-focal-rocm6.0-py3.8
build-environment: linux-focal-rocm6.1-py3.8
docker-image-name: pytorch-linux-focal-rocm-n-py3
sync-tag: rocm-build
test-matrix: |
@ -42,16 +42,16 @@ jobs:
{ config: "default", shard: 6, num_shards: 6, runner: "linux.rocm.gpu.2" },
]}
linux-focal-rocm6_0-py3_8-test:
linux-focal-rocm6_1-py3_8-test:
permissions:
id-token: write
contents: read
name: linux-focal-rocm6.0-py3.8
name: linux-focal-rocm6.1-py3.8
uses: ./.github/workflows/_rocm-test.yml
needs:
- linux-focal-rocm6_0-py3_8-build
- linux-focal-rocm6_1-py3_8-build
- target-determination
with:
build-environment: linux-focal-rocm6.0-py3.8
docker-image: ${{ needs.linux-focal-rocm6_0-py3_8-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_0-py3_8-build.outputs.test-matrix }}
build-environment: linux-focal-rocm6.1-py3.8
docker-image: ${{ needs.linux-focal-rocm6_1-py3_8-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_1-py3_8-build.outputs.test-matrix }}

View File

@ -111,30 +111,30 @@ jobs:
docker-image: ${{ needs.linux-focal-py3_8-clang10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-py3_8-clang10-build.outputs.test-matrix }}
linux-focal-rocm6_0-py3_8-build:
name: linux-focal-rocm6.0-py3.8
linux-focal-rocm6_1-py3_8-build:
name: linux-focal-rocm6.1-py3.8
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-rocm6.0-py3.8
build-environment: linux-focal-rocm6.1-py3.8
docker-image-name: pytorch-linux-focal-rocm-n-py3
test-matrix: |
{ include: [
{ config: "slow", shard: 1, num_shards: 1, runner: "linux.rocm.gpu" },
]}
linux-focal-rocm6_0-py3_8-test:
linux-focal-rocm6_1-py3_8-test:
permissions:
id-token: write
contents: read
name: linux-focal-rocm6.0-py3.8
name: linux-focal-rocm6.1-py3.8
uses: ./.github/workflows/_rocm-test.yml
needs:
- linux-focal-rocm6_0-py3_8-build
- linux-focal-rocm6_1-py3_8-build
- target-determination
with:
build-environment: linux-focal-rocm6.0-py3.8
docker-image: ${{ needs.linux-focal-rocm6_0-py3_8-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_0-py3_8-build.outputs.test-matrix }}
build-environment: linux-focal-rocm6.1-py3.8
docker-image: ${{ needs.linux-focal-rocm6_1-py3_8-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_1-py3_8-build.outputs.test-matrix }}
linux-jammy-py3_10-clang15-asan-build:
name: linux-jammy-py3.10-clang15-asan
@ -144,8 +144,9 @@ jobs:
docker-image-name: pytorch-linux-jammy-py3-clang15-asan
test-matrix: |
{ include: [
{ config: "slow", shard: 1, num_shards: 2, runner: "linux.4xlarge" },
{ config: "slow", shard: 2, num_shards: 2, runner: "linux.4xlarge" },
{ config: "slow", shard: 1, num_shards: 3, runner: "linux.4xlarge" },
{ config: "slow", shard: 2, num_shards: 3, runner: "linux.4xlarge" },
{ config: "slow", shard: 3, num_shards: 3, runner: "linux.4xlarge" },
]}
sync-tag: asan-build

View File

@ -34,18 +34,6 @@ jobs:
id-token: write
contents: read
# Build PyTorch with BUILD_CAFFE2=ON
caffe2-linux-jammy-py3_8-gcc11-build:
name: caffe2-linux-jammy-py3.8-gcc11
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: caffe2-linux-jammy-py3.8-gcc11
docker-image-name: pytorch-linux-jammy-py3.8-gcc11
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1 },
]}
linux-focal-cuda12_1-py3_10-gcc9-build:
name: linux-focal-cuda12.1-py3.10-gcc9
uses: ./.github/workflows/_linux-build.yml
@ -106,20 +94,16 @@ jobs:
{ config: "default", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
]}
macos-12-py3-arm64-build:
name: macos-12-py3-arm64
macos-13-py3-arm64-build:
name: macos-13-py3-arm64
uses: ./.github/workflows/_mac-build.yml
with:
sync-tag: macos-12-py3-arm64-build
build-environment: macos-12-py3-arm64
sync-tag: macos-py3-arm64-build
build-environment: macos-13-py3-arm64
runner-type: macos-m1-stable
build-generates-artifacts: true
# To match the one pre-installed in the m1 runners
python-version: 3.9.12
# We need to set the environment file here instead of trying to detect it automatically because
# MacOS arm64 is cross-compiled from x86-64. Specifically, it means that arm64 conda environment
# is needed when building PyTorch MacOS arm64 from x86-64
environment-file: .github/requirements/conda-env-macOS-ARM64
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "macos-m1-stable" },
@ -127,33 +111,34 @@ jobs:
{ config: "default", shard: 3, num_shards: 3, runner: "macos-m1-stable" },
]}
macos-12-py3-arm64-mps-test:
name: macos-12-py3-arm64-mps
macos-py3-arm64-mps-test:
name: macos-py3-arm64-mps
uses: ./.github/workflows/_mac-test-mps.yml
needs: macos-12-py3-arm64-build
if: needs.macos-12-py3-arm64-build.outputs.build-outcome == 'success'
needs: macos-13-py3-arm64-build
if: needs.macos-13-py3-arm64-build.outputs.build-outcome == 'success'
with:
sync-tag: macos-12-py3-arm64-mps-test
build-environment: macos-12-py3-arm64
sync-tag: macos-py3-arm64-mps-test
build-environment: macos-13-py3-arm64
# Same as the build job
python-version: 3.9.12
test-matrix: |
{ include: [
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-stable" },
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-14" },
]}
macos-12-py3-arm64-test:
name: macos-12-py3-arm64
macos-13-py3-arm64-test:
name: macos-13-py3-arm64
uses: ./.github/workflows/_mac-test.yml
needs:
- macos-12-py3-arm64-build
- macos-13-py3-arm64-build
- target-determination
with:
build-environment: macos-12-py3-arm64
build-environment: macos-13-py3-arm64
# Same as the build job
python-version: 3.9.12
test-matrix: ${{ needs.macos-12-py3-arm64-build.outputs.test-matrix }}
arch: arm64
test-matrix: ${{ needs.macos-13-py3-arm64-build.outputs.test-matrix }}
win-vs2019-cpu-py3-build:
name: win-vs2019-cpu-py3
@ -198,11 +183,11 @@ jobs:
{ config: "force_on_cpu", shard: 1, num_shards: 1, runner: "windows.4xlarge.nonephemeral" },
]}
linux-focal-rocm6_0-py3_8-build:
name: linux-focal-rocm6.0-py3.8
linux-focal-rocm6_1-py3_8-build:
name: linux-focal-rocm6.1-py3.8
uses: ./.github/workflows/_linux-build-label.yml
with:
build-environment: linux-focal-rocm6.0-py3.8
build-environment: linux-focal-rocm6.1-py3.8
docker-image-name: pytorch-linux-focal-rocm-n-py3
sync-tag: rocm-build
test-matrix: |
@ -210,17 +195,17 @@ jobs:
{ config: "default", shard: 1, num_shards: 1, runner: "linux.rocm.gpu" },
]}
linux-focal-rocm6_0-py3_8-test:
linux-focal-rocm6_1-py3_8-test:
permissions:
id-token: write
contents: read
name: linux-focal-rocm6.0-py3.8
name: linux-focal-rocm6.1-py3.8
uses: ./.github/workflows/_rocm-test.yml
needs:
- linux-focal-rocm6_0-py3_8-build
- linux-focal-rocm6_1-py3_8-build
- target-determination
with:
build-environment: linux-focal-rocm6.0-py3.8
docker-image: ${{ needs.linux-focal-rocm6_0-py3_8-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_0-py3_8-build.outputs.test-matrix }}
build-environment: linux-focal-rocm6.1-py3.8
docker-image: ${{ needs.linux-focal-rocm6_1-py3_8-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-rocm6_1-py3_8-build.outputs.test-matrix }}
tests-to-include: "test_nn test_torch test_cuda test_ops test_unary_ufuncs test_binary_ufuncs test_autograd inductor/test_torchinductor"

View File

@ -36,6 +36,20 @@ jobs:
#
# Experimental ARC jobs
#
llm-td:
name: before-test
uses: ./.github/workflows/llm_td_retrieval.yml
permissions:
id-token: write
contents: read
target-determination:
name: before-test
uses: ./.github/workflows/target_determination.yml
needs: llm-td
permissions:
id-token: write
contents: read
linux-jammy-py3_8-gcc11-build:
name: linux-jammy-py3.8-gcc11
@ -45,16 +59,26 @@ jobs:
docker-image-name: pytorch-linux-jammy-py3.8-gcc11
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 3, runner: "linux.2xlarge" },
{ config: "docs_test", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "backwards_compat", shard: 1, num_shards: 1, runner: "linux.2xlarge" },
{ config: "distributed", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
{ config: "distributed", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "default", shard: 2, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "default", shard: 3, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "docs_test", shard: 1, num_shards: 1, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "backwards_compat", shard: 1, num_shards: 1, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "distributed", shard: 1, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "distributed", shard: 2, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
]}
linux-jammy-py3_8-gcc11-test:
name: linux-jammy-py3.8-gcc11
uses: ./.github/workflows/_linux-test-rg.yml
needs:
- linux-jammy-py3_8-gcc11-build
- target-determination
with:
build-environment: linux-jammy-py3.8-gcc11
docker-image: ${{ needs.linux-jammy-py3_8-gcc11-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-py3_8-gcc11-build.outputs.test-matrix }}
linux-jammy-py3_8-gcc11-no-ops:
name: linux-jammy-py3.8-gcc11-no-ops
@ -86,10 +110,21 @@ jobs:
docker-image-name: pytorch-linux-focal-py3-clang10-onnx
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "default", shard: 2, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
]}
linux-focal-py3_8-clang10-onnx-test:
name: linux-focal-py3.8-clang10-onnx
uses: ./.github/workflows/_linux-test-rg.yml
needs:
- linux-focal-py3_8-clang10-onnx-build
- target-determination
with:
build-environment: linux-focal-py3.8-clang10-onnx
docker-image: ${{ needs.linux-focal-py3_8-clang10-onnx-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-py3_8-clang10-onnx-build.outputs.test-matrix }}
linux-jammy-py3_10-clang15-asan-build:
name: linux-jammy-py3.10-clang15-asan
uses: ./.github/workflows/_linux-build-rg.yml
@ -115,16 +150,27 @@ jobs:
docker-image-name: pytorch-linux-focal-py3.8-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 3, runner: "linux.2xlarge" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "default", shard: 2, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "default", shard: 3, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
]}
linux-focal-py3_8-clang10-test:
name: linux-focal-py3.8-clang10
uses: ./.github/workflows/_linux-test-rg.yml
needs:
- linux-focal-py3_8-clang10-build
- target-determination
with:
build-environment: linux-focal-py3.8-clang10
docker-image: ${{ needs.linux-focal-py3_8-clang10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-py3_8-clang10-build.outputs.test-matrix }}
linux-focal-py3_11-clang10-build:
name: linux-focal-py3.11-clang10
uses: ./.github/workflows/_linux-build-rg.yml
@ -133,16 +179,27 @@ jobs:
docker-image-name: pytorch-linux-focal-py3.11-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 3, runner: "linux.2xlarge" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "default", shard: 2, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "default", shard: 3, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "arc-lf-linux.2xlarge.avx512" },
]}
linux-focal-py3_11-clang10-test:
name: linux-focal-py3.11-clang10
uses: ./.github/workflows/_linux-test-rg.yml
needs:
- linux-focal-py3_11-clang10-build
- target-determination
with:
build-environment: linux-focal-py3.11-clang10
docker-image: ${{ needs.linux-focal-py3_11-clang10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-py3_11-clang10-build.outputs.test-matrix }}
#
# End of Experimental ARC jobs
#
#

View File

@ -49,22 +49,6 @@ jobs:
- run: |
pip3 install requests==2.26 rockset==1.0.3 boto3==1.19.12
- name: Upload test stats
env:
ROCKSET_API_KEY: ${{ secrets.ROCKSET_API_KEY }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id }}
WORKFLOW_RUN_ATTEMPT: ${{ github.event.workflow_run.run_attempt }}
WORKFLOW_URL: ${{ github.event.workflow_run.html_url }}
HEAD_REPOSITORY: ${{ github.event.workflow_run.head_repository.full_name }}
HEAD_BRANCH: ${{ github.event.workflow_run.head_branch }}
run: |
echo "${WORKFLOW_URL}"
python3 -m tools.stats.upload_test_stats --workflow-run-id "${WORKFLOW_RUN_ID}" --workflow-run-attempt "${WORKFLOW_RUN_ATTEMPT}" --head-branch "${HEAD_BRANCH}" --head-repository "${HEAD_REPOSITORY}"
python3 -m tools.stats.upload_sccache_stats --workflow-run-id "${WORKFLOW_RUN_ID}" --workflow-run-attempt "${WORKFLOW_RUN_ATTEMPT}"
- name: Upload test artifacts
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
@ -81,6 +65,22 @@ jobs:
# anything on GitHub to upload. The command should return right away
python3 -m tools.stats.upload_artifacts --workflow-run-id "${WORKFLOW_RUN_ID}" --workflow-run-attempt "${WORKFLOW_RUN_ATTEMPT}" --repo "${REPO_FULLNAME}"
- name: Upload test stats
env:
ROCKSET_API_KEY: ${{ secrets.ROCKSET_API_KEY }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
WORKFLOW_RUN_ID: ${{ github.event.workflow_run.id }}
WORKFLOW_RUN_ATTEMPT: ${{ github.event.workflow_run.run_attempt }}
WORKFLOW_URL: ${{ github.event.workflow_run.html_url }}
HEAD_REPOSITORY: ${{ github.event.workflow_run.head_repository.full_name }}
HEAD_BRANCH: ${{ github.event.workflow_run.head_branch }}
run: |
echo "${WORKFLOW_URL}"
python3 -m tools.stats.upload_test_stats --workflow-run-id "${WORKFLOW_RUN_ID}" --workflow-run-attempt "${WORKFLOW_RUN_ATTEMPT}" --head-branch "${HEAD_BRANCH}" --head-repository "${HEAD_REPOSITORY}"
python3 -m tools.stats.upload_sccache_stats --workflow-run-id "${WORKFLOW_RUN_ID}" --workflow-run-attempt "${WORKFLOW_RUN_ATTEMPT}"
- name: Analyze disabled tests rerun
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}

2
.gitignore vendored
View File

@ -87,7 +87,7 @@ torch/csrc/api/include/torch/version.h
torch/csrc/cudnn/cuDNN.cpp
torch/csrc/generated
torch/csrc/generic/TensorMethods.cpp
torch/csrc/inductor/aoti_torch/generated/*
torch/csrc/inductor/aoti_torch/generated/*.cpp
torch/csrc/jit/generated/*
torch/csrc/jit/fuser/config.h
torch/csrc/nn/THCUNN.cpp

View File

@ -78,6 +78,7 @@ exclude_patterns = [
'aten/src/ATen/native/vulkan/api/vk_mem_alloc.h',
'c10/util/strong_type.h',
'**/fb/**',
'torch/csrc/inductor/aoti_torch/generated/**',
'torch/csrc/jit/serialization/mobile_bytecode_generated.h',
'torch/csrc/utils/pythoncapi_compat.h',
'aten/src/ATen/dlpack.h',
@ -1051,21 +1052,12 @@ exclude_patterns = [
'test/quantization/fx/test_numeric_suite_fx.py',
'test/quantization/fx/test_quantize_fx.py',
'test/quantization/fx/test_subgraph_rewriter.py',
'test/test_custom_op_testing.py',
'test/test_dataloader.py',
'test/test_datapipe.py',
'test/test_decomp.py',
'test/test_deploy.py',
'test/test_determination.py',
'test/test_dlpack.py',
'test/test_dynamic_shapes.py',
'test/test_expanded_weights.py',
'test/test_fake_tensor.py',
'test/test_flop_counter.py',
'test/test_function_schema.py',
'test/test_functional_autograd_benchmark.py',
'test/test_functional_optim.py',
'test/test_functionalization.py',
'test/test_functionalization_of_rng_ops.py',
'test/test_futures.py',
'test/test_fx.py',
@ -1074,7 +1066,6 @@ exclude_patterns = [
'test/test_fx_reinplace_pass.py',
'test/test_hub.py',
'test/test_import_stats.py',
'test/test_indexing.py',
'test/test_itt.py',
'test/test_jit.py',
'test/test_jit_autocast.py',
@ -1123,8 +1114,6 @@ exclude_patterns = [
'test/test_optim.py',
'test/test_out_dtype_op.py',
'test/test_overrides.py',
'test/test_package.py',
'test/test_per_overload_api.py',
'test/test_prims.py',
'test/test_proxy_tensor.py',
'test/test_pruning_op.py',
@ -1160,7 +1149,6 @@ exclude_patterns = [
'test/test_type_promotion.py',
'test/test_unary_ufuncs.py',
'test/test_utils.py',
'test/test_view_ops.py',
'test/test_vulkan.py',
'test/test_xnnpack_integration.py',
'test/torch_np/numpy_test/**/*.py',

View File

@ -446,7 +446,6 @@ cu_library(
# caffe2
CAFFE2_COPTS = COMMON_COPTS + [
"-Dcaffe2_EXPORTS",
"-DCAFFE2_USE_GLOO",
"-DCAFFE2_USE_CUDNN",
"-DCAFFE2_BUILD_MAIN_LIB",
"-fvisibility-inlines-hidden",
@ -454,22 +453,6 @@ CAFFE2_COPTS = COMMON_COPTS + [
"-fno-trapping-math",
]
filegroup(
name = "caffe2_contrib_srcs",
srcs = [
"caffe2/contrib/aten/aten_op.cc",
"caffe2/contrib/gloo/allgather_ops.cc",
"caffe2/contrib/gloo/allreduce_ops.cc",
"caffe2/contrib/gloo/barrier_ops.cc",
"caffe2/contrib/gloo/broadcast_ops.cc",
"caffe2/contrib/gloo/common.cc",
"caffe2/contrib/gloo/common_world_ops.cc",
"caffe2/contrib/gloo/context.cc",
"caffe2/contrib/gloo/reduce_scatter_ops.cc",
"caffe2/contrib/gloo/store_handler.cc",
],
)
filegroup(
name = "caffe2_core_srcs",
srcs = [
@ -520,363 +503,6 @@ filegroup(
],
)
filegroup(
name = "caffe2_distributed_srcs",
srcs = [
"caffe2/distributed/file_store_handler.cc",
"caffe2/distributed/file_store_handler_op.cc",
"caffe2/distributed/store_handler.cc",
"caffe2/distributed/store_ops.cc",
],
)
filegroup(
name = "caffe2_ideep_srcs",
srcs = [
"caffe2/ideep/operators/adam_op.cc",
"caffe2/ideep/operators/channel_shuffle_op.cc",
"caffe2/ideep/operators/concat_split_op.cc",
"caffe2/ideep/operators/conv_op.cc",
"caffe2/ideep/operators/conv_transpose_op.cc",
"caffe2/ideep/operators/dropout_op.cc",
"caffe2/ideep/operators/elementwise_sum_op.cc",
"caffe2/ideep/operators/expand_squeeze_dims_op.cc",
"caffe2/ideep/operators/fully_connected_op.cc",
"caffe2/ideep/operators/local_response_normalization_op.cc",
"caffe2/ideep/operators/momentum_sgd_op.cc",
"caffe2/ideep/operators/operator_fallback_ideep.cc",
"caffe2/ideep/operators/order_switch_ops.cc",
"caffe2/ideep/operators/pool_op.cc",
"caffe2/ideep/operators/quantization/int8_add_op.cc",
"caffe2/ideep/operators/quantization/int8_conv_op.cc",
"caffe2/ideep/operators/quantization/int8_dequantize_op.cc",
"caffe2/ideep/operators/quantization/int8_fully_connected_op.cc",
"caffe2/ideep/operators/quantization/int8_given_tensor_fill_op.cc",
"caffe2/ideep/operators/quantization/int8_pool_op.cc",
"caffe2/ideep/operators/quantization/int8_quantize_op.cc",
"caffe2/ideep/operators/quantization/int8_relu_op.cc",
"caffe2/ideep/operators/queue_ops.cc",
"caffe2/ideep/operators/relu_op.cc",
"caffe2/ideep/operators/reshape_op.cc",
"caffe2/ideep/operators/shape_op.cc",
"caffe2/ideep/operators/sigmoid_op.cc",
"caffe2/ideep/operators/spatial_batch_norm_op.cc",
"caffe2/ideep/operators/transpose_op.cc",
"caffe2/ideep/operators/utility_ops.cc",
"caffe2/ideep/utils/ideep_register.cc",
],
)
filegroup(
name = "caffe2_onnx_srcs",
srcs = [
"caffe2/onnx/backend.cc",
"caffe2/onnx/backend_rep.cc",
"caffe2/onnx/device.cc",
"caffe2/onnx/helper.cc",
"caffe2/onnx/offline_tensor.cc",
"caffe2/onnx/onnx_exporter.cc",
"caffe2/onnx/onnxifi_graph_info.cc",
"caffe2/onnx/onnxifi_init.cc",
],
)
filegroup(
name = "caffe2_operators_srcs",
srcs = [
"caffe2/operators/abs_op.cc",
"caffe2/operators/accumulate_op.cc",
"caffe2/operators/accuracy_op.cc",
"caffe2/operators/acos_op.cc",
"caffe2/operators/affine_channel_op.cc",
"caffe2/operators/alias_with_name.cc",
"caffe2/operators/apmeter_op.cc",
"caffe2/operators/arg_ops.cc",
"caffe2/operators/asin_op.cc",
"caffe2/operators/assert_op.cc",
"caffe2/operators/atan_op.cc",
"caffe2/operators/atomic_ops.cc",
"caffe2/operators/batch_box_cox_op.cc",
"caffe2/operators/batch_bucketize_op.cc",
"caffe2/operators/batch_gather_ops.cc",
"caffe2/operators/batch_matmul_op.cc",
"caffe2/operators/batch_moments_op.cc",
"caffe2/operators/batch_permutation_op.cc",
"caffe2/operators/batch_sparse_to_dense_op.cc",
"caffe2/operators/bbox_transform_op.cc",
"caffe2/operators/bisect_percentile_op.cc",
"caffe2/operators/boolean_mask_ops.cc",
"caffe2/operators/boolean_unmask_ops.cc",
"caffe2/operators/box_with_nms_limit_op.cc",
"caffe2/operators/bucketize_op.cc",
"caffe2/operators/byte_weight_dequant_op.cc",
"caffe2/operators/cast_op.cc",
"caffe2/operators/cbrt_op.cc",
"caffe2/operators/cc_bmm_bg_op.cc",
"caffe2/operators/ceil_op.cc",
"caffe2/operators/channel_backprop_stats_op.cc",
"caffe2/operators/channel_shuffle_op.cc",
"caffe2/operators/channel_stats_op.cc",
"caffe2/operators/clip_op.cc",
"caffe2/operators/collect_and_distribute_fpn_rpn_proposals_op.cc",
"caffe2/operators/communicator_op.cc",
"caffe2/operators/concat_split_op.cc",
"caffe2/operators/conditional_op.cc",
"caffe2/operators/conv_gradient_op.cc",
"caffe2/operators/conv_op.cc",
"caffe2/operators/conv_op_eigen.cc",
"caffe2/operators/conv_op_shared.cc",
"caffe2/operators/conv_transpose_gradient_op.cc",
"caffe2/operators/conv_transpose_op.cc",
"caffe2/operators/conv_transpose_op_mobile.cc",
"caffe2/operators/copy_op.cc",
"caffe2/operators/copy_rows_to_tensor_op.cc",
"caffe2/operators/cos_op.cc",
"caffe2/operators/cosh_op.cc",
"caffe2/operators/cosine_embedding_criterion_op.cc",
"caffe2/operators/counter_ops.cc",
"caffe2/operators/crash_op.cc",
"caffe2/operators/create_scope_op.cc",
"caffe2/operators/crf_viterbi_op.cc",
"caffe2/operators/cross_entropy_op.cc",
"caffe2/operators/ctc_beam_search_decoder_op.cc",
"caffe2/operators/ctc_greedy_decoder_op.cc",
"caffe2/operators/cube_op.cc",
"caffe2/operators/data_couple.cc",
"caffe2/operators/dataset_ops.cc",
"caffe2/operators/deform_conv_gradient_op.cc",
"caffe2/operators/deform_conv_op.cc",
"caffe2/operators/dense_vector_to_id_list_op.cc",
"caffe2/operators/distance_op.cc",
"caffe2/operators/do_op.cc",
"caffe2/operators/dropout_op.cc",
"caffe2/operators/elementwise_add_gradient_op.cc",
"caffe2/operators/elementwise_add_op.cc",
"caffe2/operators/elementwise_div_gradient_op.cc",
"caffe2/operators/elementwise_div_op.cc",
"caffe2/operators/elementwise_linear_op.cc",
"caffe2/operators/elementwise_logical_ops.cc",
"caffe2/operators/elementwise_mul_gradient_op.cc",
"caffe2/operators/elementwise_mul_op.cc",
"caffe2/operators/elementwise_ops.cc",
"caffe2/operators/elementwise_ops_schema.cc",
"caffe2/operators/elementwise_ops_utils.cc",
"caffe2/operators/elementwise_sub_gradient_op.cc",
"caffe2/operators/elementwise_sub_op.cc",
"caffe2/operators/elementwise_sum_op.cc",
"caffe2/operators/elu_op.cc",
"caffe2/operators/enforce_finite_op.cc",
"caffe2/operators/ensure_clipped_op.cc",
"caffe2/operators/ensure_cpu_output_op.cc",
"caffe2/operators/erf_op.cc",
"caffe2/operators/exp_op.cc",
"caffe2/operators/expand_op.cc",
"caffe2/operators/expand_squeeze_dims_op.cc",
"caffe2/operators/fc_inference.cc",
"caffe2/operators/feature_maps_ops.cc",
"caffe2/operators/feed_blob_op.cc",
"caffe2/operators/filler_op.cc",
"caffe2/operators/find_duplicate_elements_op.cc",
"caffe2/operators/find_op.cc",
"caffe2/operators/flatten_op.cc",
"caffe2/operators/flexible_top_k.cc",
"caffe2/operators/floor_op.cc",
"caffe2/operators/free_op.cc",
"caffe2/operators/fully_connected_op.cc",
"caffe2/operators/fused_rowwise_8bit_conversion_ops.cc",
"caffe2/operators/fused_rowwise_random_quantization_ops.cc",
"caffe2/operators/gather_fused_8bit_rowwise_op.cc",
"caffe2/operators/gather_op.cc",
"caffe2/operators/gather_ranges_to_dense_op.cc",
"caffe2/operators/gelu_op.cc",
"caffe2/operators/generate_proposals_op.cc",
"caffe2/operators/given_tensor_byte_string_to_uint8_fill_op.cc",
"caffe2/operators/given_tensor_fill_op.cc",
"caffe2/operators/glu_op.cc",
"caffe2/operators/group_norm_op.cc",
"caffe2/operators/gru_unit_op.cc",
"caffe2/operators/h_softmax_op.cc",
"caffe2/operators/half_float_ops.cc",
"caffe2/operators/hard_sigmoid_op.cc",
"caffe2/operators/heatmap_max_keypoint_op.cc",
"caffe2/operators/if_op.cc",
"caffe2/operators/im2col_op.cc",
"caffe2/operators/index_hash_ops.cc",
"caffe2/operators/index_ops.cc",
"caffe2/operators/inference_lstm_op.cc",
"caffe2/operators/instance_norm_gradient_op.cc",
"caffe2/operators/instance_norm_op.cc",
"caffe2/operators/integral_image_op.cc",
"caffe2/operators/is_empty_op.cc",
"caffe2/operators/jsd_op.cc",
"caffe2/operators/key_split_ops.cc",
"caffe2/operators/last_n_window_collector.cc",
"caffe2/operators/layer_norm_op.cc",
"caffe2/operators/leaky_relu_op.cc",
"caffe2/operators/length_split_op.cc",
"caffe2/operators/lengths_pad_op.cc",
"caffe2/operators/lengths_reducer_fused_8bit_rowwise_ops.cc",
"caffe2/operators/lengths_reducer_ops.cc",
"caffe2/operators/lengths_reducer_rowwise_8bit_ops.cc",
"caffe2/operators/lengths_tile_op.cc",
"caffe2/operators/lengths_top_k_op.cc",
"caffe2/operators/listwise_l2r_op.cc",
"caffe2/operators/load_save_op.cc",
"caffe2/operators/load_save_op_util.cc",
"caffe2/operators/local_response_normalization_op.cc",
"caffe2/operators/locally_connected_op.cc",
"caffe2/operators/locally_connected_op_util.cc",
"caffe2/operators/log_op.cc",
"caffe2/operators/logit_op.cc",
"caffe2/operators/loss_op.cc",
"caffe2/operators/lp_pool_op.cc",
"caffe2/operators/lpnorm_op.cc",
"caffe2/operators/lstm_unit_op.cc",
"caffe2/operators/map_ops.cc",
"caffe2/operators/margin_ranking_criterion_op.cc",
"caffe2/operators/matmul_op.cc",
"caffe2/operators/mean_op.cc",
"caffe2/operators/merge_id_lists_op.cc",
"caffe2/operators/minmax_gradient_ops.cc",
"caffe2/operators/minmax_ops.cc",
"caffe2/operators/mod_op.cc",
"caffe2/operators/moments_op.cc",
"caffe2/operators/multi_class_accuracy_op.cc",
"caffe2/operators/negate_gradient_op.cc",
"caffe2/operators/negative_op.cc",
"caffe2/operators/ngram_ops.cc",
"caffe2/operators/norm_planar_yuv_op.cc",
"caffe2/operators/normalize_l1_op.cc",
"caffe2/operators/normalize_op.cc",
"caffe2/operators/numpy_tile_op.cc",
"caffe2/operators/one_hot_ops.cc",
"caffe2/operators/onnx_while_op.cc",
"caffe2/operators/order_switch_ops.cc",
"caffe2/operators/pack_rnn_sequence_op.cc",
"caffe2/operators/pack_segments.cc",
"caffe2/operators/pad_op.cc",
"caffe2/operators/partition_ops.cc",
"caffe2/operators/percentile_op.cc",
"caffe2/operators/perplexity_op.cc",
"caffe2/operators/piecewise_linear_transform_op.cc",
"caffe2/operators/pool_gradient_op.cc",
"caffe2/operators/pool_op.cc",
"caffe2/operators/pool_op_util.cc",
"caffe2/operators/pow_op.cc",
"caffe2/operators/prelu_op.cc",
"caffe2/operators/prepend_dim_op.cc",
"caffe2/operators/quant_decode_op.cc",
"caffe2/operators/rank_loss_op.cc",
"caffe2/operators/reciprocal_gradient_op.cc",
"caffe2/operators/reciprocal_op.cc",
"caffe2/operators/reduce_front_back_max_ops.cc",
"caffe2/operators/reduce_front_back_mean_ops.cc",
"caffe2/operators/reduce_front_back_sum_ops.cc",
"caffe2/operators/reduce_ops.cc",
"caffe2/operators/reduction_ops.cc",
"caffe2/operators/relu_n_op.cc",
"caffe2/operators/relu_op.cc",
"caffe2/operators/remove_data_blocks_op.cc",
"caffe2/operators/replace_nan_op.cc",
"caffe2/operators/reservoir_sampling.cc",
"caffe2/operators/reshape_op.cc",
"caffe2/operators/resize_3d_op.cc",
"caffe2/operators/resize_op.cc",
"caffe2/operators/reverse_packed_segs_op.cc",
"caffe2/operators/rmac_regions_op.cc",
"caffe2/operators/rnn/recurrent_network_blob_fetcher_op.cc",
"caffe2/operators/rnn/recurrent_network_executor.cc",
"caffe2/operators/rnn/recurrent_network_op.cc",
"caffe2/operators/roi_align_gradient_op.cc",
"caffe2/operators/roi_align_op.cc",
"caffe2/operators/roi_align_rotated_gradient_op.cc",
"caffe2/operators/roi_align_rotated_op.cc",
"caffe2/operators/roi_pool_op.cc",
"caffe2/operators/rowmul_op.cc",
"caffe2/operators/rsqrt_op.cc",
"caffe2/operators/scale_blobs_op.cc",
"caffe2/operators/scale_op.cc",
"caffe2/operators/segment_reduction_op.cc",
"caffe2/operators/selu_op.cc",
"caffe2/operators/sequence_ops.cc",
"caffe2/operators/shape_op.cc",
"caffe2/operators/sigmoid_gradient_op.cc",
"caffe2/operators/sigmoid_op.cc",
"caffe2/operators/sin_op.cc",
"caffe2/operators/sinh_op.cc",
"caffe2/operators/sinusoid_position_encoding_op.cc",
"caffe2/operators/slice_op.cc",
"caffe2/operators/softmax_op.cc",
"caffe2/operators/softmax_utils.cc",
"caffe2/operators/softmax_with_loss_op.cc",
"caffe2/operators/softplus_op.cc",
"caffe2/operators/softsign_op.cc",
"caffe2/operators/space_batch_op.cc",
"caffe2/operators/sparse_dropout_with_replacement_op.cc",
"caffe2/operators/sparse_normalize_op.cc",
"caffe2/operators/sparse_to_dense_mask_op.cc",
"caffe2/operators/sparse_to_dense_op.cc",
"caffe2/operators/spatial_batch_norm_gradient_op.cc",
"caffe2/operators/spatial_batch_norm_op.cc",
"caffe2/operators/spatial_softmax_with_loss_op.cc",
"caffe2/operators/sqr_op.cc",
"caffe2/operators/sqrt_op.cc",
"caffe2/operators/square_root_divide_op.cc",
"caffe2/operators/stats_ops.cc",
"caffe2/operators/stats_put_ops.cc",
"caffe2/operators/stop_gradient.cc",
"caffe2/operators/string_ops.cc",
"caffe2/operators/stump_func_op.cc",
"caffe2/operators/stylizer_ops.cc",
"caffe2/operators/summarize_op.cc",
"caffe2/operators/swish_op.cc",
"caffe2/operators/tan_op.cc",
"caffe2/operators/tanh_gradient_op.cc",
"caffe2/operators/tanh_op.cc",
"caffe2/operators/tensor_protos_db_input.cc",
"caffe2/operators/text_file_reader.cc",
"caffe2/operators/text_file_reader_utils.cc",
"caffe2/operators/thresholded_relu_op.cc",
"caffe2/operators/tile_op.cc",
"caffe2/operators/top_k.cc",
"caffe2/operators/transpose_op.cc",
"caffe2/operators/tt_linear_op.cc",
"caffe2/operators/unique_ops.cc",
"caffe2/operators/upsample_op.cc",
"caffe2/operators/utility_ops.cc",
"caffe2/operators/variable_length_sequence_padding.cc",
"caffe2/operators/weighted_multi_sampling_op.cc",
"caffe2/operators/weighted_sample_op.cc",
"caffe2/operators/while_op.cc",
"caffe2/operators/workspace_ops.cc",
"caffe2/operators/zero_gradient_op.cc",
],
)
filegroup(
name = "caffe2_opt_srcs",
srcs = [
"caffe2/opt/annotations.cc",
"caffe2/opt/backend_cutting.cc",
"caffe2/opt/backend_transformer_base.cc",
"caffe2/opt/bound_shape_inferencer.cc",
"caffe2/opt/converter.cc",
"caffe2/opt/dead_code_elim.cc",
"caffe2/opt/device.cc",
"caffe2/opt/distributed.cc",
"caffe2/opt/distributed_converter.cc",
"caffe2/opt/fusion.cc",
"caffe2/opt/mobile.cc",
"caffe2/opt/onnxifi_op.cc",
"caffe2/opt/onnxifi_transformer.cc",
"caffe2/opt/optimize_ideep.cc",
"caffe2/opt/optimizer.cc",
"caffe2/opt/passes.cc",
"caffe2/opt/shape_info.cc",
"caffe2/opt/tvm_transformer.cc",
],
)
filegroup(
name = "caffe2_perfkernels_srcs",
srcs = [
@ -892,70 +518,6 @@ filegroup(
],
)
filegroup(
name = "caffe2_predictor_srcs",
srcs = [
"caffe2/predictor/emulator/data_filler.cc",
"caffe2/predictor/emulator/data_filler.h",
"caffe2/predictor/predictor.cc",
"caffe2/predictor/predictor_config.cc",
"caffe2/predictor/predictor_utils.cc",
],
)
filegroup(
name = "caffe2_quantization_srcs",
srcs = [
"caffe2/quantization/server/activation_distribution_observer.cc",
"caffe2/quantization/server/batch_matmul_dnnlowp_op.cc",
"caffe2/quantization/server/caffe2_dnnlowp_utils.cc",
"caffe2/quantization/server/channel_shuffle_dnnlowp_op.cc",
"caffe2/quantization/server/concat_dnnlowp_op.cc",
"caffe2/quantization/server/conv_dnnlowp_acc16_op.cc",
"caffe2/quantization/server/conv_dnnlowp_op.cc",
"caffe2/quantization/server/conv_relu_op.cc",
"caffe2/quantization/server/dequantize_dnnlowp_op.cc",
"caffe2/quantization/server/dnnlowp.cc",
"caffe2/quantization/server/dnnlowp_partition.cc",
"caffe2/quantization/server/dynamic_histogram.cc",
"caffe2/quantization/server/elementwise_add_dnnlowp_op.cc",
"caffe2/quantization/server/elementwise_linear_dnnlowp_op.cc",
"caffe2/quantization/server/elementwise_mul_dnnlowp_op.cc",
"caffe2/quantization/server/elementwise_sum_dnnlowp_op.cc",
"caffe2/quantization/server/elementwise_sum_relu_op.cc",
"caffe2/quantization/server/fbgemm_pack_matrix_cache.cc",
"caffe2/quantization/server/fbgemm_pack_op.cc",
"caffe2/quantization/server/fully_connected_dnnlowp_acc16_op.cc",
"caffe2/quantization/server/fully_connected_dnnlowp_op.cc",
"caffe2/quantization/server/fully_connected_fake_lowp_op.cc",
"caffe2/quantization/server/group_norm_dnnlowp_op.cc",
"caffe2/quantization/server/int8_gen_quant_params.cc",
"caffe2/quantization/server/kl_minimization.cc",
"caffe2/quantization/server/lstm_unit_dnnlowp_op.cc",
"caffe2/quantization/server/norm_minimization.cc",
"caffe2/quantization/server/p99.cc",
"caffe2/quantization/server/pool_dnnlowp_op.cc",
"caffe2/quantization/server/quantize_dnnlowp_op.cc",
"caffe2/quantization/server/relu_dnnlowp_op.cc",
"caffe2/quantization/server/sigmoid.cc",
"caffe2/quantization/server/sigmoid_dnnlowp_op.cc",
"caffe2/quantization/server/spatial_batch_norm_dnnlowp_op.cc",
"caffe2/quantization/server/tanh.cc",
"caffe2/quantization/server/tanh_dnnlowp_op.cc",
"caffe2/quantization/server/utility_dnnlowp_ops.cc",
],
)
filegroup(
name = "caffe2_queue_srcs",
srcs = [
"caffe2/queue/blobs_queue.cc",
"caffe2/queue/blobs_queue_db.cc",
"caffe2/queue/queue_ops.cc",
"caffe2/queue/rebatching_queue.cc",
"caffe2/queue/rebatching_queue_ops.cc",
],
)
filegroup(
name = "caffe2_serialize_srcs",
@ -967,36 +529,6 @@ filegroup(
],
)
filegroup(
name = "caffe2_sgd_srcs",
srcs = [
"caffe2/sgd/adadelta_op.cc",
"caffe2/sgd/adagrad_op.cc",
"caffe2/sgd/adam_op.cc",
"caffe2/sgd/clip_tensor_op.cc",
"caffe2/sgd/ftrl_op.cc",
"caffe2/sgd/gftrl_op.cc",
"caffe2/sgd/iter_op.cc",
"caffe2/sgd/lars_op.cc",
"caffe2/sgd/learning_rate_adaption_op.cc",
"caffe2/sgd/learning_rate_op.cc",
"caffe2/sgd/momentum_sgd_op.cc",
"caffe2/sgd/rmsprop_op.cc",
"caffe2/sgd/wngrad_op.cc",
"caffe2/sgd/yellowfin_op.cc",
],
)
filegroup(
name = "caffe2_transforms_srcs",
srcs = [
"caffe2/transforms/common_subexpression_elimination.cc",
"caffe2/transforms/conv_to_nnpack_transform.cc",
"caffe2/transforms/pattern_net_transform.cc",
"caffe2/transforms/single_op_transform.cc",
],
)
filegroup(
name = "caffe2_utils_srcs",
srcs = [
@ -1021,228 +553,6 @@ filegroup(
],
)
filegroup(
name = "caffe2_cuda_cpp_srcs",
srcs = [
"caffe2/contrib/aten/aten_op_gpu.cc",
"caffe2/contrib/gloo/allreduce_ops_gpu.cc",
"caffe2/contrib/gloo/broadcast_ops_gpu.cc",
"caffe2/contrib/gloo/common_world_ops_gpu.cc",
"caffe2/core/blob_serialization_gpu.cc",
"caffe2/core/common_cudnn.cc",
"caffe2/core/common_gpu.cc",
"caffe2/core/event_gpu.cc",
"caffe2/db/create_db_op_gpu.cc",
"caffe2/distributed/file_store_handler_op_gpu.cc",
"caffe2/operators/communicator_op_gpu.cc",
"caffe2/operators/concat_split_op_gpu.cc",
"caffe2/operators/conv_op_cache_cudnn.cc",
"caffe2/operators/conv_op_cudnn.cc",
"caffe2/operators/conv_op_gpu.cc",
"caffe2/operators/conv_op_shared_gpu.cc",
"caffe2/operators/conv_transpose_op_cudnn.cc",
"caffe2/operators/conv_transpose_op_gpu.cc",
"caffe2/operators/counter_ops_gpu.cc",
"caffe2/operators/do_op_gpu.cc",
"caffe2/operators/dropout_op_cudnn.cc",
"caffe2/operators/elementwise_add_op_gpu.cc",
"caffe2/operators/elementwise_sub_op_gpu.cc",
"caffe2/operators/elu_op_cudnn.cc",
"caffe2/operators/exp_op_gpu.cc",
"caffe2/operators/expand_op_gpu.cc",
"caffe2/operators/expand_squeeze_dims_op_gpu.cc",
"caffe2/operators/free_op_gpu.cc",
"caffe2/operators/fully_connected_op_gpu.cc",
"caffe2/operators/if_op_gpu.cc",
"caffe2/operators/im2col_op_gpu.cc",
"caffe2/operators/load_save_op_gpu.cc",
"caffe2/operators/local_response_normalization_op_cudnn.cc",
"caffe2/operators/locally_connected_op_gpu.cc",
"caffe2/operators/log_op_gpu.cc",
"caffe2/operators/matmul_op_gpu.cc",
"caffe2/operators/negate_gradient_op_gpu.cc",
"caffe2/operators/negative_op_gpu.cc",
"caffe2/operators/order_switch_ops_cudnn.cc",
"caffe2/operators/order_switch_ops_gpu.cc",
"caffe2/operators/pool_op_cudnn.cc",
"caffe2/operators/prepend_dim_op_gpu.cc",
"caffe2/operators/reshape_op_gpu.cc",
"caffe2/operators/rnn/recurrent_network_blob_fetcher_op_gpu.cc",
"caffe2/operators/rnn/recurrent_network_executor_gpu.cc",
"caffe2/operators/rnn/recurrent_op_cudnn.cc",
"caffe2/operators/scale_op_gpu.cc",
"caffe2/operators/shape_op_gpu.cc",
"caffe2/operators/sigmoid_op_cudnn.cc",
"caffe2/operators/softmax_op_cudnn.cc",
"caffe2/operators/sqr_op_gpu.cc",
"caffe2/operators/sqrt_op_gpu.cc",
"caffe2/operators/stop_gradient_gpu.cc",
"caffe2/operators/tanh_op_cudnn.cc",
"caffe2/operators/tensor_protos_db_input_gpu.cc",
"caffe2/operators/transpose_op_cudnn.cc",
"caffe2/operators/while_op_gpu.cc",
"caffe2/operators/zero_gradient_op_gpu.cc",
"caffe2/queue/queue_ops_gpu.cc",
"caffe2/sgd/iter_op_gpu.cc",
"caffe2/sgd/learning_rate_op_gpu.cc",
],
)
filegroup(
name = "caffe2_cu_srcs",
srcs = [
"caffe2/core/context_gpu.cu",
"caffe2/operators/abs_op.cu",
"caffe2/operators/accumulate_op.cu",
"caffe2/operators/accuracy_op.cu",
"caffe2/operators/acos_op.cu",
"caffe2/operators/affine_channel_op.cu",
"caffe2/operators/alias_with_name.cu",
"caffe2/operators/arg_ops.cu",
"caffe2/operators/asin_op.cu",
"caffe2/operators/assert_op.cu",
"caffe2/operators/atan_op.cu",
"caffe2/operators/batch_gather_ops.cu",
"caffe2/operators/batch_matmul_op.cu",
"caffe2/operators/batch_moments_op.cu",
"caffe2/operators/batch_permutation_op.cu",
"caffe2/operators/batch_sparse_to_dense_op.cu",
"caffe2/operators/boolean_mask_ops.cu",
"caffe2/operators/boolean_unmask_ops.cu",
"caffe2/operators/bucketize_op.cu",
"caffe2/operators/cast_op.cu",
"caffe2/operators/cbrt_op.cu",
"caffe2/operators/ceil_op.cu",
"caffe2/operators/channel_backprop_stats_op.cu",
"caffe2/operators/channel_shuffle_op.cu",
"caffe2/operators/channel_stats_op.cu",
"caffe2/operators/channelwise_conv3d_op_cudnn.cu",
"caffe2/operators/clip_op.cu",
"caffe2/operators/copy_op.cu",
"caffe2/operators/cos_op.cu",
"caffe2/operators/cosh_op.cu",
"caffe2/operators/cosine_embedding_criterion_op.cu",
"caffe2/operators/cross_entropy_op.cu",
"caffe2/operators/cube_op.cu",
"caffe2/operators/data_couple_gpu.cu",
"caffe2/operators/deform_conv_op.cu",
"caffe2/operators/depthwise_3x3_conv_op_cudnn.cu",
"caffe2/operators/distance_op.cu",
"caffe2/operators/dropout_op.cu",
"caffe2/operators/elementwise_div_op.cu",
"caffe2/operators/elementwise_linear_op.cu",
"caffe2/operators/elementwise_mul_op.cu",
"caffe2/operators/elementwise_ops.cu",
"caffe2/operators/elu_op.cu",
"caffe2/operators/enforce_finite_op.cu",
"caffe2/operators/ensure_cpu_output_op.cu",
"caffe2/operators/erf_op.cu",
"caffe2/operators/filler_op.cu",
"caffe2/operators/find_op.cu",
"caffe2/operators/floor_op.cu",
"caffe2/operators/gather_op.cu",
"caffe2/operators/gelu_op.cu",
"caffe2/operators/generate_proposals_op.cu",
"caffe2/operators/generate_proposals_op_util_nms_gpu.cu",
"caffe2/operators/given_tensor_byte_string_to_uint8_fill_op.cu",
"caffe2/operators/given_tensor_fill_op.cu",
"caffe2/operators/glu_op.cu",
"caffe2/operators/group_norm_op.cu",
"caffe2/operators/gru_unit_op_gpu.cu",
"caffe2/operators/half_float_ops.cu",
"caffe2/operators/hard_sigmoid_op.cu",
"caffe2/operators/instance_norm_op.cu",
"caffe2/operators/integral_image_op.cu",
"caffe2/operators/layer_norm_op.cu",
"caffe2/operators/leaky_relu_op.cu",
"caffe2/operators/lengths_pad_op.cu",
"caffe2/operators/lengths_tile_op.cu",
"caffe2/operators/local_response_normalization_op.cu",
"caffe2/operators/logit_op.cu",
"caffe2/operators/loss_op.cu",
"caffe2/operators/lp_pool_op.cu",
"caffe2/operators/lstm_unit_op_gpu.cu",
"caffe2/operators/margin_ranking_criterion_op.cu",
"caffe2/operators/max_pool_with_index.cu",
"caffe2/operators/mean_op.cu",
"caffe2/operators/mem_query_op.cu",
"caffe2/operators/minmax_ops.cu",
"caffe2/operators/moments_op.cu",
"caffe2/operators/multi_class_accuracy_op.cu",
"caffe2/operators/normalize_ops.cu",
"caffe2/operators/one_hot_ops.cu",
"caffe2/operators/pack_segments.cu",
"caffe2/operators/pad_op_gpu.cu",
"caffe2/operators/perplexity_op.cu",
"caffe2/operators/piecewise_linear_transform_op.cu",
"caffe2/operators/pool_op.cu",
"caffe2/operators/pow_op.cu",
"caffe2/operators/prelu_op.cu",
"caffe2/operators/reciprocal_op.cu",
"caffe2/operators/reduce_front_back_max_ops.cu",
"caffe2/operators/reduce_front_back_sum_mean_ops.cu",
"caffe2/operators/reduce_ops.cu",
"caffe2/operators/reduction_ops.cu",
"caffe2/operators/relu_n_op.cu",
"caffe2/operators/relu_op.cu",
"caffe2/operators/replace_nan_op.cu",
"caffe2/operators/resize_3d_op.cu",
"caffe2/operators/resize_op.cu",
"caffe2/operators/reverse_packed_segs_op.cu",
"caffe2/operators/rmac_regions_op.cu",
"caffe2/operators/rnn/recurrent_network_op_gpu.cu",
"caffe2/operators/roi_align_gradient_op.cu",
"caffe2/operators/roi_align_op.cu",
"caffe2/operators/roi_align_rotated_gradient_op.cu",
"caffe2/operators/roi_align_rotated_op.cu",
"caffe2/operators/roi_pool_op.cu",
"caffe2/operators/rsqrt_op.cu",
"caffe2/operators/scale_blobs_op.cu",
"caffe2/operators/segment_reduction_op_gpu.cu",
"caffe2/operators/selu_op.cu",
"caffe2/operators/sequence_ops.cu",
"caffe2/operators/sigmoid_op.cu",
"caffe2/operators/sin_op.cu",
"caffe2/operators/sinh_op.cu",
"caffe2/operators/slice_op.cu",
"caffe2/operators/softmax_ops.cu",
"caffe2/operators/softplus_op.cu",
"caffe2/operators/softsign_op.cu",
"caffe2/operators/space_batch_op_gpu.cu",
"caffe2/operators/sparse_normalize_op_gpu.cu",
"caffe2/operators/sparse_to_dense_op.cu",
"caffe2/operators/spatial_batch_norm_op.cu",
"caffe2/operators/spatial_batch_norm_op_cudnn.cu",
"caffe2/operators/stump_func_op.cu",
"caffe2/operators/summarize_op.cu",
"caffe2/operators/swish_op.cu",
"caffe2/operators/tan_op.cu",
"caffe2/operators/tanh_op.cu",
"caffe2/operators/thresholded_relu_op.cu",
"caffe2/operators/tile_op.cu",
"caffe2/operators/top_k.cu",
"caffe2/operators/transpose_op.cu",
"caffe2/operators/unique_ops.cu",
"caffe2/operators/upsample_op.cu",
"caffe2/operators/utility_ops.cu",
"caffe2/operators/weighted_sample_op.cu",
"caffe2/sgd/adadelta_op_gpu.cu",
"caffe2/sgd/adagrad_op_gpu.cu",
"caffe2/sgd/adam_op_gpu.cu",
"caffe2/sgd/fp16_momentum_sgd_op.cu",
"caffe2/sgd/fp32_momentum_sgd_op.cu",
"caffe2/sgd/lars_op_gpu.cu",
"caffe2/sgd/momentum_sgd_op_gpu.cu",
"caffe2/sgd/rmsprop_op_gpu.cu",
"caffe2/sgd/yellowfin_op_gpu.cu",
"caffe2/utils/math/broadcast.cu",
"caffe2/utils/math/elementwise.cu",
"caffe2/utils/math/reduce.cu",
"caffe2/utils/math/transpose.cu",
"caffe2/utils/math_gpu.cu",
],
)
# To achieve finer granularity and make debug easier, caffe2 is split into three libraries:
# ATen, caffe2 and caffe2_for_aten_headers. ATen lib group up source codes under
# aten/ directory and caffe2 contains most files under `caffe2/` directory. Since the
@ -1271,35 +581,10 @@ cc_library(
],
)
py_binary(
name = "gen_op",
srcs = ["caffe2/contrib/aten/gen_op.py"],
deps = ["//torchgen"],
)
genrule(
name = "generated_caffe2_aten_op_headers",
srcs = [
"caffe2/contrib/aten/aten_op_template.h",
"aten/src/ATen/Declarations.yaml",
],
outs = ["caffe2/caffe2/contrib/aten/gen_aten_op.h"],
cmd = """
$(location :gen_op) \
--output_prefix gen_ \
--install_dir $(@D) \
--aten_root `dirname $(location aten/src/ATen/Declarations.yaml)`/../.. \
--template_dir `dirname $(location caffe2/contrib/aten/aten_op_template.h)` \
--yaml_dir `dirname $(location aten/src/ATen/Declarations.yaml)`""",
tools = [":gen_op"],
)
cc_library(
name = "caffe2_headers",
hdrs = glob(
[
"caffe2/contrib/aten/*.h",
"caffe2/contrib/gloo/*.h",
"caffe2/core/*.h",
"caffe2/core/nomnigraph/include/nomnigraph/Converters/*.h",
"caffe2/core/nomnigraph/include/nomnigraph/Generated/*.h",
@ -1308,25 +593,8 @@ cc_library(
"caffe2/core/nomnigraph/include/nomnigraph/Support/*.h",
"caffe2/core/nomnigraph/include/nomnigraph/Transformations/*.h",
"caffe2/core/nomnigraph/tests/*.h",
"caffe2/db/*.h",
"caffe2/distributed/*.h",
"caffe2/ideep/*.h",
"caffe2/ideep/operators/*.h",
"caffe2/ideep/operators/quantization/*.h",
"caffe2/ideep/utils/*.h",
"caffe2/onnx/*.h",
"caffe2/operators/*.h",
"caffe2/operators/rnn/*.h",
"caffe2/opt/*.h",
"caffe2/perfkernels/*.h",
"caffe2/predictor/*.h",
"caffe2/predictor/emulator/*.h",
"caffe2/quantization/server/*.h",
"caffe2/queue/*.h",
"caffe2/serialize/*.h",
"caffe2/sgd/*.h",
"caffe2/share/contrib/depthwise/*.h",
"caffe2/transforms/*.h",
"caffe2/utils/*.h",
"caffe2/utils/math/*.h",
"caffe2/utils/threadpool/*.h",
@ -1338,10 +606,9 @@ cc_library(
) + if_cuda(glob([
"caffe2/**/*.cuh",
"caffe2/image/*.h",
])) + [":generated_caffe2_aten_op_headers"],
])),
copts = CAFFE2_COPTS,
includes = [
"caffe2/contrib/aten",
"caffe2/core/nomnigraph/include",
],
visibility = ["//visibility:public"],
@ -1353,52 +620,12 @@ cc_library(
],
)
cc_library(
name = "caffe2_dnnlowp_avx2_ops",
srcs = [
"caffe2/quantization/server/elementwise_sum_dnnlowp_op_avx2.cc",
"caffe2/quantization/server/fully_connected_fake_lowp_op_avx2.cc",
"caffe2/quantization/server/group_norm_dnnlowp_op_avx2.cc",
"caffe2/quantization/server/norm_minimization_avx2.cc",
"caffe2/quantization/server/pool_dnnlowp_op_avx2.cc",
"caffe2/quantization/server/relu_dnnlowp_op_avx2.cc",
"caffe2/quantization/server/spatial_batch_norm_dnnlowp_op_avx2.cc",
"caffe2/quantization/server/transpose.cc",
],
copts = CAFFE2_COPTS + [
"-mf16c",
"-mavx2",
"-mfma",
"-mxsave",
],
visibility = ["//visibility:public"],
deps = [
":caffe2_headers",
"@fbgemm",
],
alwayslink = True,
)
cc_library(
name = "caffe2",
srcs = [
"caffe2/db/create_db_op.cc",
"caffe2/db/protodb.cc",
"caffe2/share/contrib/depthwise/depthwise3x3_conv_op.cc",
":caffe2_contrib_srcs",
":caffe2_core_srcs",
":caffe2_distributed_srcs",
":caffe2_ideep_srcs",
":caffe2_onnx_srcs",
":caffe2_operators_srcs",
":caffe2_opt_srcs",
":caffe2_perfkernels_srcs",
":caffe2_predictor_srcs",
":caffe2_quantization_srcs",
":caffe2_queue_srcs",
":caffe2_serialize_srcs",
":caffe2_sgd_srcs",
":caffe2_transforms_srcs",
":caffe2_utils_srcs",
],
copts = CAFFE2_COPTS + ["-mf16c"],
@ -1406,7 +633,6 @@ cc_library(
visibility = ["//visibility:public"],
deps = [
":caffe2_core_macros",
":caffe2_dnnlowp_avx2_ops",
":caffe2_headers",
":caffe2_perfkernels_avx",
":caffe2_perfkernels_avx2",
@ -1419,11 +645,9 @@ cc_library(
"@fbgemm//:fbgemm_src_headers",
"@fmt",
"@foxi",
"@gloo",
"@onnx",
] + if_cuda(
[
":caffe2_cuda_cpp",
":aten_cuda",
"@tensorpipe//:tensorpipe_cuda",
],
@ -1435,39 +659,20 @@ cc_library(
alwayslink = True,
)
cc_library(
name = "caffe2_cuda_cpp",
srcs = [":caffe2_cuda_cpp_srcs"],
copts = CAFFE2_COPTS,
visibility = ["//visibility:public"],
deps = [
":caffe2_cuda",
":caffe2_headers",
],
alwayslink = True,
)
cu_library(
name = "caffe2_cuda",
# one may think that `quantization_gpu.cu` could be a separate kernel,
# however that leads to de-registration problem that's described in
# https://github.com/pytorch/pytorch/issues/79236
# To solve it we add it into the `caffe2_cuda`,
# this is also aligned with the CMake build.
srcs = [":caffe2_cu_srcs"] + [
name = "torch_cuda",
srcs = [
"torch/csrc/distributed/c10d/intra_node_comm.cu",
"torch/csrc/distributed/c10d/quantization/quantization_gpu.cu",
],
copts = CAFFE2_COPTS + torch_cuda_half_options,
copts = torch_cuda_half_options,
visibility = ["//visibility:public"],
deps = [
":aten",
":caffe2_headers",
"@cuda//:cublas",
"@cuda//:curand",
"@cudnn",
"@eigen",
"@gloo",
"@tensorpipe//:tensorpipe_cuda",
],
alwayslink = True,
@ -1641,6 +846,7 @@ cc_library(
] + if_cuda([
"@cuda//:nvToolsExt",
"@cutlass",
":torch_cuda",
]),
alwayslink = True,
)

View File

@ -56,7 +56,7 @@ endif()
# This define is needed to preserve behavior given anticpated changes to cccl/thrust
# https://nvidia.github.io/libcudacxx/standard_api/numerics_library/complex.html
string(APPEND CMAKE_CUDA_FLAGS "-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS")
string(APPEND CMAKE_CUDA_FLAGS " -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS")
if(LINUX)
include(cmake/CheckAbi.cmake)
@ -228,14 +228,10 @@ option(USE_FBGEMM "Use FBGEMM (quantized 8-bit server operators)" ON)
option(USE_KINETO "Use Kineto profiling library" ON)
option(USE_CUPTI_SO "Use CUPTI as a shared library" ON)
option(USE_FAKELOWP "Use FakeLowp operators" OFF)
option(USE_FFMPEG "Use ffmpeg" OFF)
option(USE_GFLAGS "Use GFLAGS" OFF)
option(USE_GLOG "Use GLOG" OFF)
option(USE_LEVELDB "Use LEVELDB" OFF)
option(USE_LITE_PROTO "Use lite protobuf instead of full." OFF)
option(USE_LMDB "Use LMDB" OFF)
option(USE_MAGMA "Use MAGMA" ON)
option(USE_METAL "Use Metal for Caffe2 iOS build" ON)
option(USE_PYTORCH_METAL "Use Metal for PyTorch iOS build" OFF)
option(USE_PYTORCH_METAL_EXPORT "Export Metal models on MacOSX desktop" OFF)
option(USE_NATIVE_ARCH "Use -march=native" OFF)
@ -264,15 +260,12 @@ cmake_dependent_option(
option(USE_NUMPY "Use NumPy" ON)
option(USE_OBSERVERS "Use observers module." OFF)
option(USE_OPENCL "Use OpenCL" OFF)
option(USE_OPENCV "Use OpenCV" OFF)
option(USE_OPENMP "Use OpenMP for parallel code" ON)
option(USE_PRECOMPILED_HEADERS "Use pre-compiled headers to accelerate build." OFF)
option(USE_PROF "Use profiling" OFF)
option(USE_QNNPACK "Use QNNPACK (quantized 8-bit operators)" ON)
option(USE_PYTORCH_QNNPACK "Use ATen/QNNPACK (quantized 8-bit operators)" ON)
option(USE_REDIS "Use Redis" OFF)
option(USE_ROCKSDB "Use RocksDB" OFF)
option(USE_SNPE "Use Qualcomm's SNPE library" OFF)
option(USE_SYSTEM_EIGEN_INSTALL
"Use system Eigen instead of the one under third_party" OFF)
@ -294,7 +287,6 @@ option(USE_VULKAN_FP16_INFERENCE "Vulkan - Use fp16 inference" OFF)
option(USE_VULKAN_RELAXED_PRECISION "Vulkan - Use relaxed precision math in the kernels (mediump)" OFF)
# option USE_XNNPACK: try to enable xnnpack by default.
option(USE_XNNPACK "Use XNNPACK" ON)
option(USE_ZMQ "Use ZMQ" OFF)
option(USE_ZSTD "Use ZSTD" OFF)
option(USE_ROCM_KERNEL_ASSERT "Use Kernel Assert for ROCm" OFF)
# Ensure that an ITT build is the default for x86 CPUs

View File

@ -116,7 +116,7 @@ torch/profiler/ @aaronenyeshi
test/functorch/test_aotdispatch.py @ezyang @Chillee
# Dataloader
torch/utils/data/ @ejguan
torch/utils/data/ @andrewkho @gokulavasan
# hipify
torch/utils/hipify/ @jeffdaily @jithunnair-amd
@ -144,3 +144,14 @@ caffe2/utils/hip @jeffdaily @jithunnair-amd
/torch/csrc/Storage* @mikaylagawarecki
# subscribing for PyTorchFileWriter/PyTorchFileReader changes
/torch/csrc/jit/python/init.cpp @mikaylagawarecki
# CUDA and CUDA math libraries
aten/src/ATen/cuda/ @eqy
aten/src/ATen/cudnn/ @eqy
aten/src/ATen/native/cuda/ @eqy
aten/src/ATen/native/cudnn/ @eqy
c10/cuda @eqy
torch/cuda/ @eqy
torch/csrc/cuda/ @eqy
torch/backends/cuda/ @eqy
torch/backends/cudnn/ @eqy

View File

@ -1,12 +1,10 @@
# syntax = docker/dockerfile:experimental
# syntax=docker/dockerfile:1
# NOTE: Building this image require's docker version >= 23.0.
#
# NOTE: To build this you will need a docker version > 18.06 with
# experimental enabled and DOCKER_BUILDKIT=1
#
# If you do not use buildkit you are not going to have a good time
#
# For reference:
# https://docs.docker.com/develop/develop-images/build_enhancements/
# For reference:
# - https://docs.docker.com/build/dockerfile/frontend/#stable-channel
ARG BASE_IMAGE=ubuntu:22.04
ARG PYTHON_VERSION=3.11
@ -67,8 +65,9 @@ ARG CUDA_VERSION=12.1
ARG CUDA_CHANNEL=nvidia
ARG INSTALL_CHANNEL=pytorch-nightly
# Automatically set by buildx
# Note conda needs to be pinned to 23.5.2 see: https://github.com/pytorch/pytorch/issues/106470
RUN /opt/conda/bin/conda install -c "${INSTALL_CHANNEL}" -y python=${PYTHON_VERSION} conda=23.5.2
RUN /opt/conda/bin/conda update -y -n base -c defaults conda
RUN /opt/conda/bin/conda install -c "${INSTALL_CHANNEL}" -y python=${PYTHON_VERSION}
ARG TARGETPLATFORM
# On arm64 we can only install wheel packages.

View File

@ -65,8 +65,8 @@ Following is the release cadence for year 2023/2024. All dates below are tentati
| --- | --- | --- | --- | --- |
| 2.1 | Aug 2023 | Oct 2023 | Nov 2023 | Dec 2023 |
| 2.2 | Dec 2023 | Jan 2024 | Feb 2024 | Mar 2024 |
| 2.3 | Mar 2024 | Apr 2024 | May 2024 | Jun 2024 |
| 2.4 | May 2024 | Jul 2024 | Aug 2024 | Sep 2024 |
| 2.3 | Mar 2024 | Apr 2024 | Jun 2024 | Not planned |
| 2.4 | Jun 2024 | Jul 2024 | Aug 2024 | Sep 2024 |
| 2.5 | Aug 2024 | Oct 2024 | Nov 2024 | Dec 2024 |
## General Overview

View File

@ -268,6 +268,12 @@ at::BlasBackend Context::blasPreferredBackend() const {
}
void Context::setBlasPreferredBackend(at::BlasBackend b) {
#ifdef _MSC_VER
TORCH_WARN_ONCE(
"torch.backends.cuda.preferred_blas_library is an experimental feature. "
"It is not supported on Windows."
);
#else
TORCH_CHECK((b != at::BlasBackend::Cublaslt) || hasCuBLASLt(),
"Cannot set preferred backend to cuBLASLt if PyTorch has not been compiled with cuBLASLt.");
if (b != at::BlasBackend::Cublas) {
@ -278,6 +284,7 @@ void Context::setBlasPreferredBackend(at::BlasBackend b) {
);
}
blas_preferred_backend = b;
#endif
}
bool Context::allowFP16ReductionCuBLAS() const {

View File

@ -57,6 +57,8 @@ SparseCsrTensorImpl::SparseCsrTensorImpl(
TORCH_INTERNAL_ASSERT(((key_set.has(DispatchKey::SparseCsrCPU) && device().type() == kCPU)
|| (key_set.has(DispatchKey::SparseCsrCUDA) && device().type() == kCUDA)
|| (key_set.has(DispatchKey::SparseCsrMeta) && device().type() == kMeta)
|| (key_set.has(DispatchKey::SparseCsrCPU) && device().type() == kMeta) // fake tensor
|| (key_set.has(DispatchKey::SparseCsrCUDA) && device().type() == kMeta) // fake tensor
|| (key_set.has(DispatchKey::SparseCsrPrivateUse1) && device().type() == kPrivateUse1)),
"Inconsistent key_set (=", key_set, ") and device (=", device(), ")");

View File

@ -2,6 +2,7 @@
#include <ATen/Tensor.h>
#include <c10/core/TensorImpl.h>
#include <c10/core/impl/TorchDispatchModeTLS.h>
#include <c10/util/Exception.h>
namespace at {
@ -107,6 +108,39 @@ struct TORCH_API SparseCsrTensorImpl : public TensorImpl {
}
}
template <typename VariableVersion>
c10::intrusive_ptr<TensorImpl> shallow_copy_and_detach_core(
VariableVersion&& version_counter,
bool allow_tensor_metadata_change) const {
const auto mode_stack_len = c10::impl::TorchDispatchModeTLS::stack_len();
c10::impl::PyInterpreter&& interpreter = nullptr;
if (mode_stack_len > 0 &&
!c10::impl::tls_is_dispatch_key_excluded(DispatchKey::Python)) {
const auto& cur_torch_dispatch_mode_state =
c10::impl::TorchDispatchModeTLS::get_stack_at(mode_stack_len - 1);
interpreter = cur_torch_dispatch_mode_state->pyinterpreter();
} else if (
key_set_.has(DispatchKey::Python) &&
!c10::impl::tls_is_dispatch_key_excluded(DispatchKey::Python)) {
interpreter = pyobj_slot_.load_pyobj_interpreter();
} else {
// otherwise just copy the SparseTensorImpl and not the PyObject.
auto impl = c10::make_intrusive<SparseCsrTensorImpl>(
key_set(), device(), layout_impl(), dtype());
copy_tensor_metadata(
/*src_sparse_impl=*/this,
/*dest_sparse_impl=*/impl.get(),
/*version_counter=*/version_counter,
/*allow_tensor_metadata_change=*/allow_tensor_metadata_change);
impl->refresh_numel();
return impl;
}
auto r = interpreter->detach(this);
r->set_version_counter(std::forward<VariableVersion>(version_counter));
r->set_allow_tensor_metadata_change(allow_tensor_metadata_change);
return r;
}
/**
* Return a TensorImpl that is a shallow-copy of this TensorImpl.
*
@ -116,15 +150,8 @@ struct TORCH_API SparseCsrTensorImpl : public TensorImpl {
c10::intrusive_ptr<TensorImpl> shallow_copy_and_detach(
const c10::VariableVersion& version_counter,
bool allow_tensor_metadata_change) const override {
auto impl = c10::make_intrusive<SparseCsrTensorImpl>(
key_set(), device(), layout_impl(), dtype());
copy_tensor_metadata(
/*src_sparse_impl=*/this,
/*dest_sparse_impl=*/impl.get(),
/*version_counter=*/version_counter,
/*allow_tensor_metadata_change=*/allow_tensor_metadata_change);
impl->refresh_numel();
return impl;
return shallow_copy_and_detach_core(
version_counter, allow_tensor_metadata_change);
}
/**
@ -136,15 +163,8 @@ struct TORCH_API SparseCsrTensorImpl : public TensorImpl {
c10::intrusive_ptr<TensorImpl> shallow_copy_and_detach(
c10::VariableVersion&& version_counter,
bool allow_tensor_metadata_change) const override {
auto impl = c10::make_intrusive<SparseCsrTensorImpl>(
key_set(), device(), layout_impl(), dtype());
copy_tensor_metadata(
/*src_sparse_impl=*/this,
/*dest_sparse_impl=*/impl.get(),
/*version_counter=*/std::move(version_counter),
/*allow_tensor_metadata_change=*/allow_tensor_metadata_change);
impl->refresh_numel();
return impl;
return shallow_copy_and_detach_core(
std::move(version_counter), allow_tensor_metadata_change);
}
private:

View File

@ -2,6 +2,7 @@
#include <ATen/Tensor.h>
#include <c10/core/TensorImpl.h>
#include <c10/core/impl/TorchDispatchModeTLS.h>
#include <c10/util/Exception.h>
#include <c10/util/irange.h>
@ -306,6 +307,38 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
const Tensor& indices,
const Tensor& values);
template <typename VariableVersion>
c10::intrusive_ptr<TensorImpl> shallow_copy_and_detach_core(
VariableVersion&& version_counter,
bool allow_tensor_metadata_change) const {
const auto mode_stack_len = c10::impl::TorchDispatchModeTLS::stack_len();
c10::impl::PyInterpreter&& interpreter = nullptr;
if (mode_stack_len > 0 &&
!c10::impl::tls_is_dispatch_key_excluded(DispatchKey::Python)) {
const auto& cur_torch_dispatch_mode_state =
c10::impl::TorchDispatchModeTLS::get_stack_at(mode_stack_len - 1);
interpreter = cur_torch_dispatch_mode_state->pyinterpreter();
} else if (
key_set_.has(DispatchKey::Python) &&
!c10::impl::tls_is_dispatch_key_excluded(DispatchKey::Python)) {
interpreter = pyobj_slot_.load_pyobj_interpreter();
} else {
// otherwise just copy the SparseTensorImpl and not the PyObject.
auto impl = c10::make_intrusive<SparseTensorImpl>(key_set(), dtype());
copy_tensor_metadata(
/*src_sparse_impl=*/this,
/*dest_sparse_impl=*/impl.get(),
/*version_counter=*/version_counter,
/*allow_tensor_metadata_change=*/allow_tensor_metadata_change);
impl->refresh_numel();
return impl;
}
auto r = interpreter->detach(this);
r->set_version_counter(std::forward<VariableVersion>(version_counter));
r->set_allow_tensor_metadata_change(allow_tensor_metadata_change);
return r;
}
/**
* Return a TensorImpl that is a shallow-copy of this TensorImpl.
*
@ -315,14 +348,8 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
c10::intrusive_ptr<TensorImpl> shallow_copy_and_detach(
const c10::VariableVersion& version_counter,
bool allow_tensor_metadata_change) const override {
auto impl = c10::make_intrusive<SparseTensorImpl>(key_set(), dtype());
copy_tensor_metadata(
/*src_sparse_impl=*/this,
/*dest_sparse_impl=*/impl.get(),
/*version_counter=*/version_counter,
/*allow_tensor_metadata_change=*/allow_tensor_metadata_change);
impl->refresh_numel();
return impl;
return shallow_copy_and_detach_core(
version_counter, allow_tensor_metadata_change);
}
/**
@ -334,14 +361,8 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
c10::intrusive_ptr<TensorImpl> shallow_copy_and_detach(
c10::VariableVersion&& version_counter,
bool allow_tensor_metadata_change) const override {
auto impl = c10::make_intrusive<SparseTensorImpl>(key_set(), dtype());
copy_tensor_metadata(
/*src_sparse_impl=*/this,
/*dest_sparse_impl=*/impl.get(),
/*version_counter=*/std::move(version_counter),
/*allow_tensor_metadata_change=*/allow_tensor_metadata_change);
impl->refresh_numel();
return impl;
return shallow_copy_and_detach_core(
std::move(version_counter), allow_tensor_metadata_change);
}
/**

View File

@ -158,159 +158,6 @@ namespace {
Explicit registration for out-of-place ops
*****************************************/
#define AT_FORALL_LOWER_PRECISION_FP(_) \
_(_convolution, deprecated) \
_(_convolution) \
_(conv1d) \
_(conv2d) \
_(conv3d) \
_(conv_tbc) \
_(conv_transpose1d) \
_(conv_transpose2d, input) \
_(conv_transpose3d, input) \
_(convolution) \
_(prelu) \
_(addmm) \
_(addmv) \
_(addr) \
_(matmul) \
_(einsum) \
_(mm) \
_(mv) \
_(linalg_vecdot) \
_(linear) \
_(addbmm) \
_(baddbmm) \
_(bmm) \
_(chain_matmul) \
_(linalg_multi_dot) \
_(_thnn_fused_lstm_cell) \
_(_thnn_fused_gru_cell) \
_(lstm_cell) \
_(gru_cell) \
_(rnn_tanh_cell) \
_(rnn_relu_cell) \
_(_scaled_dot_product_flash_attention) \
_(scaled_dot_product_attention)
#define AT_FORALL_FP32(_) \
_(acos) \
_(asin) \
_(cosh) \
_(erfinv) \
_(exp) \
_(expm1) \
_(log) \
_(log10) \
_(log2) \
_(log1p) \
_(reciprocal) \
_(rsqrt) \
_(sinh) \
_(tan) \
_(pow, Tensor_Scalar) \
_(pow, Tensor_Tensor) \
_(pow, Scalar) \
_(softplus) \
_(layer_norm) \
_(native_layer_norm) \
_(group_norm) \
_(frobenius_norm, dim) \
_(nuclear_norm) \
_(nuclear_norm, dim) \
_(cosine_similarity) \
_(poisson_nll_loss) \
_(cosine_embedding_loss) \
_(nll_loss) \
_(nll_loss2d) \
_(hinge_embedding_loss) \
_(kl_div) \
_(l1_loss) \
_(smooth_l1_loss) \
_(huber_loss) \
_(mse_loss) \
_(margin_ranking_loss) \
_(multilabel_margin_loss) \
_(soft_margin_loss) \
_(triplet_margin_loss) \
_(multi_margin_loss) \
_(binary_cross_entropy_with_logits) \
_(dist) \
_(pdist) \
_(cdist) \
_(renorm) \
_(logsumexp) \
_(upsample_nearest1d) \
_(_upsample_nearest_exact1d) \
_(upsample_nearest2d) \
_(_upsample_nearest_exact2d) \
_(upsample_nearest3d) \
_(_upsample_nearest_exact3d) \
_(upsample_linear1d) \
_(upsample_bilinear2d) \
_(_upsample_bilinear2d_aa) \
_(upsample_trilinear3d) \
_(upsample_bicubic2d) \
_(_upsample_bicubic2d_aa)
#define AT_FORALL_FP32_SET_OPT_DTYPE(_) \
_(prod) \
_(prod, dim_int) \
_(prod, dim_Dimname) \
_(softmax, int) \
_(softmax, Dimname) \
_(log_softmax, int) \
_(log_softmax, Dimname) \
_(cumprod) \
_(cumprod, dimname) \
_(cumsum) \
_(cumsum, dimname) \
_(linalg_vector_norm) \
_(linalg_matrix_norm) \
_(linalg_matrix_norm, str_ord) \
_(sum) \
_(sum, dim_IntList) \
_(sum, dim_DimnameList)
#define AT_FORALL_DIFFERENT_REDISPATCH_SIGNATURE(_) \
_(ADD_NS(norm), \
"norm.Scalar", \
Tensor(const Tensor&, const Scalar&), \
Tensor(const Tensor&, const c10::optional<Scalar>&, ScalarType), \
fp32_append_dtype) \
_(ADD_NS(norm), \
"norm.ScalarOpt_dim", \
Tensor(const Tensor&, const c10::optional<Scalar>&, IntArrayRef, bool), \
Tensor( \
const Tensor&, \
const c10::optional<Scalar>&, \
IntArrayRef, \
bool, \
ScalarType), \
fp32_append_dtype) \
_(ADD_NS(norm), \
"norm.names_ScalarOpt_dim", \
Tensor(const Tensor&, const c10::optional<Scalar>&, DimnameList, bool), \
Tensor( \
const Tensor&, \
const c10::optional<Scalar>&, \
DimnameList, \
bool, \
ScalarType), \
fp32_append_dtype)
#define AT_FORALL_PROMOTE(_) \
_(addcdiv) \
_(addcmul) \
_(atan2) \
_(bilinear) \
_(cross) \
_(dot) \
_(grid_sampler) \
_(index_put) \
_(tensordot) \
_(scatter_add)
TORCH_LIBRARY_IMPL(_, Autocast, m) {
m.fallback(torch::CppFunction::makeFallthrough());
}

View File

@ -728,7 +728,7 @@ copy pasted in from VariableTypeEverything.cpp with appropriate substitutions.
// KERNEL_PRIVATEUSEONE/KERNEL_DIFFERENT_REDISPATCH_SIGNATURE_PRIVATEUSEONE
// registration (OP, POLICY) or (OP, OVERLOAD, POLICY) for AutocastPrivateUse1
#define KERNEL_PRIVATEUSEONE(OP, ...) \
#define KERNEL_PRIVATEUSEONE(...) \
KERNEL(c10::DeviceType::PrivateUse1, __VA_ARGS__)
#define KERNEL_DIFFERENT_REDISPATCH_SIGNATURE_PRIVATEUSEONE( \
@ -744,3 +744,158 @@ copy pasted in from VariableTypeEverything.cpp with appropriate substitutions.
REGISTER_SIGNATURE, \
REDISPATCH_SIGNATURE, \
POLICY)
// Op lists for different policies.
// To make sure other backends can reuse the policy op list.
#define AT_FORALL_LOWER_PRECISION_FP(_) \
_(_convolution, deprecated) \
_(_convolution) \
_(conv1d) \
_(conv2d) \
_(conv3d) \
_(conv_tbc) \
_(conv_transpose1d) \
_(conv_transpose2d, input) \
_(conv_transpose3d, input) \
_(convolution) \
_(prelu) \
_(addmm) \
_(addmv) \
_(addr) \
_(matmul) \
_(einsum) \
_(mm) \
_(mv) \
_(linalg_vecdot) \
_(linear) \
_(addbmm) \
_(baddbmm) \
_(bmm) \
_(chain_matmul) \
_(linalg_multi_dot) \
_(_thnn_fused_lstm_cell) \
_(_thnn_fused_gru_cell) \
_(lstm_cell) \
_(gru_cell) \
_(rnn_tanh_cell) \
_(rnn_relu_cell) \
_(_scaled_dot_product_flash_attention) \
_(scaled_dot_product_attention)
#define AT_FORALL_FP32(_) \
_(acos) \
_(asin) \
_(cosh) \
_(erfinv) \
_(exp) \
_(expm1) \
_(log) \
_(log10) \
_(log2) \
_(log1p) \
_(reciprocal) \
_(rsqrt) \
_(sinh) \
_(tan) \
_(pow, Tensor_Scalar) \
_(pow, Tensor_Tensor) \
_(pow, Scalar) \
_(softplus) \
_(layer_norm) \
_(native_layer_norm) \
_(group_norm) \
_(frobenius_norm, dim) \
_(nuclear_norm) \
_(nuclear_norm, dim) \
_(cosine_similarity) \
_(poisson_nll_loss) \
_(cosine_embedding_loss) \
_(nll_loss) \
_(nll_loss2d) \
_(hinge_embedding_loss) \
_(kl_div) \
_(l1_loss) \
_(smooth_l1_loss) \
_(huber_loss) \
_(mse_loss) \
_(margin_ranking_loss) \
_(multilabel_margin_loss) \
_(soft_margin_loss) \
_(triplet_margin_loss) \
_(multi_margin_loss) \
_(binary_cross_entropy_with_logits) \
_(dist) \
_(pdist) \
_(cdist) \
_(renorm) \
_(logsumexp) \
_(upsample_nearest1d) \
_(_upsample_nearest_exact1d) \
_(upsample_nearest2d) \
_(_upsample_nearest_exact2d) \
_(upsample_nearest3d) \
_(_upsample_nearest_exact3d) \
_(upsample_linear1d) \
_(upsample_bilinear2d) \
_(_upsample_bilinear2d_aa) \
_(upsample_trilinear3d) \
_(upsample_bicubic2d) \
_(_upsample_bicubic2d_aa)
#define AT_FORALL_FP32_SET_OPT_DTYPE(_) \
_(prod) \
_(prod, dim_int) \
_(prod, dim_Dimname) \
_(softmax, int) \
_(softmax, Dimname) \
_(log_softmax, int) \
_(log_softmax, Dimname) \
_(cumprod) \
_(cumprod, dimname) \
_(cumsum) \
_(cumsum, dimname) \
_(linalg_vector_norm) \
_(linalg_matrix_norm) \
_(linalg_matrix_norm, str_ord) \
_(sum) \
_(sum, dim_IntList) \
_(sum, dim_DimnameList)
#define AT_FORALL_DIFFERENT_REDISPATCH_SIGNATURE(_) \
_(ADD_NS(norm), \
"norm.Scalar", \
Tensor(const Tensor&, const Scalar&), \
Tensor(const Tensor&, const c10::optional<Scalar>&, ScalarType), \
fp32_append_dtype) \
_(ADD_NS(norm), \
"norm.ScalarOpt_dim", \
Tensor(const Tensor&, const c10::optional<Scalar>&, IntArrayRef, bool), \
Tensor( \
const Tensor&, \
const c10::optional<Scalar>&, \
IntArrayRef, \
bool, \
ScalarType), \
fp32_append_dtype) \
_(ADD_NS(norm), \
"norm.names_ScalarOpt_dim", \
Tensor(const Tensor&, const c10::optional<Scalar>&, DimnameList, bool), \
Tensor( \
const Tensor&, \
const c10::optional<Scalar>&, \
DimnameList, \
bool, \
ScalarType), \
fp32_append_dtype)
#define AT_FORALL_PROMOTE(_) \
_(addcdiv) \
_(addcmul) \
_(atan2) \
_(bilinear) \
_(cross) \
_(dot) \
_(grid_sampler) \
_(index_put) \
_(tensordot) \
_(scatter_add)

View File

@ -9,7 +9,7 @@ namespace c10 {
// const reference (const T&); taking T by non-const reference
// will result in an error like:
//
// error: no type named 'type' in 'class std::result_of<foobar::__lambda(T)>'
// error: no type named 'type' in 'class std::invoke_result<foobar::__lambda, T>'
//
// No explicit template parameters are required.

View File

@ -227,6 +227,7 @@ namespace c10 {
_(aten, is_autocast_enabled) \
_(aten, is_autocast_cpu_enabled) \
_(aten, is_autocast_xla_enabled) \
_(aten, get_autocast_dtype) \
FORALL_ATEN_BASE_SYMBOLS(_) \
_(onnx, Add) \
_(onnx, Concat) \

View File

@ -1034,11 +1034,9 @@ struct C10_EXPORT ivalue::Future final : c10::intrusive_ptr_target {
*/
template <typename T>
void addCallback(T callback, bool uses_future = true) {
#if __cpp_lib_is_invocable >= 201703
static_assert(
std::is_invocable_r<void, T, Future&>::value,
"The callback must have signature void(Future&)");
#endif
std::unique_lock<std::mutex> lock(mutex_);
if (completed()) {
@ -1057,14 +1055,13 @@ struct C10_EXPORT ivalue::Future final : c10::intrusive_ptr_target {
template <typename T>
c10::intrusive_ptr<Future> then(T callback, TypePtr type) {
using IValueWithStorages = std::tuple<IValue, std::vector<WeakStorage>>;
#if __cpp_lib_is_invocable >= 201703
static_assert(
std::disjunction<
std::is_invocable_r<IValue, T, Future&>,
std::is_invocable_r<IValueWithStorages, T, Future&>>::value,
"The callback must have signature IValue(Future&) or "
"std::tuple<IValue, std::vector<Storage>>(Future&)");
#endif
auto childFut = createInstance(::std::move(type));
addCallback([childFut,
cb = std::move(callback)](Future& parentFut) mutable {
@ -1084,11 +1081,10 @@ struct C10_EXPORT ivalue::Future final : c10::intrusive_ptr_target {
template <typename T>
c10::intrusive_ptr<Future> thenAsync(T callback, TypePtr type) {
#if __cpp_lib_is_invocable >= 201703
static_assert(
std::is_invocable_r<c10::intrusive_ptr<Future>, T, Future&>::value,
"The callback must have signature c10::intrusive_ptr<Future>(Future&)");
#endif
auto childFut = createInstance(std::move(type));
addCallback(
[childFut, cb = std::move(callback)](Future& parentFut) mutable {
@ -1165,11 +1161,9 @@ struct C10_EXPORT ivalue::Future final : c10::intrusive_ptr_target {
// synchronize them with the value, and so on (if needed).
template<typename T>
void invokeCallback(T callback, bool uses_future) {
#if __cpp_lib_is_invocable >= 201703
static_assert(
std::is_invocable_r<void, T, Future&>::value,
"The callback must have signature void(Future&)");
#endif
// The synchronization performed below shouldn't be needed when the future
// is not used by the callback.
@ -2321,8 +2315,7 @@ IValue::IValue(c10::intrusive_ptr<T> custom_class) : tag(Tag::Object) {
} catch (const c10::Error&) {
throw c10::Error(
"Trying to instantiate a class that isn't a registered custom class: " +
std::string(c10::util::get_fully_qualified_type_name<T>()),
"");
std::string(c10::util::get_fully_qualified_type_name<T>()));
}
}();
auto ivalue_obj = c10::ivalue::Object::create(std::move(classType), /* numSlots */1);

View File

@ -126,32 +126,44 @@ struct VecConvert<int32_t, 1, uint8_t, 1> {
}
};
template <typename dst_t, typename src_t>
struct VecConvert<
dst_t,
1,
src_t,
1,
typename std::enable_if_t<
(is_reduced_floating_point_v<dst_t> && is_8bit_integer_v<src_t>) ||
(is_reduced_floating_point_v<src_t> && is_8bit_integer_v<dst_t>),
void>> {
static inline VectorizedN<dst_t, 1> apply(const VectorizedN<src_t, 1>& src) {
VectorizedN<float, 1> tmp_fp32 = VecConvert<float, 1, src_t, 1>::apply(src);
return VecConvert<dst_t, 1, float, 1>::apply(tmp_fp32);
}
};
template <typename dst_t>
struct VecConvert<
dst_t,
1,
float,
1,
typename std::enable_if_t<
std::is_same_v<dst_t, unsigned char> || std::is_same_v<dst_t, signed char>,
void>> {
static inline VectorizedN<dst_t, 1> apply(
const VectorizedN<float, 1>& src) {
dst_t,
1,
float,
1,
typename std::enable_if_t<is_8bit_integer_v<dst_t>,
void>> {
static inline VectorizedN<dst_t, 1> apply(const VectorizedN<float, 1>& src) {
return convert_float_to_int8<dst_t>(src[0]);
}
};
template <typename src_t>
struct VecConvert<
float,
1,
src_t,
1,
typename std::enable_if_t<
std::is_same_v<src_t, unsigned char> || std::is_same_v<src_t, signed char>,
void>> {
static inline VectorizedN<float, 1> apply(
const VectorizedN<src_t, 1>& src) {
float,
1,
src_t,
1,
typename std::enable_if_t<is_8bit_integer_v<src_t>,
void>> {
static inline VectorizedN<float, 1> apply(const VectorizedN<src_t, 1>& src) {
return convert_int8_to_float<src_t>(src[0]);
}
};

View File

@ -13,8 +13,6 @@
#include <ATen/cpu/vec/vec_base.h>
#include <c10/util/complex.h>
#define SLEEF_MEMORY_WORKAROUND
namespace at {
namespace vec {
@ -1148,32 +1146,20 @@ struct Vectorized<T, std::enable_if_t<is_zarch_implemented<T>()>> {
}
Vectorized<T> sin() const {
#ifndef SLEEF_MEMORY_WORKAROUND
return mapSleef(Sleef_sinf4_u10, Sleef_sind2_u10);
#else
return mapOrdinary(std::sin);
#endif
}
Vectorized<T> sinh() const {
return mapSleef(Sleef_sinhf4_u10, Sleef_sinhd2_u10);
}
Vectorized<T> cos() const {
#ifndef SLEEF_MEMORY_WORKAROUND
return mapSleef(Sleef_cosf4_u10, Sleef_cosd2_u10);
#else
return mapOrdinary(std::cos);
#endif
}
Vectorized<T> cosh() const {
return mapSleef(Sleef_coshf4_u10, Sleef_coshd2_u10);
}
Vectorized<T> tan() const {
#ifndef SLEEF_MEMORY_WORKAROUND
return mapSleef(Sleef_tanf4_u10, Sleef_tand2_u10);
#else
return mapOrdinary(std::tan);
#endif
}
Vectorized<T> tanh() const {
return mapSleef(Sleef_tanhf4_u10, Sleef_tanhd2_u10);

View File

@ -117,32 +117,44 @@ struct VecConvert<int32_t, 1, uint8_t, 1> {
}
};
template <typename dst_t, typename src_t>
struct VecConvert<
dst_t,
1,
src_t,
1,
typename std::enable_if_t<
(is_reduced_floating_point_v<dst_t> && is_8bit_integer_v<src_t>) ||
(is_reduced_floating_point_v<src_t> && is_8bit_integer_v<dst_t>),
void>> {
static inline VectorizedN<dst_t, 1> apply(const VectorizedN<src_t, 1>& src) {
VectorizedN<float, 1> tmp_fp32 = VecConvert<float, 1, src_t, 1>::apply(src);
return VecConvert<dst_t, 1, float, 1>::apply(tmp_fp32);
}
};
template <typename dst_t>
struct VecConvert<
dst_t,
1,
float,
1,
typename std::enable_if_t<
std::is_same_v<dst_t, unsigned char> || std::is_same_v<dst_t, signed char>,
void>> {
static inline VectorizedN<dst_t, 1> apply(
const VectorizedN<float, 1>& src) {
dst_t,
1,
float,
1,
typename std::enable_if_t<is_8bit_integer_v<dst_t>,
void>> {
static inline VectorizedN<dst_t, 1> apply(const VectorizedN<float, 1>& src) {
return convert_float_to_int8<dst_t>(src[0]);
}
};
template <typename src_t>
struct VecConvert<
float,
1,
src_t,
1,
typename std::enable_if_t<
std::is_same_v<src_t, unsigned char> || std::is_same_v<src_t, signed char>,
void>> {
static inline VectorizedN<float, 1> apply(
const VectorizedN<src_t, 1>& src) {
float,
1,
src_t,
1,
typename std::enable_if_t<is_8bit_integer_v<src_t>,
void>> {
static inline VectorizedN<float, 1> apply(const VectorizedN<src_t, 1>& src) {
return convert_int8_to_float<src_t>(src[0]);
}
};

View File

@ -90,6 +90,16 @@ struct is_reduced_floating_point:
template <typename T>
constexpr bool is_reduced_floating_point_v = is_reduced_floating_point<T>::value;
template <typename T>
struct is_8bit_integer:
std::integral_constant<bool,
std::is_same_v<T, unsigned char> ||
std::is_same_v<T, signed char>> {
};
template <typename T>
constexpr bool is_8bit_integer_v = is_8bit_integer<T>::value;
template<size_t n> struct int_of_size;
#define DEFINE_INT_OF_SIZE(int_t) \

View File

@ -236,7 +236,7 @@ namespace at::cuda::blas {
CUDABLAS_NONNEGINT_CHECK(bgemm<Dtype>, num_batches); \
} while (0)
#if (!defined(USE_ROCM) && !defined(_MSC_VER)) || (defined(USE_ROCM) && ROCM_VERSION >= 50700)
#if !defined(USE_ROCM) || (defined(USE_ROCM) && ROCM_VERSION >= 50700)
#if defined(USE_ROCM) && ROCM_VERSION >= 50700 && ROCM_VERSION < 60000
// only for rocm 5.7 where we first supported hipblaslt, it was difficult
@ -375,7 +375,7 @@ class CuBlasLtMatmulPreference : public CuBlasLtDescriptor<
template <typename Dtype>
inline void bgemm_internal_cublaslt(CUDABLAS_BGEMM_ARGTYPES(Dtype)) {
#if (!defined(USE_ROCM) && !defined(_MSC_VER)) || (defined(USE_ROCM) && ROCM_VERSION >= 60000)
#if !defined(USE_ROCM) || (defined(USE_ROCM) && ROCM_VERSION >= 60000)
cudaDataType_t abcType = CUDA_R_32F;
cublasComputeType_t computeType = CUBLAS_COMPUTE_32F;
cudaDataType_t scaleType = CUDA_R_32F;
@ -1235,7 +1235,7 @@ void gemm<at::BFloat16>(CUDABLAS_GEMM_ARGTYPES(at::BFloat16)) {
}
}
#if (!defined(USE_ROCM) && !defined(_MSC_VER)) || (defined(USE_ROCM) && ROCM_VERSION >= 50700)
#if !defined(USE_ROCM) || (defined(USE_ROCM) && ROCM_VERSION >= 50700)
template <typename Dtype>
void gemm_and_bias(
@ -1745,7 +1745,7 @@ void int8_gemm(
TORCH_CHECK(false, "int8_gemm is only supported for ROCm 6.0 and above");
#endif // !defined(USE_ROCM) || (defined(USE_ROCM) && ROCM_VERSION >= 60000)
}
#endif // (!defined(USE_ROCM) && !defined(_MSC_VER)) || (defined(USE_ROCM) && ROCM_VERSION >= 50700)
#endif // !defined(USE_ROCM) || (defined(USE_ROCM) && ROCM_VERSION >= 50700)
// ROCm 5.6 hipblas matches the const Dtype *A API, but prior hipblas does not.
#if defined(USE_ROCM) && ROCM_VERSION < 50600

View File

@ -82,7 +82,7 @@ void gemm_internal<at::Half>(CUDABLAS_GEMM_ARGTYPES(at::Half));
template <>
void gemm_internal<at::BFloat16>(CUDABLAS_GEMM_ARGTYPES(at::BFloat16));
#if (!defined(USE_ROCM) && !defined(_MSC_VER)) || (defined(USE_ROCM) && ROCM_VERSION >= 50700)
#if !defined(USE_ROCM) || (defined(USE_ROCM) && ROCM_VERSION >= 50700)
enum GEMMAndBiasActivationEpilogue {
None,
RELU,

View File

@ -9,7 +9,7 @@
// cublasLT was introduced in CUDA 10.1 but we enable only for 11.1 that also
// added bf16 support
#if (!defined(USE_ROCM) && !defined(_MSC_VER)) || (defined(USE_ROCM) && ROCM_VERSION >= 50700)
#if !defined(USE_ROCM) || (defined(USE_ROCM) && ROCM_VERSION >= 50700)
#include <cublasLt.h>
#endif
@ -82,7 +82,7 @@ TORCH_CUDA_CPP_API c10::Allocator* getCUDADeviceAllocator();
/* Handles */
TORCH_CUDA_CPP_API cusparseHandle_t getCurrentCUDASparseHandle();
TORCH_CUDA_CPP_API cublasHandle_t getCurrentCUDABlasHandle();
#if (!defined(USE_ROCM) && !defined(_MSC_VER)) || (defined(USE_ROCM) && ROCM_VERSION >= 50700)
#if !defined(USE_ROCM) || (defined(USE_ROCM) && ROCM_VERSION >= 50700)
TORCH_CUDA_CPP_API cublasLtHandle_t getCurrentCUDABlasLtHandle();
#endif

View File

@ -191,7 +191,7 @@ cublasHandle_t getCurrentCUDABlasHandle() {
return handle;
}
#if (!defined(USE_ROCM) && !defined(_MSC_VER)) || (defined(USE_ROCM) && ROCM_VERSION >= 50700)
#if !defined(USE_ROCM) || (defined(USE_ROCM) && ROCM_VERSION >= 50700)
cublasLtHandle_t getCurrentCUDABlasLtHandle() {
#ifdef USE_ROCM
c10::DeviceIndex device = 0;

View File

@ -15,6 +15,14 @@
#include <ATen/cuda/Exceptions.h>
#include <c10/util/StringUtil.h>
#ifndef AT_PER_OPERATOR_HEADERS
#include <ATen/Functions.h>
#include <ATen/NativeFunctions.h>
#else
#include <ATen/ops/allclose.h>
#include <ATen/ops/from_blob.h>
#endif
namespace at::cuda::tunable {
enum class BlasOp {
@ -33,6 +41,39 @@ inline std::string BlasOpToString(BlasOp op) {
return "N";
}
namespace detail {
static bool NumericalCheck(ScalarType dtype, void* c, void* other_c, int64_t size) {
auto options = at::TensorOptions().dtype(dtype).device(at::kCUDA);
// comparison done as 1D tensor
at::Tensor ref = at::from_blob(c, {size}, options);
at::Tensor oth = at::from_blob(other_c, {size}, options);
at::Tensor ref_float = ref.to(at::kFloat);
at::Tensor oth_float = oth.to(at::kFloat);
std::vector<double> atols{1e-1, 1e-2, 1e-3, 1e-4, 1e-5};
std::vector<double> rtols{1e-1, 1e-2, 1e-3, 1e-4, 1e-5};
double last_succeed_atol = 1;
double last_succeed_rtol = 1;
for (auto& atol : atols) {
for (auto& rtol : rtols) {
if (at::allclose(ref_float, oth_float, rtol, atol)) {
last_succeed_atol = atol;
last_succeed_rtol = rtol;
}
}
}
if (last_succeed_atol == 1) {
return false;
}
else {
TUNABLE_LOG("├──verify numerics: atol=", last_succeed_atol, ", rtol=", last_succeed_rtol);
}
return true;
}
}
template <typename T>
struct GemmParams : OpParams {
std::string Signature() const override {
@ -57,32 +98,8 @@ struct GemmParams : OpParams {
}
TuningStatus NumericalCheck(GemmParams<T> *other) {
auto options = at::TensorOptions().dtype(c10::CppTypeToScalarType<T>::value).device(at::kCUDA);
// comparison done as 1D tensor
at::Tensor ref = at::from_blob(c, {m*n}, options);
at::Tensor oth = at::from_blob(other->c, {m*n}, options);
at::Tensor ref_float = ref.to(at::kFloat);
at::Tensor oth_float = oth.to(at::kFloat);
std::vector<double> atols{1e-1, 1e-2, 1e-3, 1e-4, 1e-5};
std::vector<double> rtols{1e-1, 1e-2, 1e-3, 1e-4, 1e-5};
double last_succeed_atol = 1;
double last_succeed_rtol = 1;
for (auto& atol : atols) {
for (auto& rtol : rtols) {
if (at::allclose(ref_float, oth_float, rtol, atol)) {
last_succeed_atol = atol;
last_succeed_rtol = rtol;
}
}
}
if (last_succeed_atol == 1) {
return FAIL;
}
else {
TUNABLE_LOG("├──verify numerics: atol=", last_succeed_atol, ", rtol=", last_succeed_rtol);
}
return OK;
auto c_dtype = c10::CppTypeToScalarType<T>::value;
return detail::NumericalCheck(c_dtype, c, other->c, m*n) ? OK : FAIL;
}
char transa;
@ -124,32 +141,8 @@ struct GemmStridedBatchedParams : OpParams {
}
TuningStatus NumericalCheck(GemmStridedBatchedParams<T> *other) {
auto options = at::TensorOptions().dtype(c10::CppTypeToScalarType<T>::value).device(at::kCUDA);
// comparison done as 1D tensor
at::Tensor ref = at::from_blob(c, {batch*stride_c}, options);
at::Tensor oth = at::from_blob(other->c, {batch*stride_c}, options);
at::Tensor ref_float = ref.to(at::kFloat);
at::Tensor oth_float = oth.to(at::kFloat);
std::vector<double> atols{1e-1, 1e-2, 1e-3, 1e-4, 1e-5};
std::vector<double> rtols{1e-1, 1e-2, 1e-3, 1e-4, 1e-5};
double last_succeed_atol = 1;
double last_succeed_rtol = 1;
for (auto& atol : atols) {
for (auto& rtol : rtols) {
if (at::allclose(ref_float, oth_float, rtol, atol)) {
last_succeed_atol = atol;
last_succeed_rtol = rtol;
}
}
}
if (last_succeed_atol == 1) {
return FAIL;
}
else {
TUNABLE_LOG("├──verify numerics: atol=", last_succeed_atol, ", rtol=", last_succeed_rtol);
}
return OK;
auto c_dtype = c10::CppTypeToScalarType<T>::value;
return detail::NumericalCheck(c_dtype, c, other->c, batch*stride_c) ? OK : FAIL;
}
char transa;
@ -171,4 +164,54 @@ struct GemmStridedBatchedParams : OpParams {
int64_t batch;
};
template <typename T>
struct ScaledGemmParams : OpParams {
std::string Signature() const override {
return c10::str(transa, transb, "_", m, "_", n, "_", k);
}
ScaledGemmParams* DeepCopy() const {
ScaledGemmParams* copy = new ScaledGemmParams;
*copy = *this;
c10::DeviceIndex device = 0;
AT_CUDA_CHECK(c10::cuda::GetDevice(&device));
size_t c_size = m * n * sizeof(T);
copy->c = c10::cuda::CUDACachingAllocator::raw_alloc(c_size);
AT_CUDA_CHECK(c10::cuda::CUDACachingAllocator::memcpyAsync(
copy->c, device, c, device, c_size, getCurrentCUDAStream(device), true));
return copy;
}
// only call on object returned by DeepCopy
void Delete() {
c10::cuda::CUDACachingAllocator::raw_delete(c);
}
TuningStatus NumericalCheck(ScaledGemmParams<T> *other) {
return detail::NumericalCheck(c_dtype, c, other->c, m*n) ? OK : FAIL;
}
char transa;
char transb;
int64_t m;
int64_t n;
int64_t k;
const void* a;
const void* a_scale_ptr;
int64_t lda;
ScalarType a_dtype;
const void* b;
const void* b_scale_ptr;
int64_t ldb;
ScalarType b_dtype;
const void* bias_ptr;
ScalarType bias_dtype;
void* c;
const void* c_scale_ptr;
int64_t ldc;
ScalarType c_dtype;
void* amax_ptr;
bool use_fast_accum;
};
} // namespace at::cuda::tunable

View File

@ -4,6 +4,7 @@
#pragma once
#include <ATen/cuda/CUDAContext.h>
#include <ATen/cuda/CUDADataType.h>
#include <ATen/cuda/tunable/TunableOp.h>
#include <ATen/cuda/tunable/GemmCommon.h>
#include <c10/cuda/CUDACachingAllocator.h>
@ -67,6 +68,16 @@ constexpr hipblasltDatatype_t HipBlasDataTypeFor<double>() {
return HIPBLASLT_R_64F;
}
template <>
constexpr hipblasltDatatype_t HipBlasDataTypeFor<c10::Float8_e4m3fnuz>() {
return HIPBLASLT_R_8F_E4M3;
}
template <>
constexpr hipblasltDatatype_t HipBlasDataTypeFor<c10::Float8_e5m2fnuz>() {
return HIPBLASLT_R_8F_E5M3;
}
#define DATA_TYPE_R_32 HIPBLASLT_R_32F
#else
@ -94,6 +105,16 @@ constexpr hipblasDatatype_t HipBlasDataTypeFor<double>() {
return HIPBLAS_R_64F;
}
template <>
constexpr hipblasDatatype_t HipBlasDataTypeFor<c10::Float8_e4m3fnuz>() {
return HIP_R_8F_E4M3_FNUZ;
}
template <>
constexpr hipblasDatatype_t HipBlasDataTypeFor<c10::Float8_e5m2fnuz>() {
return HIP_R_8F_E5M2_FNUZ;
}
#ifdef HIPBLAS_V2
#define DATA_TYPE_R_32 HIP_R_32F
#else
@ -102,8 +123,8 @@ constexpr hipblasDatatype_t HipBlasDataTypeFor<double>() {
#endif
template <typename T, typename ParamsT>
int GetBatchFromParams(const ParamsT* params) {
template <typename T>
int GetBatchFromParams(const GemmParams<T>* params) {
return 1;
}
@ -112,8 +133,13 @@ int GetBatchFromParams(const GemmStridedBatchedParams<T>* params) {
return params->batch;
}
template <typename T, typename ParamsT>
int GetStrideAFromParams(const ParamsT* params) {
template <typename T>
int GetBatchFromParams(const ScaledGemmParams<T>* params) {
return 1;
}
template <typename T>
int GetStrideAFromParams(const GemmParams<T>* params) {
return 1;
}
@ -122,8 +148,13 @@ int GetStrideAFromParams(const GemmStridedBatchedParams<T>* params) {
return params->stride_a;
}
template <typename T, typename ParamsT>
int GetStrideBFromParams(const ParamsT* params) {
template <typename T>
int GetStrideAFromParams(const ScaledGemmParams<T>* params) {
return 1;
}
template <typename T>
int GetStrideBFromParams(const GemmParams<T>* params) {
return 1;
}
@ -132,8 +163,13 @@ int GetStrideBFromParams(const GemmStridedBatchedParams<T>* params) {
return params->stride_b;
}
template <typename T, typename ParamsT>
int GetStrideCFromParams(const ParamsT* params) {
template <typename T>
int GetStrideBFromParams(const ScaledGemmParams<T>* params) {
return 1;
}
template <typename T>
int GetStrideCFromParams(const GemmParams<T>* params) {
return 1;
}
@ -142,6 +178,116 @@ int GetStrideCFromParams(const GemmStridedBatchedParams<T>* params) {
return params->stride_c;
}
template <typename T>
int GetStrideCFromParams(const ScaledGemmParams<T>* params) {
return 1;
}
template <typename T>
float GetAlphaFromParams(const GemmParams<T>* params) {
return params->alpha;
}
template <typename T>
float GetAlphaFromParams(const GemmStridedBatchedParams<T>* params) {
return params->alpha;
}
template <typename T>
float GetAlphaFromParams(const ScaledGemmParams<T>* params) {
return 1.0;
}
template <typename T>
float GetBetaFromParams(const GemmParams<T>* params) {
return params->beta;
}
template <typename T>
float GetBetaFromParams(const GemmStridedBatchedParams<T>* params) {
return params->beta;
}
template <typename T>
float GetBetaFromParams(const ScaledGemmParams<T>* params) {
return 0.0;
}
template <typename T>
const void* GetAScalePointerFromParams(const GemmParams<T>* params) {
return nullptr;
}
template <typename T>
const void* GetAScalePointerFromParams(const GemmStridedBatchedParams<T>* params) {
return nullptr;
}
template <typename T>
const void* GetAScalePointerFromParams(const ScaledGemmParams<T>* params) {
return params->a_scale_ptr;
}
template <typename T>
const void* GetBScalePointerFromParams(const GemmParams<T>* params) {
return nullptr;
}
template <typename T>
const void* GetBScalePointerFromParams(const GemmStridedBatchedParams<T>* params) {
return nullptr;
}
template <typename T>
const void* GetBScalePointerFromParams(const ScaledGemmParams<T>* params) {
return params->b_scale_ptr;
}
template <typename T>
const void* GetDScalePointerFromParams(const GemmParams<T>* params) {
return nullptr;
}
template <typename T>
const void* GetDScalePointerFromParams(const GemmStridedBatchedParams<T>* params) {
return nullptr;
}
template <typename T>
const void* GetDScalePointerFromParams(const ScaledGemmParams<T>* params) {
return params->c_scale_ptr;
}
template <typename T>
const void* GetBiasPointerFromParams(const GemmParams<T>* params) {
return nullptr;
}
template <typename T>
const void* GetBiasPointerFromParams(const GemmStridedBatchedParams<T>* params) {
return nullptr;
}
template <typename T>
const void* GetBiasPointerFromParams(const ScaledGemmParams<T>* params) {
return params->bias_ptr;
}
template <typename T>
hipDataType GetBiasTypeFromParams(const GemmParams<T>* params) {
return HIP_R_32F;
}
template <typename T>
hipDataType GetBiasTypeFromParams(const GemmStridedBatchedParams<T>* params) {
return HIP_R_32F;
}
template <typename T>
hipDataType GetBiasTypeFromParams(const ScaledGemmParams<T>* params) {
return at::cuda::ScalarTypeToCudaDataType(params->bias_dtype);
}
static hipblasOperation_t _hipblasOpFromChar(char op) {
switch (op) {
case 'n':
@ -198,7 +344,48 @@ static size_t GetHipblasltWorkspaceSize() {
return workspace_size;
}
template <typename T, BlasOp ALayout, BlasOp BLayout, typename ParamsT>
template <typename T, cublasStatus_t (*destructor)(T*)>
struct HipBlasLtDeleter {
void operator()(T* x) {
if (x != nullptr) {
TORCH_CUDABLAS_CHECK(destructor(x));
}
}
};
template <typename T, hipblasStatus_t (*destructor)(T*)>
class HipBlasLtDescriptor {
public:
T* descriptor() const {
return descriptor_.get();
}
T* descriptor() {
return descriptor_.get();
}
protected:
std::unique_ptr<T, HipBlasLtDeleter<T, destructor>> descriptor_;
};
class HipBlasLtMatmulDescriptor : public HipBlasLtDescriptor<
hipblasLtMatmulDescOpaque_t,
&hipblasLtMatmulDescDestroy> {
public:
HipBlasLtMatmulDescriptor(
hipblasComputeType_t compute_type,
hipDataType scale_type) {
hipblasLtMatmulDesc_t raw_descriptor = nullptr;
TORCH_HIPBLASLT_CHECK(
hipblasLtMatmulDescCreate(&raw_descriptor, compute_type, scale_type));
descriptor_.reset(raw_descriptor);
}
template <typename T>
inline void setAttribute(hipblasLtMatmulDescAttributes_t attr, const T value) {
TORCH_HIPBLASLT_CHECK(::hipblasLtMatmulDescSetAttribute(descriptor(), attr, &value, sizeof(T)));
}
};
template <typename AT, typename BT, typename CT, BlasOp ALayout, BlasOp BLayout, typename ParamsT>
class HipblasltGemmOp : public Callable<ParamsT> {
public:
HipblasltGemmOp(hipblasLtMatmulAlgo_t algo) : algo_{algo} {}
@ -206,37 +393,38 @@ class HipblasltGemmOp : public Callable<ParamsT> {
TuningStatus Call(const ParamsT* params) override {
hipblasOperation_t transa_outer = MapLayoutToHipBlasLt(ALayout);
hipblasOperation_t transb_outer = MapLayoutToHipBlasLt(BLayout);
auto in_out_datatype = HipBlasDataTypeFor<T>();
auto a_datatype = HipBlasDataTypeFor<AT>();
auto b_datatype = HipBlasDataTypeFor<BT>();
auto in_out_datatype = HipBlasDataTypeFor<CT>();
auto opa = _hipblasOpFromChar(params->transa);
auto opb = _hipblasOpFromChar(params->transb);
TORCH_CHECK(transa_outer == opa && transb_outer == opb, "trans mismatch, shouldn't happen");
float alpha = static_cast<float>(params->alpha);
float beta = static_cast<float>(params->beta);
float alpha = GetAlphaFromParams<CT>(params);
float beta = GetBetaFromParams<CT>(params);
hipblasLtMatrixLayout_t mat_a, mat_b, mat_c;
hipblasLtMatmulDesc_t matmul;
if (opa == HIPBLAS_OP_N) {
TORCH_HIPBLASLT_CHECK(hipblasLtMatrixLayoutCreate(&mat_a, in_out_datatype, params->m, params->k, params->lda));
TORCH_HIPBLASLT_CHECK(hipblasLtMatrixLayoutCreate(&mat_a, a_datatype, params->m, params->k, params->lda));
}
else {
TORCH_HIPBLASLT_CHECK(hipblasLtMatrixLayoutCreate(&mat_a, in_out_datatype, params->k, params->m, params->lda));
TORCH_HIPBLASLT_CHECK(hipblasLtMatrixLayoutCreate(&mat_a, a_datatype, params->k, params->m, params->lda));
}
if (opb == HIPBLAS_OP_N) {
TORCH_HIPBLASLT_CHECK(hipblasLtMatrixLayoutCreate(&mat_b, in_out_datatype, params->k, params->n, params->ldb));
TORCH_HIPBLASLT_CHECK(hipblasLtMatrixLayoutCreate(&mat_b, b_datatype, params->k, params->n, params->ldb));
}
else {
TORCH_HIPBLASLT_CHECK(hipblasLtMatrixLayoutCreate(&mat_b, in_out_datatype, params->n, params->k, params->ldb));
TORCH_HIPBLASLT_CHECK(hipblasLtMatrixLayoutCreate(&mat_b, b_datatype, params->n, params->k, params->ldb));
}
TORCH_HIPBLASLT_CHECK(hipblasLtMatrixLayoutCreate(&mat_c, in_out_datatype, params->m, params->n, params->ldc));
TORCH_HIPBLASLT_CHECK(hipblasLtMatmulDescCreate(&matmul, COMPUTE_TYPE_32, DATA_TYPE_R_32));
int batch = GetBatchFromParams<T>(params);
// specific to batched gemmm
int batch = GetBatchFromParams<CT>(params);
if (batch > 1) {
int64_t stride_a = GetStrideAFromParams<T>(params);
int64_t stride_b = GetStrideBFromParams<T>(params);
int64_t stride_c = GetStrideCFromParams<T>(params);
int64_t stride_a = GetStrideAFromParams<CT>(params);
int64_t stride_b = GetStrideBFromParams<CT>(params);
int64_t stride_c = GetStrideCFromParams<CT>(params);
TORCH_HIPBLASLT_CHECK(hipblasLtMatrixLayoutSetAttribute(
mat_a, HIPBLASLT_MATRIX_LAYOUT_BATCH_COUNT, &batch, sizeof(batch)));
TORCH_HIPBLASLT_CHECK(hipblasLtMatrixLayoutSetAttribute(
@ -251,10 +439,27 @@ class HipblasltGemmOp : public Callable<ParamsT> {
mat_c, HIPBLASLT_MATRIX_LAYOUT_STRIDED_BATCH_OFFSET, &stride_c, sizeof(stride_c)));
}
TORCH_HIPBLASLT_CHECK(hipblasLtMatmulDescSetAttribute(
matmul, HIPBLASLT_MATMUL_DESC_TRANSA, &opa, sizeof(int32_t)));
TORCH_HIPBLASLT_CHECK(hipblasLtMatmulDescSetAttribute(
matmul, HIPBLASLT_MATMUL_DESC_TRANSB, &opb, sizeof(int32_t)));
HipBlasLtMatmulDescriptor matmul(COMPUTE_TYPE_32, DATA_TYPE_R_32);
matmul.setAttribute(HIPBLASLT_MATMUL_DESC_TRANSA, opa);
matmul.setAttribute(HIPBLASLT_MATMUL_DESC_TRANSB, opb);
// specific to scaled gemm
const void* mat1_scale_ptr = GetAScalePointerFromParams<CT>(params);
const void* mat2_scale_ptr = GetBScalePointerFromParams<CT>(params);
const void* result_scale_ptr = GetDScalePointerFromParams<CT>(params);
if (mat1_scale_ptr && mat2_scale_ptr && result_scale_ptr) {
matmul.setAttribute(HIPBLASLT_MATMUL_DESC_A_SCALE_POINTER, mat1_scale_ptr);
matmul.setAttribute(HIPBLASLT_MATMUL_DESC_B_SCALE_POINTER, mat2_scale_ptr);
matmul.setAttribute(HIPBLASLT_MATMUL_DESC_D_SCALE_POINTER, result_scale_ptr);
const void* bias_ptr = GetBiasPointerFromParams<CT>(params);
auto bias_datatype = GetBiasTypeFromParams<CT>(params);
if (bias_ptr) {
matmul.setAttribute(HIPBLASLT_MATMUL_DESC_BIAS_POINTER, bias_ptr);
matmul.setAttribute(HIPBLASLT_MATMUL_DESC_EPILOGUE, HIPBLASLT_EPILOGUE_BIAS);
matmul.setAttribute(HIPBLASLT_MATMUL_DESC_BIAS_DATA_TYPE, bias_datatype);
}
}
size_t workspace_size = GetHipblasltWorkspaceSize();
@ -262,7 +467,7 @@ class HipblasltGemmOp : public Callable<ParamsT> {
size_t ret_workspace_size = 0;
auto status = hipblaslt_ext::matmulIsAlgoSupported(op_handle,
matmul,
matmul.descriptor(),
&alpha,
mat_a,
mat_b,
@ -289,7 +494,7 @@ class HipblasltGemmOp : public Callable<ParamsT> {
}
TORCH_HIPBLASLT_CHECK(hipblasLtMatmul(op_handle,
matmul,
matmul.descriptor(),
&alpha,
params->a,
mat_a,
@ -305,7 +510,7 @@ class HipblasltGemmOp : public Callable<ParamsT> {
workspace_size,
at::cuda::getCurrentCUDAStream()));
TORCH_HIPBLASLT_CHECK(hipblasLtMatmulDescDestroy(matmul));
//TORCH_HIPBLASLT_CHECK(hipblasLtMatmulDescDestroy(matmul));
TORCH_HIPBLASLT_CHECK(hipblasLtMatrixLayoutDestroy(mat_a));
TORCH_HIPBLASLT_CHECK(hipblasLtMatrixLayoutDestroy(mat_b));
TORCH_HIPBLASLT_CHECK(hipblasLtMatrixLayoutDestroy(mat_c));
@ -319,11 +524,13 @@ class HipblasltGemmOp : public Callable<ParamsT> {
hipblasLtMatmulAlgo_t algo_;
};
template <typename T, BlasOp ALayout, BlasOp BLayout, typename ParamsT>
template <typename AT, typename BT, typename CT, BlasOp ALayout, BlasOp BLayout, typename ParamsT>
auto GetHipBlasLtTypeStringAndOps() {
hipblasOperation_t transa_outer = MapLayoutToHipBlasLt(ALayout);
hipblasOperation_t transb_outer = MapLayoutToHipBlasLt(BLayout);
auto in_out_datatype = HipBlasDataTypeFor<T>();
auto a_datatype = HipBlasDataTypeFor<AT>();
auto b_datatype = HipBlasDataTypeFor<BT>();
auto in_out_datatype = HipBlasDataTypeFor<CT>();
std::vector<hipblasLtMatmulHeuristicResult_t> heuristic_result;
hipblasLtHandle_t handle;
@ -332,8 +539,8 @@ auto GetHipBlasLtTypeStringAndOps() {
hipblaslt_ext::GemmType::HIPBLASLT_GEMM,
transa_outer,
transb_outer,
in_out_datatype,
in_out_datatype,
a_datatype,
b_datatype,
in_out_datatype,
in_out_datatype,
COMPUTE_TYPE_32,
@ -352,7 +559,7 @@ auto GetHipBlasLtTypeStringAndOps() {
for (int i = 0; i < returned_algo_count; i++) {
auto algo = heuristic_result[i].algo;
int algo_index = GETINDEXFROMALGO(algo);
auto callable = std::make_unique<HipblasltGemmOp<T, ALayout, BLayout, ParamsT>>(algo);
auto callable = std::make_unique<HipblasltGemmOp<AT, BT, CT, ALayout, BLayout, ParamsT>>(algo);
std::string type_string = c10::str(
"Gemm_Hipblaslt_", _charFromhipblasOp(transa_outer), _charFromhipblasOp(transb_outer), "_", algo_index);
ret.emplace_back(type_string, std::move(callable));
@ -363,12 +570,17 @@ auto GetHipBlasLtTypeStringAndOps() {
template <typename T, BlasOp ALayout, BlasOp BLayout>
auto GetHipBlasLtGemmTypeStringAndOps() {
return GetHipBlasLtTypeStringAndOps<T, ALayout, BLayout, GemmParams<T>>();
return GetHipBlasLtTypeStringAndOps<T, T, T, ALayout, BLayout, GemmParams<T>>();
}
template <typename T, BlasOp ALayout, BlasOp BLayout>
auto GetHipBlasLtGemmStridedBatchedTypeStringAndOps() {
return GetHipBlasLtTypeStringAndOps<T, ALayout, BLayout, GemmStridedBatchedParams<T>>();
return GetHipBlasLtTypeStringAndOps<T, T, T, ALayout, BLayout, GemmStridedBatchedParams<T>>();
}
template <typename AT, typename BT, typename CT, BlasOp ALayout, BlasOp BLayout>
auto GetHipBlasLtScaledGemmTypeStringAndOps() {
return GetHipBlasLtTypeStringAndOps<AT, BT, CT, ALayout, BLayout, ScaledGemmParams<CT>>();
}
#undef TORCH_HIPBLASLT_CHECK

View File

@ -19,6 +19,10 @@
#include <ATen/cuda/tunable/StreamTimer.h>
#include <ATen/cuda/tunable/TunableOp.h>
#include <c10/cuda/CUDACachingAllocator.h>
#include <c10/util/Float8_e4m3fn.h>
#include <c10/util/Float8_e4m3fnuz.h>
#include <c10/util/Float8_e5m2.h>
#include <c10/util/Float8_e5m2fnuz.h>
#include <c10/util/StringUtil.h>
#ifdef USE_ROCM
@ -64,62 +68,112 @@ class DefaultGemmStridedBatchedOp : public Callable<GemmStridedBatchedParams<T>>
};
template <typename T>
bool IsZero(T v) {
class DefaultScaledGemmOp : public Callable<ScaledGemmParams<T>> {
public:
TuningStatus Call(const ScaledGemmParams<T>* params) override {
at::cuda::blas::scaled_gemm(
params->transa,
params->transb,
params->m,
params->n,
params->k,
params->a,
params->a_scale_ptr,
params->lda,
params->a_dtype,
params->b,
params->b_scale_ptr,
params->ldb,
params->b_dtype,
params->bias_ptr,
params->bias_dtype,
params->c,
params->c_scale_ptr,
params->ldc,
params->c_dtype,
params->amax_ptr,
params->use_fast_accum);
return OK;
}
};
template <typename T>
inline bool IsZero(T v) {
return v == 0.0f;
}
template <>
bool IsZero(BFloat16 v) {
inline bool IsZero(BFloat16 v) {
return v.x == 0;
}
template <>
bool IsZero(Half v) {
inline bool IsZero(Half v) {
return float(v) == 0.0f;
}
template <>
bool IsZero(c10::complex<double> v) {
inline bool IsZero(c10::complex<double> v) {
return v == 0.0;
}
template <>
bool IsZero(c10::complex<float> v) {
inline bool IsZero(c10::complex<float> v) {
return v == 0.0f;
}
template <typename T>
std::string TypeName(T v) {
inline std::string TypeName(T v) {
return "unknown";
}
template <>
std::string TypeName(float v) {
inline std::string TypeName(float v) {
return "float";
}
template <>
std::string TypeName(double v) {
inline std::string TypeName(double v) {
return "double";
}
template <>
std::string TypeName(BFloat16 v) {
inline std::string TypeName(BFloat16 v) {
return "BFloat16";
}
template <>
std::string TypeName(Half v) {
inline std::string TypeName(Half v) {
return "Half";
}
template <>
std::string TypeName(c10::complex<double> v) {
inline std::string TypeName(Float8_e4m3fn v) {
return "Float8_e4m3fn";
}
template <>
inline std::string TypeName(Float8_e5m2 v) {
return "Float8_e5m2";
}
template <>
inline std::string TypeName(Float8_e4m3fnuz v) {
return "Float8_e4m3fnuz";
}
template <>
inline std::string TypeName(Float8_e5m2fnuz v) {
return "Float8_e5m2fnuz";
}
template <>
inline std::string TypeName(c10::complex<double> v) {
return "c10::complex<double>";
}
template <>
std::string TypeName(c10::complex<float> v) {
inline std::string TypeName(c10::complex<float> v) {
return "c10::complex<float>";
}
@ -272,6 +326,42 @@ class GemmStridedBatchedTunableOp : public TunableOp<GemmStridedBatchedParams<T>
}
};
template <typename AT, typename BT, typename CT, BlasOp ALayout, BlasOp BLayout>
class ScaledGemmTunableOp : public TunableOp<ScaledGemmParams<CT>, StreamTimer> {
public:
ScaledGemmTunableOp() {
this->RegisterOp(std::string("Default"), std::make_unique<DefaultScaledGemmOp<CT>>());
auto validators = getTuningContext()->GetTuningResultsValidator().GetAllValidators();
#if defined(USE_ROCM) && ROCM_VERSION >= 50700
for (auto&& [name, op] : GetHipBlasLtScaledGemmTypeStringAndOps<AT, BT, CT, ALayout, BLayout>()) {
this->RegisterOp(std::move(name), std::move(op));
}
if (validators.find("HIPBLASLT_VERSION") == validators.end()) {
std::string hipblaslt_version = c10::str(
XSTRINGIFY(HIPBLASLT_VERSION_MAJOR), ".",
XSTRINGIFY(HIPBLASLT_VERSION_MINOR), ".",
XSTRINGIFY(HIPBLASLT_VERSION_PATCH), "-",
XSTRINGIFY(HIPBLASLT_VERSION_TWEAK));
getTuningContext()->GetTuningResultsValidator().RegisterValidator(
"HIPBLASLT_VERSION",
[hipblaslt_version]() { return hipblaslt_version; },
[hipblaslt_version](auto&& k) { return hipblaslt_version == k ? OK : FAIL; });
}
#endif
}
std::string Signature() override {
return c10::str("ScaledGemmTunableOp",
"_", TypeName<AT>(AT{}),
"_", TypeName<BT>(BT{}),
"_", TypeName<CT>(CT{}),
"_", BlasOpToString(ALayout), BlasOpToString(BLayout));
}
};
#undef XSTRINGIFY
#undef STRINGIFY

View File

@ -81,6 +81,11 @@ static Tensor unsafeMakeTensorWrapper(
auto result = at::detail::make_tensor<TensorWrapper>(
key_set, tensor, level, life_handle, is_immutable);
TORCH_INTERNAL_ASSERT(result.key_set().has(DispatchKey::FuncTorchGradWrapper));
if (tensor.unsafeGetTensorImpl()->is_wrapped_number()) {
result.unsafeGetTensorImpl()->set_wrapped_number(true);
}
return result;
}

View File

@ -299,6 +299,12 @@ public:
void StartTrace(const std::string& mode, bool waitUntilCompleted);
void StopTrace();
// Abstractions for GPU trace capturing
bool isCaptureEnabled() const;
bool isCapturing() const;
void startCapture(const std::string& name, MPSStream* stream = nullptr);
void stopCapture(MPSStream* stream = nullptr);
// convenience functions to indicate whether signpost tracing or
// logging are enabled for the SignpostTypes
bool isOperationProfilingEnabled() const {
@ -356,6 +362,9 @@ public:
// a short list that contains copy stats
std::unordered_map<CopyInfo::Kind, std::unique_ptr<CopyStat>> m_copy_stat_list{};
mutable MTLCaptureManager *captureManager = nil;
unsigned captureCount = 0;
void initialize();
void beginProfileExecution(BaseInfo& info, bool cpuExecution = false);
void endProfileExecution(BaseInfo& info, os_signpost_id_t event_signpost_id,

View File

@ -765,6 +765,41 @@ void MPSProfiler::handleIntSignal(int signal) {
struct sigaction MPSProfiler::currentSigint {};
struct sigaction MPSProfiler::previousSigint {};
bool MPSProfiler::isCapturing() const {
return [captureManager isCapturing];
}
bool MPSProfiler::isCaptureEnabled() const {
if (captureManager == nil) {
captureManager = [MTLCaptureManager sharedCaptureManager];
}
static bool isEnabled = [this]() {
return [captureManager supportsDestination:MTLCaptureDestinationGPUTraceDocument];
}();
return isEnabled;
}
void MPSProfiler::startCapture(const std::string& name, MPSStream* stream) {
if (captureManager == nil) {
captureManager = [MTLCaptureManager sharedCaptureManager];
}
NSError* err = nil;
NSString* fname = [NSString stringWithFormat:@"%04d-%s.gputrace", captureCount++, name.c_str()];
MTLCaptureDescriptor* captureDescriptor = [MTLCaptureDescriptor new];
captureDescriptor.captureObject = stream ? (id)stream->commandQueue() : (id)MPSDevice::getInstance()->device();
captureDescriptor.destination = MTLCaptureDestinationGPUTraceDocument;
captureDescriptor.outputURL = [NSURL fileURLWithPath:fname];
auto rc = [captureManager startCaptureWithDescriptor:captureDescriptor error:&err];
TORCH_CHECK(rc, "Failed to start capture of ", [fname UTF8String], " error ", [[err description] UTF8String]);
}
void MPSProfiler::stopCapture(MPSStream* stream) {
if (stream) {
stream->synchronize(SyncType::COMMIT);
}
[captureManager stopCapture];
}
} // namespace Profiler
Profiler::MPSProfiler& getMPSProfiler() {

View File

@ -22,7 +22,7 @@ MPSStream::MPSStream(Stream stream) : _stream(stream) {
_compilationDescriptor = [MPSGraphCompilationDescriptor new];
// disable commitAndContinue if Signpost tracing is enabled
if (getMPSProfiler().isSignpostTracingEnabled()) {
if (getMPSProfiler().isSignpostTracingEnabled() || getMPSProfiler().isCaptureEnabled()) {
_enableCommitAndContinue = false;
}
_executionDescriptor.enableCommitAndContinue = _enableCommitAndContinue;

View File

@ -317,6 +317,12 @@ Tensor adaptive_avg_pool3d_symint(Tensor const& input, SymIntArrayRef output_siz
// in this case, adaptive pooling is just computing mean over hw
// dimensions, which can be done more efficiently
Tensor out = input.mean({-1, -2, -3}, /* keepdim = */ true);
if (input.suggest_memory_format() == at::MemoryFormat::ChannelsLast3d) {
// assert ndim == 5, since ndim = 4 doesn't give channels_last
const auto n = input.sym_size(0);
const auto c = input.sym_size(1);
out.as_strided__symint({n, c, 1, 1, 1}, {c, 1, c, c, c});
}
return out;
} else {
return _adaptive_avg_pool3d_symint(input, output_size);

View File

@ -8,15 +8,25 @@
namespace at::native {
using adaptive_avg_pooling_fn = void(*)(Tensor& output, const Tensor& input, IntArrayRef output_size);
using adaptive_avg_pooling_backward_fn = void(*)(Tensor& grad_input, const Tensor& grad_output);
DECLARE_DISPATCH(adaptive_avg_pooling_fn, adaptive_avg_pool2d_kernel);
DECLARE_DISPATCH(adaptive_avg_pooling_backward_fn, adaptive_avg_pool2d_backward_kernel);
using adaptive_avg_pooling2d_fn = void(*)(Tensor& output, const Tensor& input, IntArrayRef output_size);
using adaptive_avg_pooling2d_backward_fn = void(*)(Tensor& grad_input, const Tensor& grad_output);
DECLARE_DISPATCH(adaptive_avg_pooling2d_fn, adaptive_avg_pool2d_kernel);
DECLARE_DISPATCH(adaptive_avg_pooling2d_backward_fn, adaptive_avg_pool2d_backward_kernel);
using adaptive_max_pooling_fn = void(*)(const Tensor& output, const Tensor& indices, const Tensor& input, IntArrayRef output_size);
using adaptive_max_pooling_backward_fn = void(*)(const Tensor& grad_input, const Tensor& grad_output, const Tensor& indices);
DECLARE_DISPATCH(adaptive_max_pooling_fn, adaptive_max_pool2d_kernel);
DECLARE_DISPATCH(adaptive_max_pooling_backward_fn, adaptive_max_pool2d_backward_kernel);
using adaptive_max_pooling2d_fn = void(*)(const Tensor& output, const Tensor& indices, const Tensor& input, IntArrayRef output_size);
using adaptive_max_pooling2d_backward_fn = void(*)(const Tensor& grad_input, const Tensor& grad_output, const Tensor& indices);
DECLARE_DISPATCH(adaptive_max_pooling2d_fn, adaptive_max_pool2d_kernel);
DECLARE_DISPATCH(adaptive_max_pooling2d_backward_fn, adaptive_max_pool2d_backward_kernel);
using adaptive_avg_pooling3d_fn = void(*)(Tensor& output, const Tensor& input, IntArrayRef output_size);
using adaptive_avg_pooling3d_backward_fn = void(*)(Tensor& grad_input, const Tensor& grad_output);
DECLARE_DISPATCH(adaptive_avg_pooling3d_fn, adaptive_avg_pool3d_kernel);
DECLARE_DISPATCH(adaptive_avg_pooling3d_backward_fn, adaptive_avg_pool3d_backward_kernel);
using adaptive_max_pooling3d_fn = void(*)(const Tensor& output, const Tensor& indices, const Tensor& input, IntArrayRef output_size);
using adaptive_max_pooling3d_backward_fn = void(*)(const Tensor& grad_input, const Tensor& grad_output, const Tensor& indices);
DECLARE_DISPATCH(adaptive_max_pooling3d_fn, adaptive_max_pool3d_kernel);
DECLARE_DISPATCH(adaptive_max_pooling3d_backward_fn, adaptive_max_pool3d_backward_kernel);
static inline int64_t start_index(int64_t a, int64_t b, int64_t c) {
return (a / b) * c + ((a % b) * c) / b;

View File

@ -10,8 +10,19 @@
#include <cstdlib>
#include <cstring>
#ifdef HAVE_ZVECTOR_CPU_DEFINITION
#include <sys/auxv.h>
#endif
namespace at::native {
#ifdef HAVE_ZVECTOR_CPU_DEFINITION
static inline bool cpu_has_vxe()
{
return (getauxval(AT_HWCAP) & HWCAP_S390_VXE);
}
#endif
static CPUCapability compute_cpu_capability() {
auto envar = std::getenv("ATEN_CPU_CAPABILITY");
if (envar) {
@ -60,10 +71,16 @@ static CPUCapability compute_cpu_capability() {
#endif
}
#endif
#ifdef HAVE_ZVECTOR_CPU_DEFINITION
// vxe is needed for fp32 vector instructions
if (cpu_has_vxe()) {
return CPUCapability::ZVECTOR;
}
#endif
#ifdef HAVE_VSX_CPU_DEFINITION
return CPUCapability::VSX;
#elif HAVE_ZVECTOR_CPU_DEFINITION
return CPUCapability::ZVECTOR;
#else
return CPUCapability::DEFAULT;
#endif

View File

@ -2839,10 +2839,16 @@ TORCH_IMPL_FUNC(linalg_vector_norm_out)(const Tensor& self, const Scalar& scalar
}
if (is_reduce_over_1D_vector) {
if (ord != 0.0) {
keepdim ? at::abs_outf(self, const_cast<Tensor&>(result)) : at::abs_outf(self.squeeze(reduce_dim), const_cast<Tensor&>(result));
Tensor self_;
if (opt_dtype.has_value()) {
self_ = self.to(*opt_dtype);
} else {
keepdim ? at::ne_outf(self, 0, const_cast<Tensor&>(result)) : at::ne_outf(self.squeeze(reduce_dim), 0, const_cast<Tensor&>(result));
self_ = self;
}
if (ord != 0.0) {
keepdim ? at::abs_outf(self_, const_cast<Tensor&>(result)) : at::abs_outf(self_.squeeze(reduce_dim), const_cast<Tensor&>(result));
} else {
keepdim ? at::ne_outf(self_, 0, const_cast<Tensor&>(result)) : at::ne_outf(self_.squeeze(reduce_dim), 0, const_cast<Tensor&>(result));
}
return;
}

View File

@ -26,6 +26,19 @@ using avg_pool2d_backward_fn = void(*)(const Tensor& output, const Tensor& input
DECLARE_DISPATCH(avg_pool2d_fn, avg_pool2d_kernel);
DECLARE_DISPATCH(avg_pool2d_backward_fn, avg_pool2d_backward_kernel);
// averge pooling has same signature for forward and backward
using avg_pool3d_fn = void(*)(const Tensor& output, const Tensor& input,
int64_t kW, int64_t kH, int64_t kD, int64_t dW, int64_t dH, int64_t dD,
int64_t padW, int64_t padH, int64_t padD, bool count_include_pad,
c10::optional<int64_t> divisor_override);
using avg_pool3d_backward_fn = void(*)(const Tensor& output, const Tensor& input,
int kW, int kH, int kD, int dW, int dH, int dD,
int padW, int padH, int padD, bool count_include_pad,
c10::optional<int64_t> divisor_override);
DECLARE_DISPATCH(avg_pool3d_fn, avg_pool3d_kernel);
DECLARE_DISPATCH(avg_pool3d_backward_fn, avg_pool3d_backward_kernel);
using max_pool3d_fn = void(*)(Tensor& output, Tensor& indices, const Tensor& input,
int kW, int kH, int kD, int dW, int dH, int dD, int pW, int pH, int pD, int dilationW, int dilationH, int dilationD);
using max_pool3d_backward_fn = void(*)(Tensor& grad_input, const Tensor& grad_output, const Tensor& indices);

View File

@ -254,13 +254,50 @@ Tensor _to_copy(
// TODO: Use the dispatcher for this.
// Currently there are unenumerated extensibility issues preventing this.
if (at::sparse_csr::is_sparse_compressed(self)) {
if (self.layout() == kSparse) {
TORCH_CHECK(
memory_format == MemoryFormat::Preserve,
"to(options): COO only supports memory format Preserve, but got ", memory_format,
" instead.");
if (options.device().is_meta()) {
return zeros_like(self, options);
}
auto indices = self._indices();
const auto new_indices = at::native::to(
indices,
indices.scalar_type(),
c10::kStrided,
device,
pin_memory,
non_blocking,
true, // force copy since we are in _to_copy
memory_format);
const auto new_values = at::native::to(
self._values(),
dtype,
c10::kStrided,
device,
pin_memory,
non_blocking,
true, // force copy since we are in _to_copy
memory_format);
return at::_sparse_coo_tensor_unsafe(
new_indices,
new_values,
self.sizes(),
options, self.is_coalesced());
} else if (at::sparse_csr::is_sparse_compressed(self)) {
TORCH_CHECK(
memory_format == MemoryFormat::Preserve,
"to(options): ", at::sparse_csr::layoutToString(self.layout()),
" only supports memory format Preserve, but got ", memory_format,
" instead.");
if (options.device().is_meta()) {
return zeros_like(self, options);
}
auto [compressed_indices, plain_indices] = at::sparse_csr::getCompressedPlainIndices(self);
const auto new_values = at::native::to(

View File

@ -421,9 +421,19 @@ Tensor& set_storage_meta__symint(Tensor& result, Storage storage, c10::SymInt st
// it. TODO: Actually this might not quite be correct if we use special
// pointers to track whether or not fake cuda tensors are pinned or not
const auto itemsize = result.dtype().itemsize();
c10::SymInt size_bytes = at::detail::computeStorageNbytes(
c10::SymInt new_size_bytes = at::detail::computeStorageNbytes(
size, stride, itemsize, std::move(storage_offset));
storage.set_nbytes(std::move(size_bytes));
// TODO: When there are unbacked SymInts, we unconditionally skip the
// setter. This is technically wrong, but we cannot conveniently test
// the real condition in many cases, because a lot of people are using
// set_ just to swizzle metadata on a tensor, they didn't actually want
// to see if they need to resize the storage.
//
// The old behavior was to unconditionally set_nbytes, but I think not
// setting it is more safe.
if (new_size_bytes.has_hint() && storage.sym_nbytes().has_hint() && TORCH_GUARD_SIZE_OBLIVIOUS(new_size_bytes.sym_gt(storage.sym_nbytes()))) {
storage.set_nbytes(std::move(new_size_bytes));
}
}
return result;
}
@ -4072,11 +4082,13 @@ void unbind_copy_int_out(const at::Tensor & self, int64_t dim, at::TensorList o
}
}
int64_t sparse_dim_strided(const at::Tensor& self) {
int64_t sparse_dim_default(const Tensor& self) {
TORCH_CHECK(self.layout() == kStrided, "sparse_dim expected sparse or strided tensor layout but got ", self.layout());
return 0;
}
int64_t dense_dim_strided(const at::Tensor& self) {
int64_t dense_dim_default(const Tensor& self) {
TORCH_CHECK(self.layout() == kStrided, "dense_dim expected sparse or strided tensor layout but got ", self.layout());
return self.dim();
}

View File

@ -15,7 +15,7 @@ namespace at::native {
namespace {
template <typename scalar_t, typename accscalar_t>
void cpu_adaptive_avg_pool(
void cpu_adaptive_avg_pool2d(
Tensor& output_,
const Tensor& input_,
IntArrayRef output_size) {
@ -69,7 +69,7 @@ void cpu_adaptive_avg_pool(
template <typename scalar_t>
typename std::enable_if_t<std::is_same_v<scalar_t, at::opmath_type<scalar_t>>, void>
cpu_adaptive_avg_pool_channels_last(
cpu_adaptive_avg_pool2d_channels_last(
Tensor& output_,
const Tensor& input_,
IntArrayRef output_size) {
@ -156,7 +156,7 @@ cpu_adaptive_avg_pool_channels_last(
template <typename scalar_t>
typename std::enable_if_t<!std::is_same_v<scalar_t, at::opmath_type<scalar_t>>, void>
cpu_adaptive_avg_pool_channels_last(
cpu_adaptive_avg_pool2d_channels_last(
Tensor& output_,
const Tensor& input_,
IntArrayRef output_size) {
@ -255,7 +255,7 @@ cpu_adaptive_avg_pool_channels_last(
}
template <typename scalar_t>
void cpu_adaptive_avg_pool_backward(
void cpu_adaptive_avg_pool2d_backward(
Tensor& grad_input_,
const Tensor& grad_output_) {
auto grad_output = grad_output_.contiguous();
@ -305,7 +305,7 @@ void cpu_adaptive_avg_pool_backward(
}
template <typename scalar_t>
void cpu_adaptive_avg_pool_backward_channels_last(
void cpu_adaptive_avg_pool2d_backward_channels_last(
Tensor& grad_input_,
const Tensor& grad_output_) {
auto memory_format = at::MemoryFormat::ChannelsLast;
@ -373,13 +373,13 @@ void adaptive_avg_pool2d_kernel_impl(
case at::MemoryFormat::Contiguous: {
AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::BFloat16, ScalarType::Half, input.scalar_type(), "adaptive_avg_pool2d", [&] {
using param_t = at::opmath_type<scalar_t>;
cpu_adaptive_avg_pool<scalar_t, /*accscalar_t*/param_t>(output, input, output_size);
cpu_adaptive_avg_pool2d<scalar_t, /*accscalar_t*/param_t>(output, input, output_size);
});
break;
}
case at::MemoryFormat::ChannelsLast: {
AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::BFloat16, ScalarType::Half, input.scalar_type(), "adaptive_avg_pool2d_channels_last", [&]{
cpu_adaptive_avg_pool_channels_last<scalar_t>(output, input, output_size);
cpu_adaptive_avg_pool2d_channels_last<scalar_t>(output, input, output_size);
});
break;
}
@ -394,13 +394,458 @@ void adapative_avg_pool2d_backward_kernel_impl(
switch (grad_output.suggest_memory_format()) {
case at::MemoryFormat::Contiguous: {
AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::BFloat16, ScalarType::Half, grad_output.scalar_type(), "adaptive_avg_pool2d_backward", [&] {
cpu_adaptive_avg_pool_backward<scalar_t>(grad_input, grad_output);
cpu_adaptive_avg_pool2d_backward<scalar_t>(grad_input, grad_output);
});
break;
}
case at::MemoryFormat::ChannelsLast: {
AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::BFloat16, ScalarType::Half, grad_output.scalar_type(), "adaptive_avg_pool2d_backward_channels_last", [&]{
cpu_adaptive_avg_pool_backward_channels_last<scalar_t>(grad_input, grad_output);
cpu_adaptive_avg_pool2d_backward_channels_last<scalar_t>(grad_input, grad_output);
});
break;
}
default:
TORCH_CHECK(false, "Unsupported memory format. Supports only ChannelsLast, Contiguous");
}
}
template <typename scalar_t, typename accscalar_t>
void cpu_adaptive_avg_pool3d(
Tensor& output_,
const Tensor& input_,
IntArrayRef output_size) {
auto input = input_.contiguous();
auto output = output_.contiguous();
auto input_data = input.data_ptr<scalar_t>();
auto output_data = output.data_ptr<scalar_t>();
int64_t ndim = input.ndimension();
// treat batch size and channels as one dimension
int64_t channels = ndim == 4 ? input.size(0) : input.size(0) * input.size(1);
int64_t input_depth = input.size(-3);
int64_t input_height = input.size(-2);
int64_t input_width = input.size(-1);
int64_t output_depth = output_size[0];
int64_t output_height = output_size[1];
int64_t output_width = output_size[2];
// parallel on dim of N, C
at::parallel_for(0, channels, 0, [&](int64_t begin, int64_t end) {
for (const auto c : c10::irange(begin, end)) {
scalar_t* input_ptr = input_data + c * input_depth * input_height * input_width;
scalar_t* output_ptr = output_data + c * output_depth * output_height * output_width;
for (const auto od : c10::irange(output_depth)) {
int64_t id0 = start_index(od, output_depth, input_depth);
int64_t id1 = end_index(od, output_depth, input_depth);
int64_t kd = id1 - id0;
for (const auto oh : c10::irange(output_height)) {
int64_t ih0 = start_index(oh, output_height, input_height);
int64_t ih1 = end_index(oh, output_height, input_height);
int64_t kh = ih1 - ih0;
for (const auto ow : c10::irange(output_width)) {
int64_t iw0 = start_index(ow, output_width, input_width);
int64_t iw1 = end_index(ow, output_width, input_width);
int64_t kw = iw1 - iw0;
// compute local average
accscalar_t sum = 0;
for (const auto id : c10::irange(id0, id1)) {
for (const auto ih : c10::irange(ih0, ih1)) {
for (const auto iw : c10::irange(iw0, iw1)) {
sum += accscalar_t(input_ptr[id * input_height * input_width + ih * input_width + iw]);
}
}
}
output_ptr[od * output_height * output_width + oh * output_width + ow] = scalar_t(sum / kd / kh / kw);
}
}
}
}
});
if (!output_.is_contiguous()) {
output_.copy_(output);
}
}
template <typename scalar_t>
typename std::enable_if_t<std::is_same_v<scalar_t, at::opmath_type<scalar_t>>, void>
cpu_adaptive_avg_pool3d_channels_last(
Tensor& output_,
const Tensor& input_,
IntArrayRef output_size) {
auto memory_format = at::MemoryFormat::ChannelsLast3d;
auto input = input_.contiguous(memory_format);
auto output = output_.contiguous(memory_format);
auto input_data = input.data_ptr<scalar_t>();
auto output_data = output.data_ptr<scalar_t>();
int64_t nbatch = input.size(0);
int64_t channels = input.size(1);
int64_t input_depth = input.size(2);
int64_t input_height = input.size(3);
int64_t input_width = input.size(4);
int64_t output_depth = output_size[0];
int64_t output_height = output_size[1];
int64_t output_width = output_size[2];
using Vec = vec::Vectorized<scalar_t>;
// parallel on dim N, H, W
at::parallel_for(0, nbatch * output_depth * output_height * output_width, 0, [&](int64_t begin, int64_t end) {
int64_t n = 0;
int64_t od = 0;
int64_t oh = 0;
int64_t ow = 0;
data_index_init(begin, n, nbatch, od, output_depth, oh, output_height, ow, output_width);
for (const auto i : c10::irange(begin, end)) {
int64_t id0 = start_index(od, output_depth, input_depth);
int64_t id1 = end_index(od, output_depth, input_depth);
int64_t kd = id1 - id0;
int64_t ih0 = start_index(oh, output_height, input_height);
int64_t ih1 = end_index(oh, output_height, input_height);
int64_t kh = ih1 - ih0;
int64_t iw0 = start_index(ow, output_width, input_width);
int64_t iw1 = end_index(ow, output_width, input_width);
int64_t kw = iw1 - iw0;
scalar_t* out = output_data + i * channels;
int64_t size = channels;
// Note: For oridinary usage scenario, each out lane should
// fit in L1 cache; otherwise consider block dim C.
// Pass I: zero the out lane
int64_t d1 = 0;
for (; d1 < size - (size % Vec::size()); d1 += Vec::size()) {
Vec out_vec = Vec(scalar_t(0));
out_vec.store(out + d1);
}
for (; d1 < size; d1++) {
out[d1] = scalar_t(0);
}
// Pass II: compute local sum
for (const auto id : c10::irange(id0, id1)) {
for (const auto ih : c10::irange(ih0, ih1)) {
for (const auto iw : c10::irange(iw0, iw1)) {
scalar_t* in = input_data + n * input_depth * input_height * input_width * channels +
id * input_height * input_width * channels + ih * input_width * channels + iw * channels;
int64_t d2 = 0;
for (; d2 < size - (size % Vec::size()); d2 += Vec::size()) {
Vec out_vec = Vec::loadu(out + d2) + Vec::loadu(in + d2);
out_vec.store(out + d2);
}
for (; d2 < size; d2++) {
out[d2] += in[d2];
}
}
}
}
// Pass III: compute local average
int64_t d3 = 0;
for (; d3 < size - (size % Vec::size()); d3 += Vec::size()) {
Vec out_vec = Vec::loadu(out + d3) / Vec(scalar_t(kd * kh * kw));
out_vec.store(out + d3);
}
for (; d3 < size; d3++) {
out[d3] = out[d3] / kd / kh / kw;
}
// move on to next output index
data_index_step(n, nbatch, od, output_depth, oh, output_height, ow, output_width);
}
});
if (!output_.is_contiguous(memory_format)) {
output_.copy_(output);
}
}
template <typename scalar_t>
typename std::enable_if_t<!std::is_same_v<scalar_t, at::opmath_type<scalar_t>>, void>
cpu_adaptive_avg_pool3d_channels_last(
Tensor& output_,
const Tensor& input_,
IntArrayRef output_size) {
auto memory_format = at::MemoryFormat::ChannelsLast3d;
auto input = input_.contiguous(memory_format);
auto output = output_.contiguous(memory_format);
auto input_data = input.data_ptr<scalar_t>();
auto output_data = output.data_ptr<scalar_t>();
int64_t nbatch = input.size(0);
int64_t channels = input.size(1);
int64_t input_depth = input.size(2);
int64_t input_height = input.size(3);
int64_t input_width = input.size(4);
int64_t output_depth = output_size[0];
int64_t output_height = output_size[1];
int64_t output_width = output_size[2];
using bVec = vec::Vectorized<scalar_t>;
using fVec = vec::Vectorized<float>;
// parallel on dim N,D, H, W
at::parallel_for(0, nbatch * output_depth * output_height * output_width, 0, [&](int64_t begin, int64_t end) {
int64_t n = 0;
int64_t oh = 0;
int64_t ow = 0;
int64_t od = 0;
data_index_init(begin, n, nbatch, od, output_depth, oh, output_height, ow, output_width);
// temp buffer for sum, use float as accumulation type
// can't reuse output buffer to store sum since it is BFloat16/Half
auto sum_arr = std::make_unique<float []>(channels);
float* sum = sum_arr.get();
for (const auto i : c10::irange(begin, end)) {
int64_t id0 = start_index(od, output_depth, input_depth);
int64_t id1 = end_index(od, output_depth, input_depth);
int64_t kd = id1 - id0;
int64_t ih0 = start_index(oh, output_height, input_height);
int64_t ih1 = end_index(oh, output_height, input_height);
int64_t kh = ih1 - ih0;
int64_t iw0 = start_index(ow, output_width, input_width);
int64_t iw1 = end_index(ow, output_width, input_width);
int64_t kw = iw1 - iw0;
scalar_t* out = output_data + i * channels;
int64_t size = channels;
// Pass I: zero the out lane
int64_t d1 = 0;
for (; d1 < size - (size % fVec::size()); d1 += fVec::size()) {
fVec sum_fvec = fVec(float(0));
sum_fvec.store(sum + d1);
}
for (; d1 < size; d1++) {
sum[d1] = float(0);
}
// Pass II: compute local sum
for (const auto id : c10::irange(id0, id1)) {
for (const auto ih : c10::irange(ih0, ih1)) {
for (const auto iw : c10::irange(iw0, iw1)) {
scalar_t* in = input_data + n * input_depth * input_height * input_width * channels +
id * input_height * input_width * channels +
ih * input_width * channels + iw * channels;
int64_t d2 = 0;
for (; d2 < size - (size % bVec::size()); d2 += bVec::size()) {
bVec data_bvec = bVec::loadu(in + d2);
fVec data_fvec0, data_fvec1;
std::tie(data_fvec0, data_fvec1) = convert_to_float<scalar_t>(data_bvec);
fVec sum_fvec0 = fVec::loadu(sum + d2) + data_fvec0;
fVec sum_fvec1 = fVec::loadu(sum + d2 + fVec::size()) + data_fvec1;
sum_fvec0.store(sum + d2);
sum_fvec1.store(sum + d2 + fVec::size());
}
for (; d2 < size; d2++) {
sum[d2] += float(in[d2]);
}
}
}
}
// Pass III: compute local average
int64_t d3 = 0;
for (; d3 < size - (size % bVec::size()); d3 += bVec::size()) {
fVec out_fvec0 = fVec::loadu(sum + d3) / fVec(float(kd * kh * kw));
fVec out_fvec1 = fVec::loadu(sum + d3 + fVec::size()) / fVec(float(kd * kh * kw));
bVec out_bvec = convert_from_float<scalar_t>(out_fvec0, out_fvec1);
out_bvec.store(out + d3);
}
for (; d3 < size; d3++) {
out[d3] = scalar_t(sum[d3] / kd / kh / kw);
}
// move on to next output index
data_index_step(n, nbatch, od, output_depth, oh, output_height, ow, output_width);
}
});
if (!output_.is_contiguous(memory_format)) {
output_.copy_(output);
}
}
template <typename scalar_t>
void cpu_adaptive_avg_pool3d_backward(
Tensor& grad_input_,
const Tensor& grad_output_) {
auto grad_output = grad_output_.contiguous();
auto grad_input = grad_input_.contiguous();
auto grad_output_data = grad_output.data_ptr<scalar_t>();
auto grad_input_data = grad_input.mutable_data_ptr<scalar_t>();
int64_t ndim = grad_output.ndimension();
// treat batch size and channels as one dimension
int64_t channels = ndim == 4 ? grad_output.size(0) : grad_output.size(0) * grad_output.size(1);
int64_t input_depth = grad_input.size(-3);
int64_t input_height = grad_input.size(-2);
int64_t input_width = grad_input.size(-1);
int64_t output_depth = grad_output.size(-3);
int64_t output_height = grad_output.size(-2);
int64_t output_width = grad_output.size(-1);
// parallel on dim of N, C
at::parallel_for(0, channels, 0, [&](int64_t begin, int64_t end) {
for (const auto c : c10::irange(begin, end)) {
scalar_t* grad_input_ptr = grad_input_data + c * input_depth * input_height * input_width;
scalar_t* grad_output_ptr = grad_output_data + c * output_depth * output_height * output_width;
for (const auto od : c10::irange(output_depth)) {
int64_t id0 = start_index(od, output_depth, input_depth);
int64_t id1 = end_index(od, output_depth, input_depth);
int64_t kd = id1 - id0;
for (const auto oh : c10::irange(output_height)) {
int64_t ih0 = start_index(oh, output_height, input_height);
int64_t ih1 = end_index(oh, output_height, input_height);
int64_t kh = ih1 - ih0;
for (const auto ow : c10::irange(output_width)) {
int64_t iw0 = start_index(ow, output_width, input_width);
int64_t iw1 = end_index(ow, output_width, input_width);
int64_t kw = iw1 - iw0;
scalar_t grad_delta = grad_output_ptr[od * output_width * output_height + oh * output_width + ow] / kd / kh / kw;
for (const auto id : c10::irange(id0, id1)) {
for (const auto ih : c10::irange(ih0, ih1)) {
for (const auto iw : c10::irange(iw0, iw1)) {
grad_input_ptr[id * input_height * input_width + ih * input_width + iw] += grad_delta;
}
}
}
}
}
}
}
});
if (!grad_input_.is_contiguous()) {
grad_input_.copy_(grad_input);
}
}
template <typename scalar_t>
void cpu_adaptive_avg_pool3d_backward_channels_last(
Tensor& grad_input_,
const Tensor& grad_output_) {
auto memory_format = at::MemoryFormat::ChannelsLast3d;
auto grad_input = grad_input_.contiguous(memory_format);
auto grad_output = grad_output_.contiguous(memory_format);
auto grad_input_data = grad_input.mutable_data_ptr<scalar_t>();
auto grad_output_data = grad_output.data_ptr<scalar_t>();
int64_t nbatch = grad_input.size(0);
int64_t channels = grad_input.size(1);
int64_t input_depth = grad_input.size(2);
int64_t input_height = grad_input.size(3);
int64_t input_width = grad_input.size(4);
int64_t output_depth = grad_output.size(2);
int64_t output_height = grad_output.size(3);
int64_t output_width = grad_output.size(4);
using Vec = vec::Vectorized<scalar_t>;
// parallel on dim N
at::parallel_for(0, nbatch, 0, [&](int64_t begin, int64_t end) {
for (const auto n : c10::irange(begin, end)) {
scalar_t* grad_input_ptr = grad_input_data + n * input_depth * input_height * input_width * channels;
scalar_t* grad_output_ptr = grad_output_data + n * output_depth * output_height * output_width * channels;
for (const auto od : c10::irange(output_depth)) {
int64_t id0 = start_index(od, output_depth, input_depth);
int64_t id1 = end_index(od, output_depth, input_depth);
int64_t kd = id1 - id0;
for (const auto oh : c10::irange(output_height)) {
int64_t ih0 = start_index(oh, output_height, input_height);
int64_t ih1 = end_index(oh, output_height, input_height);
int64_t kh = ih1 - ih0;
for (const auto ow : c10::irange(output_width)) {
int64_t iw0 = start_index(ow, output_width, input_width);
int64_t iw1 = end_index(ow, output_width, input_width);
int64_t kw = iw1 - iw0;
scalar_t* gout = grad_output_ptr + od * output_depth * channels + oh * output_width * channels + ow * channels;
int64_t size = channels;
for (const auto id : c10::irange(id0, id1)) {
for (const auto ih : c10::irange(ih0, ih1)) {
for (const auto iw : c10::irange(iw0, iw1)) {
scalar_t* gin = grad_input_ptr + id * input_width * input_height * channels + ih * input_width * channels + iw * channels;
int64_t d = 0;
for (; d < size - (size % Vec::size()); d += Vec::size()) {
Vec gin_vec = Vec::loadu(gin + d) + Vec::loadu(gout + d) / Vec(scalar_t(kd * kh * kw));
gin_vec.store(gin + d);
}
for (; d < size; d++) {
gin[d] += gout[d] / kd / kh / kw;
}
}
}
}
}
}
}
}
});
if (!grad_input_.is_contiguous(memory_format)) {
grad_input_.copy_(grad_input);
}
}
void adaptive_avg_pool3d_kernel_impl(
Tensor& output,
const Tensor& input,
IntArrayRef output_size) {
switch (input.suggest_memory_format()) {
case at::MemoryFormat::Contiguous: {
AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::BFloat16, ScalarType::Half, input.scalar_type(), "adaptive_avg_pool3d", [&] {
using param_t = at::opmath_type<scalar_t>;
cpu_adaptive_avg_pool3d<scalar_t, /*accscalar_t*/param_t>(output, input, output_size);
});
break;
}
case at::MemoryFormat::ChannelsLast3d: {
AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::BFloat16, ScalarType::Half, input.scalar_type(), "adaptive_avg_pool3d_channels_last", [&]{
cpu_adaptive_avg_pool3d_channels_last<scalar_t>(output, input, output_size);
});
break;
}
default:
TORCH_CHECK(false, "Unsupported memory format. Supports only ChannelsLast, Contiguous");
}
}
void adapative_avg_pool3d_backward_kernel_impl(
Tensor& grad_input,
const Tensor& grad_output) {
switch (grad_output.suggest_memory_format()) {
case at::MemoryFormat::Contiguous: {
AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::BFloat16, ScalarType::Half, grad_output.scalar_type(), "adaptive_avg_pool3d_backward", [&] {
cpu_adaptive_avg_pool3d_backward<scalar_t>(grad_input, grad_output);
});
break;
}
case at::MemoryFormat::ChannelsLast3d: {
AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::BFloat16, ScalarType::Half, grad_output.scalar_type(), "adaptive_avg_pool3d_backward_channels_last", [&]{
cpu_adaptive_avg_pool3d_backward_channels_last<scalar_t>(grad_input, grad_output);
});
break;
}
@ -413,5 +858,7 @@ void adapative_avg_pool2d_backward_kernel_impl(
REGISTER_DISPATCH(adaptive_avg_pool2d_kernel, &adaptive_avg_pool2d_kernel_impl);
REGISTER_DISPATCH(adaptive_avg_pool2d_backward_kernel, &adapative_avg_pool2d_backward_kernel_impl);
REGISTER_DISPATCH(adaptive_avg_pool3d_kernel, &adaptive_avg_pool3d_kernel_impl);
REGISTER_DISPATCH(adaptive_avg_pool3d_backward_kernel, &adapative_avg_pool3d_backward_kernel_impl);
} // at::native

View File

@ -15,7 +15,7 @@ namespace at::native {
namespace {
template <typename scalar_t, typename accscalar_t>
void cpu_adaptive_max_pool(
void cpu_adaptive_max_pool2d(
const Tensor& output_,
const Tensor& indices_,
const Tensor& input_,
@ -83,13 +83,13 @@ void cpu_adaptive_max_pool(
template <typename scalar_t>
typename std::enable_if_t<std::is_same_v<scalar_t, at::opmath_type<scalar_t>>, void>
cpu_adaptive_max_pool_channels_last(
cpu_adaptive_max_pool2d_channels_last(
const Tensor& output_,
const Tensor& indices_,
const Tensor& input_,
IntArrayRef output_size) {
TORCH_CHECK(input_.ndimension() == 4,
"adaptive max pooling with channels last format supports tensors with 4 dims");
"2d adaptive max pooling with channels last format supports tensors with 4 dims");
auto memory_format = at::MemoryFormat::ChannelsLast;
auto input = input_.contiguous(memory_format);
auto output = output_.contiguous(memory_format);
@ -200,13 +200,13 @@ cpu_adaptive_max_pool_channels_last(
template <typename scalar_t>
typename std::enable_if_t<!std::is_same_v<scalar_t, at::opmath_type<scalar_t>>, void>
cpu_adaptive_max_pool_channels_last(
cpu_adaptive_max_pool2d_channels_last(
const Tensor& output_,
const Tensor& indices_,
const Tensor& input_,
IntArrayRef output_size) {
TORCH_CHECK(input_.ndimension() == 4,
"adaptive max pooling with channels last format supports tensors with 4 dims");
"2d adaptive max pooling with channels last format supports tensors with 4 dims");
auto memory_format = at::MemoryFormat::ChannelsLast;
auto input = input_.contiguous(memory_format);
auto output = output_.contiguous(memory_format);
@ -340,7 +340,7 @@ cpu_adaptive_max_pool_channels_last(
}
template <typename scalar_t>
void cpu_adaptive_max_pool_backward(
void cpu_adaptive_max_pool2d_backward(
const Tensor& grad_input_,
const Tensor& grad_output_,
const Tensor& indices_) {
@ -386,12 +386,12 @@ void cpu_adaptive_max_pool_backward(
}
template <typename scalar_t>
void cpu_adaptive_max_pool_backward_channels_last(
void cpu_adaptive_max_pool2d_backward_channels_last(
const Tensor& grad_input_,
const Tensor& grad_output_,
const Tensor& indices_) {
TORCH_CHECK(grad_output_.ndimension() == 4,
"adaptive max pooling backward with channels last format supports tensors with 4 dims.");
"2d adaptive max pooling backward with channels last format supports tensors with 4 dims.");
auto memory_format = at::MemoryFormat::ChannelsLast;
auto grad_input = grad_input_.contiguous(memory_format);
auto grad_output = grad_output_.contiguous(memory_format);
@ -443,13 +443,13 @@ void adaptive_max_pool2d_kernel_impl(
case at::MemoryFormat::Contiguous: {
AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::BFloat16, ScalarType::Half, input.scalar_type(), "adaptive_max_pool2d", [&] {
using param_t = at::opmath_type<scalar_t>;
cpu_adaptive_max_pool<scalar_t, /*accscalar_t*/param_t>(output, indices, input, output_size);
cpu_adaptive_max_pool2d<scalar_t, /*accscalar_t*/param_t>(output, indices, input, output_size);
});
break;
}
case at::MemoryFormat::ChannelsLast: {
AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::BFloat16, ScalarType::Half, input.scalar_type(), "adaptive_max_pool2d_channels_last", [&]{
cpu_adaptive_max_pool_channels_last<scalar_t>(output, indices, input, output_size);
cpu_adaptive_max_pool2d_channels_last<scalar_t>(output, indices, input, output_size);
});
break;
}
@ -466,13 +466,512 @@ void adaptive_max_pool2d_backward_kernel_impl(
switch (grad_input.suggest_memory_format()) {
case at::MemoryFormat::Contiguous: {
AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::BFloat16, ScalarType::Half, grad_output.scalar_type(), "adaptive_max_pool2d_backward", [&] {
cpu_adaptive_max_pool_backward<scalar_t>(grad_input, grad_output, indices);
cpu_adaptive_max_pool2d_backward<scalar_t>(grad_input, grad_output, indices);
});
break;
}
case at::MemoryFormat::ChannelsLast: {
AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::BFloat16, ScalarType::Half, grad_output.scalar_type(), "adaptive_max_pool2d_backward_channels_last", [&]{
cpu_adaptive_max_pool_backward_channels_last<scalar_t>(grad_input, grad_output, indices);
cpu_adaptive_max_pool2d_backward_channels_last<scalar_t>(grad_input, grad_output, indices);
});
break;
}
default:
TORCH_CHECK(false, "Unsupported memory format. Supports only ChannelsLast, Contiguous");
}
}
template <typename scalar_t, typename accscalar_t>
void cpu_adaptive_max_pool3d(
const Tensor& output_,
const Tensor& indices_,
const Tensor& input_,
IntArrayRef output_size) {
auto input = input_.contiguous();
auto output = output_.contiguous();
auto indices = indices_.contiguous();
auto input_data = input.data_ptr<scalar_t>();
auto output_data = output.data_ptr<scalar_t>();
auto indices_data = indices.data_ptr<int64_t>();
int64_t ndim = input.ndimension();
// treat batch size and channels as one dimension
int64_t channels = ndim == 4 ? input.size(0) : input.size(0) * input.size(1);
int64_t input_depth = input.size(-3);
int64_t input_height = input.size(-2);
int64_t input_width = input.size(-1);
int64_t output_depth = output_size[0];
int64_t output_height = output_size[1];
int64_t output_width = output_size[2];
// parallel on dim of N, C
at::parallel_for(0, channels, 0, [&](int64_t begin, int64_t end) {
for (const auto c : c10::irange(begin, end)) {
scalar_t* input_ptr = input_data + c * input_depth * input_height * input_width;
scalar_t* output_ptr = output_data + c * output_depth * output_height * output_width;
int64_t* indices_ptr = indices_data + c * output_depth * output_height * output_width;
for (const auto od : c10::irange(output_depth)) {
int64_t id0 = start_index(od, output_depth, input_depth);
int64_t id1 = end_index(od, output_depth, input_depth);
for (const auto oh : c10::irange(output_height)) {
int64_t ih0 = start_index(oh, output_height, input_height);
int64_t ih1 = end_index(oh, output_height, input_height);
for (const auto ow : c10::irange(output_width)) {
int64_t iw0 = start_index(ow, output_width, input_width);
int64_t iw1 = end_index(ow, output_width, input_width);
// compute local max
int64_t maxindex = id0 * input_height * input_width + ih0 * input_width + iw0;
accscalar_t maxval = -std::numeric_limits<accscalar_t>::infinity();
for (int64_t id = id0; id < id1; id ++) {
for (int64_t ih = ih0; ih < ih1; ih ++) {
for (int64_t iw = iw0; iw < iw1; iw ++) {
int64_t index = id * input_height * input_width + ih * input_width + iw;
scalar_t val = input_ptr[index];
if ((val > maxval) || std::isnan(val)) {
maxval = val;
maxindex = index;
}
}
}
}
// set output to local max and store location of max
output_ptr[od * output_height * output_width + oh * output_width + ow] = maxval;
indices_ptr[od * output_height * output_width + oh * output_width + ow] = scalar_t(maxindex);
}
}
}
}
});
if (!output_.is_contiguous()) {
output_.copy_(output);
}
if (!indices_.is_contiguous()) {
indices_.copy_(indices);
}
}
template <typename scalar_t>
typename std::enable_if_t<std::is_same_v<scalar_t, at::opmath_type<scalar_t>>, void>
cpu_adaptive_max_pool3d_channels_last(
const Tensor& output_,
const Tensor& indices_,
const Tensor& input_,
IntArrayRef output_size) {
TORCH_CHECK(input_.ndimension() == 5,
"3d adaptive max pooling with channels last format supports tensors with 5 dims");
auto memory_format = at::MemoryFormat::ChannelsLast3d;
auto input = input_.contiguous(memory_format);
auto output = output_.contiguous(memory_format);
auto indices = indices_.contiguous(memory_format);
auto input_data = input.data_ptr<scalar_t>();
auto output_data = output.data_ptr<scalar_t>();
auto indices_data = indices.data_ptr<int64_t>();
int64_t nbatch = input.size(0);
int64_t channels = input.size(1);
int64_t input_depth = input.size(2);
int64_t input_height = input.size(3);
int64_t input_width = input.size(4);
int64_t output_depth = output_size[0];
int64_t output_height = output_size[1];
int64_t output_width = output_size[2];
using Vec = vec::Vectorized<scalar_t>;
using integer_t = vec::int_same_size_t<scalar_t>;
using iVec = vec::Vectorized<integer_t>;
// for the convience of vectorization, use integer of the same size of scalar_t,
// e.g. int32_t for float, int64_t for double
// need to make sure doesn't overflow
TORCH_CHECK(input_height * input_width <= std::numeric_limits<integer_t>::max());
// parallel on dim of N, H, W
at::parallel_for(0, nbatch * output_depth * output_height * output_width, 0, [&](int64_t begin, int64_t end) {
int64_t n = 0;
int64_t od = 0;
int64_t oh = 0;
int64_t ow = 0;
data_index_init(begin, n, nbatch, od, output_depth, oh, output_height, ow, output_width);
int64_t size = channels;
int64_t len = size - (size % Vec::size());
// temp buffer holding index with integer_t
auto index_buffer = std::make_unique<integer_t []>(len);
for (const auto i : c10::irange(begin, end)) {
int64_t id0 = start_index(od, output_depth, input_depth);
int64_t id1 = end_index(od, output_depth, input_depth);
int64_t ih0 = start_index(oh, output_height, input_height);
int64_t ih1 = end_index(oh, output_height, input_height);
int64_t iw0 = start_index(ow, output_width, input_width);
int64_t iw1 = end_index(ow, output_width, input_width);
scalar_t* out = output_data + i * channels;
int64_t* ind = indices_data + i * channels;
// Pass I: init out lane
iVec index0_vec = iVec(id0 * input_height * input_width + ih0 * input_width + iw0);
Vec out_vec = Vec(-std::numeric_limits<scalar_t>::infinity());
int64_t d1 = 0;
for (; d1 < len; d1 += Vec::size()) {
index0_vec.store(index_buffer.get() + d1);
out_vec.store(out + d1);
}
for (; d1 < size; d1++) {
ind[d1] = id0 * input_height * input_width + ih0 * input_width + iw0;
out[d1] = -std::numeric_limits<scalar_t>::infinity();
}
// Pass II: compute local max
for (int64_t id = id0; id < id1; id ++) {
for (int64_t ih = ih0; ih < ih1; ih ++) {
for (int64_t iw = iw0; iw < iw1; iw ++) {
scalar_t* in = input_data + n * input_depth * input_height * input_width * channels +
id * input_height * input_width * channels + ih * input_width * channels + iw * channels;
int64_t d2 = 0;
for (; d2 < len; d2 += Vec::size()) {
iVec index_vec = iVec(id * input_height * input_width + ih * input_width + iw);
Vec val_vec = Vec::loadu(in + d2);
iVec maxindex_vec = iVec::loadu(index_buffer.get() + d2);
Vec maxval_vec = Vec::loadu(out + d2);
// true = all ones, false = all zeros
Vec mask = (val_vec > maxval_vec) | val_vec.isnan();
iVec imask = vec::cast<integer_t>(mask);
Vec out_vec = Vec::blendv(maxval_vec, val_vec, mask);
iVec ind_vec = iVec::blendv(maxindex_vec, index_vec, imask);
out_vec.store(out + d2);
ind_vec.store(index_buffer.get() + d2);
}
for (; d2 < size; d2++) {
int64_t index = id * input_height * input_width + ih * input_width + iw;
scalar_t val = in[d2];
int64_t maxindex = ind[d2];
scalar_t maxval = out[d2];
bool mask = (val > maxval) || std::isnan(val);
out[d2] = mask ? val : maxval;
ind[d2] = mask ? index : maxindex;
}
}
}
}
// convert indice data type
vec::convert<integer_t, int64_t>(index_buffer.get(), ind, len);
// move on to next output index
data_index_step(n, nbatch, od, output_depth, oh, output_height, ow, output_width);
}
});
if (!output_.is_contiguous(memory_format)) {
output_.copy_(output);
}
if (!indices_.is_contiguous(memory_format)) {
indices_.copy_(indices);
}
}
template <typename scalar_t>
typename std::enable_if_t<!std::is_same_v<scalar_t, at::opmath_type<scalar_t>>, void>
cpu_adaptive_max_pool3d_channels_last(
const Tensor& output_,
const Tensor& indices_,
const Tensor& input_,
IntArrayRef output_size) {
TORCH_CHECK(input_.ndimension() == 5,
"3d adaptive max pooling with channels last format supports tensors with 5 dims");
auto memory_format = at::MemoryFormat::ChannelsLast3d;
auto input = input_.contiguous(memory_format);
auto output = output_.contiguous(memory_format);
auto indices = indices_.contiguous(memory_format);
auto input_data = input.data_ptr<BFloat16>();
auto output_data = output.data_ptr<BFloat16>();
auto indices_data = indices.data_ptr<int64_t>();
int64_t nbatch = input.size(0);
int64_t channels = input.size(1);
int64_t input_depth = input.size(2);
int64_t input_height = input.size(3);
int64_t input_width = input.size(4);
int64_t output_depth = output_size[0];
int64_t output_height = output_size[1];
int64_t output_width = output_size[2];
using bVec = vec::Vectorized<BFloat16>;
using fVec = vec::Vectorized<float>;
using iVec = vec::Vectorized<int32_t>;
// need to make sure doesn't overflow
TORCH_CHECK(input_height * input_width <= std::numeric_limits<int32_t>::max());
// parallel on dim of N, H, W
at::parallel_for(0, nbatch * output_depth * output_height * output_width, 0, [&](int64_t begin, int64_t end) {
int64_t n = 0;
int64_t od = 0;
int64_t oh = 0;
int64_t ow = 0;
data_index_init(begin, n, nbatch, od, output_depth, oh, output_height, ow, output_width);
int64_t size = channels;
int64_t len = size - (size % bVec::size());
// temp buffer holding index with integer_t
auto index_buffer = std::make_unique<int32_t []>(len);
// temp buffer holding max value with float
auto max_arr = std::make_unique<float []>(size);
float* max = max_arr.get();
for (const auto i : c10::irange(begin, end)) {
int64_t id0 = start_index(od, output_depth, input_depth);
int64_t id1 = end_index(od, output_depth, input_depth);
int64_t ih0 = start_index(oh, output_height, input_height);
int64_t ih1 = end_index(oh, output_height, input_height);
int64_t iw0 = start_index(ow, output_width, input_width);
int64_t iw1 = end_index(ow, output_width, input_width);
BFloat16* out = output_data + i * channels;
int64_t* ind = indices_data + i * channels;
// Pass I: init out lane
iVec index0_ivec = iVec(id0 * input_height * input_width + ih0 * input_width + iw0);
fVec max_fvec = fVec(-std::numeric_limits<float>::infinity());
int64_t d1 = 0;
for (; d1 < len; d1 += fVec::size()) {
index0_ivec.store(index_buffer.get() + d1);
max_fvec.store(max + d1);
}
for (; d1 < size; d1++) {
ind[d1] = id0 * input_height * input_width + ih0 * input_width + iw0;
max[d1] = -std::numeric_limits<float>::infinity();
}
// Pass II: compute local max
for (int64_t id = id0; id < id1; id ++) {
for (int64_t ih = ih0; ih < ih1; ih ++) {
for (int64_t iw = iw0; iw < iw1; iw ++) {
BFloat16* in = input_data + n * input_depth * input_height * input_width * channels +
id * input_height * input_width * channels + ih * input_width * channels + iw * channels;
int64_t d2 = 0;
for (; d2 < len; d2 += bVec::size()) {
iVec index_ivec = iVec(id * input_height * input_width + ih * input_width + iw);
bVec val_bvec = bVec::loadu(in + d2);
fVec val_fvec0, val_fvec1;
std::tie(val_fvec0, val_fvec1) = convert_bfloat16_float(val_bvec);
iVec maxindex_ivec0 = iVec::loadu(index_buffer.get() + d2);
iVec maxindex_ivec1 = iVec::loadu(index_buffer.get() + d2 + iVec::size());
fVec maxval_fvec0 = fVec::loadu(max + d2);
fVec maxval_fvec1 = fVec::loadu(max + d2 + fVec::size());
// true = all ones, false = all zeros
fVec mask0 = (val_fvec0 > maxval_fvec0) | val_fvec0.isnan();
fVec mask1 = (val_fvec1 > maxval_fvec1) | val_fvec1.isnan();
iVec imask0 = vec::cast<int32_t>(mask0);
iVec imask1 = vec::cast<int32_t>(mask1);
fVec max_fvec0 = fVec::blendv(maxval_fvec0, val_fvec0, mask0);
fVec max_fvec1 = fVec::blendv(maxval_fvec1, val_fvec1, mask1);
iVec ind_ivec0 = iVec::blendv(maxindex_ivec0, index_ivec, imask0);
iVec ind_ivec1 = iVec::blendv(maxindex_ivec1, index_ivec, imask1);
max_fvec0.store(max + d2);
max_fvec1.store(max + d2 + fVec::size());
ind_ivec0.store(index_buffer.get() + d2);
ind_ivec1.store(index_buffer.get() + d2 + iVec::size());
}
for (; d2 < size; d2++) {
int64_t index = id * input_height * input_width + ih * input_width + iw;
float val = float(in[d2]);
int64_t maxindex = ind[d2];
float maxval = max[d2];
bool mask = (val > maxval) || std::isnan(val);
max[d2] = mask ? val : maxval;
ind[d2] = mask ? index : maxindex;
}
}
}
}
// Pass III: convert max values from float to bfloat16
int64_t d3 = 0;
for (; d3 < len; d3 += bVec::size()) {
fVec max_fvec0 = fVec::loadu(max + d3);
fVec max_fvec1 = fVec::loadu(max + d3 + fVec::size());
bVec max_bvec = convert_float_bfloat16(max_fvec0, max_fvec1);
max_bvec.store(out + d3);
}
for (; d3 < size; d3++) {
out[d3] = BFloat16(max[d3]);
}
// convert indice data type
vec::convert<int32_t, int64_t>(index_buffer.get(), ind, len);
// move on to next output index
data_index_step(n, nbatch, od, output_depth, oh, output_height, ow, output_width);
}
});
if (!output_.is_contiguous(memory_format)) {
output_.copy_(output);
}
if (!indices_.is_contiguous(memory_format)) {
indices_.copy_(indices);
}
}
template <typename scalar_t>
void cpu_adaptive_max_pool3d_backward(
const Tensor& grad_input_,
const Tensor& grad_output_,
const Tensor& indices_) {
auto grad_output = grad_output_.contiguous();
auto indices = indices_.contiguous();
auto grad_input = grad_input_.contiguous();
auto grad_output_data = grad_output.data_ptr<scalar_t>();
auto indices_data = indices.data_ptr<int64_t>();
auto grad_input_data = grad_input.mutable_data_ptr<scalar_t>();
int64_t ndim = grad_output.ndimension();
// treat batch size and channels as one dimension
int64_t channels = ndim == 3 ? grad_output.size(0) : grad_output.size(0) * grad_output.size(1);
int64_t input_depth = grad_input.size(-3);
int64_t input_height = grad_input.size(-2);
int64_t input_width = grad_input.size(-1);
int64_t output_depth = grad_output.size(-3);
int64_t output_height = grad_output.size(-2);
int64_t output_width = grad_output.size(-1);
// parallel on dim of N, C
at::parallel_for(0, channels, 0, [&](int64_t begin, int64_t end) {
for (const auto c : c10::irange(begin, end)) {
scalar_t* grad_input_ptr = grad_input_data + c * input_depth * input_height * input_width;
scalar_t* grad_output_ptr = grad_output_data + c * output_depth * output_height * output_width;
int64_t* indices_ptr = indices_data + c * output_depth * output_height * output_width;
for (const auto od : c10::irange(output_depth)) {
for (const auto oh : c10::irange(output_height)) {
for (const auto ow : c10::irange(output_width)) {
// retrieve position of max
int64_t index = od * output_height * output_width + oh * output_width + ow;
int64_t maxindex = indices_ptr[index];
// update gradient
grad_input_ptr[maxindex] += grad_output_ptr[index];
}
}
}
}
});
if (!grad_input_.is_contiguous()) {
grad_input_.copy_(grad_input);
}
}
template <typename scalar_t>
void cpu_adaptive_max_pool3d_backward_channels_last(
const Tensor& grad_input_,
const Tensor& grad_output_,
const Tensor& indices_) {
TORCH_CHECK(grad_output_.ndimension() == 5,
"3d adaptive max pooling backward with channels last format supports tensors with 5 dims.");
auto memory_format = at::MemoryFormat::ChannelsLast3d;
auto grad_input = grad_input_.contiguous(memory_format);
auto grad_output = grad_output_.contiguous(memory_format);
auto indices = indices_.contiguous(memory_format);
auto grad_input_data = grad_input.mutable_data_ptr<scalar_t>();
auto grad_output_data = grad_output.data_ptr<scalar_t>();
auto indices_data = indices.data_ptr<int64_t>();
int64_t nbatch = grad_input.size(0);
int64_t channels = grad_input.size(1);
int64_t input_depth = grad_input.size(2);
int64_t input_height = grad_input.size(3);
int64_t input_width = grad_input.size(4);
int64_t output_depth = grad_output.size(2);
int64_t output_height = grad_output.size(3);
int64_t output_width = grad_output.size(4);
// parallel on dim N
at::parallel_for(0, nbatch, 0, [&](int64_t begin, int64_t end) {
for (const auto n : c10::irange(begin, end)) {
scalar_t* grad_input_ptr = grad_input_data + n * input_depth * input_height * input_width * channels;
scalar_t* grad_output_ptr = grad_output_data + n * output_depth * output_height * output_width * channels;
int64_t* indices_ptr = indices_data + n * output_depth * output_height * output_width * channels;
for (const auto od : c10::irange(output_depth)) {
for (const auto oh : c10::irange(output_height)) {
for (const auto ow : c10::irange(output_width)) {
scalar_t* gout = grad_output_ptr + od * output_height * output_width * channels + oh * output_width * channels + ow * channels;
int64_t* ind = indices_ptr + od * output_height * output_width * channels + oh * output_width * channels + ow * channels;
// TODO: gcc vectorization
for (const auto c : c10::irange(channels)) {
int64_t maxindex = ind[c];
grad_input_ptr[maxindex * channels + c] += gout[c];
}
}
}
}
}
});
if (!grad_input_.is_contiguous(memory_format)) {
grad_input_.copy_(grad_input);
}
}
void adaptive_max_pool3d_kernel_impl(
const Tensor& output,
const Tensor& indices,
const Tensor& input,
IntArrayRef output_size) {
switch (input.suggest_memory_format()) {
case at::MemoryFormat::Contiguous: {
AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::BFloat16, ScalarType::Half, input.scalar_type(), "adaptive_max_pool3d", [&] {
using param_t = at::opmath_type<scalar_t>;
cpu_adaptive_max_pool3d<scalar_t, /*accscalar_t*/param_t>(output, indices, input, output_size);
});
break;
}
case at::MemoryFormat::ChannelsLast3d: {
AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::BFloat16, ScalarType::Half, input.scalar_type(), "adaptive_max_pool3d_channels_last", [&]{
cpu_adaptive_max_pool3d_channels_last<scalar_t>(output, indices, input, output_size);
});
break;
}
default:
TORCH_CHECK(false, "Unsupported memory format. Supports only ChannelsLast, Contiguous");
}
}
void adaptive_max_pool3d_backward_kernel_impl(
const Tensor& grad_input,
const Tensor& grad_output,
const Tensor& indices) {
// can't use grad_output memory format to switch here since grad_output might be NC11
switch (grad_input.suggest_memory_format()) {
case at::MemoryFormat::Contiguous: {
AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::BFloat16, ScalarType::Half, grad_output.scalar_type(), "adaptive_max_pool3d_backward", [&] {
cpu_adaptive_max_pool3d_backward<scalar_t>(grad_input, grad_output, indices);
});
break;
}
case at::MemoryFormat::ChannelsLast3d: {
AT_DISPATCH_FLOATING_TYPES_AND2(ScalarType::BFloat16, ScalarType::Half, grad_output.scalar_type(), "adaptive_max_pool3d_backward_channels_last", [&]{
cpu_adaptive_max_pool3d_backward_channels_last<scalar_t>(grad_input, grad_output, indices);
});
break;
}
@ -485,5 +984,7 @@ void adaptive_max_pool2d_backward_kernel_impl(
REGISTER_DISPATCH(adaptive_max_pool2d_kernel, &adaptive_max_pool2d_kernel_impl);
REGISTER_DISPATCH(adaptive_max_pool2d_backward_kernel, &adaptive_max_pool2d_backward_kernel_impl);
REGISTER_DISPATCH(adaptive_max_pool3d_kernel, &adaptive_max_pool3d_kernel_impl);
REGISTER_DISPATCH(adaptive_max_pool3d_backward_kernel, &adaptive_max_pool3d_backward_kernel_impl);
} // at::native

View File

@ -14,7 +14,7 @@ namespace at::native {
namespace {
template <typename scalar_t>
void cpu_avg_pool(
void cpu_avg_pool2d(
const Tensor& output_,
const Tensor& input_,
int64_t kW, int64_t kH,
@ -101,7 +101,7 @@ void cpu_avg_pool(
template <typename scalar_t,
typename std::enable_if<!is_reduced_floating_point<scalar_t>::value, int>::type = 0>
void cpu_avg_pool_channels_last(
void cpu_avg_pool2d_channels_last(
const Tensor& output_,
const Tensor& input_,
int64_t kW, int64_t kH,
@ -110,7 +110,7 @@ void cpu_avg_pool_channels_last(
bool count_include_pad,
c10::optional<int64_t> divisor_override) {
TORCH_CHECK(input_.ndimension() == 4,
"average pooling with channels last format supports tensors with 4 dims");
"2d average pooling with channels last format supports tensors with 4 dims");
auto memory_format = at::MemoryFormat::ChannelsLast;
auto input = input_.contiguous(memory_format);
auto output = output_.contiguous(memory_format);
@ -215,7 +215,7 @@ void cpu_avg_pool_channels_last(
template <typename scalar_t,
typename std::enable_if<is_reduced_floating_point<scalar_t>::value, int>::type = 0>
void cpu_avg_pool_channels_last(
void cpu_avg_pool2d_channels_last(
const Tensor& output_,
const Tensor& input_,
int64_t kW, int64_t kH,
@ -224,7 +224,7 @@ void cpu_avg_pool_channels_last(
bool count_include_pad,
c10::optional<int64_t> divisor_override) {
TORCH_CHECK(input_.ndimension() == 4,
"average pooling with channels last format supports tensors with 4 dims");
"2d average pooling with channels last format supports tensors with 4 dims");
auto memory_format = at::MemoryFormat::ChannelsLast;
auto input = input_.contiguous(memory_format);
auto output = output_.contiguous(memory_format);
@ -347,7 +347,7 @@ void cpu_avg_pool_channels_last(
}
template <typename scalar_t>
void cpu_avg_pool_backward(
void cpu_avg_pool2d_backward(
const Tensor& grad_input_,
const Tensor& grad_output_,
int kW, int kH,
@ -415,7 +415,7 @@ void cpu_avg_pool_backward(
}
template <typename scalar_t>
void cpu_avg_pool_backward_channels_last(
void cpu_avg_pool2d_backward_channels_last(
const Tensor& grad_input_,
const Tensor& grad_output_,
int kW, int kH,
@ -463,7 +463,7 @@ void cpu_avg_pool_backward_channels_last(
if(count_include_pad) {
divide_factor = pool_size;
} else {
divide_factor = (ih1 - ih0) * (iw1 - iw0);
divide_factor = (ih1 - ih0) * (iw1 - iw0);
}
}
@ -505,13 +505,13 @@ void avg_pool2d_kernel_impl(
switch (input.suggest_memory_format()) {
case at::MemoryFormat::Contiguous: {
AT_DISPATCH_FLOATING_TYPES_AND3(kLong, kBFloat16, kHalf, input.scalar_type(), "avg_pool2d", [&] {
cpu_avg_pool<scalar_t>(output, input, kW, kH, dW, dH, padW, padH, count_include_pad, divisor_override);
cpu_avg_pool2d<scalar_t>(output, input, kW, kH, dW, dH, padW, padH, count_include_pad, divisor_override);
});
break;
}
case at::MemoryFormat::ChannelsLast: {
AT_DISPATCH_FLOATING_TYPES_AND3(kLong, kBFloat16, kHalf, input.scalar_type(), "avg_pool2d_channels_last", [&] {
cpu_avg_pool_channels_last<scalar_t>(output, input, kW, kH, dW, dH, padW, padH, count_include_pad, divisor_override);
cpu_avg_pool2d_channels_last<scalar_t>(output, input, kW, kH, dW, dH, padW, padH, count_include_pad, divisor_override);
});
break;
}
@ -531,13 +531,13 @@ void avg_pool2d_backward_kernel_impl(
switch (grad_output.suggest_memory_format()) {
case at::MemoryFormat::Contiguous: {
AT_DISPATCH_FLOATING_TYPES_AND3(kLong, kBFloat16, kHalf, grad_output.scalar_type(), "avg_pool2d_backward", [&] {
cpu_avg_pool_backward<scalar_t>(grad_input, grad_output, kW, kH, dW, dH, padW, padH, count_include_pad, divisor_override);
cpu_avg_pool2d_backward<scalar_t>(grad_input, grad_output, kW, kH, dW, dH, padW, padH, count_include_pad, divisor_override);
});
break;
}
case at::MemoryFormat::ChannelsLast: {
AT_DISPATCH_FLOATING_TYPES_AND3(kLong, kBFloat16, kHalf, grad_output.scalar_type(), "avg_pool2d_backward_channels_last", [&] {
cpu_avg_pool_backward_channels_last<scalar_t>(grad_input, grad_output, kW, kH, dW, dH, padW, padH, count_include_pad, divisor_override);
cpu_avg_pool2d_backward_channels_last<scalar_t>(grad_input, grad_output, kW, kH, dW, dH, padW, padH, count_include_pad, divisor_override);
});
break;
}
@ -546,9 +546,595 @@ void avg_pool2d_backward_kernel_impl(
}
}
template <typename scalar_t>
void cpu_avg_pool3d(
const Tensor& output_,
const Tensor& input_,
int64_t kW, int64_t kH, int64_t kD,
int64_t dW, int64_t dH, int64_t dD,
int64_t padW, int64_t padH, int64_t padD,
bool count_include_pad,
c10::optional<int64_t> divisor_override) {
using acc_t = at::opmath_type<scalar_t>;
auto input = input_.contiguous();
auto output = output_.contiguous();
auto input_data = input.data_ptr<scalar_t>();
auto output_data = output.data_ptr<scalar_t>();
int64_t numel = output.numel();
int64_t ndim = input.ndimension();
// treat batch size and channels as one dimension
int64_t channels = ndim == 4 ? input.size(0) : input.size(0) * input.size(1);
int64_t input_depth = input.size(-3);
int64_t input_height = input.size(-2);
int64_t input_width = input.size(-1);
int64_t output_depth = output.size(-3);
int64_t output_height = output.size(-2);
int64_t output_width = output.size(-1);
// parallel on dim N, C, D, H, W
at::parallel_for(0, numel, 0, [&](int64_t begin, int64_t end) {
int64_t c = 0;
int64_t od = 0;
int64_t oh = 0;
int64_t ow = 0;
data_index_init(begin, c, channels, od, output_depth, oh, output_height, ow, output_width);
for (const auto i : c10::irange(begin, end)) {
output_data[i] = static_cast<scalar_t>(0);
// local pointers
scalar_t* input_ptr = input_data + c * input_depth * input_height * input_width;
// compute the mean of the input image...
int64_t id0 = od * dD - padD;
int64_t ih0 = oh * dH - padH;
int64_t iw0 = ow * dW - padW;
int64_t id1 = std::min(id0 + kD, input_depth + padD);
int64_t ih1 = std::min(ih0 + kH, input_height + padH);
int64_t iw1 = std::min(iw0 + kW, input_width + padW);
int64_t pool_size = (id1 - id0) * (ih1 - ih0) * (iw1 - iw0);
id0 = std::max(id0, (int64_t) 0);
ih0 = std::max(ih0, (int64_t) 0);
iw0 = std::max(iw0, (int64_t) 0);
id1 = std::min(id1, input_depth);
ih1 = std::min(ih1, input_height);
iw1 = std::min(iw1, input_width);
if (id0 >= id1 || ih0 >= ih1 || iw0 >= iw1) {
// move on to next output index
data_index_step(c, channels, od, output_depth, oh, output_height, ow, output_width);
continue;
}
acc_t sum = 0;
int64_t divide_factor;
if (divisor_override.has_value()) {
divide_factor = divisor_override.value();
} else {
if(count_include_pad) {
divide_factor = pool_size;
} else {
divide_factor = (id1 - id0) * (ih1 - ih0) * (iw1 - iw0);
}
}
for (const auto id : c10::irange(id0, id1)) {
for (const auto ih : c10::irange(ih0, ih1)) {
for (const auto iw : c10::irange(iw0, iw1)) {
sum += input_ptr[id * input_height * input_width + ih * input_width + iw];
}
}
}
output_data[i] += scalar_t(sum / divide_factor);
// move on to next output index
data_index_step(c, channels, od, output_depth, oh, output_height, ow, output_width);
}
});
if (!output_.is_contiguous()) {
output_.copy_(output);
}
}
template <typename scalar_t,
typename std::enable_if<!is_reduced_floating_point<scalar_t>::value, int>::type = 0>
void cpu_avg_pool3d_channels_last(
const Tensor& output_,
const Tensor& input_,
int64_t kW, int64_t kH, int64_t kD,
int64_t dW, int64_t dH, int64_t dD,
int64_t padW, int64_t padH, int64_t padD,
bool count_include_pad,
c10::optional<int64_t> divisor_override) {
TORCH_CHECK(input_.ndimension() == 5,
"3d average pooling with channels last format supports tensors with 5 dims");
auto memory_format = at::MemoryFormat::ChannelsLast3d;
auto input = input_.contiguous(memory_format);
auto output = output_.contiguous(memory_format);
auto input_data = input.data_ptr<scalar_t>();
auto output_data = output.data_ptr<scalar_t>();
int64_t nbatch = input.size(0);
int64_t channels = input.size(1);
int64_t input_depth = input.size(2);
int64_t input_height = input.size(3);
int64_t input_width = input.size(4);
int64_t output_depth = output.size(2);
int64_t output_height = output.size(3);
int64_t output_width = output.size(4);
using Vec = vec::Vectorized<scalar_t>;
// parallel on dim N, H, W
at::parallel_for(0, nbatch * output_depth * output_height * output_width, 0, [&](int64_t begin, int64_t end) {
int64_t n = 0;
int64_t od = 0;
int64_t oh = 0;
int64_t ow = 0;
data_index_init(begin, n, nbatch, od, output_depth, oh, output_height, ow, output_width);
int64_t size = channels;
int64_t len = size - (size % Vec::size());
for (const auto i : c10::irange(begin, end)) {
// compute the mean of the input image...
int64_t id0 = od * dD - padD;
int64_t ih0 = oh * dH - padH;
int64_t iw0 = ow * dW - padW;
int64_t id1 = std::min(id0 + kD, input_depth + padD);
int64_t ih1 = std::min(ih0 + kH, input_height + padH);
int64_t iw1 = std::min(iw0 + kW, input_width + padW);
int64_t pool_size = (id1 - id0) * (ih1 - ih0) * (iw1 - iw0);
id0 = std::max(id0, (int64_t) 0);
ih0 = std::max(ih0, (int64_t) 0);
iw0 = std::max(iw0, (int64_t) 0);
id1 = std::min(id1, input_depth);
ih1 = std::min(ih1, input_height);
iw1 = std::min(iw1, input_width);
int64_t divide_factor;
if (divisor_override.has_value()) {
divide_factor = divisor_override.value();
} else {
if(count_include_pad) {
divide_factor = pool_size;
} else {
divide_factor = (id1 - id0) * (ih1 - ih0) * (iw1 - iw0);
}
}
scalar_t* out = output_data + i * channels;
// Pass I: zero the out lane
int64_t d1 = 0;
for (; d1 < len; d1 += Vec::size()) {
Vec out_vec = Vec(scalar_t(0));
out_vec.store(out + d1);
}
for (; d1 < size; d1++) {
out[d1] = scalar_t(0);
}
if (id0 >= id1 || ih0 >= ih1 || iw0 >= iw1) {
// move on to next output index
data_index_step(n, nbatch, od, output_depth, oh, output_height, ow, output_width);
continue;
}
// Pass II: compute local sum
for (const auto id : c10::irange(id0, id1)) {
for (const auto ih : c10::irange(ih0, ih1)) {
for (const auto iw : c10::irange(iw0, iw1)) {
scalar_t* in = input_data + n * input_depth * input_height * input_width * channels +
id * input_height * input_width * channels + ih * input_width * channels + iw * channels;
int64_t d2 = 0;
for (; d2 < len; d2 += Vec::size()) {
Vec out_vec = Vec::loadu(out + d2) + Vec::loadu(in + d2);
out_vec.store(out + d2);
}
for (; d2 < size; d2++) {
out[d2] += in[d2];
}
}
}
}
// Pass III: compute local average
int64_t d3 = 0;
for (; d3 < len; d3 += Vec::size()) {
Vec out_vec = Vec::loadu(out + d3) / Vec(scalar_t(divide_factor));
out_vec.store(out + d3);
}
for (; d3 < size; d3++) {
out[d3] = out[d3] / divide_factor;
}
// move on to next output index
data_index_step(n, nbatch, od, output_depth, oh, output_height, ow, output_width);
}
});
if (!output_.is_contiguous(memory_format)) {
output_.copy_(output);
}
}
template <typename scalar_t,
typename std::enable_if<is_reduced_floating_point<scalar_t>::value, int>::type = 0>
void cpu_avg_pool3d_channels_last(
const Tensor& output_,
const Tensor& input_,
int64_t kW, int64_t kH, int64_t kD,
int64_t dW, int64_t dH, int64_t dD,
int64_t padW, int64_t padH, int64_t padD,
bool count_include_pad,
c10::optional<int64_t> divisor_override) {
TORCH_CHECK(input_.ndimension() == 5,
"3d average pooling with channels last format supports tensors with 5 dims");
auto memory_format = at::MemoryFormat::ChannelsLast3d;
auto input = input_.contiguous(memory_format);
auto output = output_.contiguous(memory_format);
auto input_data = input.data_ptr<BFloat16>();
auto output_data = output.data_ptr<BFloat16>();
int64_t nbatch = input.size(0);
int64_t channels = input.size(1);
int64_t input_depth = input.size(2);
int64_t input_height = input.size(3);
int64_t input_width = input.size(4);
int64_t output_depth = output.size(2);
int64_t output_height = output.size(3);
int64_t output_width = output.size(4);
using bVec = vec::Vectorized<BFloat16>;
using fVec = vec::Vectorized<float>;
// parallel on dim N, H, W
at::parallel_for(0, nbatch * output_depth * output_height * output_width, 0, [&](int64_t begin, int64_t end) {
int64_t n = 0;
int64_t od = 0;
int64_t oh = 0;
int64_t ow = 0;
data_index_init(begin, n, nbatch, od, output_depth, oh, output_height, ow, output_width);
// temp buffer for sum, use float as accumulation type
// can't reuse output buffer to store sum since it is BFloat16
auto sum_arr = std::make_unique<float []>(channels);
float* sum = sum_arr.get();
int64_t size = channels;
for (const auto i : c10::irange(begin, end)) {
// compute the mean of the input image...
int64_t id0 = od * dD - padD;
int64_t ih0 = oh * dH - padH;
int64_t iw0 = ow * dW - padW;
int64_t id1 = std::min(id0 + kD, input_depth + padD);
int64_t ih1 = std::min(ih0 + kH, input_height + padH);
int64_t iw1 = std::min(iw0 + kW, input_width + padW);
int64_t pool_size = (id1 - id0) * (ih1 - ih0) * (iw1 - iw0);
id0 = std::max(id0, (int64_t) 0);
ih0 = std::max(ih0, (int64_t) 0);
iw0 = std::max(iw0, (int64_t) 0);
id1 = std::min(id1, input_depth);
ih1 = std::min(ih1, input_height);
iw1 = std::min(iw1, input_width);
int64_t divide_factor;
if (divisor_override.has_value()) {
divide_factor = divisor_override.value();
} else {
if(count_include_pad) {
divide_factor = pool_size;
} else {
divide_factor = (id1 - id0) * (ih1 - ih0) * (iw1 - iw0);
}
}
BFloat16* out = output_data + i * channels;
// Pass I: zero the out lane
int64_t d1 = 0;
for (; d1 < size - (size % fVec::size()); d1 += fVec::size()) {
fVec sum_fvec = fVec(float(0));
sum_fvec.store(sum + d1);
}
for (; d1 < size; d1++) {
sum[d1] = float(0);
}
if (id0 >= id1 || ih0 >= ih1 || iw0 >= iw1) {
// since we are not directly using output as the accumulation buffer,
// in case the kernel window is out of range, need to zero the output buffer here.
for (int64_t k = 0; k < size; k++) {
out[k] = 0;
}
// move on to next output index
data_index_step(n, nbatch, od, output_depth, oh, output_height, ow, output_width);
continue;
}
// Pass II: compute local sum
for (const auto id : c10::irange(id0, id1)) {
for (const auto ih : c10::irange(ih0, ih1)) {
for (const auto iw : c10::irange(iw0, iw1)) {
BFloat16* in = input_data + n * input_depth * input_height * input_width * channels +
id * input_height * input_width * channels + ih * input_width * channels + iw * channels;
int64_t d2 = 0;
for (; d2 < size - (size % bVec::size()); d2 += bVec::size()) {
bVec data_bvec = bVec::loadu(in + d2);
fVec data_fvec0, data_fvec1;
std::tie(data_fvec0, data_fvec1) = convert_bfloat16_float(data_bvec);
fVec sum_fvec0 = fVec::loadu(sum + d2) + data_fvec0;
fVec sum_fvec1 = fVec::loadu(sum + d2 + fVec::size()) + data_fvec1;
sum_fvec0.store(sum + d2);
sum_fvec1.store(sum + d2 + fVec::size());
}
for (; d2 < size; d2++) {
sum[d2] += float(in[d2]);
}
}
}
}
// Pass III: compute local average
int64_t d3 = 0;
for (; d3 < size - (size % bVec::size()); d3 += bVec::size()) {
fVec out_fvec0 = fVec::loadu(sum + d3) / fVec(float(divide_factor));
fVec out_fvec1 = fVec::loadu(sum + d3 + fVec::size()) / fVec(float(divide_factor));
bVec out_bvec = convert_float_bfloat16(out_fvec0, out_fvec1);
out_bvec.store(out + d3);
}
for (; d3 < size; d3++) {
out[d3] = BFloat16(sum[d3] / divide_factor);
}
// move on to next output index
data_index_step(n, nbatch, od, output_depth, oh, output_height, ow, output_width);
}
});
if (!output_.is_contiguous(memory_format)) {
output_.copy_(output);
}
}
template <typename scalar_t>
void cpu_avg_pool3d_backward(
const Tensor& grad_input_,
const Tensor& grad_output_,
int kW, int kH, int kD,
int dW, int dH, int dD,
int padW, int padH, int padD,
bool count_include_pad,
c10::optional<int64_t> divisor_override) {
auto grad_output = grad_output_.contiguous();
auto grad_input = grad_input_.contiguous();
auto grad_output_data = grad_output.data_ptr<scalar_t>();
auto grad_input_data = grad_input.mutable_data_ptr<scalar_t>();
int64_t ndim = grad_output.ndimension();
// treat batch size and channels as one dimension
int64_t channels = ndim == 4 ? grad_output.size(0) : grad_output.size(0) * grad_output.size(1);
int64_t input_depth = grad_input.size(-3);
int64_t input_height = grad_input.size(-2);
int64_t input_width = grad_input.size(-1);
int64_t output_depth = grad_output.size(-3);
int64_t output_height = grad_output.size(-2);
int64_t output_width = grad_output.size(-1);
// parallel on dim of N, C
at::parallel_for(0, channels, 0, [&](int64_t begin, int64_t end) {
for (const auto c : c10::irange(begin, end)) {
scalar_t* grad_input_ptr = grad_input_data + c * input_depth * input_height * input_width;
scalar_t* grad_output_ptr = grad_output_data + c * output_depth * output_height * output_width;
for (const auto od : c10::irange(output_depth)) {
for (const auto oh : c10::irange(output_height)) {
for (const auto ow : c10::irange(output_width)) {
int64_t id0 = od * dD - padD;
int64_t ih0 = oh * dH - padH;
int64_t iw0 = ow * dW - padW;
int64_t id1 = std::min(id0 + kD, input_depth + padD);
int64_t ih1 = std::min(ih0 + kH, input_height + padH);
int64_t iw1 = std::min(iw0 + kW, input_width + padW);
int64_t pool_size = (id1 - id0) * (ih1 - ih0) * (iw1 - iw0);
id0 = std::max(id0, (int64_t) 0);
ih0 = std::max(ih0, (int64_t) 0);
iw0 = std::max(iw0, (int64_t) 0);
ih1 = std::min(ih1, input_height);
iw1 = std::min(iw1, input_width);
int64_t divide_factor;
if (divisor_override.has_value()) {
divide_factor = divisor_override.value();
} else {
if(count_include_pad) {
divide_factor = pool_size;
} else {
divide_factor = (id1 - id0) * (ih1 - ih0) * (iw1 - iw0);
}
}
scalar_t grad_delta = grad_output_ptr[od * output_height * output_width + oh * output_width + ow] / divide_factor;
for (const auto id : c10::irange(id0, id1)) {
for (const auto ih : c10::irange(ih0, ih1)) {
for (const auto iw : c10::irange(iw0, iw1)) {
grad_input_ptr[id * input_height * input_width + ih * input_width + iw] += grad_delta;
}
}
}
}
}
}
}
});
if (!grad_input_.is_contiguous()) {
grad_input_.copy_(grad_input);
}
}
template <typename scalar_t>
void cpu_avg_pool3d_backward_channels_last(
const Tensor& grad_input_,
const Tensor& grad_output_,
int kW, int kH, int kD,
int dW, int dH, int dD,
int padW, int padH, int padD,
bool count_include_pad,
c10::optional<int64_t> divisor_override) {
auto memory_format = at::MemoryFormat::ChannelsLast3d;
auto grad_input = grad_input_.contiguous(memory_format);
auto grad_output = grad_output_.contiguous(memory_format);
auto grad_input_data = grad_input.mutable_data_ptr<scalar_t>();
auto grad_output_data = grad_output.data_ptr<scalar_t>();
int64_t nbatch = grad_input.size(0);
int64_t channels = grad_input.size(1);
int64_t input_depth = grad_input.size(2);
int64_t input_height = grad_input.size(3);
int64_t input_width = grad_input.size(4);
int64_t output_depth = grad_output.size(2);
int64_t output_height = grad_output.size(3);
int64_t output_width = grad_output.size(4);
using Vec = vec::Vectorized<scalar_t>;
// parallel on dim N
at::parallel_for(0, nbatch, 0, [&](int64_t begin, int64_t end) {
for (const auto n : c10::irange(begin, end)) {
scalar_t* grad_input_ptr = grad_input_data + n * input_depth * input_height * input_width * channels;
scalar_t* grad_output_ptr = grad_output_data + n * output_height * output_width * channels;
for (const auto od : c10::irange(output_depth)) {
for (const auto oh : c10::irange(output_height)) {
for (const auto ow : c10::irange(output_width)) {
int64_t id0 = od * dD - padD;
int64_t ih0 = oh * dH - padH;
int64_t iw0 = ow * dW - padW;
int64_t id1 = std::min(id0 + kD, input_depth + padD);
int64_t ih1 = std::min(ih0 + kH, input_height + padH);
int64_t iw1 = std::min(iw0 + kW, input_width + padW);
int64_t pool_size = (id1 - id0) * (ih1 - ih0) * (iw1 - iw0);
id0 = std::max(id0, (int64_t) 0);
ih0 = std::max(ih0, (int64_t) 0);
iw0 = std::max(iw0, (int64_t) 0);
id1 = std::min(id1, input_depth);
ih1 = std::min(ih1, input_height);
iw1 = std::min(iw1, input_width);
int64_t divide_factor;
if (divisor_override.has_value()) {
divide_factor = divisor_override.value();
} else {
if(count_include_pad) {
divide_factor = pool_size;
} else {
divide_factor = (id1 - id0) * (ih1 - ih0) * (iw1 - iw0);
}
}
scalar_t* gout = grad_output_ptr + od * output_height * output_width * channels + oh * output_width * channels + ow * channels;
int64_t size = channels;
int64_t len = size - (size % Vec::size());
for (const auto id : c10::irange(id0, id1)) {
for (const auto ih : c10::irange(ih0, ih1)) {
for (const auto iw : c10::irange(iw0, iw1)) {
scalar_t* gin = grad_input_ptr + id * input_height * input_width * channels + ih * input_width * channels + iw * channels;
int64_t d = 0;
for (; d < len; d += Vec::size()) {
Vec gin_vec = Vec::loadu(gin + d) + Vec::loadu(gout + d) / Vec(scalar_t(divide_factor));
gin_vec.store(gin + d);
}
for (; d < size; d++) {
gin[d] += gout[d] / divide_factor;
}
}
}
}
}
}
}
}
});
if (!grad_input_.is_contiguous(memory_format)) {
grad_input_.copy_(grad_input);
}
}
void avg_pool3d_kernel_impl(
const Tensor& output,
const Tensor& input,
int64_t kW, int64_t kH, int64_t kD,
int64_t dW, int64_t dH, int64_t dD,
int64_t padW, int64_t padH, int64_t padD,
bool count_include_pad,
c10::optional<int64_t> divisor_override) {
switch (input.suggest_memory_format()) {
case at::MemoryFormat::Contiguous: {
AT_DISPATCH_FLOATING_TYPES_AND3(kLong, kBFloat16, kHalf, input.scalar_type(), "avg_pool3d", [&] {
cpu_avg_pool3d<scalar_t>(output, input, kW, kH, kD, dW, dH, dD, padW, padH, padD, count_include_pad, divisor_override);
});
break;
}
case at::MemoryFormat::ChannelsLast: {
AT_DISPATCH_FLOATING_TYPES_AND3(kLong, kBFloat16, kHalf, input.scalar_type(), "avg_pool3d_channels_last", [&] {
cpu_avg_pool3d_channels_last<scalar_t>(output, input, kW, kH, kD, dW, dH, dD, padW, padH, padD, count_include_pad, divisor_override);
});
break;
}
default:
TORCH_CHECK(false, "Unsupported memory format. Supports only ChannelsLast, Contiguous");
}
}
void avg_pool3d_backward_kernel_impl(
const Tensor& grad_input,
const Tensor& grad_output,
int kW, int kH, int kD,
int dW, int dH, int dD,
int padW, int padH, int padD,
bool count_include_pad,
c10::optional<int64_t> divisor_override) {
switch (grad_output.suggest_memory_format()) {
case at::MemoryFormat::Contiguous: {
AT_DISPATCH_FLOATING_TYPES_AND3(kLong, kBFloat16, kHalf, grad_output.scalar_type(), "avg_pool3d_backward", [&] {
cpu_avg_pool3d_backward<scalar_t>(grad_input, grad_output, kW, kH, kD, dW, dH, dD, padW, padH, padD, count_include_pad, divisor_override);
});
break;
}
case at::MemoryFormat::ChannelsLast3d: {
AT_DISPATCH_FLOATING_TYPES_AND3(kLong, kBFloat16, kHalf, grad_output.scalar_type(), "avg_pool3d_backward_channels_last", [&] {
cpu_avg_pool3d_backward_channels_last<scalar_t>(grad_input, grad_output, kW, kH, kD, dW, dH, dD, padW, padH, padD, count_include_pad, divisor_override);
});
break;
}
default:
TORCH_CHECK(false, "Unsupported memory format. Supports only ChannelsLast, Contiguous");
}
}
} // anonymous namespace
REGISTER_DISPATCH(avg_pool2d_kernel, &avg_pool2d_kernel_impl);
REGISTER_DISPATCH(avg_pool2d_backward_kernel, &avg_pool2d_backward_kernel_impl);
REGISTER_DISPATCH(avg_pool3d_kernel, &avg_pool3d_kernel_impl);
REGISTER_DISPATCH(avg_pool3d_backward_kernel, &avg_pool3d_backward_kernel_impl);
} // at::native

View File

@ -52,8 +52,8 @@ typename std::enable_if<
grad_vec2 = grad_vec2 * fVec(opmath_t(-1.0));
}
if (weight_decay != 0.0){
grad_vec1 += param_vec1 * fVec(scalar_t(weight_decay));
grad_vec2 += param_vec2 * fVec(scalar_t(weight_decay));
grad_vec1 = vec::fmadd(param_vec1, fVec(scalar_t(weight_decay)), grad_vec1);
grad_vec2 = vec::fmadd(param_vec2, fVec(scalar_t(weight_decay)), grad_vec2);
}
if (momentum != 0.0) {
fVec momentum_vec1, momentum_vec2;
@ -61,17 +61,16 @@ typename std::enable_if<
momentum_vec1 = grad_vec1;
momentum_vec2 = grad_vec2;
} else {
momentum_vec1 =
fVec::loadu(momentum_buf_ptr + d) * fVec(scalar_t(momentum)) +
grad_vec1 * fVec(scalar_t(1 - dampening));
momentum_vec2 =
fVec::loadu(momentum_buf_ptr + d + fVec::size()) * fVec(scalar_t(momentum)) +
grad_vec2 * fVec(scalar_t(1 - dampening));
momentum_vec1 = fVec::loadu(momentum_buf_ptr + d) * fVec(scalar_t(momentum));
momentum_vec2 = fVec::loadu(momentum_buf_ptr + d + fVec::size()) * fVec(scalar_t(momentum));
momentum_vec1 = vec::fmadd(fVec(scalar_t(1 - dampening)), grad_vec1, momentum_vec1);
momentum_vec2 = vec::fmadd(fVec(scalar_t(1 - dampening)), grad_vec2, momentum_vec2);
}
vec::convert_from_float<scalar_t>(momentum_vec1, momentum_vec2).store(momentum_buf_ptr + d);;
if (nesterov) {
grad_vec1 += momentum_vec1 * fVec(scalar_t(momentum));
grad_vec2 += momentum_vec2 * fVec(scalar_t(momentum));
grad_vec1 = vec::fmadd(momentum_vec1, fVec(scalar_t(momentum)), grad_vec1);
grad_vec2 = vec::fmadd(momentum_vec2, fVec(scalar_t(momentum)), grad_vec2);
} else {
grad_vec1 = momentum_vec1;
grad_vec2 = momentum_vec2;
@ -142,7 +141,7 @@ typename std::enable_if<
}
if (maximize) grad_vec = grad_vec * Vec(scalar_t(-1.0));
if (weight_decay != 0.0){
grad_vec += param_vec * Vec(scalar_t(weight_decay));
grad_vec = vec::fmadd(param_vec, Vec(scalar_t(weight_decay)), grad_vec);
}
if (momentum != 0.0) {
Vec momentum_vec;
@ -150,12 +149,12 @@ typename std::enable_if<
momentum_vec = grad_vec;
} else {
momentum_vec =
Vec::loadu(momentum_buf_ptr + d) * Vec(scalar_t(momentum)) +
grad_vec * Vec(scalar_t(1 - dampening));
Vec::loadu(momentum_buf_ptr + d) * Vec(scalar_t(momentum));
momentum_vec = vec::fmadd(Vec(scalar_t(1 - dampening)), grad_vec, momentum_vec);
}
momentum_vec.store(momentum_buf_ptr + d);
if (nesterov) {
grad_vec += momentum_vec * Vec(scalar_t(momentum));
grad_vec = vec::fmadd(momentum_vec, Vec(scalar_t(momentum)), grad_vec);
} else {
grad_vec = momentum_vec;
}

View File

@ -185,11 +185,78 @@ inline void tinygemm_kernel(
#if !defined(C10_MOBILE) && defined(__aarch64__)
#include <arm_neon.h>
static inline float reduce(float32x4_t x) {
inline float reduce(float32x4_t x) {
auto sum = vpaddq_f32(x, x);
return vgetq_lane_f32(vpaddq_f32(sum, sum), 0);
}
inline float32x4x2_t load_as_float32x4x2(const Half* ptr) {
float16x8_t f16_val = vld1q_f16(reinterpret_cast<const float16_t *>(ptr));
auto val_low = vcvt_f32_f16(vget_low_f16(f16_val));
auto val_high = vcvt_f32_f16(vget_high_f16(f16_val));
return {val_low, val_high};
}
inline float32x4_t load_as_float32x4(const Half* ptr) {
return vcvt_f32_f16(vld1_f16(reinterpret_cast<const float16_t *>(ptr)));
}
inline float32x4x2_t load_as_float32x4x2(const BFloat16* ptr) {
int32x4_t shift = vdupq_n_s32(16);
uint16x8_t u16_val = vld1q_u16(reinterpret_cast<const uint16_t *>(ptr));
uint32x4_t int_low = vmovl_u16(vget_low_u16(u16_val));
uint32x4_t int_high = vmovl_u16(vget_high_u16(u16_val));
return {vreinterpretq_f32_u32(vshlq_u32(int_low, shift)), vreinterpretq_f32_u32(vshlq_u32(int_high, shift))};
}
inline float32x4_t load_as_float32x4(const BFloat16* ptr) {
int32x4_t shift = vdupq_n_s32(16);
uint32x4_t as_int = vmovl_u16(vld1_u16(reinterpret_cast<const uint16_t *>(ptr)));
return vreinterpretq_f32_u32(vshlq_u32(as_int, shift));
}
inline float32x4_t load_as_float32x4(const float* ptr) {
return vld1q_f32(ptr);
}
inline float32x4x2_t load_as_float32x4x2(const float* ptr) {
return {vld1q_f32(ptr), vld1q_f32(ptr + 4)};
}
template <int BLOCK_M, int BLOCK_N, typename T>
inline void tinygemm_kernel_(
const T* RESTRICT A,
const int8_t* RESTRICT B,
const T* RESTRICT scales,
T* RESTRICT C,
int lda,
int ldb,
int ldc,
int K) {
for (const auto m : c10::irange(BLOCK_M)) {
float32x4_t c_val[BLOCK_N];
c10::ForcedUnroll<BLOCK_N>{}([&](auto i) {
c_val[i] = vdupq_n_f32(0.0);
});
for (int k = 0; k < K; k += 8) {
auto a_val = load_as_float32x4x2(A + m * lda + k);
c10::ForcedUnroll<BLOCK_N>{}([&](auto i) {
int16x8_t b_val = vmovl_s8(vld1_s8(B + i * ldb + k));
auto b_val_low = vcvtq_f32_s32(vmovl_s16(vget_low_s16(b_val)));
auto b_val_high = vcvtq_f32_s32(vmovl_s16(vget_high_s16(b_val)));
c_val[i] = vfmaq_f32(c_val[i], a_val.val[1], b_val_high);
c_val[i] = vfmaq_f32(c_val[i], a_val.val[0], b_val_low);
});
}
float32x4_t scale_val = load_as_float32x4(scales);
c10::ForcedUnroll<BLOCK_N>{}([&](auto i) {
C[m * ldc + i] = reduce(c_val[i]) * vgetq_lane_f32(scale_val, i);
});
}
}
template <int BLOCK_M, int BLOCK_N>
inline void tinygemm_kernel(
const Half* RESTRICT A,
@ -200,30 +267,33 @@ inline void tinygemm_kernel(
int ldb,
int ldc,
int K) {
tinygemm_kernel_<BLOCK_M, BLOCK_N>(A, B, scales, C, lda, ldb, ldc, K);
}
for (const auto m : c10::irange(BLOCK_M)) {
float32x4_t c_val[BLOCK_N];
c10::ForcedUnroll<BLOCK_N>{}([&](auto i) {
c_val[i] = vdupq_n_f32(0.0);
});
for (int k = 0; k < K; k += 8) {
float16x8_t a_val = vld1q_f16(reinterpret_cast<const float16_t *>(A) + m * lda + k);
auto a_val_low = vcvt_f32_f16(vget_low_f16(a_val));
auto a_val_high = vcvt_f32_f16(vget_high_f16(a_val));
c10::ForcedUnroll<BLOCK_N>{}([&](auto i) {
int16x8_t b_val = vmovl_s8(vld1_s8(B + i * ldb + k));
auto b_val_low = vcvtq_f32_s32(vmovl_s16(vget_low_s16(b_val)));
auto b_val_high = vcvtq_f32_s32(vmovl_s16(vget_high_s16(b_val)));
c_val[i] = vfmaq_f32(c_val[i], a_val_high, b_val_high);
c_val[i] = vfmaq_f32(c_val[i], a_val_low, b_val_low);
});
}
template <int BLOCK_M, int BLOCK_N>
inline void tinygemm_kernel(
const BFloat16* RESTRICT A,
const int8_t* RESTRICT B,
const BFloat16* RESTRICT scales,
BFloat16* RESTRICT C,
int lda,
int ldb,
int ldc,
int K) {
tinygemm_kernel_<BLOCK_M, BLOCK_N>(A, B, scales, C, lda, ldb, ldc, K);
}
float32x4_t scale_val = vcvt_f32_f16(vld1_f16(reinterpret_cast<const float16_t *>(scales)));
c10::ForcedUnroll<BLOCK_N>{}([&](auto i) {
C[m * ldc + i] = reduce(c_val[i]) * vgetq_lane_f32(scale_val, i);
});
}
template <int BLOCK_M, int BLOCK_N>
inline void tinygemm_kernel(
const float* RESTRICT A,
const int8_t* RESTRICT B,
const float* RESTRICT scales,
float* RESTRICT C,
int lda,
int ldb,
int ldc,
int K) {
tinygemm_kernel_<BLOCK_M, BLOCK_N>(A, B, scales, C, lda, ldb, ldc, K);
}
#endif

View File

@ -7,6 +7,7 @@
#include <ATen/TensorUtils.h>
#include <ATen/cuda/CUDABlas.h>
#include <ATen/cuda/tunable/Tunable.h>
#include <ATen/cuda/tunable/TunableGemm.h>
#include <ATen/native/Resize.h>
#include <c10/util/MaybeOwned.h>
@ -156,7 +157,7 @@ enum class Activation {
GELU,
};
#if (!defined(USE_ROCM) && !defined(_MSC_VER)) || (defined(USE_ROCM) && ROCM_VERSION >= 50700)
#if !defined(USE_ROCM) || (defined(USE_ROCM) && ROCM_VERSION >= 50700)
cuda::blas::GEMMAndBiasActivationEpilogue activation_to_gemm_and_blas_arg(Activation a) {
switch (a) {
case Activation::None:
@ -235,7 +236,7 @@ Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& ma
at::ScalarType scalar_type = self.scalar_type();
c10::MaybeOwned<Tensor> self_;
if (&result != &self) {
#if (defined(CUDA_VERSION) && CUDA_VERSION >= 11040 && !defined(_MSC_VER)) || defined(USE_ROCM) && ROCM_VERSION >= 50700
#if (defined(CUDA_VERSION) && (CUDA_VERSION >= 11040)) || (defined(USE_ROCM) && (ROCM_VERSION >= 50700))
// Strangely, if mat2 has only 1 row or column, we get
// CUBLAS_STATUS_INVALID_VALUE error from cublasLtMatmulAlgoGetHeuristic.
// self.dim() == 1 && result.dim() == 2 && self.sizes()[0] == mat2_sizes[1]
@ -333,8 +334,9 @@ Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& ma
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(!args.result->is_conj());
#if (!defined(USE_ROCM) && !defined(_MSC_VER)) || (defined(USE_ROCM) && ROCM_VERSION >= 50700)
#if !defined(USE_ROCM) || (defined(USE_ROCM) && (ROCM_VERSION >= 50700))
if (useLtInterface) {
#if defined(USE_ROCM)
AT_DISPATCH_FLOATING_TYPES_AND2(
at::ScalarType::Half,
at::ScalarType::BFloat16,
@ -352,28 +354,49 @@ Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& ma
args.lda,
args.matb->const_data_ptr<scalar_t>(),
args.ldb,
#if defined(USE_ROCM)
// This condition is needed for mm case on ROCm for hipblasLt path.
// Passing the bias ptr as null to avoid accuracy issues for mm case.
(&result != &self) ? self.const_data_ptr<scalar_t>() : nullptr,
#else
self.const_data_ptr<scalar_t>(),
#endif
args.result->data_ptr<scalar_t>(),
args.result_ld,
#if (defined(CUDA_VERSION) && CUDA_VERSION >= 11080) || defined(USE_ROCM)
activation_to_gemm_and_blas_arg(activation)
#else
// GELU is not supported (and does not compile!) prior
// to CUDA 11.4. Have observed accuracy issues with
// GELU epilogue in 11.4; disabling the GELU epilogue
// path for CUDA version < 11.8.
activation != Activation::GELU
? activation_to_gemm_and_blas_arg(activation)
: cuda::blas::GEMMAndBiasActivationEpilogue::None
#endif
);
});
#else
auto activation_epilogue = activation_to_gemm_and_blas_arg(activation);
#if (defined(CUDA_VERSION) && (CUDA_VERSION < 11080))
// GELU is not supported (and does not compile!) prior
// to CUDA 11.4. Have observed accuracy issues with
// GELU epilogue in 11.4; disabling the GELU epilogue
// path for CUDA version < 11.8.
if (activation == Activation::GELU)
activation_epilogue = cuda::blas::GEMMAndBiasActivationEpilogue::None;
#endif
AT_DISPATCH_FLOATING_TYPES_AND2(
at::ScalarType::Half,
at::ScalarType::BFloat16,
scalar_type,
"addmm_cuda_lt",
[&] {
at::cuda::blas::gemm_and_bias<scalar_t>(
args.transa == 't',
args.transb == 't',
args.m,
args.n,
args.k,
alpha.to<at::opmath_type<scalar_t>>(),
args.mata->const_data_ptr<scalar_t>(),
args.lda,
args.matb->const_data_ptr<scalar_t>(),
args.ldb,
self.const_data_ptr<scalar_t>(),
args.result->data_ptr<scalar_t>(),
args.result_ld,
activation_epilogue
);
});
#endif
} else
#endif
{
@ -747,7 +770,7 @@ Tensor& _int_mm_out_cuda(const Tensor& self, const Tensor& mat2, Tensor& result)
TORCH_CHECK(result.is_contiguous(), "Expected result to be contiguous.");
#if (!defined(USE_ROCM) && !defined(_MSC_VER) && defined(CUDA_VERSION) && CUDA_VERSION >= 11070) || (defined(USE_ROCM) && ROCM_VERSION >= 60000)
#if (!defined(USE_ROCM) && defined(CUDA_VERSION) && (CUDA_VERSION >= 11070)) || (defined(USE_ROCM) && (ROCM_VERSION >= 60000))
cublasCommonArgs args(self, mat2, result);
at::cuda::blas::int8_gemm(
@ -767,7 +790,7 @@ Tensor& _int_mm_out_cuda(const Tensor& self, const Tensor& mat2, Tensor& result)
result.copy_(*args.result);
}
#else
#if !defined(USE_ROCM) && !defined(_MSC_VER) && defined(CUDA_VERSION)
#if !defined(USE_ROCM) && defined(CUDA_VERSION)
TORCH_CHECK(false, "_int_mm_out_cuda not compiled for CUDA ", CUDA_VERSION);
#else
TORCH_CHECK(false, "_int_mm_out_cuda not compiled for this platform.");
@ -887,32 +910,112 @@ _scaled_mm_out_cuda(const Tensor& mat1, const Tensor& mat2,
at::native::resize_output(out, {mat1_sizes[0], mat2_sizes[1]});
at::native::resize_output(amax, {});
#if !defined(USE_ROCM) && !defined(_MSC_VER) || (defined(USE_ROCM) && ROCM_VERSION >= 60000)
#if !defined(USE_ROCM) || (defined(USE_ROCM) && (ROCM_VERSION >= 60000))
cublasCommonArgs args(mat1, mat2, out);
const auto out_dtype_ = args.result->scalar_type();
TORCH_CHECK(args.transa == 't' && args.transb == 'n', "Only multiplication of row-major and column-major matrices is supported by cuBLASLt");
at::cuda::blas::scaled_gemm(
args.transa,
args.transb,
args.m,
args.n,
args.k,
args.mata->data_ptr(),
scale_a ? scale_a->data_ptr() : nullptr,
args.lda,
args.mata->scalar_type(),
args.matb->data_ptr(),
scale_b ? scale_b->data_ptr() : nullptr,
args.ldb,
args.matb->scalar_type(),
bias ? bias->data_ptr(): nullptr,
bias ? bias->scalar_type() : isFloat8Type(out_dtype_) ? at::ScalarType::Half : out_dtype_,
args.result->data_ptr(),
scale_result ? scale_result->data_ptr() : nullptr,
args.result_ld,
out_dtype_,
amax.data_ptr(),
use_fast_accum);
#ifdef USE_ROCM
auto tuning_ctx = at::cuda::tunable::getTuningContext();
if (tuning_ctx->IsTunableOpEnabled()) {
#define TUNABLE_DISPATCH(BLASOP_A, BLASOP_B) \
if (mat1.scalar_type() == ScalarType::Float8_e4m3fnuz) { \
if (mat2.scalar_type() == ScalarType::Float8_e4m3fnuz) { \
static at::cuda::tunable::ScaledGemmTunableOp< \
at::Float8_e4m3fnuz, at::Float8_e4m3fnuz, scalar_t, \
BLASOP_A, BLASOP_B> scaledgemm{}; \
scaledgemm(&params); \
} \
else if (mat2.scalar_type() == ScalarType::Float8_e5m2fnuz) { \
static at::cuda::tunable::ScaledGemmTunableOp< \
at::Float8_e4m3fnuz, at::Float8_e5m2fnuz, scalar_t, \
BLASOP_A, BLASOP_B> scaledgemm{}; \
scaledgemm(&params); \
} \
} \
else if (mat1.scalar_type() == ScalarType::Float8_e5m2fnuz) { \
if (mat2.scalar_type() == ScalarType::Float8_e4m3fnuz) { \
static at::cuda::tunable::ScaledGemmTunableOp< \
at::Float8_e5m2fnuz, at::Float8_e4m3fnuz, scalar_t, \
BLASOP_A, BLASOP_B> scaledgemm{}; \
scaledgemm(&params); \
} \
else if (mat2.scalar_type() == ScalarType::Float8_e5m2fnuz) { \
static at::cuda::tunable::ScaledGemmTunableOp< \
at::Float8_e5m2fnuz, at::Float8_e5m2fnuz, scalar_t, \
BLASOP_A, BLASOP_B> scaledgemm{}; \
scaledgemm(&params); \
} \
}
AT_DISPATCH_V2(out_dtype_, "_tunable_scaled_gemm", AT_WRAP([&] {
bool transa_ = ((args.transa != 'n') && (args.transa != 'N'));
bool transb_ = ((args.transb != 'n') && (args.transb != 'N'));
at::cuda::tunable::ScaledGemmParams<scalar_t> params;
params.transa = args.transa;
params.transb = args.transb;
params.m = args.m;
params.n = args.n;
params.k = args.k;
params.a = args.mata->data_ptr();
params.a_scale_ptr = scale_a ? scale_a->data_ptr() : nullptr;
params.lda = args.lda;
params.a_dtype = args.mata->scalar_type();
params.b = args.matb->data_ptr();
params.b_scale_ptr = scale_b ? scale_b->data_ptr() : nullptr;
params.ldb = args.ldb;
params.b_dtype = args.matb->scalar_type();
params.bias_ptr = bias ? bias->data_ptr(): nullptr;
params.bias_dtype = bias ? bias->scalar_type() : isFloat8Type(out_dtype_) ? at::ScalarType::Half : out_dtype_;
params.c = args.result->data_ptr();
params.c_scale_ptr = scale_result ? scale_result->data_ptr() : nullptr;
params.ldc = args.result_ld;
params.c_dtype = out_dtype_;
params.amax_ptr = amax.data_ptr();
params.use_fast_accum = use_fast_accum;
if (transa_ && transb_) {
TUNABLE_DISPATCH(at::cuda::tunable::BlasOp::T, at::cuda::tunable::BlasOp::T)
}
else if (transa_ && !transb_) {
TUNABLE_DISPATCH(at::cuda::tunable::BlasOp::T, at::cuda::tunable::BlasOp::N)
}
else if (!transa_ && transb_) {
TUNABLE_DISPATCH(at::cuda::tunable::BlasOp::N, at::cuda::tunable::BlasOp::T)
}
else if (!transa_ && !transb_) {
TUNABLE_DISPATCH(at::cuda::tunable::BlasOp::N, at::cuda::tunable::BlasOp::N)
}
else {
TORCH_CHECK(false, "unreachable");
}
}),
kHalf, kBFloat16, kFloat8_e4m3fnuz, kFloat8_e5m2fnuz, AT_EXPAND(AT_FLOATING_TYPES));
#undef TUNABLE_DISPATCH
}
else
#endif
{
at::cuda::blas::scaled_gemm(
args.transa,
args.transb,
args.m,
args.n,
args.k,
args.mata->data_ptr(),
scale_a ? scale_a->data_ptr() : nullptr,
args.lda,
args.mata->scalar_type(),
args.matb->data_ptr(),
scale_b ? scale_b->data_ptr() : nullptr,
args.ldb,
args.matb->scalar_type(),
bias ? bias->data_ptr(): nullptr,
bias ? bias->scalar_type() : isFloat8Type(out_dtype_) ? at::ScalarType::Half : out_dtype_,
args.result->data_ptr(),
scale_result ? scale_result->data_ptr() : nullptr,
args.result_ld,
out_dtype_,
amax.data_ptr(),
use_fast_accum);
}
#else
TORCH_CHECK(false, "_scaled_mm_out_cuda is not compiled for this platform.");
#endif

View File

@ -4,6 +4,7 @@
#include <ATen/cuda/CUDAContext.h>
#include <ATen/native/Resize.h>
#include <ATen/native/TensorShape.h>
#include <c10/cuda/CUDAGraphsC10Utils.h>
#include <c10/util/TypeCast.h>
#ifndef AT_PER_OPERATOR_HEADERS
@ -703,12 +704,15 @@ void split_with_sizes_copy_out_cuda(
IntArrayRef split_sizes,
int64_t dim,
TensorList out) {
const bool is_capturing = at::cuda::currentStreamCaptureStatusMayInitCtx() !=
at::cuda::CaptureStatus::None;
bool contiguous_no_cast = self.is_non_overlapping_and_dense();
for (const auto& t : out) {
contiguous_no_cast &= t.is_non_overlapping_and_dense();
contiguous_no_cast &= (t.dtype() == self.dtype());
}
if (contiguous_no_cast) {
// TODO(yifu): make the fast path work for CUDA graph
if (!is_capturing && contiguous_no_cast) {
// Perform equivalent checks performed by the composite impl
if (dim < 0) {
dim = at::maybe_wrap_dim(dim, self.dim());

View File

@ -29,6 +29,30 @@ void run_cudnn_SDP_fprop(
false, "PyTorch was not compiled with cuDNN Flash Attention enabled!");
}
void run_cudnn_SDP_bprop(
int64_t b,
int64_t h,
int64_t s_q,
int64_t s_kv,
int64_t d,
float scaling_factor,
bool is_causal,
float dropout_probability,
const Tensor& q,
const Tensor& k,
const Tensor& v,
const Tensor& o,
const Tensor& dO,
const Tensor& softmaxstats,
Tensor& dQ,
Tensor& dK,
Tensor& dV,
const Tensor& dropoutseed,
const Tensor& dropoutoffset) {
TORCH_CHECK(
false, "PyTorch was not compiled with cuDNN Flash Attention enabled!");
}
} // namespace native
} // namespace at
@ -73,6 +97,22 @@ using graph_and_tensors = std::tuple<
std::shared_ptr<fe::graph::Tensor_attributes> // Stats
>;
using graph_and_tensors_backward = std::tuple<
std::shared_ptr<fe::graph::Graph>,
std::shared_ptr<fe::graph::Tensor_attributes>, // Q,
std::shared_ptr<fe::graph::Tensor_attributes>, // K,
std::shared_ptr<fe::graph::Tensor_attributes>, // V,
std::shared_ptr<fe::graph::Tensor_attributes>, // Attn_scale
std::shared_ptr<fe::graph::Tensor_attributes>, // Seed,
std::shared_ptr<fe::graph::Tensor_attributes>, // Offset,
std::shared_ptr<fe::graph::Tensor_attributes>, // O,
std::shared_ptr<fe::graph::Tensor_attributes>, // dO,
std::shared_ptr<fe::graph::Tensor_attributes>, // stats,
std::shared_ptr<fe::graph::Tensor_attributes>, // dQ,
std::shared_ptr<fe::graph::Tensor_attributes>, // dK,,
std::shared_ptr<fe::graph::Tensor_attributes> // dV,
>;
#define MAX_MHA_DIM 4
struct MHAParams {
@ -178,8 +218,7 @@ struct MHACacheKeyWrapper : ParamsWrapper<MHAParams> {
template <typename T, typename KeyType>
struct MHAGraphCache {
std::unordered_map<KeyType, graph_and_tensors, ParamsWrapperHash<KeyType>>
engine_cache;
std::unordered_map<KeyType, T, ParamsWrapperHash<KeyType>> engine_cache;
// no mutexes here as caches are now thread local for v8, can also return a
// pointer to the Execution Plan if we know it will not be invalidated by
@ -202,6 +241,8 @@ struct MHAGraphCache {
// be thread safe across all engines see Limitations in
// https://docs.nvidia.com/deeplearning/cudnn/release-notes/index.html
thread_local MHAGraphCache<graph_and_tensors, MHACacheKeyWrapper> mhagraphcache;
thread_local MHAGraphCache<graph_and_tensors_backward, MHACacheKeyWrapper>
mhagraphbackwardcache;
auto build_graph_and_tensors(
int64_t b,
@ -227,10 +268,12 @@ auto build_graph_and_tensors(
dtype = fe::DataType_t::BFLOAT16;
}
auto mha_graph = std::make_shared<fe::graph::Graph>();
// We're baking in float accumulation and scale types
// in theory the graph may support other types, but they
// have not been tested
mha_graph->set_io_data_type(dtype)
.set_intermediate_data_type(fe::DataType_t::FLOAT)
.set_compute_data_type(fe::DataType_t::FLOAT);
auto Q = mha_graph->tensor(
fe::graph::Tensor_attributes()
.set_name("Q")
@ -254,7 +297,7 @@ auto build_graph_and_tensors(
params.v_stride.begin(), params.v_stride.end())));
auto attn_scale =
mha_graph->tensor(fe::graph::Tensor_attributes()
.set_name("attn_scale")
.set_name("Attn_scale")
.set_dim({1, 1, 1, 1})
.set_stride({1, 1, 1, 1})
.set_is_pass_by_value(true)
@ -276,7 +319,7 @@ auto build_graph_and_tensors(
.set_data_type(fe::DataType_t::INT32));
auto scaled_dot_product_flash_attention_options =
fe::graph::SDPA_attributes()
.set_name("flash_attention")
.set_name("CUDNN_SDPA")
.set_is_inference(return_softmaxstats == false)
.set_causal_mask(is_causal)
.set_attn_scale(attn_scale)
@ -287,12 +330,12 @@ auto build_graph_and_tensors(
}
auto seq_q = mha_graph->tensor(fe::graph::Tensor_attributes()
.set_name("seq_q")
.set_name("Seq_q")
.set_dim({b, 1, 1, 1})
.set_stride({1, 1, 1, 1})
.set_data_type(fe::DataType_t::INT32));
auto seq_kv = mha_graph->tensor(fe::graph::Tensor_attributes()
.set_name("seq_kv")
.set_name("Seq_kv")
.set_dim({b, 1, 1, 1})
.set_stride({1, 1, 1, 1})
.set_data_type(fe::DataType_t::INT32));
@ -324,7 +367,146 @@ auto build_graph_and_tensors(
AT_CUDNN_FRONTEND_CHECK(mha_graph->build_plans(handle));
return std::make_tuple(
mha_graph, Q, K, V, attn_scale, seed, offset, O, Stats);
std::move(mha_graph),
std::move(Q),
std::move(K),
std::move(V),
std::move(attn_scale),
std::move(seed),
std::move(offset),
std::move(O),
std::move(Stats));
}
auto build_graph_and_tensors_backward(
int64_t b,
int64_t h,
int64_t s_q,
int64_t s_kv,
int64_t d,
float scaling_factor,
bool is_causal,
float dropout_probability,
const Tensor& q,
const Tensor& k,
const Tensor& v,
const Tensor& o,
const Tensor& dO,
const Tensor& softmaxstats,
Tensor& dQ,
Tensor& dK,
Tensor& dV,
const Tensor& dropoutseed,
const Tensor& dropoutoffset,
cudnnHandle_t& handle,
MHAParams& params) {
auto dtype = fe::DataType_t::HALF;
if (q.scalar_type() == kBFloat16) {
dtype = fe::DataType_t::BFLOAT16;
}
auto mha_graph = std::make_shared<fe::graph::Graph>();
// We're baking in float accumulation and scale types
// in theory the graph may support other types, but they
// have not been tested
mha_graph->set_io_data_type(dtype)
.set_intermediate_data_type(fe::DataType_t::FLOAT)
.set_compute_data_type(fe::DataType_t::FLOAT);
auto Q = mha_graph->tensor(
fe::graph::Tensor_attributes()
.set_name("Q")
.set_dim(std::vector<int64_t>(q.sizes().begin(), q.sizes().end()))
.set_stride(
std::vector<int64_t>(q.strides().begin(), q.strides().end())));
auto K = mha_graph->tensor(
fe::graph::Tensor_attributes()
.set_name("K")
.set_dim(std::vector<int64_t>(k.sizes().begin(), k.sizes().end()))
.set_stride(
std::vector<int64_t>(k.strides().begin(), k.strides().end())));
auto V = mha_graph->tensor(
fe::graph::Tensor_attributes()
.set_name("V")
.set_dim(std::vector<int64_t>(v.sizes().begin(), v.sizes().end()))
.set_stride(
std::vector<int64_t>(v.strides().begin(), v.strides().end())));
auto attn_scale =
mha_graph->tensor(fe::graph::Tensor_attributes()
.set_name("Attn_scale")
.set_dim({1, 1, 1, 1})
.set_stride({1, 1, 1, 1})
.set_is_pass_by_value(true)
.set_data_type(fe::DataType_t::FLOAT));
auto Seed = mha_graph->tensor(fe::graph::Tensor_attributes()
.set_name("Seed")
.set_dim({1, 1, 1, 1})
.set_stride({1, 1, 1, 1})
.set_data_type(fe::DataType_t::INT32));
auto Offset = mha_graph->tensor(fe::graph::Tensor_attributes()
.set_name("Offset")
.set_dim({1, 1, 1, 1})
.set_stride({1, 1, 1, 1})
.set_data_type(fe::DataType_t::INT32));
auto O = mha_graph->tensor(
fe::graph::Tensor_attributes()
.set_name("O")
.set_dim(std::vector<int64_t>(o.sizes().begin(), o.sizes().end()))
.set_stride(
std::vector<int64_t>(o.strides().begin(), o.strides().end())));
auto STATS = mha_graph->tensor(
fe::graph::Tensor_attributes()
.set_name("Stats")
.set_dim(std::vector<int64_t>(
softmaxstats.sizes().begin(), softmaxstats.sizes().end()))
.set_stride(std::vector<int64_t>(
softmaxstats.strides().begin(), softmaxstats.strides().end()))
.set_data_type(fe::DataType_t::FLOAT));
auto DO = mha_graph->tensor(
fe::graph::Tensor_attributes()
.set_name("DO")
.set_dim(std::vector<int64_t>(dO.sizes().begin(), dO.sizes().end()))
.set_stride(
std::vector<int64_t>(dO.strides().begin(), dO.strides().end())));
auto sdpa_backward_options = fe::graph::SDPA_backward_attributes()
.set_name("CUDNN_SDPA_BACKWARD")
.set_causal_mask(is_causal)
.set_attn_scale(attn_scale);
if (dropout_probability != 0.0f) {
sdpa_backward_options.set_dropout(dropout_probability, Seed, Offset);
}
auto [DQ, DK, DV] =
mha_graph->sdpa_backward(Q, K, V, O, DO, STATS, sdpa_backward_options);
DQ->set_output(true)
.set_dim(std::vector<int64_t>(dQ.sizes().begin(), dQ.sizes().end()))
.set_stride(
std::vector<int64_t>(dQ.strides().begin(), dQ.strides().end()));
DK->set_output(true)
.set_dim(std::vector<int64_t>(dK.sizes().begin(), dK.sizes().end()))
.set_stride(
std::vector<int64_t>(dK.strides().begin(), dK.strides().end()));
DV->set_output(true)
.set_dim(std::vector<int64_t>(dV.sizes().begin(), dV.sizes().end()))
.set_stride(
std::vector<int64_t>(dV.strides().begin(), dV.strides().end()));
AT_CUDNN_FRONTEND_CHECK(mha_graph->validate());
AT_CUDNN_FRONTEND_CHECK(mha_graph->build_operation_graph(handle));
AT_CUDNN_FRONTEND_CHECK(
mha_graph->create_execution_plans({fe::HeurMode_t::A}));
AT_CUDNN_FRONTEND_CHECK(mha_graph->check_support(handle));
AT_CUDNN_FRONTEND_CHECK(mha_graph->build_plans(handle));
return std::make_tuple(
std::move(mha_graph),
std::move(Q),
std::move(K),
std::move(V),
std::move(attn_scale),
std::move(Seed),
std::move(Offset),
std::move(O),
std::move(DO),
std::move(STATS),
std::move(DQ),
std::move(DK),
std::move(DV));
}
void run_cudnn_SDP_fprop(
@ -407,11 +589,92 @@ void run_cudnn_SDP_fprop(
auto workspace_size = mha_graph->get_workspace_size();
auto workspace_ptr =
c10::cuda::CUDACachingAllocator::get()->allocate(workspace_size);
TORCH_INTERNAL_ASSERT(
TORCH_CHECK(
mha_graph->execute(handle, variant_pack, workspace_ptr.get()).is_good());
mhagraphcache.update(key, graph_and_tensors_values);
}
void run_cudnn_SDP_bprop(
int64_t b,
int64_t h,
int64_t s_q,
int64_t s_kv,
int64_t d,
float scaling_factor,
bool is_causal,
float dropout_probability,
const Tensor& q,
const Tensor& k,
const Tensor& v,
const Tensor& o,
const Tensor& dO,
const Tensor& softmaxstats,
Tensor& dQ,
Tensor& dK,
Tensor& dV,
const Tensor& dropoutseed,
const Tensor& dropoutoffset) {
cudnnHandle_t handle = getCudnnHandle();
auto key = MHACacheKeyWrapper(
b, h, s_q, s_kv, d, q, k, v, dropout_probability, is_causal, true);
auto graph_and_tensors_backward_ptr = mhagraphbackwardcache.find(key);
graph_and_tensors_backward graph_and_tensors_backward_values;
if (graph_and_tensors_backward_ptr) {
graph_and_tensors_backward_values = *graph_and_tensors_backward_ptr;
} else {
graph_and_tensors_backward_values = build_graph_and_tensors_backward(
b,
h,
s_q,
s_kv,
d,
scaling_factor,
is_causal,
dropout_probability,
q,
k,
v,
o,
dO,
softmaxstats,
dQ,
dK,
dV,
dropoutseed,
dropoutoffset,
handle,
key.pod);
}
auto
[mha_graph, Q, K, V, attn_scale, Seed, Offset, O, Do, Stats, Dq, Dk, Dv] =
graph_and_tensors_backward_values;
std::unordered_map<std::shared_ptr<fe::graph::Tensor_attributes>, void*>
variant_pack = {// inputs
{Q, q.data_ptr()},
{K, k.data_ptr()},
{V, v.data_ptr()},
{O, o.data_ptr()},
{Do, dO.data_ptr()},
{Stats, softmaxstats.data_ptr()},
// outputs
{Dq, dQ.data_ptr()},
{Dk, dK.data_ptr()},
{Dv, dV.data_ptr()},
// pass by value
{attn_scale, &scaling_factor}};
if (dropout_probability != 0.0f) {
variant_pack[Seed] = dropoutseed.data_ptr();
variant_pack[Offset] = dropoutoffset.data_ptr();
}
auto workspace_size = mha_graph->get_workspace_size();
auto workspace_ptr =
c10::cuda::CUDACachingAllocator::get()->allocate(workspace_size);
TORCH_CHECK(!workspace_size || workspace_ptr.get());
TORCH_CHECK(
mha_graph->execute(handle, variant_pack, workspace_ptr.get()).is_good());
mhagraphbackwardcache.update(key, graph_and_tensors_backward_values);
}
} // namespace native
} // namespace at

Some files were not shown because too many files have changed in this diff Show More