Compare commits

...

158 Commits

Author SHA1 Message Date
ac3dabf652 [Dynamo] Remove ignored modes from torch function mode stack guard
ghstack-source-id: c3398f28b58561ba6241279c6a7cf404aabfa8c7
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503
2024-09-11 14:02:23 -07:00
54ab06fc07 [Dynamo] Remove ignored modes workaround
ghstack-source-id: 1ec9a6b4c31d310659b4a116abf5bfb1de393b12
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502
2024-09-11 14:02:22 -07:00
32542724be [Dynamo] Trace enter/exit of TorchFunctionModes
ghstack-source-id: 8f0811c156177e2b54b3aea97835e1b15044080b
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422
2024-09-11 14:02:22 -07:00
dfbb990dc4 [Dynamo] Simplify torch function mode stack guard
ghstack-source-id: 5fad7e6481132b96b594e6755b0fdb394aa9d56f
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444
2024-09-09 16:02:11 -07:00
194d46e91c [Dynamo] Support thread local setattr
ghstack-source-id: d7ca565f27a57ba0aed030f74b90b3ce8faa59bd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443
2024-09-09 16:02:11 -07:00
9094fb5c7c [Dynamo] Trace torch function modes
ghstack-source-id: 188be474d3f4685d45141153c6425a0a2684715d
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137
2024-09-09 16:02:11 -07:00
ec6b49eed9 [Dynamo] Disable metadata tf mode when tracing cond
ghstack-source-id: a8d524089ad362b08adb98b6851de5490815fd38
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732
2024-09-07 23:50:38 -07:00
042f2f7746 [ONNX] Re-raise the exception if the dynamic shapes cannot be refined (#135418)
Improve error reporting. Otherwise users will just see not being able to refine shapes most of the time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135418
Approved by: https://github.com/titaiwangms
2024-09-08 05:30:34 +00:00
fd494dd426 Change wrapped_linear_prepack and wrapped_quantized_linear_prepacked to private by adding _ as prefix (#135401)
Summary: In https://github.com/pytorch/pytorch/pull/134232, we added two new ops wrapped_linear_prepack and wrapped_quantized_linear_prepacked. From the review comments and offline discussion, we are changing them to private by adding `_` as prefix

Differential Revision: D62325142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135401
Approved by: https://github.com/houseroad
2024-09-08 04:16:24 +00:00
8334cb2fb9 remove commented out breakpoints (#135363)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135363
Approved by: https://github.com/oulgen
2024-09-08 02:15:45 +00:00
e72ed4717e [Dynamo] Fix Huggingface PretrainedConfig get non const attr (#135413)
Fixes #135329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135413
Approved by: https://github.com/anijain2305
2024-09-07 19:16:29 +00:00
3bebc09be9 [FlexAttention] Align the matmul tensorcore usage (#135168)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135168
Approved by: https://github.com/Chillee
2024-09-07 16:33:41 +00:00
a2db22e6bb [inductor] Catch BrokenProcessPool and print a more helpful message. (#135120)
Summary: BrokenProcessPool means a parallel-compile subprocess exited, which we never expect. It's likely due to a crash, so print a more meaningful error message and instructions that it's probably easier to debug by turning off parallel compile. Output looks like:
```
...
  File "/data/users/slarsen/pytorch/torch/_inductor/runtime/compile_tasks.py", line 45, in _reload_python_module
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/torchinductor_slarsen/4q/c4qw7xk5lbb7whg5txnk4hwbc7z6kepak3o666tr3d64gcad5r5b.py", line 815, in <module>
    async_compile.wait(globals())
  File "/data/users/slarsen/pytorch/torch/_inductor/async_compile.py", line 265, in wait
    raise RuntimeError(
RuntimeError: A compilation subprocess exited unexpectedly. This is likely due to a crash. To facilitate debugging, you can re-run with TORCHINDUCTOR_COMPILE_THREADS=1 to cause compilation to occur in the main process.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135120
Approved by: https://github.com/Chillee
2024-09-07 16:33:37 +00:00
eac5e12548 [inductor] Move LoopBody to its own file (#135257)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135257
Approved by: https://github.com/oulgen
2024-09-07 16:29:15 +00:00
18479c5f70 [Doc] update max-autotune for CPU (#134986)
The current doc for `max-autotune` is applicable only for GPU. This PR adds the corresponding content for CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134986
Approved by: https://github.com/jgong5, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-07 13:42:40 +00:00
f7c0c06692 Add oneDNN BRGEMM support on CPU (#131878)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131878
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-09-07 13:22:30 +00:00
b53d97c7be [Intel GPU] Add XPU memory-related APIs (#129919)
# Motivation
According to https://github.com/pytorch/pytorch/issues/116322, we will help unify the device allocator. So we introduce a simple xpu device allocator only with the key functionality first. And expect to add some memory statistics-related functionality after the unification.
But now, some memory statistic-related APIs listed in https://github.com/pytorch/pytorch/issues/127929 are requested. We need more time to unify the device allocator. In order to facilitate the user experience, we expect to support these memory statistic-related APIs before the unification.

# Additional Context
Fixes: #127929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129919
Approved by: https://github.com/dvrogozh, https://github.com/abhilash1910, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #130923
2024-09-07 11:15:17 +00:00
6c1da66407 [Reland] Refactor caching device allocator utils (#130923)
# Motivation
Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322), this PR aims to refactor caching device allocator utils to improve code reuse usage.
This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD, https://github.com/eqy
2024-09-07 11:14:17 +00:00
d7c97e7245 [inductor][cpp][gemm] cache blocking config for dynamic shapes (#133538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133538
Approved by: https://github.com/leslie-fang-intel
ghstack dependencies: #135277, #133447

Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
2024-09-07 11:09:30 +00:00
be9f4ffe88 [inductor][cpp][gemm] enable dynamic M for k-slicing (#133447)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133447
Approved by: https://github.com/leslie-fang-intel
ghstack dependencies: #135277

Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
2024-09-07 11:09:30 +00:00
692faa9bc6 [inductor][cpp][gemm] reduce memory alloc overhead by allocating local acc once per thread (#135277)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135277
Approved by: https://github.com/leslie-fang-intel

Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
2024-09-07 11:09:25 +00:00
32f3af72b7 [ONNX] Support FakeTensor in ONNXProgram (#135399)
Sync with https://github.com/justinchuby/torch-onnx/compare/v0.1.20...v0.1.21 to support FakeTensors in ONNXProgram. Specifically, this PR implements the `apply_weights` method to allow users to supply a dictionary of concrete tensors to replace FakeTensors in the exported model weights.

An error is raised when users try to serialize a FakeTensor to avoid segfaults.

Also fixed a bug in `.save()` when `keep_initializers_as_inputs` is True and `include_initializers` is False.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135399
Approved by: https://github.com/titaiwangms
2024-09-07 04:48:18 +00:00
ebab5c85c4 [FlexAttention] Skip very small block size unit tests on H100 due to Triton bug (#135393)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135393
Approved by: https://github.com/BoyuanFeng
2024-09-07 04:35:22 +00:00
3d734d837b [ONNX] Handle mixed sequence inputs properly (#135378)
Previously, when an input contains a mixture of `Value` and python constants like `[SymbolicTensor('sym_size_int_3', type=Tensor(INT64), shape=[], producer=node_Shape_0, index=0), 512]`, we get errors like

```pytb
Traceback (most recent call last):
  File "/Users/justinc/Documents/GitHub/torch-onnx/src/torch_onnx/_building.py", line 367, in _call_op
    converted_named_inputs = _process_python_constants_and_sequences(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/justinc/Documents/GitHub/torch-onnx/src/torch_onnx/_building.py", line 275, in _process_python_constants_and_sequences
    raise TypeError(
TypeError: Constant input '[SymbolicTensor('sym_size_int_3', type=Tensor(INT64), shape=[], producer=node_Shape_0, index=0), 512]' of type '<class 'list'>' is not supported
```

This PR updates Sequence handling to support this case, as well as variadic inputs and ONNX Sequence inputs.

Synced from https://github.com/justinchuby/torch-onnx/pull/187
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135378
Approved by: https://github.com/titaiwangms
2024-09-07 03:07:39 +00:00
c92227c41a [quant][pt2e] fix placeholder typo and related quantization tests (#135379)
A previous typo on "placeholder" and related tests in quantization are fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135379
Approved by: https://github.com/jerryzh168
2024-09-07 02:31:43 +00:00
e6a0221fc6 [Inductor] Optionally allow padding on non-GPU devices (#135280)
This is the OSS component of a larger MTIA diff.

Currently, Inductor disables padding for non-GPU devices. We need to change this behavior to enable padding on MTIA.

This PR adds a config option to enable padding on the CPU, or any other non-GPU device. In the future, we might want to enable padding on all devices by default. However, that might require supporting device-dependent padding defaults, since CPUs will likely use different settings than H100 GPUs.

Differential Revision: D61038114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135280
Approved by: https://github.com/jfix71, https://github.com/shunting314
2024-09-07 02:19:14 +00:00
a6b9d444fb [ONNX] Refactor exporter errors (#135180)
Refactor exporter errors to combine old errors and new errors for API consistency.

This PR also

1. Removes the `_C._check_onnx_proto(proto)` call in the old exporter. We don't need the ONNX checker because it is limited.
2. Removes the `OnnxExporterError` defined in the dynamo module. This class unnecessarily stores the onnx program object, making it very bulky. Instead, we revert to use the plain OnnxExporterError defined in the `errors` module and use it as the base class for all errors.
3. Continues to expose `OnnxExporterError` in `torch.onnx` and the rest of the errors in `torch.onnx.errors`.
4. Removes the `CheckerError` and `InvalidExportOptionsError` from `torch.onnx`. This is BC breaking but should have low impact.
5. I did not rename existing errors out of compatibility considerations, even though `ExporterError` would have been more succinct.

Fixes https://github.com/pytorch/pytorch/issues/135125
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135180
Approved by: https://github.com/titaiwangms
2024-09-07 00:50:15 +00:00
d42b0c8f22 Add release matrix for 2.5 (#135383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135383
Approved by: https://github.com/huydhn
2024-09-07 00:49:53 +00:00
941d094dd1 [Dynamo][DTensor] Fixes SymNodeVariable() is not a constant error in Compiled DDP + TP unit test (#135315)
Before the fix, the unit test will fail at forward Dynamo tracing:
```
  File "/data/users/willfeng/pytorch/test/distributed/_composable/test_replicate_with_compiler.py", line 415, in test_ddp_tp
    loss = compiled_replicate_model(data).sum()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
torch._dynamo.exc.InternalTorchDynamoError: SymNodeVariable() is not a constant

from user code:
   File "/data/users/willfeng/pytorch/torch/distributed/tensor/parallel/_data_parallel_utils.py", line 34, in _unflatten_tensor
    result = DTensor.from_local(
```
After the fix, the compilation fails at a later step (Compiled Autograd tracing), due to needing "pre-dispatch tracing of backward graph" feature (see details at https://github.com/pytorch/pytorch/issues/127797#issuecomment-2291695474).

I believe this PR is a net improvement, because it should also fix the 1D Traceable FSDP2 failure case on internal models (https://github.com/pytorch/pytorch/issues/130978#issuecomment-2319476690), which is much harder to build a minimal unit test for.

Fixes https://github.com/pytorch/pytorch/issues/130978.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135315
Approved by: https://github.com/bdhirsh
2024-09-07 00:11:25 +00:00
b1a934741e Change test_constant_prop_preserve_metadata (#135268)
Summary: In new export_for_training, "stack_trace" does not exist in node meta anymore.

Test Plan:
```
buck run fbcode//mode/dev-nosan fbcode//caffe2/test:quantization_pt2e -- -r test_constant_prop_preserve_metadata
```

Reviewed By: angelayi

Differential Revision: D62219974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135268
Approved by: https://github.com/angelayi
2024-09-07 00:02:35 +00:00
0c661f3e1a [Split Build] Refactor split build binary builds into their own workflows and move split build binary builds to periodic (#134624)
As we need to move split build binary tests from trunk to periodic this pr, refactors those jobs out into its own workflow to achieve this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134624
Approved by: https://github.com/malfet
2024-09-06 23:57:56 +00:00
2c7e314803 [Inductor][CPP] Fix the issue of view dtype (#135301)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/135160, it's a regression introduced by https://github.com/pytorch/pytorch/pull/134569, where the dtype of `to_dtype_bitcast` was incorrectly handled when using the scalarize implementation.

**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_view_dtype
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135301
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-09-06 23:36:44 +00:00
ead4407f57 [inductor] Fix loop split optimization (#135303)
Fix https://github.com/pytorch/pytorch/issues/135274.

Improve the check whether the div expr matches: add a check whether `split_var` is in `original_body.iter_vars`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135303
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-09-06 23:06:25 +00:00
2f5b40c099 [aoti test] Disable FP8 funz dtypes in fp8 runtime check test (#135373)
Fixing https://github.com/pytorch/pytorch/issues/126734

Key is the funz FP8 types are for AMD only.

source: https://github.com/openxla/stablehlo/blob/main/rfcs/20230321-fp8_fnuz.md

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135373
Approved by: https://github.com/chenyang78
2024-09-06 23:05:47 +00:00
993b5647ab [export] fix placeholder name collision tests by removing map call (#135366)
The current test is failing because of the current unstable state of map. torch.compile and non-strict export are taking two seperate routes unlike cond and while_loop. This pr fix the test it self. We'll fix map in follow up PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135366
Approved by: https://github.com/angelayi
2024-09-06 22:02:50 +00:00
2ab26806f1 Require tlparse for failing tests in test_structured_trace.py (#135376)
Summary: These tests are currently failing internally. Per discussion, skip if tlparse is unavailable

Test Plan:
```
feature remove tlparse
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --run-disabled --regex test_structured_trace.py
feature install tlparse
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --run-disabled --regex test_structured_trace.py
```

Differential Revision: D62310342

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135376
Approved by: https://github.com/ezyang
2024-09-06 21:53:41 +00:00
b1612569f6 [BE] Clarify defaulting behavior in optimizer (#135384)
Fixes #135340

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135384
Approved by: https://github.com/drisspg, https://github.com/jainapurva
2024-09-06 21:52:55 +00:00
dc0e818738 [FR] Automatically infer a common filename prefix (#135158)
Save the annoyance of specifying this on the command line each time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135158
Approved by: https://github.com/fduwjj, https://github.com/c-p-i-o
ghstack dependencies: #135157
2024-09-06 21:44:27 +00:00
06e414d7fe [FR] Make trace_dir a required argument (#135157)
Ensures users get a clean error if they forget to specify the dir, and
improves the help message.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135157
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj
2024-09-06 21:44:27 +00:00
a681260caf Revert "[ONNX] Refactor exporter errors (#135180)"
This reverts commit 5eebd9315a72422d59b6f8d8ca8e4e573e231d5c.

Reverted https://github.com/pytorch/pytorch/pull/135180 on behalf of https://github.com/clee2000 due to I think this broke test_public_bindings.py::TestPublicBindings::test_correct_module_names [GH job link](https://github.com/pytorch/pytorch/actions/runs/10743909338/job/29800779403) [HUD commit link](5eebd9315a), possibly a landrace with the PR that landed before it ([comment](https://github.com/pytorch/pytorch/pull/135180#issuecomment-2334844191))
2024-09-06 21:39:18 +00:00
95e976a63f [dynamo] recursively skip frames when Dynamo cache limit is hit (#135144)
Fixes https://github.com/pytorch/pytorch/pull/135144 and [T197117723](https://www.internalfb.com/intern/tasks/?t=197117723).

In general, adds `SkipCodeRecursiveException` to Dynamo - when raised in Dynamo, convert_frame will return a `skip_code_recursive_flag` back to C Dynamo, signaling it to skip the current frame and all recursive calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135144
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-09-06 21:38:53 +00:00
306ac44eaa [ez][TD] Fix request for issue body returns None (#135389)
I assumed it would be empty string if the body is empty, but its just None
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135389
Approved by: https://github.com/malfet
2024-09-06 21:02:01 +00:00
a7643baceb Revert expectFailureIf condition on tests with torch.compile on Windows (#134759)
Fixes #134716

This PR reverts some changes introduced in 6eae569546 (#133987)

torch.compile is not available on Windows, tests should be expected to fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134759
Approved by: https://github.com/malfet
2024-09-06 20:51:55 +00:00
a4030e37be [dynamo] reland map/zip iterator related changes (#135074)
Differential Revision: [D62211019](https://our.internmc.facebook.com/intern/diff/D62211019)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135074
Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/mlazos
2024-09-06 20:38:02 +00:00
22e1fb6faa [test][easy] Add debug utils for cpu select algorithm test (#135038)
Summary: Add debug utils to debug a flaky test in fbcode ci.

Some context: https://github.com/pytorch/pytorch/pull/126545

Test Plan: ci

Differential Revision: D62005445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135038
Approved by: https://github.com/jgong5, https://github.com/XuehaiPan
2024-09-06 20:30:49 +00:00
2a4890e315 [ONNX] Clean up the missed lines from previous PRs (#135368)
Some missed deleted lines

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135368
Approved by: https://github.com/justinchuby
2024-09-06 20:27:52 +00:00
3ce433aef2 [TCPStore] use wait counters (#135283)
This replaces the existing TCPStore counters with the new shared wait counters. There's no users of the tcpstore counters so should be completely safe to remove.

Test plan:

Existing tests + build

There's no OSS backend for wait counters so can't write any tests with them currently.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135283
Approved by: https://github.com/c-p-i-o
2024-09-06 19:54:25 +00:00
7f2d20e687 Run all autograd node post hooks (#134728)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134728
Approved by: https://github.com/albanD, https://github.com/soulitzer
2024-09-06 19:44:28 +00:00
32fd29c1ea [ONNX] Properly handle Attributes in traceable functions (#135367)
Previously the attributes were sent in as Attr objects even when we call the function as a plain Python function. Turning them into python objects.

From https://github.com/justinchuby/torch-onnx/pull/186
Related https://github.com/microsoft/onnxscript/issues/1846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135367
Approved by: https://github.com/justinchuby
2024-09-06 19:35:22 +00:00
5eebd9315a [ONNX] Refactor exporter errors (#135180)
Refactor exporter errors to combine old errors and new errors for API consistency.

This PR also

1. Removes the `_C._check_onnx_proto(proto)` call in the old exporter. We don't need the ONNX checker because it is limited.
2. Removes the `OnnxExporterError` defined in the dynamo module. This class unnecessarily stores the onnx program object, making it very bulky. Instead, we revert to use the plain OnnxExporterError defined in the `errors` module and use it as the base class for all errors.
3. Continues to expose `OnnxExporterError` in `torch.onnx` and the rest of the errors in `torch.onnx.errors`.
4. Removes the `CheckerError` and `InvalidExportOptionsError` from `torch.onnx`. This is BC breaking but should have low impact.
5. I did not rename existing errors out of compatibility considerations, even though `ExporterError` would have been more succinct.

Fixes https://github.com/pytorch/pytorch/issues/135125
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135180
Approved by: https://github.com/titaiwangms
2024-09-06 19:10:56 +00:00
a15aabc975 Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262)
Hi,
I noticed the `unfold` operator was missing on MaskedTensor.

I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262
Approved by: https://github.com/cpuhrsch
2024-09-06 19:06:23 +00:00
b143426db3 [Inductor] Use argument names as the key for the constants dict and the signature dict (#135170)
Referencing how triton constructs these dictionaries

ca3fb5f6fa/python/triton/runtime/jit.py (L639)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135170
Approved by: https://github.com/htyu
2024-09-06 19:05:00 +00:00
13ba0a2e5c Run bypassed graph compile outside the except block to avoid chaining of exceptions (#135175)
Fixes #135172

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135175
Approved by: https://github.com/masnesral, https://github.com/ezyang
2024-09-06 19:03:57 +00:00
8520ce5f78 Fix incorrect trace of post-accumulate grad hook on tensor with zero dims (#135226)
Fix incorrect trace of post-accumulate grad hook on tensor with zero dimensions

Fixes #135207

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135226
Approved by: https://github.com/xmfan
2024-09-06 18:19:54 +00:00
196748d491 [elastic] support local_addr across all rendezvous impls (#135262)
Summary:
There was a regression introduced in https://github.com/pytorch/pytorch/pull/125743 that made `local_addr` no longer used. This fixes that by passing `local_addr` to `RendezvousStoreInfo.build` everywhere it's used.

This also fixes a number of tests allowing them to be run in parallel which hugely sped up the testing cycle as this change touches many different rendezvous implementations. This required a few fixes in unrelated tests.

Test Plan:
Added tests for the common rendezvous implementations that `local_addr` to prevent future regressions.

```
buck2 test @//mode/dev-nosan fbcode//caffe2/test/distributed/elastic/... fbcode//caffe2/torch/distributed/elastic/... -- --stress-runs 3
```

To vet the parallelism changes I also ran with 3 stress runs each to identify flakiness caused by parallelism.

Differential Revision: D62256407

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135262
Approved by: https://github.com/fduwjj, https://github.com/wz337
2024-09-06 17:55:43 +00:00
177e4f4218 remove _check call on item() for torch.istft (#135234)
Fixes #135014

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135234
Approved by: https://github.com/tugsbayasgalan
2024-09-06 17:31:25 +00:00
3988b3468b [aoti][easy] remove breakpoint() in wrapper.py (#134807)
Differential Revision: D61687146

Remove an unintended breakpoint in code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134807
Approved by: https://github.com/YUNQIUGUO
2024-09-06 17:25:05 +00:00
04118d8617 [export] Record the global torch version in serialization. (#135243)
Summary: In general I think it will be useful to also record the global torch version in the EP, so that we can track them in the logging in addition to the schema version.

Test Plan: CI

Reviewed By: henryoier

Differential Revision: D62252626

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135243
Approved by: https://github.com/yushangdi
2024-09-06 17:02:06 +00:00
24482e5c68 [torch][fx] Set maximum warning count during fx.Graph.lint (#135069)
Summary:
resnet152 spent about 15 minutes writing warning messages in _unlift
during `to_executorch` because they're all written to unbuffered stderr
by the `warnings` module.

These warnings are almost always about get_attr nodes referencing a
non-existent name:
```lang=py
warnings.warn(f'Node {node} target {node.target} {atom} of {seen_qualname} does '
  'not reference an nn.Module, nn.Parameter, or buffer, which is '
  'what \'get_attr\' Nodes typically target'
)
```
I'm not aware of a way to configure the warnings module to write this out
at most once, so I'm just going to disable the lint for now.

Test Plan:
Re-ran resnet152 with Executorch and the XNNPackBackend, it is much faster now

Differential Revision: D62156090

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135069
Approved by: https://github.com/yushangdi
2024-09-06 16:41:59 +00:00
c0ec599f27 Update submodule ideep to include aarch64 change (#134897)
This PR is per ARM request, which is in https://github.com/intel/ideep/issues/334.

Context for the request is: Arm team has upstreamed the dynamic quantization changes, all the PRs were merged (torch, ideep, oneDNN), but without this ideep submodule update, the feature will not work. The change is isolated to only matmul operator and quantization path alone.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134897
Approved by: https://github.com/jgong5, https://github.com/atalman, https://github.com/snadampal
2024-09-06 16:40:26 +00:00
7074de43c0 Porting to GCC 15 (#135188)
uint8_t is found on cstdint header

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135188
Approved by: https://github.com/Skylion007
2024-09-06 16:16:53 +00:00
771dcce11d [AOTI][Tooling][6/n] Fix long dtype input tensors calling mean() in aoti_torch_print_tensor_handle (#135072)
Differential Revision: D61635232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135072
Approved by: https://github.com/hl475, https://github.com/ColinPeppler
2024-09-06 15:59:32 +00:00
de74aafff4 error on exporting ScriptModule (#135302)
Test Plan: added test

Differential Revision: D62279179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135302
Approved by: https://github.com/yushangdi
2024-09-06 15:12:40 +00:00
ad29a2c0dc Add Inductor config for default stride behavior (#135238)
By default, Inductor is allowed to manipulate the layout
(strides+storage offset) of input tensors to custom operators.

We want to change it so that the default is that Inductor should respect
the stride order of input tensors to custom operators.

This PR adds a config to toggle the behavior, in the next PR up we'll
change the default. We also make the following changes:
- We add a new operator Tag (flexible_layout), which means that
inductor is allowed to manipulate the layout. When we flip the default,
users can specify they want the old behavior by using this tag.

This is a reland of https://github.com/pytorch/pytorch/pull/126986,
which was previously reverted due to silent incorrectness. We've since
fixed the silent incorrectness
(https://github.com/pytorch/pytorch/pull/133639)

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135238
Approved by: https://github.com/albanD
2024-09-06 14:48:24 +00:00
3a9e33dca8 [torchelastic] Don't do signal handling when off the main thread (#135088)
Summary:
In multiprocessing, signal handling is not possible if the thread is not the main thread. This resulted in the following error:
> "ValueError('signal only works in main thread of the main interpreter')"

To address this issue, the diff checks whether the thread is the main thread and, if not, skips signal handling.

Test Plan:
Before this change, MAST job failed:
https://fburl.com/mlhub/iq2m10v8

With this change, MAST job succeeded:
https://fburl.com/mlhub/q6kb8343

Differential Revision: D62166943

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135088
Approved by: https://github.com/d4l3k
2024-09-06 14:47:03 +00:00
a086882d72 [inductor][triton] mark workspace args as mutated (#134648)
SplitScan makes use of a workspace arg that needs to be zeroed before it is used - then, it is used to communicate between thread blocks during the triton kernel implementation. It is mutated during during the execution of the kernel, so it should be marked as such.

Before this PR, it is not marked as mutated; AFAIK this is fine during normal execution, but during autotuning it causes problems. The workspace starts off zeroed (as expected), but during autotuning the kernel will be executed multiple times and the workspace does not get re-set between executions, resulting in incorrect data. If the data is used for indexing, then you can fail device-side asserts (and the results after the initial run (with autotuning) could be wrong). The test added in this PR repros the issue when the fix is removed.

When we mark the arg as mutated, then the arg gets cloned before autotuning, so that the arg passed to the kernel during autotuning will always be zeroed as expected.
804852c1f9/torch/_inductor/runtime/triton_heuristics.py (L685-L689)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134648
Approved by: https://github.com/peterbell10, https://github.com/jansel
2024-09-06 14:23:37 +00:00
84ae6b7d6b AOTDispatcher: limit cases when we detach() graph inputs to non-leaves (#134193)
This PR is slightly a revival / update to the discussion from https://github.com/pytorch/pytorch/pull/98960:

Part of FSDP2's tracing strategy right now is that:

(1) it is painful/difficult to handle the case where we have multiple graph input tensors that are aliased to each other and at least one of them is duplicated

(2) we already have longstanding in logic to remove duplicate input tensors from the graph in dynamo. Morally, FSDP2 gives us duplicate input tensors in the backward graph for every `unsharded_param`, because we have (a) the `unsharded_param` being closed over by the backward hook to resize/allgather, and (b) the same `unsharded_param` being saved for backward by autograd (we now guarantee in the partitioner that we will always save the base tensor for backward and recompute views)

(3) However, we were still seeing cases where the `unsharded_param` showed up twice in the backward graph inputs, as distinct tensor objects (with different python ids) instead of being true duplicates that dynamo can de-dup.

It turns on that this was because we were `.detach()`ing the `unsharded_param` in AOTDispatcher before plumbing it through the compiled forward (and so autograd would save a detach'd version of the `unsharded_param`). This is precisely because of the logic from https://github.com/pytorch/pytorch/pull/98960.

However, re-reading the detailed comments, it seems unnecessary to do a detach() on a graph input that is a (leaf) `nn.Parameter`, even if it happens to get no gradients in the backward. Since it is a leaf, we don't have to worry about the autograd engine "continuing to backprop through the graph beyond the current tensor" (the leaf has no other grad_fn for autograd to backprop through).

So this PR makes us a bit less aggressive about calling detach() on inputs: we only do it when:

(1) our graph input statically will get a `None` gradient (and also has no metadata mutations, the existing state)

(2) **and** our graph input is a non-leaf tensor (so detach()ing is actually required to prevent autograd from incorrectly backpropping past the non-leaf.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134193
Approved by: https://github.com/yf225

Co-authored-by: Will Feng <yf225@cornell.edu>
2024-09-06 14:06:48 +00:00
60a097a071 [CD] Update binary_linux_test.sh to include calling builder smoke test (#133869)
Run smoke test

Fixes #1969

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133869
Approved by: https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
2024-09-06 13:27:24 +00:00
13bae39e22 [inductor] [cpp] improve cache blocking for is_dynamic_M (#131306)
## Performance
Models with >= 3% performance speedup are listed below:

### AMP single-thread dynamic shape (measured on CPU with AMX support)
No regressions

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | soft_actor_critic| 3%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131306
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
ghstack dependencies: #135275

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
2024-09-06 13:21:24 +00:00
4ef6c05f65 [inductor][cpp][gemm] fix autotune runtime error from linear_binary fusion (#135275)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135275
Approved by: https://github.com/leslie-fang-intel
2024-09-06 13:21:23 +00:00
d6b9bd3e60 Also handle compiler collective when input variable doesn't exist on all ranks (#135147)
Internal xref:
https://fb.workplace.com/groups/3095840833991792/permalink/3810738595835342/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135147
Approved by: https://github.com/jansel
2024-09-06 13:18:36 +00:00
d0591f4658 Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053)
Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7705964779531357/

This now also incorporates a test from https://github.com/pytorch/pytorch/pull/133585 (which it fixes) and the prep PR https://github.com/pytorch/pytorch/pull/134407 Including the PR desc from that:

I am trying to fix a problem reported by user in [fb.workplace.com/groups/6829516587176185/permalink/7705964779531357](https://fb.workplace.com/groups/6829516587176185/permalink/7705964779531357/) The summary of this problem is that when we do collect metadata analysis in AOTAutograd, we accumulate pending unbacked symbols which are going to be discarded at the end of the trace. However, if we do a recursive make_fx inside tracing, as occurs with torch.cond, we end up seeing that there are pending unbacked symbols that aren't associated with a binding, even though it's spurious (they've leaked into the inner make_fx call from the outer AOTAutograd analysis).

In https://github.com/pytorch/pytorch/pull/133588 I tried to just prevent adding the symbols to the pending list at all in the first place. But this itself caused some problems which were fixed in https://github.com/pytorch/pytorch/pull/124785 . The problem fixed in that PR is that when we allocate tangents that have unbacked size, something prevented them from having correct unbacked SymInts when ignore fresh unbacked SymInts was enabled. So I had patched it at the time by just not suppressing pending symbols and clearing them out some other way.

I think... I was wrong in that PR? That is to say, it was OK to avoid putting the fresh unbacked symbols in the pending list; the real problem was suppressing unbacked renamings. But there doesn't seem to be a good reason to suppress these; this PR shows that it doesn't actually fail any tests if you do these anyway. Intuitively, this makes sense, because you can't trigger renamings unless you're actually adding unbacked symbols to the pending set.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135053
Approved by: https://github.com/ydwu4
2024-09-06 13:13:15 +00:00
b5dea061c8 check compilation status before query cudnn version in conv (#135332)
This PR is created for fixing the https://github.com/pytorch/pytorch/issues/135322.  The cudnn compilation status should be check firstly before querying version, otherwise, conv may trigger runtimeerror before any check in other non-cuda backends.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135332
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-09-06 12:50:04 +00:00
041960a1ce [Dynamo] Automatically in-graph traceable tensor subclass ctors (#135151)
Fixes https://github.com/pytorch/pytorch/issues/114389

Previously, dynamo would attempt to trace through the `__init__` of traceable tensor subclasses, since their constructors are AOT dispatcher traceable by definition, dynamo should automatically put these in the graph like we do for any other tensors. Not doing this is difficult because dynamo would need to apply mutations post tensor subclass creation in the graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135151
Approved by: https://github.com/bdhirsh
2024-09-06 12:23:38 +00:00
67c7924ea1 [inductor] Fix gen_transposed_tile_load_store (#135307)
Recent PR: https://github.com/pytorch/pytorch/pull/131745 bring new VLA logical in cpp codegen. And it will raise build fail error on MSVC and error code is `Compiler Error C2131`: https://learn.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2131?view=msvc-170

reproduce UT:
```cmd
pytest test\inductor\test_torchinductor_dynamic_shapes.py -v -k test_large_block_sizes_dynamic_shapes_cpu
```

Original generated code:
```c++
alignas(16) float tmp1[static_cast<int64_t>(((-256LL)*(c10::div_floor_integer(static_cast<int64_t>(ks1), static_cast<int64_t>(16LL)))) + (16LL*ks1))];
```

Changes:
allocate a large-enough fixed-sized buffer.

New genarated code:
```c++
alignas(16) float tmp1[16*16];
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135307
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-09-06 10:44:08 +00:00
217ba7b2ab [Docs] Update FileCheck doc (#135199)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135199
Approved by: https://github.com/soulitzer
2024-09-06 08:18:38 +00:00
758d515d98 [Inductor][CPP] Select tiling factor for lower precision data types (#133830)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133830
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-09-06 08:12:37 +00:00
60d98b4cfb Update torch-xpu-ops pin (ATen XPU implementation) (#135300)
Release cycle for PyTorch 2.5
1. Bugfixing: correct reduction logic in cdist kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135300
Approved by: https://github.com/EikanWang
2024-09-06 07:30:09 +00:00
590a3e9f8a [export][training ir migration] quantized_decomposed.quantize_per_tensor decomposition (#134525)
Summary:
In graph of  TestXNNPACKQuantizer.test_dynamic_linear_with_con test, some quantized_decomposed.quantize_per_tensor.default ops are becoming quantized_decomposed.dequantize_per_tensor.tensor ops when using the new training ir.

This is because we lift params/buffers before calling make_fx. So previously, for the graph that’s passed to make_fx,`graph.L__self___linear1.weight` is a tensor
now in training ir, graph.L__self___linear1.weight is a FakeTensor. This caused the node overload to be different.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_dynamic_linear_with_conv
```

Differential Revision: D61364547

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134525
Approved by: https://github.com/tugsbayasgalan, https://github.com/jerryzh168
2024-09-06 07:06:06 +00:00
764ee6e3f9 [FlexAttention] Specify padding_value for boundary checked loads (#134573)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134573
Approved by: https://github.com/Chillee
2024-09-06 06:47:26 +00:00
67f98a99a4 [DeviceMesh][Easy] Make RuntimeError a bit more descriptive by including the actual world_size (#135271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135271
Approved by: https://github.com/fduwjj
2024-09-06 06:23:20 +00:00
e020a8755a [Fix][FR][ez] Remove debugging logs (#135308)
Removing the print added during debugging process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135308
Approved by: https://github.com/wz337
2024-09-06 06:14:33 +00:00
7ffb3b201c [inductor] Remove LoopBody.reads,writes,other (#135256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135256
Approved by: https://github.com/oulgen
ghstack dependencies: #135070, #135076, #135082, #135084, #135079, #135235
2024-09-06 06:11:55 +00:00
f946bf88c4 [inductor] Skip retracing an existing LoopBody (#135235)
This is roughly a 7% speedup in inductor compile time for hf_Bert_large.  The time spent in `LoopBody.__init__` improves from 15% to 8% of `fx_codegen_and_compile`.

Before
![image](https://github.com/user-attachments/assets/7de0f28e-35bd-472f-b4be-b52733d2a85c)

After
![image](https://github.com/user-attachments/assets/5f0cf11a-43c5-43ae-b13c-f32383a75a7f)

Overall
![image](https://github.com/user-attachments/assets/6a369d8c-fb5e-4ad2-9504-0fc745ad6568)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135235
Approved by: https://github.com/oulgen
ghstack dependencies: #135070, #135076, #135082, #135084, #135079
2024-09-06 06:11:55 +00:00
66da3b3b2a [fx] Bypass custom __setattr__ in Node.__init__ (#135079)
Before:
![image](https://github.com/user-attachments/assets/5f0a6ae6-6049-44d0-b5f2-a549a23ad97f)

After:
![image](https://github.com/user-attachments/assets/51c9f91b-f8a0-4043-8362-65813feec823)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135079
Approved by: https://github.com/oulgen
ghstack dependencies: #135070, #135076, #135082, #135084
2024-09-06 06:11:46 +00:00
41e653456e [RDP] Fix "No module named 'libfb’" (#135244)
Summary:
D62215095 Introduced an import error to arvr pipelines as the is_fbcode() function does not work as intended.

This changes is_fbcode() to be a much stricter check.

Test Plan:
```
buck2 run arvr/mode/platform010/opt-stripped //arvr/libraries/depthlink/clients/mr_replay:pipeline_runner -c bolt.use_eva3_sim=True -- --config_file arvr/libraries/depthlink/clients/mr_replay/configs/runner_config.yaml --features DEPTH
```

Differential Revision: D62237502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135244
Approved by: https://github.com/aorenste
2024-09-06 04:52:31 +00:00
e40a0a9359 Add randomness checking for sdpa vmap (#135176)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135176
Approved by: https://github.com/zou3519
2024-09-06 04:50:49 +00:00
c05a7adb36 [inductor][debug] fix draw_buffers (#135266)
**Before:**
![image](https://github.com/user-attachments/assets/aac756f3-1349-4647-9da3-87cf105cf647)

**After:**
<img width="791" alt="image" src="https://github.com/user-attachments/assets/d72c663c-e598-42fa-ac40-9e58956f1ec1">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135266
Approved by: https://github.com/yf225
2024-09-06 04:12:41 +00:00
5f57be7571 [Distributed] Change function call in test to non-deprecated to eliminate warning (#134938)
Migrate function call in test to eliminate warning message in below and reduce the chance of test fail when methods removed

-  from deprecated `save_state_dict` change to `save`
-  from deprecated `load_state_dict` change to `load`

Warning message:
```bash
pytorch/test/distributed/checkpoint/test_fsdp_model_state.py:37: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134938
Approved by: https://github.com/wz337, https://github.com/fegin
2024-09-06 03:25:09 +00:00
29d72c1100 [inductor] check intel compiler minimal version (#135209)
On Windows: early version icx has `-print-file-name` issue, and can't preload correctly for inductor. Add minimal version check for Intel compiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135209
Approved by: https://github.com/ezyang
2024-09-06 03:21:07 +00:00
3b1a334c0f [Inductor][CPP] Avoid mistake wgt tensor delete (#135100)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/134998: Previously, we only checked if the `get_attr` FX node for the weight had a single user node. However, two `get_attr` nodes may share the same tensor and should not be deleted in such cases. In this PR, we add the count of users for tensor along with the num of users for nodes to decide whether this tensor can be deleted or not.

**TestPlan**
```
 python test/inductor/test_cpu_select_algorithm.py -k test_linear_wgt_multi_users
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135100
Approved by: https://github.com/jgong5
2024-09-06 03:13:36 +00:00
07689a38bf [Inductor] Fix AOT weight alignment issue on CPU (#135205)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/135027. On CPU, the `consts_size` used to generate `_binary_constants_bin_start` is not padded to `ALIGN_BYTES`, while `serialized_weights` is, causing a failure in the 16K alignment check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135205
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-09-06 03:06:51 +00:00
06a7dc21c1 Remove dead expect_rational (#135105)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135105
Approved by: https://github.com/malfet
2024-09-06 02:57:27 +00:00
d9a18173fa Report qualname of exception type rather than <class 'RuntimeError'> (#135146)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135146
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/yanboliang
ghstack dependencies: #135148, #135145
2024-09-06 02:56:50 +00:00
d8543e3162 Include exception type qualname when rewrapping InternalTorchDynamoError (#135145)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135145
Approved by: https://github.com/drisspg, https://github.com/anijain2305
ghstack dependencies: #135148
2024-09-06 02:56:50 +00:00
ad01fc194d Consolidate raise and rewrap raise error branches (#135148)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135148
Approved by: https://github.com/anijain2305, https://github.com/albanD, https://github.com/yanboliang, https://github.com/malfet
2024-09-06 02:56:46 +00:00
e162414963 add instrumentation of CCA stats for reserved and allocated memory size (#135231)
As titled
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135231
Approved by: https://github.com/c-p-i-o
2024-09-06 02:48:56 +00:00
9e5a797771 Improve test_public_bindings import module error reporting (#135258)
Error was hard to understand without message. Render it now. See https://github.com/pytorch/pytorch/pull/135259 for it in action.

Example failure:

```
2024-09-05T20:04:45.3022000Z FAILED [5.9524s] test_public_bindings.py::TestPublicBindings::test_modules_can_be_imported - AssertionError: String comparison failed: '' != "torch._logging.scribe failed to import w[112 chars].py)"
2024-09-05T20:04:45.3025413Z + torch._logging.scribe failed to import with error ImportError: cannot import name 'TypeAlias' from 'typing' (/opt/conda/envs/py_3.9/lib/python3.9/typing.py)
2024-09-05T20:04:45.3026990Z
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135258
Approved by: https://github.com/albanD
2024-09-06 02:40:03 +00:00
b46a1b9e2d Use Python 3.9 on all libtorch jobs (#135245)
Part of the migration py3.8->3.9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135245
Approved by: https://github.com/izaitsevfb
2024-09-06 02:27:22 +00:00
9688014820 aarch64: extend matmul heuristic checks to all neoverse platforms (#134548)
for aarch64 neoverse platforms there are two gemm backends available
for matmul operator on PyTorch: (1) Arm Compute Library and (2) OpenBLAS.
While Arm Compute Library provides better performance over OpenBLAS,
it has overhead for the kernel launch time, and hence we use OpenBLAS
for smaller tensor compute. The heuristic was originally implemented for
neoverse_v1. This commit extends the heuristic to other neoverse platforms

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134548
Approved by: https://github.com/malfet
2024-09-06 01:40:50 +00:00
8f6e73f068 [ONNX] Enable experimental exporter logic to dynamo_export and support refine dynamic_shapes (#134976)
(1) Enable experimental exporter logic to dynamo_export
(2) Refine dynamic shapes and retry export in export strategies
(3) Delete `torch_export_graph_extractor` and use the new export logic
(4) Disable ExportedProgram test in `test_fx_onnx_with_onnxruntime.py`, as ONNXProgram is different now.

Fixes https://github.com/pytorch/pytorch/issues/126479
Fixes #135183
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134976
Approved by: https://github.com/justinchuby
2024-09-06 01:29:56 +00:00
1e57ef08fa [AOTI] Support MKLDNN qconv ops in cpp wrapper (#134795)
Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support qconv in the ABI-compatible mode for cpp-wrapper Inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134795
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi
ghstack dependencies: #134475, #134783
2024-09-06 01:01:53 +00:00
614b86d602 [AOTI] Support MKLDNN qlinear ops in cpp wrapper (#134783)
Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support qlinear in the ABI-compatible mode for cpp-wrapper Inductor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134783
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi
ghstack dependencies: #134475
2024-09-06 01:01:53 +00:00
0b96dfb736 [AOTI] Support MKLDNN conv ops in cpp wrapper (#134475)
Summary: Partially fix https://github.com/pytorch/pytorch/issues/123040. In the ABI-compatible mode, MKLDNN fallback ops do not have C shim implementations and thus need to go through the custom ops launch path. Other MLKDNN ops will be fixed in following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134475
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi
2024-09-06 01:01:53 +00:00
62b221d5cc Add Percentages to Function Events (#135155)
Summary: Users have recently asked that the profiler contains self/total CPU and device percentages to FunctionEvents so that teams can process the data procedurely. Some of it could be done mathematically via subroutines but since we already have the information in the _build_table, lets build it there.

Test Plan: Check that we have the same table as before but also check that the parameters we check also have the expected values

Differential Revision: D62210351

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135155
Approved by: https://github.com/shanw-meta, https://github.com/kit1980
2024-09-06 00:39:11 +00:00
66dd4577b1 Track base of FunctionalTensor in inference mode. (#135141)
The idea behind the tracking is the following, whenever we see a tensor if the tensors is a root tensors (does not have any view metas ) when we consider is as the base of the all the tensors that shares its storage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135141
Approved by: https://github.com/zou3519
2024-09-06 00:10:25 +00:00
cyy
cc28634172 [Submodule] Bump pybind11 to v2.13.5 (#135202)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135202
Approved by: https://github.com/Skylion007
2024-09-06 00:09:00 +00:00
c83cdf068b [DTensor] Fix view op replicating on tensor dim when the size of the tensor dim = 1 (#135054)
We found a corner case that when a tensor dimension is 1, calling `view(1)` would result in an unexpected replication (see case 1 below). When the tensor dimension to shard is not 1, no matter whether the tensor dimension is evenly-shardable across the mesh dimension, it won't cause an implicit replication behind the scenes if view doesn't change the size of the given tensor dimension (see case 2 and 3).

When the tensor dimension to shard is of size 1, it is not being added to shardable_dims here:
https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/ops/_view_ops.py#L518

```
# uneven case where the size of the tensor dimension to shard is 1
p = torch.randn(1,2)
mesh = init_device_mesh(“cuda”, (2,))
dtensor = distribute_tensor(p, mesh, [Shard(0)])
t = dtensor.view(1, 2)
# this would result in replication, meaning t is now replicated across all ranks.

# uneven case where the size of the tensor dimension to shard is not 1
p = torch.randn(3, 2)
mesh = init_device_mesh(“cuda”, (2,))
dtensor = distribute_tensor(p, mesh, [Shard(0)])
t = dtensor.view(3, 2) # this would not result in replication.
# this would not result in replication, meaning t stays as sharded.

# even case
p = torch.randn(2,2)
dtensor = distribute_tensor(p, mesh, [Shard(0)])
t = dtensor.view(2, 2)
# this would not result in replication, meaning t stays as sharded.
```

Differential Revision: [D62155606](https://our.internmc.facebook.com/intern/diff/D62155606)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135054
Approved by: https://github.com/tianyu-l, https://github.com/wanchaol
2024-09-06 00:03:54 +00:00
28ccfba248 [ONNX] Delete ONNXProgramSerializer (#135261)
Fixes #135182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135261
Approved by: https://github.com/justinchuby
2024-09-05 23:52:51 +00:00
b2386bdca1 [debug] Add helper to run cProfile on a function (#135084)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135084
Approved by: https://github.com/oulgen
ghstack dependencies: #135070, #135076, #135082
2024-09-05 23:41:30 +00:00
bdfc8d9f96 [fx] Don't use generators in map_aggregate (#135082)
While the generators avoid a copy, they are slow.

Before:
![image](https://github.com/user-attachments/assets/70a55a9a-0595-4105-b0ab-22cf77c7409c)

After:
![image](https://github.com/user-attachments/assets/cecb9c59-ae36-47de-8b08-cab2c7cb3d57)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135082
Approved by: https://github.com/oulgen
ghstack dependencies: #135070, #135076
2024-09-05 23:41:30 +00:00
70779dded8 [fx] Compile time optimization in Node.__update_args_kwargs (#135076)
Before this we took two passes over all of the args.

Before:
![image](https://github.com/user-attachments/assets/24ce5628-03f4-4983-9f2d-5ddf0ca5816e)

After:
![image](https://github.com/user-attachments/assets/c9681aa2-32f0-4f6b-a598-fc6f90ffafb5)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135076
Approved by: https://github.com/Chillee
ghstack dependencies: #135070
2024-09-05 23:41:30 +00:00
ea231300d1 [inductor] Improve compile time regression from MemoryDep.normalize (#135070)
Possible fix for #135056

Before
![image](https://github.com/user-attachments/assets/3962cb85-e808-4fd4-991f-471ff5ef7eae)

After
![image](https://github.com/user-attachments/assets/2322d48d-6518-4518-baca-336027b5cda8)

Measured based on:
```
python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --training --only hf_Bert_large --stats -n1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135070
Approved by: https://github.com/Chillee
2024-09-05 23:41:30 +00:00
8f66995459 Revert "Support rolling over a percentage of workflows (#134816)"
This reverts commit fc890b55b51098437b6149abf1026a8b2aaee389.

Reverted https://github.com/pytorch/pytorch/pull/134816 on behalf of https://github.com/malfet due to Causes lint to intermittently fail ([comment](https://github.com/pytorch/pytorch/pull/134816#issuecomment-2332902609))
2024-09-05 23:39:41 +00:00
144fde4fd2 [MPS] Add support for autocast in MPS (#99272)
Fixes https://github.com/pytorch/pytorch/issues/88415

Need to run inductor/test_cpu_select_algorithm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272
Approved by: https://github.com/malfet

Co-authored-by: Siddharth Kotapati <skotapati@apple.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: Roy Hvaara <roy@lightyear.no>
2024-09-05 23:23:17 +00:00
43f4947d44 fix fake tensor tolist implementation (#135131)
Summary:
When exporting for training with `tolist`, we do not hit `FunctionalTensor.tolist` since we do not functionalize. Unfortunately, this means we hit `FakeTensor.tolist`, which creates unbacked symints that are not backed by proxies.

Rather than trying to patch up this low-level implementation, we replace it with essentially what `FunctionalTensor.tolist` does, which is higher-level: we essentially desugar to `item()` calls and let it take care of unbacked symints.

Test Plan:
Some expected failures are gone now.
Also found a test for `tolist` that was written when `FunctionalTensor.tolist` was implemented but not really doing much; repurposed it now to exercise more modes.

Differential Revision: D62197742

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135131
Approved by: https://github.com/ezyang
2024-09-05 23:20:31 +00:00
65e1c34061 [rfc] scuba for flight recorder (#134794)
Summary: Record flight recorder status in a scuba table.

Test Plan: Testing with timing out a job. Will post results soon.

Differential Revision: D61729221

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134794
Approved by: https://github.com/fduwjj
2024-09-05 23:18:10 +00:00
830247c355 [Intel Triton] Update Intel Triton to release/2.5.0 (#134074)
This PR relands https://github.com/pytorch/pytorch/pull/134053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134074
Approved by: https://github.com/EikanWang
2024-09-05 22:46:31 +00:00
4262755b5a [cond] fix typo in cond codegen (#134708)
As titled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134708
Approved by: https://github.com/jansel
2024-09-05 22:38:24 +00:00
3825607144 Add torch._logging.scribe (#135224)
See https://github.com/pytorch/pytorch/pull/135138 for a usage example. Meta only, see https://docs.google.com/document/d/1JpbAQvRhTmuxjnKKjT7qq57dsnV84nxSLpWJo1abJuE/edit#heading=h.9wi46k7np6xw for context

fbscribelogger is a library that allows us to write to scribe, which is Meta's logging infrastructure, when you have appropriate access token (this token is available for jobs running on main, as well as authorized jobs with the ci-scribe label). The resulting data is accessible via Scuba (a real time in-memory database) and Hive (a more traditional SQL persisted database).

Here's the motivating use case. Suppose there is somewhere in PyTorch's codebase where you'd like to log an event, and then you'd like to find all the situations where this log is called. If PyTorch is rolled out to our internal users, we have some FB-oriented APIs (like torch._utils_internal.signpost_event) with which you can do this. But you have to actually land your PR to main, wait for it to be ingested to fbcode, and then wait for us to actually roll out this version, before you get any data. But what if you want the results within the next few hours? Instead, you can use torch._logging.scribe to directly write to our logging infrastructure *from inside CI jobs.* The most convenient approach is to log unstructured JSON blobs to `open_source_signpost` (added in this PR; you can also add your own dedicated table as described in the GDoc above). After adding logging code to your code, you can push your PR to CI, add 'ci-scribe' label, and in a few hours view the results in Scuba, e.g., (Meta-only) https://fburl.com/scuba/torch_open_source_signpost/z2mq8o4l If you want continuous logging on all commits on master, you can land your PR and it will be continuously get logging for all CI runs that happen on main.

Eventually, if your dataset is important enough, you can consider collaborating with PyTorch Dev Infra to get the data collected in our public AWS cloud so that OSS users can view it without access to Meta's internal users. But this facility is really good for prototyping / one-off experiments. It's entirely self serve: just add your logging, run your PR CI with ci-scribe, get results, do analysis in Scuba.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135224
Approved by: https://github.com/Skylion007
2024-09-05 22:37:13 +00:00
eqy
3c8f71ff93 [cuDNN][64-bit indexing] cuDNN v9.3+ supports non-batch-splittable convolutions with > 2**31 elements (#134890)
For longstanding issues such as #95024

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134890
Approved by: https://github.com/Skylion007
2024-09-05 22:22:45 +00:00
fc890b55b5 Support rolling over a percentage of workflows (#134816)
In order to support adding a rollover percentage, this ended up being a complete rewrite of runner_determinator.py.

Details of the new format are in the comments up top.

On the plus side, this now includes some unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134816
Approved by: https://github.com/PaliC, https://github.com/zxiiro
2024-09-05 22:21:45 +00:00
058a69d91a [fbcode][dynamo] Turn on guard_nn_modules using justknobs_check (#134928)
As Title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134928
Approved by: https://github.com/ezyang
2024-09-05 22:05:54 +00:00
6c5920d515 Tune int8 AMX WoQ micro-kernel for CPU (#134832)
This patch prevents performance regression against the default ATen implementation for LLaMA 3.1 int8 GPTQ WoQ workload.

Uses AMX micro-kernel only if `M` >= `block_m`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134832
Approved by: https://github.com/jgong5
2024-09-05 22:01:14 +00:00
116fd474da [export] Expand coverage to more copied sym ops for unflattener. (#135119)
Test Plan:
buck2 test 'fbcode//mode/opt' fbcode//torchrec/ir/tests:test_serializer -- --run-disabled

```
File changed: fbcode//caffe2/torch/export/unflatten.py
Buck UI: https://www.internalfb.com/buck2/2e0377e7-e2b6-4bd0-8133-a787245165a0
Test UI: https://www.internalfb.com/intern/testinfra/testrun/5066549824883887
Network: Up: 0B  Down: 0B
Jobs completed: 16. Time elapsed: 10.2s.
Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D62190172

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135119
Approved by: https://github.com/yushangdi
2024-09-05 21:58:20 +00:00
a5d70cf545 [PyTorch] Add isfinite to BFloat16-math.h (#135052)
Missing function from <cmath>.

Differential Revision: [D62148884](https://our.internmc.facebook.com/intern/diff/D62148884/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135052
Approved by: https://github.com/PaliC, https://github.com/albanD
ghstack dependencies: #135031
2024-09-05 21:50:36 +00:00
7fe819d917 [PyTorch] Fix -Wshadow -Werror build in BFloat16-inl.h (#135031)
`float_t` is required to exists in C99 math.h, which causes -Wshadow to fire. We don't need the alias, fortunately.

Differential Revision: [D62135908](https://our.internmc.facebook.com/intern/diff/D62135908/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135031
Approved by: https://github.com/albanD
2024-09-05 21:48:21 +00:00
f63571060c Revert "Use actions/upload-artifact@v4.4.0 for rest of workflows (#135264)"
This reverts commit 9c0b03020b7204ca5d5dbe18174bab005f79c47b.

Reverted https://github.com/pytorch/pytorch/pull/135264 on behalf of https://github.com/atalman due to broke CI ([comment](https://github.com/pytorch/pytorch/pull/135264#issuecomment-2332674607))
2024-09-05 21:43:05 +00:00
38fead8f7c [hop] preserve metadata in re-tracing hop subgraph by running with interpreter (#135159)
In this way, the interpreter.run can preserve the current metadata of subgraphs correctly when tracing the subgraphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135159
Approved by: https://github.com/tugsbayasgalan
2024-09-05 21:36:56 +00:00
24a223c49d Run inductor micro benchmark on x86 metal runner (#135042)
This enables inductor micro benchmark on CPU (x86):

* Running on AWS metal runner for more accurate benchmark
* I add a new `arch` column, which will be either x86_64 or arm64 for CPU or GPU name for GPU.  We can use this later to differentiate between different setup, i.e. cuda (a100) vs cuda (a10g) or cpu (x86_64) vs cpu (arm64)

The next step would be to run this one cpu arm64, and cuda (a10g).

### Testing
Here is the CSV results from my test run https://github.com/pytorch/pytorch/actions/runs/10709344180

```
name,metric,target,actual,dtype,device,arch,is_model
mlp_layer_norm_gelu,flops_utilization,0.8,17.36,bfloat16,cpu,x86_64,False
gather_gemv,memory_bandwidth(GB/s),990,170.80,int8,cpu,x86_64,False
gather_gemv,memory_bandwidth(GB/s),1060,204.78,bfloat16,cpu,x86_64,False
Mixtral-8x7B-v0.1,token_per_sec,175,26.68,int8,cpu,x86_64,True
Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,171.91,int8,cpu,x86_64,True
Mixtral-8x7B-v0.1,compilation_time(s),162,47.36,int8,cpu,x86_64,True
gemv,memory_bandwidth(GB/s),870,236.36,int8,cpu,x86_64,False
gemv,memory_bandwidth(GB/s),990,305.71,bfloat16,cpu,x86_64,False
Llama-2-7b-chat-hf,token_per_sec,94,14.01,bfloat16,cpu,x86_64,True
Llama-2-7b-chat-hf,memory_bandwidth(GB/s),1253,185.18,bfloat16,cpu,x86_64,True
Llama-2-7b-chat-hf,compilation_time(s),162,74.99,bfloat16,cpu,x86_64,True
Llama-2-7b-chat-hf,token_per_sec,144,25.09,int8,cpu,x86_64,True
Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,165.83,int8,cpu,x86_64,True
Llama-2-7b-chat-hf,compilation_time(s),172,70.69,int8,cpu,x86_64,True
layer_norm,memory_bandwidth(GB/s),950,172.03,bfloat16,cpu,x86_64,False
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135042
Approved by: https://github.com/yanboliang
2024-09-05 21:31:36 +00:00
e4920a1364 [Traceable FSDP2][Dynamo] allow tracing through auto_functionalized HOP (#135169)
If an `auto_functionalized` HOP is included in backward graph due to activation checkpointing, we will run into a scenario where Compiled Autograd Dynamo tracing will need to trace through the `auto_functionalized` HOP. This PR adds support for it.

Test commands:
- `pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_auto_functionalized`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135169
Approved by: https://github.com/zou3519
2024-09-05 21:22:45 +00:00
bc5ecf83d7 [training ir migration] Fix quantization tests (#135184)
Summary:
Fixed some quantization tests for new training ir:

Fix batch norm node pattern matcher. In training ir, we have `aten.batch_norm` node instead of `aten._native_batch_norm_legit` and `aten._native_batch_norm_legit_no_training`.

Test Plan:
```
buck run fbcode//mode/dev-nosan fbcode//caffe2/test:quantization_pt2e
```

Reviewed By: tugsbayasgalan

Differential Revision: D62209819

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135184
Approved by: https://github.com/tugsbayasgalan
2024-09-05 21:19:28 +00:00
e55c0f59e5 Revert "[Reland] Refactor caching device allocator utils (#130923)"
This reverts commit 9809080b9ed657a8c0ea0383be7cbdce3a26e05e.

Reverted https://github.com/pytorch/pytorch/pull/130923 on behalf of https://github.com/kit1980 due to breaking internal builds - Error: Relocation overflow has occured ([comment](https://github.com/pytorch/pytorch/pull/130923#issuecomment-2332640961))
2024-09-05 21:16:14 +00:00
a4cf9653ee Revert "Remove Caffe2 code from tool scripts (#134941)"
This reverts commit c818ecd1698a28d9fadf4a81453a89914b18374a.

Reverted https://github.com/pytorch/pytorch/pull/134941 on behalf of https://github.com/kit1980 due to breaking internal builds - The path `caffe2/operators/hip/gather_op.cuh` does not exist ([comment](https://github.com/pytorch/pytorch/pull/134941#issuecomment-2332636624))
2024-09-05 21:12:54 +00:00
9c0b03020b Use actions/upload-artifact@v4.4.0 for rest of workflows (#135264)
To be consistent with https://github.com/pytorch/pytorch/pull/135263 and rest of workflows. Use v4.4.0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135264
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-09-05 21:05:06 +00:00
034717a029 [ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2024-09-05 20:36:45 +00:00
9c38b00999 [export] Add ability to run eagerly on UnflattenedModule (#133996)
Summary:
Added the contextmanager, `_disable_interpreter`, which is meant to put around a call to `unflatten`. This will generate an UnflattendModule and sub-InterpreterModules which will not use torch.fx.Interpreter to run eagerly. We want to have this as a state of the module instead of a contextmanager around running the module because it's not clear where we are calling the unflattened module.

This seems to improve the performance: https://fb.workplace.com/groups/1075192433118967/posts/1473590629945810/?comment_id=1473621763276030

Test Plan: CI

Differential Revision: D60939034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133996
Approved by: https://github.com/pianpwk
2024-09-05 20:28:42 +00:00
8efe547046 Use actions/upload-artifact@v4.4.0 for triton builds (#135263)
Same as: https://github.com/pytorch/pytorch/pull/135139
Fixes upload failure: https://github.com/pytorch/pytorch/actions/runs/10722567217/job/29748125015
fix regression introduced by https://github.com/pytorch/pytorch/pull/135068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135263
Approved by: https://github.com/kit1980, https://github.com/huydhn
2024-09-05 20:03:39 +00:00
82d00acfee Allow cross-device copies for cpu scalars in refs (#135140)
This copies our eager-mode behavior where someone can do torch.add(a, b, out=c)
where a and b are CPU scalar tensors and c is a CUDA tensor.

Fixes https://github.com/pytorch/pytorch/issues/121619 by side effect (we get into a situation where we're writing a CPU scalar into a FakeTensor that is actually a meta tensor)

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135140
Approved by: https://github.com/williamwen42, https://github.com/yanboliang
2024-09-05 19:08:48 +00:00
098431a29d Update Resize.cpp with new device type (#135117)
Update Resize.cpp with new device type

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135117
Approved by: https://github.com/egienvalue
2024-09-05 18:53:13 +00:00
be660ea2d3 [PT2] Directly set meta.val in group_batch_fusion_aten (#135078)
Summary: instead of using FakeTensorProp after the pass

Differential Revision: D62162640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135078
Approved by: https://github.com/frank-wei
2024-09-05 18:17:06 +00:00
52c7c89ea4 [Inductor][CPP] Leverage full bits for BF16/FP16 vectorization (#126502)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126502
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-09-05 17:17:46 +00:00
1efd341d15 [fake_tensor] Move unrecognized_type NotImplemented before ConstProp (#135033)
We should not try to do ConstProp on the unrecognized types (e.g. Subclasses).
In case of those types throwing NotImplemented will jump to the next torch_dispatch.

Test:
```
 python test/functorch/test_aotdispatch.py -k test_aot_test_subclasses_with_tensor_factories
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135033
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2024-09-05 17:09:41 +00:00
a096f2899d Add torch.serialization.skip_data context manager (#134504)
## Semantic

The semantic is
(1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint).

```python
import torch
import torch.nn as nn

sd = nn.Linear(3, 5).state_dict()
with torch.serialization.skip_data():
    torch.save(sd, 'foo.pt')
print(torch.load('foo.pt', weights_only=True))
```

(2)  With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor)

```python
import torch
import torch.nn as nn
from torch._subclasses.fake_tensor import FakeTensorMode

with FakeTensorMode():
    m = nn.Linear(3, 5, dtype=torch.float16, device='cuda')

sd = m.state_dict()
with torch.serialization.skip_data(materialize_fake_tensors=True):
    torch.save(sd, 'bla.pt')
print(torch.load('bla.pt', weights_only=True))
# OrderedDict([('weight', tensor([[0., 0., 0.],
#        [0., 0., 0.],
#        [0., 0., 0.],
#        [0., 0., 0.],
#        [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))])

```

## Follow Ups

- [ ] `torch.load` semantic for skip_data context manager
- [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass)

Differential Revision: [D62238610](https://our.internmc.facebook.com/intern/diff/D62238610)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504
Approved by: https://github.com/albanD
2024-09-05 16:53:39 +00:00
dbeb8a1691 Render log filepaths that are not anchored in torch's directory in a reasonable way (#135165)
For example, if I do TORCH_LOGS=fbscribelogger I'll get:

```
I0904 17:59:07.567000 3672513 fbscribelogger/__init__.py:161] stop
```

instead of

```
I0904 12:46:15.332000 2930287 ../../../../../home/ezyang/local/a/pytorch-env/lib/python3.10/site-packages/fbscribelogger/__init__.py:161] stop
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135165
Approved by: https://github.com/Skylion007
2024-09-05 16:48:09 +00:00
b1f72e2984 Gradient scaler for DTensor (#132816)
Solve the request [here](https://github.com/pytorch/pytorch/issues/120003#issuecomment-2248805798).
Enable DTensor input in gradient scaler's APIs, especially on `.unscale_()`
Related dispatch strategy is added to accept DTensor input.

To enable found_inf to conduct reduce action across devices, we add allreduce at dispatch with args after dispatch strategy and kernel.
Since `aten._amp_foreach_non_finite_check_and_unscale_.default` is an inplace_op, grad_scale as the arg[0] with be inplaced, so that redesign a strategy or refactoring the kernel would not help

Test files are testing 2 parts under 1-d(dp) and 2-d(dp,tp) cases:
1. whether the non-inf values unscaled
2. whether all DTensors at each device could found inf even not at their device.
3. If inf not found, will new parameters generates
4. if inf found, will scale be updated

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132816
Approved by: https://github.com/XilunWu, https://github.com/weifengpy, https://github.com/wanchaol
2024-09-05 16:44:32 +00:00
bb3c2408f4 [inductor][test] in test_unbacked_symints, replace inductor's skipCUDAIf with common device type's skipcudaif (#133936)
Differential Revision: D61506212

Use `skipCUDAIf` from `torch.testing._internal.common_device_type` if we create the test class with `instantiate_device_type_tests`.

`instantiate_device_type_tests` would make sure the class has attr device_type, which works with`skipCUDAIf` from `torch.testing._internal.common_device_type`.

Also skipping test_vertical_pointwise_reduction_fusion for cpu test class, since the test expects cuda.

FAILED [0.0026s] test/inductor/test_unbacked_symints.py::TestUnbackedSymintsCPU::test_vertical_pointwise_reduction_fusion_cpu - AttributeError: 'TestUnbackedSymintsCPU' object has no attribute 'device'

repro:
```
CUDA_VISIBLE_DEVICES="" pytest test/inductor/test_unbacked_symints.py -k cpu -v
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133936
Approved by: https://github.com/ColinPeppler, https://github.com/desertfire
2024-09-05 16:40:14 +00:00
2c99f17a32 Implement VariableTracker.python_type() (#134215)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134215
Approved by: https://github.com/amjames, https://github.com/jansel
2024-09-05 16:35:47 +00:00
0043dcd79e Switch torch pt2e xnnpack tests to use export_for_training (#134788)
Migrate all the callsites inside the pt2e XNNPACK tests to use export_for_training.

Differential Revision: D61994553

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134788
Approved by: https://github.com/mergennachin
2024-09-05 16:11:18 +00:00
2e2fb668fa Upgrade expecttest to 0.2.1 (#135136)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135136
Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/Skylion007
2024-09-05 16:05:35 +00:00
9d24f945ba [CI] Use larger instance for building triton whl (#135201)
When running CI jobs of "Build Triton Wheels", it failed due to the lack of resources. This PR uses a larger runner to avoid these issues.

The failure message is like:

```
Process completed with exit code 137.
```

Related running actions:
Failed actions: https://github.com/pytorch/pytorch/actions/runs/10714445036
Success actions: https://github.com/pytorch/pytorch/actions/runs/10716710830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135201
Approved by: https://github.com/chuanqi129, https://github.com/atalman
2024-09-05 14:36:23 +00:00
ecbd715363 [Intel GPU][Windows] Fix overriding default CMAKE_CXX_FLAGS (#135093)
The root cause is that `/EHsc` is part of the default `CMAKE_CXX_FLAGS` in CMake.
Fix to not override the default `CMAKE_CXX_FLAGS`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135093
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-09-05 12:52:43 +00:00
58f2477a26 [Dynamo] Support builtin function frozenset (#134563)
Support builtin function frozenset in dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134563
Approved by: https://github.com/anijain2305, https://github.com/EikanWang, https://github.com/jansel
2024-09-05 12:15:10 +00:00
43dcb4bb61 Revise CPU vectorization ISA support API (#135075)
Revising (mostly renaming) CPU vectorization ISA support API (non-frontend-user-facing). Also added AVX512_BF16 ISA detection API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135075
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/ezyang
2024-09-05 12:14:56 +00:00
50d1e37079 [AOTI] Fix a unbacked symint retrieve bug (#134670)
Summary: Fix https://github.com/pytorch/pytorch/issues/134081. When a unbacked symint is computed as the shape of a tensor from a tuple, generated C++ code needs to use std::get<> to extract the tensor.

Differential Revision: [D62142113](https://our.internmc.facebook.com/intern/diff/D62142113)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134670
Approved by: https://github.com/angelayi, https://github.com/22quinn, https://github.com/chenyang78
2024-09-05 11:34:14 +00:00
b99ef1a02e Update torch-xpu-ops pin (ATen XPU implementation) (#135185)
Release cycle for PyTorch 2.5
1. Update specific AOT targets for Windows. On Windows, AOT target list prefers Intel client GPUs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135185
Approved by: https://github.com/EikanWang
2024-09-05 10:05:23 +00:00
8a5c8e5db9 Update unbacked symints in masked_select more precisely (#134899)
## Summary
At the moment, the fake impl for `masked_select` simply sets the upper range while updating its size-like SymInt to `sys.maxsize`(9223372036854775807, max value for an unsigned int64) if the there are any SymInts in the original input tensor shape. This PR constrains the range more intelligently by using the upper ranges of each SymInt in the input tensor shape.

This solves an issue where an model being lowered to Executorch errors during memory planning because the memory allocated for `masked_select` ended up exceeded the 64-bit address space (`INT_MAX * size(dtype)`).

## Test plan
- Passes existing unit tests (tests case where upper bound is inf)
- Added unit test to verify upper bound reduction calculation
- Tested end-to-end by exporting with TORCH_LOGS="export" and ensuring that the range for `masked_select`'s SymInt size has the correct upper bound
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134899
Approved by: https://github.com/ezyang
2024-09-05 09:01:06 +00:00
c7328dff7f Enhance the stability of the complex divide code (#134647)
In C++, when a floating-point literal (e.g., 3.14) is compared with a variable of type float, the literal is by default interpreted as a double.
```c++
float f = 3.14f;
if (f == 3.14) {
    // Do something
}
```
If a device does not support double, an error will occur.
This PR addresses the issue of complex64 errors on machines that do not support double operations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134647
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-09-05 08:36:37 +00:00
316 changed files with 9703 additions and 4135 deletions

View File

@ -108,10 +108,10 @@ ENV CMAKE_C_COMPILER cc
ENV CMAKE_CXX_COMPILER c++
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/triton-rocm.txt triton-rocm.txt
COPY ci_commit_pins/triton.txt triton.txt
COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt
RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt
# Install AOTriton (Early fail)
COPY ./aotriton_version.txt aotriton_version.txt

View File

@ -1 +0,0 @@
21eae954efa5bf584da70324b640288c3ee7aede

View File

@ -1 +1 @@
1b2f15840e0d70eec50d84c7a0575cb835524def
cc981feba10a3f4c2e46f3fe368e8fcf5f5643df

View File

@ -1 +1 @@
dedb7bdf339a3546896d4820366ca562c586bfa0
757b6a61e7df814ba806f498f8bb3160f84b120c

View File

@ -12,10 +12,7 @@ conda_reinstall() {
as_jenkins conda install -q -n py_$ANACONDA_PYTHON_VERSION -y --force-reinstall $*
}
if [ -n "${ROCM_VERSION}" ]; then
TRITON_REPO="https://github.com/openai/triton"
TRITON_TEXT_FILE="triton-rocm"
elif [ -n "${XPU_VERSION}" ]; then
if [ -n "${XPU_VERSION}" ]; then
TRITON_REPO="https://github.com/intel/intel-xpu-backend-for-triton"
TRITON_TEXT_FILE="triton-xpu"
else

View File

@ -30,9 +30,14 @@ dill==0.3.7
#Pinned versions: 0.3.7
#test that import: dynamo/test_replay_record.py test_dataloader.py test_datapipe.py test_serialization.py
expecttest==0.1.6
expecttest==0.2.1
#Description: method for writing tests where test framework auto populates
# the expected output based on previous runs
#Pinned versions: 0.2.1
#test that import:
fbscribelogger==0.1.6
#Description: write to scribe from authenticated jobs on CI
#Pinned versions: 0.1.6
#test that import:

View File

@ -100,10 +100,10 @@ ARG TRITON
# try to reach out to S3, which docker build runners don't have access
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/triton-rocm.txt triton-rocm.txt
COPY ci_commit_pins/triton.txt triton.txt
COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt
RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt
# Install AOTriton
COPY ./aotriton_version.txt aotriton_version.txt

View File

@ -596,6 +596,9 @@ test_single_dynamo_benchmark() {
test_inductor_micro_benchmark() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
test_inductor_set_cpu_affinity
fi
python benchmarks/gpt_fast/benchmark.py --output "${TEST_REPORTS_DIR}/gpt_fast_benchmark.csv"
}

View File

@ -119,6 +119,11 @@ fi
# Test the package
/builder/check_binary.sh
if [[ "\$GPU_ARCH_TYPE" != *s390x* && "\$GPU_ARCH_TYPE" != *xpu* && "\$GPU_ARCH_TYPE" != *rocm* && "$PACKAGE_TYPE" != libtorch ]]; then
# Exclude s390, xpu, rocm and libtorch builds from smoke testing
python /builder/test/smoke_test/smoke_test.py --package=torchonly --torch-compile-check disabled
fi
# Clean temp files
cd /builder && git clean -ffdx

View File

@ -90,7 +90,7 @@ fi
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*rocm.* && $(uname) == "Linux" ]]; then
TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"
if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then
TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-rocm.txt)
TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton.txt)
TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}+${TRITON_SHORTHASH}; ${TRITON_CONSTRAINT}"
fi
if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then

View File

@ -9,6 +9,7 @@ ciflow_push_tags:
- ciflow/inductor-rocm
- ciflow/inductor-perf-compare
- ciflow/inductor-micro-benchmark
- ciflow/inductor-micro-benchmark-cpu-x86
- ciflow/inductor-cu124
- ciflow/linux-aarch64
- ciflow/mps

View File

@ -1,6 +1,7 @@
boto3==1.19.12
hypothesis==6.56.4
expecttest==0.1.6
expecttest==0.2.1
fbscribelogger==0.1.6
librosa>=0.6.2
mpmath==1.3.0
networkx==2.8.7

View File

@ -15,9 +15,7 @@ REPO_DIR = SCRIPT_DIR.parent.parent
def read_triton_pin(device: str = "cuda") -> str:
triton_file = "triton.txt"
if device == "rocm":
triton_file = "triton-rocm.txt"
elif device == "xpu":
if device == "xpu":
triton_file = "triton-xpu.txt"
with open(REPO_DIR / ".ci" / "docker" / "ci_commit_pins" / triton_file) as f:
return f.read().strip()

View File

@ -325,6 +325,7 @@ def generate_wheels_matrix(
os: str,
arches: Optional[List[str]] = None,
python_versions: Optional[List[str]] = None,
use_split_build: bool = False,
) -> List[Dict[str, str]]:
package_type = "wheel"
if os == "linux" or os == "linux-aarch64" or os == "linux-s390x":
@ -371,7 +372,17 @@ def generate_wheels_matrix(
) and python_version == "3.13":
continue
if use_split_build and (
arch_version not in ["12.4", "12.1", "11.8", "cpu"] or os != "linux"
):
raise RuntimeError(
"Split build is only supported on linux with cuda 12.4, 12.1, 11.8, and cpu.\n"
f"Currently attempting to build on arch version {arch_version} and os {os}.\n"
"Please modify the matrix generation to exclude this combination."
)
# 12.1 linux wheels require PYTORCH_EXTRA_INSTALL_REQUIREMENTS to install
if (
arch_version in ["12.4", "12.1", "11.8"]
and os == "linux"
@ -385,6 +396,7 @@ def generate_wheels_matrix(
"desired_cuda": translate_desired_cuda(
gpu_arch_type, gpu_arch_version
),
"use_split_build": "True" if use_split_build else "False",
"devtoolset": (
"cxx11-abi" if arch_version == "cuda-aarch64" else ""
),
@ -400,7 +412,8 @@ def generate_wheels_matrix(
),
}
)
if arch_version != "cuda-aarch64":
# Special build building to use on Colab. PyThon 3.10 for 12.1 CUDA
if python_version == "3.10" and arch_version == "12.1":
ret.append(
{
"python_version": python_version,
@ -409,40 +422,16 @@ def generate_wheels_matrix(
"desired_cuda": translate_desired_cuda(
gpu_arch_type, gpu_arch_version
),
"use_split_build": "True",
"use_split_build": "True" if use_split_build else "False",
"devtoolset": "",
"container_image": WHEEL_CONTAINER_IMAGES[arch_version],
"package_type": package_type,
"pytorch_extra_install_requirements": (
PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version] # fmt: skip
if os != "linux-aarch64"
else ""
),
"build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}-split".replace( # noqa: B950
"pytorch_extra_install_requirements": "",
"build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}-full".replace( # noqa: B950
".", "_"
),
}
)
# Special build building to use on Colab. PyThon 3.10 for 12.1 CUDA
if python_version == "3.10" and arch_version == "12.1":
ret.append(
{
"python_version": python_version,
"gpu_arch_type": gpu_arch_type,
"gpu_arch_version": gpu_arch_version,
"desired_cuda": translate_desired_cuda(
gpu_arch_type, gpu_arch_version
),
"use_split_build": "False",
"devtoolset": "",
"container_image": WHEEL_CONTAINER_IMAGES[arch_version],
"package_type": package_type,
"pytorch_extra_install_requirements": "",
"build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}-full".replace( # noqa: B950
".", "_"
),
}
)
else:
ret.append(
{
@ -452,6 +441,7 @@ def generate_wheels_matrix(
"desired_cuda": translate_desired_cuda(
gpu_arch_type, gpu_arch_version
),
"use_split_build": "True" if use_split_build else "False",
"devtoolset": (
"cxx11-abi" if arch_version == "cpu-cxx11-abi" else ""
),
@ -467,6 +457,7 @@ def generate_wheels_matrix(
),
}
)
return ret

View File

@ -61,6 +61,7 @@ class BinaryBuildWorkflow:
# Mainly for macos
cross_compile_arm64: bool = False
macos_runner: str = "macos-14-xlarge"
use_split_build: bool = False
def __post_init__(self) -> None:
if self.abi_version:
@ -75,6 +76,11 @@ class BinaryBuildWorkflow:
GITHUB_DIR
/ f"workflows/generated-{self.build_environment}-{self.branches}.yml"
)
if self.use_split_build:
output_file_path = (
GITHUB_DIR
/ f"workflows/generated-{self.build_environment}-{self.branches}-split.yml"
)
with open(output_file_path, "w") as output_file:
GENERATED = "generated" # Note that please keep the variable GENERATED otherwise phabricator will hide the whole file
output_file.writelines([f"# @{GENERATED} DO NOT EDIT MANUALLY\n"])
@ -110,6 +116,20 @@ LINUX_BINARY_BUILD_WORFKLOWS = [
isolated_workflow=True,
),
),
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="manywheel",
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.LINUX,
use_split_build=True,
arches=["11.8", "12.1", "12.4", "cpu"],
),
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},
isolated_workflow=True,
),
use_split_build=True,
),
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="conda",
@ -162,6 +182,21 @@ LINUX_BINARY_SMOKE_WORKFLOWS = [
),
branches="main",
),
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="manywheel",
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.LINUX,
arches=["11.8", "12.1", "12.4"],
python_versions=["3.9"],
use_split_build=True,
),
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_PERIODIC},
),
branches="main",
use_split_build=True,
),
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="libtorch",

View File

@ -45,7 +45,7 @@
{%- if is_windows %}
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
{%- endif %}
{%- else %}

View File

@ -13,7 +13,6 @@ on:
- .github/scripts/build_triton_wheel.py
- .github/ci_commit_pins/triton.txt
- .ci/docker/ci_commit_pins/triton.txt
- .ci/docker/ci_commit_pins/triton-rocm.txt
- .ci/docker/ci_commit_pins/triton-xpu.txt
pull_request:
paths:
@ -21,7 +20,6 @@ on:
- .github/scripts/build_triton_wheel.py
- .github/ci_commit_pins/triton.txt
- .ci/docker/ci_commit_pins/triton.txt
- .ci/docker/ci_commit_pins/triton-rocm.txt
- .ci/docker/ci_commit_pins/triton-xpu.txt
concurrency:
@ -31,7 +29,7 @@ concurrency:
jobs:
build-wheel:
name: "Build Triton Wheel"
runs-on: [self-hosted, linux.2xlarge]
runs-on: [self-hosted, linux.4xlarge]
strategy:
fail-fast: false
matrix:
@ -120,7 +118,7 @@ jobs:
fi
docker exec -t "${container_name}" chown -R 1000.1000 /artifacts
- uses: actions/upload-artifact@v3
- uses: actions/upload-artifact@v4.4.0
with:
name: pytorch-triton-wheel-${{ matrix.py_vers }}-${{ matrix.device }}
if-no-files-found: error
@ -253,7 +251,7 @@ jobs:
docker exec -t "${container_name}" python /pytorch/.github/scripts/build_triton_wheel.py --build-conda --py-version="${PY_VERS}" $RELEASE
docker exec -t "${container_name}" chown -R 1000.1000 /artifacts
- uses: actions/upload-artifact@v3
- uses: actions/upload-artifact@v4.4.0
with:
name: pytorch-triton-conda-${{ matrix.py_vers }}
if-no-files-found: error

View File

@ -58,6 +58,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.9"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
@ -81,6 +82,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
@ -103,6 +105,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cpu-aarch64
secrets:
@ -125,6 +128,7 @@ jobs:
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.4-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.9"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
@ -149,6 +153,7 @@ jobs:
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.4-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda-aarch64
secrets:
@ -170,6 +175,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.10"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
@ -193,6 +199,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
@ -215,6 +222,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cpu-aarch64
secrets:
@ -237,6 +245,7 @@ jobs:
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.4-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.10"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
@ -261,6 +270,7 @@ jobs:
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.4-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda-aarch64
secrets:
@ -282,6 +292,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.11"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
@ -305,6 +316,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
@ -327,6 +339,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cpu-aarch64
secrets:
@ -349,6 +362,7 @@ jobs:
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.4-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.11"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
@ -373,6 +387,7 @@ jobs:
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.4-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda-aarch64
secrets:
@ -394,6 +409,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.12"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
@ -417,6 +433,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
@ -439,6 +456,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cpu-aarch64
secrets:
@ -461,6 +479,7 @@ jobs:
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.4-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.12"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
@ -485,6 +504,7 @@ jobs:
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.4-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cuda-aarch64
secrets:

View File

@ -0,0 +1,182 @@
# @generated DO NOT EDIT MANUALLY
# Template is at: .github/templates/linux_binary_build_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: linux-binary-manywheel
on:
push:
branches:
- main
tags:
- 'ciflow/periodic/*'
workflow_dispatch:
env:
# Needed for conda builds
ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
ANACONDA_USER: pytorch
AWS_DEFAULT_REGION: us-east-1
BINARY_ENV_FILE: /tmp/env
BUILD_ENVIRONMENT: linux-binary-manywheel
BUILDER_ROOT: /builder
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-manywheel-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
manywheel-py3_9-cuda11_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: True
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda11_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda11_8-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_9-cuda11_8-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: True
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda11_8
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_1-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_1
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_1-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_9-cuda12_1-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda12_1
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: True
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_9-cuda12_4-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: True
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda12_4
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}

View File

@ -54,6 +54,7 @@ jobs:
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: False
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda11_8
@ -77,6 +78,7 @@ jobs:
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: False
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda11_8
build_environment: linux-binary-manywheel
@ -85,53 +87,6 @@ jobs:
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda11_8-split-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: True
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda11_8-split
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda11_8-split-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_9-cuda11_8-split-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: True
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda11_8-split
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_1-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -146,6 +101,7 @@ jobs:
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: False
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_1
@ -169,6 +125,7 @@ jobs:
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: False
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda12_1
build_environment: linux-binary-manywheel
@ -177,53 +134,6 @@ jobs:
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_1-split-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_1-split
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_1-split-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_9-cuda12_1-split-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda12_1-split
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -238,6 +148,7 @@ jobs:
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: False
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_4
@ -261,6 +172,7 @@ jobs:
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: False
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda12_4
build_environment: linux-binary-manywheel
@ -268,50 +180,3 @@ jobs:
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_4-split-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: True
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_4-split
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_4-split-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_9-cuda12_4-split-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: True
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda12_4-split
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -58,6 +58,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
use_split_build: False
DESIRED_PYTHON: "3.9"
runs_on: linux.s390x
ALPINE_IMAGE: "docker.io/s390x/alpine"
@ -81,6 +82,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
use_split_build: False
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cpu-s390x
build_environment: linux-s390x-binary-manywheel
@ -103,6 +105,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
use_split_build: False
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cpu-s390x
secrets:
@ -124,6 +127,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
use_split_build: False
DESIRED_PYTHON: "3.10"
runs_on: linux.s390x
ALPINE_IMAGE: "docker.io/s390x/alpine"
@ -147,6 +151,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
use_split_build: False
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cpu-s390x
build_environment: linux-s390x-binary-manywheel
@ -169,6 +174,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
use_split_build: False
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cpu-s390x
secrets:
@ -190,6 +196,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
use_split_build: False
DESIRED_PYTHON: "3.11"
runs_on: linux.s390x
ALPINE_IMAGE: "docker.io/s390x/alpine"
@ -213,6 +220,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
use_split_build: False
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cpu-s390x
build_environment: linux-s390x-binary-manywheel
@ -235,6 +243,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
use_split_build: False
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cpu-s390x
secrets:
@ -256,6 +265,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
use_split_build: False
DESIRED_PYTHON: "3.12"
runs_on: linux.s390x
ALPINE_IMAGE: "docker.io/s390x/alpine"
@ -279,6 +289,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
use_split_build: False
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cpu-s390x
build_environment: linux-s390x-binary-manywheel
@ -301,6 +312,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
use_split_build: False
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cpu-s390x
secrets:
@ -322,6 +334,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
use_split_build: False
DESIRED_PYTHON: "3.13"
runs_on: linux.s390x
ALPINE_IMAGE: "docker.io/s390x/alpine"
@ -345,6 +358,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
use_split_build: False
DESIRED_PYTHON: "3.13"
build_name: manywheel-py3_13-cpu-s390x
build_environment: linux-s390x-binary-manywheel
@ -367,6 +381,7 @@ jobs:
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
use_split_build: False
DESIRED_PYTHON: "3.13"
build_name: manywheel-py3_13-cpu-s390x
secrets:

View File

@ -49,7 +49,7 @@ jobs:
DESIRED_DEVTOOLSET: cxx11-abi
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the

View File

@ -51,7 +51,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
steps:
- name: Display EC2 information
shell: bash
@ -169,7 +169,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
steps:
- name: Display EC2 information
shell: bash

View File

@ -58,7 +58,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
steps:
- name: Display EC2 information
shell: bash
@ -176,7 +176,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
steps:
- name: Display EC2 information
shell: bash
@ -290,7 +290,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
build_name: libtorch-cpu-shared-with-deps-debug
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -316,7 +316,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
steps:
- name: Display EC2 information
shell: bash
@ -435,7 +435,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
steps:
- name: Display EC2 information
shell: bash
@ -550,7 +550,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
build_name: libtorch-cuda11_8-shared-with-deps-debug
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -576,7 +576,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
steps:
- name: Display EC2 information
shell: bash
@ -695,7 +695,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
steps:
- name: Display EC2 information
shell: bash
@ -810,7 +810,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
build_name: libtorch-cuda12_1-shared-with-deps-debug
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -836,7 +836,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
steps:
- name: Display EC2 information
shell: bash
@ -955,7 +955,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
steps:
- name: Display EC2 information
shell: bash
@ -1070,7 +1070,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
build_name: libtorch-cuda12_4-shared-with-deps-debug
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}

View File

@ -51,7 +51,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
steps:
- name: Display EC2 information
shell: bash
@ -169,7 +169,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
steps:
- name: Display EC2 information
shell: bash

View File

@ -58,7 +58,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
steps:
- name: Display EC2 information
shell: bash
@ -176,7 +176,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
steps:
- name: Display EC2 information
shell: bash
@ -290,7 +290,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
build_name: libtorch-cpu-shared-with-deps-release
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -316,7 +316,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
steps:
- name: Display EC2 information
shell: bash
@ -435,7 +435,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
steps:
- name: Display EC2 information
shell: bash
@ -550,7 +550,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
build_name: libtorch-cuda11_8-shared-with-deps-release
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -576,7 +576,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
steps:
- name: Display EC2 information
shell: bash
@ -695,7 +695,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
steps:
- name: Display EC2 information
shell: bash
@ -810,7 +810,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
build_name: libtorch-cuda12_1-shared-with-deps-release
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -836,7 +836,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
steps:
- name: Display EC2 information
shell: bash
@ -955,7 +955,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
steps:
- name: Display EC2 information
shell: bash
@ -1070,7 +1070,7 @@ jobs:
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
build_name: libtorch-cuda12_4-shared-with-deps-release
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}

View File

@ -0,0 +1,40 @@
name: inductor-micro-benchmark-x86
on:
schedule:
- cron: 0 7 * * *
push:
tags:
- ciflow/inductor-micro-benchmark-cpu-x86/*
workflow_dispatch:
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions: read-all
jobs:
linux-jammy-cpu-py3_9-gcc11-inductor-build:
name: linux-jammy-cpu-py3.9-gcc11-inductor
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-jammy-py3.9-gcc11
docker-image-name: pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks
# Use metal host for benchmark jobs
test-matrix: |
{ include: [
{ config: "inductor-micro-benchmark-cpu-x86", shard: 1, num_shards: 1, runner: "linux.24xl.spr-metal" },
]}
linux-jammy-cpu-py3_9-gcc11-inductor-micro-benchmark-test:
name: linux-jammy-cpu-py3.9-gcc11-inductor
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-cpu-py3_9-gcc11-inductor-build
with:
build-environment: linux-jammy-py3.9-gcc11
docker-image: ${{ needs.linux-jammy-cpu-py3_9-gcc11-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cpu-py3_9-gcc11-inductor-build.outputs.test-matrix }}
use-gha: anything-non-empty-to-use-gha
timeout-minutes: 720

View File

@ -223,7 +223,7 @@ jobs:
cache: pip
- name: Install dependencies
run: |
pip install pytest-rerunfailures==11.1.* pytest-flakefinder==1.1.* pytest-xdist==3.3.* expecttest==0.1.* numpy==1.24.*
pip install pytest-rerunfailures==11.1.* pytest-flakefinder==1.1.* pytest-xdist==3.3.* expecttest==0.2.* fbscribelogger==0.1.* numpy==1.24.*
pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cpu/
- name: Run run_test.py (nonretryable)
run: |

View File

@ -2,7 +2,7 @@ name: Upload test stats
on:
workflow_run:
workflows: [pull, trunk, periodic, inductor, unstable, slow, unstable-periodic, inductor-periodic, rocm, inductor-micro-benchmark, inductor-cu124, inductor-rocm]
workflows: [pull, trunk, periodic, inductor, unstable, slow, unstable-periodic, inductor-periodic, rocm, inductor-micro-benchmark, inductor-micro-benchmark-x86, inductor-cu124, inductor-rocm]
types:
- completed

View File

@ -138,7 +138,7 @@ init_command = [
'--dry-run={{DRYRUN}}',
'numpy==1.24.3 ; python_version == "3.8"',
'numpy==1.26.0 ; python_version >= "3.9"',
'expecttest==0.1.6',
'expecttest==0.2.1',
'mypy==1.10.0',
'sympy==1.12.1 ; python_version == "3.8"',
'sympy==1.13.0 ; python_version >= "3.9"',

View File

@ -332,6 +332,7 @@ intern_build_aten_ops(
"@fbgemm",
"@mkl",
"@sleef",
"@mkl_dnn//:mkl-dnn",
],
)

View File

@ -57,7 +57,6 @@ nn/qat/ @jerryzh168
# Docker
/.ci/docker/ @jeffdaily
/.ci/docker/ci_commit_pins/triton.txt @desertfire @Chillee @eellison @shunting314 @bertmaher @jeffdaily @jataylo @jithunnair-amd @pruthvistony
/.ci/docker/ci_commit_pins/triton-rocm.txt @jeffdaily @jataylo @jithunnair-amd @pruthvistony
/.ci/docker/ci_commit_pins/triton-xpu.txt @EikanWang @gujinghui
# Github Actions

View File

@ -50,6 +50,7 @@ Following is the Release Compatibility Matrix for PyTorch releases:
| PyTorch version | Python | Stable CUDA | Experimental CUDA | Stable ROCm |
| --- | --- | --- | --- | --- |
| 2.5 | >=3.9, <=3.12, (3.13 experimental) | CUDA 11.8, CUDA 12.1, CUDA 12.4, CUDNN 9.1.0.70 | None | ROCm 6.2 |
| 2.4 | >=3.8, <=3.12 | CUDA 11.8, CUDA 12.1, CUDNN 9.1.0.70 | CUDA 12.4, CUDNN 9.1.0.70 | ROCm 6.1 |
| 2.3 | >=3.8, <=3.11, (3.12 experimental) | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 | ROCm 6.0 |
| 2.2 | >=3.8, <=3.11, (3.12 experimental) | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 | ROCm 5.7 |

View File

@ -707,7 +707,12 @@ bool are_all_mutations_under_no_grad_or_inference_mode(const Tensor& functional_
}
bool isFunctionalTensor(const at::Tensor& tensor) {
return tensor.unsafeGetTensorImpl()->key_set().has(c10::DispatchKey::Functionalize);
return tensor.unsafeGetTensorImpl()->key_set().has(c10::DispatchKey::Functionalize);
}
bool isBaseTensor(const at::Tensor& tensor) {
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(isFunctionalTensor(tensor));
return unsafeGetFunctionalWrapper(tensor)->isBaseTensor();
}
bool isFunctionalTensor(const std::optional<Tensor>& t) {

View File

@ -165,6 +165,12 @@ struct TORCH_API FunctionalTensorWrapper : public c10::TensorImpl {
was_storage_changed_ = true;
}
// A FunctionalTensor is considered a base if its not a view of another
// tensor.
bool isBaseTensor() const {
return view_metas_.empty();
}
c10::SymInt get_storage_size(bool before) {
return functional_storage_impl()->get_storage_size(before);
}
@ -290,6 +296,8 @@ TORCH_API inline FunctionalTensorWrapper* unsafeGetFunctionalWrapper(
return functional_impl;
}
TORCH_API bool isBaseTensor(const at::Tensor& tensor);
TORCH_API bool isFunctionalTensor(const at::Tensor& tensor);
TORCH_API bool isFunctionalTensor(const std::optional<Tensor>& t);
TORCH_API bool isFunctionalTensor(

View File

@ -69,7 +69,7 @@ thread_local std::array<at::ScalarType, at::COMPILE_TIME_MAX_DEVICE_TYPES>
at::ScalarType::Undefined, // Vulkan
at::ScalarType::Undefined, // Metal
at::kHalf, // XPU
at::ScalarType::Undefined, // MPS
at::kHalf, // MPS
at::ScalarType::Undefined, // Meta (tensors with no data)
at::kBFloat16, // HPU / HABANA
at::ScalarType::Undefined, // SX-Aurora / NEC
@ -206,6 +206,118 @@ TORCH_LIBRARY_IMPL(aten, Autocast, m) {
TORCH_FN((&at::autocast::binary_cross_entropy_banned)));
}
TORCH_LIBRARY_IMPL(_, AutocastMPS, m) {
m.fallback(torch::CppFunction::makeFallthrough());
}
TORCH_LIBRARY_IMPL(aten, AutocastMPS, m) {
// lower_precision_fp
KERNEL_MPS2(_convolution, deprecated, lower_precision_fp)
KERNEL_MPS(_convolution, lower_precision_fp)
KERNEL_MPS(conv1d, lower_precision_fp)
KERNEL_MPS(conv2d, lower_precision_fp)
KERNEL_MPS(conv_tbc, lower_precision_fp)
KERNEL_MPS(conv_transpose1d, lower_precision_fp)
KERNEL_MPS2(conv_transpose2d, input, lower_precision_fp)
KERNEL_MPS(convolution, lower_precision_fp)
KERNEL_MPS(_mps_convolution, lower_precision_fp)
KERNEL_MPS(prelu, lower_precision_fp)
KERNEL_MPS(addmm, lower_precision_fp)
KERNEL_MPS(addmv, lower_precision_fp)
KERNEL_MPS(addr, lower_precision_fp)
KERNEL_MPS(matmul, lower_precision_fp)
KERNEL_MPS(einsum, lower_precision_fp)
KERNEL_MPS(mm, lower_precision_fp)
KERNEL_MPS(mv, lower_precision_fp)
KERNEL_MPS(linear, lower_precision_fp)
KERNEL_MPS(addbmm, lower_precision_fp)
KERNEL_MPS(baddbmm, lower_precision_fp)
KERNEL_MPS(bmm, lower_precision_fp)
KERNEL_MPS(chain_matmul, lower_precision_fp)
KERNEL_MPS(linalg_multi_dot, lower_precision_fp)
KERNEL_MPS(lstm_cell, lower_precision_fp)
// fp32
KERNEL_MPS(acos, fp32)
KERNEL_MPS(asin, fp32)
KERNEL_MPS(cosh, fp32)
KERNEL_MPS(erfinv, fp32)
KERNEL_MPS(exp, fp32)
KERNEL_MPS(expm1, fp32)
KERNEL_MPS(log, fp32)
KERNEL_MPS(log10, fp32)
KERNEL_MPS(log2, fp32)
KERNEL_MPS(log1p, fp32)
KERNEL_MPS(reciprocal, fp32)
KERNEL_MPS(rsqrt, fp32)
KERNEL_MPS(sinh, fp32)
KERNEL_MPS(tan, fp32)
KERNEL_MPS2(pow, Tensor_Scalar, fp32)
KERNEL_MPS2(pow, Tensor_Tensor, fp32)
KERNEL_MPS2(pow, Scalar, fp32)
KERNEL_MPS(softplus, fp32)
KERNEL_MPS(layer_norm, fp32)
KERNEL_MPS(native_layer_norm, fp32)
KERNEL_MPS(group_norm, fp32)
KERNEL_MPS2(frobenius_norm, dim, fp32)
KERNEL_MPS(nuclear_norm, fp32)
KERNEL_MPS2(nuclear_norm, dim, fp32)
KERNEL_MPS(batch_norm, fp32)
KERNEL_MPS(cosine_similarity, fp32)
KERNEL_MPS(poisson_nll_loss, fp32)
KERNEL_MPS(cosine_embedding_loss, fp32)
KERNEL_MPS(nll_loss, fp32)
KERNEL_MPS(nll_loss2d, fp32)
KERNEL_MPS(hinge_embedding_loss, fp32)
KERNEL_MPS(kl_div, fp32)
KERNEL_MPS(l1_loss, fp32)
KERNEL_MPS(smooth_l1_loss, fp32)
KERNEL_MPS(huber_loss, fp32)
KERNEL_MPS(mse_loss, fp32)
KERNEL_MPS(margin_ranking_loss, fp32)
KERNEL_MPS(multilabel_margin_loss, fp32)
KERNEL_MPS(soft_margin_loss, fp32)
KERNEL_MPS(triplet_margin_loss, fp32)
KERNEL_MPS(multi_margin_loss, fp32)
KERNEL_MPS(binary_cross_entropy_with_logits, fp32)
KERNEL_MPS(dist, fp32)
KERNEL_MPS(pdist, fp32)
KERNEL_MPS(cdist, fp32)
KERNEL_MPS(renorm, fp32)
KERNEL_MPS(logsumexp, fp32)
// fp32_set_opt_dtype
KERNEL_MPS(prod, fp32)
KERNEL_MPS2(prod, dim_int, fp32)
KERNEL_MPS2(prod, dim_Dimname, fp32)
KERNEL_MPS2(softmax, int, fp32)
KERNEL_MPS2(softmax, Dimname, fp32)
KERNEL_MPS2(log_softmax, int, fp32)
KERNEL_MPS2(log_softmax, Dimname, fp32)
KERNEL_MPS(cumprod, fp32)
KERNEL_MPS2(cumprod, dimname, fp32)
KERNEL_MPS(cumsum, fp32)
KERNEL_MPS2(cumsum, dimname, fp32)
KERNEL_MPS(linalg_vector_norm, fp32)
KERNEL_MPS(linalg_matrix_norm, fp32)
KERNEL_MPS2(linalg_matrix_norm, str_ord, fp32)
KERNEL_MPS(sum, fp32)
KERNEL_MPS2(sum, dim_IntList, fp32)
KERNEL_MPS2(sum, dim_DimnameList, fp32)
//
// promote
KERNEL_MPS(addcdiv, promote)
KERNEL_MPS(addcmul, promote)
KERNEL_MPS(atan2, promote)
KERNEL_MPS(bilinear, promote)
KERNEL_MPS(cross, promote)
KERNEL_MPS(dot, promote)
KERNEL_MPS(grid_sampler, promote)
KERNEL_MPS(index_put, promote)
KERNEL_MPS(tensordot, promote)
KERNEL_MPS(scatter_add, promote)
}
TORCH_LIBRARY_IMPL(_, AutocastCPU, m) {
m.fallback(torch::CppFunction::makeFallthrough());
}

View File

@ -145,6 +145,8 @@ inline bool is_autocast_eligible(
return tensor.is_xla() && tensor.is_floating_point();
case c10::DeviceType::PrivateUse1:
return tensor.is_privateuseone() && tensor.is_floating_point();
case c10::DeviceType::MPS:
return tensor.is_mps() && tensor.is_floating_point();
default:
return false;
}
@ -168,6 +170,8 @@ inline DispatchKey get_autocast_dispatch_key_from_device_type(
return DispatchKey::AutocastXLA;
case c10::DeviceType::PrivateUse1:
return DispatchKey::AutocastPrivateUse1;
case c10::DeviceType::MPS:
return DispatchKey::AutocastMPS;
default:
throw std::runtime_error(
"unknown device type for autocast in get_autocast_dispatch_key_from_device_type");
@ -178,7 +182,7 @@ inline bool is_autocast_available(c10::DeviceType device_type) {
if (device_type == at::kCPU || device_type == at::kCUDA ||
device_type == at::kXPU || device_type == at::kIPU ||
device_type == at::kHPU || device_type == at::kXLA ||
device_type == at::kPrivateUse1) {
device_type == at::kPrivateUse1 || device_type == at::kMPS) {
return true;
} else {
return false;
@ -745,6 +749,27 @@ copy pasted in from VariableTypeEverything.cpp with appropriate substitutions.
REDISPATCH_SIGNATURE, \
POLICY)
// KERNEL_MPS registration for AutocastMPS
#define KERNEL_MPS(OP, POLICY) \
m.impl( \
TORCH_SELECTIVE_NAME("aten::" #OP), \
&WrapFunction< \
CastPolicy::POLICY, \
DeviceType::MPS, \
decltype(ATEN_FN(OP)), \
decltype(ATEN_FN(OP)), \
&ATEN_FN(OP)>::type::call);
#define KERNEL_MPS2(OP, OVERLOAD, POLICY) \
m.impl( \
TORCH_SELECTIVE_NAME("aten::" #OP "." #OVERLOAD), \
&WrapFunction< \
CastPolicy::POLICY, \
DeviceType::MPS, \
decltype(ATEN_FN2(OP, OVERLOAD)), \
decltype(ATEN_FN2(OP, OVERLOAD)), \
&ATEN_FN2(OP, OVERLOAD)>::type::call);
// Op lists for different policies.
// To make sure other backends can reuse the policy op list.
#define AT_FORALL_LOWER_PRECISION_FP(_) \

View File

@ -228,6 +228,7 @@ namespace c10 {
_(aten, is_autocast_cpu_enabled) \
_(aten, is_autocast_xla_enabled) \
_(aten, get_autocast_dtype) \
_(aten, is_autocast_mps_enabled) \
FORALL_ATEN_BASE_SYMBOLS(_) \
_(onnx, Add) \
_(onnx, Concat) \

View File

@ -9,7 +9,7 @@
#endif
namespace at::cpu {
bool is_cpu_support_avx2() {
bool is_avx2_supported() {
#if !defined(__s390x__) && !defined(__powerpc__)
return cpuinfo_initialize() && cpuinfo_has_x86_avx2();
#else
@ -17,7 +17,7 @@ bool is_cpu_support_avx2() {
#endif
}
bool is_cpu_support_avx512() {
bool is_avx512_supported() {
#if !defined(__s390x__) && !defined(__powerpc__)
return cpuinfo_initialize() && cpuinfo_has_x86_avx512f() && cpuinfo_has_x86_avx512vl() && cpuinfo_has_x86_avx512bw() && cpuinfo_has_x86_avx512dq();
#else
@ -25,7 +25,7 @@ bool is_cpu_support_avx512() {
#endif
}
bool is_cpu_support_avx512_vnni() {
bool is_avx512_vnni_supported() {
#if !defined(__s390x__) && !defined(__powerpc__)
return cpuinfo_initialize() && cpuinfo_has_x86_avx512vnni();
#else
@ -33,7 +33,15 @@ bool is_cpu_support_avx512_vnni() {
#endif
}
bool is_cpu_support_amx_tile() {
bool is_avx512_bf16_supported() {
#if !defined(__s390x__) && !defined(__powerpc__)
return cpuinfo_initialize() && cpuinfo_has_x86_avx512bf16();
#else
return false;
#endif
}
bool is_amx_tile_supported() {
#if !defined(__s390x__) && !defined(__powerpc__)
return cpuinfo_initialize() && cpuinfo_has_x86_amx_tile();
#else
@ -42,7 +50,7 @@ bool is_cpu_support_amx_tile() {
}
bool init_amx() {
if (!is_cpu_support_amx_tile()) {
if (!is_amx_tile_supported()) {
return false;
}

View File

@ -6,14 +6,17 @@
namespace at::cpu {
TORCH_API bool is_cpu_support_avx2();
TORCH_API bool is_cpu_support_avx512();
TORCH_API bool is_avx2_supported();
TORCH_API bool is_avx512_supported();
// Detect if CPU support Vector Neural Network Instruction.
TORCH_API bool is_cpu_support_avx512_vnni();
TORCH_API bool is_avx512_vnni_supported();
// Detect if CPU supports AVX512_BF16 ISA
TORCH_API bool is_avx512_bf16_supported();
// Detect if CPU support Advanced Matrix Extension.
TORCH_API bool is_cpu_support_amx_tile();
TORCH_API bool is_amx_tile_supported();
// Enable the system to use AMX instructions.
TORCH_API bool init_amx();

View File

@ -636,6 +636,21 @@ inline void transpose_mxn<float, 8, 8>(
_mm256_storeu_ps(&dst[7 * ld_dst], th);
}
template<>
inline void transpose_mxn<float, 16, 16>(
const float* src,
int64_t ld_src,
float* dst,
int64_t ld_dst) {
transpose_mxn<float, 8, 8>(
src , ld_src, dst, ld_dst);
transpose_mxn<float, 8, 8>(
src + 8, ld_src, dst + 8 * ld_dst, ld_dst);
transpose_mxn<float, 8, 8>(
src + 8 * ld_src, ld_src, dst + 8, ld_dst);
transpose_mxn<float, 8, 8>(
src + 8 * ld_src + 8, ld_src, dst + 8 * ld_dst + 8, ld_dst);
}
#endif
}} // namespace at::vec::CPU_CAPABILITY

View File

@ -582,8 +582,7 @@ Vectorized<float> inline fmsub(const Vectorized<float>& a, const Vectorized<floa
// https://github.com/pytorch/FBGEMM/blob/39a423e4ad1a04b77fea81c7d09c3e6f8984fae9/src/UtilsAvx512.cc#L230-L304
// kernel for transposing mxn where m, n <= 16
// M + (M + 1) / 2 * 2 + (M + 3) / 4 * 4 + (M + 7) / 8 * 8 + 2 * N instructions
template <>
inline void transpose_mxn<float>(const float* src, int64_t ld_src, float* dst, int64_t ld_dst, int M, int N) {
inline void transpose_mxn_16x16(const float* src, int64_t ld_src, float* dst, int64_t ld_dst, int M, int N) {
TORCH_CHECK(M <= 16 && N <= 16, "transpose_mxn<float> expects M, N <= 16.");
// load from src to registers
__m512 input[16];
@ -667,8 +666,39 @@ inline void transpose_mxn<float>(const float* src, int64_t ld_src, float* dst, i
}
}
template<>
inline void transpose_mxn<float>(const float* src, int64_t ld_src, float* dst, int64_t ld_dst, int M, int N) {
int64_t i = 0;
for (; i < M / 16 * 16; i += 16) {
int64_t j = 0;
for (; j < N / 16 * 16; j += 16) {
transpose_mxn_16x16(
src + i * ld_src + j, ld_src, dst + j * ld_dst + i, ld_dst, 16, 16);
}
// handle remainder j
int nrem = N - j;
if (nrem > 0) {
transpose_mxn_16x16(
src + i * ld_src + j, ld_src, dst + j * ld_dst + i, ld_dst, 16, nrem);
}
}
// handle remainder i
int mrem = M - i;
if (mrem > 0) {
int j = 0;
for (; j < N / 16 * 16; j += 16) {
transpose_mxn_16x16(
src + i * ld_src + j, ld_src, dst + j * ld_dst + i, ld_dst, mrem, 16);
}
// handle remainder j
int nrem = N - j;
transpose_mxn_16x16(
src + i * ld_src + j, ld_src, dst + j * ld_dst + i, ld_dst, mrem, nrem);
}
}
template <typename T, int M, int N,
typename std::enable_if_t<std::is_same<T, float>::value && M <= 16 && N <= 16, int> = 0>
typename std::enable_if_t<std::is_same<T, float>::value, int> = 0>
inline void transpose_mxn(const float* src, int64_t ld_src, float* dst, int64_t ld_dst) {
transpose_mxn<float>(src, ld_src, dst, ld_dst, M, N);
}

View File

@ -23,6 +23,9 @@ TORCH_LIBRARY_IMPL(aten, FuncTorchVmapMode, m) {
OP_DECOMPOSE(dropout_);
OP_DECOMPOSE(feature_alpha_dropout_);
OP_DECOMPOSE(feature_dropout_);
OP_DECOMPOSE(dropout);
OP_DECOMPOSE(_scaled_dot_product_attention_math);
OP_DECOMPOSE(scaled_dot_product_attention);
}
static void unsupportedData(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
@ -235,7 +238,6 @@ TORCH_LIBRARY_IMPL(aten, FuncTorchBatchedDecomposition, m) {
OP_DECOMPOSE(relu6_);
OP_DECOMPOSE(prelu);
OP_DECOMPOSE2(softmax, int);
OP_DECOMPOSE(scaled_dot_product_attention);
OP_DECOMPOSE(special_gammainc);
OP_DECOMPOSE(special_gammaincc);
OP_DECOMPOSE(special_logit);
@ -261,7 +263,6 @@ TORCH_LIBRARY_IMPL(aten, FuncTorchBatchedDecomposition, m) {
OP_DECOMPOSE(special_xlogy);
OP_DECOMPOSE2(special_xlogy, other_scalar);
OP_DECOMPOSE2(special_xlogy, self_scalar);
OP_DECOMPOSE(_scaled_dot_product_attention_math);
m.impl("split.sizes", native::split_symint);
@ -386,6 +387,11 @@ TORCH_LIBRARY_IMPL(aten, FuncTorchBatchedDecomposition, m) {
OP_DECOMPOSE2(to, dtype);
OP_DECOMPOSE2(to, dtype_layout);
OP_DECOMPOSE2(to, other);
// Random ops that are also registered here
OP_DECOMPOSE(dropout);
OP_DECOMPOSE(_scaled_dot_product_attention_math);
OP_DECOMPOSE(scaled_dot_product_attention);
}
} // namespace at::functorch

View File

@ -496,6 +496,11 @@ _scaled_dot_product_flash_attention_batch_rule(
bool return_debug_mask,
c10::optional<double> scale
) {
if (dropout_p > 0) {
auto maybe_layer = maybeCurrentDynamicLayer();
RandomnessType randomness = maybe_layer->randomness();
check_randomness(randomness, query_bdim.has_value() || key_bdim.has_value() || value_bdim.has_value());
}
auto batch_size = get_bdim_size3(query, query_bdim, key, key_bdim, value, value_bdim);
auto query_ = moveBatchDimToFront(query, query_bdim);
auto key_ = moveBatchDimToFront(key, key_bdim);
@ -540,6 +545,11 @@ fourOutputs _scaled_dot_product_efficient_attention_batch_rule(
bool is_causal,
c10::optional<double> scale
) {
if (dropout_p > 0) {
auto maybe_layer = maybeCurrentDynamicLayer();
RandomnessType randomness = maybe_layer->randomness();
check_randomness(randomness, query_bdim.has_value() || key_bdim.has_value() || value_bdim.has_value());
}
auto batch_size = get_bdim_size3(query, query_bdim, key, key_bdim, value, value_bdim);
auto query_ = moveBatchDimToFront(query, query_bdim);
auto key_ = moveBatchDimToFront(key, key_bdim);
@ -577,6 +587,11 @@ _scaled_dot_product_cudnn_attention_batch_rule(
bool return_debug_mask,
c10::optional<double> scale
) {
if (dropout_p > 0) {
auto maybe_layer = maybeCurrentDynamicLayer();
RandomnessType randomness = maybe_layer->randomness();
check_randomness(randomness, query_bdim.has_value() || key_bdim.has_value() || value_bdim.has_value());
}
auto batch_size = get_bdim_size3(query, query_bdim, key, key_bdim, value, value_bdim);
auto query_ = moveBatchDimToFront(query, query_bdim);
auto key_ = moveBatchDimToFront(key, key_bdim);

View File

@ -41,6 +41,17 @@ extern "C" void zaxpy_(int *n, void *a, const void *x, int *incx, void *y, int *
#include <fbgemm/FbgemmI64.h>
#endif // USE_FBGEMM
#if AT_MKLDNN_ENABLED()
#include <oneapi/dnnl/dnnl_version.h>
#endif // oneDNN
#define ONEDNN_UKERNEL_ENABLED (DNNL_VERSION_MAJOR >=3 && DNNL_VERSION_MINOR >=5)
#if ONEDNN_UKERNEL_ENABLED && (defined(__x86_64__) || (defined(_M_X64) && !defined(_M_ARM64EC)))
#include <oneapi/dnnl/dnnl_ukernel.hpp>
#include <oneapi/dnnl/dnnl.hpp>
#endif // oneDNN BRGEMM
namespace at::native::cpublas {
namespace internal {
@ -822,4 +833,366 @@ void copy(int64_t n, const c10::complex<float> *x, int64_t incx, c10::complex<fl
n, x, incx, y, incy);
}
} // namespace at::native::cpublas
// oneDNN BRGEMM
#if ONEDNN_UKERNEL_ENABLED && (defined(__x86_64__) || (defined(_M_X64) && !defined(_M_ARM64EC)))
struct BrgemmKey {
int64_t M;
int64_t N;
int64_t K;
int64_t batch_size;
int64_t lda;
int64_t ldb;
int64_t ldc;
ScalarType dt_a;
ScalarType dt_b;
ScalarType dt_c;
float alpha;
float beta;
BrgemmKey(
int64_t M,
int64_t N,
int64_t K,
int64_t batch_size,
int64_t lda,
int64_t ldb,
int64_t ldc,
ScalarType dt_a,
ScalarType dt_b,
ScalarType dt_c,
float alpha,
float beta)
: M(M),
N(N),
K(K),
batch_size(batch_size),
lda(lda),
ldb(ldb),
ldc(ldc),
dt_a(dt_a),
dt_b(dt_b),
dt_c(dt_c),
alpha(alpha),
beta(beta) {}
bool operator==(const BrgemmKey& other) const {
return M == other.M && N == other.N && K == other.K &&
batch_size == other.batch_size && lda == other.lda &&
ldb == other.ldb && ldc == other.ldc && dt_a == other.dt_a &&
dt_b == other.dt_b && dt_c == other.dt_c && alpha == other.alpha &&
beta == other.beta;
}
};
struct PackKey {
int64_t K;
int64_t N;
int64_t ld_in;
int64_t ld_out;
ScalarType dt_in;
ScalarType dt_out;
PackKey(
int64_t K,
int64_t N,
int64_t ld_in,
int64_t ld_out,
ScalarType dt_in,
ScalarType dt_out)
: K(K),
N(N),
ld_in(ld_in),
ld_out(ld_out),
dt_in(dt_in),
dt_out(dt_out) {}
bool operator==(const PackKey& other) const {
return N == other.N && K == other.K && ld_in == other.ld_in &&
ld_out == other.ld_out && dt_in == other.dt_in &&
dt_out == other.dt_out;
}
};
inline dnnl::memory::data_type get_dnnl_dtype(ScalarType dtype) {
if (dtype == ScalarType::Float) {
return dnnl::memory::data_type::f32;
} else if (dtype == ScalarType::BFloat16) {
return dnnl::memory::data_type::bf16;
} else if (dtype == ScalarType::Half) {
return dnnl::memory::data_type::f16;
} else if (dtype == ScalarType::Byte) {
return dnnl::memory::data_type::u8;
} else if (dtype == ScalarType::Char) {
return dnnl::memory::data_type::s8;
} else {
TORCH_CHECK(false, "get_dnnl_dtype expects float/bfloat16/half/int8 tensor input");
}
}
template<typename key_t>
struct UnsafeUkernelKeyHasher {
std::size_t operator()(const key_t& key) const;
};
template<>
std::size_t UnsafeUkernelKeyHasher<BrgemmKey>::operator()(const BrgemmKey& key) const {
// Use beta, M, N, and K to compute hash to reduce the overhead as
// batch size, alpha, and data types are unlikely to change within the same kernel and
// leading dimensions are likely to be related to M, K, N or use fixed values.
std::size_t h = std::hash<float>()(key.beta + 1);
h = std::hash<int64_t>()(key.M) ^ (h << 1);
h = std::hash<int64_t>()(key.N) ^ (h << 1);
h = std::hash<int64_t>()(key.K) ^ (h << 1);
h = std::hash<int64_t>()(key.ldc) ^ (h << 1);
return h;
}
template<>
std::size_t UnsafeUkernelKeyHasher<PackKey>::operator()(const PackKey& key) const {
// Use K and N to compute hash to reduce the overhead as
// data types are unlikely to change and
// ld_in/ld_out is likely to be related to K, N or use fixed values
std::size_t h = std::hash<int64_t>()(key.K);
h = std::hash<int64_t>()(key.N) ^ (h << 1);
return h;
}
template <typename key_t, typename value_t>
struct KernelCache {
using kstore_t = std::unordered_map<key_t, std::shared_ptr<value_t>, UnsafeUkernelKeyHasher<key_t>>;
static inline std::shared_ptr<value_t>&& fetch_or_create(
const key_t& key,
const std::function<std::shared_ptr<value_t>()>& callback) {
auto&& search = get_store().find(key);
if (search != get_store().end()) {
return std::move(search->second);
} else {
get_store().insert({key, callback()});
return std::move(get_store()[key]);
}
}
static inline kstore_t& get_store() {
static thread_local kstore_t cache_kernels;
return cache_kernels;
}
};
// Helper struct for convenient brgemm configuration
struct GemmHelper {
GemmHelper(
int64_t M,
int64_t N,
int64_t K,
int64_t bs,
int64_t ld_a,
int64_t ld_b,
int64_t ld_c,
ScalarType dt_a,
ScalarType dt_b,
ScalarType dt_c,
const float alpha,
const float beta) {
// Create brgemm
brg = dnnl::ukernel::brgemm(
M,
N,
K,
bs,
ld_a,
ld_b,
ld_c,
get_dnnl_dtype(dt_a),
get_dnnl_dtype(dt_b),
get_dnnl_dtype(dt_c),
alpha,
beta);
// Create a scratchpad buffer for the brgemm execution
scratchpad = std::vector<uint8_t>(brg.get_scratchpad_size());
// Prepare default vector of pairs of tensors A and B offsets for each batch.
A_B_offsets.reserve(1);
A_B_offsets[0] = std::make_pair(0, 0);
}
dnnl::ukernel::brgemm brg;
std::vector<uint8_t> scratchpad;
std::vector<std::pair<int64_t, int64_t>> A_B_offsets;
};
struct Brgemm : public KernelCache <BrgemmKey, GemmHelper> {
// Fetch/create GemmHelper object and execute brgemm with batch size = 1
template <typename scalar_t_a, typename scalar_t_b, typename scalar_t_c>
static inline void call(
int64_t M,
int64_t N,
int64_t K,
int64_t ld_a,
int64_t ld_b,
int64_t ld_c,
const float alpha,
const float beta,
const scalar_t_a* A,
const scalar_t_b* B,
scalar_t_c* C) {
auto&& key = BrgemmKey(
M,
N,
K,
int64_t(1),
ld_a,
ld_b,
ld_c,
c10::CppTypeToScalarType<scalar_t_a>::value,
c10::CppTypeToScalarType<scalar_t_b>::value,
c10::CppTypeToScalarType<scalar_t_c>::value,
alpha,
beta);
// Fetch/create GemmHelper object
auto&& value = fetch_or_create(key, [&]() {
auto&& v = std::make_shared<GemmHelper>(
M,
N,
K,
1,
ld_a,
ld_b,
ld_c,
c10::CppTypeToScalarType<scalar_t_a>::value,
c10::CppTypeToScalarType<scalar_t_b>::value,
c10::CppTypeToScalarType<scalar_t_c>::value,
alpha,
beta);
(*v).brg.generate();
return std::move(v);
});
if (get_current() != value) {
dnnl::ukernel::brgemm::release_hw_context();
((*value).brg).set_hw_context();
get_current() = value;
}
((*value).brg)
.execute(A, B, (*value).A_B_offsets, C, (*value).scratchpad.data());
}
static inline std::shared_ptr<GemmHelper>& get_current() {
static thread_local std::shared_ptr<GemmHelper> current;
return current;
}
static inline bool device_check(ScalarType dtype) {
if (!at::globalContext().userEnabledMkldnn()) {
return false;
}
if (dtype == ScalarType::Half) {
static bool fp16_support = dnnl::get_effective_cpu_isa() >= dnnl::cpu_isa::avx512_core_fp16;
return fp16_support;
}
return false;
}
};
using pack_t = dnnl::ukernel::brgemm_pack_B;
struct Pack : public KernelCache <PackKey, pack_t> {
static inline void call(
int64_t K,
int64_t N,
int64_t ld_in,
int64_t ld_out,
ScalarType dt_in,
ScalarType dt_out,
const void* in,
void* out) {
auto&& key = PackKey(K, N, ld_in, ld_out, dt_in, dt_out);
auto&& pack = fetch_or_create(key, [&]() {
auto&& p = std::make_shared<pack_t>(
K, N, ld_in, ld_out, get_dnnl_dtype(dt_in), get_dnnl_dtype(dt_out));
if (need_pack(dt_in)) {
(*p).generate();
}
return std::move(p);
});
if (need_pack(dt_in)) {
(*pack).execute(in, out);
} else {
TORCH_CHECK(false, "No need to pack");
}
}
static inline bool need_pack(ScalarType dtype) {
if (!at::globalContext().userEnabledMkldnn()) {
return false;
}
if (dtype == ScalarType::Half) {
static bool fp16_pack = dnnl::get_effective_cpu_isa() >= dnnl::cpu_isa::avx512_core_amx_fp16;
return fp16_pack;
}
return false;
}
};
#endif
void brgemm(
int64_t M,
int64_t N,
int64_t K,
int64_t ld_a,
int64_t ld_b,
int64_t ld_c,
const float alpha,
const float beta,
const at::Half* A,
const at::Half* B,
float* C) {
#if ONEDNN_UKERNEL_ENABLED && (defined(__x86_64__) || (defined(_M_X64) && !defined(_M_ARM64EC)))
if (Brgemm::device_check(ScalarType::Half)) {
Brgemm::call<at::Half, at::Half, float>(
M, N, K, ld_a, ld_b, ld_c, alpha, beta, A, B, C);
return;
}
#endif
TORCH_CHECK(false,
"Half Brgemm is only supported on X64 when oneDNN ukernel is enabled and avx512_fp16 is supported");
}
void brgemm(
int64_t M,
int64_t N,
int64_t K,
int64_t ld_a,
int64_t ld_b,
int64_t ld_c,
const float alpha,
const float beta,
const at::BFloat16* A,
const at::BFloat16* B,
float* C) {
TORCH_CHECK(false,
"BFloat16 Brgemm is currently not supported");
}
void brgemm_release() {
#if ONEDNN_UKERNEL_ENABLED && (defined(__x86_64__) || (defined(_M_X64) && !defined(_M_ARM64EC)))
dnnl::ukernel::brgemm::release_hw_context();
#endif
}
void pack(
int64_t K,
int64_t N,
int64_t ld_in,
int64_t ld_out,
ScalarType dt_in,
ScalarType dt_out,
const void* in,
void* out) {
#if ONEDNN_UKERNEL_ENABLED && (defined(__x86_64__) || (defined(_M_X64) && !defined(_M_ARM64EC)))
Pack::call(K, N, ld_in, ld_out, dt_in, dt_out, in, out);
#else
TORCH_CHECK(false, "pack is only supported on X64 with oneDNN ukernel enabled");
#endif
}
bool need_pack(ScalarType dt_in) {
#if ONEDNN_UKERNEL_ENABLED && (defined(__x86_64__) || (defined(_M_X64) && !defined(_M_ARM64EC)))
return Pack::need_pack(dt_in);
#else
return false;
#endif
}
} // namespace at::native::cpublas

View File

@ -7,6 +7,7 @@
#include <c10/core/ScalarType.h>
#include <c10/core/Scalar.h>
namespace at::native::cpublas {
namespace internal {
@ -186,4 +187,58 @@ void copy(int64_t n, const float *x, int64_t incx, float *y, int64_t incy);
void copy(int64_t n, const c10::complex<double> *x, int64_t incx, c10::complex<double> *y, int64_t incy);
void copy(int64_t n, const c10::complex<float> *x, int64_t incx, c10::complex<float> *y, int64_t incy);
} // namespace at::native::cpublas
// Batch-reduce GEMM
// Operates by the following formula:
// C = alpha * SUM(A[i] x B[i]) + beta * C, i = 0 to batch size
// A Base pointer to a tensor A.
// B Base pointer to a tensor B.
// Byte offsets vector of pairs of tensors A and B offsets for
// each batch. The number of batches must coincide with the
// `batch_size` value passed at object construction stage.
// C Pointer to a tensor C (accumulation buffer).
// scratchpad Pointer to a scratchpad buffer.
// Currently, only brgemm with batch size = 1 will be used
TORCH_API void brgemm(
int64_t M,
int64_t N,
int64_t K,
int64_t ld_a,
int64_t ld_b,
int64_t ld_c,
const float alpha,
const float beta,
const at::Half* A,
const at::Half* B,
float* C);
TORCH_API void brgemm(
int64_t M,
int64_t N,
int64_t K,
int64_t ld_a,
int64_t ld_b,
int64_t ld_c,
const float alpha,
const float beta,
const at::BFloat16* A,
const at::BFloat16* B,
float* C);
// Release brgemm hardware context
void brgemm_release();
// Pack B matrix to get better performance if needed
void pack(
int64_t K,
int64_t N,
int64_t ld_in,
int64_t ld_out,
ScalarType dt_in,
ScalarType dt_out,
const void* in,
void* out);
// Whether pack is needed in the platform.
bool need_pack(ScalarType dt_in);
} // namespace at::native::cpublas

View File

@ -144,7 +144,7 @@ static void col2im_out_cpu_template(
output.resize_({batch_size, n_output_plane, output_height, output_width});
AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(kBFloat16, kHalf,
AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND3(kBFloat16, kHalf, kBool,
input.scalar_type(), "col2im_out_cpu", [&] {
Tensor input_n = Tensor();
Tensor output_n = Tensor();

View File

@ -421,12 +421,18 @@ struct ConvParams {
// cudnn and miopen are guaranteed not to be on mobile, and T102591915 / T110194934 suggest
// that maybe the compiledWithCuDNN() check sometimes segfaults (though I can't imagine how)
#if !defined(C10_MOBILE)
if (needs_64bit_indexing_no_split(input, weight)) {
return false;
}
if (!detail::getCUDAHooks().compiledWithCuDNN()) {
return false;
}
if (needs_64bit_indexing_no_split(input, weight)) {
static long cudnn_version = detail::getCUDAHooks().versionCuDNN();
if (!(cudnn_version >= 90300 && at::native::cudnnv8_enabled_check_debug())) {
TORCH_WARN_ONCE("cuDNN cannot be used for large non-batch-splittable convolutions"
" if the V8 API is not enabled or before cuDNN version 9.3+."
" Consider upgrading cuDNN and/or enabling the V8 API for better efficiency.");
return false;
}
}
if (!input.is_cuda() || !cudnn_enabled) {
return false;
}

View File

@ -94,7 +94,7 @@ static void im2col_out_cpu_template(
output.resize_({batch_size, n_output_plane, output_length});
AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(kBFloat16, kHalf,
AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND3(kBFloat16, kHalf, kBool,
input.scalar_type(), "im2col_out_cpu", [&] {
Tensor input_n;
Tensor output_n;

View File

@ -19,6 +19,7 @@
#include <ATen/native/ReduceOpsUtils.h>
#include <ATen/native/Resize.h>
#include <ATen/native/mkldnn/Matmul.h>
#include <ATen/native/mkldnn/Utils.h>
#include <c10/core/GradMode.h>
#include <c10/util/accumulate.h>
#include <c10/util/irange.h>
@ -1358,13 +1359,8 @@ static inline int64_t get_mkldnn_matmul_min_dim() {
static auto value = [&] {
const int64_t default_min_dim = [&] {
// Minimum dimension requirement for MKLDNN; derived based on experiments.
// By default, it's only enabled on Neoverse V1.
#if !defined(__s390x__) && !defined(__powerpc__)
if (cpuinfo_initialize() && cpuinfo_get_uarchs_count() == 1 && cpuinfo_get_uarch(0)->uarch == cpuinfo_uarch_neoverse_v1) {
return 8;
}
#endif
return 0;
//it's enabled on all Neoverse cpus.
return is_arm_neoverse() ? 8 : 0;
}();
const char* ptr = std::getenv("TORCH_MKLDNN_MATMUL_MIN_DIM");
return ptr != nullptr ? std::atoi(ptr) : default_min_dim;
@ -1377,13 +1373,8 @@ static inline int64_t get_mkldnn_matmul_min_size() {
static auto value = [&] {
const int64_t default_min_size = [&] {
// Minimum size requirement for MKLDNN; derived based on experiments.
// By default, it's only enabled on Neoverse V1.
#if !defined(__s390x__) && !defined(__powerpc__)
if (cpuinfo_initialize() && cpuinfo_get_uarchs_count() == 1 && cpuinfo_get_uarch(0)->uarch == cpuinfo_uarch_neoverse_v1) {
return 8 * 1024;
}
#endif
return 0;
// it's enabled on all Neoverse cpus.
return is_arm_neoverse() ? 8 * 1024 : 0;
}();
const char* ptr = std::getenv("TORCH_MKLDNN_MATMUL_MIN_SIZE");
return ptr != nullptr ? std::atoi(ptr) : default_min_size;

View File

@ -284,7 +284,7 @@ void resize_bytes_nocuda(const Storage& storage, const c10::SymInt& newsize) {
} else if (device_type == at::kPrivateUse1) {
at::detail::getPrivateUse1Hooks().resizePrivateUse1Bytes(
storage, newsize.expect_int());
} else if (device_type == at::kXPU || device_type == at::kHPU) {
} else if (device_type == at::kXPU || device_type == at::kHPU || device_type == at::kMTIA) {
ptrdiff_t size_bytes_i = newsize.expect_int();
TORCH_CHECK(
!c10::overflows<int64_t>(size_bytes_i),

View File

@ -102,7 +102,7 @@ void col2im_out_cuda_template(
output.resize_({batch_size, n_output_plane, output_height, output_width});
int64_t output_batch_stride = output.stride(0);
AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(kHalf, kBFloat16,
AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND3(kHalf, kBFloat16, kBool,
input.scalar_type(), "col2im_out_cuda", [&] {
int64_t height_col = (output_height + 2 * pad_height -
(dilation_height * (kernel_height - 1) + 1)) /

View File

@ -103,7 +103,7 @@ static void im2col_out_cuda_template(
output.resize_({batch_size, n_output_plane, output_length});
// Launch kernel
AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(kHalf, kBFloat16,
AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND3(kHalf, kBFloat16, kBool,
input.scalar_type(), "im2col_out_cuda", [&] {
Tensor input_n;
Tensor output_n;

View File

@ -89,10 +89,22 @@ const std::map<c10::string_view, ideep::algorithm>& fusion_binary_alg_map();
inline bool mkldnn_bf16_device_check_arm() {
return cpuinfo_initialize() && cpuinfo_has_arm_bf16();
}
inline bool is_arm_neoverse() {
return (cpuinfo_initialize() && cpuinfo_get_uarchs_count() == 1 &&
(cpuinfo_get_uarch(0)->uarch == cpuinfo_uarch_neoverse_v1 ||
cpuinfo_get_uarch(0)->uarch == cpuinfo_uarch_neoverse_v2 ||
cpuinfo_get_uarch(0)->uarch == cpuinfo_uarch_neoverse_n1 ||
cpuinfo_get_uarch(0)->uarch == cpuinfo_uarch_neoverse_n2));
}
#else
constexpr bool mkldnn_bf16_device_check_arm() {
return false;
}
constexpr bool is_arm_neoverse() {
return false;
}
#endif
#if AT_MKLDNN_ENABLED()

View File

@ -3400,9 +3400,9 @@
- func: fbgemm_pack_gemm_matrix_fp16(Tensor input) -> Tensor
- func: wrapped_linear_prepack(Tensor weight, Tensor weight_scale, Tensor weight_zero_point, Tensor bias) -> Tensor
- func: _wrapped_linear_prepack(Tensor weight, Tensor weight_scale, Tensor weight_zero_point, Tensor bias) -> Tensor
- func: wrapped_quantized_linear_prepacked(Tensor input, Tensor input_scale, Tensor input_zero_point, Tensor packed_weight, Tensor output_scale, Tensor output_zero_point, int out_channel) -> Tensor
- func: _wrapped_quantized_linear_prepacked(Tensor input, Tensor input_scale, Tensor input_zero_point, Tensor packed_weight, Tensor output_scale, Tensor output_zero_point, int out_channel) -> Tensor
- func: fbgemm_linear_fp16_weight_fp32_activation(Tensor input, Tensor packed_weight, Tensor bias) -> Tensor

View File

@ -436,12 +436,12 @@ at::Tensor wrapped_quantized_linear_meta(
#endif // USE_FBGEMM
}
at::Tensor wrapped_linear_prepack(const at::Tensor& weight,
at::Tensor _wrapped_linear_prepack(const at::Tensor& weight,
const at::Tensor& weight_scale,
const at::Tensor& weight_zero_point,
const at::Tensor& bias);
at::Tensor wrapped_linear_prepack(const at::Tensor& weight,
at::Tensor _wrapped_linear_prepack(const at::Tensor& weight,
const at::Tensor& weight_scale,
const at::Tensor& weight_zero_point,
const at::Tensor& bias) {
@ -474,14 +474,14 @@ at::Tensor wrapped_linear_prepack(const at::Tensor& weight,
#endif // USE_FBGEMM
}
at::Tensor wrapped_quantized_linear_prepacked(const at::Tensor& input, const at::Tensor& input_scale,
at::Tensor _wrapped_quantized_linear_prepacked(const at::Tensor& input, const at::Tensor& input_scale,
const at::Tensor& input_zero_point,
const at::Tensor& packed_weight,
const at::Tensor& output_scale,
const at::Tensor& output_zero_point,
[[maybe_unused]] const int64_t out_channel);
at::Tensor wrapped_quantized_linear_prepacked(const at::Tensor& input, const at::Tensor& input_scale,
at::Tensor _wrapped_quantized_linear_prepacked(const at::Tensor& input, const at::Tensor& input_scale,
const at::Tensor& input_zero_point,
const at::Tensor& packed_weight,
const at::Tensor& output_scale,
@ -507,12 +507,12 @@ at::Tensor wrapped_quantized_linear_prepacked(const at::Tensor& input, const at:
#endif // USE_FBGEMM
}
at::Tensor wrapped_linear_prepack_meta(const at::Tensor& weight,
at::Tensor _wrapped_linear_prepack_meta(const at::Tensor& weight,
[[maybe_unused]] const at::Tensor& weight_scale,
[[maybe_unused]] const at::Tensor& weight_zero_point,
[[maybe_unused]] const at::Tensor& bias);
at::Tensor wrapped_linear_prepack_meta(const at::Tensor& weight,
at::Tensor _wrapped_linear_prepack_meta(const at::Tensor& weight,
[[maybe_unused]] const at::Tensor& weight_scale,
[[maybe_unused]] const at::Tensor& weight_zero_point,
[[maybe_unused]] const at::Tensor& bias) {
@ -530,7 +530,7 @@ at::Tensor wrapped_linear_prepack_meta(const at::Tensor& weight,
#endif // USE_FBGEMM
}
at::Tensor wrapped_quantized_linear_prepacked_meta(const at::Tensor& input,
at::Tensor _wrapped_quantized_linear_prepacked_meta(const at::Tensor& input,
[[maybe_unused]] const at::Tensor& input_scale,
[[maybe_unused]] const at::Tensor& input_zero_point,
[[maybe_unused]] const at::Tensor& packed_weight,
@ -538,7 +538,7 @@ at::Tensor wrapped_quantized_linear_prepacked_meta(const at::Tensor& input,
[[maybe_unused]] const at::Tensor& output_zero_point,
const int64_t out_channel);
at::Tensor wrapped_quantized_linear_prepacked_meta(const at::Tensor& input,
at::Tensor _wrapped_quantized_linear_prepacked_meta(const at::Tensor& input,
[[maybe_unused]] const at::Tensor& input_scale,
[[maybe_unused]] const at::Tensor& input_zero_point,
[[maybe_unused]] const at::Tensor& packed_weight,
@ -695,21 +695,21 @@ TORCH_LIBRARY_IMPL(_quantized, CPU, m) {
m.impl(TORCH_SELECTIVE_NAME("_quantized::linear_prepack_fp16_legacy"), TORCH_FN(QLinearPackWeightFp16Legacy::run));
m.impl(TORCH_SELECTIVE_NAME("_quantized::wrapped_quantized_linear"), TORCH_FN(wrapped_quantized_linear));
m.impl(
TORCH_SELECTIVE_NAME("_quantized::wrapped_linear_prepack"),
wrapped_linear_prepack);
TORCH_SELECTIVE_NAME("_quantized::_wrapped_linear_prepack"),
_wrapped_linear_prepack);
m.impl(
TORCH_SELECTIVE_NAME("_quantized::wrapped_quantized_linear_prepacked"),
wrapped_quantized_linear_prepacked);
TORCH_SELECTIVE_NAME("_quantized::_wrapped_quantized_linear_prepacked"),
_wrapped_quantized_linear_prepacked);
}
TORCH_LIBRARY_IMPL(_quantized, Meta, m) {
m.impl(TORCH_SELECTIVE_NAME("_quantized::wrapped_quantized_linear"), TORCH_FN(wrapped_quantized_linear_meta));
m.impl(
TORCH_SELECTIVE_NAME("_quantized::wrapped_linear_prepack"),
wrapped_linear_prepack_meta);
TORCH_SELECTIVE_NAME("_quantized::_wrapped_linear_prepack"),
_wrapped_linear_prepack_meta);
m.impl(
TORCH_SELECTIVE_NAME("_quantized::wrapped_quantized_linear_prepacked"),
wrapped_quantized_linear_prepacked_meta);
TORCH_SELECTIVE_NAME("_quantized::_wrapped_quantized_linear_prepacked"),
_wrapped_quantized_linear_prepacked_meta);
}
TORCH_LIBRARY_IMPL(onednn, CPU, m) {

View File

@ -251,8 +251,8 @@ TORCH_LIBRARY(_quantized, m) {
m.def(TORCH_SELECTIVE_SCHEMA("_quantized::wrapped_fbgemm_pack_gemm_matrix_fp16(Tensor W) -> Tensor"));
m.def(TORCH_SELECTIVE_SCHEMA("_quantized::wrapped_fbgemm_linear_fp16_weight(Tensor X, Tensor W, Tensor B, int out_channel) -> Tensor"));
m.def(TORCH_SELECTIVE_SCHEMA("_quantized::wrapped_quantized_linear(Tensor X, Tensor X_scale, Tensor X_zero_point, Tensor W, Tensor W_scale, Tensor W_zero_point, Tensor B, Tensor output_scale, Tensor output_zero_point, int out_channel) -> Tensor Y"));
m.def(TORCH_SELECTIVE_SCHEMA("_quantized::wrapped_linear_prepack(Tensor W, Tensor W_scale, Tensor W_zero_point, Tensor B) -> Tensor"));
m.def(TORCH_SELECTIVE_SCHEMA("_quantized::wrapped_quantized_linear_prepacked(Tensor X, Tensor X_scale, Tensor X_zero_point, Tensor W_prepack, Tensor output_scale, Tensor output_zero_point, int out_channel) -> Tensor Y"));
m.def(TORCH_SELECTIVE_SCHEMA("_quantized::_wrapped_linear_prepack(Tensor W, Tensor W_scale, Tensor W_zero_point, Tensor B) -> Tensor"));
m.def(TORCH_SELECTIVE_SCHEMA("_quantized::_wrapped_quantized_linear_prepacked(Tensor X, Tensor X_scale, Tensor X_zero_point, Tensor W_prepack, Tensor output_scale, Tensor output_zero_point, int out_channel) -> Tensor Y"));
}
TORCH_LIBRARY(onednn, m) {

View File

@ -46,6 +46,15 @@
desc: |
This tag indicates that the operator should be passed Tensors following
the same stride permutation as observed in eager when compiled in inductor.
Only one of {needs_fixed_stride_order, flexible_layout} can apply; if
multiple are assigned then we assume the most restrictive one.
- tag: flexible_layout
desc: |
This tag indicates that the custom operator can accept inputs with varying
strides/storage_offset and that when compiled, Inductor is allowed to change
the strides/storage_offset of inputs to the custom operator.
Only one of {needs_fixed_stride_order, flexible_layout} can apply; if
multiple are assigned then we assume the most restrictive one.
# NOTE [Core ATen Ops]
- tag: core

View File

@ -90,7 +90,7 @@ detectron2_maskrcnn_r_50_fpn,fail_to_run,0
dlrm,fail_to_run,0
dlrm,pass,0

1 name accuracy graph_breaks
90
91
92
93
94
95
96

View File

@ -90,7 +90,7 @@ detectron2_maskrcnn_r_50_fpn,fail_to_run,0
dlrm,fail_to_run,0
dlrm,pass,0

1 name accuracy graph_breaks
90
91
92
93
94
95
96

View File

@ -74,7 +74,7 @@ detectron2_fasterrcnn_r_50_fpn,fail_to_run,0
dlrm,fail_to_run,0
dlrm,pass,0

1 name accuracy graph_breaks
74
75
76
77
78
79
80

View File

@ -74,7 +74,7 @@ detectron2_fasterrcnn_r_50_fpn,fail_to_run,0
dlrm,fail_to_run,0
dlrm,pass,0

1 name accuracy graph_breaks
74
75
76
77
78
79
80

View File

@ -3,7 +3,12 @@ import csv
import dataclasses
import os
from generate import run_llama2_7b_bf16, run_llama2_7b_int8, run_mixtral_8x7b_int8
from generate import (
get_arch_name,
run_llama2_7b_bf16,
run_llama2_7b_int8,
run_mixtral_8x7b_int8,
)
import torch
import torch.nn as nn
@ -24,6 +29,7 @@ class Experiment:
actual: float
dtype: str
device: str
arch: str # GPU name for CUDA or CPU arch for CPU
is_model: bool = False
@ -71,7 +77,12 @@ def run_mlp_layer_norm_gelu(device: str = "cuda"):
for _ in range(WARMUP_ITER):
compiled_mod(x)
us_per_iter = benchmarker.benchmark_gpu(lambda: compiled_mod(x)) * 1000
benchmark_fn = (
benchmarker.benchmark_gpu
if device == "cuda"
else benchmarker.benchmark_cpu
)
us_per_iter = benchmark_fn(lambda: compiled_mod(x)) * 1000
flops_utilization += us_per_iter * flops / 1e9 / A100_40G_BF16_TFLOPS
flops_utilization = flops_utilization / len(input_shapes)
@ -84,6 +95,7 @@ def run_mlp_layer_norm_gelu(device: str = "cuda"):
f"{flops_utilization:.02f}",
dtype_str,
device,
get_arch_name(),
)
)
return results
@ -108,7 +120,12 @@ def run_layer_norm(device: str = "cuda"):
for _ in range(WARMUP_ITER):
compiled_mod(x)
us_per_iter = benchmarker.benchmark_gpu(lambda: compiled_mod(x)) * 1000
benchmark_fn = (
benchmarker.benchmark_gpu
if device == "cuda"
else benchmarker.benchmark_cpu
)
us_per_iter = benchmark_fn(lambda: compiled_mod(x)) * 1000
memory_bandwidth += (1e6 / us_per_iter) * 2 * BS * D * dtype.itemsize / 1e9
memory_bandwidth = memory_bandwidth / len(input_shapes)
@ -121,6 +138,7 @@ def run_layer_norm(device: str = "cuda"):
f"{memory_bandwidth:.02f}",
dtype_str,
device,
get_arch_name(),
)
)
return results
@ -151,9 +169,12 @@ def run_gather_gemv(device: str = "cuda"):
for _ in range(WARMUP_ITER):
compiled_fn(W, score_idxs, x)
us_per_iter = (
benchmarker.benchmark_gpu(lambda: compiled_fn(W, score_idxs, x)) * 1000
benchmark_fn = (
benchmarker.benchmark_gpu
if device == "cuda"
else benchmarker.benchmark_cpu
)
us_per_iter = benchmark_fn(lambda: compiled_fn(W, score_idxs, x)) * 1000
memory_bandwidth += (1e6 / us_per_iter) * 2 * D * D * dtype.itemsize / 1e9
memory_bandwidth = memory_bandwidth / len(input_shapes)
@ -166,6 +187,7 @@ def run_gather_gemv(device: str = "cuda"):
f"{memory_bandwidth:.02f}",
dtype_str,
device,
get_arch_name(),
)
)
return results
@ -186,15 +208,20 @@ def run_gemv(device: str = "cuda"):
def gemv(W, x):
return W.to(x.dtype) @ x
W = torch.randn(D, D, device="cuda").to(dtype=dtype)
x = torch.randn(D, device="cuda", dtype=torch.bfloat16)
W = torch.randn(D, D, device=device).to(dtype=dtype)
x = torch.randn(D, device=device, dtype=torch.bfloat16)
compiled_fn = torch.compile(gemv, dynamic=False)
for _ in range(WARMUP_ITER):
compiled_fn(W, x)
us_per_iter = benchmarker.benchmark_gpu(lambda: compiled_fn(W, x)) * 1000
benchmark_fn = (
benchmarker.benchmark_gpu
if device == "cuda"
else benchmarker.benchmark_cpu
)
us_per_iter = benchmark_fn(lambda: compiled_fn(W, x)) * 1000
memory_bandwidth += (1e6 / us_per_iter) * D * D * dtype.itemsize / 1e9
memory_bandwidth = memory_bandwidth / len(input_shapes)
@ -207,6 +234,7 @@ def run_gemv(device: str = "cuda"):
f"{memory_bandwidth:.02f}",
dtype_str,
device,
get_arch_name(),
)
)
return results
@ -252,7 +280,13 @@ def main(output_file=DEFAULT_OUTPUT_FILE):
results = []
for func in all_experiments:
lst = func()
try:
device = "cuda" if torch.cuda.is_available() else "cpu"
except AssertionError:
# This happens when torch is compiled with CUDA turning off completely
device = "cpu"
lst = func(device)
for x in lst:
results.append(dataclasses.astuple(x))

View File

@ -1,5 +1,6 @@
import dataclasses
import itertools
import platform
import time
from typing import Optional, Tuple
@ -41,6 +42,14 @@ def device_sync(device):
print(f"device={device} is not yet suppported")
def get_arch_name() -> str:
if torch.cuda.is_available():
return torch.cuda.get_device_name()
else:
# This returns x86_64 or arm64 (for aarch64)
return platform.machine()
def multinomial_sample_one_no_sync(
probs_sort,
): # Does multinomial sampling without a cuda synchronization
@ -198,7 +207,7 @@ def run_experiment(
) -> None:
print(f"Loading model {x.name}")
t0 = time.time()
model = _load_model(x)
model = _load_model(x, device=device)
device_sync(device=device) # MKG
print(f"Time to load model: {time.time() - t0:.02f} seconds")
@ -257,7 +266,9 @@ def run_llama2_7b_bf16(device: str = "cuda"):
1253,
162,
)
token_per_sec, memory_bandwidth, compilation_time = run_experiment(model)
token_per_sec, memory_bandwidth, compilation_time = run_experiment(
model, device=device
)
return [
Experiment(
model.name,
@ -266,6 +277,7 @@ def run_llama2_7b_bf16(device: str = "cuda"):
f"{token_per_sec:.02f}",
model.mode,
device,
get_arch_name(),
True,
),
Experiment(
@ -275,6 +287,7 @@ def run_llama2_7b_bf16(device: str = "cuda"):
f"{memory_bandwidth:.02f}",
model.mode,
device,
get_arch_name(),
True,
),
Experiment(
@ -284,6 +297,7 @@ def run_llama2_7b_bf16(device: str = "cuda"):
f"{compilation_time:.02f}",
model.mode,
device,
get_arch_name(),
True,
),
]
@ -302,7 +316,9 @@ def run_llama2_7b_int8(device: str = "cuda"):
957,
172,
)
token_per_sec, memory_bandwidth, compilation_time = run_experiment(model)
token_per_sec, memory_bandwidth, compilation_time = run_experiment(
model, device=device
)
return [
Experiment(
model.name,
@ -311,6 +327,7 @@ def run_llama2_7b_int8(device: str = "cuda"):
f"{token_per_sec:.02f}",
model.mode,
device,
get_arch_name(),
True,
),
Experiment(
@ -320,6 +337,7 @@ def run_llama2_7b_int8(device: str = "cuda"):
f"{memory_bandwidth:.02f}",
model.mode,
device,
get_arch_name(),
True,
),
Experiment(
@ -329,6 +347,7 @@ def run_llama2_7b_int8(device: str = "cuda"):
f"{compilation_time:.02f}",
model.mode,
device,
get_arch_name(),
True,
),
]
@ -348,7 +367,9 @@ def run_mixtral_8x7b_int8(device: str = "cuda"):
1130,
162,
)
token_per_sec, memory_bandwidth, compilation_time = run_experiment(model)
token_per_sec, memory_bandwidth, compilation_time = run_experiment(
model, device=device
)
return [
Experiment(
model.name,
@ -357,6 +378,7 @@ def run_mixtral_8x7b_int8(device: str = "cuda"):
f"{token_per_sec:.02f}",
model.mode,
device,
get_arch_name(),
True,
),
Experiment(
@ -366,6 +388,7 @@ def run_mixtral_8x7b_int8(device: str = "cuda"):
f"{memory_bandwidth:.02f}",
model.mode,
device,
get_arch_name(),
True,
),
Experiment(
@ -375,6 +398,7 @@ def run_mixtral_8x7b_int8(device: str = "cuda"):
f"{compilation_time:.02f}",
model.mode,
device,
get_arch_name(),
True,
),
]

View File

@ -149,6 +149,8 @@ const char* toString(DispatchKey t) {
return "AutocastXLA";
case DispatchKey::AutocastPrivateUse1:
return "AutocastPrivateUse1";
case DispatchKey::AutocastMPS:
return "AutocastMPS";
case DispatchKey::FuncTorchBatched:
return "FuncTorchBatched";
@ -297,6 +299,7 @@ c10::DispatchKey parseDispatchKey(const std::string& k) {
{"AutocastCUDA", c10::DispatchKey::AutocastCUDA},
{"AutocastXLA", c10::DispatchKey::AutocastXLA},
{"AutocastPrivateUse1", c10::DispatchKey::AutocastPrivateUse1},
{"AutocastMPS", c10::DispatchKey::AutocastMPS},
{"FuncTorchBatched", c10::DispatchKey::FuncTorchBatched},
{"BatchedNestedTensor", c10::DispatchKey::BatchedNestedTensor},
{"FuncTorchVmapMode", c10::DispatchKey::FuncTorchVmapMode},

View File

@ -359,6 +359,7 @@ enum class DispatchKey : uint16_t {
AutocastXLA,
// AutocastXLA is only being used for TPUs. XLA GPUs continue to use
// AutocastCUDA.
AutocastMPS,
AutocastCUDA,
AutocastPrivateUse1,

View File

@ -655,6 +655,7 @@ constexpr DispatchKeySet autograd_dispatch_keyset = DispatchKeySet({
constexpr DispatchKeySet autocast_dispatch_keyset = DispatchKeySet({
DispatchKey::AutocastCPU,
DispatchKey::AutocastMPS,
DispatchKey::AutocastCUDA,
DispatchKey::AutocastXPU,
DispatchKey::AutocastIPU,
@ -671,6 +672,7 @@ constexpr DispatchKeySet default_included_set = DispatchKeySet({
constexpr DispatchKeySet default_excluded_set = DispatchKeySet({
DispatchKey::AutocastCPU,
DispatchKey::AutocastMPS,
DispatchKey::AutocastCUDA,
DispatchKey::AutocastXPU,
DispatchKey::AutocastIPU,
@ -863,6 +865,7 @@ inline DispatchKeySet getAutocastRelatedKeySetFromBackend(BackendComponent t) {
constexpr auto autocast_xla_ks = DispatchKeySet(DispatchKey::AutocastXLA);
constexpr auto autocast_privateuse1_ks =
DispatchKeySet(DispatchKey::AutocastPrivateUse1);
constexpr auto autocast_mps_ks = DispatchKeySet(DispatchKey::AutocastMPS);
switch (t) {
case BackendComponent::CPUBit:
return autocast_cpu_ks;
@ -878,6 +881,8 @@ inline DispatchKeySet getAutocastRelatedKeySetFromBackend(BackendComponent t) {
return autocast_xla_ks;
case BackendComponent::PrivateUse1Bit:
return autocast_privateuse1_ks;
case BackendComponent::MPSBit:
return autocast_mps_ks;
default:
return DispatchKeySet();
}

View File

@ -6,6 +6,7 @@
#include <c10/cuda/CUDAFunctions.h>
#include <c10/cuda/CUDAGuard.h>
#include <c10/util/CallOnce.h>
#include <c10/util/Gauge.h>
#include <c10/util/ScopeExit.h>
#include <c10/util/UniqueVoidPtr.h>
#include <c10/util/flat_hash_map.h>
@ -1429,6 +1430,12 @@ class DeviceCachingAllocator {
if (block->size >= CUDAAllocatorConfig::max_split_size())
stats.oversize_allocations.increase(1);
auto allocated_bytes_gauge =
STATIC_GAUGE(pytorch.CUDACachingAllocator.allocated_bytes);
allocated_bytes_gauge.record(
stats.allocated_bytes[static_cast<int64_t>(StatType::AGGREGATE)]
.current);
c10::reportMemoryUsageToProfiler(
block->ptr,
static_cast<int64_t>(block->size),
@ -1456,6 +1463,11 @@ class DeviceCachingAllocator {
stats.allocation[stat_type].decrease(1);
stats.allocated_bytes[stat_type].decrease(block->size);
});
auto allocated_bytes_gauge =
STATIC_GAUGE(pytorch.CUDACachingAllocator.allocated_bytes);
allocated_bytes_gauge.record(
stats.allocated_bytes[static_cast<int64_t>(StatType::AGGREGATE)]
.current);
record_trace(
TraceEntry::FREE_REQUESTED,
@ -2245,6 +2257,11 @@ class DeviceCachingAllocator {
for_each_selected_stat_type(stat_types, [&](size_t stat_type) {
stats.reserved_bytes[stat_type].increase(mapped_range.size);
});
auto reserved_bytes_gauge =
STATIC_GAUGE(pytorch.CUDACachingAllocator.reserved_bytes);
reserved_bytes_gauge.record(
stats.reserved_bytes[static_cast<int64_t>(StatType::AGGREGATE)]
.current);
stats.num_device_alloc++;
record_trace(
@ -2683,6 +2700,11 @@ class DeviceCachingAllocator {
});
if (size >= CUDAAllocatorConfig::max_split_size())
stats.oversize_segments.increase(1);
auto reserved_bytes_gauge =
STATIC_GAUGE(pytorch.CUDACachingAllocator.reserved_bytes);
reserved_bytes_gauge.record(
stats.reserved_bytes[static_cast<int64_t>(StatType::AGGREGATE)]
.current);
// p.block came from new, not cudaMalloc. It should not be nullptr here.
TORCH_INTERNAL_ASSERT(p.block != nullptr && p.block->ptr != nullptr);
@ -2820,6 +2842,11 @@ class DeviceCachingAllocator {
stats.segment[stat_type].decrease(1);
stats.reserved_bytes[stat_type].decrease(block->size);
});
auto reserved_bytes_gauge =
STATIC_GAUGE(pytorch.CUDACachingAllocator.reserved_bytes);
reserved_bytes_gauge.record(
stats.reserved_bytes[static_cast<int64_t>(StatType::AGGREGATE)]
.current);
if (block->size >= CUDAAllocatorConfig::max_split_size())
stats.oversize_segments.decrease(1);
@ -2876,6 +2903,11 @@ class DeviceCachingAllocator {
for_each_selected_stat_type(stat_types, [&](size_t stat_type) {
stats.reserved_bytes[stat_type].decrease(unmapped.size);
});
auto reserved_bytes_gauge =
STATIC_GAUGE(pytorch.CUDACachingAllocator.reserved_bytes);
reserved_bytes_gauge.record(
stats.reserved_bytes[static_cast<int64_t>(StatType::AGGREGATE)]
.current);
if (block->pool->owner_PrivatePool) {
// The cudaFreed block belonged to a CUDA graph's PrivatePool.

View File

@ -68,6 +68,12 @@ template <
inline T expm1(T a) {
return std::expm1(float(a));
}
template <
typename T,
typename std::enable_if_t<is_reduced_floating_point_v<T>, int> = 0>
inline bool isfinite(T a) {
return std::isfinite(float(a));
}
template <
typename T,
typename std::enable_if_t<is_reduced_floating_point_v<T>, int> = 0>
@ -237,10 +243,9 @@ C10_HOST_DEVICE inline T nextafter(T from, T to) {
// Reference:
// https://git.musl-libc.org/cgit/musl/tree/src/math/nextafter.c
using int_repr_t = uint16_t;
using float_t = T;
constexpr uint8_t bits = 16;
union {
float_t f;
T f;
int_repr_t i;
} ufrom = {from}, uto = {to};

View File

@ -261,19 +261,19 @@ struct alignas(sizeof(T) * 2) complex {
#endif
if (abs_c >= abs_d) {
if (abs_c == 0 && abs_d == 0) {
if (abs_c == U(0) && abs_d == U(0)) {
/* divide by zeros should yield a complex inf or nan */
real_ = a / abs_c;
imag_ = b / abs_d;
} else {
auto rat = d / c;
auto scl = 1.0 / (c + d * rat);
auto scl = U(1.0) / (c + d * rat);
real_ = (a + b * rat) * scl;
imag_ = (b - a * rat) * scl;
}
} else {
auto rat = c / d;
auto scl = 1.0 / (d + c * rat);
auto scl = U(1.0) / (d + c * rat);
real_ = (a * rat + b) * scl;
imag_ = (b * rat - a) * scl;
}

View File

@ -9,6 +9,8 @@
namespace c10::xpu::XPUCachingAllocator {
using namespace c10::CachingDeviceAllocator;
// newly allocated memory with 512-byte alignment.
constexpr size_t kDeviceAlignment = 512;
// all sizes are rounded to at least 512 bytes
@ -117,6 +119,7 @@ struct AllocParams {
BlockPool* pool;
size_t alloc_size;
Block* block;
StatTypes stat_types = {};
};
} // anonymous namespace
@ -124,6 +127,7 @@ struct AllocParams {
class DeviceCachingAllocator {
private:
mutable std::recursive_mutex mutex;
DeviceStats stats;
BlockPool large_blocks; // unallocated cached blocks larger than 1 MB
BlockPool small_blocks; // unallocated cached blocks 1 MB or smaller
ska::flat_hash_set<Block*> active_blocks; // allocated or in use by a stream
@ -173,6 +177,12 @@ class DeviceCachingAllocator {
active_blocks.erase(block);
bool inserted = pool.blocks.insert(block).second;
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(inserted);
StatTypes stat_types = get_stat_types_for_pool(pool);
for_each_selected_stat_type(stat_types, [&](size_t stat_type) {
stats.active_bytes[stat_type].decrease(block->size);
stats.requested_bytes[stat_type].decrease(block->requested_size);
});
}
void process_events() {
@ -250,6 +260,9 @@ class DeviceCachingAllocator {
return false;
}
p.block = new Block(device, p.queue(), size, p.pool, ptr);
for_each_selected_stat_type(p.stat_types, [&](size_t stat_type) {
stats.reserved_bytes[stat_type].increase(size);
});
return true;
}
@ -281,6 +294,12 @@ class DeviceCachingAllocator {
sycl::free(block->ptr, xpu::get_device_context());
auto* pool = block->pool;
pool->blocks.erase(block);
StatTypes stat_types = get_stat_types_for_pool(*pool);
for_each_selected_stat_type(stat_types, [&](size_t stat_type) {
stats.reserved_bytes[stat_type].decrease(block->size);
});
delete block;
}
@ -314,6 +333,14 @@ class DeviceCachingAllocator {
}
}
StatTypes get_stat_types_for_pool(const BlockPool& pool) {
StatTypes stat_types = {};
stat_types[static_cast<size_t>(StatType::AGGREGATE)] = true;
stat_types[static_cast<size_t>(
pool.is_small ? StatType::SMALL_POOL : StatType::LARGE_POOL)] = true;
return stat_types;
}
Block* alloc_found_block(
AllocParams params,
size_t orig_size,
@ -350,6 +377,12 @@ class DeviceCachingAllocator {
bool inserted = active_blocks.insert(block).second;
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(inserted)
for_each_selected_stat_type(params.stat_types, [&](size_t stat_type) {
stats.allocated_bytes[stat_type].increase(block->size);
stats.active_bytes[stat_type].increase(block->size);
stats.requested_bytes[stat_type].increase(block->requested_size);
});
return block;
}
@ -376,6 +409,7 @@ class DeviceCachingAllocator {
auto& pool = get_pool(size);
const size_t alloc_size = get_allocation_size(size);
AllocParams params(device, size, &queue, &pool, alloc_size);
params.stat_types = get_stat_types_for_pool(pool);
// First, try to get a block from the existing pool.
bool block_found = get_free_block(params);
@ -384,9 +418,32 @@ class DeviceCachingAllocator {
block_found = alloc_block(params) ||
(release_cached_blocks() && alloc_block(params));
}
TORCH_CHECK(
block_found,
"XPU out of memory, please use `empty_cache` to release all unoccupied cached memory.");
if (!block_found) {
c10::xpu::DeviceProp device_prop;
c10::xpu::get_device_properties(&device_prop, device);
auto device_total = device_prop.global_mem_size;
auto allocated_bytes =
stats.allocated_bytes[static_cast<size_t>(StatType::AGGREGATE)]
.current;
auto reserved_bytes =
stats.reserved_bytes[static_cast<size_t>(StatType::AGGREGATE)]
.current;
TORCH_CHECK_WITH(
OutOfMemoryError,
false,
"XPU out of memory. Tried to allocate ",
format_size(alloc_size),
". GPU ",
static_cast<int>(device),
" has a total capacity of ",
format_size(device_total),
". Of the allocated memory ",
format_size(allocated_bytes),
" is allocated by PyTorch, and ",
format_size(reserved_bytes - allocated_bytes),
" is reserved by PyTorch but unallocated.",
" Please use `empty_cache` to release all unoccupied cached memory.");
}
bool split_remainder = should_split(params.block, params.size());
return alloc_found_block(std::move(params), orig_size, split_remainder);
}
@ -395,6 +452,11 @@ class DeviceCachingAllocator {
std::scoped_lock<std::recursive_mutex> lock(mutex);
block->allocated = false;
StatTypes stat_types = get_stat_types_for_pool(*block->pool);
for_each_selected_stat_type(stat_types, [&](size_t stat_type) {
stats.allocated_bytes[stat_type].decrease(block->size);
});
if (!block->stream_uses.empty()) {
insert_events(block);
} else {
@ -414,6 +476,35 @@ class DeviceCachingAllocator {
std::scoped_lock<std::recursive_mutex> lock(mutex);
release_cached_blocks();
}
DeviceStats getStats() {
std::scoped_lock<std::recursive_mutex> lock(mutex);
return stats;
}
void resetAccumulatedStats() {
std::scoped_lock<std::recursive_mutex> lock(mutex);
for (const auto statType :
c10::irange(static_cast<size_t>(StatType::NUM_TYPES))) {
stats.allocated_bytes[statType].reset_accumulated();
stats.reserved_bytes[statType].reset_accumulated();
stats.active_bytes[statType].reset_accumulated();
stats.requested_bytes[statType].reset_accumulated();
}
}
void resetPeakStats() {
std::scoped_lock<std::recursive_mutex> lock(mutex);
for (const auto statType :
c10::irange(static_cast<size_t>(StatType::NUM_TYPES))) {
stats.allocated_bytes[statType].reset_peak();
stats.reserved_bytes[statType].reset_peak();
stats.active_bytes[statType].reset_peak();
stats.requested_bytes[statType].reset_peak();
}
}
};
void local_raw_delete(void* ptr);
@ -547,6 +638,30 @@ class XPUAllocator : public Allocator {
void copy_data(void* dest, const void* src, std::size_t count) const final {
xpu::getCurrentXPUStream().queue().memcpy(dest, src, count);
}
void assertValidDevice(DeviceIndex device) {
const auto device_num = device_allocators.size();
TORCH_CHECK(
0 <= device && device < static_cast<int64_t>(device_num),
"Invalid device argument ",
device,
": did you call init?");
}
DeviceStats getDeviceStats(DeviceIndex device) {
assertValidDevice(device);
return device_allocators[device]->getStats();
}
void resetPeakStats(DeviceIndex device) {
assertValidDevice(device);
device_allocators[device]->resetPeakStats();
}
void resetAccumulatedStats(DeviceIndex device) {
assertValidDevice(device);
device_allocators[device]->resetAccumulatedStats();
}
};
static XPUAllocator allocator;
@ -567,6 +682,18 @@ void emptyCache() {
return allocator.emptyCache();
}
void resetPeakStats(DeviceIndex device) {
return allocator.resetPeakStats(device);
}
void resetAccumulatedStats(DeviceIndex device) {
return allocator.resetAccumulatedStats(device);
}
DeviceStats getDeviceStats(DeviceIndex device) {
return allocator.getDeviceStats(device);
}
void* raw_alloc(size_t size) {
return allocator.raw_alloc(size);
}

View File

@ -1,6 +1,6 @@
#pragma once
#include <c10/core/Allocator.h>
#include <c10/core/CachingDeviceAllocator.h>
#include <c10/xpu/XPUStream.h>
namespace c10::xpu::XPUCachingAllocator {
@ -11,6 +11,13 @@ C10_XPU_API void init(DeviceIndex device_count);
C10_XPU_API void emptyCache();
C10_XPU_API void resetPeakStats(DeviceIndex device);
C10_XPU_API void resetAccumulatedStats(DeviceIndex device);
C10_XPU_API c10::CachingDeviceAllocator::DeviceStats getDeviceStats(
DeviceIndex device);
C10_XPU_API void* raw_alloc(size_t size);
C10_XPU_API void raw_delete(void* ptr);

View File

@ -3,6 +3,7 @@
#include <algorithm>
#include <sstream>
#include <vector>
#include <cstdint>
namespace caffe2 {

View File

@ -43,7 +43,9 @@ IF(NOT MKLDNN_FOUND)
endif()
endif()
if(LINUX)
set(ABI_NEUTRAL_FLAGS -fpreview-breaking-changes)
set(DNNL_CXX_FLAGS "-DCMAKE_CXX_FLAGS=-fpreview-breaking-changes")
else()
set(DNNL_CXX_FLAGS "")
endif()
ExternalProject_Add(xpu_mkldnn_proj
SOURCE_DIR ${MKLDNN_ROOT}
@ -51,7 +53,7 @@ IF(NOT MKLDNN_FOUND)
BUILD_IN_SOURCE 0
CMAKE_ARGS -DCMAKE_C_COMPILER=icx
-DCMAKE_CXX_COMPILER=${SYCL_CXX_DRIVER}
-DCMAKE_CXX_FLAGS=${ABI_NEUTRAL_FLAGS}
${DNNL_CXX_FLAGS}
-DDNNL_GPU_RUNTIME=SYCL
-DDNNL_CPU_RUNTIME=THREADPOOL
-DDNNL_BUILD_TESTS=OFF
@ -85,13 +87,18 @@ IF(NOT MKLDNN_FOUND)
SET(ONEDNN_BUILD_GRAPH ON CACHE BOOL "" FORCE)
ENDIF(NOT APPLE AND NOT WIN32 AND NOT BUILD_LITE_INTERPRETER)
IF(EXISTS "${MKLDNN_ROOT}/include/oneapi/dnnl/dnnl_ukernel.hpp")
MESSAGE("-- Will build oneDNN UKERNEL")
SET(DNNL_EXPERIMENTAL_UKERNEL ON CACHE BOOL "" FORCE)
ENDIF(EXISTS "${MKLDNN_ROOT}/include/oneapi/dnnl/dnnl_ukernel.hpp")
FIND_PACKAGE(BLAS)
FIND_PATH(IDEEP_INCLUDE_DIR ideep.hpp PATHS ${IDEEP_ROOT} PATH_SUFFIXES include)
FIND_PATH(MKLDNN_INCLUDE_DIR dnnl.hpp dnnl.h PATHS ${MKLDNN_ROOT} PATH_SUFFIXES include/oneapi/dnnl)
FIND_PATH(MKLDNN_INCLUDE_DIR dnnl.hpp dnnl.h dnnl_ukernel.hpp dnnl_ukernel.h PATHS ${MKLDNN_ROOT} PATH_SUFFIXES include/oneapi/dnnl)
IF(NOT MKLDNN_INCLUDE_DIR)
MESSAGE("MKLDNN_INCLUDE_DIR not found")
EXECUTE_PROCESS(COMMAND git${CMAKE_EXECUTABLE_SUFFIX} submodule update --init mkl-dnn WORKING_DIRECTORY ${IDEEP_ROOT})
FIND_PATH(MKLDNN_INCLUDE_DIR dnnl.hpp dnnl.h PATHS ${MKLDNN_ROOT} PATH_SUFFIXES include)
FIND_PATH(MKLDNN_INCLUDE_DIR dnnl.hpp dnnl.h dnnl_ukernel.hpp dnnl_ukernel.h PATHS ${MKLDNN_ROOT} PATH_SUFFIXES include)
ENDIF(NOT MKLDNN_INCLUDE_DIR)
IF(BUILD_ONEDNN_GRAPH)
FIND_PATH(LLGA_INCLUDE_DIR dnnl_graph.hpp PATHS ${LLGA_ROOT} PATH_SUFFIXES include/oneapi/dnnl)

View File

@ -283,9 +283,11 @@ The following ops are currently supported:
kron
meshgrid
narrow
nn.functional.unfold
ravel
select
split
stack
t
transpose
vsplit
@ -294,6 +296,7 @@ The following ops are currently supported:
Tensor.expand_as
Tensor.reshape
Tensor.reshape_as
Tensor.unfold
Tensor.view
.. This module needs to be documented. Adding here in the meantime

View File

@ -398,3 +398,4 @@ The following utility functions are related to serialization:
.. autofunction:: clear_safe_globals
.. autofunction:: get_safe_globals
.. autoclass:: safe_globals
.. autoclass:: skip_data

View File

@ -28,7 +28,6 @@ The exporter is designed to be modular and extensible. It is composed of the fol
- **FX Graph Extractor**: :class:`FXGraphExtractor` extracts the FX graph from the PyTorch model.
- **Fake Mode**: :class:`ONNXFakeContext` is a context manager that enables fake mode for large scale models.
- **ONNX Program**: :class:`ONNXProgram` is the output of the exporter that contains the exported ONNX graph and diagnostics.
- **ONNX Program Serializer**: :class:`ONNXProgramSerializer` serializes the exported model to a file.
- **ONNX Diagnostic Options**: :class:`DiagnosticOptions` has a set of options that control the diagnostics emitted by the exporter.
Dependencies
@ -144,15 +143,9 @@ API Reference
.. autoclass:: torch.onnx.ONNXProgram
:members:
.. autoclass:: torch.onnx.ONNXProgramSerializer
:members:
.. autoclass:: torch.onnx.ONNXRuntimeOptions
:members:
.. autoclass:: torch.onnx.InvalidExportOptionsError
:members:
.. autoclass:: torch.onnx.OnnxExporterError
:members:

View File

@ -13,7 +13,6 @@ torch.xpu
device
device_count
device_of
empty_cache
get_device_capability
get_device_name
get_device_properties
@ -51,7 +50,25 @@ Streams and events
Stream
Memory management
-----------------
.. autosummary::
:toctree: generated
:nosignatures:
empty_cache
max_memory_allocated
max_memory_reserved
memory_allocated
memory_reserved
memory_stats
memory_stats_as_nested_dict
reset_accumulated_memory_stats
reset_peak_memory_stats
.. This module needs to be documented. Adding here in the meantime
.. for tracking purposes
.. py:module:: torch.xpu.memory
.. py:module:: torch.xpu.random
.. py:module:: torch.xpu.streams
.. py:module:: torch.xpu.streams

View File

@ -1,6 +1,6 @@
# Python dependencies required for development
astunparse
expecttest!=0.2.0
expecttest>=0.2.1
hypothesis
numpy
psutil

View File

@ -93,7 +93,7 @@ annotations from the example above one would write:
* `CHECK-COUNT-EXACTLY-<num>: <pattern>`
Scans the input and succeeds when a line containing exactly `NUM` entries of
`PATTERN` is found.
* `CHECK-DAG: pattern`
* `CHECK-DAG: <pattern>`
Works similar to the usual `CHECK` pragma, but also matches if there exists a
way to reorder the CHECK-DAG pragmas to satisfy all patterns.
For example the following pattern:
@ -110,3 +110,18 @@ annotations from the example above one would write:
bar
end
```
* `CHECK-SOURCE-HIGHLIGHTED: <pattern>`
Check for highlighted source ranges. This is useful when writing tests regarding generated error messages that require source code highlighting.
For example the following pattern:
```
# CHECK-SOURCE-HIGHLIGHTED: raise Exception("raised exception
```
would match the following input:
```
def method_that_raises() -> torch.Tensor:
raise Exception("raised exception") # noqa: TRY002
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
builtins.Exception: raised exception
```
* `CHECK-REGEX: <pattern>`
Scans the input until `PATTERN` is matched, accepts RE syntax for std::regex.

View File

@ -0,0 +1,111 @@
# Owner(s): ["oncall: distributed"]
import copy
import torch
import torch.nn as nn
from torch.amp.grad_scaler import GradScaler, OptState
from torch.distributed._composable.fsdp import fully_shard
from torch.distributed._tensor import init_device_mesh
from torch.distributed.tensor.parallel import (
ColwiseParallel,
parallelize_module,
RowwiseParallel,
)
from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
from torch.testing._internal.common_fsdp import FSDPTest, MLP
from torch.testing._internal.common_utils import run_tests, skipIfRocm
class TestFullyShardGradientScaler(FSDPTest):
@skip_if_lt_x_gpu(4)
@skipIfRocm
def test_gradient_scaler(self):
self.run_subtests(
{"has_inf": [True, False], "test_2d": [True, False]},
self._test_gradient_scaler,
)
def _test_gradient_scaler(self, has_inf: bool, test_2d: bool):
torch.manual_seed(0)
model = nn.Sequential(
*[nn.Linear(4, 4, device="cuda", bias=False) for _ in range(2)]
)
for layer in model:
fully_shard(layer)
fully_shard(model)
input = torch.randn([4, 4], device="cuda")
if test_2d:
mesh_2d = init_device_mesh(
"cuda", (2, self.world_size // 2), mesh_dim_names=("dp", "tp")
)
dp_mesh, tp_mesh = mesh_2d["dp"], mesh_2d["tp"]
model = nn.Sequential(MLP(2), MLP(2), MLP(2))
tp_parallelize_plan = {
"0.in_proj": ColwiseParallel(),
"0.out_proj": RowwiseParallel(),
"1.in_proj": ColwiseParallel(),
"1.out_proj": RowwiseParallel(),
"2.in_proj": ColwiseParallel(),
"2.out_proj": RowwiseParallel(),
}
model = parallelize_module(
model,
device_mesh=tp_mesh,
parallelize_plan=tp_parallelize_plan,
)
for module in model:
fully_shard(module, mesh=dp_mesh)
fully_shard(model, mesh=dp_mesh)
input = torch.randn((2,), device="cuda")
loss = model(input).sum()
scaler = GradScaler(init_scale=2.0, enabled=True)
opt = torch.optim.Adam(model.parameters(), lr=1e-2)
scaler.scale(loss).backward()
inv_scale = scaler._scale.double().reciprocal().float()
if (
has_inf is True
and opt.param_groups[0]["params"][0].grad._local_tensor.device.index == 1
):
opt.param_groups[0]["params"][0].grad._local_tensor[0, 0].fill_(
float("inf")
)
inital_grad = opt.param_groups[0]["params"][0].grad.to_local().clone()
scaler.unscale_(opt)
for found_inf in scaler._per_optimizer_states[id(opt)][
"found_inf_per_device"
].values():
self.assertEqual(found_inf, has_inf)
self.assertEqual(
scaler._per_optimizer_states[id(opt)]["stage"].value,
OptState.UNSCALED.value,
)
unscaled_grad = opt.param_groups[0]["params"][0].grad.to_local().clone()
self.assertEqual(unscaled_grad, inital_grad * inv_scale)
initial_scale = scaler.get_scale()
initial_state = copy.copy(opt.state)
scaler.step(opt)
steped_state = copy.copy(opt.state)
if has_inf:
# assert parameters are the same before/after
self.assertEqual(steped_state, initial_state)
else:
# new parameters here if no inf found during .unscale_()
self.assertNotEqual(steped_state.items(), initial_state.items())
scaler.update()
updated_scale = scaler.get_scale()
if has_inf:
# assert scale is updated
backoff_factor = scaler.get_backoff_factor()
self.assertEqual(updated_scale, initial_scale * backoff_factor)
else:
# scale is not updated
self.assertEqual(updated_scale, initial_scale)
if __name__ == "__main__":
run_tests()

View File

@ -32,6 +32,7 @@ from torch.testing._internal.common_distributed import (
skip_if_rocm,
)
from torch.testing._internal.common_utils import run_tests
from torch.testing._internal.distributed.fake_pg import FakeStore
from torch.utils._triton import has_triton
from torch.utils.checkpoint import checkpoint
@ -367,35 +368,28 @@ class ReplicateTest(MultiProcessInductorTestCase):
fc.run(code)
class DDP_TP_Test(MultiProcessInductorTestCase):
@property
def world_size(self) -> int:
return min(4, torch.cuda.device_count())
class DDP_TP_Test(InductorTestCase):
def setUp(self):
self.rank = 0
self.world_size = 4
torch.cuda.set_device("cuda:0")
def setUp(self) -> None:
super().setUp()
self._spawn_processes()
store = FakeStore()
dist.init_process_group(
backend="fake",
world_size=self.world_size,
rank=self.rank,
store=store,
)
def tearDown(self):
super().tearDown()
try:
os.remove(self.file_name)
except OSError:
pass
dist.destroy_process_group()
@unittest.skipIf(not has_triton(), "Inductor+gpu needs triton and recent GPU arch")
@skip_if_rocm
@skip_if_lt_x_gpu(4)
def test_ddp_tp(self):
torch.cuda.set_device(f"cuda:{self.rank}")
dist.init_process_group(
backend="nccl",
rank=self.rank,
world_size=self.world_size,
store=dist.FileStore(self.file_name, self.world_size),
)
model = Net().cuda()
compiled_replicate_model = deepcopy(model)
ref_model = Net()
compiled_replicate_model = deepcopy(ref_model)
mesh_2d = init_device_mesh(
"cuda", (2, self.world_size // 2), mesh_dim_names=("dp", "tp")
)
@ -407,8 +401,8 @@ class DDP_TP_Test(MultiProcessInductorTestCase):
"fc3": ColwiseParallel(),
"fc4": RowwiseParallel(),
}
model = parallelize_module(model, tp_mesh, parallelize_plan)
model = replicate(model, device_mesh=dp_mesh)
ref_model = parallelize_module(ref_model, tp_mesh, parallelize_plan)
ref_model = replicate(ref_model, device_mesh=dp_mesh)
compiled_replicate_model = parallelize_module(
compiled_replicate_model, tp_mesh, parallelize_plan
)
@ -416,15 +410,23 @@ class DDP_TP_Test(MultiProcessInductorTestCase):
compiled_replicate_model, device_mesh=dp_mesh
)
compiled_replicate_model = torch.compile(compiled_replicate_model)
data = torch.randn([1, DIM]).cuda()
data = torch.randn([1, DIM])
with compiled_autograd.enable(compiler_fn()):
loss = compiled_replicate_model(data).sum()
loss.backward()
# TODO: We need "pre-dispatch tracing of backward graph" to make this work:
# https://github.com/pytorch/pytorch/issues/127797#issuecomment-2291695474
with self.assertRaisesRegex(
AssertionError,
"Expected ProxyTensor, got <class 'torch.distributed._tensor.api.DTensor'>",
):
loss.backward()
loss = model(data).sum()
loss.backward()
for p1, p2 in zip(model.parameters(), compiled_replicate_model.parameters()):
self.assertEqual(p1.grad, p2.grad)
# ref_loss = ref_model(data).sum()
# ref_loss.backward()
# for p1, p2 in zip(
# ref_model.parameters(), compiled_replicate_model.parameters()
# ):
# self.assertEqual(p1.grad, p2.grad)
if __name__ == "__main__":

View File

@ -299,6 +299,70 @@ class TestDTensorCompile(torch._dynamo.test_case.TestCase):
self.assertEqual(res, ref)
self.assertEqual(cnt.frame_count, 2)
def test_dynamo_dtensor_from_local_dynamic_shapes(self):
mesh = DeviceMesh(self.device_type, torch.arange(self.world_size))
# Case 1: all dims dynamic
def fn(x):
dt = DTensor.from_local(
x,
mesh,
[Replicate()],
run_check=False,
shape=x.shape,
stride=x.stride(),
)
return dt.to_local() + 2
inp = torch.randn(4, 6, requires_grad=True)
ref = fn(inp)
cnt = torch._dynamo.testing.CompileCounterWithBackend("aot_eager")
res = torch.compile(fn, backend=cnt, fullgraph=True, dynamic=True)(inp)
res.sum().backward()
self.assertEqual(res, ref)
self.assertEqual(cnt.frame_count, 1)
# Case 2: only sizes are dynamic, strides are static
def fn(x):
dt = DTensor.from_local(
x, mesh, [Replicate()], run_check=False, shape=x.shape, stride=(1,)
)
return dt.to_local() + 2
inp = torch.randn(4, requires_grad=True)
torch._dynamo.mark_dynamic(inp, 0)
ref = fn(inp)
cnt = torch._dynamo.testing.CompileCounterWithBackend("aot_eager")
res = torch.compile(fn, backend=cnt, fullgraph=True)(inp)
res.sum().backward()
self.assertEqual(res, ref)
self.assertEqual(cnt.frame_count, 1)
# Case 3: both sizes and strides have a mix of dynamic and static dims
def fn(x):
dt = DTensor.from_local(
x,
mesh,
[Replicate()],
run_check=False,
shape=(x.shape[0], x.shape[1], 2),
stride=(x.stride()[0], 2, 1),
)
return dt.to_local() + 2
inp = torch.randn(4, 6, 2, requires_grad=True)
torch._dynamo.mark_dynamic(inp, 0)
torch._dynamo.mark_dynamic(inp, 1)
ref = fn(inp)
cnt = torch._dynamo.testing.CompileCounterWithBackend("aot_eager")
res = torch.compile(fn, backend=cnt, fullgraph=True)(inp)
res.sum().backward()
self.assertEqual(res, ref)
self.assertEqual(cnt.frame_count, 1)
def test_dynamo_dtensor_recompile(self):
mesh = DeviceMesh(self.device_type, torch.arange(self.world_size))

View File

@ -7,7 +7,13 @@ from typing import cast, List
import torch
import torch.distributed as dist
from torch import rand, randn, Tensor
from torch.distributed._tensor import DeviceMesh, distribute_tensor, Replicate, Shard
from torch.distributed._tensor import (
DeviceMesh,
distribute_tensor,
init_device_mesh,
Replicate,
Shard,
)
from torch.distributed._tensor.debug import CommDebugMode
from torch.distributed._tensor.ops._view_ops import (
Broadcast,
@ -29,6 +35,10 @@ from torch.utils import _pytree as pytree
class TestViewOps(DTensorTestBase):
@property
def world_size(self) -> int:
return 6
def test_view_groups(self):
self.assertEqual(
view_groups([2, 3], [3, 2]),
@ -106,8 +116,8 @@ class TestViewOps(DTensorTestBase):
view_groups([1, 1, 3, 2, 1, 1], [6, 1, 1, 1]),
(
Flatten((InputDim(2), InputDim(3))),
Singleton(),
Singleton(),
InputDim(4),
InputDim(5),
Singleton(),
),
)
@ -116,7 +126,7 @@ class TestViewOps(DTensorTestBase):
(
Split(InputDim(2), (3, 4), 0),
Split(InputDim(2), (3, 4), 1),
Singleton(),
InputDim(3),
Flatten((InputDim(6), InputDim(7))),
),
)
@ -125,10 +135,6 @@ class TestViewOps(DTensorTestBase):
(InputDim(0), InputDim(1), InputDim(2)),
)
@property
def world_size(self) -> int:
return 6
def call_dt_test(self, op, args, kwargs, device_mesh: DeviceMesh):
dim_map = dim_maps[op]
rules = dim_map(*args, **kwargs)
@ -429,7 +435,7 @@ class TestViewOps(DTensorTestBase):
self.dimmap_test(
Tensor.view,
(randn(1, 1, 42, 1, 24, 1), -1),
(Flatten((InputDim(2), InputDim(4))),),
(Flatten((InputDim(2), InputDim(input_dim=3), InputDim(4))),),
)
self.dimmap_test(
@ -525,6 +531,46 @@ class TestViewOps(DTensorTestBase):
)
self.assertEqual(out, out_dt.full_tensor())
@with_comms
def test_dtensor_view_op_uneven(self):
"""
Test two uneven cases for view op:
1) the sharded tensor dim is 1 so that only the first rank has an non-empty shard.
2) the sharded tensor dim is uneven such that some ranks have full shards,
smaller non-empty shards, and empty shards.
"""
dim0_sizes = [1, self.world_size + 1]
for dim0_size in dim0_sizes:
p = torch.randn(dim0_size, 2, 2, 2)
mesh = init_device_mesh(self.device_type, (self.world_size,))
dtensor = distribute_tensor(p, mesh, [Shard(0)])
with CommDebugMode() as comm_mode:
view = dtensor.view(dim0_size, 2, 4)
self.assertEqual(len(comm_mode.get_comm_counts()), 0)
# when no communication happens, the data pointer should be the same.
self.assertEqual(
view.to_local().data_ptr(), dtensor.to_local().data_ptr()
)
view = dtensor.view(dim0_size, 4, 2)
self.assertEqual(
view.to_local().data_ptr(), dtensor.to_local().data_ptr()
)
self.assertEqual(len(comm_mode.get_comm_counts()), 0)
view = dtensor.view(dim0_size, 8)
self.assertEqual(
view.to_local().data_ptr(), dtensor.to_local().data_ptr()
)
self.assertEqual(len(comm_mode.get_comm_counts()), 0)
view = dtensor.view(dtensor.shape)
self.assertEqual(
view.to_local().data_ptr(), dtensor.to_local().data_ptr()
)
self.assertEqual(len(comm_mode.get_comm_counts()), 0)
if __name__ == "__main__":
run_tests()

View File

@ -34,7 +34,7 @@ class FsdpModelStateCheckpoint(DTensorTestBase):
"model": model.state_dict(),
}
dist_cp.save_state_dict(
dist_cp.save(
state_dict=state_dict,
storage_writer=dist_cp.FileSystemWriter(CHECKPOINT_DIR),
planner=DefaultSavePlanner(),
@ -55,7 +55,7 @@ class FsdpModelStateCheckpoint(DTensorTestBase):
"model": model_2.state_dict(),
}
dist_cp.load_state_dict(
dist_cp.load(
state_dict=state_dict,
storage_reader=dist_cp.FileSystemReader(CHECKPOINT_DIR),
planner=DefaultLoadPlanner(),

View File

@ -40,7 +40,7 @@ class TestFsdpTpCheckpointConversion(DTensorTestBase):
fsdp_state_dict = fsdp_model.state_dict()
# save fsdp_state_dict to storage
dist_cp.save_state_dict(
dist_cp.save(
state_dict=fsdp_state_dict,
storage_writer=dist_cp.FileSystemWriter(CHECKPOINT_DIR),
)

View File

@ -94,7 +94,7 @@ class TestHSDPCheckpoint(DTensorTestBase):
state_dict = {"model": model.state_dict()}
state_dict_to_save = deepcopy(state_dict)
dist_cp.save_state_dict(
dist_cp.save(
state_dict=state_dict_to_save,
storage_writer=dist_cp.FileSystemWriter(CHECKPOINT_DIR),
planner=DefaultSavePlanner(),
@ -113,7 +113,7 @@ class TestHSDPCheckpoint(DTensorTestBase):
self.assertEqual(v1.placements, v2.placements)
self.assertNotEqual(v1.to_local(), v2.to_local())
dist_cp.load_state_dict(
dist_cp.load(
state_dict=state_dict_to_save,
storage_reader=dist_cp.FileSystemReader(CHECKPOINT_DIR),
planner=DefaultLoadPlanner(),

View File

@ -6,6 +6,7 @@
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.
import asyncio
import ctypes
import multiprocessing
import os
@ -362,6 +363,9 @@ if not (TEST_WITH_DEV_DBG_ASAN or IS_WINDOWS or IS_MACOS):
self.assertTrue(pc._stderr_tail.stopped())
self.assertTrue(pc._stdout_tail.stopped())
def test_pcontext_wait_on_a_child_thread(self):
asyncio.run(asyncio.to_thread(self.test_pcontext_wait))
def test_multiprocess_context_close(self):
pc = start_processes(
name="sleep",

View File

@ -25,6 +25,7 @@ from torch.distributed.elastic.rendezvous.c10d_rendezvous_backend import (
C10dRendezvousBackend,
create_backend,
)
from torch.distributed.elastic.utils.distributed import get_free_port
class TCPStoreBackendTest(TestCase, RendezvousBackendTestMixin):
@ -69,9 +70,11 @@ class CreateBackendTest(TestCase):
# For testing, the default parameters used are for tcp. If a test
# uses parameters for file store, we set the self._params to
# self._params_filestore.
port = get_free_port()
self._params = RendezvousParameters(
backend="dummy_backend",
endpoint="localhost:29300",
endpoint=f"localhost:{port}",
run_id="dummy_run_id",
min_nodes=1,
max_nodes=1,
@ -95,7 +98,7 @@ class CreateBackendTest(TestCase):
self._expected_temp_dir = tempfile.gettempdir()
self._expected_endpoint_host = "localhost"
self._expected_endpoint_port = 29300
self._expected_endpoint_port = port
self._expected_store_type = TCPStore
self._expected_read_timeout = timedelta(seconds=10)
@ -173,11 +176,14 @@ class CreateBackendTest(TestCase):
def test_create_backend_returns_backend_if_endpoint_port_is_not_specified(
self,
) -> None:
self._params.endpoint = self._expected_endpoint_host
# patch default port and pass endpoint with no port specified
with mock.patch(
"torch.distributed.elastic.rendezvous.c10d_rendezvous_backend.DEFAULT_PORT",
self._expected_endpoint_port,
):
self._params.endpoint = self._expected_endpoint_host
self._expected_endpoint_port = 29400
self._assert_create_backend_returns_backend()
self._assert_create_backend_returns_backend()
def test_create_backend_returns_backend_if_endpoint_file_is_not_specified(
self,

View File

@ -1597,6 +1597,23 @@ class CreateHandlerTest(TestCase):
create_handler(self._store, self._backend, self._params)
record_mock.assert_called_once()
def test_create_handler_rdzv_local_addr(self) -> None:
params = RendezvousParameters(
backend=self._backend.name,
endpoint="dummy_endpoint",
run_id="dummy_run_id",
min_nodes=1,
max_nodes=1,
join_timeout="50",
last_call_timeout="60",
close_timeout="70",
local_addr="127.0.0.2",
)
store = HashStore()
handler = create_handler(store, self._backend, params)
rdzv_info = handler.next_rendezvous()
self.assertEqual(rdzv_info.bootstrap_store_info.master_addr, "127.0.0.2")
def _ignore_exception(exception_type: Exception, fn: Callable):
try:
@ -1656,7 +1673,7 @@ class IntegrationTest(TestCase):
"min_nodes": 2,
"max_nodes": 2,
"join_timeout": "5",
"local_addr": f"address_{len(self._handlers)}",
"local_addr": f"127.0.0.{len(self._handlers)}",
}
params.update(**kwargs)
@ -1714,7 +1731,7 @@ class IntegrationTest(TestCase):
state_and_token = self._backend.get_state()
state = pickle.loads(state_and_token[0])
addresses = [node.addr for node in state.redundancy_list]
self.assertListEqual(addresses, ["address_2"])
self.assertListEqual(addresses, ["127.0.0.2"])
def test_redundancy_transition_to_wait_list_then_join_rendezvous(self) -> None:
handler1 = self._create_handler(

View File

@ -6,6 +6,7 @@
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.
import multiprocessing as mp
import os
import signal
import time
import unittest
@ -37,7 +38,7 @@ if not (IS_WINDOWS or IS_MACOS):
def setUp(self):
super().setUp()
self.max_interval = 0.01
self.file_path = "/tmp/test_file_path_" + str(uuid.uuid4())
self.file_path = f"/tmp/test_file_path_{os.getpid()}_{uuid.uuid4()}"
self.server = timer.FileTimerServer(
self.file_path, "test", self.max_interval
)
@ -204,7 +205,7 @@ if not (IS_WINDOWS or IS_MACOS):
class FileTimerServerTest(TestCase):
def setUp(self):
super().setUp()
self.file_path = "/tmp/test_file_path_" + str(uuid.uuid4())
self.file_path = f"/tmp/test_file_path_{os.getpid()}_{uuid.uuid4()}"
self.max_interval = 0.01
self.server = timer.FileTimerServer(
self.file_path, "test", self.max_interval

View File

@ -914,9 +914,6 @@ class TestMultiProc(DynamoDistributedMultiProcTestCase):
with _dynamo_dist_per_rank_init(self.rank, self.world_size):
torch._dynamo.utils.clear_compilation_metrics()
# TODO: This should be possible to do inside the function, but
device = f"cuda:{self.rank}"
@torch.compile()
def f(x, y):
zx = x.shape
@ -940,6 +937,28 @@ class TestMultiProc(DynamoDistributedMultiProcTestCase):
for r in res[1:]:
self.assertEqual(res[0], r)
@unittest.skipIf(not has_triton(), "Inductor+gpu needs triton and recent GPU arch")
@config.patch(enable_compiler_collectives=True)
def test_compiler_collectives_missing_source(self):
with _dynamo_dist_per_rank_init(self.rank, self.world_size):
torch._dynamo.utils.clear_compilation_metrics()
@torch.compile()
def f(rank, xs):
return xs[rank].sum()
xs = []
for _ in range(self.world_size):
xs.append(torch.randn(10, device=self.rank))
f(self.rank, xs)
metrics = torch._dynamo.utils.get_compilation_metrics()
res = [None] * self.world_size
torch.distributed.all_gather_object(res, len(metrics))
for r in res[1:]:
self.assertEqual(res[0], r)
@unittest.skipIf(not has_triton(), "Inductor+gpu needs triton and recent GPU arch")
@patch.object(torch._inductor.config, "fx_graph_cache", False)
@patch.object(torch._inductor.config, "fx_graph_remote_cache", False)

View File

@ -1013,22 +1013,17 @@ class TestCollectivesInductor(DynamoDistributedSingleProcTestCase):
return ar
input = torch.ones(4, 4, device="cuda", requires_grad=True)
# TODO implement backwards
with self.assertRaisesRegex(
RuntimeError,
"element 0 of tensors does not require grad and does not have a grad_fn",
):
compiled = torch.compile(
func, backend="aot_eager"
) # inductor bug with single-op allreduce graph
out = compiled(input)
out.sum().backward()
compiled = torch.compile(
func, backend="aot_eager"
) # inductor bug with single-op allreduce graph
out = compiled(input)
out.sum().backward()
correct_input = input.clone().detach().requires_grad_()
correct = func(correct_input)
correct.sum().backward()
self.assertTrue(same(out, correct))
self.assertTrue(same(input.grad, correct_input.grad))
correct_input = input.clone().detach().requires_grad_()
correct = func(correct_input)
correct.sum().backward()
self.assertTrue(same(out, correct))
self.assertTrue(same(input.grad, correct_input.grad))
def test_meta(self):
x = torch.rand((2, 3, 4), device="meta")

View File

@ -107,7 +107,7 @@ due to:
Traceback (most recent call last):
File "test_exc.py", line N, in f
raise NotImplementedError
torch._dynamo.exc.InternalTorchDynamoError:
torch._dynamo.exc.InternalTorchDynamoError: NotImplementedError:
from user code:
File "test_exc.py", line N, in fn001

View File

@ -239,6 +239,22 @@ class FunctionTests(torch._dynamo.test_case.TestCase):
v = v + x
return v
def test_itertools_reconstruct(self):
def fn(a):
it1 = itertools.repeat(1)
it2 = itertools.count(2)
for _ in range(3):
a += next(it1)
a += next(it2)
return it1, it2, a
opt_fn = torch.compile(fn, backend="eager", fullgraph=True)
i1, i2, a = fn(torch.ones(3, 3))
it1, it2, b = opt_fn(torch.ones(3, 3))
self.assertEqual(next(i1), next(it1))
self.assertEqual(next(i2), next(it2))
self.assertEqual(a, b)
@make_test
def test_obj_eq(a, b):
v = a + b
@ -507,8 +523,7 @@ class FunctionTests(torch._dynamo.test_case.TestCase):
empty = collections.deque()
d.extend(empty)
# dynamo same() util doesn't support deque so just return a list
return list(d)
return d
@make_test
def test_slice1(a):
@ -3115,6 +3130,199 @@ class GraphModule(torch.nn.Module):
fn(arr, np.s_[..., 1], np.array([3, 3])), np.array([[1, 3], [2, 3]])
)
def test_map_return(self):
def fn(a, b):
return map(lambda x: x + 1, [a, b])
opt_fn = torch.compile(fn, backend="eager", fullgraph=True)
m = opt_fn(torch.randn(3, 3), torch.randn(3, 3))
self.assertIsInstance(m, map)
@make_test
def test_map_max(a, b):
return max(map(lambda x: x.sum(), [a, b]))
# max(map(...)) graph breaks
@unittest.expectedFailure
@make_test
def test_map_max_const(a):
return max(map(lambda x: x, [1, 2, 3])), a + 1
@make_test
def test_map_list(a, b):
return list(map(lambda x: x + 1, [a, b]))
@make_test
def test_map_tuple(a, b):
return tuple(map(lambda x: x + 1, [a, b]))
@make_test
def test_map_iter(a, b):
it = iter(map(lambda x: x + 1, [a, b]))
return next(it)
@make_test
def test_map_zip_dict(a):
d = dict(
zip(
map(lambda x: x + 1, [0, 1, 2]),
[map(lambda x: x - 1, [y]) for y in [3, 4, 5]],
)
)
return list(d[3])[0], a + 1 # noqa: RUF015
@make_test
def test_map_dict_fromkeys(a):
return dict.fromkeys(map(lambda x: x + 1, [0, 1])), a + 1
@make_test
def test_map_set(a):
return set(map(lambda x: x + 1, [0, 1])), a + 1
# test_map_sum defined earlier
@make_test
def test_map_reduce(a, b):
return functools.reduce(lambda x, y: x + y, map(lambda x: x + 1, [a, b]))
@make_test
def test_map_sorted(a):
return sorted(map(lambda x: x + 1, [0, 4, 3, 1, 2])), a + 1
@make_test
def test_map_list_extend(a, b, c):
l = [a]
l.extend(map(lambda x: x + 1, [b, c]))
return l
@make_test
def test_map_list_slice_assign(a, b, c, d, e):
l = [a, b, c]
l[1:2] = map(lambda x: x + 1, [d, e])
return l
@make_test
def test_map_deque_extendleft(a, b, c):
d = collections.deque([a])
d.extendleft(map(lambda x: x + 1, [b, c]))
return d
@make_test
def test_map_str_join(a):
return "".join(map(lambda x: x, ["a", "b", "c"])), a + 1
def test_map_with_graph_break(self):
def f(a):
a += 1
def g(x):
nonlocal a
a += 1
return x + 1
m = map(g, [1, 2, 3, 4, 5])
a += next(m) # won't graph break
torch._dynamo.graph_break()
a += next(m) # will graph break
return a
cnts = torch._dynamo.testing.CompileCounter()
opt_f = torch.compile(f, backend=cnts)
self.assertEqual(f(torch.ones(3, 3)), opt_f(torch.ones(3, 3)))
self.assertEqual(cnts.frame_count, 3)
def test_map_reconstruct(self):
def fn(a):
return map(lambda x: x[0] + x[1], zip([1, 2, 3], [1, 2, 3])), a + 1
opt_fn = torch.compile(fn, backend="eager", fullgraph=True)
m = opt_fn(torch.ones(3, 3))[0]
self.assertIsInstance(m, map)
self.assertEqual(list(m), list(fn(torch.ones(3, 3))[0]))
def test_zip_reconstruct(self):
def fn(a):
return zip([1, 2, 3], map(lambda x: x + 1, [1, 2, 3])), a + 1
opt_fn = torch.compile(fn, backend="eager", fullgraph=True)
m = opt_fn(torch.ones(3, 3))[0]
self.assertIsInstance(m, zip)
self.assertEqual(list(m), list(fn(torch.ones(3, 3))[0]))
@make_test
def test_map_partial_unpack(a, b):
y = 1
def f(x):
nonlocal y
y += 1
return x
l = list(zip([a, b], map(f, [1, 2, 3, 4])))
return a + y
@make_test
def test_map_call_function_ex(a, b):
def f(x, y):
return x + y
return f(*map(lambda x: x + 1, [a, b]))
@make_test
def test_map_unpack_twice(a, b):
m = map(lambda x: x + 1, [a, b])
l1 = list(m)
l2 = list(m)
return l1, l2
@make_test
def test_enumerate(a, b):
return list(enumerate([a, b], start=1)), a + 1
@make_test
def test_map_enumerate(a, b):
return list(enumerate(map(lambda x: x + 1, [a, b]), start=1)), a + 1
@make_test
def test_map_infinite(a, b):
return list(map(lambda x, y: x + y, [a, b], itertools.count(3)))
@make_test
def test_map_unpack_vars(a, b):
x, y = map(lambda x: x + 1, [a, b])
return x + y
def test_enumerate_custom(self):
class MyClass:
def __iter__(self):
self.a = 1
return self
def __next__(self):
if self.a > 3:
raise StopIteration
self.a += 1
return self.a
def fn(x):
for i, it in enumerate(MyClass()):
x += i + it
return x
opt_fn = torch.compile(fn, backend="eager", fullgraph=True)
self.assertEqual(fn(torch.ones(3, 3)), opt_fn(torch.ones(3, 3)))
def test_enumerate_reconstruct(self):
def fn(a, b):
return enumerate([a, b], start=1)
opt_fn = torch.compile(fn, backend="eager", fullgraph=True)
inps = (torch.randn(3, 3), torch.randn(3, 3))
it1 = fn(*inps)
it2 = opt_fn(*inps)
self.assertIsInstance(it2, enumerate)
self.assertEqual(list(it1), list(it2))
def udf_mul(x, y):
return x * y
@ -3394,6 +3602,71 @@ class DefaultsTests(torch._dynamo.test_case.TestCase):
ref = opt_fn(x)
self.assertEqual(ref, res)
def test_frozenset_construction(self):
def fn(x):
s = frozenset({x})
t = frozenset(s)
return len(t)
opt_fn = torch.compile(fn, backend="eager", fullgraph=True)
x = torch.randn(4)
res = fn(x)
ref = opt_fn(x)
self.assertEqual(ref, res)
def test_frozenset_reconstruction(self):
d = {}
f = frozenset()
d[f] = torch.randn(4)
def fn(x):
k = frozenset()
torch._dynamo.graph_break()
return d[k] * x
opt_fn = torch.compile(fn, backend="eager")
x = torch.randn(4)
res = fn(x)
ref = opt_fn(x)
self.assertEqual(ref, res)
def test_frozenset_illegal_call_method(self):
def fn_add():
s = frozenset((1, 2, 3))
s.add({2})
return len(s)
def fn_pop():
s = frozenset((1, 2, 3))
s.pop()
return len(s)
def fn_update():
s = frozenset((1, 2, 3))
s.update({4, 5, 6})
return len(s)
def fn_remove():
s = frozenset((1, 2, 3))
s.remove(2)
return len(s)
def fn_discard():
s = frozenset((1, 2, 3))
s.discard(2)
return len(s)
def fn_clear():
s = frozenset((1, 2, 3))
s.clear()
return len(s)
for fn in [fn_add, fn_pop, fn_update, fn_remove, fn_discard, fn_clear]:
torch._dynamo.reset()
opt_fn = torch.compile(fn, backend="eager", fullgraph=True)
with self.assertRaises(torch._dynamo.exc.InternalTorchDynamoError):
opt_fn()
def test_is_tensor_tensor(self):
def fn(x, y):
if x is y:
@ -3605,10 +3878,16 @@ class DefaultsTests(torch._dynamo.test_case.TestCase):
with self.assertRaisesRegex(torch._dynamo.exc.UserError, "zip()"):
nopython_fn(x, ys[:1], zs)
with self.assertRaisesRegex(torch._dynamo.exc.UserError, "zip()"):
nopython_fn(x, ys, zs[:1])
# Should cause fallback if allow graph break
with self.assertRaisesRegex(ValueError, "zip()"):
opt_fn(x, ys[:1], zs)
with self.assertRaisesRegex(ValueError, "zip()"):
opt_fn(x, ys, zs[:1])
def test_fn_with_attr(self):
def fn(x):
if fn.pred:

View File

@ -308,6 +308,19 @@ class MiscTests(torch._inductor.test_case.TestCase):
"Graph break for an optree C/C++ function optree._C.PyCapsule.flatten. Consider using torch.utils._pytree - https://github.com/pytorch/pytorch/blob/main/torch/utils/_pytree.py",
)
def test_scalar_device_movement(self):
if not torch._dynamo.config.assume_static_by_default:
self.skipTest("Doesn't work with symints")
def add_fn(a, b, out):
res = torch.add(a, b, out=out)
return res
res = add_fn(2, 3, torch.tensor(0.0))
add_fn = torch.compile(add_fn, backend="eager", fullgraph=True)
res_compiled = add_fn(2, 3, torch.tensor(0.0))
self.assertEqual(res, res_compiled)
@skipIfNNModuleInlined("fails internal CI")
@unittest.skipIf(IS_FBCODE, "inline cpp_extension doesn't work in fbcode")
def test_cpp_extension_recommends_custom_ops(self):
@ -3367,6 +3380,21 @@ utils_device.CURRENT_DEVICE == None""".split(
self.assertTrue(same(obj41.y, obj42.y))
self.assertEqual(cnts.frame_count, 1)
def test_thread_local_setattr(self):
from threading import local
loc = local()
@torch.compile(fullgraph=True)
def fn(x, l):
l.x = x
return x + 1
x = torch.ones(2, 2)
fn(x, loc)
self.assertTrue(loc.x is x)
def test_user_defined_class_name(self):
class MyClassFoo:
pass

View File

@ -47,6 +47,21 @@ class TestHFPretrained(torch._dynamo.test_case.TestCase):
res = opt_fn(x, tmp)
self.assertTrue(same(ref, res))
@maybe_skip
def test_pretrained_non_const_attr(self):
def fn(a, tmp):
if tmp.pruned_heads:
return a + 1
else:
return a - 1
x = torch.randn(2)
tmp = PretrainedConfig()
ref = fn(x, tmp)
opt_fn = torch.compile(backend="eager", fullgraph=True)(fn)
res = opt_fn(x, tmp)
self.assertTrue(same(ref, res))
class TestModelOutput(torch._dynamo.test_case.TestCase):
@maybe_skip

View File

@ -1,5 +1,4 @@
# Owner(s): ["module: dynamo"]
from unittest.mock import patch
import torch
import torch._dynamo.test_case
@ -14,6 +13,17 @@ from torch.utils._device import DeviceContext
from torch.utils._python_dispatch import TorchDispatchMode
class TestMode(BaseTorchFunctionMode):
def __torch_function__(self, func, types, args, kwargs=None):
if not kwargs:
kwargs = {}
if func == torch.add:
return torch.zeros(2, 2)
return super().__torch_function__(func, types, args, kwargs)
class TorchDispatchModeTests(torch._dynamo.test_case.TestCase):
@classmethod
def setUpClass(cls):
@ -57,9 +67,11 @@ class TorchFunctionModeTests(torch._dynamo.test_case.TestCase):
def setUp(self):
torch.set_default_device(None)
torch._dynamo.reset()
def tearDown(self):
torch.set_default_device(None)
torch._dynamo.reset()
def _run_torch_function_mode_guard_test(self):
class TestMode1(BaseTorchFunctionMode):
@ -94,70 +106,6 @@ class TorchFunctionModeTests(torch._dynamo.test_case.TestCase):
fn(inp)
self.assertEqual(cnt.frame_count, 4)
def _run_ignored_mode_types_test(self):
class IgnoredMode(BaseTorchFunctionMode):
pass
cnt = torch._dynamo.testing.CompileCounter()
@torch.compile(backend=cnt.__call__, fullgraph=True)
def fn(x):
return x + 1
inp = torch.ones(2, 2)
with patch(
"torch._dynamo.variables.torch_function.IGNORED_MODES", {IgnoredMode}
):
# initial compile
fn(inp)
# no recompile, mode ignored
# note: the ref stack is length 0, and the stack we are checking against has length 2
# we want to check both ref stack len > runtime stack, and ref stack len < runtime stack
with IgnoredMode(), IgnoredMode():
fn(inp)
self.assertEqual(cnt.frame_count, 1)
# recompile due to new mode on the stack
with BaseTorchFunctionMode(), BaseTorchFunctionMode(), BaseTorchFunctionMode():
fn(inp)
self.assertEqual(cnt.frame_count, 2)
# recompile
# tests both ref stack len > runtime stack len for the above guard check
# and ref stack len < runtime stack len for the initial zero mode case
with BaseTorchFunctionMode(), IgnoredMode(), BaseTorchFunctionMode():
fn(inp)
self.assertEqual(cnt.frame_count, 3)
# no recompile
with IgnoredMode(), IgnoredMode(), BaseTorchFunctionMode(), BaseTorchFunctionMode():
fn(inp)
self.assertEqual(cnt.frame_count, 3)
# This is tricky, basically the ignored modes are baked into the guard
# IgnoredMode will be ignored forever by that guard.
# This is okay since we don't expect to be modifying IGNORED_MODES
# in the middle of execution except for the purposes of testing.
torch._dynamo.reset()
with IgnoredMode():
fn(inp)
self.assertEqual(cnt.frame_count, 4)
@torch._dynamo.config.patch("enable_cpp_guard_manager", False)
def test_torch_function_mode_guards_ignored_types_py(self):
self._run_ignored_mode_types_test()
def test_torch_function_mode_guards_ignored_types_cpp(self):
self._run_ignored_mode_types_test()
@torch._dynamo.config.patch("enable_cpp_guard_manager", False)
def test_torch_function_mode_guards_py(self):
self._run_torch_function_mode_guard_test()
@ -324,6 +272,218 @@ class TorchFunctionModeTests(torch._dynamo.test_case.TestCase):
fn(inp)
self.assertEqual(cnt.frame_count, 2)
def test_nested_torch_function_mode(self):
mode_1_called = False
mode_2_called = False
def reset_state():
nonlocal mode_1_called
nonlocal mode_2_called
mode_1_called = False
mode_2_called = False
ones = torch.ones(2, 2)
zeros = torch.zeros(2, 2)
class TestMode1(BaseTorchFunctionMode):
def __torch_function__(self, func, types, args, kwargs=None):
if not kwargs:
kwargs = {}
nonlocal mode_1_called
mode_1_called = True
if func == torch.add:
return zeros
return super().__torch_function__(func, types, args, kwargs)
class TestMode2(BaseTorchFunctionMode):
def __torch_function__(self, func, types, args, kwargs=None):
if not kwargs:
kwargs = {}
nonlocal mode_2_called
mode_2_called = True
if func == torch.mul:
return ones
return super().__torch_function__(func, types, args, kwargs)
def fn(x):
return torch.add(x, 3)
def fn_2(x):
return torch.mul(x, 3) + torch.add(x, 3)
inp = torch.ones(2, 2) + 1
for fn_i in [fn, fn_2]:
fn_opt = torch.compile(fn_i, fullgraph=True)
with TestMode1(), TestMode2():
expected = fn_i(inp), mode_1_called, mode_2_called
reset_state()
actual = fn_opt(inp), mode_1_called, mode_2_called
reset_state()
self.assertEqual(expected, actual)
def test_torch_function_mode_disable(self):
class TestSubclass(torch.Tensor):
@classmethod
def __torch_function__(cls, func, types, args, kwargs=None):
if not kwargs:
kwargs = {}
if func == torch.add:
return torch.ones(2, 2)
return super().__torch_function__(func, types, args, kwargs)
class TestMode(BaseTorchFunctionMode):
def __torch_function__(self, func, types, args, kwargs=None):
if not kwargs:
kwargs = {}
if func == torch.add:
return torch.zeros(2, 2)
return super().__torch_function__(func, types, args, kwargs)
def fn(x):
return torch.add(x, 3)
inp = (torch.ones(2, 2) + 1).as_subclass(TestSubclass)
fn_opt = torch.compile(fn, fullgraph=True)
with TestMode(), torch._dynamo.config.patch(
"traceable_tensor_subclasses", {TestSubclass}
):
with torch._C.DisableTorchFunctionSubclass():
expected = fn(inp)
actual = fn_opt(inp)
self.assertEqual(expected, actual)
with torch._C.DisableTorchFunction():
expected = fn(inp)
actual = fn_opt(inp)
self.assertEqual(expected, actual)
def test_torch_function_mode_highest_priority(self):
class TestSubclass(torch.Tensor):
@classmethod
def __torch_function__(cls, func, types, args, kwargs=None):
if not kwargs:
kwargs = {}
if func == torch.add:
return torch.ones(2, 2)
return super().__torch_function__(func, types, args, kwargs)
def fn(x):
return torch.add(x, 3)
inp = (torch.ones(2, 2) + 1).as_subclass(TestSubclass)
fn_opt = torch.compile(fn, fullgraph=True)
with TestMode(), torch._dynamo.config.patch(
"traceable_tensor_subclasses", {TestSubclass}
):
expected = fn(inp)
actual = fn_opt(inp)
self.assertEqual(expected, actual)
def test_torch_function_mode_enter_exit(self):
def fn(x, y):
with TestMode():
o = torch.add(x, 3)
return torch.add(o, y)
inp = (torch.ones(2, 2) + 1, torch.ones(2, 2) + 2)
fn_opt = torch.compile(fn, fullgraph=True)
expected = fn(*inp)
actual = fn_opt(*inp)
self.assertEqual(expected, actual)
def test_torch_function_mode_graph_break(self):
def fn(x, y):
with TestMode():
torch._dynamo.graph_break()
o = torch.add(x, 3)
return torch.add(o, y)
inp = (torch.ones(2, 2) + 1, torch.ones(2, 2) + 2)
fn_opt = torch.compile(fn)
expected = fn(*inp)
actual = fn_opt(*inp)
self.assertEqual(expected, actual)
def test_torch_function_mode_and_pop_graph_break(self):
def fn(x, y):
with TestMode():
z = _pop_torch_function_stack()
torch._dynamo.graph_break()
_push_on_torch_function_stack(z)
o = torch.add(x, 3)
return torch.add(o, y)
inp = (torch.ones(2, 2) + 1, torch.ones(2, 2) + 2)
fn_opt = torch.compile(fn)
expected = fn(*inp)
actual = fn_opt(*inp)
self.assertEqual(expected, actual)
def test_torch_function_mode_restore_on_exc(self):
@torch._dynamo.disable()
def err():
raise RuntimeError("test")
@torch.compile()
def fn(x):
with TestMode():
x += 1
err()
x += 2
return x
try:
fn(torch.ones(2, 2))
except RuntimeError:
pass
self.assertEqual(_len_torch_function_stack(), 0)
def test_torch_function_mode_and_pop_graph_break_mutation(self):
def fn(x, y):
with TestMode():
z = _pop_torch_function_stack()
z.y = 5
torch._dynamo.graph_break()
_push_on_torch_function_stack(z)
o = torch.add(x, 3)
o = torch.mul(o, z.y)
return torch.add(o, y)
inp = (torch.ones(2, 2) + 1, torch.ones(2, 2) + 2)
fn_opt = torch.compile(fn)
expected = fn(*inp)
actual = fn_opt(*inp)
self.assertEqual(expected, actual)
if __name__ == "__main__":
from torch._dynamo.test_case import run_tests

Some files were not shown because too many files have changed in this diff Show More