Compare commits

...

128 Commits

Author SHA1 Message Date
515abb7744 [CI] Add Triton 3.13t build (#143212)
By just extending the matrix and invoking script with appropriate cpython runtime
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143212
Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/seemethere
2024-12-13 23:45:47 +00:00
8621b9ff0c Infer whether prologues can be computed without upcasting to fp32 without changing numerics (#142402)
For prologues which only do either loads like gathers or dtype conversions, and no actual arithmetic on lower-precision types, we can codegen them without upcasting to fp32 without changing numerics.

Prologues that actually do arithmetic will need to use invoke quant. But I would like to to support upcasts/gathers out of the box.

We could potentially extend this in the future to avoid upcasting max pooling operations as well, if there were perf benefits to be had (less likely).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142402
Approved by: https://github.com/jansel
ghstack dependencies: #142401
2024-12-13 23:25:15 +00:00
4e0de50eb5 Revert "[CI] Add Triton 3.13t build (#143212)"
This reverts commit 571cd92d7c4c7bd2d5f068b5a285e0e70b8d0a40.

Reverted https://github.com/pytorch/pytorch/pull/143212 on behalf of https://github.com/janeyx99 due to lint is failing, the other failures don't seem relevant but ci has turned red after this change haha ([comment](https://github.com/pytorch/pytorch/pull/143212#issuecomment-2542521875))
2024-12-13 23:03:45 +00:00
f406207af2 Revert "[ROCm] Prune old gfx archs gfx900/gfx906 from binaries (#142827)"
This reverts commit 1e2b841675e50a6abd8dab9a95b33fda64b12e2b.

Reverted https://github.com/pytorch/pytorch/pull/142827 on behalf of https://github.com/jeffdaily due to prematurely dropped support for gfx900/gfx906 ([comment](https://github.com/pytorch/pytorch/pull/142827#issuecomment-2542507857))
2024-12-13 22:48:44 +00:00
ad2faec8bb Add a pass which analyzes whether a prologue preserves zero mask (#142401)
We load inputs to prologue fusion with a mask. That mask must still be zero before we run `tl.dot`. Previously, we would always apply the mask:
```
        tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last')
        tmp1 = tmp0.to(tl.float32)
        a = tl.where(a_mask, tmp1, 0.0)
```
now we do not need to ->
```
        tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last')
        tmp1 = tmp0.to(tl.float32)
        a = tmp1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142401
Approved by: https://github.com/jansel
2024-12-13 22:37:33 +00:00
b29fc52f82 [Profiler] Add Optional Flag to turn off external correlations (#142516)
Summary: External Correlations are super spammy and oftentimes not even useful. Add flag during init to remove them entirely

Test Plan: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Dec_10_12_33_31.531106.pt.trace.json.gz&bucket=gpu_traces

Differential Revision: D67048206

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142516
Approved by: https://github.com/ngimel
2024-12-13 22:32:09 +00:00
bb574abe73 [BC-Breaking]Remove capture_pre_autograd_graph references in quantization (#139505)
Summary:
As title

This is a BC-breaking change because graph produced by "capture_pre_autograd_graph" cannot be input to quantization anymore. But this is ok, since this API is deprecated for a while and is going to be deleted. We have removed all call sites of it.

We remove the deprecated API references in code, docs, and tests.

We also removed two tests that specific to capture_pre_autograd_graph API.

Test Plan: CI

Differential Revision: D65351887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139505
Approved by: https://github.com/tugsbayasgalan, https://github.com/andrewor14, https://github.com/jerryzh168
2024-12-13 22:26:22 +00:00
d25e6e623f Fix unused Python variables in test/[a-d]* (#134665)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134665
Approved by: https://github.com/albanD
2024-12-13 22:13:12 +00:00
e19f493f02 add private config to temporarily preserve old FSDP guard behavior (#142871)
Summary: https://github.com/pytorch/pytorch/pull/138819 wobbled dynamo guards in a way that caused some performance regression, so this PR temporarily adds a config to get the old behavior back while we investigate.

Test Plan: CI

Differential Revision: D67096751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142871
Approved by: https://github.com/yf225
2024-12-13 22:06:48 +00:00
8fae4397b4 Add "inductor_pre_grad_graph" logging (#142717) (#143126)
Summary:

Add new structured logging "inductor_pre_grad_graph"

This is for inductor provenance tracking front-end to load this graph from tlparse.
ghstack-source-id: 257581974
exported-using-ghexport

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' //caffe2/test/dynamo:test_dynamo -- -r StructuredTraceTest
```

Differential Revision: D67150288

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143126
Approved by: https://github.com/desertfire
2024-12-13 21:48:25 +00:00
8a04018329 [MPS] Fix conv backward for channels last (cont) (#143196)
This is a continuation of https://github.com/pytorch/pytorch/issues/140902 but extends the same logic to input.

Looks like existing channels-last logic just produced incorrect results on pre MacOS-15 versions and fails on MacOS-15, so removing it feels like a right idea

Fixes https://github.com/pytorch/pytorch/issues/142344
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143196
Approved by: https://github.com/manuelcandales
2024-12-13 21:32:42 +00:00
571cd92d7c [CI] Add Triton 3.13t build (#143212)
By just extending the matrix and invoking script with appropriate cpython runtime
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143212
Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/seemethere
2024-12-13 21:28:52 +00:00
60c54467db [logging] Log runtime autotuning timing to scuba (#141919)
See test plan in internal diff [D66679369](https://our.internmc.facebook.com/intern/diff/D66679369)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141919
Approved by: https://github.com/jamesjwu, https://github.com/ezyang
2024-12-13 21:22:13 +00:00
0d6d29af38 [CUDA] Follow up to clean up some set_per_process_memory_fraction usage in tests (#142811)
follow-up to #140852 now that #140620 has landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142811
Approved by: https://github.com/Skylion007
2024-12-13 21:09:05 +00:00
65d0a25289 [associative_scan] patch inductor tests to always run with static shape (#143161)
fixes #143053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143161
Approved by: https://github.com/eellison
2024-12-13 21:06:12 +00:00
52f31cc238 dynamo tracing perf: Guard slots: 51.76 -> 51.34 (#143060)
See #143056 for overall docs.

This PR: Add slots to Guard
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143060
Approved by: https://github.com/jansel
ghstack dependencies: #143066, #143056, #143058, #143059
2024-12-13 21:02:50 +00:00
e87f07d3b8 Revert "Migrate compiler config to Config (#143152)"
This reverts commit 1ebdfd56053dafa8880a0dedf535fff70aa92e09.

Reverted https://github.com/pytorch/pytorch/pull/143152 on behalf of https://github.com/oulgen due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/143152#issuecomment-2542342073))
2024-12-13 20:55:14 +00:00
625b4edb97 [CD] Test torch.compile on 3.13 (#143207)
Follow up after https://github.com/pytorch/pytorch/pull/143162
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143207
Approved by: https://github.com/atalman, https://github.com/ZainRizvi
2024-12-13 20:01:36 +00:00
fe9365f3f5 Add check_binary workflow to pytorch/pytorch (#143201)
Migrated from pytorch/builder
Related to: https://github.com/pytorch/builder/issues/2054

Copying from : 3468139e81
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143201
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-12-13 19:30:10 +00:00
8f40446770 Fix precedence of bitwise and/or printing (#143197)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143197
Approved by: https://github.com/albanD, https://github.com/williamwen42
2024-12-13 19:29:42 +00:00
1ebdfd5605 Migrate compiler config to Config (#143152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143152
Approved by: https://github.com/ezyang
ghstack dependencies: #143150, #143151
2024-12-13 19:29:07 +00:00
f1ff8bc1c5 Add type to Config (#143151)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143151
Approved by: https://github.com/ezyang
ghstack dependencies: #143150
2024-12-13 19:29:07 +00:00
9d05c8110d Require Config to have a default (#143150)
With aliases coming soon, we want to reject alias + default combo, so we need defaults to be passed in. On top of this, this simplifies statically type checking config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143150
Approved by: https://github.com/ezyang
2024-12-13 19:28:59 +00:00
bf711a9cce [ROCm] Improve performance of reduce sum for 3D shapes (#143137)
Improve performance of reduce sum for 3D shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143137
Approved by: https://github.com/jeffdaily, https://github.com/eqy
2024-12-13 19:02:00 +00:00
6178be822d dynamo tracing perf: direct Guard: 52.58 -> 51.76 (#143059)
See #143056 for overall docs.

This PR: Remove explicit constant check from `VariableBuilder.install_guards()`
the args calling convention.  Also remove a lambda binding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143059
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #143066, #143056, #143058
2024-12-13 18:20:48 +00:00
6bcda3a21a dynamo tracing perf: cache on import_source: 52.9 -> 52.58 (#143058)
See #143056 for overall docs.

This PR: add cache to `InstructionTranslatorBase.import_source()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143058
Approved by: https://github.com/jansel
ghstack dependencies: #143066, #143056
2024-12-13 18:20:48 +00:00
b472d82c96 dynamo tracing perf: import in build: 60.48 -> 59.92 (#143056)
A series of directed perf improvements to drive down the dynamo tracing cost of
the given test. Before this PR stack the compile took about 60s, and after takes
30s. Individual improvements are listed below along with the approximate
improvement of that change.

Tested with this model:
```
@torch.compile(backend="eager")
def model_add(x, y):
    out = x
    for i in range(5000):
        out = torch.add(out, y)
    return out
```

This PR: Stop importing builder in the inner loop of `VariableTracker.build()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143056
Approved by: https://github.com/jansel
ghstack dependencies: #143066
2024-12-13 18:20:48 +00:00
63e1f97f4b dynamo tracing perf: don't unnecessarily call getframeinfo on the hot path: 47.26 -> 37.66 (#143066)
See #143056 for overall docs.

This PR: Stop using `getframeinfo()` when we only care about the function name
and throw the rest away.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143066
Approved by: https://github.com/jansel
2024-12-13 18:20:48 +00:00
e0c8abda76 Fix potentially undefined behaviour in index_put sample input (#143116)
From the [docs](https://pytorch.org/docs/stable/generated/torch.Tensor.index_put_.html) for index_put_:

> If accumulate is True, the elements in values are added to self. If accumulate is False, the behavior is undefined if indices contain duplicate elements.

Currently the sample inputs for `index_put` generates 2 indices. Because they are generated randomly, they could be the same leading to undefined behaviour if `accumulate=False`.

This PR changes the input generation to only generate a single index if `accumulate=False` preventing duplicate indices and undefined behaviour.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143116
Approved by: https://github.com/albanD
2024-12-13 17:59:01 +00:00
23b8ea3094 Allow disabling int specialization on nn.Modules (#142829)
Resolves issue #140464 by adding an option to not specialize int from nn.Modules (False by default to maintain existing behavior).

Test Plan: `buck2 test mode/opt caffe2/test/dynamo:test_dynamo -- test_modules.py::NNModuleTests::test_nn_module_unspec_int_attr`

Differential Revision: D66837042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142829
Approved by: https://github.com/ezyang, https://github.com/yanboliang
2024-12-13 17:26:11 +00:00
82a45d19b4 Expose sharedMemPerMultiprocessor device property to python (#143119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143119
Approved by: https://github.com/ezyang
2024-12-13 16:53:57 +00:00
3f62054de1 [ROCm] upgrade nightly wheels to rocm6.3 - 1 of 2 (docker images) (#142151)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142151
Approved by: https://github.com/jeffdaily
2024-12-13 16:21:17 +00:00
7968732f5b Fix int8 mm V.ops.mul dispatching (#143127)
This is sort of subtle - because we were doing `V.ops.mul` at binding time, we dont redispatch later when we invoke the epilogue. and then later running into assertion checking in pr above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143127
Approved by: https://github.com/drisspg
ghstack dependencies: #143048
2024-12-13 16:17:23 +00:00
da67a6a7bb [inductor] Replace set by OrderedSet (#138466)
Uses the set_linter from https://github.com/pytorch/pytorch/pull/138454
and considerable manual editing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138466
Approved by: https://github.com/eellison
2024-12-13 16:08:45 +00:00
fbfc530442 [export][ez] Fix forward D67044185 (#143193)
Summary: Fixing forward D67044185 and T210459833 by adding the missing buld file.

Test Plan: buck2 build --flagfile fbcode//mode/opt fbcode//admarket/training_data/augmentation/processors/tests:model_manager_test

Differential Revision: D67200056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143193
Approved by: https://github.com/tugsbayasgalan
2024-12-13 16:06:42 +00:00
04bb82f097 Linux Wheels: Remove triton dependency python < 3.13 constraint (#143162)
We do build pytorch-triton package for python 3.13 : https://github.com/pytorch/pytorch/actions/runs/12304476674/job/34344764271
Hence constraint is no longer needed.
This stack enabled torch.compile for Python 3.13 : https://github.com/pytorch/pytorch/pull/141264
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143162
Approved by: https://github.com/kit1980
2024-12-13 15:08:44 +00:00
810808d97d Enable cutlass-based all-gather matmul when TORCH_SYMM_MEM_ENABLE_NATIVE_ASYNC_TP is set (#142283)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142283
Approved by: https://github.com/weifengpy, https://github.com/Chillee
2024-12-13 10:29:14 +00:00
3e1f587514 [AOTI] Fix an autotune block grid computation issue (#143098)
Summary: There is a grid computation issue after switching to one-pass codegen in https://github.com/pytorch/pytorch/pull/141980. When max-autotune is turned on, there is an incorrect grid codegen in some cases.

Reviewed By: henrylhtsang

Differential Revision: D67120987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143098
Approved by: https://github.com/henrylhtsang
2024-12-13 07:52:30 +00:00
9f90583ca2 [CI] Run aarch64 tests on Graviton3 (#143129)
Which is armv8.6 that has SVE and BF16 capability

mkldnn_pattern_matcher skips are tracked in https://github.com/pytorch/pytorch/issues/143146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143129
Approved by: https://github.com/digantdesai
2024-12-13 07:39:22 +00:00
c37185c76a [BE] Stop using deprecated APIs in mkldnn_pattern_matcher (#143156)
This should fix
```
/var/lib/jenkins/workspace/test/inductor/test_mkldnn_pattern_matcher.py:157: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143156
Approved by: https://github.com/kit1980
2024-12-13 06:37:20 +00:00
cyy
075905b7bd [14/N] Fix extra warnings brought by clang-tidy-17 (#141644)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141644
Approved by: https://github.com/ezyang

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2024-12-13 06:22:13 +00:00
72fd7abb35 [ca] fix flex attention backward HOP capture in initial graph (#143155)
FIXES https://github.com/pytorch/pytorch/issues/142313

So with previous HOPs, compiled autograd could just inline into their body and get their post-dispatch aten representation. You can't do that with this flex attention HOP, which just wants any proxy tracing mechanism to insert it into its graph. Okay, compiled autograd does use proxy tracing, so we can do that.

This is safe because other than the reenter_make_fx call, there were no other make_fx internals usage in the HOP. And compiled autograd specializes on the AOT backward's saved symints which should cover any changes in shapes to the inputs of the HOP.

However, there's still an issue: Dynamo doesn't know how to handle `FlexAttentionBackwardHOP` and will graph break, so the flex attention backward is running in eager as of this PR. The tlparse looks really scuffed after the compiled autograd capture: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpMMHBEH/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143155
Approved by: https://github.com/drisspg
2024-12-13 06:04:39 +00:00
b4f4c75e19 [dynamo] Support multiple inheritance for custom dict construction (#142416)
This patch applies a local and practical workaround for custom dict
construction when multiple inheritance is involved.

Handling multiple inheritance in general could be a lot more involved,
so I created #142414 to track that.

Fixes #141118.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142416
Approved by: https://github.com/jansel
2024-12-13 05:13:05 +00:00
b5d8d2444a add README.md for compile time benchmarks (#143145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143145
Approved by: https://github.com/laithsakka
ghstack dependencies: #141517, #143143
2024-12-13 05:12:26 +00:00
b7ad52abb0 Use new group instead of split group on non-CUDA device (#141469)
Motivation:

Currently, `split_group` only works for NCCL backend. https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L4745. Then we need to use `use_group` on other non-CUDA device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141469
Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD
2024-12-13 05:11:33 +00:00
57c46af47a [Inductor][CPU] Add torchao da8w8 pattern with sym quantized act & wgt (#142110)
### Summary

Extends #142036 for Inductor pattern-matching pattern covered for torchao API `int8_dynamic_activation_int8_weight` in the following scenario (inference-only, freezing enabled) -

- int8 quantized (symmetrically) activation (per token quantized).
- Statically (so, scales are also constant. But then they would have been constant even in case of dynamic quantization due to constant weights, anyway) per-channel int8 quantized (symmetrically) weights (which are also constant because freezing is enabled).

The pattern that's matched is `torch._intmm` -> convert to FP32/BF16 -> [optional expand for activation scale] ->`mul` -> `mul`.

We don't check if the activation is dynamically quantized or whether the weights are statically quantized, though (since the implementation won't have have any side-effects even if that wouldn't be true).

In practice, it also matches the smooth-quant int8 quantized linear pattern if its output is not reshaped (if activation is 2D).

### More details

oneDNN int8 matmul supports application of per-channel weight scale but not a vector activation scale, which could be applied as a post op, but is currently unsupported in ATen. Bias addition (which could be supported with an add post-op) is also unfused.

The fusion pattern used in this PR is `torch._intmm` -> convert to FP32/BF16 ->`mul`, which will be replaced by oneDNN qlinear op.

The speedup over eager-mode is due to 2 reasons -
1. fusion of int8xint8 -> int32 GEMM, conversion to FP32/BF16 & application of weight scale. (In case of BF16, many intermediate conversions are also avoided).
2. weight is pre-packed & cached by Inductor, so a reorder is avoided at run-time.

But, in the future, the whole pattern (including application of activation scale, which would be a mul post-op) + bias could be fused if corresponding support would be enabled in ATen.

### Verification

Added UT in this PR
```
python test/inductor/test_mkldnn_pattern_matcher.py -v -k test_da8w8_sym_act_sym_wgt_with_int_mm
```

#### Corresponding torchao UTs

1. int8 Smoothquant legacy API - `TORCHINDUCTOR_FREEZING=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" python test/integration/test_integration.py -v -k test_non_dynamically_quantizable_linear`.
The difference from #139595 is that there are no reshapes of the linear output in this pattern.

2. int8 da8w8 - symmetrically quantized activation (dynamically) & statically quantized weights -  ` TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" TORCHINDUCTOR_FREEZING=1 python test/integration/test_integration.py -v -k test_int8_dynamic_quant_subclass_api_0_cpu`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142110
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #142036
2024-12-13 04:59:03 +00:00
b731ced91f Prologue Fusion (#134532)
This PR extends our ability to fuse pointwise nodes onto triton templates with the ability to fuse pointwise nodes into triton templates - prologue fusion.

Similar to the store_output api:
`{{store_output(("idx_m", "idx_n"), "acc", "mask")}}`

And the modification api:

```
{{ modification(
    subgraph_number=0,
    output_name="post_mod_scores",
    score="qk",
    out="qk"
) | indent_except_first(1) }}
```

We have:

```{{load_input("B", "b", ("idx_m", "idx_n"), mask=None if EVEN_K else "b_mask", indent_width=8)}}```

Because we are now loading the input with explicit indices and mask, I needed to rewrite the mm kernel to no longer update the [pointers by BLOCK_K](bb03ef7aca/torch/_inductor/kernel/mm.py (L110-L111)) on every iteration and instead on each iteration compute indices from the the k_idx of each loop. This did not have any perf difference.

There are a couple main use cases for prologue fusion:

- Fusing dequants into a matmul. particularly for more bandwidth bound scenarios.
- Fusing gather into a matmul. This is useful particularly in MOE. See https://github.com/pytorch/pytorch/issues/134535 for more details.

Prologue fusion is generally much less profitable than epilogue fusion, because it must be applied to an element of an input on each loop of the matmul, compared to only once in the epilogue (gather into matmul is a potential exception). Accordingly, we are much less aggressive in attempting to fuse prologue fusion. We only attempt fusion if it does not increase the number of memory bytes read instead the triton template, multipled by a small factor to allow gathers. This restricts reliably unprofitable fusions like fp32->fp16 inside kernel. In future pr we could potentially have api of being more aggressive if we know we are in a bandwidth bound regime. See: https://github.com/pytorch/pytorch/pull/134532/files#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R3060-R3066

Other notes:

By default we will upcast to fp32 inside every kernel. This matches eager numerics. This is fine enough for epilogue because it is only done once (although it is probably unnecessary for say a relu) but tanks perf for prologue. I am currently using the `codegen_upcast_to_fp32` option to avoid it, but that will not work for libdevice calls that require fp32. We will need https://github.com/pytorch/pytorch/pull/136778/ and dtype-aware codegen to upcast fp16 ops into libdevice calls.

With prologue fusion, we now have essentially separate kernels for each input, and for the output. I had to increase the number of fields that are swapped out in `set_subgraph_body` by a large number :/ I also update the fusion logic because the inputs will have a different group than the outputs. Maybe as part of enabling multiple outputs, this could get cleaned up a bit so..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134532
Approved by: https://github.com/jansel
2024-12-13 04:18:25 +00:00
ceb664aca6 add float_args benchmark (#143143)
71% improvement with automatic dynamic float arguments

with specialize_float=False
```
float_args,compile_time_instruction_count,346293869
```

with specialize_float=True
```
float_args,compile_time_instruction_count,1198546486
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143143
Approved by: https://github.com/laithsakka
ghstack dependencies: #141517
2024-12-13 03:35:59 +00:00
ab04f3aee1 [ca] set autograd graph task state (#143108)
GraphTask holds metadata needed for a single execution of backward(), it is 1:1 with backward calls, at least for compiled autograd. It is used for certain torch._C global autograd state APIs.

In SAC, we use torch._C._current_graph_task_id() as a dict key to store information during unpack hook execution: a5fb07af27/torch/utils/checkpoint.py (L1128)

If we don't set an active task, it will randomize the key, and will do its logic as if each unpacked tensor was from a different graph task
a5fb07af27/torch/utils/checkpoint.py (L1112-L1115)

The sketchy part of this PR is that in eager autograd, GraphTask is mutated during execution. But inspecting the struct, the mutation seems to only be used to communicate between autograd threads (created when multiple devices are involved) or for deprecated uses. We shouldn't run into the mutation case at all in compiled autograd. Also, only the graph task id is accessible from python hooks.

FIXES https://github.com/pytorch/pytorch/issues/142862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143108
Approved by: https://github.com/jansel, https://github.com/albanD
2024-12-13 03:10:48 +00:00
dbe4b69df0 [Inductor] Fix cooperative reduction tests broken in recent refactor (#143135)
These tests were broken by https://github.com/pytorch/pytorch/pull/142020. This PR updates the fixed configs accordingly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143135
Approved by: https://github.com/jansel, https://github.com/huydhn
2024-12-13 02:03:43 +00:00
cyy
9f5ebf3fc6 Clang-format aten/src/ATen/native/Tensor*{cpp,h} (#143089)
These files are relatively stable, so it should be safe to format them without incurring conflicts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143089
Approved by: https://github.com/albanD
2024-12-13 00:06:48 +00:00
2533a5a843 upgrade sccache to 0.9.0 (#142854)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142854
Approved by: https://github.com/malfet, https://github.com/ZainRizvi
2024-12-12 22:49:50 +00:00
fb93462904 [Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern (#142036)
Reopen of https://github.com/pytorch/pytorch/pull/139595

**About the PR**
In the implementation of SmoothQuant in Torchao, quantized linear is computed by `_int_mm(a, b)` + `mul(b_scale)` + `mul(a_scale)` (+ optional `add` for bias) with `reshape` and `convert_dtype` in between.
This PR adds a pass to fuse the corresponding patterns:
- (no bias) `reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape`
- (with bias) `pattern_no_bias -> add -> reshape -> reshape`

The patterns are replaced by `onednn.qlinear_pointwise` and `onednn.qlinear_prepack`, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains `onednn.qlinear_pointwise` only with packed weight constants.

Note that `onednn.qlinear_pointwise` only supports a scalar activation scale, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after `onednn.qlinear_pointwise`.

**Validation results**
Accuracy/perplexity is not changed with or without this fusion pass.
Latency is improved by >10% with the fusion pass.
Test method:
- Model: EleutherAI/gpt-j-6b
- Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores
- Using Intel OMP and Tcmalloc
- Running [the example script of SmoothQuant in Torchao](https://github.com/pytorch/ao/blob/main/torchao/prototype/smoothquant/example.py) with `TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile`

**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm
```

Differential Revision: [D66796966](https://our.internmc.facebook.com/intern/diff/D66796966)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142036
Approved by: https://github.com/jerryzh168, https://github.com/jgong5

Co-authored-by: sanchitintel <sanchit.jain@intel.com>
2024-12-12 21:18:03 +00:00
602c86a420 [DSD] Fix strict=False case for DDP (#143038)
Summary:
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143038
Approved by: https://github.com/mori360
2024-12-12 21:15:21 +00:00
a7509e98c5 [pipelining] fix backward_one_chunk when the output of the model is a… (#142237)
fixes #142229

if any of ``stage_output`` is a view, it cannot be detached in place. Replacing it with ``t = t.detach()`` or similar would not free the graph for the output given to the user. Detaching the base tensor could cause a side effect.

The same code is used in ``_backward.py`` (b64a537993/torch/distributed/pipelining/_backward.py (L215)) but does not seem to cause any issue in my case. Maybe needs some investigation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142237
Approved by: https://github.com/H-Huang
2024-12-12 20:59:35 +00:00
39cacc1d81 Fix missing tests on test tool lint job (#143052)
A follow-up from https://github.com/pytorch/pytorch/pull/142476#discussion_r1878888558 where some tests are not discovered correctly by pytest

### Testing

https://github.com/pytorch/pytorch/actions/runs/12287448581/job/34289531307?pr=143052#step:14:162 shows the correct number of tests now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143052
Approved by: https://github.com/ZainRizvi
2024-12-12 20:29:32 +00:00
82ce888273 c10::string_view -> std::string_view in more places (#142517)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142517
Approved by: https://github.com/malfet
2024-12-12 19:45:59 +00:00
0b75b7ff2b [Easy] factor out inductor ophandler decompositions (#142400)
Factor out inductor operator decompositions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142400
Approved by: https://github.com/Chillee, https://github.com/jansel
2024-12-12 19:03:26 +00:00
c170248b78 [Profiler] Enable Iterative Step without profiler in fbcode (#142077)
Summary: Adds post optimizer hook for fbcode so that we can run iterative on demand without having to use a frontend profiler interface. Since this is being used more frequently, it would be convenient for users to be able to trigger this on-demand feature without having to worry about being within some timing window.

Test Plan: Ran iterative tracing without profiler.profile

Differential Revision: D66734119

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142077
Approved by: https://github.com/briancoutinho
2024-12-12 19:00:13 +00:00
e3fe5f62b6 Remove Checkout pytorch/builder for Linux Binary Builds (#143125)
Follow Up after: https://github.com/pytorch/pytorch/pull/142282

Remove Checkout pytorch/builder for Linux Binary Builds
I believe we where not using builder already. Hence remove this checkout.
We should be using scripts from this folder:
```
/pytorch/.ci/${{ inputs.PACKAGE_TYPE }}/build.sh
```

TODO: Will followup with removing BUILDER_ROOT everywhere from PyTorch repo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143125
Approved by: https://github.com/kit1980
2024-12-12 18:55:00 +00:00
d48b16a725 Revert "[Dynamo] only import einops if version is lower than 0.7.0 (#142847)"
This reverts commit 357e261b1eded933d98de18ddcef2b083f87259d.

Reverted https://github.com/pytorch/pytorch/pull/142847 on behalf of https://github.com/atalman due to Breaks binary builds, see the comment above ([comment](https://github.com/pytorch/pytorch/pull/142847#issuecomment-2539759580))
2024-12-12 18:44:35 +00:00
b0c3d39e0d [pipelining] Update tutorials and documentation (#143045)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143045
Approved by: https://github.com/wconstab, https://github.com/kwen2501
2024-12-12 18:42:17 +00:00
ee5bceaee6 [sigmoid] Write the new export schema format to archive without breaking compatibility. (#142511)
Summary:
This diff make it possible to migrate to PyTorch's OSS export schema from sigmoid. Basically, we add a new field called "methods" to ExportedProgram in Model definition, which contains the thrift schema generated based on schema.py from OSS. This way, we can keep writing the old fields while double write a new format in equivalent form. Since thrift doesn't support inlining type definitions, we do it manually here and it shouldn't break on-wire compatibility. As long as every sigmoid user is using sigmoid.frontend.serialization.serialize, we always guarantee to have the new format saved sa well.

Eventually we will will use json deserialization from OSS so we will only keep this double writing for a couple of months. Eventually, we will migrate every serialization path to the OSS workflow.

Test Plan:
buck test mode/opt sigmoid/frontend:serialization_test
buck test mode/opt sigmoid/frontend/test_gpu:serializer_test

Differential Revision: D67044185

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142511
Approved by: https://github.com/desertfire
2024-12-12 18:41:10 +00:00
5dabe2d464 Fix NJT backward tests (#143072)
This PR fixes some issues with NJT backward / compile backward tests:
1. `requires_grad` was not being propagated appropriately during `SampleInput` generation, so a LOT of backward cases were untested before (sad times). This PR utilizes a helper function `_clone()` to clone() / detach() NJTs for SampleInputs while preserving `requires_grad` status. Note: the clone() / detach() stuff is for autograd; can't have two SampleInputs as part of the same autograd graph.
2. Per-sample skips weren't -fully- working; the op logic would still be invoked even with a skip. I found this out thanks to `split_with_sizes`, which segfaults during backwards because it tries to use an NST-specific formula. As annoying as it is, I tried a ton of things but ultimately had to split the `subtest_ctx` into that + a `skip_xfail_ctx` to run the subtests within.
    * Updated all uses of per-sample skips / xfails: 4 in `test_nestedtensor.py` and 1 in `test_vmap.py`
3. Added the appropriate skips / xfails to get everything passing. There are a shitton of bugs to fix!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143072
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2024-12-12 18:06:23 +00:00
d47a80246a [dynamo][pytree][3/N] make CXX pytree traceable: tree_map / tree_map_ (#137399)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137399
Approved by: https://github.com/jansel
ghstack dependencies: #137398
2024-12-12 18:05:25 +00:00
7edeb1005a [dynamo][pytree][2/N] make CXX pytree traceable: tree_flatten / tree_unflatten / tree_structure (#137398)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137398
Approved by: https://github.com/jansel
2024-12-12 18:05:25 +00:00
c85323c5e8 Revert "Tests Generelization for multiple accelerator devices (#139184)"
This reverts commit b576a8c318201b63269f7ff25ec5830d00662a7a.

Reverted https://github.com/pytorch/pytorch/pull/139184 on behalf of https://github.com/clee2000 due to Failing internally when trying to pickle distributed test files D67098795 ([comment](https://github.com/pytorch/pytorch/pull/139184#issuecomment-2539610187))
2024-12-12 17:48:30 +00:00
2f0fe82f6d Revert "[14/N] Fix extra warnings brought by clang-tidy-17 (#141644)"
This reverts commit 24a5a2ef258d2b482ded674cdb9555afaf081402.

Reverted https://github.com/pytorch/pytorch/pull/141644 on behalf of https://github.com/clee2000 due to failing internally D67112938 ([comment](https://github.com/pytorch/pytorch/pull/141644#issuecomment-2539602023))
2024-12-12 17:43:36 +00:00
dc23f1944a Remove unused Python variables in torch/[_-a]* (#133492)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492
Approved by: https://github.com/albanD
2024-12-12 17:39:14 +00:00
7667235a23 c10::optional -> std::optional (#142514)
Fixes issues introduced in https://github.com/pytorch/pytorch/pull/141348 and https://github.com/pytorch/pytorch/pull/139578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142514
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-12 17:23:46 +00:00
520ba556cd [Inductor] Refactor "r" reduction prefix to {"r0_", "r1_"}. (#142020)
Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243.

# Feature

This PR changes the `RINDEX` / `"r"` symbol type to `(R0_INDEX, R1_INDEX)` and `("r0_", "r1_")`, respectively. This allows the relevant code to support 2D (often ND) reductions. Unlike the parent PR, this one does not change the tiling algorithm, so `"r1_"` is never used. However, it prepares other parts of the system to handle `"r1_"` once we start using it. This should significantly reduce the chances of hitting merge conflicts, making the parent PR much easier to land.

The only change to the generated triton code is to rename `"rindex"` -> `"r0_index"`, `"RBLOCK"` -> `"R0_BLOCK"`, etc. To maintain compatibilty with existing codegen, this also generates aliases to the old reduction variables like `rindex = r0_index`. If we generated 2D reductions (which this PR will not do), the aliases would be more complicated and would collapse 2D multi-indices to linear indices. See some example kernels in the parent PR.

These aliases can be eliminated by the Triton compiler, and should not impact the final machine code running on the GPU. See the perf testing in the parent PR which confirms the aliases do not impact perf.

# Test plan

The existing CI provides good coverage. This PR modifies the expected code in a few places, renaming reduction variables from `r.*` to `r0_.*`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142020
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@meta.com>
2024-12-12 17:22:20 +00:00
cf538efd0c Revert "Hide torch_python symbols (#142214)"
This reverts commit da76e912a4c58c649061fc84b29a42714897a0ca.

Reverted https://github.com/pytorch/pytorch/pull/142214 on behalf of https://github.com/huydhn due to The MacOS failure looks legit as it shows up in trunk ([comment](https://github.com/pytorch/pytorch/pull/142214#issuecomment-2539543504))
2024-12-12 17:15:51 +00:00
15ee2960e1 [aot] Functionalize aot backward prologue and epilogue wrappers (#142415)
For functional compiled autograd, we're having dynamo trace through the aot backward implementation. To avoid graph breaking and imposing too many restrictions, we allow_in_graph the prologue and epilogue. This adds 2 restrictions:
- code must be available in the global context
- inputs other than tensors/symnodes must be const foldable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142415
Approved by: https://github.com/bdhirsh
2024-12-12 17:14:29 +00:00
30b61e521c [logging] Populate compile_time_autotune_time_us (#143104)
See testing in attached diff

Differential Revision: [D67128210](https://our.internmc.facebook.com/intern/diff/D67128210)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143104
Approved by: https://github.com/ezyang
2024-12-12 17:08:43 +00:00
e3ddc0ca33 Support remote caching requiring redis auth (#141679)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141679
Approved by: https://github.com/masnesral
2024-12-12 17:07:50 +00:00
0f78be5573 Fix search icon (#142808)
Removing:

.pytorch-left-menu-search input[type=text] {
    background-image: none;
}
so that the search icon correctly appears in the sphinx searchbox

Also, fixing scrolling

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142808
Approved by: https://github.com/albanD
2024-12-12 16:09:30 +00:00
725526abc5 Fix scan dtypes (#143048)
FIx for https://github.com/pytorch/pytorch/issues/142883. We weren't getting test coverage of scan because the tests were being skipped. see, https://github.com/pytorch/pytorch/issues/143053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143048
Approved by: https://github.com/arui-meta, https://github.com/blaine-rister
2024-12-12 15:57:00 +00:00
d83a049232 [EZ] Update lintrunner in CI to 0.12.7 (#143073)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143073
Approved by: https://github.com/wdvr
2024-12-12 15:35:37 +00:00
7cc3a591c2 [FlexAttention] Fix a few more symbolic shape issues (#142816)
# Summary

See  https://github.com/pytorch/pytorch/issues/139064 for more details. This fixes a number of issues with dynamic shapes. Thanks to @alexdremov for finding most of these

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142816
Approved by: https://github.com/yanboliang, https://github.com/ezyang
2024-12-12 15:29:21 +00:00
84f791381a Python 3.13 CI add crossref test to existing linux-focal-py3_13-clang10-build (#143074)
Add  linux-jammy-py3_13-gcc11-build and test - similar to Py 3.9
Add crossref test to existing linux-focal-py3_13-clang10-build
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143074
Approved by: https://github.com/malfet
2024-12-12 14:45:56 +00:00
cd1b5924d5 Revert "[Inductor] Use sleef implementation for CPP backend asinh codegen (#142360)"
This reverts commit 79cf8fa75176a8f6bb79d426c6d0f9369d03ff98.

Reverted https://github.com/pytorch/pytorch/pull/142360 on behalf of https://github.com/jeanschmidt due to seems to have broken macos tests ([comment](https://github.com/pytorch/pytorch/pull/142360#issuecomment-2539143039))
2024-12-12 14:42:55 +00:00
30e2b322a1 Add <string> to uninteresting_files (#142984)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142984
Approved by: https://github.com/albanD, https://github.com/IvanKobzarev
2024-12-12 14:35:30 +00:00
91261107e0 debug handler maintain through decomposition (#141612)
Add checks in the ao numberic debugger to guard the debug handle consistency between aten op decomposition

Differential Revision: [D66517480](https://our.internmc.facebook.com/intern/diff/D66517480/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141612
Approved by: https://github.com/jerryzh168
2024-12-12 12:26:45 +00:00
18785c1af9 [BE][accelerator] formalize API name {current,set}_device_{idx => index} (#140542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140542
Approved by: https://github.com/guangyey, https://github.com/albanD
2024-12-12 10:53:48 +00:00
a5fb07af27 [Torch][#142396]Resolve Failure When Uploading To Remote Storage (#143046)
Summary: Catch io.UnsupportedOperation exception so that stream's without fileno support don't cause failure

Test Plan: UT

Differential Revision: D67108487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143046
Approved by: https://github.com/saumishr
2024-12-12 08:17:15 +00:00
497f89ff83 fix dynamo nn module stack fqn (#142823)
Dynamo can produce sources that have funny patterns in their `.name()` that break `nn_module_stack` fqns. Added a test that used to have `._modules` inside nn_module_stack fqns, now doesn't. (Unfortunately couldn't repro a case mentioned in the GH issue where `.slice(...)` is claimed to appear as well.)

Fixes https://github.com/pytorch/pytorch/issues/141939

Differential Revision: [D67064189](https://our.internmc.facebook.com/intern/diff/D67064189/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142823
Approved by: https://github.com/pianpwk, https://github.com/zhxchen17
2024-12-12 07:02:13 +00:00
da76e912a4 Hide torch_python symbols (#142214)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142214
Approved by: https://github.com/ezyang
2024-12-12 07:00:54 +00:00
dcb128d495 [ROCm] TunableOp use thread-safe getenv functions (#142274)
Fixes #142403

~~PR fixes breakage due to this commit
8cd7ad8b48~~

PR is a partial reland of this https://github.com/pytorch/pytorch/pull/140594 with a unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142274
Approved by: https://github.com/jeffdaily, https://github.com/eqy
2024-12-12 06:49:26 +00:00
5ad7d5304c [DTensor][random] add HSDP+TP model init test (#143077)
**Summary**
1. Move the model init tests from `DistTensorRandomOpTest` to `DistTensorRandomInitTest`
2. Added a HSDP+TP meta init test to show correct model init result in this use case. Note that this test requires 8 GPUs to run and our CI doesn't have that capacity so this test will be skipped on CI testing. A local run shows that the test passes on a 8-GPU host.

**Test**
`pytest test/distributed/_tensor/test_random_ops.py -s -k test_hsdp_tp_model_meta_init`

<details>
<summary> Test Result </summary>
<img width="3343" alt="image" src="https://github.com/user-attachments/assets/a960c5e6-37bc-49be-9e36-ecc29ed47eb0" />

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143077
Approved by: https://github.com/weifengpy
2024-12-12 06:46:16 +00:00
357e261b1e [Dynamo] only import einops if version is lower than 0.7.0 (#142847)
Fixes internal xref (https://fb.workplace.com/groups/257735836456307/posts/804793021750583/?comment_id=805229281706957&reply_comment_id=805232695039949)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142847
Approved by: https://github.com/zou3519
2024-12-12 06:38:22 +00:00
9701c50bdc [Dynamo] Add missing tensor builtins to allowed functions (#142841)
Fixes https://github.com/pytorch/pytorch/issues/141232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142841
Approved by: https://github.com/yanboliang
2024-12-12 06:38:19 +00:00
b25f64b613 Add-o pipefail for all bash scripts (#143050)
Fixes #142380
I have added -o pipefail in all bash scripts in pytorch/.ci/pytorch. Sorry I didn't double-check the submodule in my last PR. Thanks for the correction! Please contact me again if there are any problems with this fix^^. (Actually contributing to the open source community is an assignment for one of my courses and today is the deadline so I rushed to revise it when I saw an email early in the morning. Haha.)
 @ezyang @malfet @huydhn @zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143050
Approved by: https://github.com/ezyang, https://github.com/huydhn

Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
2024-12-12 06:18:41 +00:00
79cf8fa751 [Inductor] Use sleef implementation for CPP backend asinh codegen (#142360)
**Summary**
Fix https://github.com/pytorch/pytorch/issues/142345. Previously, we use `asinh(x) = log(x + sqrt(1 + x**2))` to calculate the result of `asinh`, the issue happens when input with `-10000.1`, which makes `x + sqrt(1 + x**2)` close to 0 and log(0) is invalid. We use the `sleef` implementation in this PR to fix this issue.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_asinh_with_corner_inputs
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142360
Approved by: https://github.com/jgong5
2024-12-12 05:40:48 +00:00
1e2b841675 [ROCm] Prune old gfx archs gfx900/gfx906 from binaries (#142827)
Remove gfx900 and gfx906 archs as they're long-in-the-tooth. Should help reduce the increasing size of ROCm binaries.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142827
Approved by: https://github.com/jeffdaily
2024-12-12 05:33:40 +00:00
cyy
fda43c98d1 Improve implementation of quantized_batch_norm (#141570)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141570
Approved by: https://github.com/albanD
2024-12-12 04:35:00 +00:00
cyy
20df80a669 Remove unneeded optional dereference (#141578)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141578
Approved by: https://github.com/swolchok
2024-12-12 04:34:43 +00:00
cyy
f7b9533c3f [4/N] Apply bugprone-unchecked-optional-access (#142832)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142832
Approved by: https://github.com/albanD
2024-12-12 04:33:32 +00:00
fbbafd0320 Turn on AOTAutogradCache by default on open source (#141981)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141981
Approved by: https://github.com/bdhirsh, https://github.com/oulgen
2024-12-12 04:21:11 +00:00
4d0775462e E2E composability testing (#141398)
Add 3D(pp+tp+fsdp) test `test_3d_with_tp_dp_pp` at test_pp_compodability
Currently provide @parametrize on
"ScheduleClass" for pp in [ScheduleGPipe, Schedule1F1B, ScheduleInterleaved1F1B, ScheduleLoopedBFS, ScheduleInterleavedZeroBubble]
"MixedPrecisionParam" for fsdp in [torch.bfloat16, torch.float32]

Future work:
1. add fp8
2. add cp(context parallelism) to enable 4D test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141398
Approved by: https://github.com/wconstab, https://github.com/kwen2501
2024-12-12 04:19:29 +00:00
cyy
2903cf0ad8 Re-enable some C++ warnings (#142332)
It enables some C++ warnings since the code base is fairly clean. Meanwhile, Wextra-semi is disabled on CUDA generated code since there is no way to fix them without the cooperation of CUDA team.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142332
Approved by: https://github.com/albanD, https://github.com/eqy
2024-12-12 04:02:12 +00:00
f892f9862a [ROCM] Enable *_load_dwordx4 ISA for BFloat16 and Half. (#141397)
Remove input_vec_size constexpr and move it to template parameter. This enables generation of vectorized loads in ROCm AMDGPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141397
Approved by: https://github.com/jeffdaily

Co-authored-by: Jerry Mannil <jerry.mannil@amd.com>
2024-12-12 03:27:49 +00:00
4d8357e912 [CD] Use Anaconda cmake for Mac builds (#143054)
To find Anaconda-env-installed OpenMP
(As OpenMP from PyPI is looking for it in a different places)

For posterity: our build script names are very confusing:
 - [`.ci/wheel/build_wheel.sh`](6cb6e8d790/.ci/wheel/build_wheel.sh) is only used for MacOS wheel/libtorch builds
 - [`.ci/manywheel/build.sh`](6cb6e8d790/.ci/manywheel/build.sh) are used for Linux wheel/libtorch builds
 - [`.ci/pytorch/windows/build_pytorch.bat`](6cb6e8d790/.ci/pytorch/windows/build_pytorch.bat) is used for Windows wheel builds

Fixes https://github.com/pytorch/pytorch/issues/142873
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143054
Approved by: https://github.com/Jack-Khuu, https://github.com/atalman
2024-12-12 03:05:46 +00:00
cb354f8b47 [PGNCCL] Move NCCLComm impl to cpp (#142826)
BE as titled. No behavior change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142826
Approved by: https://github.com/wconstab, https://github.com/c-p-i-o
2024-12-12 02:45:52 +00:00
06075d3d18 [Inductor][CPP] Fix Mask Dtype mismatch (#142103)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/141559. The `vec_mask` store data type doesn't aligned when doing `bitwise_and`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142103
Approved by: https://github.com/jgong5
2024-12-12 01:21:32 +00:00
d68403df3b filelock: Make waitcounter variant to use (#139816)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139816
Approved by: https://github.com/ezyang
2024-12-12 01:18:34 +00:00
6cb6e8d790 Python 3.11, 3.12 Remove tests covered by 3.13 (#143078)
We do have linux-focal-py3_13-clang10-build and test. Hence removing linux-focal-py3_11-clang10-build/test and linux-focal-py3_12-clang10-build/test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143078
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-12-12 01:12:00 +00:00
1dd6f21029 Cuda 12.1 - Remove from trunk tests (#143076)
Remove cuda 12.1 from trunk tests. This is covered by 12.4 tests.
Move ``libtorch-linux-focal-cuda12_4-py3_7-gcc9-debug-build`` -> ``libtorch-linux-focal-cuda12_4-py3_10-gcc9-debug-build``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143076
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-12-12 01:10:09 +00:00
bd7d81db9e Use validate-docker-images workflow from test-infra (#143081)
After PR: https://github.com/pytorch/test-infra/pull/6029 use validate-docker-images.yml from test-infra.
Related to: https://github.com/pytorch/builder/issues/2054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143081
Approved by: https://github.com/huydhn
2024-12-12 00:24:27 +00:00
cyy
db81a3f31c [TorchGen] remove remove_non_owning_ref_types from valuetype_type (#142449)
It is not used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142449
Approved by: https://github.com/ezyang
2024-12-12 00:15:44 +00:00
1b3f8b7589 Revert "[RELAND] Add UTs for accelerator device-agnostic runtime APIs (#133572)"
This reverts commit 209119424922b135fef39aba1f25da3b67f5879a.

Reverted https://github.com/pytorch/pytorch/pull/133572 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is still very flaky on MacOS even when it does not segfault anymore ([comment](https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537256522))
2024-12-11 21:47:18 +00:00
dfe5669076 Revert "[RELAND] Add device-agnostic runtime Device/Stream C++ API (#138677)"
This reverts commit 734bb01460d59e661e9114e7aa17e04821e4b57a.

Reverted https://github.com/pytorch/pytorch/pull/138677 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is still very flaky on MacOS even when it does not segfault anymore ([comment](https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537256522))
2024-12-11 21:47:17 +00:00
cd50bd8477 Revert "[BE][accelerator] formalize API name {current,set}_device_{idx => index} (#140542)"
This reverts commit fb02b40d27737213e0547dec0e30977dfc50f2f3.

Reverted https://github.com/pytorch/pytorch/pull/140542 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I need to revert this in order to revert https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537204202 due to a conflict ([comment](https://github.com/pytorch/pytorch/pull/140542#issuecomment-2537253665))
2024-12-11 21:44:23 +00:00
de313f1155 [foreach_map] Initial foreach map HOP impl for inference (#142098)
This is the initial foreach map HOP for pointwise ops which will be extended in the future to support grouped GEMMs and other ops.

This PR utilizes PrimHOPBase class to represent foreach_map as a HOP with a single subgraph. The way this is implemented is that the user API `foreach_map` provides a single pointwise torch op, and internally this function calls a polyfill which has the same semantics as a foreach op (ie iterates over lists of operands applying the op elementwise). The higher order op is passed through the stack down to inductor where a lowering in essence inlines the subgraph into the main graph. This is done by interpreting it with a pointwise subgraph lowering, grouping the outputs by device, and registering the output buffers as foreach groups as applicable. For testing I was able to reuse the existing foreach tests by creating a wrapper function which matches the foreach op interfaces for those tests and then run all of the existing foreach tests on foreach_map.

TODO before landing:
* Add tests for general functions
* Test warning if unsupported op will block fusion

Followups:
* I need to add tests for backwards (this will be a followup PR because backwards will  require other work as well)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142098
Approved by: https://github.com/eellison
2024-12-11 21:32:11 +00:00
bd199bc754 [EZ] Move slow job from CU12.1 to CU12.4 (#142856)
I though it was migrated a while back

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142856
Approved by: https://github.com/huydhn, https://github.com/atalman, https://github.com/ZainRizvi
2024-12-11 21:12:35 +00:00
688f44824b DistributedDataParallel: add init_sync option to control collectives during initialization (#142824)
This controls whether or not we run collectives during the DDP init function. This makes it easier to use fault tolerant ProcessGroup implementations that may not be starting at the same time.

torchft uses a dummy process group and a comm hook to get around these checks. With this change torchft can use the normal ProcessGroup API via the stock comm hook.

https://github.com/pytorch-labs/torchft/blob/main/torchft/ddp.py#L50-L59

Test plan:

```
pytest test/distributed/test_c10d_pypg.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142824
Approved by: https://github.com/wconstab, https://github.com/fegin, https://github.com/H-Huang
2024-12-11 20:28:38 +00:00
fd65bd755d [BE] replace incorrect .. note:: invocations (#142868)
Something I've noticed is that a lot of the distributed sites don't render on our docs at all, but if they ever do, the notes will render properly now 😛

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142868
Approved by: https://github.com/albanD
2024-12-11 19:58:18 +00:00
0b96413dbf Upgrade expecttest to 0.3.0 (#142869)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142869
Approved by: https://github.com/albanD, https://github.com/malfet
2024-12-11 19:04:16 +00:00
cyy
e5f08c0cbf [TorchGen] Remove cpp_type_registration_declarations (#142452)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142452
Approved by: https://github.com/ezyang
2024-12-11 19:01:36 +00:00
cyy
e228381846 [TorchGen] Simplify argument_type_str (#142491)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142491
Approved by: https://github.com/ezyang
2024-12-11 19:01:20 +00:00
42d4eec5f3 Don't install lintrunner on S390 (#142876)
Not sure if there are many users of this platform, but hopefully this will fix https://github.com/pytorch/pytorch/issues/142872

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142876
Approved by: https://github.com/jeanschmidt
2024-12-11 18:54:12 +00:00
e647b6d590 Fix undesired specialization on slice after split. (#142372)
Fix: #141251

This PR adds a few static guard checks when decomposing and lowering the `slice`
operation, so that we avoid adding unnecessary guards. Specifically, when clamping the end
values.

In summary, the changes are:

- `slice` dynamo decomposition: checks `end >= sizes[dim]` statically. If we don't know
  that, the following guard ensures that we (don't) need clamping.
- `evaluate_min` inductor `sizevar` function: checks whether we can solve it statically or
  not, before actually creating a new guard.

The latter had to be changed because `evaluate_min` (called by `ir.SliceView` constructor)
would always try to create a guard based on the hints operation result. However, if both
`left` and `right` hints were true, it would default to `left <= right` guard. By checking
the guards statically before, we can avoid that.

```python
N = 16

@torch.compile(backend="inductor", dynamic=False, fullgraph=True)
def fn(x):
    splits = torch.ops.aten.split.Tensor(x, N)
    first = splits[0]
    return torch.ops.aten.slice.Tensor(first, 0, 0, N)

x = torch.arange(N)
torch._dynamo.mark_dynamic(x, 0)

fn(x)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142372
Approved by: https://github.com/ezyang
2024-12-11 18:52:17 +00:00
0ddb33ba22 [ONNX] Avoid overwriting overlapped decomposed functions (#142831)
Fixes #141770

The decomposed function in `torch.export.default_decompositions().items()` is overwritten by `torch._decomp.decomposition_table`. As from `torch.onnx.export()` perspective, we should rather respect the table of decompositions in `torch.export.default_decompositions().items()` and avoid overwriting it with `torch._decomp.decomposition_table.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142831
Approved by: https://github.com/justinchuby
2024-12-11 18:47:40 +00:00
c632e29774 [hop][dynamo] support torch.SymInt inputs (#141524)
Fixes https://github.com/pytorch/pytorch/issues/141305.

```python
        class M(torch.nn.Module):
            def forward(self, x, y, z):
                a = y.shape[0]
                b = z.shape[0]

                def true_fn(x):
                    return x + a

                def false_fn(x):
                    return x + b * z

                # When exporting with non-strict: a and b are symints,
                # so torch.compile need to wrap and trace symint inputs.
                return torch.cond(x.shape[0] > 5, true_fn, false_fn, (x,))
```

In non-strict export, when inputs are annotated with dynamic shape, the a, and b in above example are torch.SymInt type. true_fn and false_fn will have closure that're of torch.SymInt types.  The error is triggered because we didn't handle SymInt inputs in dynamo and ends up using a UserDefinedObjectVariable for it, which doesn't have a proxy. We added support by following how we handle SymBool input previously.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141524
Approved by: https://github.com/zou3519
ghstack dependencies: #142185
2024-12-11 18:46:58 +00:00
a8fa98ccef skip test dynamo for aot_dispatch tests on ci (#142185)
A lot of tests in test_aotdispatch.py is not meaningful (from user's perspective) when we run with dynamo. So we skip them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142185
Approved by: https://github.com/zou3519
2024-12-11 18:46:58 +00:00
cyy
24a5a2ef25 [14/N] Fix extra warnings brought by clang-tidy-17 (#141644)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141644
Approved by: https://github.com/ezyang
2024-12-11 18:40:42 +00:00
be27dbf2b8 Enable CPP/CUDAExtension with py_limited_api for python agnosticism (#138088)
Getting tested with ao, but now there is a real test i added.

## What does this PR do?

We want to allow custom PyTorch extensions to be able to build one wheel for multiple Python versions, in other words, achieve python agnosticism. It turns out that there is such a way that setuptools/Python provides already! Namely, if the user promises to use only the Python limited API in their extension, they can pass in `py_limited_api` to their Extension class and to the bdist_wheel command (with a min python version) in order to build 1 wheel that will suffice across multiple Python versions.

Sounds lovely! Why don't people do that already with PyTorch? Well 2 things. This workflow is hardly documented (even searching for python agnostic specifically does not reveal many answers) so I'd expect that people simply don't know about it. But even if they did, _PyTorch_ custom Extensions would still not work because we always link torch_python, which does not abide by py_limited_api rules.

So this is where this PR comes in! We respect when the user specifies py_limited_api and skip linking torch_python under that condition, allowing users to enroll in the provided functionality I just described.

## How do I know this PR works?

I manually tested my silly little ultra_norm locally (with `import python_agnostic`) and wrote a test case for the extension showing that
- torch_python doesn't show up in the ldd tree
- no Py- symbols show up
It may be a little confusing that our test case is actually python-free (more clean than python-agnostic) but it is sufficient (and not necessary) towards showing that this change works.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138088
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-12-11 18:22:55 +00:00
fb02b40d27 [BE][accelerator] formalize API name {current,set}_device_{idx => index} (#140542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140542
Approved by: https://github.com/guangyey, https://github.com/albanD
2024-12-11 17:57:56 +00:00
cyy
82aaf64422 [3/N] Apply py39 ruff fixes (#142115)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142115
Approved by: https://github.com/ezyang
2024-12-11 17:50:10 +00:00
619 changed files with 14257 additions and 7694 deletions

View File

@ -9,7 +9,7 @@ install_ubuntu() {
# Instead use lib and headers from OpenSSL1.1 installed in `install_openssl.sh``
apt-get install -y cargo
echo "Checking out sccache repo"
git clone https://github.com/mozilla/sccache -b v0.8.2
git clone https://github.com/mozilla/sccache -b v0.9.0
cd sccache
echo "Building sccache"
cargo build --release

View File

@ -30,10 +30,10 @@ dill==0.3.7
#Pinned versions: 0.3.7
#test that import: dynamo/test_replay_record.py test_dataloader.py test_datapipe.py test_serialization.py
expecttest==0.2.1
expecttest==0.3.0
#Description: method for writing tests where test framework auto populates
# the expected output based on previous runs
#Pinned versions: 0.2.1
#Pinned versions: 0.3.0
#test that import:
fbscribelogger==0.1.7
@ -280,9 +280,9 @@ unittest-xml-reporting<=3.2.0,>=2.0.0
#test that import:
#lintrunner is supported on aarch64-linux only from 0.12.4 version
lintrunner==0.12.5
lintrunner==0.12.7
#Description: all about linters!
#Pinned versions: 0.12.5
#Pinned versions: 0.12.7
#test that import:
redis>=4.0.0

View File

@ -247,7 +247,7 @@ if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /v
fi
if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then
set -e
set -e -o pipefail
get_bazel

View File

@ -3,7 +3,7 @@
# Common setup for all Jenkins scripts
# shellcheck source=./common_utils.sh
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
set -ex
set -ex -o pipefail
# Required environment variables:
# $BUILD_ENVIRONMENT (should be set by your Docker image)

View File

@ -160,7 +160,7 @@ function install_torchvision() {
}
function install_tlparse() {
pip_install --user "tlparse==0.3.25"
pip_install --user "tlparse==0.3.30"
PATH="$(python -m site --user-base)/bin:$PATH"
}

View File

@ -40,7 +40,7 @@ echo "Building PyTorch C++ API docs..."
rm -rf cppdocs
git clone https://github.com/pytorch/cppdocs
set -ex
set -ex -o pipefail
# Generate ATen files
pushd "${pt_checkout}"

View File

@ -5,7 +5,7 @@ pt_checkout="/var/lib/jenkins/workspace"
source "$pt_checkout/.ci/pytorch/common_utils.sh"
echo "functorch_doc_push_script.sh: Invoked with $*"
set -ex
set -ex -o pipefail
version=${DOCS_VERSION:-nightly}
echo "version: $version"

View File

@ -6,7 +6,7 @@
# return the same thing, ex checks for for rocm, CUDA, and changing the path
# where sccache is installed, and not changing /etc/environment.
set -ex
set -ex -o pipefail
install_binary() {
echo "Downloading sccache binary from S3 repo"

View File

@ -7,7 +7,7 @@ source "$pt_checkout/.ci/pytorch/common_utils.sh"
echo "python_doc_push_script.sh: Invoked with $*"
set -ex
set -ex -o pipefail
# for statements like ${1:-${DOCS_INSTALL_PATH:-docs/}}
# the order of operations goes:
@ -63,7 +63,7 @@ build_docs () {
echo "(tried to echo the WARNINGS above the ==== line)"
echo =========================
fi
set -ex
set -ex -o pipefail
return $code
}

View File

@ -180,7 +180,7 @@ def smoke_test_cuda(
# torch.compile is available on macos-arm64 and Linux for python 3.8-3.13
if (
torch_compile_check == "enabled"
and sys.version_info < (3, 13, 0)
and sys.version_info < (3, 14, 0)
and target_os in ["linux", "linux-aarch64", "macos-arm64", "darwin"]
):
smoke_test_compile("cuda" if torch.cuda.is_available() else "cpu")

View File

@ -4,7 +4,7 @@
# (This is set by default in the Docker images we build, so you don't
# need to set it yourself.
set -ex
set -ex -o pipefail
# Suppress ANSI color escape sequences
export TERM=vt100
@ -313,6 +313,7 @@ test_dynamo_wrapped_shard() {
--exclude-jit-executor \
--exclude-distributed-tests \
--exclude-torch-export-tests \
--exclude-aot-dispatch-tests \
--shard "$1" "$NUM_TEST_SHARDS" \
--verbose \
--upload-artifacts-while-running
@ -1243,7 +1244,7 @@ EOF
}
test_bazel() {
set -e
set -e -o pipefail
# bazel test needs sccache setup.
# shellcheck source=./common-build.sh

View File

@ -38,7 +38,7 @@ if [[ $PYLONG_API_CHECK == 0 ]]; then
echo "PyLong_AsUnsignedLong -> THPUtils_unpackUInt32 / THPUtils_unpackUInt64"
exit 1
fi
set -ex
set -ex -o pipefail
"$SCRIPT_HELPERS_DIR"/build_pytorch.bat

View File

@ -1,5 +1,5 @@
#!/bin/bash
set -ex
set -ex -o pipefail
SCRIPT_PARENT_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
# shellcheck source=./common.sh
@ -41,7 +41,7 @@ python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==
python -m pip install z3-solver==4.12.2.0
# Install tlparse for test\dynamo\test_structured_trace.py UTs.
python -m pip install tlparse==0.3.25
python -m pip install tlparse==0.3.30
# Install parameterized
python -m pip install parameterized==0.8.1

View File

@ -173,8 +173,10 @@ conda create ${EXTRA_CONDA_INSTALL_FLAGS} -yn "$tmp_env_name" python="$desired_p
source activate "$tmp_env_name"
pip install -q "numpy=${NUMPY_PINNED_VERSION}" "pyyaml${PYYAML_PINNED_VERSION}" requests
retry conda install ${EXTRA_CONDA_INSTALL_FLAGS} -yq llvm-openmp=14.0.6 cmake ninja "setuptools${SETUPTOOLS_PINNED_VERSION}" typing_extensions
retry pip install -qr "${pytorch_rootdir}/requirements.txt" || true
# TODO : Remove me later (but in the interim, use Anaconda cmake, to find Anaconda installed OpenMP)
retry pip uninstall -y cmake
retry conda install ${EXTRA_CONDA_INSTALL_FLAGS} -yq llvm-openmp=14.0.6 cmake ninja "setuptools${SETUPTOOLS_PINNED_VERSION}" typing_extensions
# For USE_DISTRIBUTED=1 on macOS, need libuv and pkg-config to find libuv.
export USE_DISTRIBUTED=1

View File

@ -75,9 +75,8 @@ export PYTORCH_BUILD_NUMBER=1
TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)
# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT
TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64' and python_version < '3.13'"
TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64'"
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then
# Only linux Python < 3.13 are supported wheels for triton
TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"
if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then
TRITON_SHORTHASH=$(cut -c1-8 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton.txt)

View File

@ -1,6 +1,6 @@
boto3==1.35.42
hypothesis==6.56.4
expecttest==0.2.1
expecttest==0.3.0
fbscribelogger==0.1.7
librosa>=0.6.2
mpmath==1.3.0

View File

@ -19,7 +19,7 @@ fi
# if lintrunner is not installed, install it
if ! command -v lintrunner &> /dev/null; then
python3 -m pip install lintrunner==0.12.5
python3 -m pip install lintrunner==0.12.7
fi
# This has already been cached in the docker image

View File

@ -206,21 +206,6 @@ jobs:
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder to builder dir
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Check if the job is disabled
id: filter
uses: ./pytorch/.github/actions/filter-test-configs
@ -246,7 +231,6 @@ jobs:
mkdir -p artifacts/
container_name=$(docker run \
-e BINARY_ENV_FILE \
-e BUILDER_ROOT \
-e BUILD_ENVIRONMENT \
-e DESIRED_CUDA \
-e DESIRED_DEVTOOLSET \
@ -264,7 +248,6 @@ jobs:
--tty \
--detach \
-v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
-v "${GITHUB_WORKSPACE}/builder:/builder" \
-v "${RUNNER_TEMP}/artifacts:/artifacts" \
-w / \
"${DOCKER_IMAGE}"
@ -272,10 +255,8 @@ jobs:
docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
if [[ ${BUILD_ENVIRONMENT} == *"aarch64"* ]]; then
docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /pytorch/.ci/aarch64_linux/aarch64_ci_build.sh"
elif [[ ${{ inputs.PACKAGE_TYPE }} == "manywheel" || ${{ inputs.PACKAGE_TYPE }} == "libtorch" ]]; then
docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /pytorch/.ci/${{ inputs.PACKAGE_TYPE }}/build.sh"
else
docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/${{ inputs.PACKAGE_TYPE }}/build.sh"
docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /pytorch/.ci/${{ inputs.PACKAGE_TYPE }}/build.sh"
fi
- name: Chown artifacts

View File

@ -87,7 +87,7 @@ jobs:
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"
strategy:
matrix:
rocm_version: ["6.1", "6.2.4"]
rocm_version: ["6.2.4", "6.3"]
env:
GPU_ARCH_TYPE: rocm
GPU_ARCH_VERSION: ${{ matrix.rocm_version }}

View File

@ -178,7 +178,7 @@ jobs:
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"
strategy:
matrix:
rocm_version: ["6.1", "6.2.4"]
rocm_version: ["6.2.4", "6.3"]
env:
GPU_ARCH_TYPE: rocm-manylinux_2_28
GPU_ARCH_VERSION: ${{ matrix.rocm_version }}

View File

@ -44,7 +44,7 @@ jobs:
strategy:
fail-fast: false
matrix:
py_vers: [ "3.9", "3.10", "3.11", "3.12", "3.13" ]
py_vers: [ "3.9", "3.10", "3.11", "3.12", "3.13", "3.13t" ]
device: ["cuda", "rocm", "xpu"]
docker-image: ["pytorch/manylinux-builder:cpu", "pytorch/manylinux2_28-builder:cpu"]
exclude:
@ -114,6 +114,9 @@ jobs:
3.13)
PYTHON_EXECUTABLE=/opt/python/cp313-cp313/bin/python
;;
3.13t)
PYTHON_EXECUTABLE=/opt/python/cp313-cp313t/bin/python
;;
*)
echo "Unsupported python version ${PY_VERS}"
exit 1

View File

@ -165,7 +165,7 @@ jobs:
validate:
needs: build
uses: pytorch/builder/.github/workflows/validate-docker-images.yml@main
uses: pytorch/test-infra/.github/workflows/validate-docker-images.yml@main
with:
channel: nightly
ref: main

View File

@ -207,8 +207,8 @@ jobs:
conda activate "${CONDA_ENV}"
# Test tools
PYTHONPATH=$(pwd) pytest tools/test/test_*.py
PYTHONPATH=$(pwd) pytest .github/scripts/test_*.py
PYTHONPATH=$(pwd) pytest tools/test -o "python_files=test*.py"
PYTHONPATH=$(pwd) pytest .github/scripts -o "python_files=test*.py"
test_run_test:
name: Test `run_test.py` is usable without boto3
@ -229,7 +229,7 @@ jobs:
- name: Install dependencies
run: |
python3 -m pip install --upgrade pip
pip install pytest-rerunfailures==11.1.* pytest-flakefinder==1.1.* pytest-xdist==3.3.* expecttest==0.2.* fbscribelogger==0.1.* numpy==1.24.*
pip install pytest-rerunfailures==11.1.* pytest-flakefinder==1.1.* pytest-xdist==3.3.* expecttest==0.3.* fbscribelogger==0.1.* numpy==1.24.*
pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cpu/
- name: Run run_test.py (nonretryable)
run: |

View File

@ -41,6 +41,9 @@ jobs:
{ config: "default", shard: 2, num_shards: 4, runner: "linux.arm64.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "linux.arm64.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "linux.arm64.2xlarge" },
{ config: "default", shard: 1, num_shards: 3, runner: "linux.arm64.m7g.4xlarge" },
{ config: "default", shard: 2, num_shards: 3, runner: "linux.arm64.m7g.4xlarge" },
{ config: "default", shard: 3, num_shards: 3, runner: "linux.arm64.m7g.4xlarge" },
]}
secrets: inherit

View File

@ -214,73 +214,6 @@ jobs:
test-matrix: ${{ needs.linux-focal-py3_9-clang10-build.outputs.test-matrix }}
secrets: inherit
linux-focal-py3_11-clang10-build:
name: linux-focal-py3.11-clang10
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-focal-py3.11-clang10
docker-image-name: pytorch-linux-focal-py3.11-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo_wrapped", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo_wrapped", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo_wrapped", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
]}
secrets: inherit
linux-focal-py3_11-clang10-test:
name: linux-focal-py3.11-clang10
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-focal-py3_11-clang10-build
- target-determination
with:
build-environment: linux-focal-py3.11-clang10
docker-image: ${{ needs.linux-focal-py3_11-clang10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-py3_11-clang10-build.outputs.test-matrix }}
secrets: inherit
linux-focal-py3_12-clang10-build:
name: linux-focal-py3.12-clang10
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-focal-py3.12-clang10
docker-image-name: pytorch-linux-focal-py3.12-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "dynamo_wrapped", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo_wrapped", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo_wrapped", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
]}
secrets: inherit
linux-focal-py3_12-clang10-test:
name: linux-focal-py3.12-clang10
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-py3_12-clang10-build
with:
build-environment: linux-focal-py3.12-clang10
docker-image: ${{ needs.linux-focal-py3_12-clang10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-py3_12-clang10-build.outputs.test-matrix }}
timeout-minutes: 600
secrets: inherit
linux-focal-py3_13-clang10-build:
name: linux-focal-py3.13-clang10
uses: ./.github/workflows/_linux-build.yml
@ -296,6 +229,8 @@ jobs:
{ config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo_wrapped", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo_wrapped", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo_wrapped", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },

View File

@ -47,14 +47,14 @@ jobs:
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-focal-cuda12_1-py3_10-gcc9-sm86-build:
name: linux-focal-cuda12.1-py3.10-gcc9-sm86
linux-focal-cuda12_4-py3_10-gcc9-sm86-build:
name: linux-focal-cuda12.4-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm86
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9
cuda-arch-list: 8.6
test-matrix: |
{ include: [
@ -64,16 +64,16 @@ jobs:
]}
secrets: inherit
linux-focal-cuda12_1-py3_10-gcc9-sm86-test:
name: linux-focal-cuda12.1-py3.10-gcc9-sm86
linux-focal-cuda12_4-py3_10-gcc9-sm86-test:
name: linux-focal-cuda12.4-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-focal-cuda12_1-py3_10-gcc9-sm86-build
- linux-focal-cuda12_4-py3_10-gcc9-sm86-build
- target-determination
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm86
docker-image: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-sm86-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-sm86-build.outputs.test-matrix }}
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm86
docker-image: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-sm86-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-sm86-build.outputs.test-matrix }}
secrets: inherit
linux-focal-py3_9-clang10-build:

40
.github/workflows/test-check-binary.yml vendored Normal file
View File

@ -0,0 +1,40 @@
name: Test check_binary
on:
pull_request:
paths:
- .github/workflows/test-check-binary.yml
- .ci/pytorch/check_binary.sh
- .ci/pytorch//smoke_test/smoke_test.py
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
check_binary_linux_cpu:
if: github.repository_owner == 'pytorch'
name: Test check_binary.sh for Linux CPU
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
with:
docker-image: python:3.11
docker-build-dir: "skip-docker-build"
script: |
pushd .ci/pytorch/
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu
DESIRED_PYTHON=3.11 DESIRED_CUDA=cpu PACKAGE_TYPE=manywheel ./check_binary.sh
popd
check_binary_linux_cuda:
if: github.repository_owner == 'pytorch'
name: Test check_binary.sh for Linux CUDA
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
with:
runner: linux.4xlarge.nvidia.gpu
docker-image: python:3.11
docker-build-dir: "skip-docker-build"
script: |
pushd .ci/pytorch/
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu124
DESIRED_PYTHON=3.11 DESIRED_CUDA=cu124 PACKAGE_TYPE=manywheel ./check_binary.sh
popd

View File

@ -45,43 +45,12 @@ jobs:
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
libtorch-linux-focal-cuda12_1-py3_7-gcc9-debug-build:
name: libtorch-linux-focal-cuda12.1-py3.7-gcc9-debug
libtorch-linux-focal-cuda12_4-py3_10-gcc9-debug-build:
name: libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
build-environment: libtorch-linux-focal-cuda12.1-py3.7-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
build-generates-artifacts: false
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runner: "linux.4xlarge"
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1 },
]}
secrets: inherit
# no-ops builds test USE_PER_OPERATOR_HEADERS=0 where ATen/ops is not generated
linux-focal-cuda12_1-py3_10-gcc9-no-ops-build:
name: linux-focal-cuda12.1-py3.10-gcc9-no-ops
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-focal-cuda12.1-py3.10-gcc9-no-ops
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1 },
]}
secrets: inherit
libtorch-linux-focal-cuda12_4-py3_7-gcc9-debug-build:
name: libtorch-linux-focal-cuda12.4-py3.7-gcc9-debug
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
build-environment: libtorch-linux-focal-cuda12.4-py3.7-gcc9
build-environment: libtorch-linux-focal-cuda12.4-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9
build-generates-artifacts: false
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"

View File

@ -73,6 +73,8 @@ include_patterns = [
'aten/src/ATen/native/cudnn/*.cpp',
'aten/src/ATen/native/mkldnn/xpu/**/*.h',
'aten/src/ATen/native/mkldnn/xpu/**/*.cpp',
'aten/src/ATen/native/Tensor*.h',
'aten/src/ATen/native/Tensor*.cpp',
'c10/**/*.h',
'c10/**/*.cpp',
'torch/csrc/**/*.h',
@ -143,7 +145,7 @@ init_command = [
'--dry-run={{DRYRUN}}',
'numpy==1.26.4 ; python_version >= "3.9" and python_version <= "3.11"',
'numpy==2.1.0 ; python_version >= "3.12"',
'expecttest==0.2.1',
'expecttest==0.3.0',
'mypy==1.13.0',
'sympy==1.13.0 ; python_version >= "3.9"',
'types-requests==2.27.25',
@ -1711,7 +1713,7 @@ command = [
'@{{PATHSFILE}}'
]
include_patterns = [
'torch/**/does-not-exist.py'
"torch/_inductor/**/*.py",
]
is_formatter = true

View File

@ -997,8 +997,6 @@ if(NOT MSVC)
append_cxx_flag_if_supported("-Wnarrowing" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Wno-missing-field-initializers"
CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Wno-type-limits" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Wno-array-bounds" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Wno-unknown-pragmas" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Wno-unused-parameter" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Wno-strict-overflow" CMAKE_CXX_FLAGS)
@ -1076,7 +1074,6 @@ if(NOT MSVC)
set(WERROR FALSE)
endif()
endif()
append_cxx_flag_if_supported("-Wno-unused-but-set-variable" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Wno-maybe-uninitialized" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-fstandalone-debug" CMAKE_CXX_FLAGS_DEBUG)
if(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64" AND CMAKE_CXX_COMPILER_ID MATCHES "GNU")
@ -1093,6 +1090,7 @@ if(NOT MSVC)
append_cxx_flag_if_supported("-fno-trapping-math" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Werror=format" CMAKE_CXX_FLAGS)
if(CMAKE_COMPILER_IS_GNUCXX AND CMAKE_CXX_COMPILER_VERSION VERSION_GREATER_EQUAL 13)
append_cxx_flag_if_supported("-Wno-dangling-reference" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Wno-error=dangling-reference" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Wno-error=redundant-move" CMAKE_CXX_FLAGS)
endif()

View File

@ -1,8 +1,6 @@
#include <ATen/Context.h>
#include <ATen/DeviceAccelerator.h>
#include <c10/core/impl/VirtualGuardImpl.h>
namespace at::accelerator {
namespace at {
std::optional<c10::DeviceType> getAccelerator(bool checked) {
#define DETECT_AND_ASSIGN_ACCELERATOR(device_name) \
@ -39,8 +37,8 @@ std::optional<c10::DeviceType> getAccelerator(bool checked) {
#undef DETECT_AND_ASSIGN_ACCELERATOR
}
bool isAccelerator(c10::DeviceType device_type) {
switch (device_type) {
bool isAccelerator(c10::DeviceType d) {
switch (d) {
case at::kCUDA:
case at::kMTIA:
case at::kXPU:
@ -54,50 +52,4 @@ bool isAccelerator(c10::DeviceType device_type) {
}
}
c10::DeviceIndex deviceCount() {
const auto device_type = getAccelerator(false);
if (!device_type.has_value()) {
return static_cast<c10::DeviceIndex>(0);
}
c10::impl::VirtualGuardImpl impl(device_type.value());
return static_cast<c10::DeviceIndex>(impl.deviceCount());
}
void setDeviceIndex(c10::DeviceIndex device_index) {
const auto device_type = getAccelerator(true).value();
c10::impl::VirtualGuardImpl impl(device_type);
impl.setDevice({device_type, device_index});
}
c10::DeviceIndex getDeviceIndex() {
const auto device_type = getAccelerator(true).value();
c10::impl::VirtualGuardImpl impl(device_type);
return static_cast<c10::DeviceIndex>(impl.getDevice().index());
}
void setCurrentStream(c10::Stream stream) {
const auto device_type = getAccelerator(true).value();
TORCH_CHECK(
device_type == stream.device_type(),
"stream's device type ",
c10::DeviceTypeName(stream.device_type()),
" doesn't match the current accelerator ",
c10::DeviceTypeName(device_type));
c10::impl::VirtualGuardImpl impl(device_type);
impl.exchangeStream(stream);
}
c10::Stream getCurrentStream(c10::DeviceIndex device_index) {
const auto device_type = getAccelerator(true).value();
c10::impl::VirtualGuardImpl impl(device_type);
return impl.getStream({device_type, device_index});
}
void synchronizeDevice(c10::DeviceIndex device_index) {
const auto device_type = getAccelerator(true).value();
c10::impl::VirtualGuardImpl impl(device_type);
// impl.synchronizeDevice should can be safely called from any device
impl.synchronizeDevice(device_index);
}
} // namespace at::accelerator
} // namespace at

View File

@ -6,8 +6,6 @@
#include <ATen/detail/MTIAHooksInterface.h>
#include <optional>
namespace at::accelerator {
// This file defines the top level Accelerator concept for PyTorch.
// A device is an accelerator per the definition here if:
// - It is mutually exclusive with all other accelerators
@ -17,39 +15,13 @@ namespace at::accelerator {
// As of today, accelerator devices are (in no particular order):
// CUDA, MTIA, XPU, HIP, MPS, PrivateUse1
namespace at {
// Ensures that only one accelerator is available (at
// compile time if possible) and return it.
// When checked is true, the returned optional always has a value.
TORCH_API std::optional<c10::DeviceType> getAccelerator(bool checked = false);
// Check if the given device type is an accelerator.
TORCH_API bool isAccelerator(c10::DeviceType device_type);
TORCH_API bool isAccelerator(c10::DeviceType d);
// Return the number of the device available. Note that this is *REQUIRED* to
// not raise any exception.
TORCH_API c10::DeviceIndex deviceCount();
// Set the current device index to the given device index.
TORCH_API void setDeviceIndex(c10::DeviceIndex device_index);
// Get the current device index.
TORCH_API c10::DeviceIndex getDeviceIndex();
// Set the current stream to a given stream. Note that this API doesn't change
// the current device index.
TORCH_API void setCurrentStream(c10::Stream stream);
// Get the current stream of the given device index.
TORCH_API c10::Stream getCurrentStream(c10::DeviceIndex device_index);
// Wait (by blocking the calling thread) until all the work previously enqueued
// on the given device index has been completed.
TORCH_API void synchronizeDevice(c10::DeviceIndex device_index);
} // namespace at::accelerator
namespace at {
// Keep BC only
using at::accelerator::getAccelerator;
using at::accelerator::isAccelerator;
} // namespace at

View File

@ -92,8 +92,8 @@ class MatrixRef {
/// The declaration here is extra complicated so that "arrayRef = {}"
/// continues to select the move assignment operator.
template <typename U>
// NOLINTNEXTLINE(cppcoreguidelines-missing-std-forward)
std::enable_if_t<std::is_same_v<U, T>, MatrixRef<T>>& operator=(
// NOLINTNEXTLINE(cppcoreguidelines-missing-std-forward)
U&& Temporary) = delete;
/// Disallow accidental assignment from a temporary.

View File

@ -1129,6 +1129,7 @@ TEST(ListTest, canAccessOptionalStringByReference) {
EXPECT_EQ("two", str1);
EXPECT_FALSE(str2.has_value());
EXPECT_TRUE(strRef1.has_value());
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
EXPECT_EQ("two", strRef1.value().get());
EXPECT_FALSE(strRef2.has_value());
}

View File

@ -66,7 +66,7 @@ struct TORCH_API EnumType : public NamedType {
}
const QualifiedName& qualifiedClassName() const {
// NOLINTLEXTLINE(bugprone-unchecked-optional-access)
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
return name().value();
}

View File

@ -95,7 +95,7 @@ struct TORCH_API Argument {
const TypePtr& real_type() const {
return real_type_;
}
std::optional<int32_t> N() const {
const std::optional<int32_t>& N() const {
return N_;
}
const std::optional<IValue>& default_value() const {
@ -651,11 +651,11 @@ template<>
hash = c10::hash_combine(hash, type_hash);
hash = c10::hash_combine(hash, kwarg_only_hash);
// hashing optional fields if they exist
if (arg.default_value()) {
auto default_value_hash = c10::hash<c10::IValue>{}(arg.default_value().value());
if (arg.default_value().has_value()) {
auto default_value_hash = c10::hash<c10::IValue>{}(*arg.default_value());
hash = c10::hash_combine(hash, default_value_hash);
}
if (arg.N()) {
if (arg.N().has_value()) {
auto N_hash = std::hash<int64_t>{}(*arg.N());
hash = c10::hash_combine(hash, N_hash);
}

View File

@ -1546,11 +1546,11 @@ struct WeakOrStrongCompilationUnit {
}
bool holdingStrongRef() const {
return strong_ptr_ != std::nullopt;
return strong_ptr_.has_value();
}
bool holdingEmptyStrongRef() const {
return holdingStrongRef() && *strong_ptr_ == nullptr;
return strong_ptr_ == nullptr;
}
std::optional<std::shared_ptr<torch::jit::CompilationUnit>> strong_ptr_;

View File

@ -625,13 +625,13 @@ struct TORCH_API TensorType : public SharedType {
return strides_;
}
std::optional<at::Device> device() const {
const std::optional<at::Device>& device() const {
return device_;
}
std::optional<at::ScalarType> scalarType() const {
const std::optional<at::ScalarType>& scalarType() const {
return scalar_type_;
}
std::optional<bool> requiresGrad() const {
const std::optional<bool>& requiresGrad() const {
return requires_grad_;
}
bool requires_grad() const override {
@ -656,7 +656,7 @@ struct TORCH_API TensorType : public SharedType {
const auto& shape = sizes();
for (size_t i = 0; i < shape.size(); i++) {
if (!shape[i]) {
if (!shape[i].has_value()) {
return std::optional<size_t>{};
}
prod *= shape[i].value();

View File

@ -292,7 +292,7 @@ TensorTypePtr TensorType::create(
scalar_type, device, symbol_sizes, sprops, requires_grad, undefined);
} else {
// strides are all null, but still have number of strides equal to number of ranks
TORCH_INTERNAL_ASSERT(sizes.sizes() && sizes.size());
TORCH_INTERNAL_ASSERT(sizes.sizes().has_value() && sizes.size());
auto symbol_sizes = SymbolicShape(*sizes.sizes());
return TensorType::create(
scalar_type, device, symbol_sizes, VaryingShape<Stride>(*sizes.size()), requires_grad, undefined);

View File

@ -61,8 +61,8 @@ std::ostream& operator<<(std::ostream & out, const Type & t) {
} else {
out << "Tensor";
}
if (auto ndim = value->sizes().size()) {
bool has_valid_strides_info = *ndim > 0 &&
if (auto ndim = value->sizes().size(); ndim.has_value()) {
bool has_valid_strides_info = ndim > 0 &&
value->strides().isComplete() && value->strides().size() == ndim;
out << "(";
@ -87,7 +87,7 @@ std::ostream& operator<<(std::ostream & out, const Type & t) {
if (i > 0) {
out << ", ";
}
out << *value->strides()[i];
out << value->strides()[i].value();
}
out << "]";
}
@ -903,7 +903,7 @@ bool ListType::isSubtypeOfExt(const Type& rhs_, std::ostream* why_not) const {
std::string TupleType::str() const {
std::stringstream ss;
if (schema_ && name()) {
if (schema_ && name().has_value()) {
ss << name()->qualifiedName();
} else {
ss << "(";

View File

@ -106,6 +106,7 @@ static hipblasStatus_t rocBLASStatusToHIPStatus(rocblas_status error)
namespace {
static cublasOperation_t _cublasOpFromChar(char op) {
// NOLINTNEXTLINE(bugprone-switch-missing-default-case)
switch (op) {
case 'n':
case 'N':

View File

@ -466,6 +466,6 @@ void CUDAHooks::deviceSynchronize(DeviceIndex device_index) const {
using at::CUDAHooksRegistry;
using at::RegistererCUDAHooksRegistry;
REGISTER_CUDA_HOOKS(CUDAHooks);
REGISTER_CUDA_HOOKS(CUDAHooks)
} // namespace at::cuda::detail

View File

@ -127,8 +127,8 @@ RETTYPE NAME(ARG1 a1, ARG2 a2, ARG3 a3, ARG4 a4) {
#define NVRTC_STUB2(NAME, A1, A2) _STUB_2(NVRTC, NAME, nvrtcResult, A1, A2)
#define NVRTC_STUB3(NAME, A1, A2, A3) _STUB_3(NVRTC, NAME, nvrtcResult, A1, A2, A3)
NVRTC_STUB2(nvrtcVersion, int*, int*);
NVRTC_STUB2(nvrtcAddNameExpression, nvrtcProgram, const char * const);
NVRTC_STUB2(nvrtcVersion, int*, int*)
NVRTC_STUB2(nvrtcAddNameExpression, nvrtcProgram, const char * const)
nvrtcResult nvrtcCreateProgram(nvrtcProgram *prog,
const char *src,
@ -143,32 +143,32 @@ nvrtcResult nvrtcCreateProgram(nvrtcProgram *prog,
return fn(prog, src, name, numHeaders, headers, includeNames);
}
NVRTC_STUB1(nvrtcDestroyProgram, nvrtcProgram *);
NVRTC_STUB2(nvrtcGetPTXSize, nvrtcProgram, size_t *);
NVRTC_STUB2(nvrtcGetPTX, nvrtcProgram, char *);
NVRTC_STUB1(nvrtcDestroyProgram, nvrtcProgram *)
NVRTC_STUB2(nvrtcGetPTXSize, nvrtcProgram, size_t *)
NVRTC_STUB2(nvrtcGetPTX, nvrtcProgram, char *)
#if defined(CUDA_VERSION) && CUDA_VERSION >= 11010
NVRTC_STUB2(nvrtcGetCUBINSize, nvrtcProgram, size_t *);
NVRTC_STUB2(nvrtcGetCUBIN, nvrtcProgram, char *);
NVRTC_STUB2(nvrtcGetCUBINSize, nvrtcProgram, size_t *)
NVRTC_STUB2(nvrtcGetCUBIN, nvrtcProgram, char *)
#endif
NVRTC_STUB3(nvrtcCompileProgram, nvrtcProgram, int, const char * const *);
_STUB_1(NVRTC, nvrtcGetErrorString, const char *, nvrtcResult);
NVRTC_STUB2(nvrtcGetProgramLogSize,nvrtcProgram, size_t*);
NVRTC_STUB2(nvrtcGetProgramLog, nvrtcProgram, char *);
NVRTC_STUB3(nvrtcGetLoweredName, nvrtcProgram, const char *, const char **);
NVRTC_STUB3(nvrtcCompileProgram, nvrtcProgram, int, const char * const *)
_STUB_1(NVRTC, nvrtcGetErrorString, const char *, nvrtcResult)
NVRTC_STUB2(nvrtcGetProgramLogSize,nvrtcProgram, size_t*)
NVRTC_STUB2(nvrtcGetProgramLog, nvrtcProgram, char *)
NVRTC_STUB3(nvrtcGetLoweredName, nvrtcProgram, const char *, const char **)
CUDA_STUB2(cuModuleLoadData, CUmodule *, const void *);
CUDA_STUB3(cuModuleGetFunction, CUfunction *, CUmodule, const char *);
CUDA_STUB4(cuOccupancyMaxActiveBlocksPerMultiprocessor, int *, CUfunction, int, size_t);
CUDA_STUB2(cuGetErrorString, CUresult, const char **);
CUDA_STUB1(cuCtxGetCurrent, CUcontext *);
CUDA_STUB1(cuCtxSetCurrent, CUcontext);
CUDA_STUB1(cuModuleUnload, CUmodule);
CUDA_STUB3(cuDevicePrimaryCtxGetState, CUdevice, unsigned int *, int *);
CUDA_STUB2(cuDevicePrimaryCtxRetain, CUcontext *, CUdevice);
CUDA_STUB4(cuLinkCreate, unsigned int, CUjit_option *, void **, CUlinkState *);
CUDA_STUB3(cuLinkComplete, CUlinkState, void **, size_t *);
CUDA_STUB3(cuFuncSetAttribute, CUfunction, CUfunction_attribute, int);
CUDA_STUB3(cuFuncGetAttribute, int*, CUfunction_attribute, CUfunction);
CUDA_STUB2(cuModuleLoadData, CUmodule *, const void *)
CUDA_STUB3(cuModuleGetFunction, CUfunction *, CUmodule, const char *)
CUDA_STUB4(cuOccupancyMaxActiveBlocksPerMultiprocessor, int *, CUfunction, int, size_t)
CUDA_STUB2(cuGetErrorString, CUresult, const char **)
CUDA_STUB1(cuCtxGetCurrent, CUcontext *)
CUDA_STUB1(cuCtxSetCurrent, CUcontext)
CUDA_STUB1(cuModuleUnload, CUmodule)
CUDA_STUB3(cuDevicePrimaryCtxGetState, CUdevice, unsigned int *, int *)
CUDA_STUB2(cuDevicePrimaryCtxRetain, CUcontext *, CUdevice)
CUDA_STUB4(cuLinkCreate, unsigned int, CUjit_option *, void **, CUlinkState *)
CUDA_STUB3(cuLinkComplete, CUlinkState, void **, size_t *)
CUDA_STUB3(cuFuncSetAttribute, CUfunction, CUfunction_attribute, int)
CUDA_STUB3(cuFuncGetAttribute, int*, CUfunction_attribute, CUfunction)
#if defined(CUDA_VERSION) && CUDA_VERSION >= 12000
CUresult CUDAAPI

View File

@ -13,6 +13,7 @@
#include <ATen/cuda/tunable/Tunable.h>
#include <c10/util/Exception.h>
#include <c10/util/StringUtil.h>
#include <c10/util/env.h>
#include <torch/version.h>
#ifndef _WIN32
@ -435,8 +436,8 @@ void TuningContext::EnableTunableOp(bool value) {
}
bool TuningContext::IsTunableOpEnabled() const {
static const char *env = std::getenv("PYTORCH_TUNABLEOP_ENABLED");
if (env != nullptr && strcmp(env, "1") == 0) {
static const bool eval = c10::utils::get_env("PYTORCH_TUNABLEOP_ENABLED") == "1";
if (eval) {
return true;
}
return enable_;
@ -462,16 +463,16 @@ void TuningContext::EnableRecordUntuned(bool value) {
}
bool TuningContext::IsTuningEnabled() const {
static const char *env = std::getenv("PYTORCH_TUNABLEOP_TUNING");
if (env != nullptr && strcmp(env, "0") == 0) {
static const bool eval = c10::utils::get_env("PYTORCH_TUNABLEOP_TUNING") == "0";
if (eval) {
return false;
}
return tuning_enable_;
}
bool TuningContext::IsRecordUntunedEnabled() const {
static const char *env = std::getenv("PYTORCH_TUNABLEOP_RECORD_UNTUNED");
if (env != nullptr && strcmp(env, "1") == 0) {
static const bool eval = c10::utils::get_env("PYTORCH_TUNABLEOP_RECORD_UNTUNED") == "1";
if (eval) {
return true;
}
return record_untuned_enable_;
@ -479,8 +480,8 @@ bool TuningContext::IsRecordUntunedEnabled() const {
std::ofstream& TuningContext::GetUntunedFile(){
if (!untuned_file_.is_open()) {
const char *env = std::getenv("PYTORCH_TUNABLEOP_UNTUNED_FILENAME");
std::string filename = (env == nullptr) ? "tunableop_untuned.csv" : env;
const auto env = c10::utils::get_env("PYTORCH_TUNABLEOP_UNTUNED_FILENAME");
std::string filename = (!env.has_value()) ? "tunableop_untuned.csv" : env.value();
std::string device = c10::str(int(c10::cuda::current_device()));
std::size_t found = filename.rfind('.');
@ -517,9 +518,9 @@ void TuningContext::SetMaxTuningDurationMs(int max_duration_ms) {
}
int TuningContext::GetMaxTuningDurationMs() const {
static const char *env = std::getenv("PYTORCH_TUNABLEOP_MAX_TUNING_DURATION_MS");
if (env != nullptr) {
int val = atoi(env);
static const auto env = c10::utils::get_env("PYTORCH_TUNABLEOP_MAX_TUNING_DURATION_MS");
if (env.has_value()) {
int val = stoi(env.value());
return val < 0 ? 0 : val;
}
return max_tuning_duration_ms_;
@ -530,9 +531,9 @@ void TuningContext::SetMaxTuningIterations(int max_iter) {
}
int TuningContext::GetMaxTuningIterations() const {
static const char *env = std::getenv("PYTORCH_TUNABLEOP_MAX_TUNING_ITERATIONS");
if (env != nullptr) {
int val = atoi(env);
static const auto env = c10::utils::get_env("PYTORCH_TUNABLEOP_MAX_TUNING_ITERATIONS");
if (env.has_value()) {
int val = stoi(env.value());
return val < 0 ? 0 : val;
}
return max_tuning_iterations_;
@ -543,9 +544,9 @@ void TuningContext::SetMaxWarmupDurationMs(int max_duration_ms) {
}
int TuningContext::GetMaxWarmupDurationMs() const {
static const char *env = std::getenv("PYTORCH_TUNABLEOP_MAX_WARMUP_DURATION_MS");
if (env != nullptr) {
int val = atoi(env);
static const auto env = c10::utils::get_env("PYTORCH_TUNABLEOP_MAX_WARMUP_DURATION_MS");
if (env.has_value()) {
int val = stoi(env.value());
return val < 0 ? 0 : val;
}
return max_warmup_duration_ms_;
@ -556,9 +557,9 @@ void TuningContext::SetMaxWarmupIterations(int max_iter) {
}
int TuningContext::GetMaxWarmupIterations() const {
static const char *env = std::getenv("PYTORCH_TUNABLEOP_MAX_WARMUP_ITERATIONS");
if (env != nullptr) {
int val = atoi(env);
static const auto env = c10::utils::get_env("PYTORCH_TUNABLEOP_MAX_WARMUP_ITERATIONS");
if (env.has_value()) {
int val = stoi(env.value());
return val < 0 ? 0 : val;
}
return max_warmup_iterations_;
@ -569,8 +570,8 @@ void TuningContext::EnableICacheFlush(bool value) {
}
bool TuningContext::IsICacheFlushEnabled() const {
static const char *env = std::getenv("PYTORCH_TUNABLEOP_ICACHE_FLUSH_ENABLED");
if (env != nullptr && strcmp(env, "0") == 0) {
static const auto env = c10::utils::get_env("PYTORCH_TUNABLEOP_ICACHE_FLUSH_ENABLED");
if (env == "0") {
return false;
}
return icache_flush_;
@ -581,10 +582,10 @@ void TuningContext::SetRotatingBufferSize(int size) {
}
int TuningContext::GetRotatingBufferSize() const {
static const char *env = std::getenv("PYTORCH_TUNABLEOP_ROTATING_BUFFER_SIZE");
if (env != nullptr) {
static const auto env = c10::utils::get_env("PYTORCH_TUNABLEOP_ROTATING_BUFFER_SIZE");
if (env.has_value()) {
constexpr int MB = 1024 * 1024;
int val = atoi(env);
int val = stoi(env.value());
return val < 0 ? 0 : val * MB; // env var is specified as MB, returned as bytes
}
else {
@ -604,8 +605,8 @@ TuningResultsManager& TuningContext::GetTuningResultsManager() {
manager_initialized_ = true;
if (GetFilename().empty()) {
// if SetFilename() was not already called, call it now with the default or env var
const char *env = std::getenv("PYTORCH_TUNABLEOP_FILENAME");
std::string filename = (env == nullptr) ? "tunableop_results.csv" : env;
const auto env = c10::utils::get_env("PYTORCH_TUNABLEOP_FILENAME");
std::string filename = (!env.has_value()) ? "tunableop_results.csv" : env.value();
SetFilename(filename, true);
}
auto filename = GetFilename();

View File

@ -42,8 +42,9 @@ static Tensor materializeGradWrappers(const Tensor& tensor, int64_t current_leve
if (!wrapper) {
return makeTensorWrapper(tensor, current_level, /*is_immutable=*/true);
}
TORCH_INTERNAL_ASSERT(wrapper->level().value() <= current_level, "escaped?");
if (wrapper->level() == current_level) {
auto level = wrapper->level();
TORCH_INTERNAL_ASSERT(level.has_value() && level <= current_level, "escaped?");
if (level == current_level) {
TORCH_INTERNAL_ASSERT(tensor.defined());
return tensor;
}

View File

@ -54,6 +54,8 @@ struct BinaryRandomPointwiseBatchRuleHelper<F, Func, typelist<T1, T2, T...>> {
static Tensor apply(const Tensor& tensor, const Tensor& other, T... extra_args) {
c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode);
auto maybe_layer = maybeCurrentDynamicLayer();
TORCH_INTERNAL_ASSERT(maybe_layer.has_value())
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
auto cur_level = maybe_layer->layerId();
RandomnessType randomness = maybe_layer->randomness();

View File

@ -19,6 +19,7 @@ struct NewBlahBatchRuleHelperSymInt<F, Func, typelist<A, B, T...>> {
std::optional<int64_t> batch_dim,
SymIntArrayRef shape,
T... extra_args) {
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
const auto bdim_size = tensor.sym_size(batch_dim.value());
c10::SmallVector<c10::SymInt> new_shape;
new_shape.reserve(shape.size() + 1);

View File

@ -9,7 +9,7 @@
namespace at::functorch {
Tensor moveBatchDimToFront(const Tensor& tensor, std::optional<int64_t> maybe_batch_dim) {
Tensor moveBatchDimToFront(Tensor tensor, std::optional<int64_t> maybe_batch_dim) {
if (!maybe_batch_dim.has_value()) {
return tensor;
}

View File

@ -30,7 +30,7 @@ TORCH_API Tensor reshape_dim_outof(int64_t src, int64_t size1, const Tensor& x);
TORCH_API Tensor reshape_dim_outof_symint(int64_t src, const c10::SymInt& size1, const Tensor& x);
Tensor moveBatchDimToFront(const Tensor& tensor, std::optional<int64_t> maybe_batch_dim);
Tensor moveBatchDimToFront(Tensor tensor, std::optional<int64_t> maybe_batch_dim);
int64_t rankWithoutBatchDim(const Tensor& tensor, std::optional<int64_t> maybe_batch_dim);
int64_t numelWithoutBatchDim(const Tensor& tensor, std::optional<int64_t> maybe_batch_dim);
std::optional<int64_t> valIfNonempty(std::optional<int64_t> maybe_empty, int64_t new_val);
@ -243,9 +243,8 @@ inline void boxed_existing_bdim_all_batch_rule(
const auto num_arguments = static_cast<int64_t>(schema.arguments().size());
c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
auto maybe_layer = maybeCurrentDynamicLayer();
const auto maybe_layer = maybeCurrentDynamicLayer();
vmap_check_escaped(maybe_layer, "boxed_existing_bdim_all_batch_rule");
int64_t cur_level = maybe_layer->layerId();
const auto arguments = torch::jit::last(stack, num_arguments);
if (std::none_of(arguments.begin(), arguments.end(), ivalueParticipatesInCurrentLevel)) {
@ -257,6 +256,8 @@ inline void boxed_existing_bdim_all_batch_rule(
SmallVector<UnpackedBatchedTensor, 5> tensor_inputs;
SmallVector<int64_t, 5> tensor_pos;
int64_t batch_size = 0;
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
int64_t cur_level = maybe_layer->layerId();
find_and_unpack_tensors(
stack, num_arguments, cur_level,

View File

@ -492,6 +492,7 @@ _scaled_dot_product_flash_attention_batch_rule(
) {
if (dropout_p > 0) {
auto maybe_layer = maybeCurrentDynamicLayer();
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
RandomnessType randomness = maybe_layer->randomness();
check_randomness(randomness, query_bdim.has_value() || key_bdim.has_value() || value_bdim.has_value());
}
@ -543,6 +544,7 @@ fourOutputs _scaled_dot_product_efficient_attention_batch_rule(
) {
if (dropout_p > 0) {
auto maybe_layer = maybeCurrentDynamicLayer();
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
RandomnessType randomness = maybe_layer->randomness();
check_randomness(randomness, query_bdim.has_value() || key_bdim.has_value() || value_bdim.has_value());
}
@ -585,6 +587,7 @@ _scaled_dot_product_cudnn_attention_batch_rule(
) {
if (dropout_p > 0) {
auto maybe_layer = maybeCurrentDynamicLayer();
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
RandomnessType randomness = maybe_layer->randomness();
check_randomness(randomness, query_bdim.has_value() || key_bdim.has_value() || value_bdim.has_value());
}

View File

@ -90,6 +90,7 @@ static Tensor binary_cross_entropy_plumbing(
const std::optional<Tensor>& weight, int64_t reduction) {
auto maybe_layer = maybeCurrentDynamicLayer();
vmap_check_escaped(maybe_layer, "binary_cross_entropy_plumbing");
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
int64_t cur_level = maybe_layer->layerId();
if (!isBatchedAtLevel(self, cur_level) && !isBatchedAtLevel(target, cur_level)
@ -126,6 +127,7 @@ static Tensor binary_cross_entropy_backward_plumbing(
const std::optional<Tensor>& weight_opt, int64_t reduction) {
auto maybe_layer = maybeCurrentDynamicLayer();
vmap_check_escaped(maybe_layer, "binary_cross_entropy_backward_plumbing");
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
int64_t cur_level = maybe_layer->layerId();
if (!areAnyBatchedAtLevel({grad, input, target, weight_opt}, cur_level)) {

View File

@ -57,7 +57,7 @@ embedding_dense_backward_batch_rule(
c10::SymInt num_weights, c10::SymInt padding_idx, bool scale_grad_by_freq) {
Tensor grad = grad_;
Tensor indices = indices_;
if (!indices_bdim && grad_bdim) {
if (!indices_bdim.has_value() && grad_bdim) {
const auto bdim_size = grad.sym_size(*grad_bdim);
grad = reshape_dim_into(*grad_bdim, -1, grad);
auto result = at::embedding_dense_backward_symint(
@ -162,12 +162,12 @@ grid_sample_backward_helper_in(
static std::tuple<Tensor, std::optional<int64_t>, Tensor, std::optional<int64_t>>
grid_sample_backward_helper_out(
std::tuple<Tensor, Tensor> bw_out,
std::optional<int64_t> grad_input_out_bdim,
std::optional<int64_t> grad_grid_out_bdim,
int64_t grad_input_out_bdim,
int64_t grad_grid_out_bdim,
int64_t bdim_size) {
auto& [grad_input, grad_grid] = bw_out;
grad_input = reshape_dim_outof(*grad_input_out_bdim, bdim_size, grad_input);
grad_grid = reshape_dim_outof(*grad_grid_out_bdim, bdim_size, grad_grid);
grad_input = reshape_dim_outof(grad_input_out_bdim, bdim_size, grad_input);
grad_grid = reshape_dim_outof(grad_grid_out_bdim, bdim_size, grad_grid);
return std::make_tuple(std::move(grad_input), grad_input_out_bdim, std::move(grad_grid), grad_grid_out_bdim);
}

View File

@ -218,6 +218,8 @@ std::tuple<at::Tensor,at::Tensor,at::Tensor> batch_norm_backward_plumbing(
c10::MaybeOwned<Tensor> running_var_maybe_owned = at::borrow_from_optional_tensor(running_var_opt);
const Tensor& running_var = *running_var_maybe_owned;
// NB: not sure why these are optional...these are required from the forward
TORCH_INTERNAL_ASSERT(save_mean_opt.has_value());
TORCH_INTERNAL_ASSERT(save_rstd_opt.has_value());
const Tensor& save_mean = *save_mean_opt;
const Tensor& save_rstd = *save_rstd_opt;
TORCH_INTERNAL_ASSERT(save_mean.defined());
@ -226,6 +228,7 @@ std::tuple<at::Tensor,at::Tensor,at::Tensor> batch_norm_backward_plumbing(
// plumbing
auto maybe_layer = maybeCurrentDynamicLayer();
vmap_check_escaped(maybe_layer, "batch_norm_backward_plumbing");
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
int64_t cur_level = maybe_layer->layerId();
auto [grad_out_value, grad_out_bdim] = unwrapTensorAtLevel(grad_out, cur_level);
@ -298,6 +301,7 @@ static std::tuple<Tensor,Tensor,Tensor> native_group_norm_plumbing(
auto maybe_layer = maybeCurrentDynamicLayer();
vmap_check_escaped(maybe_layer, "native_group_norm_plumbing");
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
int64_t cur_level = maybe_layer->layerId();
if (!areAnyBatchedAtLevel({input, weight_opt, bias_opt}, cur_level)) {
@ -380,6 +384,7 @@ static std::tuple<Tensor,Tensor,Tensor> native_group_norm_backward_plumbing(
// plumbing
auto maybe_layer = maybeCurrentDynamicLayer();
vmap_check_escaped(maybe_layer, "native_group_norm_backward_plumbing");
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
int64_t cur_level = maybe_layer->layerId();
if (!areAnyBatchedAtLevel({grad_out, input, mean, rstd, weight_opt}, cur_level)) {
@ -579,6 +584,7 @@ static std::tuple<at::Tensor,at::Tensor,at::Tensor> native_layer_norm_backward_p
// plumbing
auto maybe_layer = maybeCurrentDynamicLayer();
vmap_check_escaped(maybe_layer, "native_layer_norm_backward_plumbing");
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
int64_t cur_level = maybe_layer->layerId();
if (!areAnyBatchedAtLevel({grad_out, input, mean, rstd, weight_opt, bias_opt}, cur_level)) {
c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
@ -721,6 +727,7 @@ struct NativeBatchNormBackwardBatchRuleHelper {
auto maybe_layer = maybeCurrentDynamicLayer();
vmap_check_escaped(maybe_layer, "NativeBatchNormBackwardBatchRuleHelper.apply");
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
int64_t cur_level = maybe_layer->layerId();
if (!areAnyBatchedAtLevel({grad_out, input, weight_opt, running_mean_opt,
@ -751,6 +758,7 @@ struct CudnnBatchNormBackwardBatchRuleHelper {
auto maybe_layer = maybeCurrentDynamicLayer();
vmap_check_escaped(maybe_layer, "CudnnBatchNormBackwardBatchRuleHelper.apply");
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
int64_t cur_level = maybe_layer->layerId();
if (!areAnyBatchedAtLevel({input, grad_out, weight, running_mean_opt,
@ -779,6 +787,7 @@ struct MiopenBatchNormBackwardBatchRuleHelper {
auto maybe_layer = maybeCurrentDynamicLayer();
vmap_check_escaped(maybe_layer, "MiopenBatchNormBackwardBatchRuleHelper.apply");
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
int64_t cur_level = maybe_layer->layerId();
if (!areAnyBatchedAtLevel({input, grad_out, weight, running_mean_opt,

View File

@ -28,8 +28,10 @@ max_pool_with_indices_batch_rule_helper(
return std::make_tuple(std::move(std::get<0>(result)), 0, std::move(std::get<1>(result)), 0);
}
// Tensor[B, N, logical_rank...] -> Tensor[B * N, logical_rank...]
auto bdim_size = self.size(*self_bdim);
auto self_ = reshape_dim_into(*self_bdim, 0, self);
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
auto bdim_size = self.size(self_bdim.value());
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
auto self_ = reshape_dim_into(self_bdim.value(), 0, self);
auto result = pooling_fn(
self_, kernel_size, stride, padding, dilation, ceil_mode);
return std::make_tuple(

View File

@ -25,6 +25,7 @@ Tensor random_batching_rule(SymIntArrayRef shape, ExtraArgs... extra_args) {
c10::SmallVector<SymInt> shapeVec(1, maybe_layer->batchSize());
shapeVec.reserve(shape.size() + 1);
shapeVec.insert(shapeVec.end(), shape.begin(), shape.end());
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
RandomnessType randomness = maybe_layer->randomness();
check_randomness(randomness);
if (randomness == RandomnessType::Different) {
@ -38,9 +39,11 @@ template <typename F, F Func, typename... ExtraArgs>
Tensor& random_inplace_batching_rule(Tensor& self, ExtraArgs... extra_args) {
c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode);
auto maybe_layer = maybeCurrentDynamicLayer();
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
const auto cur_level = maybe_layer->layerId();
auto [self_value, self_bdim] = unwrapTensorAtLevel(self, cur_level);
self_value = moveBatchDimToFront(self_value, self_bdim);
self_value = moveBatchDimToFront(std::move(self_value), self_bdim);
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
RandomnessType randomness = maybe_layer->randomness();
check_randomness(randomness);
TORCH_CHECK(

View File

@ -1732,11 +1732,10 @@ std::tuple<Tensor,Tensor,Tensor> _convolution_double_backward( const std::option
// See [Note: hacky wrapper removal for optional tensor]
c10::MaybeOwned<Tensor> ggI_maybe_owned = at::borrow_from_optional_tensor(ggI_opt);
const Tensor& ggI = *ggI_maybe_owned;
const Tensor& ggW_r = ggW_r_opt.value_or(Tensor());
Tensor ggW = ggW_r_opt.value_or(Tensor());
const Tensor& ggb = ggb_opt.value_or(Tensor());
auto ggW = ggW_r;
auto gO = gO_r;
auto weight = weight_r;

View File

@ -251,20 +251,12 @@ Tensor kl_div(const Tensor& input, const Tensor& target, int64_t reduction, bool
}
Tensor binary_cross_entropy_cpu(const Tensor& input, const Tensor& target, const std::optional<Tensor>& weight_opt, int64_t reduction) {
// See [Note: hacky wrapper removal for optional tensor]
c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
const Tensor& weight = *weight_maybe_owned;
Tensor loss = at::empty_like(input);
return at::native::binary_cross_entropy_out_cpu(
input, target, weight, reduction, loss);
input, target, weight_opt, reduction, loss);
}
Tensor& binary_cross_entropy_out_cpu(const Tensor& input, const Tensor& target, const std::optional<Tensor>& weight_opt, int64_t reduction, Tensor& loss) {
// See [Note: hacky wrapper removal for optional tensor]
c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
const Tensor& weight = *weight_maybe_owned;
Tensor loss_squeezed = at::squeeze(loss);
auto iter = TensorIteratorConfig()
@ -297,8 +289,8 @@ Tensor& binary_cross_entropy_out_cpu(const Tensor& input, const Tensor& target,
});
});
if (weight.defined()) {
loss.mul_(weight);
if (weight_opt.has_value() && weight_opt->defined()) {
loss.mul_(*weight_opt);
}
if (reduction != at::Reduction::None) {
Tensor loss_reduced = apply_loss_reduction(loss, reduction);
@ -308,20 +300,12 @@ Tensor& binary_cross_entropy_out_cpu(const Tensor& input, const Tensor& target,
}
Tensor binary_cross_entropy_backward_cpu(const Tensor& grad, const Tensor& input, const Tensor& target, const std::optional<Tensor>& weight_opt, int64_t reduction) {
// See [Note: hacky wrapper removal for optional tensor]
c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
const Tensor& weight = *weight_maybe_owned;
Tensor grad_input = at::empty_like(input);
return at::native::binary_cross_entropy_backward_out_cpu(
grad, input, target, weight, reduction, grad_input);
grad, input, target, weight_opt, reduction, grad_input);
}
Tensor& binary_cross_entropy_backward_out_cpu(const Tensor& grad, const Tensor& input, const Tensor& target, const std::optional<Tensor>& weight_opt, int64_t reduction, Tensor& grad_input) {
// See [Note: hacky wrapper removal for optional tensor]
c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
const Tensor& weight = *weight_maybe_owned;
Tensor grad_input_squeezed = at::squeeze(grad_input);
auto iter = TensorIteratorConfig()
@ -350,8 +334,8 @@ Tensor& binary_cross_entropy_backward_out_cpu(const Tensor& grad, const Tensor&
});
});
if (weight.defined()) {
grad_input.mul_(weight);
if (weight_opt.has_value() && weight_opt->defined()) {
grad_input.mul_(*weight_opt);
}
if (reduction == at::Reduction::Mean) {
grad_input.div_(input.numel());
@ -360,23 +344,17 @@ Tensor& binary_cross_entropy_backward_out_cpu(const Tensor& grad, const Tensor&
}
Tensor binary_cross_entropy_with_logits(const Tensor& input, const Tensor& target, const std::optional<Tensor>& weight_opt, const std::optional<Tensor>& pos_weight_opt, int64_t reduction) {
// See [Note: hacky wrapper removal for optional tensor]
c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
const Tensor& weight = *weight_maybe_owned;
c10::MaybeOwned<Tensor> pos_weight_maybe_owned = at::borrow_from_optional_tensor(pos_weight_opt);
const Tensor& pos_weight = *pos_weight_maybe_owned;
auto log_sigmoid_input = at::log_sigmoid(input);
if (pos_weight.defined()) {
if (pos_weight_opt.has_value() && pos_weight_opt->defined()) {
// pos_weight need to be broadcasted, thus mul(target) is not inplace.
auto log_weight = (pos_weight - 1).mul(target).add_(1);
auto log_weight = (*pos_weight_opt- 1).mul(target).add_(1);
log_sigmoid_input.mul_(log_weight);
}
Tensor loss = (1 - target).mul_(input).sub_(log_sigmoid_input);
if (weight.defined()) {
loss.mul_(weight);
if (weight_opt.has_value() && weight_opt->defined()) {
loss.mul_(*weight_opt);
}
return apply_loss_reduction(loss, reduction);

View File

@ -659,20 +659,12 @@ Tensor cross_entropy_loss_symint(
}
Tensor & nll_loss_out(const Tensor & self, const Tensor & target, const std::optional<Tensor>& weight_opt, int64_t reduction, int64_t ignore_index, Tensor & output) {
// See [Note: hacky wrapper removal for optional tensor]
c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
const Tensor& weight = *weight_maybe_owned;
Tensor total_weight = at::empty({0}, self.options());
return std::get<0>(at::nll_loss_forward_out(output, total_weight, self, target, weight, reduction, ignore_index));
return std::get<0>(at::nll_loss_forward_out(output, total_weight, self, target, weight_opt, reduction, ignore_index));
}
Tensor nll_loss_symint(const Tensor & self, const Tensor & target, const std::optional<Tensor>& weight_opt, int64_t reduction, c10::SymInt ignore_index) {
// See [Note: hacky wrapper removal for optional tensor]
c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
const Tensor& weight = *weight_maybe_owned;
return std::get<0>(at::nll_loss_forward_symint(self, target, weight, reduction, std::move(ignore_index)));
return std::get<0>(at::nll_loss_forward_symint(self, target, weight_opt, reduction, std::move(ignore_index)));
}
Tensor nll_loss_nd_symint(

View File

@ -424,14 +424,10 @@ std::tuple<Tensor, Tensor> nll_loss2d_forward_cpu(
const Tensor& target, const std::optional<Tensor>& weight_opt,
int64_t reduction,
int64_t ignore_index) {
// See [Note: hacky wrapper removal for optional tensor]
c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
const Tensor& weight = *weight_maybe_owned;
auto output = at::empty({0}, self.options());
auto total_weight = at::empty({0}, self.options());
at::native::nll_loss2d_forward_out_cpu(
self, target, weight, reduction, ignore_index, output, total_weight);
self, target, weight_opt, reduction, ignore_index, output, total_weight);
return std::make_tuple(output, total_weight);
}
@ -465,16 +461,12 @@ Tensor nll_loss2d_backward_cpu(
int64_t reduction,
int64_t ignore_index,
const Tensor& total_weight) {
// See [Note: hacky wrapper removal for optional tensor]
c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
const Tensor& weight = *weight_maybe_owned;
auto grad_input = at::zeros_like(self);
at::native::nll_loss2d_backward_out_cpu(
grad_output,
self,
target,
weight,
weight_opt,
reduction,
ignore_index,
total_weight,
@ -483,20 +475,12 @@ Tensor nll_loss2d_backward_cpu(
}
Tensor & nll_loss2d_out(const Tensor & self, const Tensor & target, const std::optional<Tensor>& weight_opt, int64_t reduction, int64_t ignore_index, Tensor & output) {
// See [Note: hacky wrapper removal for optional tensor]
c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
const Tensor& weight = *weight_maybe_owned;
Tensor total_weight = at::empty({0}, self.options());
return std::get<0>(at::nll_loss2d_forward_out(output, total_weight, self, target, weight, reduction, ignore_index));
return std::get<0>(at::nll_loss2d_forward_out(output, total_weight, self, target, weight_opt, reduction, ignore_index));
}
Tensor nll_loss2d_symint(const Tensor & self, const Tensor & target, const std::optional<Tensor>& weight_opt, int64_t reduction, c10::SymInt ignore_index) {
// See [Note: hacky wrapper removal for optional tensor]
c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
const Tensor& weight = *weight_maybe_owned;
return std::get<0>(at::nll_loss2d_forward_symint(self, target, weight, reduction, std::move(ignore_index)));
return std::get<0>(at::nll_loss2d_forward_symint(self, target, weight_opt, reduction, std::move(ignore_index)));
}
} // namespace at::native

File diff suppressed because it is too large Load Diff

View File

@ -13,21 +13,62 @@ struct TensorIterator;
namespace at::native {
using index_put_with_sort_fn = void(*)(Tensor &, const c10::List<std::optional<Tensor>> &, const Tensor &, bool accumulate, bool unsafe);
using index_put_with_sort_quantized_fn = void(*)(Tensor& self, const c10::List<std::optional<Tensor>>& indices, const Tensor& value, double scale, int zero_point, bool unsafe);
using gather_fn = void (*)(const Tensor & result, const Tensor & self, int64_t dim, const Tensor & index);
using scatter_fn = void(*)(const Tensor& self, int64_t dim, const Tensor& index, const Tensor& src);
using scatter_fill_fn = void(*)(const Tensor& self, int64_t dim, const Tensor& index, const Scalar& src);
using scatter_add_fn = void(*)(const Tensor& self, int64_t dim, const Tensor& index, const Tensor& src);
using scatter_reduce_fn = void(*)(const Tensor& self, const int64_t dim, const Tensor& index,
const Tensor& src, const ReductionType& reduce);
using scatter_scalar_reduce_fn = void(*)(const Tensor& self, const int64_t dim, const Tensor& index,
const Scalar& value, const ReductionType& reduce);
using scatter_reduce_two_fn = void(*)(const Tensor& self, const int64_t dim, const Tensor& index,
const Tensor& src, const ReductionType& reduce);
using index_put_with_sort_fn = void (*)(
Tensor&,
const c10::List<std::optional<Tensor>>&,
const Tensor&,
bool accumulate,
bool unsafe);
using index_put_with_sort_quantized_fn = void (*)(
Tensor& self,
const c10::List<std::optional<Tensor>>& indices,
const Tensor& value,
double scale,
int zero_point,
bool unsafe);
using gather_fn = void (*)(
const Tensor& result,
const Tensor& self,
int64_t dim,
const Tensor& index);
using scatter_fn = void (*)(
const Tensor& self,
int64_t dim,
const Tensor& index,
const Tensor& src);
using scatter_fill_fn = void (*)(
const Tensor& self,
int64_t dim,
const Tensor& index,
const Scalar& src);
using scatter_add_fn = void (*)(
const Tensor& self,
int64_t dim,
const Tensor& index,
const Tensor& src);
using scatter_reduce_fn = void (*)(
const Tensor& self,
const int64_t dim,
const Tensor& index,
const Tensor& src,
const ReductionType& reduce);
using scatter_scalar_reduce_fn = void (*)(
const Tensor& self,
const int64_t dim,
const Tensor& index,
const Scalar& value,
const ReductionType& reduce);
using scatter_reduce_two_fn = void (*)(
const Tensor& self,
const int64_t dim,
const Tensor& index,
const Tensor& src,
const ReductionType& reduce);
DECLARE_DISPATCH(index_put_with_sort_fn, index_put_with_sort_stub)
DECLARE_DISPATCH(index_put_with_sort_quantized_fn, index_put_with_sort_quantized_stub)
DECLARE_DISPATCH(
index_put_with_sort_quantized_fn,
index_put_with_sort_quantized_stub)
DECLARE_DISPATCH(gather_fn, gather_stub)
DECLARE_DISPATCH(scatter_fn, scatter_stub)
DECLARE_DISPATCH(scatter_fill_fn, scatter_fill_stub)
@ -36,14 +77,26 @@ DECLARE_DISPATCH(scatter_reduce_fn, scatter_reduce_stub)
DECLARE_DISPATCH(scatter_scalar_reduce_fn, scatter_scalar_reduce_stub)
DECLARE_DISPATCH(scatter_reduce_two_fn, scatter_reduce_two_stub)
TORCH_API Tensor& index_out(Tensor& result, const Tensor & self, const c10::List<std::optional<at::Tensor>>& indices);
TORCH_API Tensor& index_out(
Tensor& result,
const Tensor& self,
const c10::List<std::optional<at::Tensor>>& indices);
using scatter_add_expanded_index_fn = void(*)(const Tensor&, const Tensor&, const Tensor&);
using scatter_reduce_expanded_index_fn = void(*)(const Tensor&, const Tensor&, const Tensor&, const ReductionType& reduce, bool);
using gather_expanded_index_fn = void (*)(const Tensor&, const Tensor&, const Tensor&);
using scatter_add_expanded_index_fn =
void (*)(const Tensor&, const Tensor&, const Tensor&);
using scatter_reduce_expanded_index_fn = void (*)(
const Tensor&,
const Tensor&,
const Tensor&,
const ReductionType& reduce,
bool);
using gather_expanded_index_fn =
void (*)(const Tensor&, const Tensor&, const Tensor&);
DECLARE_DISPATCH(scatter_add_expanded_index_fn, scatter_add_expanded_index_stub)
DECLARE_DISPATCH(scatter_reduce_expanded_index_fn, scatter_reduce_expanded_index_stub)
DECLARE_DISPATCH(
scatter_reduce_expanded_index_fn,
scatter_reduce_expanded_index_stub)
DECLARE_DISPATCH(gather_expanded_index_fn, gather_expanded_index_stub)
} // namespace at::native

View File

@ -23,28 +23,38 @@ inline std::string shapes_as_str(TensorList tensors) {
#endif
} // anonymous namespace
inline std::tuple<bool, Tensor> canDispatchToMaskedFill(const Tensor& self, const torch::List<std::optional<at::Tensor>>& indices,
const Tensor& value){
if (!(value.numel() ==1 && value.device().is_cpu())){
return std::make_tuple(false,Tensor());
inline std::tuple<bool, Tensor> canDispatchToMaskedFill(
const Tensor& self,
const torch::List<std::optional<at::Tensor>>& indices,
const Tensor& value) {
if (!(value.numel() == 1 && value.device().is_cpu())) {
return std::make_tuple(false, Tensor());
}
int64_t num_ind = 0;
Tensor mask;
auto self_device = self.device();
for (const std::optional<Tensor>& i: indices) {
if (!i.has_value() || !(*i).defined()){
for (const std::optional<Tensor>& i : indices) {
if (!i.has_value() || !(*i).defined()) {
num_ind++;
} else {
const Tensor &index = *i;
const Tensor& index = *i;
if ((index.scalar_type() != kByte && index.scalar_type() != kBool) ||
index.device() != self_device || mask.defined()){
index.device() != self_device || mask.defined()) {
return std::make_tuple(false, Tensor());
} else {
mask = index;
for (const auto j : c10::irange(index.dim())) {
int64_t srcIdx = num_ind + j;
TORCH_CHECK_INDEX(index.size(j) == self.size(srcIdx), "The shape of the mask ", index.sizes(), " at index ", j,
" does not match the shape of the indexed tensor ", self.sizes(), " at index ", srcIdx);
TORCH_CHECK_INDEX(
index.size(j) == self.size(srcIdx),
"The shape of the mask ",
index.sizes(),
" at index ",
j,
" does not match the shape of the indexed tensor ",
self.sizes(),
" at index ",
srcIdx);
}
num_ind += mask.ndimension();
}
@ -59,14 +69,18 @@ const Tensor& value){
inline AdvancedIndex make_info(Tensor self, IOptTensorListRef orig) {
checkIndexTensorTypes(orig, /*allow_int*/ true);
// first expand BoolTensor (masks) or ByteTensor (masks) into 1 or more LongTensors
// first expand BoolTensor (masks) or ByteTensor (masks) into 1 or more
// LongTensors
auto indices = expandTensors(self, orig);
// next broadcast all index tensors together
try {
indices = expand_outplace(indices);
} catch (std::exception& e) {
TORCH_CHECK_INDEX(false, "shape mismatch: indexing tensors could not be broadcast together"
" with shapes ", shapes_as_str(indices));
TORCH_CHECK_INDEX(
false,
"shape mismatch: indexing tensors could not be broadcast together"
" with shapes ",
shapes_as_str(indices));
}
// add missing null Tensors so that it matches self.dim()
while (indices.size() < (size_t)self.dim()) {
@ -78,12 +92,12 @@ inline AdvancedIndex make_info(Tensor self, IOptTensorListRef orig) {
std::tie(self, indices) = transposeToFront(self, indices);
}
// Ensure indices are on the same device as self
for (auto & indice : indices) {
for (auto& indice : indices) {
if (indice.defined() && indice.device() != self.device()) {
indice = indice.to(self.device());
}
}
for (auto & indice : indices) {
for (auto& indice : indices) {
if (indice.defined() && indice.dtype() == at::kInt) {
indice = indice.to(at::kLong);
}

File diff suppressed because it is too large Load Diff

View File

@ -10,7 +10,7 @@ namespace at {
class Tensor;
struct TensorIterator;
struct TensorIteratorBase;
}
} // namespace at
namespace at::native {
@ -22,28 +22,35 @@ using structured_reduce_minmax_fn =
DECLARE_DISPATCH(structured_reduce_minmax_fn, max_stub)
DECLARE_DISPATCH(structured_reduce_minmax_fn, min_stub)
using where_fn = void (*)(TensorIterator &);
using where_fn = void (*)(TensorIterator&);
DECLARE_DISPATCH(where_fn, where_kernel)
using is_infinity_op_fn = void (*)(TensorIteratorBase &);
using is_infinity_op_fn = void (*)(TensorIteratorBase&);
DECLARE_DISPATCH(is_infinity_op_fn, isposinf_stub)
DECLARE_DISPATCH(is_infinity_op_fn, isneginf_stub)
using mode_fn = void (*)(Tensor&, Tensor&, const Tensor&, int64_t, bool);
DECLARE_DISPATCH(mode_fn, mode_stub)
using clamp_tensor_fn = void (*)(TensorIteratorBase &);
using clamp_tensor_fn = void (*)(TensorIteratorBase&);
DECLARE_DISPATCH(clamp_tensor_fn, clamp_stub)
namespace detail {
enum class ClampLimits {Min, Max, MinMax};
enum class ClampLimits { Min, Max, MinMax };
}
DECLARE_DISPATCH(void (*)(TensorIteratorBase &, const c10::Scalar&, const c10::Scalar&), clamp_scalar_stub)
DECLARE_DISPATCH(void (*)(TensorIteratorBase &, c10::Scalar), clamp_min_scalar_stub)
DECLARE_DISPATCH(void (*)(TensorIteratorBase &, c10::Scalar), clamp_max_scalar_stub)
DECLARE_DISPATCH(
void (*)(TensorIteratorBase&, const c10::Scalar&, const c10::Scalar&),
clamp_scalar_stub)
DECLARE_DISPATCH(
void (*)(TensorIteratorBase&, c10::Scalar),
clamp_min_scalar_stub)
DECLARE_DISPATCH(
void (*)(TensorIteratorBase&, c10::Scalar),
clamp_max_scalar_stub)
using isin_default_fn = void (*)(const Tensor&, const Tensor&, bool, const Tensor&);
using isin_default_fn =
void (*)(const Tensor&, const Tensor&, bool, const Tensor&);
DECLARE_DISPATCH(isin_default_fn, isin_default_stub)
} // namespace at::native

File diff suppressed because it is too large Load Diff

View File

@ -7,7 +7,7 @@
#include <optional>
namespace at {
class Tensor;
class Tensor;
namespace native {
bool to_will_alias(
const Tensor& self,
@ -20,7 +20,12 @@ bool to_will_alias(
Tensor to_meta(const Tensor& tensor);
std::optional<Tensor> to_meta(const std::optional<Tensor>& tensor);
std::vector<Tensor> to_meta(at::ITensorListRef t_list);
Tensor dense_to_sparse_with_mask(const Tensor& self, const Tensor& mask, std::optional<c10::Layout> layout, OptionalIntArrayRef blocksize, std::optional<int64_t> dense_dim_opt);
Tensor dense_to_sparse_with_mask(
const Tensor& self,
const Tensor& mask,
std::optional<c10::Layout> layout,
OptionalIntArrayRef blocksize,
std::optional<int64_t> dense_dim_opt);
} // namespace native
} // namespace at

View File

@ -3,10 +3,15 @@
#include <c10/util/irange.h>
namespace at::native {
//input tensors are non-zero dim and non-empty
template<typename T1, typename T2, typename Function>
// input tensors are non-zero dim and non-empty
template <typename T1, typename T2, typename Function>
void tensor_dim_apply3(const Tensor& self, Tensor& values, Tensor& indices, int64_t dim, Function func) {
void tensor_dim_apply3(
const Tensor& self,
Tensor& values,
Tensor& indices,
int64_t dim,
Function func) {
int ndims = self.dim();
int tensor_dim_apply_has_finished = 0;
std::vector<int64_t> counter(ndims, 0);
@ -19,9 +24,16 @@ void tensor_dim_apply3(const Tensor& self, Tensor& values, Tensor& indices, int6
int self_dim_size = self.size(dim);
while (!tensor_dim_apply_has_finished) {
func(self_data, values_data, indices_data, self_dim_size, self_stride, values_stride, indices_stride);
func(
self_data,
values_data,
indices_data,
self_dim_size,
self_stride,
values_stride,
indices_stride);
if (ndims == 1) {
break;
break;
}
for (const auto dim_i : c10::irange(ndims)) {
if (dim_i == dim) {
@ -37,18 +49,18 @@ void tensor_dim_apply3(const Tensor& self, Tensor& values, Tensor& indices, int6
indices_data += indices.stride(dim_i);
if (counter[dim_i] == self.size(dim_i)) {
if (dim_i == ndims-1) {
if (dim_i == ndims - 1) {
tensor_dim_apply_has_finished = 1;
break;
} else {
self_data -= counter[dim_i]*self.stride(dim_i);
values_data -= counter[dim_i]*values.stride(dim_i);
indices_data -= counter[dim_i]*indices.stride(dim_i);
self_data -= counter[dim_i] * self.stride(dim_i);
values_data -= counter[dim_i] * values.stride(dim_i);
indices_data -= counter[dim_i] * indices.stride(dim_i);
counter[dim_i] = 0;
}
} else {
break;
}
}
}
}
}

File diff suppressed because it is too large Load Diff

View File

@ -1,10 +1,10 @@
#pragma once
#include <ATen/core/Tensor.h>
#include <ATen/EmptyTensor.h>
#include <ATen/TensorIterator.h>
#include <ATen/Dispatch.h>
#include <ATen/Dispatch_v2.h>
#include <ATen/EmptyTensor.h>
#include <ATen/TensorIterator.h>
#include <ATen/core/Tensor.h>
#include <ATen/native/DispatchStub.h>
#ifndef AT_PER_OPERATOR_HEADERS
@ -41,9 +41,9 @@ inline int64_t get_tril_size(int64_t row, int64_t col, int64_t offset) {
return 0;
}
// number of elements in the first row of the tril
auto m_first_row = offset > 0 ?
std::min<int64_t>(col, 1 + offset) : // upper bounded by col
row + offset > 0; // either 0 or 1
auto m_first_row = offset > 0 ? std::min<int64_t>(col, 1 + offset)
: // upper bounded by col
row + offset > 0; // either 0 or 1
// number of elements in the last row of the tril, bounded by [0, col]
auto m_last_row = std::max<int64_t>(0, std::min<int64_t>(col, row + offset));
// number of rows, bounded by [0, row]
@ -63,35 +63,49 @@ inline int64_t get_tril_size(int64_t row, int64_t col, int64_t offset) {
}
inline void check_args(
int64_t row, int64_t col, std::optional<Layout> layout_opt) {
int64_t row,
int64_t col,
std::optional<Layout> layout_opt) {
TORCH_CHECK(row >= 0, "row must be non-negative, got", row);
TORCH_CHECK(col >= 0, "col must be non-negative, got", col);
if (layout_opt.has_value()) {
TORCH_CHECK(
*layout_opt == at::kStrided,
"only support layout=torch.strided, got",
*layout_opt)
*layout_opt == at::kStrided,
"only support layout=torch.strided, got",
*layout_opt)
}
}
using at::check_size_nonnegative;
// assumes maximum value in created tensor is n-1 (e.g., torch.randperm(n))
inline void check_supported_max_int_with_precision(int64_t n, const Tensor& tensor) {
inline void check_supported_max_int_with_precision(
int64_t n,
const Tensor& tensor) {
// match defined() to behavior of checks below
TORCH_CHECK(at::scalar_tensor(n>0?n-1:n, tensor.options()).defined(),
"n is too large for result tensor type: '", tensor.toString(), "'");
TORCH_CHECK(
at::scalar_tensor(n > 0 ? n - 1 : n, tensor.options()).defined(),
"n is too large for result tensor type: '",
tensor.toString(),
"'");
// Ensure sufficient precision for floating point representation.
switch (tensor.scalar_type()) {
case at::ScalarType::Half:
TORCH_CHECK(n <= (int64_t(1) << 11) + 1, "n cannot be greater than 2049 for Half type.");
TORCH_CHECK(
n <= (int64_t(1) << 11) + 1,
"n cannot be greater than 2049 for Half type.");
break;
case at::ScalarType::Float:
TORCH_CHECK(n <= (int64_t(1) << 24) + 1, "n cannot be greater than 2^24+1 for Float type.");
TORCH_CHECK(
n <= (int64_t(1) << 24) + 1,
"n cannot be greater than 2^24+1 for Float type.");
break;
case at::ScalarType::Double: // Unlikely to happen, but doesn't hurt to check
TORCH_CHECK(n <= (int64_t(1) << 53) + 1, "n cannot be greater than 2^53+1 for Double type.");
case at::ScalarType::Double: // Unlikely to happen, but doesn't hurt to
// check
TORCH_CHECK(
n <= (int64_t(1) << 53) + 1,
"n cannot be greater than 2^53+1 for Double type.");
break;
default:
break;
@ -104,14 +118,24 @@ inline void check_supported_max_int_with_precision(int64_t n, const Tensor& tens
inline Tensor& fill_empty_deterministic_(Tensor& tensor) {
if (tensor.is_floating_point() || tensor.is_complex()) {
AT_DISPATCH_V2(
tensor.scalar_type(), "fill_empty_deterministic_", AT_WRAP([&]() {
tensor.fill_(std::numeric_limits<scalar_t>::quiet_NaN());
}), AT_EXPAND(AT_FLOATING_TYPES), AT_EXPAND(AT_COMPLEX_TYPES), AT_EXPAND(AT_FLOAT8_TYPES), kBFloat16, kHalf, kComplexHalf);
tensor.scalar_type(),
"fill_empty_deterministic_",
AT_WRAP([&]() {
tensor.fill_(std::numeric_limits<scalar_t>::quiet_NaN());
}),
AT_EXPAND(AT_FLOATING_TYPES),
AT_EXPAND(AT_COMPLEX_TYPES),
AT_EXPAND(AT_FLOAT8_TYPES),
kBFloat16,
kHalf,
kComplexHalf);
} else {
AT_DISPATCH_V2(
tensor.scalar_type(), "fill_empty_deterministic_", AT_WRAP([&]() {
tensor.fill_(std::numeric_limits<scalar_t>::max());
}), kBool, AT_EXPAND(AT_INTEGRAL_TYPES_V2));
tensor.scalar_type(),
"fill_empty_deterministic_",
AT_WRAP([&]() { tensor.fill_(std::numeric_limits<scalar_t>::max()); }),
kBool,
AT_EXPAND(AT_INTEGRAL_TYPES_V2));
}
return tensor;
}
@ -130,7 +154,10 @@ struct ZeroTensorAllocator final : public at::Allocator {
DeleterFnPtr raw_deleter() const override {
return deleter;
}
void copy_data(void* dest [[maybe_unused]], const void* src [[maybe_unused]], std::size_t count [[maybe_unused]]) const final {}
void copy_data(
void* dest [[maybe_unused]],
const void* src [[maybe_unused]],
std::size_t count [[maybe_unused]]) const final {}
at::Device device_;
};

View File

@ -1,39 +1,39 @@
#pragma once
#include <complex>
#include <type_traits>
#include <c10/core/ScalarType.h>
#include <ATen/detail/FunctionTraits.h>
#include <ATen/native/TensorIterator.h>
#include <c10/core/ScalarType.h>
#include <complex>
#include <type_traits>
// This file includes utilities for dynamic_casting done by TensorIterator, see
// CUDALoops.cuh and Loops.h.
// This file includes utilities for dynamic_casting done by TensorIterator, see CUDALoops.cuh and Loops.h.
// dynamic_casting handles when the types expected by the iterator do not match the types of the arguments
// to the function that is being called.
// On CUDA, the cast is currently pushed down into the kernel (for performance reasons).
// On CPU, there is currently an internal assert that a dynamic_cast is not needed.
// dynamic_casting handles when the types expected by the iterator do not match
// the types of the arguments to the function that is being called. On CUDA, the
// cast is currently pushed down into the kernel (for performance reasons). On
// CPU, there is currently an internal assert that a dynamic_cast is not needed.
namespace at::native {
// `needs_dynamic_casting` compares the types expected by iterator
// (i.e. dtypes of the operands) with the actual type of the arguments
// (and returns) of func_t
template<typename func_t, int nargs=function_traits<func_t>::arity>
template <typename func_t, int nargs = function_traits<func_t>::arity>
struct needs_dynamic_casting {
static bool check(TensorIteratorBase& iter) {
using traits = function_traits<func_t>;
using cpp_type = typename traits::template arg<nargs - 1>::type;
using cpp_map = c10::CppTypeToScalarType<cpp_type>;
if (iter.input_dtype(nargs-1) != cpp_map::value) {
if (iter.input_dtype(nargs - 1) != cpp_map::value) {
return true;
}
return needs_dynamic_casting<func_t, nargs - 1>::check(iter);
}
};
template<typename func_t>
template <typename func_t>
struct needs_dynamic_casting<func_t, 0> {
static bool check(TensorIteratorBase& iter) {
using traits = function_traits<func_t>;
@ -49,4 +49,4 @@ struct needs_dynamic_casting<func_t, 0> {
}
};
} //namespace at::native
} // namespace at::native

View File

@ -1,6 +1,6 @@
#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
#include <ATen/TensorIterator.h>
#include <ATen/Parallel.h>
#include <ATen/TensorIterator.h>
#include <ATen/TensorIteratorInternal.h>
#ifndef AT_PER_OPERATOR_HEADERS
@ -22,7 +22,9 @@ static void two_pass_reduction(TensorIteratorBase& iter, loop2d_t loop);
static void parallel_dim_reduction(TensorIteratorBase& iter, loop2d_t loop);
void TensorIteratorBase::parallel_reduce(loop2d_t loop) {
TORCH_CHECK(ntensors() == 2, "parallel_reduce only supports one input and one output");
TORCH_CHECK(
ntensors() == 2,
"parallel_reduce only supports one input and one output");
int64_t numel = this->numel();
if (numel < at::internal::GRAIN_SIZE || at::get_num_threads() == 1 ||
at::in_parallel_region()) {
@ -54,18 +56,24 @@ static void two_pass_reduction(TensorIteratorBase& iter, loop2d_t loop) {
auto first_reduce = TensorIterator::reduce_op(buffer_0, iter.input(0));
TORCH_INTERNAL_ASSERT(first_reduce.output(0).is_alias_of(buffer_0));
at::parallel_for(0, iter.numel(), internal::GRAIN_SIZE, [&](int64_t begin, int64_t end) {
const auto thread_num = at::get_thread_num();
auto shape = first_reduce.shape();
auto strides = first_reduce.get_strides();
at::parallel_for(
0, iter.numel(), internal::GRAIN_SIZE, [&](int64_t begin, int64_t end) {
const auto thread_num = at::get_thread_num();
auto shape = first_reduce.shape();
auto strides = first_reduce.get_strides();
// Bump output ptr so each thread has its own output slice
auto base_ptrs = first_reduce.get_base_ptrs();
base_ptrs[0] += buffer_stride * thread_num;
// Bump output ptr so each thread has its own output slice
auto base_ptrs = first_reduce.get_base_ptrs();
base_ptrs[0] += buffer_stride * thread_num;
at::internal::serial_for_each(shape, strides, base_ptrs.data(),
base_ptrs.size(), loop, {begin, end});
});
at::internal::serial_for_each(
shape,
strides,
base_ptrs.data(),
base_ptrs.size(),
loop,
{begin, end});
});
auto final_reduce = TensorIterator::reduce_op(unsqueezed, buffer);
final_reduce.for_each(loop);
@ -91,8 +99,12 @@ static int find_split_dim(TensorIteratorBase& iter) {
return best_dim;
}
static std::tuple<int64_t, int64_t>
round_columns(TensorIteratorBase& iter, int dim, int multiple, int64_t begin, int64_t end) {
static std::tuple<int64_t, int64_t> round_columns(
TensorIteratorBase& iter,
int dim,
int multiple,
int64_t begin,
int64_t end) {
begin = begin - (begin % multiple);
if (end != iter.shape()[dim]) {
// only round the 'end' column down if it's not the final column
@ -113,7 +125,8 @@ static void parallel_dim_reduction(TensorIteratorBase& iter, loop2d_t loop) {
// round columns to multiples of 128 bytes if adjacent columns are
// contiguous in memory.
int64_t cols_per_128_bytes = 128 / element_size;
std::tie(begin, end) = round_columns(iter, dim, cols_per_128_bytes, begin, end);
std::tie(begin, end) =
round_columns(iter, dim, cols_per_128_bytes, begin, end);
}
if (begin == end) {
return;
@ -124,7 +137,9 @@ static void parallel_dim_reduction(TensorIteratorBase& iter, loop2d_t loop) {
});
}
void TensorIteratorBase::foreach_reduced_elt(loop_subiter_t loop, bool parallelize) {
void TensorIteratorBase::foreach_reduced_elt(
loop_subiter_t loop,
bool parallelize) {
AT_ASSERT(ninputs() == 1);
AT_ASSERT(noutputs() >= 1);
@ -134,26 +149,26 @@ void TensorIteratorBase::foreach_reduced_elt(loop_subiter_t loop, bool paralleli
}
if (output(0).numel() == 1) {
loop(*this);
}
else if (numel() < at::internal::GRAIN_SIZE || at::get_num_threads() == 1 ||
} else if (
numel() < at::internal::GRAIN_SIZE || at::get_num_threads() == 1 ||
at::in_parallel_region() || !parallelize) {
auto reduce_dims = num_reduce_dims();
auto non_reduced_shape = shape.slice(reduce_dims, shape.size() - reduce_dims);
auto non_reduced_shape =
shape.slice(reduce_dims, shape.size() - reduce_dims);
int64_t non_reduced_numel = 1;
for (const auto i : non_reduced_shape) {
non_reduced_numel *= i;
}
DimCounter dims {non_reduced_shape, {0, non_reduced_numel}};
DimCounter dims{non_reduced_shape, {0, non_reduced_numel}};
while (!dims.is_done()) {
TensorIterator reduced = *this;
reduced.select_all_keeping_dim(reduce_dims, dims.values);
loop(reduced);
dims.increment({1, 1});
}
}
else {
} else {
int dim = find_split_dim(*this);
int64_t cols = shape[dim];
at::parallel_for(0, cols, 1, [&](int64_t begin, int64_t end) {
@ -177,4 +192,4 @@ void TensorIteratorBase::foreach_reduced_elt(loop_subiter_t loop, bool paralleli
}
}
} // namespace at
} // namespace at

View File

@ -1,7 +1,7 @@
#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
#include <ATen/core/Tensor.h>
#include <ATen/Context.h>
#include <ATen/NamedTensorUtils.h>
#include <ATen/core/Tensor.h>
#include <ATen/detail/CUDAHooksInterface.h>
#include <ATen/native/TensorProperties.h>
@ -36,9 +36,10 @@ bool nested_is_same_size(const Tensor& self, const Tensor& other) {
TORCH_CHECK(
self.is_nested() && other.is_nested(),
"Expected both self and other to be nested tensors. ",
"Self ", self.is_nested()? "is " : "is not ",
"Self ",
self.is_nested() ? "is " : "is not ",
"nested. While Other ",
other.is_nested()? "is " : "is not ",
other.is_nested() ? "is " : "is not ",
"nested.")
const auto self_nt_size = _nested_tensor_size(self);
const auto other_nt_size = _nested_tensor_size(other);
@ -79,16 +80,21 @@ int64_t stride(const Tensor& self, Dimname dim) {
}
bool cudnn_is_acceptable(const TensorBase& self) {
if (!globalContext().userEnabledCuDNN()) return false;
if (!self.is_cuda()) return false;
if (!globalContext().userEnabledCuDNN())
return false;
if (!self.is_cuda())
return false;
auto st = self.scalar_type();
if (!(st == kDouble || st == kFloat || st == kHalf)) return false;
if (!detail::getCUDAHooks().compiledWithCuDNN()) return false;
if (!(st == kDouble || st == kFloat || st == kHalf))
return false;
if (!detail::getCUDAHooks().compiledWithCuDNN())
return false;
// cuDNN functions like grid_sampler returns CUDNN_STATUS_BAD_PARAM on empty
// tensors. Maybe some cuDNN functions actually support empty tensors, but
// native/THNN kernels shouldn't be much slower because the output is also
// likely empty.
if (self.sym_numel() == 0) return false;
if (self.sym_numel() == 0)
return false;
// NB: In the old Python code, there was also a test to see if the
// cuDNN library was actually dynamically linked or not. I'm not
// sure if we can actually test this.
@ -99,9 +105,10 @@ bool cudnn_is_acceptable(const Tensor& self) {
return cudnn_is_acceptable(static_cast<const TensorBase&>(self));
}
Tensor & detach_(Tensor & self) {
// this just exists to give us a hook in VariableType and an entry in Declarations.yaml
//TORCH_CHECK(false, "detach_ is not implemented for Tensor");
Tensor& detach_(Tensor& self) {
// this just exists to give us a hook in VariableType and an entry in
// Declarations.yaml
// TORCH_CHECK(false, "detach_ is not implemented for Tensor");
return self;
}
@ -117,7 +124,8 @@ Tensor contiguous(const Tensor& self, MemoryFormat memory_format) {
}
bool is_set_to(const Tensor& self, const Tensor& src) {
if (self.storage().unsafeGetStorageImpl() == src.storage().unsafeGetStorageImpl() &&
if (self.storage().unsafeGetStorageImpl() ==
src.storage().unsafeGetStorageImpl() &&
self.storage_offset() == src.storage_offset() &&
self.dim() == src.dim()) {
for (const auto d : c10::irange(self.dim())) {

File diff suppressed because it is too large Load Diff

View File

@ -1,7 +1,7 @@
#pragma once
#include <ATen/core/IListRef.h>
#include <ATen/core/Tensor.h>
#include <c10/util/irange.h>
#include <ATen/core/IListRef.h>
namespace at::native {
@ -11,45 +11,74 @@ inline bool cat_should_skip_tensor(const Tensor& t) {
return t.sym_numel() == 0 && t.dim() == 1;
}
// Check to see if the shape of tensors is compatible
// for being concatenated along a given dimension.
inline void check_cat_shape_except_dim(const Tensor & first, const Tensor & second, int64_t dimension, int64_t index) {
int64_t first_dims = first.dim();
int64_t second_dims = second.dim();
TORCH_CHECK(first_dims == second_dims, "Tensors must have same number of dimensions: got ",
first_dims, " and ", second_dims);
for (const auto dim : c10::irange(first_dims)) {
if (dim == dimension) {
continue;
}
int64_t first_dim_size = first.sizes()[dim];
int64_t second_dim_size = second.sizes()[dim];
TORCH_CHECK(first_dim_size == second_dim_size, "Sizes of tensors must match except in dimension ",
dimension, ". Expected size ", static_cast<long long>(first_dim_size), " but got size ", static_cast<long long>(second_dim_size), " for tensor number ", index, " in the list.");
}
}
// Check to see if the shape of tensors is compatible
// for being concatenated along a given dimension.
inline void check_cat_shape_except_dim(
const Tensor& first,
const Tensor& second,
int64_t dimension,
int64_t index) {
int64_t first_dims = first.dim();
int64_t second_dims = second.dim();
TORCH_CHECK(
first_dims == second_dims,
"Tensors must have same number of dimensions: got ",
first_dims,
" and ",
second_dims);
for (const auto dim : c10::irange(first_dims)) {
if (dim == dimension) {
continue;
}
int64_t first_dim_size = first.sizes()[dim];
int64_t second_dim_size = second.sizes()[dim];
TORCH_CHECK(
first_dim_size == second_dim_size,
"Sizes of tensors must match except in dimension ",
dimension,
". Expected size ",
static_cast<long long>(first_dim_size),
" but got size ",
static_cast<long long>(second_dim_size),
" for tensor number ",
index,
" in the list.");
}
}
inline void check_cat_no_zero_dim(const MaterializedITensorListRef& tensors) {
[[maybe_unused]] int64_t i = 0;
for(const Tensor& t : tensors) {
TORCH_CHECK(t.dim() > 0,
"zero-dimensional tensor (at position ", i, ") cannot be concatenated");
for (const Tensor& t : tensors) {
TORCH_CHECK(
t.dim() > 0,
"zero-dimensional tensor (at position ",
i,
") cannot be concatenated");
i++;
}
}
inline int64_t get_num_splits(const Tensor& self, int64_t split_size, int64_t dim) {
inline int64_t get_num_splits(
const Tensor& self,
int64_t split_size,
int64_t dim) {
TORCH_CHECK(self.dim() != 0, "split expects at least a 1-dimensional tensor");
TORCH_CHECK(split_size >= 0, "split expects split_size be non-negative, but got split_size=", split_size);
TORCH_CHECK(
split_size >= 0,
"split expects split_size be non-negative, but got split_size=",
split_size);
int64_t dim_size = self.size(dim);
TORCH_CHECK(split_size > 0 || dim_size == 0,
"split_size can only be 0 if dimension size is 0, "
"but got dimension size of ", dim_size);
TORCH_CHECK(
split_size > 0 || dim_size == 0,
"split_size can only be 0 if dimension size is 0, "
"but got dimension size of ",
dim_size);
// if split_size is 0 and dimension size is 0, there is 1 split.
int64_t num_splits = 1;
if (split_size != 0) {
// ensuring num_splits is at least 1 makes consistent the case where split_size > dim_size
// (returns a single split). We might want to error here, but keep it for BC.
// ensuring num_splits is at least 1 makes consistent the case where
// split_size > dim_size (returns a single split). We might want to error
// here, but keep it for BC.
num_splits = std::max<int64_t>((dim_size + split_size - 1) / split_size, 1);
}
return num_splits;
@ -58,7 +87,7 @@ inline int64_t get_num_splits(const Tensor& self, int64_t split_size, int64_t di
inline bool have_same_ndims(TensorList tensors) {
auto ndim = tensors[0].dim();
for (const auto tensor_idx : c10::irange(tensors.size())) {
if(tensors[tensor_idx].dim() != ndim) {
if (tensors[tensor_idx].dim() != ndim) {
return false;
}
}
@ -67,35 +96,46 @@ inline bool have_same_ndims(TensorList tensors) {
inline void leading_dimension_matches(TensorList tensors, int64_t dim) {
auto tensor_zero_size = tensors[0].sizes();
std::vector<c10::SymInt> leading_dim_sizes(tensor_zero_size.begin(), tensor_zero_size.begin() + dim);
std::vector<c10::SymInt> leading_dim_sizes(
tensor_zero_size.begin(), tensor_zero_size.begin() + dim);
for (const auto i : c10::irange(tensors.size())) {
at::Tensor tensor = tensors[i];
for(const auto j : c10::irange(dim)) {
for (const auto j : c10::irange(dim)) {
TORCH_CHECK(
tensor.size(j) == leading_dim_sizes[j],
"_chunk_cat expects same sizes of 0,...,dim-1 dimensions for all tensors"
);
tensor.size(j) == leading_dim_sizes[j],
"_chunk_cat expects same sizes of 0,...,dim-1 dimensions for all tensors");
}
}
}
inline int64_t preprocess_chunk_cat_inputs(TensorList tensors, int64_t dim, int64_t num_chunks) {
inline int64_t preprocess_chunk_cat_inputs(
TensorList tensors,
int64_t dim,
int64_t num_chunks) {
TORCH_CHECK(num_chunks >= 1, "_chunk_cat expects positive num_chunks");
TORCH_CHECK(!tensors.empty(),
"_chunk_cat expects a non-empty input tensor list");
TORCH_CHECK(
!tensors.empty(), "_chunk_cat expects a non-empty input tensor list");
auto expected_dtype = tensors[0].dtype();
auto expected_device = tensors[0].device();
for(const auto i : c10::irange(tensors.size())) {
for (const auto i : c10::irange(tensors.size())) {
TORCH_CHECK(tensors[i].numel() > 0, "_chunk_cat expects non-empty tensor");
TORCH_CHECK(tensors[i].dtype() == expected_dtype, "_chunk_cat expects all input tensors with the same dtype");
TORCH_CHECK(tensors[i].device() == expected_device, "_chunk_cat expects all inputs tensors on the same device");
TORCH_CHECK(
tensors[i].dtype() == expected_dtype,
"_chunk_cat expects all input tensors with the same dtype");
TORCH_CHECK(
tensors[i].device() == expected_device,
"_chunk_cat expects all inputs tensors on the same device");
}
if (have_same_ndims(tensors)) {
dim = maybe_wrap_dim(dim, tensors[0].dim());
} else {
TORCH_CHECK(dim >= 0, "_chunk_cat expects non-negative dim when input tensors have different ndims")
for(const auto i : c10::irange(tensors.size())) {
TORCH_CHECK(dim < tensors[i].ndimension(), "_chunk_cat expects dim < ndim for all input tensors");
TORCH_CHECK(
dim >= 0,
"_chunk_cat expects non-negative dim when input tensors have different ndims")
for (const auto i : c10::irange(tensors.size())) {
TORCH_CHECK(
dim < tensors[i].ndimension(),
"_chunk_cat expects dim < ndim for all input tensors");
}
}
leading_dimension_matches(tensors, dim);

View File

@ -1,6 +1,6 @@
#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
#include <ATen/native/IndexKernel.h> // for flip_stub
#include <ATen/native/TensorTransformations.h>
#include <ATen/native/IndexKernel.h> // for flip_stub
#include <ATen/Parallel.h>
#include <ATen/TensorIterator.h>
@ -44,28 +44,30 @@ Tensor flip(const Tensor& self, IntArrayRef dims) {
int n = 0;
auto strides = DimVector(self.strides());
for (const auto i : c10::irange(total_dims)) {
if(flip_dims_b[i] && self.size(i) > 1 && self.stride(i) != 0) {
if (flip_dims_b[i] && self.size(i) > 1 && self.stride(i) != 0) {
n++;
strides[i] = 0;
}
}
// Nothing to do, we return fast
if (n == 0 || self.numel() <=1) {
if (n == 0 || self.numel() <= 1) {
out_tensor.copy_(self);
return out_tensor;
}
//create dummy output with 0 strides at flipped dimension, to prevent tensorIterator from coalescing flipped dims
// create dummy output with 0 strides at flipped dimension, to prevent
// tensorIterator from coalescing flipped dims
const auto restrided_self = self.as_strided(self.sizes(), strides);
auto iter = TensorIteratorConfig()
.set_check_mem_overlap(false)
.check_all_same_dtype(false)
.declare_static_dtype_and_device(self.scalar_type(), self.device())
.add_output(out_tensor)
.add_const_input(self)
.add_const_input(restrided_self)
.build();
auto iter =
TensorIteratorConfig()
.set_check_mem_overlap(false)
.check_all_same_dtype(false)
.declare_static_dtype_and_device(self.scalar_type(), self.device())
.add_output(out_tensor)
.add_const_input(self)
.add_const_input(restrided_self)
.build();
auto* data = reinterpret_cast<char*>(iter.data_ptr(0));
const auto sizes = iter.shape();
@ -83,11 +85,12 @@ Tensor flip(const Tensor& self, IntArrayRef dims) {
// - We iterate in the opposite direction (invert the strides)
for (const auto i : c10::irange(iter.ndim())) {
// We know that an dimension has a zero stride and self[i] does not, as we defined above
// Note that it may be the case that strides_dummy[i] = 0 not because we set it, but because
// strides_self[i] == 0. We do not want to do anything there
// We know that an dimension has a zero stride and self[i] does not, as we
// defined above Note that it may be the case that strides_dummy[i] = 0 not
// because we set it, but because strides_self[i] == 0. We do not want to do
// anything there
if (strides_dummy[i] == 0 && strides_self[i] != 0) {
data += strides_bytes[i] * (sizes[i]-1);
data += strides_bytes[i] * (sizes[i] - 1);
strides_bytes[i] *= -1;
}
}
@ -99,7 +102,10 @@ Tensor flip(const Tensor& self, IntArrayRef dims) {
return out_tensor;
}
Tensor roll(const Tensor& self, IntArrayRef shifts, IntArrayRef dims) { // Used by CPU and MPS dispatch.
Tensor roll(
const Tensor& self,
IntArrayRef shifts,
IntArrayRef dims) { // Used by CPU and MPS dispatch.
if (dims.size() != 1 || shifts.size() != 1) {
return roll_common(self, shifts, dims);
}
@ -115,7 +121,7 @@ Tensor roll(const Tensor& self, IntArrayRef shifts, IntArrayRef dims) { // Used
if (start < 0) {
start = start + size;
}
auto t0 = self.narrow(dim, start, size-start);
auto t0 = self.narrow(dim, start, size - start);
auto t1 = self.narrow(dim, 0, start);
return at::cat({std::move(t0), std::move(t1)}, dim);
}
@ -123,27 +129,38 @@ Tensor roll(const Tensor& self, IntArrayRef shifts, IntArrayRef dims) { // Used
Tensor rot90(const Tensor& self, int64_t k, IntArrayRef dims) {
const int64_t total_dims = self.dim(), total_rot_dims = dims.size();
TORCH_CHECK(total_rot_dims == 2,
"expected total rotation dims == 2, but got dims = ", total_rot_dims);
TORCH_CHECK(
total_rot_dims == 2,
"expected total rotation dims == 2, but got dims = ",
total_rot_dims);
TORCH_CHECK(total_dims >= 2,
"expected total dims >= 2, but got total dims = ", total_dims);
TORCH_CHECK(
total_dims >= 2,
"expected total dims >= 2, but got total dims = ",
total_dims);
TORCH_CHECK(dims[0] != dims[1] && std::abs(dims[0] - dims[1]) != total_dims,
"expected rotation dims to be different, but got dim0 = ", dims[0],
" and dim1 = ", dims[1]);
TORCH_CHECK(
dims[0] != dims[1] && std::abs(dims[0] - dims[1]) != total_dims,
"expected rotation dims to be different, but got dim0 = ",
dims[0],
" and dim1 = ",
dims[1]);
// check range of dims
TORCH_CHECK(dims[0] < total_dims && dims[0] >= -total_dims,
"Rotation dim0 out of range, dim0 = ", dims[0]);
TORCH_CHECK(
dims[0] < total_dims && dims[0] >= -total_dims,
"Rotation dim0 out of range, dim0 = ",
dims[0]);
TORCH_CHECK(dims[1] < total_dims && dims[1] >= -total_dims,
"Rotation dim1 out of range, dim1 = ", dims[1]);
TORCH_CHECK(
dims[1] < total_dims && dims[1] >= -total_dims,
"Rotation dim1 out of range, dim1 = ",
dims[1]);
// handle modulo with negative k
k = (4 + (k % 4)) % 4;
switch(k) {
switch (k) {
case 1:
return self.flip({dims[1]}).transpose_(dims[0], dims[1]);
case 2:
@ -181,7 +198,8 @@ std::vector<Tensor> atleast_1d(TensorList tensors) {
auto transform_lambda = [](const Tensor& input) -> Tensor {
return at::native::atleast_1d(input);
};
std::transform(tensors.cbegin(), tensors.cend(), result.begin(), transform_lambda);
std::transform(
tensors.cbegin(), tensors.cend(), result.begin(), transform_lambda);
return result;
}
@ -202,7 +220,8 @@ std::vector<Tensor> atleast_2d(TensorList tensors) {
auto transform_lambda = [](const Tensor& input) -> Tensor {
return at::native::atleast_2d(input);
};
std::transform(tensors.cbegin(), tensors.cend(), result.begin(), transform_lambda);
std::transform(
tensors.cbegin(), tensors.cend(), result.begin(), transform_lambda);
return result;
}
@ -226,7 +245,8 @@ std::vector<Tensor> atleast_3d(TensorList tensors) {
auto transform_lambda = [](const Tensor& input) -> Tensor {
return at::native::atleast_3d(input);
};
std::transform(tensors.cbegin(), tensors.cend(), result.begin(), transform_lambda);
std::transform(
tensors.cbegin(), tensors.cend(), result.begin(), transform_lambda);
return result;
}

View File

@ -10,16 +10,21 @@
namespace at::native {
static inline Tensor roll_common(const Tensor& self, IntArrayRef shifts, IntArrayRef dims) {
static inline Tensor roll_common(
const Tensor& self,
IntArrayRef shifts,
IntArrayRef dims) {
TORCH_CHECK(!shifts.empty(), "`shifts` required");
if (dims.empty() && shifts.size() == 1) {
auto flattened = self.contiguous().view(self.numel());
return roll(flattened, shifts[0], 0).view(self.sizes());
}
TORCH_CHECK(
shifts.size() == dims.size(),
"shifts and dimensions must align. shifts: ", shifts.size(), ", dims:", dims.size()
);
shifts.size() == dims.size(),
"shifts and dimensions must align. shifts: ",
shifts.size(),
", dims:",
dims.size());
AT_ASSERT(dims.size() > 1);
auto tail_shifts = shifts.slice(1);
auto tail_dims = dims.slice(1);
@ -27,4 +32,4 @@ static inline Tensor roll_common(const Tensor& self, IntArrayRef shifts, IntArra
return at::roll(first_dim_rolled, tail_shifts, tail_dims);
}
} // namespace at::native
} // namespace at::native

View File

@ -63,13 +63,9 @@ void binary_cross_entropy_backward_out_kernel(Tensor& grad_input, const Tensor&
namespace at::native {
Tensor binary_cross_entropy_cuda(const Tensor& input, const Tensor& target, const std::optional<Tensor>& weight_opt, int64_t reduction) {
// See [Note: hacky wrapper removal for optional tensor]
c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
const Tensor& weight = *weight_maybe_owned;
Tensor loss = at::empty_like(input);
return at::native::binary_cross_entropy_out_cuda(
input, target, weight, reduction, loss);
input, target, weight_opt, reduction, loss);
}
Tensor& binary_cross_entropy_out_cuda(const Tensor& input, const Tensor& target, const std::optional<Tensor>& weight_opt, int64_t reduction, Tensor& loss) {
@ -122,13 +118,9 @@ Tensor& binary_cross_entropy_out_cuda(const Tensor& input, const Tensor& target,
}
Tensor binary_cross_entropy_backward_cuda(const Tensor& grad, const Tensor& input, const Tensor& target, const std::optional<Tensor>& weight_opt, int64_t reduction) {
// See [Note: hacky wrapper removal for optional tensor]
c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
const Tensor& weight = *weight_maybe_owned;
Tensor grad_input = at::empty_like(input);
return at::native::binary_cross_entropy_backward_out_cuda(
grad, input, target, weight, reduction, grad_input);
grad, input, target, weight_opt, reduction, grad_input);
}
Tensor& binary_cross_entropy_backward_out_cuda(const Tensor& grad, const Tensor& input, const Tensor& target, const std::optional<Tensor>& weight_opt, int64_t reduction, Tensor& grad_input) {

View File

@ -75,8 +75,6 @@ struct ReduceConfig {
static constexpr int BLOCK_Y = 1;
static constexpr int CTA = 2;
static constexpr int input_vec_size = 4;
ReduceConfig(int element_size_bytes, int num_outputs, int num_inputs)
: element_size_bytes(element_size_bytes)
, num_inputs(num_inputs)
@ -286,7 +284,6 @@ struct ReduceJitOp {
//TODO for now arg_t is always opmath_t of the input, later we'll need to change it
using arg_t = at::opmath_type<scalar_t>;
static constexpr int input_vec_size = ReduceConfig::input_vec_size;
//TODO - ReduceJitOp will probably need to be changed for reductions that need full functor,
//not just wrapper
arg_t ident;
@ -336,7 +333,7 @@ struct ReduceJitOp {
}
};
template <typename scalar_t, typename ops_t, typename index_t, typename out_scalar_t=scalar_t, int vt0=4>
template <typename scalar_t, typename ops_t, typename index_t, typename out_scalar_t=scalar_t, int vt0=4, int input_vec_size=vt0>
struct ReduceOp {
using traits = function_traits<decltype(&ops_t::reduce)>;
using arg_t = typename std::decay<typename traits::template arg<0>::type>::type;
@ -348,8 +345,6 @@ struct ReduceOp {
std::is_convertible_v<arg_t, out_scalar_t>
&& std::is_convertible_v<out_scalar_t, arg_t>;
static constexpr int input_vec_size = ReduceConfig::input_vec_size;
ops_t ops;
arg_t ident;
ReduceConfig config;
@ -996,7 +991,7 @@ int get_output_vec_size(const TensorIterator &iter) {
return vec_size;
}
template<typename arg_t, typename scalar_t, int vt0>
template<typename arg_t, typename scalar_t, int vt0, int input_vec_size=vt0>
ReduceConfig setReduceConfig(const TensorIterator& iter){
// Start by assuming that each thread handles a single output and all
// the inputs for that output.
@ -1063,12 +1058,16 @@ ReduceConfig setReduceConfig(const TensorIterator& iter){
// threads with different threadIdx.x are independent and will produce results for different outputs.
// In such case, values in each loaded vector always correspond to different outputs.
if (fastest_moving_stride == sizeof(scalar_t)) {
if (reduction_on_fastest_striding_dimension && dim0 > 128 && iter.num_reduce_dims() == 1 && vt0 >= ReduceConfig::input_vec_size) {
#ifdef USE_ROCM
if (reduction_on_fastest_striding_dimension && dim0 > 128 && iter.num_reduce_dims() == 1) {
#else
if (reduction_on_fastest_striding_dimension && dim0 > 128 && iter.num_reduce_dims() == 1 && vt0 >= input_vec_size) {
#endif
// Case 1: "vectorize along input"
// Note that if vt0 < ReduceConfig::vec_size, then this means the register pressure could be high, in such case,
// we should avoid vectorization.
config.vectorize_input = true;
dim0 /= config.input_vec_size;
dim0 /= input_vec_size;
} else if (!reduction_on_fastest_striding_dimension) {
// Case 2: "vectorize along output"
config.output_vec_size = get_output_vec_size<scalar_t>(iter);
@ -1123,7 +1122,7 @@ ReduceConfig setReduceConfig(const TensorIterator& iter){
// Control the number of threadblocks by adjusting the maximum number of
// threads per multi-processor. These numbers better reflect the maximum
// theoretical achievable threads per MP for the reduction operation.
if (iter.ndim() == 1)
if (iter.ndim() == 1 || iter.ndim() == 3)
max_threads_per_mp = 512;
if (iter.ndim() == 2)
max_threads_per_mp = 256;
@ -1169,7 +1168,7 @@ ReduceConfig setReduceConfig(const TensorIterator& iter){
return config;
};
template <typename scalar_t, typename out_scalar_t, int vt0=4, typename ops_t, typename ident_t=double>
template <typename scalar_t, typename out_scalar_t, int vt0=4, int input_vec_size=vt0, typename ops_t, typename ident_t=double>
inline void gpu_reduce_kernel(TensorIterator& iter, const ops_t& ops, ident_t ident=0,
AccumulationBuffer* acc_buf_ptr=nullptr, int64_t base_idx=0) {
AT_ASSERT(iter.numel() > 0 && iter.ntensors() - iter.noutputs() == 1 && iter.noutputs() >= 1);
@ -1221,7 +1220,7 @@ inline void gpu_reduce_kernel(TensorIterator& iter, const ops_t& ops, ident_t id
for (auto& sub_iter : iter.with_32bit_indexing()) {
int64_t sub_iter_base_idx = sub_iter.view_offsets()[0];
gpu_reduce_kernel<scalar_t, out_scalar_t, vt0>(sub_iter, ops, ident,
gpu_reduce_kernel<scalar_t, out_scalar_t, vt0, input_vec_size>(sub_iter, ops, ident,
acc_buf_ptr, sub_iter_base_idx);
}
return;
@ -1238,7 +1237,7 @@ inline void gpu_reduce_kernel(TensorIterator& iter, const ops_t& ops, ident_t id
}
char* acc_data = acc_buf_ptr->get_acc_slice(out_data);
ReduceConfig config = setReduceConfig<arg_t, scalar_t, vt0>(iter);
ReduceConfig config = setReduceConfig<arg_t, scalar_t, vt0, input_vec_size>(iter);
at::DataPtr buffer;
at::DataPtr semaphores;
if (config.should_global_reduce()) {
@ -1253,7 +1252,7 @@ inline void gpu_reduce_kernel(TensorIterator& iter, const ops_t& ops, ident_t id
AT_ASSERT(can_use_32bit_indexing);
auto output_calc = make_output_calculator<uint32_t>(iter);
auto input_calc = make_input_calculator<uint32_t>(iter);
auto reduce = ReduceOp<scalar_t, ops_t, uint32_t, out_scalar_t, vt0>(
auto reduce = ReduceOp<scalar_t, ops_t, uint32_t, out_scalar_t, vt0, input_vec_size>(
ops,
config,
input_calc,

View File

@ -13,6 +13,20 @@ namespace at::native {
template <typename scalar_t, typename acc_t = scalar_t, typename out_t = scalar_t>
struct sum_functor {
void operator()(TensorIterator& iter) {
#ifdef USE_ROCM
// Half and BFloat16 can be packed in groups of up to 8 elements and
// can use *_DWORDX4 instructions to achieve that.
const bool is_16_bits =
( (std::is_same<at::Half, scalar_t>::value) ||
(std::is_same<at::BFloat16, scalar_t>::value) );
if (is_16_bits) {
gpu_reduce_kernel<scalar_t, out_t, /*vt0=*/4, /*input_vec_size=*/8>(
iter, func_wrapper<out_t>([] GPU_LAMBDA(acc_t a, acc_t b) -> acc_t {
return a + b;
}));
return;
}
#endif
gpu_reduce_kernel<scalar_t, out_t>(
iter, func_wrapper<out_t>([] GPU_LAMBDA(acc_t a, acc_t b) -> acc_t {
return a + b;

View File

@ -190,13 +190,7 @@ Tensor layer_norm_symint(
c10::SymIntArrayRef normalized_shape, const std::optional<Tensor>& weight_opt /* optional */, const std::optional<Tensor>& bias_opt /* optional */,
double eps,
bool /* cudnn_enable, deprecated */) {
// See [Note: hacky wrapper removal for optional tensor]
c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
const Tensor& weight = *weight_maybe_owned;
c10::MaybeOwned<Tensor> bias_maybe_owned = at::borrow_from_optional_tensor(bias_opt);
const Tensor& bias = *bias_maybe_owned;
return std::get<0>(at::native_layer_norm_symint(input, normalized_shape, weight, bias, eps));
return std::get<0>(at::native_layer_norm_symint(input, normalized_shape, weight_opt, bias_opt, eps));
}
DEFINE_DISPATCH(LayerNormKernel);

View File

@ -54,7 +54,7 @@ at::Tensor quantized_convolution(
at::Tensor weight,
at::Tensor weight_scales,
at::Tensor weight_zero_points,
c10::optional<at::Tensor> bias,
std::optional<at::Tensor> bias,
torch::List<int64_t> stride,
torch::List<int64_t> padding,
torch::List<int64_t> dilation,
@ -63,15 +63,15 @@ at::Tensor quantized_convolution(
at::Tensor output,
double inv_output_scale,
int64_t output_zero_point,
c10::optional<at::Tensor> accum,
std::optional<at::Tensor> accum,
double accum_scale,
int64_t accum_zero_point,
c10::optional<c10::ScalarType> output_dtype,
c10::optional<std::string_view> binary_attr,
c10::optional<at::Scalar> binary_alpha,
c10::optional<std::string_view> unary_attr,
torch::List<c10::optional<at::Scalar>> unary_scalars,
c10::optional<std::string_view> unary_algorithm) {
std::optional<c10::ScalarType> output_dtype,
std::optional<std::string_view> binary_attr,
std::optional<at::Scalar> binary_alpha,
std::optional<std::string_view> unary_attr,
torch::List<std::optional<at::Scalar>> unary_scalars,
std::optional<std::string_view> unary_algorithm) {
Attr attr =
Attr(/*q_scale=*/1.0 / inv_output_scale, /*zp=*/output_zero_point);

View File

@ -114,7 +114,7 @@ at::Tensor quantized_convolution(
at::Tensor weight,
at::Tensor weight_scales,
at::Tensor weight_zero_points,
c10::optional<at::Tensor> bias,
std::optional<at::Tensor> bias,
torch::List<int64_t> stride,
torch::List<int64_t> padding,
torch::List<int64_t> dilation,
@ -123,14 +123,14 @@ at::Tensor quantized_convolution(
at::Tensor output,
double inv_output_scale,
int64_t output_zero_point,
c10::optional<at::Tensor> accum,
std::optional<at::Tensor> accum,
double accum_scale,
int64_t accum_zero_point,
c10::optional<c10::ScalarType> output_dtype,
c10::optional<std::string_view> binary_attr,
c10::optional<at::Scalar> binary_alpha,
c10::optional<std::string_view> unary_attr,
torch::List<c10::optional<at::Scalar>> unary_scalars,
c10::optional<std::string_view> unary_algorithm);
std::optional<c10::ScalarType> output_dtype,
std::optional<std::string_view> binary_attr,
std::optional<at::Scalar> binary_alpha,
std::optional<std::string_view> unary_attr,
torch::List<std::optional<at::Scalar>> unary_scalars,
std::optional<std::string_view> unary_algorithm);
} // namespace at::native::onednn

View File

@ -31,17 +31,17 @@ class QConvoneDNNXPU final {
at::Tensor weight,
at::Tensor weight_scales,
at::Tensor weight_zero_points,
c10::optional<at::Tensor> bias,
std::optional<at::Tensor> bias,
torch::List<int64_t> stride,
torch::List<int64_t> padding,
torch::List<int64_t> dilation,
int64_t groups,
double inv_output_scale,
int64_t output_zero_point,
c10::optional<c10::ScalarType> output_dtype,
std::optional<c10::ScalarType> output_dtype,
std::string_view attr,
torch::List<c10::optional<at::Scalar>> scalars,
c10::optional<std::string_view> algorithm) {
torch::List<std::optional<at::Scalar>> scalars,
std::optional<std::string_view> algorithm) {
if (act.dim() == 3 || act.dim() == 5) {
TORCH_CHECK(
attr == "none",

View File

@ -372,7 +372,6 @@ static Tensor mps_convolution_backward_input(IntArrayRef input_size,
using namespace at::native::mps;
using namespace mps;
bool is3DConv = grad_output_t.dim() == 5;
if (!is_macos_13_or_newer(MacOSVersion::MACOS_VER_15_1_PLUS)) {
// On macOS < 15.1, MPS convolution kernel does not support output channels > 2^16
for (auto elem : grad_output_t.sizes()) {
@ -417,36 +416,29 @@ static Tensor mps_convolution_backward_input(IntArrayRef input_size,
assert(0 && "Check should have been done earlier\n");
}
MPSShape* gradOutputShape = getMPSShape(grad_output_t, memory_format);
MPSShape* mps_input_shape = getMPSShape(input_size);
NSString* ns_shape_key = [[gradOutputShape valueForKey:@"description"] componentsJoinedByString:@","];
string key;
if (is3DConv) {
key = "mps_3d_convolution_backward_input:" + std::to_string(stride[0]) + ":" + std::to_string(stride[1]) + ":" +
":" + std::to_string(stride[2]) + std::to_string(dilation[0]) + ":" + std::to_string(dilation[1]) + ":" +
std::to_string(dilation[2]) + ":" + std::to_string(padding[0]) + ":" + std::to_string(padding[1]) + ":" +
std::to_string(padding[2]) + ":" + std::to_string(groups) + ":" + mem_format_key +
getTensorsStringKey({grad_output_t, weight_t}) + ":" + string([ns_shape_key UTF8String]);
getTensorsStringKey({grad_output_t, weight_t});
} else {
key = "mps_convolution_backward_input:" + std::to_string(stride[0]) + ":" + std::to_string(stride[1]) + ":" +
std::to_string(dilation[0]) + ":" + std::to_string(dilation[1]) + ":" + std::to_string(padding[0]) + ":" +
std::to_string(padding[1]) + ":" + std::to_string(groups) + ":" + mem_format_key +
getTensorsStringKey({grad_output_t, weight_t}) + ":" + string([ns_shape_key UTF8String]);
getTensorsStringKey({grad_output_t, weight_t});
}
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* gradOutputTensor =
mpsGraphRankedPlaceHolder(mpsGraph, getMPSScalarType(grad_output_t), gradOutputShape);
MPSGraphTensor* weightTensor = mpsGraphRankedPlaceHolder(mpsGraph, weight_t);
auto gradOutputTensor = mpsGraphRankedPlaceHolder(mpsGraph, grad_output_t);
auto weightTensor = mpsGraphRankedPlaceHolder(mpsGraph, weight_t);
MPSGraphTensor* gradOutputTensorTranspose = gradOutputTensor;
if (is_channels_last) {
gradOutputTensorTranspose = mps::convertNHWCtoNCHW(mpsGraph, gradOutputTensorTranspose);
}
MPSGraphTensor* gradInputTensor;
MPSShape* weightOutputShape = mps::getMPSShape(weight_t);
// Depthwise conv is input feature channels = groups. So I in OIHW has to be 1.
bool isDepthwiseConv = ((groups > 1 && (weightOutputShape[1].intValue == 1)) && gradOutputShape.count >= 4 &&
bool isDepthwiseConv = ((groups > 1 && (weightOutputShape[1].intValue == 1)) && grad_output_t.ndimension() >= 4 &&
weightOutputShape.count >= 4 && !is_channels_last);
if (is3DConv) {
@ -462,7 +454,7 @@ static Tensor mps_convolution_backward_input(IntArrayRef input_size,
padding[1],
padding[0],
groups);
gradInputTensor = [mpsGraph convolution3DDataGradientWithIncomingGradientTensor:gradOutputTensorTranspose
gradInputTensor = [mpsGraph convolution3DDataGradientWithIncomingGradientTensor:gradOutputTensor
weightsTensor:weightTensor
outputShape:mps_input_shape
forwardConvolutionDescriptor:conv3dDescriptor_
@ -484,7 +476,7 @@ static Tensor mps_convolution_backward_input(IntArrayRef input_size,
withDimension:-4
name:nil];
gradInputTensor =
[mpsGraph depthwiseConvolution3DDataGradientWithIncomingGradientTensor:gradOutputTensorTranspose
[mpsGraph depthwiseConvolution3DDataGradientWithIncomingGradientTensor:gradOutputTensor
weightsTensor:weightTransposeTensor
outputShape:mps_input_shape
descriptor:depthWiseConv3dDescriptor_
@ -501,7 +493,7 @@ static Tensor mps_convolution_backward_input(IntArrayRef input_size,
at::MemoryFormat::Contiguous,
groups);
gradInputTensor = [mpsGraph convolution2DDataGradientWithIncomingGradientTensor:gradOutputTensorTranspose
gradInputTensor = [mpsGraph convolution2DDataGradientWithIncomingGradientTensor:gradOutputTensor
weightsTensor:weightTensor
outputShape:mps_input_shape
forwardConvolutionDescriptor:conv2dDescriptor_
@ -513,7 +505,7 @@ static Tensor mps_convolution_backward_input(IntArrayRef input_size,
newCachedGraph->gradInputTensor_ = gradInputTensor;
});
auto gradOutputPlaceholder = Placeholder(cachedGraph->gradOutputTensor_, grad_output_t, gradOutputShape);
auto gradOutputPlaceholder = Placeholder(cachedGraph->gradOutputTensor_, grad_output_t);
auto weightsPlaceholder = Placeholder(cachedGraph->weightTensor_, weight_t);
auto outputPlaceholder = Placeholder(cachedGraph->gradInputTensor_, *grad_input);

View File

@ -385,19 +385,15 @@ Tensor quantized_batch_norm(
double eps,
double output_scale,
int64_t output_zero_point) {
// See [Note: hacky wrapper removal for optional tensor]
c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
const Tensor& weight = *weight_maybe_owned;
const Tensor& bias = bias_opt.value_or(Tensor());
Tensor qy;
// TODO: this should arguably support 3d as well
qy = q_batch_norm2d_impl<false>(
return q_batch_norm_impl<false>(
qx,
weight.defined() ? std::make_optional(weight) : std::nullopt,
bias.defined() ? std::make_optional(bias) : std::nullopt,
mean, var, eps, output_scale, output_zero_point);
return qy;
weight_opt,
bias_opt,
mean,
var,
eps,
output_scale,
output_zero_point);
}
TORCH_LIBRARY_IMPL(quantized, QuantizedCPU, m) {

View File

@ -931,8 +931,8 @@ static at::Tensor linear_int8_with_onednn_weight(
std::string_view& unary_post_op_algorithm) {
using ideep::tensor;
const int64_t dim = input.dim();
TORCH_CHECK(input.scalar_type() == c10::ScalarType::Byte,
"qlinear with mkldnn tensor: data type of input should be uint8 (unsigned char).");
TORCH_CHECK(input.scalar_type() == c10::ScalarType::Byte || input.scalar_type() == c10::ScalarType::Char,
"qlinear with mkldnn tensor: data type of input should be uint8 or int8 (unsigned char or char).");
TORCH_CHECK(onednn_weight.scalar_type() == c10::ScalarType::Char,
"qlinear with mkldnn tensor: data type of weight should be int8 (char).");
TORCH_CHECK(
@ -1021,7 +1021,8 @@ static at::Tensor linear_int8_with_onednn_weight(
empty_tensor;
// Create onednn primitive
auto src_desc = tensor::desc(src_dims, ideep::data_type::u8, ideep::format_tag::any);
auto src_dtype = input.scalar_type() == c10::kByte ? ideep::data_type::u8 : ideep::data_type::s8;
auto src_desc = tensor::desc(src_dims, src_dtype, ideep::format_tag::any);
auto weights_desc = packed_weight.get_desc();
auto dst_dtype = dst.get_data_type();
auto dst_desc = tensor::desc(dst_dims, dst_dtype, ideep::format_tag::any);
@ -1118,12 +1119,14 @@ namespace at::native {
torch::List<std::optional<at::Scalar>> post_op_args,
std::string_view post_op_algorithm) {
#if AT_MKLDNN_ENABLED()
TORCH_CHECK(act_scale.numel() == 1 && act_zero_point.numel() == 1,
"onednn int8 linear: act scale/zp size should be 1");
// act_zero_point.numel() == 0 for symmetric quantization
TORCH_CHECK(act_scale.numel() == 1 && act_zero_point.numel() <= 1,
"onednn int8 linear: act scale/zp size should be 1/<=1");
static std::optional<at::Tensor> other = std::nullopt;
static const std::string_view binary_post_op = "none";
int64_t act_zp = act_zero_point.numel() == 1 ? act_zero_point.item().toLong() : 0;
return linear_int8_with_onednn_weight(
act, act_scale.item().toDouble(), act_zero_point.item().toLong(),
act, act_scale.item().toDouble(), act_zp,
onednn_weight, weight_scales, weight_zero_points,
bias, output_scale, output_zero_point, output_dtype,
other, /*other scale*/1.0, /*other zp*/0,
@ -1154,10 +1157,12 @@ namespace at::native {
torch::List<std::optional<at::Scalar>> unary_post_op_args,
std::string_view unary_post_op_algorithm) {
#if AT_MKLDNN_ENABLED()
TORCH_CHECK(act_scale.numel() == 1 && act_zero_point.numel() == 1,
"onednn int8 linear: act scale/zp size should be 1");
// act_zero_point.numel() == 0 for symmetric quantization
TORCH_CHECK(act_scale.numel() == 1 && act_zero_point.numel() <= 1,
"onednn int8 linear: act scale/zp size should be 1/<=1");
int64_t act_zp = act_zero_point.numel() == 1 ? act_zero_point.item().toLong() : 0;
return linear_int8_with_onednn_weight(
act, act_scale.item().toDouble(), act_zero_point.item().toLong(),
act, act_scale.item().toDouble(), act_zp,
onednn_weight, weight_scales, weight_zero_points,
bias, output_scale, output_zero_point, output_dtype,
other, other_scale, other_zero_point,

View File

@ -561,8 +561,8 @@ namespace {
bool expected = std::isnan(val);
CACHE_ALIGN c10::Half actual_vals[vHalf::size()];
vHalf(val).isnan().store(actual_vals);
for (int jj = 0; jj < vHalf::size(); ++jj) {
EXPECT_EQ(expected, c10::bit_cast<uint16_t>(actual_vals[jj]) != 0) << "fp16 isnan failure for bit pattern " << std::hex << ii << std::dec;
for (auto actual_val : actual_vals) {
EXPECT_EQ(expected, c10::bit_cast<uint16_t>(actual_val) != 0) << "fp16 isnan failure for bit pattern " << std::hex << ii << std::dec;
}
}
}
@ -1046,7 +1046,7 @@ namespace {
mask[idx] = (VT)0;
}
else {
int64_t hex_mask = 0xFFFFFFFFFFFFFFFF;
uint64_t hex_mask = 0xFFFFFFFFFFFFFFFF;
std::memcpy(&mask[idx], &hex_mask, sizeof(VT));
}
if (!test_blendv<vec, VT, idx+1, N>(expected_val, a, b, mask)) return false;
@ -1315,8 +1315,8 @@ namespace {
ValueGen<float> generator_sc(1.f, 15.f, seed.add(2));
for ([[maybe_unused]] const auto i : c10::irange(trials)) {
float scale = generator_sc.get();
int32_t zero_point_val = generator.get();
float scale_zp_premul = -(scale * zero_point_val);
auto zero_point_val = generator.get();
float scale_zp_premul = -(scale * static_cast<float>(zero_point_val));
vfloat vf_scale = vfloat{scale};
vfloat vf_zp = vfloat{static_cast<float>(zero_point_val)};
vfloat vf_scale_zp = vfloat{scale_zp_premul};
@ -1657,18 +1657,16 @@ namespace {
TEST(HalfConversionTest, HalfFloat) {
float f32s[100];
for (const auto i : c10::irange(100)) {
f32s[i] = i + 0.3;
f32s[i] = static_cast<float>(i + 0.3);
}
uint16_t u16;
float x;
for (const auto i : c10::irange(100)) {
#if (defined(CPU_CAPABILITY_AVX2) || defined(CPU_CAPABILITY_AVX512)) && \
!defined(__APPLE__)
u16 = at::vec::float2half_scalar(f32s[i]);
x = at::vec::half2float_scalar(u16);
uint16_t u16 = at::vec::float2half_scalar(f32s[i]);
float x = at::vec::half2float_scalar(u16);
#else
u16 = c10::detail::fp16_ieee_from_fp32_value(f32s[i]);
x = c10::detail::fp16_ieee_to_fp32_value(u16);
uint16_t u16 = c10::detail::fp16_ieee_from_fp32_value(f32s[i]);
float x = c10::detail::fp16_ieee_to_fp32_value(u16);
#endif
EXPECT_EQ(u16, c10::detail::fp16_ieee_from_fp32_value(f32s[i]))
@ -1697,7 +1695,7 @@ namespace {
VT v_pinf = static_cast<VT>(*(float *)&infBits);
values[index] = v_pinf;
auto vec_pinf = vec::loadu(values);
int negInfBits = 0xFF800000;
unsigned int negInfBits = 0xFF800000;
VT v_ninf = static_cast<VT>(*(float *)&negInfBits);
values[index] = v_ninf;
auto vec_ninf = vec::loadu(values);
@ -1779,8 +1777,8 @@ namespace {
const auto expected = static_cast<float>(val);
CACHE_ALIGN float actual_vals[vfloat::size()];
at::vec::convert<float>(vBFloat16(val)).store(actual_vals);
for (int jj = 0; jj < vfloat::size(); ++jj) {
EXPECT_EQ(c10::bit_cast<uint32_t>(expected), c10::bit_cast<uint32_t>(actual_vals[jj]))
for (auto actual_val : actual_vals) {
EXPECT_EQ(c10::bit_cast<uint32_t>(expected), c10::bit_cast<uint32_t>(actual_val))
<< "convert-to-float failure for bf16 bit pattern "
<< std::hex << ii << std::dec;
}
@ -1794,20 +1792,20 @@ namespace {
#define TEST_MASK_LOAD(dst_t, mask_t, mask_n) \
do { \
CACHE_ALIGN dst_t x[mask_n * size]; \
CACHE_ALIGN dst_t y[mask_n * size]; \
CACHE_ALIGN dst_t ref[mask_n * size]; \
auto seed = TestSeed(); \
dst_t generator_min = std::numeric_limits<dst_t>::is_signed ? dst_t(-100) : dst_t(0); \
ValueGen<dst_t> generator(generator_min, dst_t(100), seed); \
for (const auto i : c10::irange(mask_n * size)) { \
x[i] = generator.get(); \
} \
auto vec_mask = generate_vec_mask<mask_t, mask_n>(seed); \
constexpr int dst_size = at::vec::Vectorized<dst_t>::size(); \
constexpr int dst_n = mask_n * size / dst_size; \
constexpr int rnd_n = (mask_n * size + dst_size - 1) / dst_size; \
if constexpr(dst_n * dst_size >= mask_n * size) { \
CACHE_ALIGN dst_t x[mask_n * size]; \
CACHE_ALIGN dst_t y[mask_n * size]; \
CACHE_ALIGN dst_t ref[mask_n * size]; \
auto seed = TestSeed(); \
dst_t generator_min = std::numeric_limits<dst_t>::is_signed ? dst_t(-100) : dst_t(0); \
ValueGen<dst_t> generator(generator_min, dst_t(100), seed); \
for (const auto i : c10::irange(mask_n * size)) { \
x[i] = generator.get(); \
} \
auto vec_mask = generate_vec_mask<mask_t, mask_n>(seed); \
constexpr int rnd_n = (mask_n * size + dst_size - 1) / dst_size;\
auto x_vec = vec_mask.template loadu<dst_t, rnd_n>(x); \
x_vec.store(y); \
for (const auto i : c10::irange(mask_n * size)) { \

View File

@ -0,0 +1,9 @@
# Instructions on how to make a new compile time benchmark
1. Make a new benchmark file in /benchmarks/dynamo/pr_time_benchmarks/benchmarks/ eg. https://github.com/pytorch/pytorch/blob/0b75b7ff2b8ab8f40e433a52b06a671d6377997f/benchmarks/dynamo/pr_time_benchmarks/benchmarks/add_loop.py
2. cd into the pr_time_benchmarks directory `cd benchmarks/dynamo/pr_time_benchmarks`
3. Run `PYTHONPATH=./ python benchmarks/[YOUR_BENCHMARK].py a.txt`
4. (Optional) flip a flag that you know will change the benchmark and run again with b.txt `PYTHONPATH=./ python benchmarks/[YOUR_BENCHMARK].py a.txt`
5. Compare `a.txt` and `b.txt` located within the `benchmarks/dynamo/pr_time_benchmarks` folder to make sure things look as you expect
6. Check in your new benchmark file and submit a new PR
7. In a few days, if your benchmark is stable, bug Laith Sakka to enable running your benchmark on all PRs. If your a meta employee, you can find the dashboard here: internalfb.com/intern/unidash/dashboard/pt2_diff_time_metrics

View File

@ -0,0 +1,47 @@
import sys
from benchmark_base import BenchmarkBase
import torch
from torch._inductor.utils import fresh_inductor_cache
class Benchmark(BenchmarkBase):
def __init__(self):
super().__init__(
category="float_args",
backend="inductor",
device="cpu",
)
def name(self):
return f"{self.category()}"
def description(self):
return "Benchmark to measure recompilations with float arguments."
def _prepare_once(self):
torch.manual_seed(0)
def _prepare(self):
torch._dynamo.reset()
def _work(self):
@torch.compile(backend="inductor")
def f(x, y):
return x + y
with fresh_inductor_cache():
for i in range(8):
f(torch.arange(3), i * 2.5)
def main():
result_path = sys.argv[1]
Benchmark().enable_compile_time_instruction_count().collect_all().append_results(
result_path
)
if __name__ == "__main__":
main()

View File

@ -6,27 +6,27 @@ add_loop_eager_dynamic,compile_time_instruction_count,5703000000,0.025
add_loop_inductor,compile_time_instruction_count,29510000000,0.015
add_loop_inductor,compile_time_instruction_count,32220000000,0.015
add_loop_inductor_dynamic_gpu,compile_time_instruction_count,43280000000,0.025
add_loop_inductor_dynamic_gpu,compile_time_instruction_count,44500000000,0.025
add_loop_inductor_gpu,compile_time_instruction_count,25690000000,0.015
add_loop_inductor_gpu,compile_time_instruction_count,27320000000,0.015
basic_modules_ListOfLinears_eager,compile_time_instruction_count,1033000000,0.015
basic_modules_ListOfLinears_eager,compile_time_instruction_count,1018000000,0.015
basic_modules_ListOfLinears_inductor,compile_time_instruction_count,20810000000,0.015
basic_modules_ListOfLinears_inductor,compile_time_instruction_count,21760000000,0.015
basic_modules_ListOfLinears_inductor_gpu_force_shape_pad,compile_time_instruction_count,17020000000,0.015
basic_modules_ListOfLinears_inductor_gpu_force_shape_pad,compile_time_instruction_count,17810000000,0.015
@ -38,7 +38,7 @@ update_hint_regression,compile_time_instruction_count,1669000000,0.02
sum_floordiv_regression,compile_time_instruction_count,1113000000,0.015
sum_floordiv_regression,compile_time_instruction_count,1033000000,0.015
@ -50,7 +50,7 @@ aotdispatcher_inference_nosubclass_cpu,compile_time_instruction_count,2018000000
aotdispatcher_inference_subclass_cpu,compile_time_instruction_count,5843000000,0.015
aotdispatcher_inference_subclass_cpu,compile_time_instruction_count,5796000000,0.015
@ -62,4 +62,4 @@ aotdispatcher_training_nosubclass_cpu,compile_time_instruction_count,3863000000,
aotdispatcher_training_subclass_cpu,compile_time_instruction_count,10410000000,0.015
aotdispatcher_training_subclass_cpu,compile_time_instruction_count,10330000000,0.015

1 add_loop_eager compile_time_instruction_count 3066000000 0.015
6 basic_modules_ListOfLinears_eager compile_time_instruction_count 1033000000 1018000000 0.015
7 basic_modules_ListOfLinears_inductor compile_time_instruction_count 20810000000 21760000000 0.015
8 basic_modules_ListOfLinears_inductor_gpu_force_shape_pad compile_time_instruction_count 17020000000 17810000000 0.015
9 basic_modules_ListOfLinears_inductor_gpu compile_time_instruction_count 17260000000 0.2
10 update_hint_regression compile_time_instruction_count 1669000000 0.02
11 sum_floordiv_regression compile_time_instruction_count 1113000000 1033000000 0.015
12 symint_sum compile_time_instruction_count 3293000000 0.015
13 aotdispatcher_inference_nosubclass_cpu compile_time_instruction_count 2018000000 0.015
14 aotdispatcher_inference_subclass_cpu compile_time_instruction_count 5843000000 5796000000 0.015
15 aotdispatcher_partitioner_cpu compile_time_instruction_count 9095000000 0.015
16 aotdispatcher_training_nosubclass_cpu compile_time_instruction_count 3863000000 0.015
17 aotdispatcher_training_subclass_cpu compile_time_instruction_count 10410000000 10330000000 0.015
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
38
39
40
41
42
43
44
50
51
52
53
54
55
56
62
63
64
65

View File

@ -353,7 +353,7 @@ void testStaticRuntime(
size_t new_managed_bytes =
memory_planner ? memory_planner->total_managed() : 0;
if (check_resize && new_managed_bytes >= 0) {
if (check_resize) {
EXPECT_GE(new_managed_bytes, managed_bytes);
}

View File

@ -123,7 +123,7 @@ inline constexpr crc64_t crc64(const char* str, size_t size) {
return crc64_t{detail::crc64impl(0, str, size)};
}
inline constexpr crc64_t crc64(c10::string_view str) {
inline constexpr crc64_t crc64(std::string_view str) {
return crc64(str.data(), str.size());
}
} // namespace c10::util

View File

@ -92,7 +92,7 @@ size_t ReplaceAll(std::string& s, std::string_view from, std::string_view to) {
std::string::size_type last_pos = 0u;
std::string::size_type cur_pos = 0u;
std::string::size_type write_pos = 0u;
const c10::string_view input(s);
const std::string_view input(s);
if (from.size() >= to.size()) {
// If the replacement string is not larger than the original, we

View File

@ -188,7 +188,6 @@ class BlockingCounter {
// returns false.
bool DecrementCount() {
const auto count_value = count_.fetch_sub(1, std::memory_order_relaxed) - 1;
TORCH_DCHECK_GE(count_value, 0);
if (count_value == 0) {
std::lock_guard<std::mutex> g(mutex_);
cond_.notify_one();

View File

@ -414,6 +414,9 @@ function(torch_compile_options libname)
$<$<COMPILE_LANGUAGE:CXX>:${private_compile_options}>)
if(USE_CUDA)
foreach(option IN LISTS private_compile_options)
if("${option}" STREQUAL "-Wextra-semi")
continue()
endif()
target_compile_options(${libname} PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-Xcompiler ${option}>)
endforeach()
endif()

View File

@ -1,7 +1,7 @@
/* styles needed for the Google Search button */
.pytorch-left-menu-search input[type=text] {
background-image: none;
.gsc-overflow-hidden {
overflow: visible !important;
}
.gsc-control-cse {

View File

@ -10,7 +10,9 @@ torch.accelerator
device_count
is_available
current_accelerator
set_device_index
set_device_idx
current_device_index
current_device_idx
set_stream
current_stream

View File

@ -305,6 +305,7 @@ coverage_ignore_functions = [
"node_arg_is_weight",
"return_arg_list",
# torch.ao.quantization.pt2e.graph_utils
"bfs_trace_with_node_process",
"find_sequential_partitions",
"get_equivalent_types",
"update_equivalent_types_dict",

View File

@ -199,15 +199,8 @@ the model. For example:
stage_index,
num_stages,
device,
input_args=example_input_microbatch,
)
The ``PipelineStage`` requires an example argument ``input_args`` representing
the runtime input to the stage, which would be one microbatch worth of input
data. This argument is passed through the forward method of the stage module to
determine the input and output shapes required for communication.
When composing with other Data or Model parallelism techniques, ``output_args``
may also be required, if the output shape/dtype of the model chunk will be
affected.
@ -421,7 +414,7 @@ are subclasses of ``PipelineScheduleMulti``.
Logging
*******
You can turn on additional logging using the `TORCH_LOGS` environment variable from [`torch._logging`](https://pytorch.org/docs/main/logging.html#module-torch._logging):
You can turn on additional logging using the `TORCH_LOGS` environment variable from `torch._logging <https://pytorch.org/docs/main/logging.html#module-torch._logging>`_:
* `TORCH_LOGS=+pp` will display `logging.DEBUG` messages and all levels above it.
* `TORCH_LOGS=pp` will display `logging.INFO` messages and above.

View File

@ -508,7 +508,7 @@ API Example::
import torch
from torch.ao.quantization.quantize_pt2e import prepare_pt2e
from torch._export import capture_pre_autograd_graph
from torch.export import export_for_training
from torch.ao.quantization.quantizer import (
XNNPACKQuantizer,
get_symmetric_quantization_config,
@ -535,7 +535,7 @@ API Example::
# Step 1. program capture
# NOTE: this API will be updated to torch.export API in the future, but the captured
# result should mostly stay the same
m = capture_pre_autograd_graph(m, *example_inputs)
m = export_for_training(m, *example_inputs).module()
# we get a model with aten ops
# Step 2. quantization

Some files were not shown because too many files have changed in this diff Show More