Compare commits

...

255 Commits

Author SHA1 Message Date
209a9bd3ca Update base for Update on "[NJT] Actually inline NT torch function during dynamo"
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:

(1) Disables torch function running a second time in AOTAutograd

If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.

2.  Enables torch function to be inlined in dynamo for NT

Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level. 

3. Fixes graph breaks for NT torch function

Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).


Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
2024-04-01 17:02:30 -07:00
23df075e2e Update base for Update on "[NJT] Actually inline NT torch function during dynamo"
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:

(1) Disables torch function running a second time in AOTAutograd

If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.

2.  Enables torch function to be inlined in dynamo for NT

Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level. 

3. Fixes graph breaks for NT torch function

Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).


Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
2024-04-01 11:22:39 -07:00
76a87e33a0 Remove cuda dependencies when building AOTriton (#122982)
Downloading CUDA sometimes fails and breaks the build process, but AOTriton does not need these packages for its own Triton fork. This commit comments out the related downloading scripts.

The actual changes from Triton can be found at: 9b73a543a5

Fixes the following building error
```
[2/6] cd /var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python && /opt/conda/envs/py_3.8/bin/cmake -E env VIRTUAL_ENV=/var/lib/jenkins/workspace/build/aotriton/build/venv PATH="/var/lib/jenkins/workspace/build/aotriton/build/venv/bin:/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.8/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" TRITON_BUILD_DIR=/var/lib/jenkins/workspace/build/aotriton/build/triton_build python setup.py develop
FAILED: CMakeFiles/aotriton_venv_triton /var/lib/jenkins/.local/lib/python3.8/site-packages/triton/_C/libtriton.so /var/lib/jenkins/workspace/build/aotriton/build/CMakeFiles/aotriton_venv_triton
cd /var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python && /opt/conda/envs/py_3.8/bin/cmake -E env VIRTUAL_ENV=/var/lib/jenkins/workspace/build/aotriton/build/venv PATH="/var/lib/jenkins/workspace/build/aotriton/build/venv/bin:/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.8/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" TRITON_BUILD_DIR=/var/lib/jenkins/workspace/build/aotriton/build/triton_build python setup.py develop
downloading and extracting https://conda.anaconda.org/nvidia/label/cuda-12.1.1/linux-64/cuda-nvcc-12.1.105-0.tar.bz2 ...
downloading and extracting https://conda.anaconda.org/nvidia/label/cuda-12.1.1/linux-64/cuda-cuobjdump-12.1.111-0.tar.bz2 ...
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python/setup.py", line 325, in <module>
    download_and_copy(
  File "/var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python/setup.py", line 151, in download_and_copy
    ftpstream = urllib.request.urlopen(url)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/urllib/request.py", line 215, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/urllib/request.py", line 521, in open
    response = meth(req, response)
               ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/urllib/request.py", line 630, in http_response
    response = self.parent.error(
               ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/urllib/request.py", line 559, in error
    return self._call_chain(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/urllib/request.py", line 492, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/urllib/request.py", line 639, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 524:
ninja: build stopped: subcommand failed.
```

Example of failed build log: https://github.com/pytorch/pytorch/actions/runs/8483953034/job/23245996425
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122982
Approved by: https://github.com/jansel
2024-04-01 17:50:35 +00:00
c422bce131 [codemod] Fix some namespace issues in caffe2 (#121847)
Summary:
Removes `using namespace` from a header file. Having `using namespace` in a header file is *always* a bad idea. A previous raft of diffs provided appropriate qualifications to everything that relied on this `using namespace`, so it is now safe to remove it in this separate diff.

Helps us enable `-Wheader-hygiene`.

Test Plan: Sandcastle

Reviewed By: dmm-fb

Differential Revision: D54838298

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121847
Approved by: https://github.com/Skylion007
2024-04-01 17:45:16 +00:00
533c1b6c49 Disable vulkan logsoftmax test (#123103)
Ex https://github.com/pytorch/pytorch/actions/runs/8509797936/job/23306567177

The failure was only surfaced after #122845 (the bug fix to surface cpp test failures) so I don't know when it started

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123103
Approved by: https://github.com/kit1980
2024-04-01 17:41:59 +00:00
d7a274e1b0 [dtensor] switch aten.t to use op strategy (#122950)
as titled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122950
Approved by: https://github.com/awgu, https://github.com/tianyu-l
ghstack dependencies: #122929, #122949
2024-04-01 17:39:43 +00:00
9e1447dad6 [dtensor] make sure expected input spec have correct tensor meta (#122949)
as titled, previously we could possibly return the expected input spec
that shared by multiple args, this is not ok since different args might
have different tensor metas, why it was working before is because
redistribute in these cases become a no-op.

This PR fixes it by making each expected input spec to shallow clone the
corresponding input metadata

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122949
Approved by: https://github.com/tianyu-l
ghstack dependencies: #122929
2024-04-01 17:39:42 +00:00
afee5bea92 [dtensor] refactor schema suggestions in output sharding (#122929)
This PR refactors the schema_suggestions in OuputSharding to be a single
OpSchema instead of list of schemas, which in practice we only have one,
for the multiple resharding case we also moved to OpStrategy so there's
no case that needs it to be a list

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122929
Approved by: https://github.com/tianyu-l
2024-04-01 17:39:39 +00:00
b4c810491e [export] Temporarily block mutating ops in quant tests. (#122863)
Summary: After we migrate to torch.export, we won't see ops like add_ and mul_ due to functionalization. We are rolling out pre dispatch export, so for now we just skip those mutating ops in tests.

Test Plan: buck run mode/opt caffe2/test/quantization:test_quantization

Reviewed By: tugsbayasgalan

Differential Revision: D55442019

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122863
Approved by: https://github.com/clee2000
2024-04-01 16:41:13 +00:00
526ca5f28e [vec] fix compile warning in vec_n.h (#123090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123090
Approved by: https://github.com/lezcano
2024-04-01 15:55:27 +00:00
9ff2a9dcdd [dynamo] Skip leaf check on assert_metadata_eq if grad tensor level is -2 (#122728)
When fakifying a grad tracking tensor, if the level is -2 (sentinel
value) we can just unwrap the grad tensor and return a fake version of
it. In this PR, we update the `assert_metadata_eq` to not compare if
the grad tensor and the unwrapped ones are leafs or not, as this may
not be always true.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122728
Approved by: https://github.com/zou3519
2024-04-01 15:38:16 +00:00
03439d4c1c [inductor] Lower divide by constant as multiplication by reciprocal (#121924)
Fixes #101039

This lowers division by a constant value to be multipication by reciprocal.
The same optimization is applied in eager mode on CUDA:

0636c11811/aten/src/ATen/native/cuda/BinaryDivTrueKernel.cu (L36-L38)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121924
Approved by: https://github.com/lezcano
2024-04-01 14:37:37 +00:00
6939279a17 [dynamo] Forward OptimizedModule.__setattr__ to the wrapped module (#122098)
Fixes #114844

In the linked issue we have
```
compiled_module = torch.compile(module)
compiled_module.x = ...
compiled_module(...)  # Mutates self.x
```
Where since the module mutates `self.x` you would expect `compiled_module.x`
to be updated but actually `compiled_module.x = ...` sets an attribute "x"
on the `OptimizedModule` object while the forward method of the module mutates
`module.x`.

This gives the expected behavior by forwarding `compiled_module.__setattr__`
down to `module.__setattr__`. There is already a corresponding `__getattr__`
so now `compiled_module.x` becomes an alias for `module.x`.

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122098
Approved by: https://github.com/ezyang, https://github.com/lezcano
2024-04-01 14:30:44 +00:00
dd8a24b8b7 [xla hash update] update the pinned xla hash (#123078)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123078
Approved by: https://github.com/pytorchbot
2024-04-01 11:17:02 +00:00
4b725e1619 [AOTInductor] Support quantized linear on CPU with fbgemm (#123069)
Summary:
Added support for quantized linear on CPU with fbgemm.
Specifically, for torch.ops.quantized.linear_unpacked_dynamic_fp16, we
decompose it into two steps, pack weight, and fbgemm's qlinear with
packed weight.

Test Plan:
Included in commit.
test_aot_inductor::test_quantized_linear

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D55577959](https://our.internmc.facebook.com/intern/diff/D55577959)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123069
Approved by: https://github.com/hl475
2024-04-01 09:15:05 +00:00
6b1f13ea2f Add skip models by device in Dynamo Test (#122591)
Fix skip logic in `runner.py`. Add skip list which was defined by device for dynamo benchmark runner `runner.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122591
Approved by: https://github.com/chuanqi129, https://github.com/desertfire, https://github.com/jgong5
2024-04-01 03:16:32 +00:00
8b7da5b791 Inductor cpp wrapper: fix dtype of ShapeAsConstantBuffer (#122297)
For `at::scalar_tensor` the default dtype will be `float` ([link to scalar_tensor](0d8e960f74/aten/src/ATen/native/TensorFactories.cpp (L856)), [link to default dtype](0d8e960f74/c10/core/TensorOptions.h (L551))) if we don't set the `dtype` value. However, the input scalar value is not necessarily a `float` value. With `torch::tensor(x)`, the dtype of the tensor will be decided according to the dtype of the scalar.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122297
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-04-01 01:32:41 +00:00
781e8d2201 [dynamo] Support __next__ on UserDefinedObjectVariable (#122565)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122565
Approved by: https://github.com/yanboliang
2024-03-31 19:00:03 +00:00
5fc0f52bf0 [BE] Use modern C++ in ATen tests (#123031)
`std::is_same<A, B>::value` -> `std::is_same_v<A, B>`
`std::is_floating_point<T>::value` -> `std::is_floating_point_v<T>`
And use constexpr instead of defining two mutually exclusive templates
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123031
Approved by: https://github.com/Skylion007
2024-03-31 16:07:38 +00:00
fa6178d246 [CI] Updated expected result files after https://github.com/pytorch/pytorch/pull/122846 (#123035)
Summary: Before https://github.com/pytorch/pytorch/pull/122846, pyhpc_isoneutral_mixing in AOTI inference run segfaults so its result was not logged in the expected result file. Now it does show as fail_to_run instead of None.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123035
Approved by: https://github.com/chenyang78
2024-03-31 13:56:00 +00:00
6c2f36c984 Upgrade submodule pybind to 2.12.0 (#122899)
To fix https://github.com/pytorch/pytorch/issues/122056

Building with NP 2.0 allows me to run locally with both NP 2.0 and 1.26.
Any other test we should run @rgommers  ?

FYI @Skylion007 @atalman
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122899
Approved by: https://github.com/Skylion007
2024-03-31 11:29:40 +00:00
cyy
6d8bb0e984 [Distributed] [1/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#122884)
This PR fixes some clang-tidy warnings in distributed code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122884
Approved by: https://github.com/kwen2501
2024-03-31 09:06:35 +00:00
a52e89b6f7 [inductor]re-enable cpu reduction ut (#122289)
Re-enable these two ut. I can pass these two ut on my local and we can see the status in the CI for this PR.

See the background about why they are disabled https://github.com/pytorch/pytorch/issues/93542, https://github.com/pytorch/pytorch/issues/87157.

After https://github.com/pytorch/pytorch/pull/115620. The reduction orders should be deterministic.
However, the orders may not exactly same with ref path (`aten`). We may can set larger tolerance if they still cannot be passed in CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122289
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-03-31 08:33:14 +00:00
56451cd49d Enable x86 CPU vectorization on windows [submodule sleef] (#118980)
Enable VEC on Windows OS.
1. Fix some type defination gap between Windows and Linux.
2. Fix some operator not support on Windows, such as [], /.
3. Enable static sleef library build on Windows.
4. Disable unsupported function overloading on MSVC.
5. Upgrade submodule sleef lib, which fixed build issue on Windows.
6. Fixed bazel build issues.
7. Fix test app not link to sleef on Windows.

Note: If rebuild fail after pulled this PR, please sync `sleef` submodule by run:
```cmd
git submodule sync
git submodule update --init --recursive
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980
Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet
2024-03-31 03:07:32 +00:00
2b1ba0ceae [DeviceMesh] Cache and reuse sliced result (#122975)
Fixes #118849

Add a map for parent_to_child_mappings in _mesh_resources so we can cache and reuse submesh slicing result so that we can avoid recreating submesh and the underlying sub pg repeatedly, which could lead to funky behaviors.

We will follow up with reusing pg from the parent_mesh during submesh creation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122975
Approved by: https://github.com/wanchaol
2024-03-30 23:56:55 +00:00
35c493f2cf [CPP Extension] Escape include paths (#122974)
By using `shlex.quote` on Linux/Mac and `_nt_quote_args` on Windows

Test it by adding non-existent path with spaces and single quote

TODO: Fix double quotes on Windows (will require touching `_nt_quote_args`, so will leave it for another day

Fixes https://github.com/pytorch/pytorch/issues/122476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122974
Approved by: https://github.com/Skylion007
2024-03-30 21:58:29 +00:00
557e7c9c16 Add some type hints to functions and update a few spelling mistakes (#123015)
# Summary
While working on this PR: https://github.com/pytorch/pytorch/pull/121845
I found that these type hints made my ide/ noob experience easier to reason about

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123015
Approved by: https://github.com/Skylion007
2024-03-30 21:15:01 +00:00
e203aa9fab [FSDP] [easy] fix HSDP validation error msg (#123019)
Summary:
This would otherwise yield

> ValueError: ('Manual wrapping with ShardingStrategy.HYBRID_SHARD', 'requires explicit specification of process group or device_mesh.')

which is odd.

Remove the extra tailing commas.

Test Plan: CI

Differential Revision: D55549851

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123019
Approved by: https://github.com/Skylion007
2024-03-30 18:12:34 +00:00
ec58f1f74e [inductor] make mask_rcnn inference work in max-autotune mode (#123008)
inference for vision_maskrcnn model fail when max-autotune is enabled.

Repro:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --accuracy --inference --bfloat16 --backend inductor --only vision_maskrcnn
```

It turns out that MA code receives empty input tensor for convolution and some places in MA related code does not handle this corner case properly. This PR enhance that and now the accuracy test above can pass.

Regarding why the input tensor is empty, I think it's probably due to no objects are detected in the input images (random data?).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123008
Approved by: https://github.com/jansel
2024-03-30 16:39:57 +00:00
5e878be101 Revert "Enable x86 CPU vectorization on windows [submodule sleef] (#118980)"
This reverts commit d94db5f6ee0af745c0d17cc6c87f695baa2b3b5f.

Reverted https://github.com/pytorch/pytorch/pull/118980 on behalf of https://github.com/atalman due to Breaks internal build ([comment](https://github.com/pytorch/pytorch/pull/118980#issuecomment-2028084839))
2024-03-30 14:20:54 +00:00
b8550f527f Support gpu trace on XPU (#121795)
# Motivation
Support GPU trace on XPU backend. Add GPU trace to xpu runtime. It is beneficial to generalize the device caching allocator in the next step.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121795
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD
ghstack dependencies: #121794
2024-03-30 13:07:53 +00:00
eb7adc3ae0 Refactor gpu trace to be device-agnostic (#121794)
# Motivation
Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend.

# Solution
move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794
Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui
2024-03-30 13:04:38 +00:00
99f8f77de9 [Inductor] Fix AFOC QPS Regression. (#122944)
Summary: Recently, we observed ~8% qps regression for AFOC model. After dig into the problem, I found it was introduced by D55272024, where the split node normalization was skipped or call_method split node, while our pattern detection based on the assumption that all split node has been normalized to call_funciton node. More context: https://docs.google.com/document/d/19h-fu2BqdUXMaSqbd7c0-Qe00ic7quUN-emJqH_1-SA/edit

Test Plan:
# unit test
```
buck2 test @mode/dev-nosan //caffe2/test/inductor:split_cat_fx_passes
```
Buck UI: https://www.internalfb.com/buck2/0792d406-3d64-4b9c-95cc-15fb0cc76a96
Test UI: https://www.internalfb.com/intern/testinfra/testrun/11258999096315690
Network: Up: 113KiB  Down: 535KiB  (reSessionID-6132c09b-2ce7-4e89-b61d-d6c6142630cc)
Jobs completed: 26. Time elapsed: 1:25.6s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 10. Fail 0. Fatal 0. Skip 0. Build failure 0
```
buck2 test @mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
```
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13792273886410433
Network: Up: 1.3MiB  Down: 960KiB  (reSessionID-0bea8575-f163-4c5d-b201-69e05806af98)
Jobs completed: 68. Time elapsed: 2:47.2s.
Cache hits: 0%. Commands: 13 (cached: 0, remote: 1, local: 12)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 0. Build failure 0

# local reproduce
```
buck2 run @mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "afoc" --flow_id 545665840
```
Now the merge_splits_pass is conducted.
```
'inductor': Counter({'pattern_matcher_nodes': 1614, 'pattern_matcher_count': 1566, 'normalization_pass': 645, 'remove_split_with_size_one_pass': 629, 'batch_aten_mul': 13, 'scmerge_split_sections_removed': 11, 'scmerge_cat_removed': 5, 'scmerge_cat_added': 4, 'merge_splits_pass': 3, 'merge_getitem_cat_pass': 2, 'scmerge_split_removed': 2, 'batch_linear_post_grad': 2, 'batch_aten_sub': 2, 'batch_layernorm': 1, 'scmerge_split_added': 1})}
```

# e2e
baseline:
f545633808

before_fix:
f545665840

After_fix:
f546227494

proposal:

Differential Revision: D55513494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122944
Approved by: https://github.com/jackiexu1992
2024-03-30 07:34:41 +00:00
2cd3ef4777 Check scale dtype for fake_quantize_per_channel_affine_cachemask (#120987)
Fixes #120903

Scale for fake quant is assumed FP32 but not checked. If scales of double dtype are passed in, an internal error is raised: `TORCH_INTERNAL_ASSERT(!needs_dynamic_casting<func_t>::check(iter));` in aten/src/ATen/native/cpu/Loops.h
This PR adds a check of scale dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120987
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2024-03-30 07:32:32 +00:00
07f0ff6ed7 [DCP][FSDP2][Test] Add_adamW to test_train_parity_2d_transformer_checkpoint_resume (#122002)
Want to add the option of AdamW here, as currently this is the only test for 2D.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122002
Approved by: https://github.com/awgu, https://github.com/fegin
2024-03-30 07:28:41 +00:00
ed457c7dbe [export] Add torch_fn (#122693)
This PR adds a new metadata, `torch_fn` which is meant to replace `source_fn_stack` as `source_fn_stack` is not entirely well defined between strict/nonstrict. Previous discussion [here](https://docs.google.com/document/d/1sPmmsmh6rZFWH03QBOe49MaXrQkP8SxoG8AOMb-pFk4/edit#heading=h.anmx9qknhvm).

`torch_fn` represents the torch function that a particular aten operator came from. For example, `torch.nn.Linear` goes down to the `torch.nn.functional.linear` at the `__torch_function__` layer, and then `aten.t/aten.addmm` in the `__torch_dispatch__` layer. So the nodes `aten.t/aten.addmm` will now have the `torch_fn` metadata containing the `torch.nn.functional.linear`.

The `torch_fn` metadata is a tuple of 2 strings: a unique identifier for each torch function call, and the actual torch function `f"{fn.__class__}.{fn.__name__}"`. The purpose of the first value is to distinguish between 2 consecutive calls to the same function. For example, if we had 2 calls to `torch.nn.Linear`, the nodes and corresponding metadata would look something like:
```
aten.t - ("linear_1", "builtin_function_or_method.linear"),
aten.addmm - ("linear_1", "builtin_function_or_method.linear"),
aten.t - ("linear_2", "builtin_function_or_method.linear"),
aten.addmm - ("linear_2", "builtin_function_or_method.linear"),
```

Higher order ops -- currently we can get the torch_fn metadata for nodes within the HOO's subgraph, but after retracing, this becomes the `(cond, higher_order_op.cond)` :( This is because `fx_traceback.set_current_meta` points to the cond node in the toplevel graph, rather than the original node in the subgraph. I think this is because `fx.Interpreter` does not go into the cond subgraphs. (will discuss with Yidi more ab this)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122693
Approved by: https://github.com/tugsbayasgalan
2024-03-30 06:47:15 +00:00
3a9eead4ab [inductor] Don't compile MultiKernelCall in a subprocess (#123010)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123010
Approved by: https://github.com/shunting314
ghstack dependencies: #123009
2024-03-30 05:46:09 +00:00
6c0911f1d9 [inductor] Skip cudagraphs warning on CPU (#123009)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123009
Approved by: https://github.com/shunting314
2024-03-30 05:46:09 +00:00
0b7a156f68 [executorch hash update] update the pinned executorch hash (#122662)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122662
Approved by: https://github.com/pytorchbot
2024-03-30 05:18:53 +00:00
c66a44ea79 [AOTInductor] Support many outputs aliasing the same tensor (#122846)
fixes https://github.com/pytorch/pytorch/issues/122826

# Problem
When the model returns multiple outputs which alias the same tensor, we get a SEGFAULT. Because we try to release the same buffer twice.
```
def forward(x):
  x_out = x + 1
  contig = x_out.contiguous()   # alias of same tensor as x_out
  return x_out, contig

run_impl() {
  output_handles[0] = buf0.release();
  output_handles[1] = buf0.release();   # SEGFAULT
}

# if we try to workaround this by assign aliases without creating a new tensor,
# then, we'll get a double free error during handle clean-up.
output_handles[1] = output_handles[0];    # assign without creating a new tensor
...
alloc_tensors_by_stealing_from_handles(){
  aoti_torch_delete_tensor_object(handles[0]);
  aoti_torch_delete_tensor_object(handles[1]);   # Double free
}
```

# Solution
~~Instead, we use the first `output_handle` that shares the same tensor and alias it.~~
```
output_handles[0] = buf0.release();
aoti_torch_alias_tensor(output_handles[0], &output_handles[1]);  # No SEGFAULT & No double free!
```

A simpler approach is to figure out which handles are duplicate. Then we simply copy all duplicate except the last one. The last one will use `std::move` and free the tensor owned by the model instance.
```
output_handles[0] = buf0.release();
output_handles[1] = output_handles[0];
```

Differential Revision: [D55455344](https://our.internmc.facebook.com/intern/diff/D55455344)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122846
Approved by: https://github.com/desertfire, https://github.com/chenyang78, https://github.com/jingsh
2024-03-30 04:41:17 +00:00
aaba3a87b1 tune down batch-size for res2net to avoid OOM (#122977)
The batch-size for this model is 64 previously. Later on we change that to 256 and cause OOM in cudagraphs setting. This PR tune the batch size down to 128.

Share more logs from my local run
```
cuda,res2net101_26w_4s,128,1.603578,110.273572,335.263494,1.042566,11.469964,11.001666,807,2,7,6,0,0
cuda,res2net101_26w_4s,256,1.714980,207.986155,344.013071,1.058278,22.260176,21.034332,807,2,7,6,0,0
```

The log shows that torch.compile uses 11GB for 128 batch size and 21GB for 256 batch size. I guess the benchmark script has extra overhead cause the model OOM for 256 batch size in the dashboard run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122977
Approved by: https://github.com/Chillee
2024-03-30 03:54:53 +00:00
5a06b8ebfd Remove skipIfTorchDynamo from TestComposability in test_eager_transforms.py (#121830)
Fixes: https://github.com/pytorch/pytorch/issues/96559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121830
Approved by: https://github.com/zou3519
ghstack dependencies: #121410, #121665
2024-03-30 01:55:04 +00:00
3d3d4e1cd5 export XPUStream to doc (#121398)
# Motivation
We would like to export XPUStream to public [doc](https://pytorch.org/cppdocs/api/library_root.html). The detailed documentation can help users understand and utilize XPU more effectively.

# Additional Context
A detailed XPUStream API and usage should be documented to public doc, like cuda's [doc](https://github.com/pytorch/pytorch/blob/main/docs/cpp/source/notes/tensor_cuda_stream.rst).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121398
Approved by: https://github.com/gujinghui, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/albanD
2024-03-30 00:36:26 +00:00
f4ff063c33 Add attributes to xpu device prop (#121898)
# Motivation
Add some attributes to `XPUDeviceProp` and expose them via `torch.xpu.get_device_properties` and `torch.xpu.get_device_capability`. They can be used in `torch.compile`  or directly passed to triton to generate more optimized code based on device properties.

# Additional Context
expose the following attributes to `torch.xpu.get_device_properties`:
- `has_fp16` (newly added)
- `has_fp64` (newly added)
- `has_atomic64` (newly added)
- `driver_version`
- `vendor`
- `version`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121898
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet, https://github.com/albanD, https://github.com/atalman
2024-03-30 00:25:39 +00:00
b5bef9bbfd Fix cpp tests not running + failing to surface (#122845)
The comment in the code should have the information
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122845
Approved by: https://github.com/huydhn
2024-03-29 22:41:45 +00:00
4282bb8b07 [c10d] add the source rank which detects the timeout (#122850)
Summary:
When a rank detects a timeout from tcpstore and triggers the dump. It's good to have more info about the source rank which detects the
collective timeout locally. We just need to put the source rank as the
value in the kvstore
Test Plan:
In unit test, we triggered the timeout on rank 0 and rank 1 should get
the timeout signal from store and log the correct source rank:

```
(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (34d27652)]$  python
test/distributed/test_c10d_nccl.py NCCLTraceTestTimeoutDumpOnStuckRanks
NCCL version 2.19.3+cuda12.0
[rank0]:[E327 17:04:16.986381360 ProcessGroupNCCL.cpp:565] [Rank 0]
Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2,
OpType=ALLREDUCE, NumelIn=12, NumelOut=12, Timeout(ms)=1000) ran for
1099 milliseconds before timing out.
[rank0]:[E327 17:04:16.988036373 ProcessGroupNCCL.cpp:1582] [PG 0 Rank
0] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed
   NCCL work: 1.
   [rank0]:[E327 17:04:16.182548526 ProcessGroupNCCL.cpp:1346] [PG 0
   Rank 0] Received a timeout signal from this local rank and will start
   to dump the debug info. Last enqueued NCCL work: 2, last completed
   NCCL work: 1.
   [rank0]:[E327 17:04:16.247574460 ProcessGroupNCCL.cpp:1167] [PG 0
   Rank 0] ProcessGroupNCCL preparing to dump debug info.
   [rank1]:[E327 17:04:16.273332178 ProcessGroupNCCL.cpp:1346] [PG 0
   Rank 1] Received a global timeout from another rank 0, and will start
   to dump the debug info. Last enqueued NCCL work: 1, last completed
   NCCL work: 1.
   [rank1]:[E327 17:04:16.273565177 ProcessGroupNCCL.cpp:1167] [PG 0
   Rank 1] ProcessGroupNCCL preparing to dump debug info.
   [rank1]:[F327 17:04:16.274256512 ProcessGroupNCCL.cpp:1185] [PG 0
   Rank 1] [PG 0 Rank 1] ProcessGroupNCCL's watchdog detected a
   collective timeout from another rank 0 and notified the current rank.
   This is most likely caused by incorrect usages of collectives, e.g.,
   wrong sizes used across ranks, the order of collectives is not same
   for all ranks or the scheduled collective, for some reason, didn't
   run. Additionally, this can be caused by GIL deadlock or other
   reasons such as network errors or bugs in the communications library
   (e.g. NCCL), etc. We tried our best to dump the debug info into the
   storage to help you debug the issue.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122850
Approved by: https://github.com/wconstab
2024-03-29 22:22:37 +00:00
d7d77a152c [ez] Increase slow grad check shards 4 to 6 (#122631)
They take almost 4 hours to run completely for one shard

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122631
Approved by: https://github.com/huydhn
2024-03-29 21:49:27 +00:00
ea33adf6c2 [vec] test VecMask in vec_test_all_types (#122878)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122878
Approved by: https://github.com/malfet
ghstack dependencies: #119979, #122869
2024-03-29 21:48:29 +00:00
c9b32c9caa [vec] test at::vec::convert in vec_test_all_types (#122869)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122869
Approved by: https://github.com/malfet
ghstack dependencies: #119979
2024-03-29 21:48:29 +00:00
6f4ed57b8a [inductor][cpp] unified the vectorized conversion with at::vec::convert for all data types (#119979)
This PR unified the vectorized conversion with `at::vec::convert` for all vectorized data types. The intrinsics implementations are implemented as a specialization and moved to their own arch-specific files. The vectorized conversion logic in cpp Inductor is simplified.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119979
Approved by: https://github.com/jansel, https://github.com/malfet
2024-03-29 21:48:29 +00:00
05e54536fb [CI] Removed tests for torch.utils.tensorboard.summary.hparams (#122556)
Partially addresses #122160

In the module `torch.utils.tensorboard.summary`, the `hparams` method does not depend on any utilities from pytorch as it uses only the utilities from `tensorboard`. Thus, I think it will be safe to delete the test for `hparams` method as it does not depend on pytorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122556
Approved by: https://github.com/huydhn
2024-03-29 21:44:02 +00:00
482d8bf1ea [aoti] Change aot_compile callsites (#122225)
Summary:
Replacing `torch._export.aot_compile` callsites with
```
ep = torch.export._trace._export(.., predispatch=True)   # Traces the given program into predispatch IR
so_path = torch._inductor.aot_compile_ep(ep, ...)  # Takes an exported program and compiles it into a .so
```

This allows us to explicitly split up the export step from AOTInductor. We can later modify tests to do `export + serialize + deserialize + inductor` to mimic internal production use cases better.

Test Plan: CI

Differential Revision: D54808612

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122225
Approved by: https://github.com/SherlockNoMad, https://github.com/khabinov
2024-03-29 21:34:20 +00:00
267145c5d0 Enable full state checking (#122971)
Fixes https://github.com/pytorch/pytorch/issues/115679

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122971
Approved by: https://github.com/anijain2305
2024-03-29 21:24:57 +00:00
4d6cb7bca0 Use Q-NEON register to compute the dot product (#122952)
Make transposed gemv a bit faster
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122952
Approved by: https://github.com/kimishpatel
ghstack dependencies: #122951
2024-03-29 21:09:08 +00:00
73e362756b Avoid COW materialize in conv forward ops (#122748)
Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122748
Approved by: https://github.com/ezyang
ghstack dependencies: #122720
2024-03-29 20:34:19 +00:00
cyy
7423092227 [TorchGen] [2/N] Remove unused variables and simplify dictionary iterations (#122585)
This PR continues to remove unused variables and simplifies dictionary iterations from TorchGen scripts, following #122576.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122585
Approved by: https://github.com/ezyang
2024-03-29 20:34:11 +00:00
57a9a64e10 [BE] Give a different error message when evaluating an integer. (#122938)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122938
Approved by: https://github.com/Skylion007
2024-03-29 19:14:15 +00:00
3178ba0dc9 Don't use sympy Float functions, use an opaque one with no reasoning (#122823)
Sympy simplifications don't obey floating point semantics, so don't
use Sympy for this.  Keep them as is, only evaluate with the reference
implementations when all arguments are known.

This may end up getting subsumed by some other changes later, but I
wanted to understand if this was easy and it seems to be easy.

This doesn't actually depend on the earlier diffs on the stack and I can detach it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122823
Approved by: https://github.com/lezcano
2024-03-29 19:13:55 +00:00
ae0cf1f98d [TD][ez] Set pytest cache bucket default to gha-artifacts (#122901)
After https://github.com/pytorch/pytorch/pull/121907/files

Example failure: https://github.com/pytorch/pytorch/actions/runs/8473386479/job/23217733984#step:5:130
```
usage: pytest_cache.py [-h] (--upload | --download) --cache_dir CACHE_DIR
                       --pr_identifier PR_IDENTIFIER --job_identifier
                       JOB_IDENTIFIER [--sha SHA] [--test_config TEST_CONFIG]
                       [--shard SHARD] [--repo REPO] [--temp_dir TEMP_DIR]
                       [--bucket BUCKET]
pytest_cache.py: error: argument --bucket: expected one argument
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122901
Approved by: https://github.com/huydhn
2024-03-29 18:52:58 +00:00
99d939f51f [dynamo] Bugfix for HASATTR guard (#122947)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122947
Approved by: https://github.com/jansel
ghstack dependencies: #122828
2024-03-29 18:50:33 +00:00
0a7162f898 Fix svd_lowrank parameter M (#122681)
ISSUE: #122699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122681
Approved by: https://github.com/lezcano
2024-03-29 18:06:38 +00:00
487b6d40ec Add RMSNorm module (#121364)
Similar to dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L51)

**The implementation here is not optimized and we welcome pull requests to improve this**

- Use `normalized_shape` instead of singular integer `dim` to be aligned with the `nn.LayerNorm` implementation
- Remove the [upcast to float and downcast
](dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L73))

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [D55485840](https://our.internmc.facebook.com/intern/diff/D55485840)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121364
Approved by: https://github.com/albanD
2024-03-29 18:05:28 +00:00
3243be7c3a [FSDP2] Removed wrapSwapTensorsTest since no longer needed (#122962)
We do not need to set the flag after https://github.com/pytorch/pytorch/pull/122755.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122962
Approved by: https://github.com/mikaylagawarecki
2024-03-29 17:53:18 +00:00
a236fa9f06 Revert "[aoti] clear precomputed symbol replacements before cpp wrapper compilation (#122882)"
This reverts commit 384de46395234e793a319325e5c9d20a60407a64.

Reverted https://github.com/pytorch/pytorch/pull/122882 on behalf of https://github.com/jithunnair-amd due to broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/122882#issuecomment-2027544640))
2024-03-29 17:52:39 +00:00
2a137f7af1 [dynamo] Support hasattr on UserDefinedClassVariable (#122564)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122564
Approved by: https://github.com/anijain2305
2024-03-29 17:34:14 +00:00
772e142e70 [dynamo] Delay cuda device registration (#122795)
the module-level `torch.cuda.device_count` calls are delayed until reading the registered devices.

Fixes #122085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122795
Approved by: https://github.com/ezyang
2024-03-29 17:22:18 +00:00
315bd951e4 Add inductor fx pass unit test for shape propagation (#122897)
Summary: Pre-grad fx passes expect information from shape propagation to be present. D55221119 ensured that `pass_execution_and_save` invokes shape propagation, and this diff adds a covering unit test to prevent regression.

Test Plan: New UT passes locally.

Differential Revision: D55440240

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122897
Approved by: https://github.com/khabinov, https://github.com/Skylion007
2024-03-29 16:44:22 +00:00
b83c94339e Fix performance regression and memory storage handling of Flash Attention on ROCM (#122857)
This PR fixes the two major issues that was discovered after the initial merge of PR #121561
1. The Flash Attention support added by has severe performance regressions on regular shapes (power of two head dimensions and sequence lengths) compared with PR #115981. Its performance is worse than the math backend and only has numerical stability advantages. This PR fixes this problem.
2. There is a flaw of memory storage handling in PR #121561 which does not copy the gradients back to the designated output tensor. This PR removes the deprecated `TensorStorageSanitizer` class which is unnecessary due to the more flexible backward kernel shipped by PR #121561

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122857
Approved by: https://github.com/jeffdaily, https://github.com/drisspg
2024-03-29 16:37:24 +00:00
d8b69de73b [EZ] Run fp16 torch.mm/torch.mv across CPU threads (#122951)
This significantly speeds up real world applications, such as LLMs

Before this change llama2-7b fp16 inference run at 1.5 tokens per sec,
after it runs at almost 6 tokens per sec

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122951
Approved by: https://github.com/ezyang
2024-03-29 16:14:59 +00:00
cyy
fb90b4d4b2 [TorchGen] Use std::optional in generated code (#121454)
This PR changes TorchGen to generate std::optional.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121454
Approved by: https://github.com/ezyang
2024-03-29 14:11:09 +00:00
375a8041ed [AOTI][refactor] Improve logging (#122932)
Summary: Improve some logging msgs, and change a data type to remove a compile time warning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122932
Approved by: https://github.com/chenyang78
2024-03-29 14:02:23 +00:00
cyy
769d1909f0 Enable clang-tidy warnings of aten/src/ATen/functorch (#122933)
Enable clang-tidy in aten/src/ATen/functorch,  following #122779.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122933
Approved by: https://github.com/ezyang
2024-03-29 14:01:28 +00:00
38946bff51 Added DispatchKey.CompositeImplicitAutograd to all upsample_nearest*.default decompositions (#122782)
Related to https://github.com/pytorch/pytorch/pull/117632#issuecomment-2021321172
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122782
Approved by: https://github.com/ezyang
2024-03-29 13:55:25 +00:00
b524a404e0 Fixed support for uint8 in upsample bicubic2d decomposition (#120411)
Superseeds https://github.com/pytorch/pytorch/pull/104248

Description:
- Fixed support for uint8 for upsample bicubic2d decomposition (on `main` results are wrong, so we can tolerate the slowdown)
- Added missing clamp(0, 1) for xscale and yscale
  - slowdown for f32 on cpu. PR on nodes fusion on CPU: https://github.com/pytorch/pytorch/pull/120077 can help for upsampling cases with align corners = true
  - the slowdown mainly due to the added clamp op and also partially reduced when using torch.stack in weights computation on cpu.
- Removed lowering implementation

Benchmarks:
```
[-------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cpu --------------------------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                                                                   |  Eager (2.4.0a0+git0c61c20) PR  |  Compiled (2.4.0a0+git0c61c20) PR  |  Compiled (2.4.0a0+git069270d) Nightly  |  speed-up PR vs Nightly  |  Eager (2.4.0a0+git069270d) Nightly
1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)       |        613.029 (+-1.590)        |         5477.608 (+-9.027)         |           3060.314 (+-12.368)           |     0.559 (+-0.000)      |          608.735 (+-6.336)
      Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)      |        610.176 (+-1.428)        |        5718.503 (+-11.203)         |           3424.022 (+-12.836)           |     0.599 (+-0.000)      |          604.781 (+-6.229)
      Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)           |        325.001 (+-0.840)        |        6183.029 (+-10.893)         |            3275.032 (+-7.625)           |     0.530 (+-0.000)      |          325.693 (+-1.067)
      Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)          |        325.855 (+-1.108)        |        6391.394 (+-11.552)         |            3533.410 (+-7.666)           |     0.553 (+-0.000)      |          325.838 (+-1.457)
      Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)     |       2521.533 (+-14.857)       |        5025.217 (+-13.415)         |            2814.304 (+-6.742)           |     0.560 (+-0.000)      |         2520.308 (+-10.796)
      Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)    |       2531.204 (+-12.534)       |        5294.925 (+-11.994)         |            3147.590 (+-6.808)           |     0.594 (+-0.000)      |         2521.228 (+-11.732)
      Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)         |        758.352 (+-10.362)       |        5639.912 (+-14.495)         |            3014.123 (+-8.799)           |     0.534 (+-0.000)      |          756.114 (+-4.792)
      Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)        |        758.712 (+-5.781)        |         5927.541 (+-9.982)         |            3249.555 (+-7.226)           |     0.548 (+-0.000)      |          757.719 (+-5.653)
      Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)       |       1524.469 (+-12.860)       |        34321.641 (+-80.310)        |           19373.714 (+-56.351)          |     0.564 (+-0.000)      |         1518.082 (+-49.653)
      Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)      |       1521.746 (+-13.780)       |        35949.711 (+-81.010)        |           21782.366 (+-68.938)          |     0.606 (+-0.000)      |         1467.911 (+-15.901)
      Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)           |        712.311 (+-5.361)        |        38826.510 (+-92.267)        |           20762.314 (+-59.303)          |     0.535 (+-0.000)      |          712.669 (+-4.673)
      Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)          |        715.060 (+-4.757)        |        40269.353 (+-92.543)        |           22402.114 (+-81.574)          |     0.556 (+-0.000)      |          716.001 (+-8.945)

      Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)       |       2331.889 (+-29.159)       |        21541.096 (+-72.346)        |           12181.194 (+-45.288)          |     0.565 (+-0.000)      |         2304.864 (+-21.351)
      Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)      |       2333.697 (+-10.066)       |        22514.154 (+-57.798)        |           21709.449 (+-98.307)          |     0.964 (+-0.000)      |         2302.141 (+-13.041)
      Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)           |        1198.768 (+-5.364)       |       37652.371 (+-101.644)        |           42740.413 (+-98.571)          |     1.135 (+-0.000)      |          1197.104 (+-7.225)
      Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)          |        1196.851 (+-5.118)       |       39678.341 (+-173.750)        |           46807.738 (+-92.744)          |     1.180 (+-0.000)      |          1189.322 (+-5.681)
      Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)     |       10020.978 (+-54.855)      |        19955.290 (+-71.891)        |           11420.521 (+-53.179)          |     0.572 (+-0.000)      |         9999.583 (+-61.230)
      Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)    |       10066.441 (+-62.700)      |       21058.334 (+-183.414)        |           19986.577 (+-65.304)          |     0.949 (+-0.000)      |         10018.672 (+-59.188)
      Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)         |       3171.135 (+-14.635)       |        19687.864 (+-54.320)        |           23313.699 (+-57.391)          |     1.184 (+-0.000)      |         3182.191 (+-17.686)
      Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)        |       3181.314 (+-13.784)       |        20224.468 (+-50.827)        |          30541.963 (+-381.385)          |     1.510 (+-0.000)      |         3183.578 (+-16.203)
      Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)       |       5879.450 (+-31.551)       |       136918.555 (+-480.320)       |          77723.568 (+-331.766)          |     0.568 (+-0.000)      |         5726.061 (+-87.517)
      Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)      |       5882.869 (+-30.325)       |       143378.094 (+-513.842)       |         137244.074 (+-4827.730)         |     0.957 (+-0.000)      |         5727.679 (+-22.164)
      Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)           |       2674.937 (+-45.003)       |      244829.360 (+-1930.579)       |         271283.073 (+-2243.245)         |     1.108 (+-0.000)      |         2676.054 (+-24.632)
      Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)          |       2676.217 (+-16.601)       |      248658.668 (+-2904.952)       |         296514.520 (+-2983.281)         |     1.192 (+-0.000)      |         2682.844 (+-19.886)

      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)     |        1768.437 (+-6.294)       |        2934.013 (+-28.870)         |            2520.649 (+-6.797)           |     0.859 (+-0.000)      |          1759.292 (+-5.097)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)    |        1748.660 (+-5.550)       |         3271.104 (+-7.557)         |            2891.306 (+-7.632)           |     0.884 (+-0.000)      |          1746.341 (+-5.845)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)         |        2813.150 (+-6.656)       |         3258.973 (+-7.543)         |            2766.286 (+-6.473)           |     0.849 (+-0.000)      |          2805.077 (+-7.611)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)        |        2812.102 (+-8.211)       |         3568.780 (+-9.018)         |            3125.870 (+-7.324)           |     0.876 (+-0.000)      |          2834.178 (+-9.034)
      Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)   |        1687.975 (+-9.527)       |         2752.085 (+-9.627)         |            2373.274 (+-7.888)           |     0.862 (+-0.000)      |          1698.782 (+-8.098)
      Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)  |        1696.606 (+-8.678)       |        3056.317 (+-13.303)         |           2699.160 (+-10.638)           |     0.883 (+-0.000)      |         1684.942 (+-10.519)
      Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)       |        2613.491 (+-9.769)       |        3176.493 (+-13.366)         |            2730.193 (+-9.573)           |     0.859 (+-0.000)      |          2625.085 (+-9.943)
      Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)      |       2614.946 (+-34.129)       |        3465.398 (+-11.165)         |           3044.396 (+-11.447)           |     0.879 (+-0.000)      |          2627.355 (+-9.608)
      Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)     |       10784.549 (+-58.181)      |        18292.452 (+-59.344)        |           15909.922 (+-49.864)          |     0.870 (+-0.000)      |         10837.656 (+-51.947)
      Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)    |       10786.513 (+-52.308)      |        20449.038 (+-56.204)        |           18295.997 (+-54.522)          |     0.895 (+-0.000)      |         10843.751 (+-44.781)
      Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)         |       17532.699 (+-64.807)      |        20425.699 (+-80.271)        |           17517.040 (+-79.705)          |     0.858 (+-0.000)      |         17595.597 (+-61.870)
      Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)        |       17530.816 (+-55.131)      |        22450.080 (+-92.899)        |           19827.828 (+-77.649)          |     0.883 (+-0.000)      |         17615.934 (+-71.716)

      Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)     |       6875.484 (+-40.543)       |        11569.509 (+-62.462)        |          10053.350 (+-208.136)          |     0.869 (+-0.000)      |         6864.501 (+-46.747)
      Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)    |       6843.126 (+-44.498)       |        12915.236 (+-60.654)        |          25335.058 (+-382.640)          |     1.962 (+-0.000)      |         6899.002 (+-46.861)
      Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)         |       11103.418 (+-51.318)      |        28834.389 (+-78.395)        |          37405.463 (+-581.646)          |     1.297 (+-0.000)      |         11223.012 (+-60.709)
      Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)        |       11092.994 (+-70.835)      |       36597.023 (+-118.988)        |           45761.267 (+-85.051)          |     1.250 (+-0.000)      |         11104.014 (+-61.288)
      Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)   |       7106.791 (+-63.666)       |        11191.071 (+-45.402)        |           9786.037 (+-75.781)           |     0.874 (+-0.000)      |         7129.419 (+-77.674)
      Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)  |       7146.519 (+-28.376)       |        12443.571 (+-39.425)        |           20147.067 (+-74.771)          |     1.619 (+-0.000)      |         7179.622 (+-64.847)
      Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)       |       10533.849 (+-44.227)      |       34814.909 (+-138.127)        |          42803.001 (+-114.326)          |     1.229 (+-0.000)      |         10644.039 (+-59.681)
      Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)      |       10548.910 (+-44.221)      |       42876.940 (+-146.959)        |          49711.443 (+-139.276)          |     1.159 (+-0.000)      |         10652.375 (+-44.174)
      Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)     |      42814.521 (+-103.198)      |       73100.489 (+-435.262)        |          63587.659 (+-134.266)          |     0.870 (+-0.000)      |        43208.921 (+-195.287)
      Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)    |      42812.373 (+-103.870)      |       81769.160 (+-373.369)        |         175159.813 (+-2028.558)         |     2.142 (+-0.000)      |         43007.691 (+-96.358)
      Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)         |      69955.505 (+-373.373)      |      215248.616 (+-2040.775)       |         267511.246 (+-2094.161)         |     1.243 (+-0.000)      |        70382.679 (+-594.941)
      Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)        |      69852.157 (+-490.076)      |      242841.484 (+-19645.513)      |         317931.678 (+-2016.498)         |     1.309 (+-0.000)      |        70074.819 (+-352.919)

Times are in microseconds (us).

[-------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cuda ---------------------------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                                                                     |  Eager (2.4.0a0+git0c61c20) PR  |  Compiled (2.4.0a0+git0c61c20) PR  |  Compiled (2.4.0a0+git069270d) Nightly  |  speed-up PR vs Nightly  |  Eager (2.4.0a0+git069270d) Nightly
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345)   |         97.727 (+-0.018)        |          97.765 (+-0.025)          |             97.773 (+-0.027)            |     1.000 (+-0.000)      |           97.905 (+-0.040)
      Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345)  |         97.615 (+-0.066)        |          97.332 (+-0.032)          |             97.950 (+-0.026)            |     1.006 (+-0.000)      |           97.690 (+-0.062)
      Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345)       |        100.635 (+-0.033)        |         125.883 (+-0.020)          |            102.499 (+-0.116)            |     0.814 (+-0.000)      |          101.103 (+-0.027)
      Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345)      |        100.898 (+-0.036)        |         109.717 (+-0.336)          |            102.558 (+-0.120)            |     0.935 (+-0.000)      |          101.642 (+-0.105)
      Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345)   |        462.853 (+-0.028)        |         382.475 (+-0.047)          |            382.472 (+-0.033)            |     1.000 (+-0.000)      |          462.188 (+-0.014)
      Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345)  |        462.783 (+-0.021)        |         382.806 (+-0.037)          |            382.563 (+-0.043)            |     0.999 (+-0.000)      |          462.089 (+-0.028)
      Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345)       |        466.721 (+-0.022)        |         384.438 (+-0.027)          |            384.886 (+-0.037)            |     1.001 (+-0.000)      |          467.014 (+-0.025)
      Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345)      |        466.993 (+-0.032)        |         384.212 (+-0.009)          |            383.946 (+-0.029)            |     0.999 (+-0.000)      |          466.575 (+-0.020)
      Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456)   |        190.070 (+-0.082)        |         209.353 (+-1.096)          |            202.870 (+-0.888)            |     0.969 (+-0.000)      |          189.371 (+-0.164)
      Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456)  |        190.021 (+-0.018)        |         210.504 (+-0.456)          |            201.814 (+-0.770)            |     0.959 (+-0.000)      |          189.314 (+-0.036)
      Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456)       |        188.860 (+-0.207)        |         336.635 (+-0.023)          |            252.026 (+-0.510)            |     0.749 (+-0.000)      |          188.860 (+-0.170)
      Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456)      |        188.725 (+-0.214)        |         276.329 (+-0.563)          |            251.439 (+-0.524)            |     0.910 (+-0.000)      |          188.776 (+-0.189)
      Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456)   |        781.879 (+-0.086)        |         836.389 (+-7.177)          |            816.483 (+-6.626)            |     0.976 (+-0.000)      |          781.362 (+-0.106)
      Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456)  |        781.824 (+-0.099)        |         840.406 (+-7.111)          |            807.530 (+-6.514)            |     0.961 (+-0.000)      |          781.307 (+-0.129)
      Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456)       |        769.290 (+-0.309)        |         675.498 (+-1.537)          |            688.171 (+-4.326)            |     1.019 (+-0.000)      |          769.830 (+-0.222)
      Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456)      |        769.240 (+-0.179)        |         675.800 (+-1.113)          |            673.176 (+-1.740)            |     0.996 (+-0.000)      |          769.935 (+-0.171)

Times are in microseconds (us).

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120411
Approved by: https://github.com/lezcano
2024-03-29 13:15:25 +00:00
d94db5f6ee Enable x86 CPU vectorization on windows [submodule sleef] (#118980)
Enable VEC on Windows OS.
1. Fix some type defination gap between Windows and Linux.
2. Fix some operator not support on Windows, such as [], /.
3. Enable static sleef library build on Windows.
4. Disable unsupported function overloading on MSVC.
5. Upgrade submodule sleef lib, which fixed build issue on Windows.
6. Fixed bazel build issues.
7. Fix test app not link to sleef on Windows.

Note: If rebuild fail after pulled this PR, please sync `sleef` submodule by run:
```cmd
git submodule sync
git submodule update --init --recursive
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980
Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet
2024-03-29 07:28:31 +00:00
35c56f85fd [dynamo][pt2d] avoid skipping modules from torch/testing/_internal (#122851)
Dynamo skips user defined modules from `torch/testing/_internal` (eg MLP, Transformer). This PR adds `torch/testing/_internal/...` to `manual_torch_name_rule_map`. It ensures FSDP CI + torch.compile are meaningfully tested

unit test shows frame count = 0 before and frame count > 0 after
```pytest test/dynamo/test_trace_rules.py -k test_module_survive_skip_files```

some FSDP unit tests actually start to compile modules with this change. add trition availability check or disable tests for now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122851
Approved by: https://github.com/jansel
2024-03-29 06:42:06 +00:00
10bdf64427 Properly pexpr the actual sympy.Expression, don't repr it. (#122893)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122893
Approved by: https://github.com/albanD, https://github.com/desertfire, https://github.com/jansel
2024-03-29 06:40:19 +00:00
ed37fbdf60 made gpt_fast benchmark run faster (#122872)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122872
Approved by: https://github.com/msaroufim, https://github.com/yifuwang
ghstack dependencies: #122848
2024-03-29 03:49:19 +00:00
b9c9f037d1 Added some checkpointing tests (#122848)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122848
Approved by: https://github.com/anijain2305
2024-03-29 03:49:19 +00:00
b6201a60c5 [BE] minor logging cleanup in distributed (#122921)
Summary:
    Minor logging cleanup in distributed library
    1. Don't use "f" formatted strings - address linter issues.
    2. Nits: Make use of unused `e` (error) in a few logs.
    3. Change info->debug as asked in issue #113545
    4. Nit: rename log -> logger in a few files for consistency
    5. Fix a linter error.

    Test Plan:
    1. Local build passes.
    2. Linter is happy.

    Reviewers: wanchaol

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921
Approved by: https://github.com/wanchaol
2024-03-29 03:34:01 +00:00
6a45809580 Simplify forward AD missing support error (#122639)
This thing about jit decomposition confuses users greatly and I'm not sure what it adds. So removing it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122639
Approved by: https://github.com/soulitzer
2024-03-29 02:11:46 +00:00
76d8020e62 Add tests for pre_dispatch + run_decomp flow and taskify failures (#122508)
Differential Revision: [D55448616](https://our.internmc.facebook.com/intern/diff/D55448616)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122508
Approved by: https://github.com/angelayi, https://github.com/zhxchen17
2024-03-29 01:47:07 +00:00
cyy
f041df8530 Fix order conditioning of norm kernel (#122874)
NormOneOps is not executed due to an incorrect comparison, this PR fixes it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122874
Approved by: https://github.com/Skylion007
2024-03-29 00:28:13 +00:00
6b8205d3de Revert "Support map in pre-dispatch functionalization (#121444)"
This reverts commit 079feea3379c021a330dbfac7668a5fc8fccc3bd.

Reverted https://github.com/pytorch/pytorch/pull/121444 on behalf of https://github.com/clee2000 due to sorry windows failure seems related 079feea337 https://github.com/pytorch/pytorch/actions/runs/8474191301/job/23220791555. PR got force merged before windows job finished ([comment](https://github.com/pytorch/pytorch/pull/121444#issuecomment-2026323614))
2024-03-28 23:42:26 +00:00
16771747c2 Add tensor step and capturable support to rprop (#122261)
Towards fixing https://github.com/pytorch/pytorch/issues/115679
Fixes Rprop step update while compiling

Also adds capturable support + testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122261
Approved by: https://github.com/janeyx99
2024-03-28 23:31:18 +00:00
e63e013c3b Skip use_count() debug assert for _nested_get_offsets() (#122917)
This broke [internal tests](https://www.internalfb.com/intern/test/844425064039866/) that run with unset `NDEBUG`. It wasn't initially caught because we don't test with unset `NDEBUG` in OSS CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122917
Approved by: https://github.com/soulitzer
ghstack dependencies: #122902
2024-03-28 23:19:17 +00:00
6fc5ad931c Use zeros for NJT dummy to avoid messing with randomness (#122902)
Use of randomness was breaking vmap.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122902
Approved by: https://github.com/vmoens, https://github.com/zou3519
2024-03-28 22:09:31 +00:00
f476d707fd Remove previous grad impl. in torch dynamo (#122215)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122215
Approved by: https://github.com/zou3519
2024-03-28 22:00:23 +00:00
079feea337 Support map in pre-dispatch functionalization (#121444)
When we enter map_autograd, we try to trace through fwd/bwd of a map operator that is wrapped in ctx.functionalize wrapper. This forces us to go through PreDispatch functionalization again (only the python part). As a result, it revealed our previous bug where pre-dispatch mode handling doesn't actually manage the local dispatch key set. (If there is no active mode, we need to turn off PreDispatch key). This PR fixes that. Also I shuffled some APIs around so that there is less code duplication as the setting/unsetting logic is quite hard to get it right.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121444
Approved by: https://github.com/bdhirsh
2024-03-28 21:56:36 +00:00
481c9bb1fc Upgrade submodule oneDNN to v3.3.6 (#122164)
As the title. Including issue fixes for aarch64:
- https://github.com/oneapi-src/oneDNN/pull/1831
- https://github.com/oneapi-src/oneDNN/pull/1834

---

## Validation results
(on Intel CPU + Linux)
**Static quantization with Inductor on CV models**

Quant method | Geomean throughput ratio (v3.3.6/baseline)
-- | --
ptq | 0.982937
ptq (cpp wrapper) | 0.978384
qat | 0.978828

**Torchbench cpu userbenchmark with Inductor**

Items | Perf Geomean Ratio (v3.3.6/baseline)
-- | --
eager_throughtput_bf16_infer | 1.00x
eager_throughtput_fp32_infer | 1.00x
jit_llga_throughtput_amp_bf16 | 1.01x
jit_llga_throughtput_fp32 | 1.00x
eager_throughtput_fx_int8 | 1.00x
eager_throughtput_bf16_train | 1.46x
eager_throughtput_fp32_train | 1.41x

**Dynamo benchmarks tests**
Precision | Shape | Wrapper | Thread | Eager old/new GEOMEAN | Inductor old/new GEOMEAN
-- | -- | -- | -- | -- | --
Float32 | Static | Default | Multiple | 1.003836812 | 1.003425
Float32 | Static | Default | Single | 1.000181451 | 0.999611
Float32 | Dynamic | Default | Multiple | 1.003980183 | 1.006563
Float32 | Dynamic | Default | Single | 1.000076939 | 0.999969
AMP | Static | Default | Multiple | 0.996824772 | 0.998715
AMP | Static | Default | Single | 0.996402574 | 1.001483
AMP | Dynamic | Default | Multiple | 0.994919866 | 1.000467
AMP | Dynamic | Default | Single | 0.9962054 | 1.000767

(on Aarch64)
https://github.com/pytorch/pytorch/pull/122164#issuecomment-2007912919

---

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122164
Approved by: https://github.com/snadampal, https://github.com/malfet, https://github.com/atalman
2024-03-28 21:36:27 +00:00
3924d2189c [FSDP2] Simplified _move_states_to_device (#122907)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122907
Approved by: https://github.com/Skylion007
2024-03-28 21:22:59 +00:00
3beb9d85a6 Revert "Add non strict inline constraints and runtime assertions to non-strict exported program (#122722)"
This reverts commit b693fff5d72b249d39436ced577a88d3b866bbba.

Reverted https://github.com/pytorch/pytorch/pull/122722 on behalf of https://github.com/BoyuanFeng due to This breaks torchrec.distributed.tests.test_pt2.TestPt2: test_kjt__getitem__ ([comment](https://github.com/pytorch/pytorch/pull/122722#issuecomment-2026078351))
2024-03-28 20:42:35 +00:00
8852b09abc [FSDP2] Used _chunk_cat for reduce-scatter copy-in (#122888)
This PR uses `_chunk_cat` to fuse padding gradients on dim-0, chunking into `world_size` chunks, and copying them into the reduce-scatter input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122888
Approved by: https://github.com/yifuwang, https://github.com/BoyuanFeng, https://github.com/weifengpy
ghstack dependencies: #122726, #122847
2024-03-28 20:35:45 +00:00
8df99732a4 Revert "Workaround dind-rootless volumes mount as root (#122787)"
This reverts commit 84dc76156a0b8a73e56d80c3947ed9dd03c5ac5e.

Reverted https://github.com/pytorch/pytorch/pull/122787 on behalf of https://github.com/zxiiro due to This broke rocm tests ([comment](https://github.com/pytorch/pytorch/pull/122787#issuecomment-2026022659))
2024-03-28 20:10:19 +00:00
dacc73669c [export] Make quantizer compatible with the standard nn_module_stack. (#122819)
Summary: When we migrate to torch.export, we won't put L['self'] as the prefix for all the fqn in nn_module_stack. This diff adds the branch to handle the new case.

Test Plan: buck test mode/opt caffe2/test/quantization:test_quantization -- -r set_module_name

Differential Revision: D55436617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122819
Approved by: https://github.com/tugsbayasgalan
2024-03-28 19:36:46 +00:00
384de46395 [aoti] clear precomputed symbol replacements before cpp wrapper compilation (#122882)
After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122882
Approved by: https://github.com/desertfire
2024-03-28 19:06:29 +00:00
646dd1ab8d Rewrite quantized conv transpose2d for vulkan (#122547)
Summary: Vulkan rewrite sp that quantized transpose 2d ops can run in a model

Test Plan:
Run vulkan api test:
# buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
# buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 418 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 418 tests from VulkanAPITest
....
[----------] Global test environment tear-down
[==========] 418 tests from 1 test suite ran. (4510 ms total)
[  PASSED  ] 417 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 9 DISABLED TESTS

Run quantized vulkan api test: Note the linear quantized are failing but all the convolution tests still pass. Linear failures are being debugged.
# buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
# buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 86 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 86 tests from VulkanAPITest
...
[  PASSED  ] 77 tests.
[  FAILED  ] 9 tests, listed below:
[  FAILED  ] VulkanAPITest.linear_2d_flat
[  FAILED  ] VulkanAPITest.linear_2d_small
[  FAILED  ] VulkanAPITest.linear_2d_large
[  FAILED  ] VulkanAPITest.linear_3d_flat
[  FAILED  ] VulkanAPITest.linear_3d_small
[  FAILED  ] VulkanAPITest.linear_3d_large
[  FAILED  ] VulkanAPITest.linear_4d_flat
[  FAILED  ] VulkanAPITest.linear_4d_small
[  FAILED  ] VulkanAPITest.linear_4d_large

 9 FAILED TESTS
  YOU HAVE 8 DISABLED TESTS

# Run CUNET quantized model on hibiki board.

Reviewed By: manuelcandales

Differential Revision: D52344263

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122547
Approved by: https://github.com/manuelcandales, https://github.com/copyrightly, https://github.com/yipjustin
2024-03-28 18:51:44 +00:00
71b5b7e081 Let dynamo trace some functions in functorch.deprecated.* namespace (#121665)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121665
Approved by: https://github.com/zou3519
ghstack dependencies: #121410
2024-03-28 18:50:43 +00:00
966ae943df Add wrapper for fbgemm quantization operations (#122763)
Summary:
We add wrappers for fbgemm's packing so we can pass it through PT2 to
lowering phase of AOTInductor.

Test Plan:
Included in commit.
test_quantized_ops::test_wrapped_fbgemm_linear_fp16

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D55433204](https://our.internmc.facebook.com/intern/diff/D55433204)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122763
Approved by: https://github.com/jerryzh168
ghstack dependencies: #122762
2024-03-28 18:41:18 +00:00
e296722e0e Z3 validation: Lift operators later when we actually run with Z3 (#122791)
Previously, we lifted operators putting them into the FX graph, limiting
the applicability of the FX graph for only Z3.  Now, we lift operators
when we are interpreting, which means I can use the graph for other
things.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122791
Approved by: https://github.com/Chillee, https://github.com/lezcano
2024-03-28 18:31:30 +00:00
3d2d7ba19d Delete torch.autograd.function.traceable APIs (#122817)
We deprecated them in 2.3 with plans to delete in 2.4. Very few OSS
repos use this flag at all and it also does nothing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122817
Approved by: https://github.com/albanD
2024-03-28 18:24:15 +00:00
a3b30851c5 Add quantized.linear_unpacked_dynamic_fp16 (#122762)
Summary:

We add a new op quantized.linear_unpacked_dynamic_fp16, which is essentially linear_dynamic_fp16 with different (unpacked) weight/bias format.
This op does packing on the fly for each call with standard at::Tensor weight & bias.

Test Plan:
Included in commit.
test_quantized_op::test_unpacked_qlinear_dynamic_fp16

Differential Revision: [D55433203](https://our.internmc.facebook.com/intern/diff/D55433203)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122762
Approved by: https://github.com/jerryzh168
2024-03-28 18:02:27 +00:00
59f6393209 [docs] Update PT2+Profiler docs (#122272)
Document:
* Torch-Compiled Region
* What to expect in kernels inside a torch-compiled region

For review, see https://docs-preview.pytorch.org/pytorch/pytorch/122272/torch.compiler_profiling_torch_compile.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122272
Approved by: https://github.com/aaronenyeshi
2024-03-28 17:52:28 +00:00
091a24495b [AOTInductor] Support use_runtime_constant_folding for CPU. (#122563)
Summary:
We allow CPU to use the config use_runtime_constant_folding.
Changes include
1. Rearrange USE_CUDA flags. Add CPU sections that consumes memory directly.
2. Codegen changes to accomodate cpp fusions for CPU only. Specifically, we shouldn't generate 2 headers that would cause re-declaration.

Test Plan: Activate tests that were deactivated for CPU before.

Reviewed By: khabinov

Differential Revision: D55234300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122563
Approved by: https://github.com/chenyang78
2024-03-28 17:49:05 +00:00
8a33a77fd1 Back out "Added a check in register_lowering to avoid decomposed ops (#117632)" (#122709)
Summary:
Original commit changeset: ebda663a196b

Original Phabricator Diff: D55271788

Test Plan: Some models are failing torch compile with this, retrying the tests

Reviewed By: colinchan15

Differential Revision: D55374457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122709
Approved by: https://github.com/huydhn
2024-03-28 17:46:57 +00:00
4670dcc94c [Inductor]Fix a couple of broken unit tests (#122714)
Summary: Titled

Test Plan:
```
buck2 test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
```
Buck UI: https://www.internalfb.com/buck2/ad05a43c-cb4a-443e-8904-b4d53e4f4b1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798909218388
Network: Up: 107KiB  Down: 28KiB  (reSessionID-d7146e4f-773a-46ea-9852-f10f59302479)
Jobs completed: 24. Time elapsed: 1:49.3s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 0. Build failure 0

```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor/fb:split_cat_fx_passes_fb
```

Buck UI: https://www.internalfb.com/buck2/82dbf3b0-c747-4c07-98b8-53b69afa3157
Test UI: https://www.internalfb.com/intern/testinfra/testrun/1125900267699118
Network: Up: 1.4GiB  Down: 2.3GiB  (reSessionID-0bd22c6d-5dfe-4b4a-bc24-705eadac884b)
Jobs completed: 252570. Time elapsed: 7:25.2s.
Cache hits: 95%. Commands: 123778 (cached: 117999, remote: 2779, local: 3000)
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D55378009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122714
Approved by: https://github.com/SherlockNoMad
2024-03-28 17:44:30 +00:00
07f94df1a6 [torch quantization]fix HistogramObserver OOM when (self.max_val - self.min_val) is too small (#122659)
Differential Revision: D55347133

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122659
Approved by: https://github.com/jerryzh168
2024-03-28 17:41:21 +00:00
d65b9dff73 [AMD] turn off triton memcache for amd devices (#122560)
Summary:
triton memcache is not supported on amd devices yet and causes torch.compile to fail

Created from CodeHub with https://fburl.com/edit-in-codehub

Test Plan:
ci

Sandcastle run

Differential Revision: D55285655

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122560
Approved by: https://github.com/jansel
2024-03-28 17:38:21 +00:00
d9a08de9a4 Add Opinfo entries for HOP testing (#122265)
In this PR, we add a systematic way to test all HOPs to be exportable as export team has been running into various bugs related to newly added HOPs due to lack of tests. We do this by creating:
- hop_db -> a list of HOP OpInfo tests which then used inside various flows including export functionalities: [aot-export, pre-dispatch export, retrace, and ser/der

For now, we also create an allowlist so that people can bypass the failures for now. But we should discourage ppl to do that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122265
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2024-03-28 17:36:43 +00:00
0bfa9f4758 [ROCm][ATen][Native] Fix kernel cache selecting kernels for incorrect architectures (#121401)
Fixes #120794

Torch creates a cache of compiled kernels at $HOME/.cache/torch/kernels. The names used to save and select the cached kernels use cuda_major and cuda_minor to identify the gpu architecture for which the gpu kernels where compiled. On ROCM this is insufficient as on rocm cudaDeviceProp  cuda_major and cuda_minor are mapped to hipDeviceProp_t::major and hipDeviceProp_t::minor which correspond to the first and second number of the LLVM target corresponding to the architecture in question:

GFX1030 is major = 10, minor = 3
GFX1032 is major = 10, minor = 3
GFX900 is major = 9,  minor = 0
GFX906 is major = 9,  minor = 0
GFX908 is major = 9,  minor = 0

Thus it can be seen hipDeviceProp_t::major and hipDeviceProp_t::minor are insufficient to uniquely identify the ROCM architecture. This causes the rocm runtime to raise an error when an operation uses a cached kernel that was first cached on a architecture with the same hipDeviceProp_t::major and hipDeviceProp_t::minor but a different llvm target.

The solution provided in this pr is to replace the use of hipDeviceProp_t::major,hipDeviceProp_t::minor with hipDeviceProp_t::gcnArchName when pytorch is compiled for rocm which contains a string identical to the LLVM target of the architecture in question

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121401
Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang, https://github.com/malfet
2024-03-28 17:24:31 +00:00
9693797491 [PT2][Inductor][Observability] Improve the optimus scuba log (#122361)
Summary: Titled

Test Plan:
```
buck2 test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
```
Test UI: https://www.internalfb.com/intern/testinfra/testrun/18014398535709463
Network: Up: 113KiB           Down: 480KiB           (reSessionID-1d2e3558-15b5-4a4e-8c5d-10c983afb389)
Discovered 9. Pass 0. Fail 0. Fatal 0. Skip 0. Timeout 0
Command: test.                                                                                 Remaining: 9/24. Cache hits: 0%. Time elapsed: 44.3s
Command: test.                                                                                 Remaining: 9/24. Cache hits: 0%. Time elapsed: 44.4s
Command: test.                                                                                 Remaining: 9/24. Cache hits: 0%. Time elapsed: 44.5s
Network: Up: 117KiB  Down: 507KiB  (reSessionID-1d2e3558-15b5-4a4e-8c5d-10c983afb389)
Jobs completed: 24. Time elapsed: 1:48.3s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 0. Build failure 0
```
buck2 test mode/dev-nosan //caffe2/test/inductor:split_cat_fx_passes
```
Test UI: https://www.internalfb.com/intern/testinfra/testrun/16044073698893554
Network: Up: 120KiB  Down: 60KiB  (reSessionID-57f2c21b-3f4e-462b-9e5b-fe3dd15f6b7d)
Jobs completed: 28. Time elapsed: 1:47.5s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 11. Fail 0. Fatal 0. Skip 0. Build failure 0

optimus_scuba_log:
```
{'before_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GIbj2haUwKx69H8BAKXdGqXZSpoybr0LAAAz', 'group_batch_fusion_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GFqhiRYcJ_C4JFoDABKPTsfpzjJ_br0LAAAz', 'normalization_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GIvswhaiAVyipcoGAJZ5sUi8Bb5qbr0LAAAz', 'remove_split_with_size_one_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GFneTxcVBPaqVuwCADCiI4q1mEwlbr0LAAAz', 'merge_getitem_cat_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GJc0Phn87ljuMO0CADBPGqqehKp2br0LAAAz', 'merge_splits_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GLWB_BbvLyT7D_0DABmygDYPDjJ_br0LAAAz', 'after_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GO6eQBeIj6oV3o4JAFLzQ3ECMTIrbr0LAAAz', 'inductor_pre_grad': Counter({'pattern_matcher_nodes': 2006, 'pattern_matcher_count': 1806, 'normalization_pass': 861, 'remove_split_with_size_one_pass': 748, 'merge_splits_pass': 82, 'merge_getitem_cat_pass': 11, 'scmerge_split_sections_removed': 4, 'batch_layernorm': 1, 'batch_sigmoid': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'scmerge_split_removed': 1, 'scmerge_cat_removed': 1}), 'before_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GMoKmxYg6AUeQ40KAMDaJ4EVDwYmbr0LAAAz', 'group_batch_fusion_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GHIvQxkrV1PMBggEACv7786a2bE8br0LAAAz', 'after_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GIpBNxXupQTHWx8BALSiVrKgDbtfbr0LAAAz', 'inductor_post_grad': Counter({'pattern_matcher_nodes': 2093, 'pattern_matcher_count': 1893, 'normalization_pass': 861, 'remove_split_with_size_one_pass': 748, 'merge_splits_pass': 82, 'merge_getitem_cat_pass': 11, 'scmerge_split_sections_removed': 4, 'batch_layernorm': 1, 'batch_sigmoid': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'scmerge_split_removed': 1, 'scmerge_cat_removed': 1, 'batch_aten_mul': 1})}
```

Differential Revision: D55107000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122361
Approved by: https://github.com/jackiexu1992
2024-03-28 17:13:32 +00:00
049d68d8bb [inductor][Autotune] Add matrix_instr_nonkdim to triton_meta (#122852)
Summary: Previous work `https://github.com/pytorch/pytorch/pull/120742` to enable `matrix_instr_nonkdim` only dealt with the autotuner benchmarking, but failed to enable the parameter in Triton meta for real runs. `matrix_instr_nonkdim` needs to be visible to the compiler driver to set up the optimization pipeline, so it's unlike other kernel parameters such as `BLOCK_N` that can be just set inside the kernel itself.

Test Plan:
P1201466917

  triton_heuristics.template(
    num_stages=1,
    num_warps=4,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())], 'matrix_instr_nonkdim': 16},
    inductor_meta={'kernel_name': 'triton_tem_fused_mm_0', 'backend_hash': None},
  )

Perf :
Before: 1.693ms    0.134GB    79.28GB/s
After:    1.577ms    0.134GB    85.12GB/s

Differential Revision: D55456401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122852
Approved by: https://github.com/xw285cornell
2024-03-28 16:58:38 +00:00
1e8d4b389b Super tiny fix typo (#122881)
"CustoType" -> "CustomType"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122881
Approved by: https://github.com/awgu
2024-03-28 16:13:25 +00:00
958dbb876c Revert "_foreach_copy with different src/dst dtypes (#121717)"
This reverts commit da2a9a05127c2b44e447e734d99e727d856cb36f.

Reverted https://github.com/pytorch/pytorch/pull/121717 on behalf of https://github.com/janeyx99 due to Causing IMAs on V100s internally :C ([comment](https://github.com/pytorch/pytorch/pull/121717#issuecomment-2025553295))
2024-03-28 15:54:40 +00:00
8698121636 Revert "Add RMSNorm module (#121364)"
This reverts commit a7306de0dc96cda8b698d19680a88d27aa45a31d.

Reverted https://github.com/pytorch/pytorch/pull/121364 on behalf of https://github.com/atalman due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/121364#issuecomment-2025502007))
2024-03-28 15:31:10 +00:00
8007d9a34a Revert "[fx] Preserve Fx graph node order in partitioner across runs (#115621)"
This reverts commit f2c1060de3cdddbfefcab11e547211993d0f9cfa.

Reverted https://github.com/pytorch/pytorch/pull/115621 on behalf of https://github.com/atalman due to Broke internal executorch test ([comment](https://github.com/pytorch/pytorch/pull/115621#issuecomment-2025496296))
2024-03-28 15:28:02 +00:00
9208df45cb Fixed increasing CPU overhead of RemovableHandle.__init__ (#122847)
For some reason, if we construct `class Handle(RemovableHandle` inside `register_multi_grad_hook`, then over time, the call to `RemovableHandle.__init__` slows down more and more (when we have GC disabled). Perhaps, this is related to the class attribute `next_id: int = 0`. Python experts: please let me know if you have thoughts 😅

I am open to any suggestions on if how we should deal with this `Handle` class. For now, I changed it to a private `_MultiHandle`.

<details>
<summary> Experiment Script </summary>

```
import gc
import time

import torch

NUM_TENSORS = int(5e4)
ts = [torch.empty(1, requires_grad=True) for _ in range(NUM_TENSORS)]

def hook(grad) -> None:
    return

gc.disable()
times = []
for i, t in enumerate(ts):
    start_time = time.time()

    torch.autograd.graph.register_multi_grad_hook([t], hook)

    end_time = time.time()
    times.append(end_time - start_time)

print([f"{t * 1e6:.3f} us" for t in times[1:6]])  # print first few times
print([f"{t * 1e6:.3f} us" for t in times[-5:]])  # print last few times

times = []
for i, t in enumerate(ts):
    start_time = time.time()

    t.register_hook(hook)

    end_time = time.time()
    times.append(end_time - start_time)

print([f"{t * 1e6:.3f} us" for t in times[1:6]])  # print first few times
print([f"{t * 1e6:.3f} us" for t in times[-5:]])  # print last few times
```
</details>

<details>
<summary> Results </summary>

Before fix:
```
['23.603 us', '19.550 us', '15.497 us', '12.875 us', '13.828 us']
['327.110 us', '341.177 us', '329.733 us', '332.832 us', '341.177 us']
['318.050 us', '315.189 us', '319.719 us', '311.613 us', '308.990 us']
['374.317 us', '394.821 us', '350.714 us', '337.362 us', '331.402 us']
```
Calling `register_multi_grad_hook` makes calling itself and `register_hook` slower (actually, any call to `RemovableHandle.__init__`).

After fix:
```
['13.590 us', '9.060 us', '12.875 us', '7.153 us', '8.583 us']
['4.530 us', '5.245 us', '6.437 us', '4.768 us', '5.007 us']
['2.623 us', '1.907 us', '1.431 us', '1.669 us', '1.192 us']
['1.431 us', '1.431 us', '1.192 us', '1.192 us', '1.431 us']
```
</details>

Update: from @soulitzer

> Your suspicion about next_id is right. I think what is happening is that whenever a class attribute is set, it needs to invalidate some cached data for the subclasses one-by-one. eefff682f0/Objects/typeobject.c (L845)
And this PR fixes the issue by avoiding creating many subclasses dynamically. Changing next_id to something like List[int] or incrementing a global instead also fixes this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122847
Approved by: https://github.com/soulitzer
ghstack dependencies: #122726
2024-03-28 15:24:12 +00:00
4290a57e9c Revert "[NJT] .to() properly updates device of offsets (#122797)"
This reverts commit 3e7fd45b409966440c54f5e370885b4b2a388a01.

Reverted https://github.com/pytorch/pytorch/pull/122797 on behalf of https://github.com/jeffdaily due to Sorry for reverting your change but it is failing CUDA and ROCm jobs in trunk. Please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/122797#issuecomment-2025473181))
2024-03-28 15:17:45 +00:00
cyy
d6aed1b692 Fix clang-tidy warnings of aten/src/ATen/functorch (#122779)
This PR fixes some performance related clang-tidy warnings of aten/src/ATen/functorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122779
Approved by: https://github.com/ezyang
2024-03-28 15:15:06 +00:00
6e1c81c687 Revert "Let dynamo trace some functions in functorch.deprecated.* namespace (#121665)"
This reverts commit f9eab9ca92c603e671e7714669758a81ce8d7111.

Reverted https://github.com/pytorch/pytorch/pull/121665 on behalf of https://github.com/guilhermeleobas due to revert PR ([comment](https://github.com/pytorch/pytorch/pull/121665#issuecomment-2025460500))
2024-03-28 15:11:51 +00:00
f9eab9ca92 Let dynamo trace some functions in functorch.deprecated.* namespace (#121665)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121665
Approved by: https://github.com/zou3519
ghstack dependencies: #121410
2024-03-28 15:07:18 +00:00
f178d996a8 [dynamo] Fix traceback generation on runtime errors (#122746)
Fixes `During handling of the above exception, another exception occurred: [...] torch._dynamo.exc.Unsupported: generator`. traceback.format_exc uses generators which isn't supported by dynamo yet.
<details>
  <summary>current error message</summary>

```
======================================================================
ERROR: test_custom_fn_saved_tensors (__main__.TestCompiledAutograd)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 307, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
  File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1537, in _call_impl
    return forward_call(*args, **kwargs)
  File "<eval_with_key>.0", line 4, in forward
    def forward(self, inputs, sizes, hooks):
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/xmfan/core/pytorch/torch/testing/_internal/common_utils.py", line 2741, in wrapper
    method(*args, **kwargs)
  File "/home/xmfan/core/pytorch/test/inductor/test_compiled_autograd.py", line 499, in test_custom_fn_saved_tensors
    self.check_output_and_recompiles(fn, 1)
  File "/home/xmfan/core/pytorch/test/inductor/test_compiled_autograd.py", line 61, in check_output_and_recompiles
    actual = list(opt_fn())
  File "/home/xmfan/core/pytorch/test/inductor/test_compiled_autograd.py", line 495, in fn
    loss.backward()
  File "/home/xmfan/core/pytorch/torch/_tensor.py", line 534, in backward
    torch.autograd.backward(
  File "/home/xmfan/core/pytorch/torch/autograd/__init__.py", line 267, in backward
    _engine_run_backward(
  File "/home/xmfan/core/pytorch/torch/autograd/graph.py", line 766, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1537, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xmfan/core/pytorch/torch/_dynamo/eval_frame.py", line 397, in _fn
    res = fn(*args, **kwargs)
  File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 741, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
  File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 315, in __call__
    _WrappedCall._generate_error_message(topmost_framesummary),
  File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 289, in _generate_error_message
    tb_repr = get_traceback()
  File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 288, in get_traceback
    return traceback.format_exc()
  File "/home/xmfan/.conda/envs/benchmarks/lib/python3.10/traceback.py", line 183, in format_exc
    return "".join(format_exception(*sys.exc_info(), limit=limit, chain=chain))
  File "/home/xmfan/.conda/envs/benchmarks/lib/python3.10/traceback.py", line 136, in format_exception
    return list(te.format(chain=chain))
  File "/home/xmfan/core/pytorch/torch/_dynamo/convert_frame.py", line 941, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state, skip=1)
  File "/home/xmfan/core/pytorch/torch/_dynamo/convert_frame.py", line 348, in _convert_frame_assert
    unimplemented("generator")
  File "/home/xmfan/core/pytorch/torch/_dynamo/exc.py", line 199, in unimplemented
    raise Unsupported(msg)
torch._dynamo.exc.Unsupported: generator
```

</details>

With this change, we get back the descriptive error message:
<details>
  <summary>post-fix error message</summary>

```
Traceback (most recent call last):
  File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 307, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
  File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1537, in _call_impl
    return forward_call(*args, **kwargs)
  File "<eval_with_key>.0", line 4, in forward
    def forward(self, inputs, sizes, hooks):
IndexError: list index out of range

Call using an FX-traced Module, line 4 of the traced Module's generated forward function:

def forward(self, inputs, sizes, hooks):

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    getitem = inputs[0]

    getitem_1 = inputs[1];  inputs = None
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122746
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #122691
2024-03-28 14:40:54 +00:00
1d96791661 [dynamo] Fix list proxy to list element proxy source propagation (#122691)
Currently, when we create proxies for a list's elements in wrap_fx_proxy_cls, we create them using the same source as the list's e.g. `LocalSource(inputs)` instead of `GetItemSource(LocalSource(inputs), index=i)`. This results in invalid guards when the tensors it contains becomes dynamic, and the guard system thinks the list is a tensor:
```
Malformed guard:
L['sizes'][0] == L['inputs'].size()[0]
Malformed guard:
2 <= L['inputs'].size()[0]

Traceback [...]
AttributeError: 'list' object has no attribute 'size'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122691
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-03-28 14:40:54 +00:00
0284bca99b Don't cache device_count if we haven't initialized CUDA yet (#122815)
Before initializing CUDA, it can change by modifying CUDA_VISIBLE_DEVICES

Fixes https://github.com/pytorch/pytorch/issues/122085
Fixes https://github.com/pytorch/pytorch/issues/38616
Fixes https://github.com/pytorch/pytorch/issues/110000
Fixes https://github.com/pytorch/pytorch/issues/110971
Fixes https://github.com/pytorch/pytorch/issues/95073

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122815
Approved by: https://github.com/albanD
2024-03-28 13:23:45 +00:00
84dc76156a Workaround dind-rootless volumes mount as root (#122787)
In ARC Runners we are using dind-rootless to run docker-in-docker and
in rootless mode volume mounts always mount as root but are mapped to
the local `runner` user in ARC. This causes the build.sh and test.sh
scripts to fail because they run as the `jenkins` user and expect to
be able to write to the workspace path that's being mounted.

Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>
2024-03-28 09:06:40 -04:00
cyy
d1da9cc654 [ClangTidy] Disable misc-include-cleaner (#122855)
misc-include-cleaner was introduced in clang-tidy-17 as a way to check missing and unused includes. However, there are lots of transitive headers in PyTorch and it would take enormous efforts to add related annotations to them in order to direct this checker. For this reason, it's better to disable it now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122855
Approved by: https://github.com/cpuhrsch
2024-03-28 10:10:43 +00:00
8c8e4e31f2 Some improvements to nonzero post guard_size_oblivious (#122156)
Prompted by https://github.com/pytorch/pytorch/pull/121571

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122156
Approved by: https://github.com/jansel
2024-03-28 03:53:16 +00:00
caa57e4fcd Add tensor step and capturable support to rmsprop (#122264)
Towards fixing https://github.com/pytorch/pytorch/issues/115679
Fixes RMSprop step update while compiling

Adds capturable support to RMSprop

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122264
Approved by: https://github.com/janeyx99
2024-03-28 03:39:28 +00:00
927bc4b558 [vision hash update] update the pinned vision hash (#122754)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122754
Approved by: https://github.com/pytorchbot
2024-03-28 03:27:07 +00:00
c10352a406 [audio hash update] update the pinned audio hash (#122584)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122584
Approved by: https://github.com/pytorchbot
2024-03-28 03:26:21 +00:00
235f24fc66 [inductor] Add FileLock around V.debug.copy (#122665)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122665
Approved by: https://github.com/ezyang
2024-03-28 03:17:33 +00:00
1b5ccdb0f0 Avoid COW materialize in more forward ops (#122720)
Affected ops:
* ormqr
* lerp
* multinomial
* bernoulli
* histogram
* searchsorted
* log_softmax
* jiterator ops
* dropout
* _segment_reduce

Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122720
Approved by: https://github.com/ezyang
2024-03-28 03:02:13 +00:00
60f3c092d4 [dynamo] Config option to Inline builtin nn module forward (#122725)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122725
Approved by: https://github.com/jansel
ghstack dependencies: #122646, #122647, #122716, #122769, #122818
2024-03-28 03:01:27 +00:00
d4317becce [dynamo][easy] Force recompilation in a test (#122818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122818
Approved by: https://github.com/williamwen42
ghstack dependencies: #122646, #122647, #122716, #122769
2024-03-28 03:01:27 +00:00
52b1d2a73d Increase timm batch sizes to make less overhead-bound and less noisy (#122581)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122581
Approved by: https://github.com/ezyang
ghstack dependencies: #122686, #122688, #121692, #122841
2024-03-28 02:34:32 +00:00
e6ee8322d7 nn.Module: use swap_tensors for Tensor subclasses (#122755)
This fixes a bug when casting a module that has DTensor parameters. The old behavior will swap the .data field of the Tensor subclass which is incorrect behavior when dealing with tensor subclasses that may have multiple child tensors.

This uses the `swap_tensors` method to swap all of the tensors not just the .data field.

Test plan:

```
pytest test/distributed/_tensor/test_api.py -k 'test_distribute_module_casting'
python test/distributed/fsdp/test_wrap.py -k test_auto_wrap_smoke_test_cuda_init_mode1_cpu_offload0_use_device_id_True
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122755
Approved by: https://github.com/wanchaol, https://github.com/mikaylagawarecki
2024-03-28 02:03:09 +00:00
3e7fd45b40 [NJT] .to() properly updates device of offsets (#122797)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122797
Approved by: https://github.com/jbschlosser
2024-03-28 00:56:23 +00:00
574a8ccf10 Remove several expectedFailureNonStrict (#122802)
This PR removes several `expectedFailureNonStrict` from `test_export.py`, where the error messages from strict and non-strict export differ a bit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122802
Approved by: https://github.com/ydwu4
2024-03-28 00:42:49 +00:00
12116aee68 Add Flash Attention support on ROCM (#121561)
This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton)

- [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`).
    * MI300X is supported. More architectures will be added once Triton support them.
- [x] Only supports power of two sequence lengths.
    * Now it support arbitrary sequence length
- [ ] No support for varlen APIs.
    * varlen API will be supported in future release of AOTriton
- [x] Only support head dimension 16,32,64,128.
    * Now it support arbitrary head dimension <= 256
- [x] Performance is still being optimized.
    * Kernel is selected according to autotune information from Triton.

Other improvements from AOTriton include
* Allow more flexible Tensor storage layout
* More flexible API

This is a more extensive fix to #112997

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561
Approved by: https://github.com/huydhn
2024-03-28 00:27:38 +00:00
8d676a6e8e [dynamo][cpp-guards] Bugfix for size/strides for tensor match (#122828)
This got missed because CPP guard manager is not ON by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122828
Approved by: https://github.com/mlazos, https://github.com/jansel
2024-03-28 00:16:49 +00:00
66510c641f [c10d][NCCL] Refactor coalesced storage (#122651)
The `coalescedDevice_` are `coalescedComms_` used inefficiently and in case of consequent coalescing comms can cause to read-before-write condition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122651
Approved by: https://github.com/kwen2501, https://github.com/eqy
2024-03-27 23:56:02 +00:00
cc12668053 Fix swap_tensors path in _apply for modules that inherit from RNNBase (RNN, GRU, LSTM) (#122800)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122800
Approved by: https://github.com/albanD
2024-03-27 23:34:16 +00:00
0348773655 Forward fix for subtly breaking AC with compile in the case of stacked (#122841)
checkpoint layers separated by recomputable op
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122841
Approved by: https://github.com/anijain2305
ghstack dependencies: #122686, #122688, #121692
2024-03-27 23:23:04 +00:00
a8b7480f0d fix dynamo.explain examples (#122745)
`dynamo.explain()` was updated to return a structure but the docs weren't updated to match.

- Update the docs to use the new API
- Remove some dead code left when `explain` was updated.
- Drive-by: Fix some `nopython` uses that I noticed
- Drive-by: I noticed an ignored error coming from CleanupHook on shutdown - make it check the global before setting it.

Fixes #122573

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122745
Approved by: https://github.com/jansel
2024-03-27 22:53:27 +00:00
a54ea7bbd8 Made several changes to min-cut partitioner that allow it to recompute more things (#121692)
Perf results
<img width="862" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/8d44e633-8941-46a6-8e7d-806330a8c890">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121692
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #122686, #122688
2024-03-27 22:45:52 +00:00
bef01c7c2b Revert "Optimize multi_tensor_apply (take 2) (#119764)"
This reverts commit fe41ba47652ca73569453bddb43605c77bb85184.

Reverted https://github.com/pytorch/pytorch/pull/119764 on behalf of https://github.com/atalman due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/119764#issuecomment-2024105399))
2024-03-27 22:42:07 +00:00
222dfc4282 [Inductor] Run pattern matcher over the original graph (#122519)
Differential Revision: [D55429070](https://our.internmc.facebook.com/intern/diff/D55429070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122519
Approved by: https://github.com/jansel
2024-03-27 22:09:36 +00:00
530e13cf3d Revert "[c10d] disable compute_duration by default (#122138)" (#122539)
This reverts commit bf18e967b4abc90c27ad460680497d8f5ec55962.

It is stacked after a fix to elapsed_time that will resolve the memory issues that required in the introduction of this flag.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122539
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: #122538
2024-03-27 21:53:28 +00:00
933d3a7829 Allow dynamo to inline through "hessian" (#121410)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121410
Approved by: https://github.com/zou3519
2024-03-27 21:39:37 +00:00
a7306de0dc Add RMSNorm module (#121364)
Similar to dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L51)

**The implementation here is not optimized and we welcome pull requests to improve this**

- Use `normalized_shape` instead of singular integer `dim` to be aligned with the `nn.LayerNorm` implementation
- Remove the [upcast to float and downcast
](dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L73))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121364
Approved by: https://github.com/albanD
2024-03-27 21:39:30 +00:00
b693fff5d7 Add non strict inline constraints and runtime assertions to non-strict exported program (#122722)
This PR reduces the difference between strict and non-strict exported program by

- Support `inline_constraints` for non-strict exported program
- Add runtime assertions for range constraints to non-strict exported program

After this PR, the following unit tests are no longer `expectedFailureNonStrict`:
- test_automatic_constrain_size
- test_export_with_inline_constraints
- test_redundant_asserts
- test_constrain_size_with_constrain_value
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122722
Approved by: https://github.com/pianpwk
2024-03-27 21:20:03 +00:00
abe4a0e9eb [dynamo] pop result of print reordering (#122744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122744
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530, #122737, #122738, #122739, #122740, #122741, #122742, #122743
2024-03-27 20:39:39 +00:00
76fe0faadd [dynamo, 3.12] add END_SEND (#122743)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122743
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530, #122737, #122738, #122739, #122740, #122741, #122742
2024-03-27 20:39:39 +00:00
c5d372dafc [dynamo, 3.12] trace through __mro__ attribute access (#122742)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122742
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530, #122737, #122738, #122739, #122740, #122741
2024-03-27 20:39:39 +00:00
71d40ff861 [dynamo, 3.12] fix typing variable tracing (#122741)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122741
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530, #122737, #122738, #122739, #122740
2024-03-27 20:39:39 +00:00
5d0a792d5f [dynamo, 3.12] fix some tests (#122740)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122740
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530, #122737, #122738, #122739
2024-03-27 20:39:39 +00:00
a9704848d1 [dynamo, 3.12] add CALL_INTRINSIC_1 (#122739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122739
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530, #122737, #122738
2024-03-27 20:39:39 +00:00
8e5a4248a3 [dynamo, 3.12] add LOAD_SUPER_ATTR (#122738)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122738
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530, #122737
2024-03-27 20:39:39 +00:00
8cd7bb7422 [dynamo, 3.12] add LOAD_FAST variants (#122737)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122737
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530
2024-03-27 20:39:39 +00:00
a9b27bbbe9 [dynamo, 3.12] update jump instructions (#122530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122530
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456
2024-03-27 20:39:39 +00:00
f44f16ebd5 [dynamo, 3.12] add END_FOR (#122456)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122456
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455
2024-03-27 20:39:39 +00:00
bcdd0c6f59 [dynamo, 3.12] add BINARY/STORE_SLICE (#122455)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122455
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449
2024-03-27 20:39:39 +00:00
7b13228038 [dynamo, 3.12] fix DICT_VERSION C++ guards (#122449)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122449
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356
2024-03-27 20:39:39 +00:00
01547960bc [dynamo, 3.12] remove LOAD_METHOD, update LOAD_ATTR (#122356)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122356
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355
2024-03-27 20:39:39 +00:00
8ba26f4aa5 [dynamo, 3.12] support RETURN_CONST (#122355)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122355
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354
2024-03-27 20:39:39 +00:00
3a67c86f72 [dynamo, 3.12] remove references to PRECALL instruction in 3.12 (#122354)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122354
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335
2024-03-27 20:39:39 +00:00
35382f0573 [dynamo, 3.12] Use CPython internal _PyOpcode_Caches instead of hardcoding (#122335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122335
Approved by: https://github.com/jansel
ghstack dependencies: #122146
2024-03-27 20:39:39 +00:00
2564f6cf0e [dynamo, 3.12] Allocate Dynamo shadow frames by mimicking CPython (#122146)
Python 3.12 changed a few things with how `_PyInterpreterFrame`s are allocated and freed:
- Frames are now required to be placed on the Python frame stack. In 3.11, we could allocate frames anywhere in memory. In 3.12, we now need to use `THP_PyThreadState_BumpFramePointerSlow`/`push_chunk`/`allocate_chunk`. This method of allocating/freeing frames is also compatible with 3.11.
- The eval frame function is now responsible for clearing the frame (see https://docs.python.org/3/whatsnew/changelog.html#id128, the point about "...which now clear the frame.")

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122146
Approved by: https://github.com/jansel
2024-03-27 20:39:39 +00:00
b73c603771 Update base for Update on "[NJT] Actually inline NT torch function during dynamo"
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:

(1) Disables torch function running a second time in AOTAutograd

If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.

2.  Enables torch function to be inlined in dynamo for NT

Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level. 

3. Fixes graph breaks for NT torch function

Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).


Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
2024-03-27 13:01:50 -07:00
ccfc87b199 include scheduler_on_plateau in optim.h (#121722)
Fixes #121593
Co-authored-by: Jane Xu <janeyx@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121722
Approved by: https://github.com/albanD
2024-03-27 19:45:25 +00:00
ceff2205e9 [dynamo][cpp-guards] Bugfix to pass on correct example_value (#122769)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122769
Approved by: https://github.com/jansel
ghstack dependencies: #122646, #122647, #122716
2024-03-27 19:40:46 +00:00
7281c5afdc [dynamo][fbcode][torchrec] Selectively inline torchrec/distributed/types.py (#122716)
Manually verified for the internal model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122716
Approved by: https://github.com/jansel
ghstack dependencies: #122646, #122647
2024-03-27 19:40:46 +00:00
5b42c41b19 [dynamo][improve-guard-overhead] Skip TENSOR_MATCH guards on parameters for optimizers (#122647)
**1.32x  guard overhead reduction** (1.092 vs vs 0.827 ms) for MegatronBertForCausalLM with 394 params.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122647
Approved by: https://github.com/jansel, https://github.com/mlazos
ghstack dependencies: #122646
2024-03-27 19:40:43 +00:00
c108696228 [dynamo][guards-cpp-refactor][easy] Env variable to turn on cpp manager (#122646)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122646
Approved by: https://github.com/jansel
2024-03-27 19:40:37 +00:00
1b9c7e41bb Remove .data call in LSTM as it is not necessary (#122733)
Summary: Title

Test Plan: CI

Differential Revision: D55392057

Functional pre-dispatch tracing chokes on LSTM .data call today. While we need to fix it, it seems this call seems unnecessary here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122733
Approved by: https://github.com/mikaylagawarecki, https://github.com/albanD
2024-03-27 19:08:22 +00:00
1d6fc0d4de Fixed _infer_device_type warning in checkpoint (#122726)
Previously, we were checking `len(device_types)` where `device_types` is a `list`. This meant that if there were multiple inputs, we would see something like `device_types = ["cuda", "cuda"]` and a false positive warning. We should check `len(set(device_types))`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122726
Approved by: https://github.com/soulitzer
2024-03-27 18:38:42 +00:00
37e3c8f33f [DCP] Supporting resolve_bytes in LoadPlanner (#122700)
1. Supporting resolve bytes, similar to resolve_tensor.
2. This will allow us to load the bytes, directly on to the user provided ioBytes buffer.

This essentially mirrors the existing pattern we have for tensors, where the user is expected to follow some version of:

```
1. resolve_tensor
2. copy to target tensor
3. commit_tensor
```

Differential Revision: [D55259699](https://our.internmc.facebook.com/intern/diff/D55259699/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122700
Approved by: https://github.com/Skylion007, https://github.com/wz337, https://github.com/pradeepfn
2024-03-27 17:43:32 +00:00
cd51496f8b add a couple debug options (#121033)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121033
Approved by: https://github.com/ezyang
2024-03-27 17:24:43 +00:00
5af839f86d [quant][pt2e] Enable observer sharing between different quantization specs (#122734)
Summary:

Right now we don't insert additional observers (share observers) if qspec.dtype and qspec.is_dynamic matches exactly,
since fixed qparams quantization spec and derived quantization spec do have have is_dynamic field curerntly, observer sharing does not happen between them and quantization spec, in this PR we fixed the issue by
adding is_dynamic to all quantization specs.

Note: SharedQuantizationSpec should probably be its own type in the future
TODO later:
(1). move all these fields (dtype, is_dynamic, quant_min, quant_max etc.) to QuantizationSpecBase,
(2). make SharedQuantizationSpec a separate type
(3). add quant_min/quant_max in observer sharing checking in pt2e/prepare.py

Test Plan:
python test/test_quantization.py -k test_fixed_qparams_qspec_observer_dedup
Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D55396546](https://our.internmc.facebook.com/intern/diff/D55396546)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122734
Approved by: https://github.com/andrewor14
2024-03-27 16:45:19 +00:00
af7ac3e5c4 Update base for Update on "[NJT] Actually inline NT torch function during dynamo"
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:

(1) Disables torch function running a second time in AOTAutograd

If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.

2.  Enables torch function to be inlined in dynamo for NT

Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level. 

3. Fixes graph breaks for NT torch function

Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).


Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
2024-03-27 09:22:15 -07:00
b63f6f78dc Revert "[Inductor] Run pattern matcher over the original graph (#122519)"
This reverts commit 1f5fcb4e203eb343e8c53f6444015c98e8f68d60.

Reverted https://github.com/pytorch/pytorch/pull/122519 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/122519#issuecomment-2023022311))
2024-03-27 15:13:26 +00:00
f3b82a4dc2 [xla hash update] update the pinned xla hash (#122628)
Originally made this PR since xla was failing, but the PR that changed the pin got reverted, so this is just a normal update now

The old pin was ~2 weeks old?

Currently XLA is broken https://github.com/pytorch/pytorch/actions/runs/8438508272/job/23115239444
Co-authored-by: Andrey Talman <atalman@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122628
Approved by: https://github.com/malfet, https://github.com/JackCaoG
2024-03-27 15:09:42 +00:00
f140309e9c Revert "Only update momentum buffers for SGD if momentum is enabled (#122349)"
This reverts commit a333b080c16a3a6bbb057b4fbaaec4a4e14615dd.

Reverted https://github.com/pytorch/pytorch/pull/122349 on behalf of https://github.com/atalman due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/122349#issuecomment-2023001467))
2024-03-27 15:04:52 +00:00
70c3deef2d Revert "[xla hash update] update the pinned xla hash (#122628)"
This reverts commit 04399a30913fd04c2120420b671cd432659d56e6.

Reverted https://github.com/pytorch/pytorch/pull/122628 on behalf of https://github.com/atalman due to Need revert and then reland ([comment](https://github.com/pytorch/pytorch/pull/122628#issuecomment-2022995857))
2024-03-27 15:01:33 +00:00
eb5381da66 Skip storage check debug assert in view codegen when output is a subclass instance (#122718)
Before the fix, this assert blows up in DEBUG mode for views where the input (base) is a dense tensor and the output (view) is a subclass instance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122718
Approved by: https://github.com/soulitzer
2024-03-27 14:39:51 +00:00
105381ea11 [inductor][cpp] simplify CppVecKernelChecker (remove bool/int8 load as mask and load as float flags) (#119734)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119734
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack dependencies: #119654, #119655
2024-03-27 11:20:35 +00:00
49121603ab [inductor][cpp] support vectorized indirect indexing (#119655)
This PR adds the vectorized indirect indexing so that we can further simplify the `CppVecKernelChecker` (done in the later PR #119734) and remove the check that throws `CppVecUnsupportedError`. A boundary assertion check is added on vectorized indices and via the new `indirect_assert` method on `Kernel` - the base implementation is for scalar indices, overridden in `CppVecKernel` for vectorized indices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119655
Approved by: https://github.com/jansel
ghstack dependencies: #119654
2024-03-27 10:25:45 +00:00
a697d972b1 Fix torchbench errors (#122735)
Summary: It looks like this target has stopped working, lets fix it.

Test Plan:
```
buck2 run mode/opt //caffe2/benchmarks/dynamo/:test
```
now works

Differential Revision: D55389546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122735
Approved by: https://github.com/xmfan
2024-03-27 06:59:16 +00:00
367ec62ae3 [inductor][cpp] generalize vector mask for dtypes (#119654)
Vectorized boolean values in CPU Inductor were modeled with `Vectorized<float>` which cannot work for operations with other data types. This PR generalizes it with the new `VecMask` template class that can work for masks on any vectorized data types. The intrinsics implementation in `cpp_prefix.h` for mask conversion, cast and masked load are now implemented as the specialization for `VecMask` and moved to corresponding header files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119654
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2024-03-27 05:33:53 +00:00
f2c1060de3 [fx] Preserve Fx graph node order in partitioner across runs (#115621)
Fixes #ISSUE_NUMBER
partitioner generates different graph in recompilation on each run
Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115621
Approved by: https://github.com/ezyang
2024-03-27 02:20:37 +00:00
d1104d76aa [Easy] Fix freezing bug with mismatched bias sizes (#122724)
Fix for https://github.com/pytorch/pytorch/issues/121231

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122724
Approved by: https://github.com/davidberard98
2024-03-27 01:41:00 +00:00
249e65b92d Graph-Safe RNG State Exchange for Tensor Parallelism (#114068)
See #113541

The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality.

cc  @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068
Approved by: https://github.com/ezyang, https://github.com/eqy, https://github.com/xuzhao9
2024-03-27 01:14:38 +00:00
fe41ba4765 Optimize multi_tensor_apply (take 2) (#119764)
### Take 2

The first take (#119153) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153:
- Incorporate #119652 so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication.
- Ensure the optimization is compatible with cuda graph.

### Summary

Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops.

Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach:
- When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments.
- Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel.

This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`.

### Benchmark (WIP)

The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. **However, I believe this PR should not be slower than the previous impl on any problem sizes.**

The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa).

**Baseline**

A single iteration in trace:
<img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319">

```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json
device ms: 1.111, cpu ms: 7.151
memory bandwidth: 1169.825 GB/s
```

**This PR**

A single iteration in trace:
<img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b">

```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json
device ms: 0.892, cpu ms: 0.810
memory bandwidth: 1456.744 GB/s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764
Approved by: https://github.com/eqy, https://github.com/eellison, https://github.com/crcrpar
2024-03-27 00:51:30 +00:00
67a4d6d6cb Stopped TORCH_COMPILE_DEBUG from printing out a bunch of logs (#122688)
@ezyang suggests using TORCH_TRACE for dumping out all intermediate logs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122688
Approved by: https://github.com/ezyang, https://github.com/mlazos
ghstack dependencies: #122686
2024-03-27 00:24:40 +00:00
602c2af9e3 Cleaned up/fixed get_args after_aot repro (#122686)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122686
Approved by: https://github.com/ezyang
2024-03-27 00:24:40 +00:00
c81c9ba472 Disallow {FakeTensor,FunctionalTensor}.data_ptr (#122514)
This PR:
- disallows FakeTensor.data_ptr when it is called inside PT2 or fx tracing.
- disallows FunctionalTensor.data_ptr (python FunctionalTensor is only used in
  PT2)

The motivation behind this is that the leading cause of segfaults when
using custom ops with PT2 is calling .data_ptr on FunctionalTensor or
FakeTensor.

This change is BC-breaking. If your code broke as a result of this, it's
because there was a bug in it (these .data_ptr should never be
accessed!). You can either fix the bug (recommended) or get the previous
behavior back with:
```
from torch._subclasses.fake_tensor import FakeTensor
from torch._subclasses.functional_tensor import FunctionalTensor

data_ptr = 0 if isinstance(tensor, (FakeTensor, FunctionalTensor)) else tensor.data_ptr()
```

Test Plan:
- existing tests

Differential Revision: [D55366199](https://our.internmc.facebook.com/intern/diff/D55366199)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122514
Approved by: https://github.com/ezyang, https://github.com/albanD, https://github.com/yifuwang, https://github.com/kurtamohler
2024-03-26 23:55:42 +00:00
04399a3091 [xla hash update] update the pinned xla hash (#122628)
Originally made this PR since xla was failing, but the PR that changed the pin got reverted, so this is just a normal update now

The old pin was ~2 weeks old?

Currently XLA is broken https://github.com/pytorch/pytorch/actions/runs/8438508272/job/23115239444
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122628
Approved by: https://github.com/malfet, https://github.com/JackCaoG
2024-03-26 23:51:38 +00:00
07b618e2d4 Graph break cleanly in Dynamo for module parametrization (#121041)
Fixes #118795

This is a graph breaking partial fix for #120914. We still need -actual- module parametrization tracing support, but at least it doesn't blow up hard now.

**Background**: Module parametrization injects a property as the module parameter attribute that calls a `nn.Module` whose forward takes in a module parameter and returns a reparametrized module parameter.
Example:
```
class MyParametrization(nn.Module):
    def forward(X):
        # This reparametrization just negates the original parameter value
        return -X

m = nn.Linear(...)
p = MyParametrization()
register_parametrization(m, "weight", p)

# Accessing the "weight" attribute will invoke p's forward() on m's original weight and return the output as the new weight.
# m.weight here is now an injected property that does the above instead of an actual Parameter.
# This property is defined in torch/nn/utils/parametrize.py.
m.weight

# NB: Parametrization changes the module type (e.g. torch.nn.utils.parametrize.ParametrizedLinear)
print(type(m))
```

**Problem 1**: Dynamo has special tracing rules for things in `torch.nn`. Parametrizing a module changes the type of the module and the parametrized attribute, so now these rules wrongly affect tracing here. To fix this:
* For parametrized modules, call `convert_to_unspecialized()` to restart analysis where Dynamo starts inlining the module.

**Problem 2**: The issue seen in #118795 is that Dynamo will see a dynamically constructed tensor when `m.weight` is called and introduce that to its `tensor_weakref_to_sizes_strides` cache during fake-ification. This tensor is also made to be a graph input, since it's a module parameter. When guards are created for this module parameter input, the logic calls `m.weight` again and tries to look the result up in the cache, but this is a different tensor now, giving the `KeyError` symptom. To fix this:
* Replace Dynamo's `tensor_weakref_to_sizes_strides` cache with a `input_source_to_sizes_strides` cache.
    * This cache was originally introduced in #100128.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121041
Approved by: https://github.com/anijain2305
2024-03-26 23:44:51 +00:00
2367d0dacd [AOTInductor] Add tensor_constantX to pass constant buffer update's check (#122562) (#122690)
Summary:

During tracing, some constants (tensor_constant{idx}) are being generated internally.
Those constants are neither parameters or buffers, and users have zero control on them.

To accomodate this, we should allow users not passing in those constants generated internally but still be able the constants in the model.

Test Plan:
Included in commit.
```
build/bin/test_aot_inductor
```

Reviewed By: zoranzhao

Differential Revision: D55354548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122690
Approved by: https://github.com/khabinov
2024-03-26 23:25:15 +00:00
09cb42ce29 [dynamo] delete graph_out_{n} after restoring local vars (#122658)
At graph breaks, we create a graph_out_{n} symbol to hold the graph output and
use it to restore the local vars. In addition to their own symbols, the local
vars are kept alive by the symbol we created. This means that if the graph
break is the last usage of one of the symbols, the symbol would still be kept
alive upon graph resumption.

This PR: delete the graph_out_{n} symbol after restoring local vars so the
lifetime of the local vars is governed by themselves.

## Example Problem
Tensor `b`'s last usage is in the graph break. However, it won't be deallocated until `bar()` completes. In the orignal issue report by @Yuzhen11, `b` is a large tensor and `bar()` is an expensive computation.

```python
import torch

def foo(a):
    return torch.mm(a, a)

@torch._dynamo.disable()
def graph_break_fn(a):
    ret = a.bfloat16()
    return ret

def bar(c):
    return torch.mm(c, c)

def fn(a):
    b = foo(a)
    c = graph_break_fn(b)
    # del b
    return bar(c)

fn_compiled = torch.compile(fn, backend="eager")
a = torch.randn(10000, 10000, device="cuda", requires_grad=True)

fn_compiled(a).sum().backward()
```

Bytecode before this PR:
```
ORIGINAL BYTECODE fn /home/yifu/microbench/del2.py line 18
 19           0 LOAD_GLOBAL              0 (foo)
              2 LOAD_FAST                0 (a)
              4 CALL_FUNCTION            1
              6 STORE_FAST               1 (b)

 20           8 LOAD_GLOBAL              1 (graph_break_fn)
             10 LOAD_FAST                1 (b)
             12 CALL_FUNCTION            1
             14 STORE_FAST               2 (c)

 22          16 LOAD_GLOBAL              2 (bar)
             18 LOAD_FAST                2 (c)
             20 CALL_FUNCTION            1
             22 RETURN_VALUE

MODIFIED BYTECODE fn /home/yifu/microbench/del2.py line 18
 18           0 LOAD_GLOBAL              3 (__compiled_fn_0)
              2 LOAD_FAST                0 (a)
              4 CALL_FUNCTION            1
              6 STORE_FAST               3 (graph_out_0)
              8 LOAD_GLOBAL              1 (graph_break_fn)
             10 LOAD_FAST                3 (graph_out_0)
             12 LOAD_CONST               1 (0)
             14 BINARY_SUBSCR

 20          16 CALL_FUNCTION            1
             18 LOAD_GLOBAL              4 (__resume_at_14_1)
             20 ROT_TWO
             22 CALL_FUNCTION            1
             24 RETURN_VALUE

ORIGINAL BYTECODE torch_dynamo_resume_in_fn_at_20 /home/yifu/microbench/del2.py line 20
 20           0 LOAD_FAST                0 (___stack0)
              2 JUMP_ABSOLUTE            9 (to 18)
              4 LOAD_GLOBAL              0 (foo)
              6 LOAD_FAST                1 (a)
              8 CALL_FUNCTION            1
             10 STORE_FAST               2 (b)
             12 LOAD_GLOBAL              1 (graph_break_fn)
             14 LOAD_FAST                2 (b)
             16 CALL_FUNCTION            1
        >>   18 STORE_FAST               3 (c)

 22          20 LOAD_GLOBAL              2 (bar)
             22 LOAD_FAST                3 (c)
             24 CALL_FUNCTION            1
             26 RETURN_VALUE

MODIFIED BYTECODE torch_dynamo_resume_in_fn_at_20 /home/yifu/microbench/del2.py line 20
 20           0 LOAD_GLOBAL              3 (__compiled_fn_2)
              2 LOAD_FAST                0 (___stack0)
              4 CALL_FUNCTION            1
              6 UNPACK_SEQUENCE          1
              8 RETURN_VALUE
```

Bytecode after this PR:
```
ORIGINAL BYTECODE fn /home/yifu/microbench/del2.py line 18
 19           0 LOAD_GLOBAL              0 (foo)
              2 LOAD_FAST                0 (a)
              4 CALL_FUNCTION            1
              6 STORE_FAST               1 (b)

 20           8 LOAD_GLOBAL              1 (graph_break_fn)
             10 LOAD_FAST                1 (b)
             12 CALL_FUNCTION            1
             14 STORE_FAST               2 (c)

 22          16 LOAD_GLOBAL              2 (bar)
             18 LOAD_FAST                2 (c)
             20 CALL_FUNCTION            1
             22 RETURN_VALUE

MODIFIED BYTECODE fn /home/yifu/microbench/del2.py line 18
 18           0 LOAD_GLOBAL              3 (__compiled_fn_0)
              2 LOAD_FAST                0 (a)
              4 CALL_FUNCTION            1
              6 STORE_FAST               3 (graph_out_0)
              8 LOAD_GLOBAL              1 (graph_break_fn)
             10 LOAD_FAST                3 (graph_out_0)
             12 LOAD_CONST               1 (0)
             14 BINARY_SUBSCR
             16 DELETE_FAST              3 (graph_out_0)

 20          18 CALL_FUNCTION            1
             20 LOAD_GLOBAL              4 (__resume_at_14_1)
             22 ROT_TWO
             24 CALL_FUNCTION            1
             26 RETURN_VALUE

ORIGINAL BYTECODE torch_dynamo_resume_in_fn_at_20 /home/yifu/microbench/del2.py line 20
 20           0 LOAD_FAST                0 (___stack0)
              2 JUMP_ABSOLUTE            9 (to 18)
              4 LOAD_GLOBAL              0 (foo)
              6 LOAD_FAST                1 (a)
              8 CALL_FUNCTION            1
             10 STORE_FAST               2 (b)
             12 LOAD_GLOBAL              1 (graph_break_fn)
             14 LOAD_FAST                2 (b)
             16 CALL_FUNCTION            1
        >>   18 STORE_FAST               3 (c)

 22          20 LOAD_GLOBAL              2 (bar)
             22 LOAD_FAST                3 (c)
             24 CALL_FUNCTION            1
             26 RETURN_VALUE

MODIFIED BYTECODE torch_dynamo_resume_in_fn_at_20 /home/yifu/microbench/del2.py line 20
 20           0 LOAD_GLOBAL              3 (__compiled_fn_2)
              2 LOAD_FAST                0 (___stack0)
              4 CALL_FUNCTION            1
              6 UNPACK_SEQUENCE          1
              8 RETURN_VALUE

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122658
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-03-26 22:49:05 +00:00
df724153c1 Add option to skip cudagraphing on dynamic shape graphs (#122520)
This was requested internally.

Differential Revision: [D55264528](https://our.internmc.facebook.com/intern/diff/D55264528)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122520
Approved by: https://github.com/mlazos, https://github.com/shunting314
2024-03-26 21:49:21 +00:00
e229ec6886 [NEON] Speedup float16 convert (#122702)
By using `vcvt_f16_f32` and back

According to [benchmark_convert.py](d3279637ca) this makes float32 to float16 tensor conversion roughly 3 times faster: time to convert 4096x4096 float32 tensor drops from  5.23 msec to 1.66 msec on M2 Pro

Test plan: run `vector_test_all_types` + CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122702
Approved by: https://github.com/kimishpatel
2024-03-26 21:48:12 +00:00
6767c04fde Forward fix for broken internal tests related to NJT view dummy (#122704)
(internal link) [example test breakage](https://www.internalfb.com/intern/test/562950061753019?ref_report_id=0)

Symptom: `type stub not overridden` for SymInt. The global NJT dummy relies on `SymInt.__mul__()` in its constructor. Lazily constructing the dummy avoids the race.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122704
Approved by: https://github.com/soulitzer
2024-03-26 21:22:12 +00:00
291848bf30 [Build] Fix AVX detection logic (#122708)
`CXX_AVX[2|512]_FOUND` flags should indicate whether compiler supports generating code  for given instruction set, rather than whether host machine can run the generated code.

This fixes a weird problem that surfaced after https://github.com/pytorch/pytorch/pull/122503 when builder can sometimes be dispatched to an old CPU architecture, that can not run AVX512 instructions, but can compile for those just fine

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122708
Approved by: https://github.com/jeanschmidt
2024-03-26 20:37:35 +00:00
3bede14fa7 Don't create world pg variable out of thin air when rewriting c10d collectives (#122561)
Fixes https://github.com/pytorch/pytorch/issues/122404

Previously, when rewriting c10d collectives, if the group argument is
unspecified or None, we create a world pg variable out of thin air and
pass it to the rewrite target. The approach was problematic, as it
assumes the symbol `torch` is available in the scope (see #122404).

After #120560, dynamo can now trace dist.group.WORLD. If the group
argument is unspecified, we can just set it with dist.group.WORLD in the
rewrite target.

Testing

pytest test/distributed/test_inductor_collectives.py -k test_dynamo_rewrite_dist_allreduce

Also verified with the repro provided in #122404

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122561
Approved by: https://github.com/wconstab
ghstack dependencies: #120560
2024-03-26 20:12:08 +00:00
852111e1c2 [TORCH_TRACE] Record stack when no compile context is available (#122644)
This will help me track down those annoying unknown compile products.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122644
Approved by: https://github.com/jamesjwu
2024-03-26 19:30:52 +00:00
f631586084 Revert "[dynamo] Forward OptimizedModule.__setattr__ to the wrapped module (#122098)"
This reverts commit b6982bf2b25d2d3ba5d82488a39721d6013a838f.

Reverted https://github.com/pytorch/pytorch/pull/122098 on behalf of https://github.com/atalman due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/122098#issuecomment-2021233604))
2024-03-26 18:54:17 +00:00
537cd66e73 [Inductor] Support custom op in JIT with cpp wrapper (#122554)
Summary:  To call custom ops in an ABI-compatible way requires doing boxed call with varargs across C shim. In the JIT mode, we can get around it by calling into Python.  https://gist.github.com/desertfire/be2a65b0a9b47780bb716b53ac2cd2b3 is an example of generated code.

Differential Revision: [D55326556](https://our.internmc.facebook.com/intern/diff/D55326556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122554
Approved by: https://github.com/jansel, https://github.com/chenyang78
2024-03-26 18:48:45 +00:00
e61aaab725 Log autotune time in scuba (#122637)
Summary:
This diff
* Refactors triton and autotune caches to be child classes of the original memcache based cache infra
* Swaps scuba table for autotune
* Adds autotune time spent/saved to scuba table

Test Plan:
Local testing using:
```
buck run mode/opt fbcode//caffe2/test/inductor/:max_autotune -- -r test_max_autotune_remote_caching_dynamic_False
```
and
```
TORCH_INDUCTOR_AUTOTUNE_REMOTE_CACHE=1 buck2 run mode/opt //scripts/oulgen:runner
```

Differential Revision: D55332620

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122637
Approved by: https://github.com/jamesjwu
2024-03-26 17:51:33 +00:00
1f5fcb4e20 [Inductor] Run pattern matcher over the original graph (#122519)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122519
Approved by: https://github.com/jansel
2024-03-26 17:30:32 +00:00
8cfbdc0451 [Easy][DCP] Fix small typo in assert (#122633)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122633
Approved by: https://github.com/awgu, https://github.com/wconstab
2024-03-26 16:46:12 +00:00
30a579dba3 Add XPU ATen merge rule (#122484)
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122484
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-03-26 16:20:48 +00:00
FEI
e08cbc0d41 update comment of test_invalid_last_dim_stride in test_transformers.py (#122679)
Fixes #122594

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122679
Approved by: https://github.com/mikaylagawarecki
2024-03-26 15:40:24 +00:00
8bad7b63c8 [ez] Add more files to trigger inductor (#122669)
To catch https://github.com/pytorch/pytorch/pull/122562/files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122669
Approved by: https://github.com/desertfire
2024-03-26 15:19:30 +00:00
9b90c5e2a1 [CI] Switch pull job linux-jammy-py3_8-gcc11-build to use ARC with runner groups (#122503)
title says it all...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122503
Approved by: https://github.com/atalman
2024-03-26 14:38:12 +00:00
85845a29db Refactor ShapeEnvSettings so it's directly on ShapeEnv (#122310)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122310
Approved by: https://github.com/masnesral, https://github.com/lezcano
2024-03-26 14:16:33 +00:00
7e176ebb47 Log compilation_metrics to TORCH_TRACE (#122638)
It's not technically needed as you can get it from Scuba too, but it's
more convenient for tlparse to get at it this way.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122638
Approved by: https://github.com/albanD
2024-03-26 14:10:55 +00:00
99c822c0ba Let dynamo inline through jacfwd (#121254)
Similar to #121146, changes are simple and don't require any fancy modification to the codebase. Moved a few entries on trace_rules.py and added tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121254
Approved by: https://github.com/zou3519
ghstack dependencies: #120338
2024-03-26 12:43:30 +00:00
2b4173e0de [Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardTanh with int8-mix-bf16 (#122374)
**Summary**
Enable the fusion pattern of `QConv2d -> hardtanh` lowering for int8-mixed-bf16 case.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_hardtanh_int8_mixed_bf16_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122374
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #122266, #122267, #122268, #122373
2024-03-26 08:12:41 +00:00
293579363c [Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardSwish with int8-mix-bf16 (#122373)
**Summary**
Enable the fusion pattern of `QConv2d -> hardswish` lowering for int8-mixed-bf16 case.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_hardswish_int8_mixed_bf16_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122373
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #122266, #122267, #122268
2024-03-26 08:09:35 +00:00
caf9c23310 [Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op SiLU (#122268)
**Summary**
Enable the fusion pattern of `QConv2d -> silu` lowering to `swish` as `QConv2d` post operator.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_silu_cpu
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_silu_int8_mixed_bf16_cpu
python -m pytest test_mkldnn_pattern_matcher.py -k test_qat_qconv2d_silu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122268
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #122266, #122267
2024-03-26 08:07:06 +00:00
41d24df08f [export] hack skip index_put_ in dce (#122683)
Summary: Ideally we should do whats in the todo. Just doing this for now to unblock llama capture

Test Plan: capturing llama and using pt2e to quantize it

Differential Revision: D55354487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122683
Approved by: https://github.com/kimishpatel
2024-03-26 08:05:06 +00:00
e0329cba8a [Quant] [PT2] Add SiLU into X86InductorQuantizer Conv2d Unary Annotation (#122267)
**Summary**
Add `SiLU` into X86InductorQuantizer Conv2d Unary Annotation

**TestPlan**
```
python -m pytest test_x86inductor_quantizer.py -k test_conv2d_unary
python -m pytest test_x86inductor_quantizer.py -k test_qat_conv2d_unary
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122267
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #122266
2024-03-26 08:03:42 +00:00
b7089937dc Disable test (test_mm_plus_mm2_cuda_cuda_wrapper) (#122682)
Summary:
The test is unstable at the moment. We need to make sure both Aten
and Triton Kernel works to reactivate the test.

Test Plan:
Disabling test

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122682
Approved by: https://github.com/clee2000
2024-03-26 07:14:35 +00:00
f8eeae7aaa Enable CPP wrapper codegen registration (#121296)
Extend code gen registration for `CppWrapper`. W/ this PR, an new backend can register its specific `CppWrapper` at runtime.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121296
Approved by: https://github.com/jansel, https://github.com/desertfire
2024-03-26 06:51:03 +00:00
d1f58eaaf5 [inductor] Fix bug with freezing + split_cat passes (#122544)
Fixes #122380

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122544
Approved by: https://github.com/eellison
2024-03-26 06:12:57 +00:00
268b0cc714 Do not run CUDA lazy init if it is triggered with fake mode on. (#122636)
Partially fixes https://github.com/pytorch/pytorch/issues/122109

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122636
Approved by: https://github.com/zou3519
2024-03-26 05:43:59 +00:00
dd3f2cb53a [Inductor] Add NEON ISA support on arm64 Macs (#122217)
This started as a re-land of https://github.com/pytorch/pytorch/pull/105590 but focusing on enabling it on MacOS, but quickly turned into landing very limited platform-specific acceleration at this time (I.e. this PR does not add any NEON accelerated code at all, just enables vectorized compilation for the existing abstractions)

Enabling the test harness, uncovered number of latent issues in CPU inductor that were fixed in the following PRS:
- https://github.com/pytorch/pytorch/pull/122511
- https://github.com/pytorch/pytorch/pull/122513
- https://github.com/pytorch/pytorch/pull/122580
- https://github.com/pytorch/pytorch/pull/122608

Following was added/changed to enable vectorization code to work on MacOS
 - Added VecNEON class to `_inductor/codecache.py`  that is supported on all AppleSilicon Macs
 - Added `Vectorized::loadu_one_fourth` to `vec_base.h`, and limit it to 8-bit types
 - Change 64-bit integral types mapping to `int64_t`/`uint64_t` to align with the rest of the code, as on MacOS, `int64_t` is a `long long` rather than `long` (see https://github.com/pytorch/pytorch/pull/118149 for more details)

See table below for perf changes with and without torch.compile using [gpt-fast](https://github.com/pytorch-labs/gpt-fast) running `stories15M` on M2 Pro:
| dtype  | Eager | Compile (before) | Compile (after) |
| ------ | ------ | --------- | --------- |
| bfloat16  | 120 tokens/sec  | 130 tokens/sec | 156 tokens/sec |
| float32  | 158 tokens/sec  | 140 tokens/sec | 236 tokens/sec |
| float16  | 235 tokens/sec  | 81 tokens/sec | 58 tokens/sec |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122217
Approved by: https://github.com/jansel
2024-03-26 05:07:30 +00:00
a333b080c1 Only update momentum buffers for SGD if momentum is enabled (#122349)
As title

[benchmark](https://gist.github.com/mlazos/1171f035a2392c33778aaa3d7bf24370)

Helps compiled vanilla SGD execution time by 2x on certain models with large number of small params (ex.
ElectraForQuestionAnswering goes from 1090us -> 554us)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122349
Approved by: https://github.com/janeyx99
2024-03-26 04:19:39 +00:00
0c47f8028e Keep example_inputs when saving and loading ExportedProgram (#122618)
Summary:
`torch.export` is a powerful tool for creating a structured and shareable package from arbitrary pytorch code. One great use case of `torch.export` is sharing models or subgraphs in a way that allows results to be easily replicated. However, in the current implementation of `export`, the `example_inputs` field is thrown out. When trying to replicate bugs, benchmarks, or behaviors, losing the original input shapes and values makes the process much messier.

This change adds saving and loading for the `example_inputs` attribute of an `ExportedProgram` when using `torch.export.save` and `torch.export.load`. This simple addition makes `ExportedPrograms`s a fantastic tool for performance and accuracy replication. For example, with this change we enable the following workflow:

```
# Script to create a reproducible accuracy issue with my model.
kwargs = {"fastmath_mode": True}
exp_program = export(my_model, sample_inputs, kwargs)
result = exp_program.module()(*sample_inputs, **kwargs)
# Uhoh, I dont like that result, lets send the module to a colleague to take a look.
torch.export.save(exp_program, "my_model.pt2")
```

My colleague can then easily reproduce my results llike so:

```
# Script to load and reproduce results from a saved ExportedProgram.
loaded_program = torch.export.load("my_model.pt2")
# The following line is enabled by this Diff, we pull out the arguments
# and options that caused the issue.
args, kwargs = loaded_program.example_inputs
reproduced_result = loaded_program.module()(*args, **kwargs)
# Oh I see what happened here, lets fix it.
```

Being able to share exact inputs and arguments makes `ExportedPrograms` much
more clean and powerful with little downside. The main potential issue with this change
is that it does slightly increase the size of saved programs. However, the size of
inputs will be much smaller than parameters in most cases. I am curious to hear
discussion on saved file size though.

The deserialization of `example_inputs` is currently implemented as `Optional`. Although this wont effect users of `export.save` and `export.load`, it does give backwards compatibility to any direct users of `serialize` and `deserialize`.

Test Plan:
This diff includes a new test which exercises the save / load flow with multiple args and kwargs.

```
buck test //caffe2/test:test_export -- TestSerialize
```

Differential Revision: D55294614

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122618
Approved by: https://github.com/zhxchen17
2024-03-26 03:32:44 +00:00
47e8d60627 [dtensor] add op support for view_as_complex and view_as_real (#122569)
This PR will unblock DTensor computations for [rotary embeddings](https://github.com/meta-llama/llama/blob/main/llama/model.py#L132) used in LLaMa training.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122569
Approved by: https://github.com/wanchaol
ghstack dependencies: #122541
2024-03-26 03:32:04 +00:00
1af6fc5e03 Remove top-level DisableFuncTorch; clearing interpreter stack should work. (#122610)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122610
Approved by: https://github.com/zou3519
ghstack dependencies: #122202
2024-03-26 03:08:22 +00:00
f42818321b Restore DILL_AVAILABLE for backwards compat with torchdata (#122616)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122616
Approved by: https://github.com/peterbell10
2024-03-26 02:18:51 +00:00
55f36d1ada Revert "[AOTInductor] Add tensor_constantX to pass constant buffer update's check (#122562)"
This reverts commit 57a3d00b0659e4ac37c4a35a36c71f710e89197a.

Reverted https://github.com/pytorch/pytorch/pull/122562 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/122562#issuecomment-2019262415))
2024-03-26 02:18:19 +00:00
4e0b5d59fa [dtensor] add backward support for scaled dot product attention (flash-attention) (#122541)
As titled, as a followup to the forward part #120298.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122541
Approved by: https://github.com/wanchaol
2024-03-26 01:50:24 +00:00
c2d4f8fa7a Update base for Update on "[NJT] Actually inline NT torch function during dynamo"
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:

(1) Disables torch function running a second time in AOTAutograd

If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.

2.  Enables torch function to be inlined in dynamo for NT

Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level. 

3. Fixes graph breaks for NT torch function

Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).


Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
2024-03-21 19:31:23 -07:00
c269e4a200 Update base for Update on "[NJT] Actually inline NT torch function during dynamo"
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:

(1) Disables torch function running a second time in AOTAutograd

If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.

2.  Enables torch function to be inlined in dynamo for NT

Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level. 

3. Fixes graph breaks for NT torch function

Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).


Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
2024-03-21 19:29:26 -07:00
f59cfa5d5b Update base for Update on "[NJT] Actually inline NT torch function during dynamo"
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:

(1) Disables torch function running a second time in AOTAutograd

If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.

2.  Enables torch function to be inlined in dynamo for NT

Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level. 

3. Fixes graph breaks for NT torch function

Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).


Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
2024-03-21 19:25:40 -07:00
f633002021 Update base for Update on "[NJT] Actually inline NT torch function during dynamo"
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:

(1) Disables torch function running a second time in AOTAutograd

If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.

2.  Enables torch function to be inlined in dynamo for NT

Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level. 

3. Fixes graph breaks for NT torch function

Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).


Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
2024-03-21 19:11:19 -07:00
535da81018 Update base for Update on "[NJT] Actually inline NT torch function during dynamo"
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:

(1) Disables torch function running a second time in AOTAutograd

If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.

2.  Enables torch function to be inlined in dynamo for NT

Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level. 

3. Fixes graph breaks for NT torch function

Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).


Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
2024-03-19 14:53:36 -07:00
10716a4af4 Update base for Update on "[NJT] Actually inline NT torch function during dynamo"
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:

(1) Disables torch function running a second time in AOTAutograd

If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.

2.  Enables torch function to be inlined in dynamo for NT

Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level. 

3. Fixes graph breaks for NT torch function

Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).


Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
2024-03-19 13:26:37 -07:00
30ed14c97c Update base for Update on "[NJT] Actually inline NT torch function during dynamo"
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:

(1) Disables torch function running a second time in AOTAutograd

If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.

2.  Enables torch function to be inlined in dynamo for NT

Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level. 

3. Fixes graph breaks for NT torch function

Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).


Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
2024-03-19 13:23:33 -07:00
bfc53c9d89 Update base for Update on "[NJT] Actually inline NT torch function during dynamo"
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:

(1) Disables torch function running a second time in AOTAutograd

If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.

2.  Enables torch function to be inlined in dynamo for NT

Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level. 

3. Fixes graph breaks for NT torch function

Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).


Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
2024-03-12 12:32:30 -07:00
3f52723029 Update base for Update on "[NJT] Actually inline NT torch function during dynamo"
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
2024-03-12 12:07:07 -07:00
f518b82db8 Update on "Prevent __torch_function__ running a second time in AOTAutograd"
If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.

Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124

cc bdhirsh 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
2024-03-07 00:54:08 -05:00
05de3d9f0e Update base for Update on "Prevent __torch_function__ running a second time in AOTAutograd"
If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.

Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124

cc bdhirsh 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
2024-03-07 00:54:08 -05:00
917caf322f Update on "Prevent __torch_function__ running a second time in AOTAutograd"
If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.

Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124

cc bdhirsh 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
2024-03-06 23:59:58 -05:00
8a666191c1 Update base for Update on "Prevent __torch_function__ running a second time in AOTAutograd"
If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.

Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124

cc bdhirsh 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
2024-03-06 23:59:58 -05:00
628277a810 Update on "Prevent __torch_function__ running a second time in AOTAutograd"
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
2024-03-06 21:20:29 -05:00
8352effea1 Update base for Update on "Prevent __torch_function__ running a second time in AOTAutograd"
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
2024-03-06 21:20:29 -05:00
29580cc5a9 Update on "Prevent __torch_function__ running a second time in AOTAutograd"
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
2024-03-06 21:18:15 -05:00
1fce3be2aa Update base for Update on "Prevent __torch_function__ running a second time in AOTAutograd"
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
2024-03-06 21:18:15 -05:00
90fe67165d Prevent __torch_function__ running a second time in AOTAutograd
[ghstack-poisoned]
2024-03-06 20:34:29 -05:00
cdd0f0db49 Update on "Add infra TorchFunctionModeKey enum and extra TLS state for disabling"
[ghstack-poisoned]
2024-03-06 20:34:29 -05:00
a4dfc66721 Add infra TorchFunctionModeKey enum and extra TLS state for disabling
[ghstack-poisoned]
2024-03-06 20:05:55 -05:00
545 changed files with 11043 additions and 5045 deletions

View File

@ -1 +1 @@
7f96f5a852ba452670255d28d59f1e6398141fbb
d4b3e5cc607e97afdba79dc90f8ef968142f347c

View File

@ -36,6 +36,7 @@ hicpp-exception-baseclass,
hicpp-avoid-goto,
misc-*,
-misc-const-correctness,
-misc-include-cleaner,
-misc-use-anonymous-namespace,
-misc-unused-parameters,
-misc-no-recursion,

View File

@ -10,9 +10,9 @@ inputs:
description: Text that uniquely identifies a given job type within a workflow. All shards of a job should share the same job identifier.
required: true
s3_bucket:
description: S3 bucket to upload/download PyTest cache
description: S3 bucket to download PyTest cache
required: false
default: ""
default: "gha-artifacts"
runs:
using: composite

View File

@ -1 +1 @@
17a70815259222570feb071034acd7bae2adc019
ea437b31ce316ea3d66fe73768c0dcb94edb79ad

View File

@ -1 +1 @@
a0c79b399b75368208464b2c638708165cca7ef1
2c4665ffbb64f03f5d18016d3398af4ac4da5f03

View File

@ -1 +1 @@
707a632930bfde19ffb361cdf5c31a7682af4e67
b0ba29f98a695671972d4a4cc07441014dba2892

2
.github/labeler.yml vendored
View File

@ -35,6 +35,8 @@
- test/distributed/tensor/parallel/test_fsdp_2d_parallel.py
- torch/distributed/_tensor/**
- torch/distributed/fsdp/**
- torch/csrc/inductor/**
- test/cpp/aot_inductor/**
"module: cpu":
- aten/src/ATen/cpu/**

View File

@ -236,6 +236,20 @@
- Lint
- pull
- name: XPU ATen
patterns:
- aten/src/ATen/xpu/**
- c10/xpu/**
- third_party/xpu.txt
approved_by:
- EikanWang
- jgong5
- gujinghui
mandatory_checks_name:
- EasyCLA
- Lint
- pull
- name: Distributions
patterns:
- torch/distributions/**

View File

@ -37,7 +37,7 @@ jobs:
linux-jammy-py3_8-gcc11-build:
name: linux-jammy-py3.8-gcc11
uses: ./.github/workflows/_linux-build.yml
uses: ./.github/workflows/_linux-build-rg.yml
with:
build-environment: linux-jammy-py3.8-gcc11
docker-image-name: pytorch-linux-jammy-py3.8-gcc11

View File

@ -45,10 +45,12 @@ jobs:
cuda-arch-list: 8.6
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 4, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 4, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 4, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 4, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 1, num_shards: 6, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 6, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 6, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 6, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 6, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 6, num_shards: 6, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_1-py3-gcc9-slow-gradcheck-test:

View File

@ -192,6 +192,8 @@ include_patterns = [
'aten/src/ATen/*.cpp',
'aten/src/ATen/core/*.h',
'aten/src/ATen/core/*.cpp',
'aten/src/ATen/functorch/*.h',
'aten/src/ATen/functorch/*.cpp',
'c10/**/*.cpp',
'c10/**/*.h',
'torch/csrc/*.h',
@ -1906,6 +1908,7 @@ exclude_patterns = [
'torch/compiler/__init__.py',
'torch/contrib/__init__.py',
'torch/contrib/_tensorboard_vis.py',
"torch/cuda/_gpu_trace.py",
'torch/cuda/_memory_viz.py', # mypy: Value of type "object" is not indexable
'torch/distributed/__init__.py',
'torch/distributed/_composable_state.py',
@ -2371,7 +2374,7 @@ exclude_patterns = [
'torch/testing/_internal/common_subclass.py',
'torch/testing/_internal/common_utils.py',
'torch/testing/_internal/composite_compliance.py',
'torch/testing/_internal/control_flow_opinfo_db.py',
'torch/testing/_internal/hop_db.py',
'torch/testing/_internal/custom_op_db.py',
'torch/testing/_internal/data/__init__.py',
'torch/testing/_internal/data/network1.py',
@ -2433,7 +2436,6 @@ exclude_patterns = [
'torch/utils/_contextlib.py',
'torch/utils/_cpp_extension_versioner.py',
'torch/utils/_crash_handler.py',
'torch/utils/_cuda_trace.py',
'torch/utils/_device.py',
'torch/utils/_foreach_utils.py',
'torch/utils/_freeze.py',
@ -2442,7 +2444,6 @@ exclude_patterns = [
'torch/utils/_stats.py',
'torch/utils/_sympy/__init__.py',
'torch/utils/_sympy/functions.py',
'torch/utils/_sympy/value_ranges.py',
'torch/utils/_traceback.py',
'torch/utils/_zip.py',
'torch/utils/backcompat/__init__.py',
@ -2562,6 +2563,7 @@ exclude_patterns = [
'torch/utils/viz/__init__.py',
'torch/utils/viz/_cycles.py',
'torch/utils/weak.py',
'torch/xpu/_gpu_trace.py',
]
init_command = [
'python3',

View File

@ -742,13 +742,28 @@ if(MSVC)
append_cxx_flag_if_supported("/utf-8" CMAKE_CXX_FLAGS)
endif()
# CAVEAT: do NOT check USE_ROCM here, because USE_ROCM is always True until
# include(cmake/Dependencies.cmake)
# Note for ROCM platform:
# 1. USE_ROCM is always ON until include(cmake/Dependencies.cmake)
# 2. USE_CUDA will become OFF during re-configuration
# Truth Table:
# CUDA 1st pass: USE_CUDA=True;USE_ROCM=True, FLASH evaluates to ON by default
# CUDA 2nd pass: USE_CUDA=True;USE_ROCM=False, FLASH evaluates to ON by default
# ROCM 1st pass: USE_CUDA=True;USE_ROCM=True, FLASH evaluates to ON by default
# ROCM 2nd pass: USE_CUDA=False;USE_ROCM=True, FLASH evaluates to ON by default
# CPU 1st pass: USE_CUDA=False(Cmd Option);USE_ROCM=True, FLASH evaluates to OFF by default
# CPU 2nd pass: USE_CUDA=False(Cmd Option);USE_ROCM=False, FLASH evaluates to OFF by default
# Thus we cannot tell ROCM 2nd pass and CPU 1st pass
#
# The only solution is to include(cmake/Dependencies.cmake), and defer the
# aotriton build decision later.
include(cmake/Dependencies.cmake)
cmake_dependent_option(
USE_FLASH_ATTENTION
"Whether to build the flash_attention kernel for scaled dot product attention.\
Will be disabled if not supported by the platform" ON
"USE_CUDA AND NOT MSVC" OFF)
"USE_CUDA OR USE_ROCM;NOT MSVC" OFF)
# We are currenlty not using alibi attention for Flash
# So we disable this feature by default
@ -764,8 +779,6 @@ cmake_dependent_option(
Will be disabled if not supported by the platform" ON
"USE_CUDA" OFF)
include(cmake/Dependencies.cmake)
if(DEBUG_CUDA)
string(APPEND CMAKE_CUDA_FLAGS_DEBUG " -lineinfo")
string(APPEND CMAKE_CUDA_FLAGS_RELWITHDEBINFO " -lineinfo")

View File

@ -67,6 +67,7 @@ nn/qat/ @jerryzh168
/test/run_test.py @pytorch/pytorch-dev-infra
/torch/testing/_internal/common_device_type.py @mruberry
/torch/testing/_internal/common_utils.py @pytorch/pytorch-dev-infra
/torch/testing/_internal/hop_db.py @tugsbayasgalan @zou3519 @ydwu4
# Parametrizations
/torch/nn/utils/parametriz*.py @lezcano

View File

@ -419,32 +419,25 @@ if(NOT CMAKE_SYSTEM_PROCESSOR MATCHES "^(s390x|ppc64le)$")
list(APPEND ATen_CPU_DEPENDENCY_LIBS cpuinfo)
endif()
if(NOT MSVC AND NOT EMSCRIPTEN AND NOT INTERN_BUILD_MOBILE)
# Preserve values for the main build
set(__aten_sleef_build_shared_libs ${BUILD_SHARED_LIBS})
set(__aten_sleef_build_tests ${BUILD_TESTS})
# Unset our restrictive C++ flags here and reset them later.
# Remove this once we use proper target_compile_options.
set(OLD_CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS})
set(CMAKE_CXX_FLAGS)
# Bump up optimization level for sleef to -O1, since at -O0 the compiler
# excessively spills intermediate vector registers to the stack
# and makes things run impossibly slowly
set(OLD_CMAKE_C_FLAGS_DEBUG ${CMAKE_C_FLAGS_DEBUG})
if(${CMAKE_C_FLAGS_DEBUG} MATCHES "-O0")
string(REGEX REPLACE "-O0" "-O1" CMAKE_C_FLAGS_DEBUG ${OLD_CMAKE_C_FLAGS_DEBUG})
else()
set(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} -O1")
if(NOT EMSCRIPTEN AND NOT INTERN_BUILD_MOBILE)
if(NOT MSVC)
# Bump up optimization level for sleef to -O1, since at -O0 the compiler
# excessively spills intermediate vector registers to the stack
# and makes things run impossibly slowly
set(OLD_CMAKE_C_FLAGS_DEBUG ${CMAKE_C_FLAGS_DEBUG})
if(${CMAKE_C_FLAGS_DEBUG} MATCHES "-O0")
string(REGEX REPLACE "-O0" "-O1" CMAKE_C_FLAGS_DEBUG ${OLD_CMAKE_C_FLAGS_DEBUG})
else()
set(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} -O1")
endif()
endif()
if(NOT USE_SYSTEM_SLEEF)
set(BUILD_SHARED_LIBS OFF CACHE BOOL "Build sleef static" FORCE)
set(BUILD_DFT OFF CACHE BOOL "Don't build sleef DFT lib" FORCE)
set(BUILD_GNUABI_LIBS OFF CACHE BOOL "Don't build sleef gnuabi libs" FORCE)
set(BUILD_TESTS OFF CACHE BOOL "Don't build sleef tests" FORCE)
set(OLD_CMAKE_BUILD_TYPE ${CMAKE_BUILD_TYPE})
set(SLEEF_BUILD_SHARED_LIBS OFF CACHE BOOL "Build sleef static" FORCE)
set(SLEEF_BUILD_DFT OFF CACHE BOOL "Don't build sleef DFT lib" FORCE)
set(SLEEF_BUILD_GNUABI_LIBS OFF CACHE BOOL "Don't build sleef gnuabi libs" FORCE)
set(SLEEF_BUILD_TESTS OFF CACHE BOOL "Don't build sleef tests" FORCE)
set(SLEEF_BUILD_SCALAR_LIB OFF CACHE BOOL "libsleefscalar will be built." FORCE)
if(CMAKE_SYSTEM_NAME STREQUAL "Darwin")
if(CMAKE_SYSTEM_PROCESSOR STREQUAL "arm64" OR CMAKE_OSX_ARCHITECTURES MATCHES "arm64")
set(DISABLE_SVE ON CACHE BOOL "Xcode's clang-12.5 crashes while trying to compile SVE code" FORCE)
@ -465,12 +458,9 @@ if(NOT MSVC AND NOT EMSCRIPTEN AND NOT INTERN_BUILD_MOBILE)
endif()
list(APPEND ATen_CPU_DEPENDENCY_LIBS sleef)
set(CMAKE_C_FLAGS_DEBUG ${OLD_CMAKE_C_FLAGS_DEBUG})
set(CMAKE_CXX_FLAGS ${OLD_CMAKE_CXX_FLAGS})
# Set these back. TODO: Use SLEEF_ to pass these instead
set(BUILD_SHARED_LIBS ${__aten_sleef_build_shared_libs} CACHE BOOL "Build shared libs" FORCE)
set(BUILD_TESTS ${__aten_sleef_build_tests} CACHE BOOL "Build tests" FORCE)
if(NOT MSVC)
set(CMAKE_C_FLAGS_DEBUG ${OLD_CMAKE_C_FLAGS_DEBUG})
endif()
endif()
if(USE_CUDA AND NOT USE_ROCM)

View File

@ -484,8 +484,8 @@ c10::optional<Tensor> to_functional_tensor(const c10::optional<Tensor>& tensor)
}
return c10::nullopt;
}
c10::List<c10::optional<Tensor>> to_functional_tensor(const c10::List<c10::optional<Tensor>>& t_list) {
c10::List<c10::optional<Tensor>> outputs;
c10::List<::std::optional<Tensor>> to_functional_tensor(const c10::List<::std::optional<Tensor>>& t_list) {
c10::List<::std::optional<Tensor>> outputs;
outputs.reserve(t_list.size());
for (const auto i : c10::irange(t_list.size())) {
outputs.push_back(to_functional_tensor(t_list[i]));
@ -536,8 +536,8 @@ std::vector<Tensor> from_functional_tensor(ITensorListRef t_list) {
}
return outputs;
}
c10::List<c10::optional<Tensor>> from_functional_tensor(const c10::List<c10::optional<Tensor>>& t_list) {
c10::List<c10::optional<Tensor>> outputs;
c10::List<::std::optional<Tensor>> from_functional_tensor(const c10::List<::std::optional<Tensor>>& t_list) {
c10::List<::std::optional<Tensor>> outputs;
outputs.reserve(t_list.size());
for (const auto i : c10::irange(t_list.size())) {
outputs.push_back(from_functional_tensor(t_list[i], /*assert_functional=*/false));
@ -572,7 +572,7 @@ void sync(ITensorListRef t_list) {
sync(t);
}
}
void sync(const c10::List<c10::optional<Tensor>>& t_list) {
void sync(const c10::List<::std::optional<Tensor>>& t_list) {
for (const auto i : c10::irange(t_list.size())) {
sync(t_list[i]);
}
@ -652,7 +652,7 @@ bool isFunctionalTensor(const c10::optional<Tensor>& t) {
}
}
bool isFunctionalTensor(const c10::List<c10::optional<Tensor>>& t_list) {
bool isFunctionalTensor(const c10::List<::std::optional<Tensor>>& t_list) {
if (t_list.empty()) return false;
auto functional_count = 0;
for (const auto i : c10::irange(t_list.size())) {

View File

@ -317,10 +317,10 @@ static inline void recordTensorIndex(
(*dim_ptr)++;
};
static inline c10::List<c10::optional<Tensor>> typeConvertIndices(
static inline c10::List<::std::optional<Tensor>> typeConvertIndices(
const Tensor& /*self*/,
std::vector<Tensor>&& indices) {
c10::List<c10::optional<Tensor>> converted_inds;
c10::List<::std::optional<Tensor>> converted_inds;
converted_inds.reserve(indices.size());
for (auto&& i : std::move(indices)) {
converted_inds.push_back(std::move(i));

View File

@ -13,4 +13,12 @@ at::Tensor Generator::get_state() const {
return at::Tensor::wrap_tensor_impl(this->impl_->get_state());
}
void Generator::graphsafe_set_state(const Generator& new_state) {
this->impl_->graphsafe_set_state(new_state.getIntrusivePtr());
}
Generator Generator::graphsafe_get_state() const {
return Generator(this->impl_->graphsafe_get_state());
}
} // namespace at

View File

@ -107,6 +107,10 @@ struct TORCH_API Generator {
at::Tensor get_state() const;
void graphsafe_set_state(const Generator& new_state);
Generator graphsafe_get_state() const;
std::mutex& mutex() {
return impl_->mutex_;
}

View File

@ -1154,15 +1154,15 @@ TEST(OperatorRegistrationTest, testAvailableArgTypes) {
"(int[]? a) -> int[]?");
// Test list of optional (with empty list)
testArgTypes<c10::List<c10::optional<int64_t>>>::test(
c10::List<c10::optional<int64_t>>(c10::List<c10::optional<int64_t>>({})), [] (const c10::List<c10::optional<int64_t>>& v) {EXPECT_EQ(0, v.size());},
c10::List<c10::optional<int64_t>>(c10::List<c10::optional<int64_t>>({})), [] (const IValue& v) {EXPECT_EQ(0, v.to<c10::List<c10::optional<int64_t>>>().size());},
testArgTypes<c10::List<::std::optional<int64_t>>>::test(
c10::List<::std::optional<int64_t>>(c10::List<::std::optional<int64_t>>({})), [] (const c10::List<::std::optional<int64_t>>& v) {EXPECT_EQ(0, v.size());},
c10::List<::std::optional<int64_t>>(c10::List<::std::optional<int64_t>>({})), [] (const IValue& v) {EXPECT_EQ(0, v.to<c10::List<::std::optional<int64_t>>>().size());},
"(int?[] a) -> int?[]");
// Test list of optional (with values)
testArgTypes<c10::List<c10::optional<int64_t>>>::test(
c10::List<c10::optional<int64_t>>(c10::List<c10::optional<int64_t>>({3, c10::nullopt, 2})), [] (const c10::List<c10::optional<int64_t>>& v) {expectListEquals<c10::optional<int64_t>>({3, c10::nullopt, 2}, v);},
c10::List<c10::optional<int64_t>>(c10::List<c10::optional<int64_t>>({3, c10::nullopt, 2})), [] (const IValue& v) {expectListEquals<c10::optional<int64_t>>({3, c10::nullopt, 2}, v.to<c10::List<c10::optional<int64_t>>>());},
testArgTypes<c10::List<::std::optional<int64_t>>>::test(
c10::List<::std::optional<int64_t>>(c10::List<::std::optional<int64_t>>({3, c10::nullopt, 2})), [] (const c10::List<::std::optional<int64_t>>& v) {expectListEquals<c10::optional<int64_t>>({3, c10::nullopt, 2}, v);},
c10::List<::std::optional<int64_t>>(c10::List<::std::optional<int64_t>>({3, c10::nullopt, 2})), [] (const IValue& v) {expectListEquals<c10::optional<int64_t>>({3, c10::nullopt, 2}, v.to<c10::List<::std::optional<int64_t>>>());},
"(int?[] a) -> int?[]");
// dict types
@ -1234,15 +1234,15 @@ TEST(OperatorRegistrationTest, testAvailableArgTypes) {
"(Dict(int, Tensor) a) -> Dict(int, Tensor)");
// weird deeply nested type
using DeeplyNestedType = c10::List<c10::Dict<std::string, c10::List<c10::optional<c10::Dict<int64_t, std::string>>>>>;
using DeeplyNestedType = c10::List<c10::Dict<std::string, c10::List<::std::optional<c10::Dict<int64_t, std::string>>>>>;
auto makeDeeplyNestedObject = [] () -> DeeplyNestedType {
c10::Dict<int64_t, std::string> inner3;
inner3.insert(1, "1");
c10::List<c10::optional<c10::Dict<int64_t, std::string>>> inner2;
c10::List<::std::optional<c10::Dict<int64_t, std::string>>> inner2;
inner2.push_back(std::move(inner3));
c10::Dict<std::string, c10::List<c10::optional<c10::Dict<int64_t, std::string>>>> inner1;
c10::Dict<std::string, c10::List<::std::optional<c10::Dict<int64_t, std::string>>>> inner1;
inner1.insert("key", std::move(inner2));
c10::List<c10::Dict<std::string, c10::List<c10::optional<c10::Dict<int64_t, std::string>>>>> result;
c10::List<c10::Dict<std::string, c10::List<::std::optional<c10::Dict<int64_t, std::string>>>>> result;
result.push_back(inner1);
return result;
};

View File

@ -22,6 +22,9 @@
#include <ATen/cpu/vec/vec256/vec256_bfloat16.h>
#endif
#include <ATen/cpu/vec/vec256/vec256_convert.h>
#include <ATen/cpu/vec/vec256/vec256_mask.h>
#include <algorithm>
#include <cstddef>
#include <cstdint>
@ -69,7 +72,7 @@ std::ostream& operator<<(std::ostream& stream, const Vectorized<T>& vec) {
}
#if defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX2)
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CAST (AVX2) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -94,7 +97,8 @@ inline Vectorized<double> cast<double, int64_t>(const Vectorized<int64_t>& src)
}
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ GATHER ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#ifndef _MSC_VER
// MSVC is not working well on complex function overload.
template<int64_t scale = 1>
std::enable_if_t<scale == 1 || scale == 2 || scale == 4 || scale == 8, Vectorized<double>>
inline gather(const double* base_addr, const Vectorized<int64_t>& vindex) {
@ -106,9 +110,10 @@ std::enable_if_t<scale == 1 || scale == 2 || scale == 4 || scale == 8, Vectorize
inline gather(const float* base_addr, const Vectorized<int32_t>& vindex) {
return _mm256_i32gather_ps(base_addr, vindex, scale);
}
#endif
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ MASK GATHER ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#ifndef _MSC_VER
// MSVC is not working well on complex function overload.
template<int64_t scale = 1>
std::enable_if_t<scale == 1 || scale == 2 || scale == 4 || scale == 8, Vectorized<double>>
inline mask_gather(const Vectorized<double>& src, const double* base_addr,
@ -122,7 +127,7 @@ inline mask_gather(const Vectorized<float>& src, const float* base_addr,
const Vectorized<int32_t>& vindex, Vectorized<float>& mask) {
return _mm256_mask_i32gather_ps(src, base_addr, vindex, mask, scale);
}
#endif
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CONVERT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// Only works for inputs in the range: [-2^51, 2^51]
@ -302,6 +307,6 @@ inline Vectorized<uint8_t> flip(const Vectorized<uint8_t> & v) {
return flip8(v);
}
#endif // (defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)
#endif // (defined(CPU_CAPABILITY_AVX2)
}} // namepsace at::vec::CPU_CAPABILITY

View File

@ -7,7 +7,8 @@
#include <ATen/cpu/vec/vec_base.h>
#include <c10/util/irange.h>
#if defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX2)
#define SLEEF_STATIC_LIBS
#include <sleef.h>
#endif
@ -18,7 +19,18 @@ namespace at::vec {
// See Note [CPU_CAPABILITY namespace]
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX2)
#ifndef SLEEF_CONST
#if (defined(__GNUC__) || defined(__CLANG__)) && !defined(__INTEL_COMPILER)
#define SLEEF_CONST const
#else
#define SLEEF_CONST
#endif
#define SLEEF_CONST_OLD SLEEF_CONST
#else
#define SLEEF_CONST_OLD
#endif
// bfloat16 conversion
static inline void cvtbf16_fp32(const __m128i& a, __m256& o) {
@ -31,6 +43,28 @@ static inline void cvtbf16_fp32(const __m256i& a, __m256& o1, __m256& o2) {
cvtbf16_fp32(lo, o1);
cvtbf16_fp32(hi, o2);
}
static inline __m128i cvtfp32_bf16(const __m256& src) {
__m256i value = _mm256_castps_si256(src);
__m256i nan = _mm256_set1_epi32(0xffff);
__m256i mask = _mm256_castps_si256(_mm256_cmp_ps(src, src, _CMP_ORD_Q));
__m256i ones = _mm256_set1_epi32(0x1);
__m256i vec_bias = _mm256_set1_epi32(0x7fff);
// uint32_t lsb = (input >> 16) & 1;
auto t_value = _mm256_and_si256(_mm256_srli_epi32(value, 16), ones);
// uint32_t rounding_bias = 0x7fff + lsb;
t_value = _mm256_add_epi32(t_value, vec_bias);
// input += rounding_bias;
t_value = _mm256_add_epi32(t_value, value);
// input = input >> 16;
t_value = _mm256_srli_epi32(t_value, 16);
// Check NaN before converting back to bf16
t_value = _mm256_blendv_epi8(nan, t_value, mask);
t_value = _mm256_packus_epi32(t_value, t_value); // t[4-7] t[4-7] t[0-4] t[0-4]
t_value = _mm256_permute4x64_epi64(t_value, 0xd8); // 11 01 10 00
return _mm256_castsi256_si128(t_value);
}
static inline __m256i cvtfp32_bf16(const __m256& a, const __m256& b) {
__m256i lo = _mm256_castps_si256(a);
__m256i hi = _mm256_castps_si256(b);
@ -80,6 +114,11 @@ static inline void cvtfp16_fp32(const __m256i& a, __m256& o1, __m256& o2) {
cvtfp16_fp32(hi, o2);
}
static inline __m128i cvtfp32_fp16(const __m256& src) {
return _mm256_cvtps_ph(
src, (_MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC));
}
static inline __m256i cvtfp32_fp16(const __m256& a, const __m256& b) {
__m128i lo = _mm256_cvtps_ph(
a, (_MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC));
@ -265,7 +304,8 @@ public:
}
return b;
}
Vectorized<T> map(const __m256 (*const vop)(__m256)) const {
Vectorized<T> map(SLEEF_CONST __m256 (*SLEEF_CONST_OLD vop)(__m256)) const {
__m256 lo, hi;
cvt_to_fp32<T>(values, lo, hi);
const auto o1 = vop(lo);
@ -1026,7 +1066,7 @@ inline Vectorized<type> convert_float_##name(const Vectorized<float>& a, const V
CONVERT_VECTORIZED_INIT(BFloat16, bfloat16);
CONVERT_VECTORIZED_INIT(Half, half);
#else // defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)
#else // defined(CPU_CAPABILITY_AVX2)
#define CONVERT_NON_VECTORIZED_INIT(type, name) \
inline std::tuple<Vectorized<float>, Vectorized<float>> convert_##name##_float(const Vectorized<type>& a) { \
@ -1049,11 +1089,39 @@ inline Vectorized<type> convert_float_##name(const Vectorized<float>& a, const V
return Vectorized<type>::loadu(arr2); \
}
CONVERT_NON_VECTORIZED_INIT(BFloat16, bfloat16);
#if defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__)
inline std::tuple<Vectorized<float>, Vectorized<float>> convert_half_float(const Vectorized<Half>& a) {
static_assert(Vectorized<Half>::size() == 2 * Vectorized<float>::size());
auto arr = reinterpret_cast<const float16_t*>(a.operator const Half*());
float16x8_t x = vld1q_f16(arr);
float32x4_t x1 = vcvt_f32_f16(vget_low_f16(x));
float32x4_t x2 = vcvt_f32_f16(vget_high_f16(x));
float16x8_t y = vld1q_f16(arr + Vectorized<float>::size());
float32x4_t y1 = vcvt_f32_f16(vget_low_f16(y));
float32x4_t y2 = vcvt_f32_f16(vget_high_f16(y));
return { Vectorized<float>(x1, x2), Vectorized<float>(y1, y2) };
}
inline Vectorized<Half> convert_float_half(const Vectorized<float>& a, const Vectorized<float>& b) {
static_assert(Vectorized<Half>::size() == 2 * Vectorized<float>::size());
float32x4x2_t x = a;
float32x4x2_t y = b;
float16x4_t x1 = vcvt_f16_f32(x.val[0]);
float16x4_t x2 = vcvt_f16_f32(x.val[1]);
float16x4_t y1 = vcvt_f16_f32(y.val[0]);
float16x4_t y2 = vcvt_f16_f32(y.val[1]);
Vectorized<Half> rc;
auto arr = reinterpret_cast<float16_t*>(rc.operator Half*());
vst1q_f16(arr, vcombine_f16(x1, x2));
vst1q_f16(arr + Vectorized<float>::size(), vcombine_f16(y1, y2));
return rc;
}
#else
CONVERT_NON_VECTORIZED_INIT(Half, half);
#endif
#endif // defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)
#endif // defined(CPU_CAPABILITY_AVX2)
#if defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX2)
#define LOAD_FP32_VECTORIZED_INIT(type, name) \
inline void load_fp32_from_##name(const type *data, Vectorized<float>& out) { \
auto values = _mm_loadu_si128(reinterpret_cast<const __m128i*>(data)); \
@ -1072,7 +1140,7 @@ inline void load_fp32_from_##name(const type *data, Vectorized<float>& out1, Vec
LOAD_FP32_VECTORIZED_INIT(BFloat16, bf16);
LOAD_FP32_VECTORIZED_INIT(Half, fp16);
#else // defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)
#else // defined(CPU_CAPABILITY_AVX2)
#define LOAD_FP32_NON_VECTORIZED_INIT(type, name) \
inline void load_fp32_from_##name(const type *data, Vectorized<float>& out) { \
__at_align__ float values[Vectorized<float>::size()]; \

View File

@ -8,7 +8,8 @@
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/vec_base.h>
#if defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX2)
#define SLEEF_STATIC_LIBS
#include <sleef.h>
#endif
@ -16,7 +17,7 @@ namespace at::vec {
// See Note [CPU_CAPABILITY namespace]
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX2)
template <> class Vectorized<c10::complex<double>> {
private:
@ -145,7 +146,7 @@ public:
auto abs = abs_();
auto zero = _mm256_setzero_pd();
auto mask = _mm256_cmp_pd(abs, zero, _CMP_EQ_OQ);
auto div = values / abs;
auto div = _mm256_div_pd(values, abs);
return _mm256_blendv_pd(div, zero, mask);
}
__m256d real_() const {

View File

@ -7,7 +7,8 @@
#include <c10/util/irange.h>
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/vec_base.h>
#if defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX2)
#define SLEEF_STATIC_LIBS
#include <sleef.h>
#endif
@ -15,7 +16,7 @@ namespace at::vec {
// See Note [CPU_CAPABILITY namespace]
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX2)
template <> class Vectorized<c10::complex<float>> {
private:
@ -180,7 +181,7 @@ public:
auto abs = abs_();
auto zero = _mm256_setzero_ps();
auto mask = _mm256_cmp_ps(abs, zero, _CMP_EQ_OQ);
auto div = values / abs;
auto div = _mm256_div_ps(values, abs);
return _mm256_blendv_ps(div, zero, mask);
}
__m256 real_() const {

View File

@ -0,0 +1,173 @@
#pragma once
#include <ATen/cpu/vec/functional_bfloat16.h>
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/vec_base.h>
#include <ATen/cpu/vec/vec_convert.h>
namespace at::vec {
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)
template <>
struct VecConvert<float, 1, BFloat16, 1> {
static inline VectorizedN<float, 1> apply(
const VectorizedN<BFloat16, 1>& src) {
VectorizedN<float, 1> result;
__m256 value;
cvtbf16_fp32(_mm256_castsi256_si128(src[0]), value);
result[0] = value;
return result;
}
};
template <>
struct VecConvert<float, 1, Half, 1> {
static inline VectorizedN<float, 1> apply(const VectorizedN<Half, 1>& src) {
VectorizedN<float, 1> result;
__m256 value;
cvtfp16_fp32(_mm256_castsi256_si128(src[0]), value);
result[0] = value;
return result;
}
};
template <>
struct VecConvert<BFloat16, 1, float, 1> {
static inline VectorizedN<BFloat16, 1> apply(
const VectorizedN<float, 1>& src) {
VectorizedN<BFloat16, 1> result;
result[0] = _mm256_castsi128_si256(cvtfp32_bf16(src[0]));
return result;
}
};
template <>
struct VecConvert<Half, 1, float, 1> {
static inline VectorizedN<Half, 1> apply(const VectorizedN<float, 1>& src) {
VectorizedN<Half, 1> result;
result[0] = _mm256_castsi128_si256(cvtfp32_fp16(src[0]));
return result;
}
};
template <>
inline Vectorized<double> convert_to_fp_of_same_size<double>(
const Vectorized<int64_t>& src);
template <>
struct VecConvert<float, 1, int64_t, 2> {
static inline VectorizedN<float, 1> apply(
const VectorizedN<int64_t, 2>& src) {
auto low_double = at::vec::convert_to_fp_of_same_size<double>(src[0]);
auto low = _mm256_cvtpd_ps(low_double);
auto high_double = at::vec::convert_to_fp_of_same_size<double>(src[1]);
auto high = _mm256_cvtpd_ps(high_double);
return Vectorized<float>(
_mm256_insertf128_ps(_mm256_castps128_ps256(low), high, 1));
}
};
template <>
inline Vectorized<int32_t> convert_to_int_of_same_size<float>(
const Vectorized<float>& src);
template <>
struct VecConvert<int64_t, 2, float, 1> {
static inline VectorizedN<int64_t, 2> apply(
const VectorizedN<float, 1>& src) {
at::vec::VectorizedN<int64_t, 2> result;
auto int32_vec = at::vec::convert_to_int_of_same_size(src[0]);
result[0] = _mm256_cvtepi32_epi64(_mm256_castsi256_si128(int32_vec));
result[1] = _mm256_cvtepi32_epi64(_mm256_extracti128_si256(int32_vec, 1));
return result;
}
};
template <>
struct VecConvert<int32_t, 1, int64_t, 2> {
static inline VectorizedN<int32_t, 1> apply(
const VectorizedN<int64_t, 2>& src) {
auto low = _mm256_shuffle_epi32(src[0], _MM_SHUFFLE(2, 0, 2, 0));
auto high = _mm256_shuffle_epi32(src[1], _MM_SHUFFLE(2, 0, 2, 0));
auto low_perm = _mm256_permute4x64_epi64(low, _MM_SHUFFLE(3, 1, 2, 0));
auto high_perm = _mm256_permute4x64_epi64(high, _MM_SHUFFLE(3, 1, 2, 0));
return Vectorized<int32_t>(_mm256_blend_epi32(low_perm, high_perm, 0xF0));
}
};
template <>
struct VecConvert<int64_t, 2, int32_t, 1> {
static inline VectorizedN<int64_t, 2> apply(
const VectorizedN<int32_t, 1>& src) {
at::vec::VectorizedN<int64_t, 2> result;
result[0] = _mm256_cvtepi32_epi64(_mm256_castsi256_si128(src[0]));
result[1] = _mm256_cvtepi32_epi64(_mm256_extracti128_si256(src[0], 1));
return result;
}
};
template <>
struct VecConvert<int32_t, 1, int8_t, 1> {
static inline VectorizedN<int32_t, 1> apply(
const VectorizedN<int8_t, 1>& src) {
auto src128 = _mm256_castsi256_si128(src[0]);
return Vectorized<int32_t>(_mm256_cvtepi8_epi32(src128));
}
};
template <>
struct VecConvert<int32_t, 1, uint8_t, 1> {
static inline VectorizedN<int32_t, 1> apply(
const VectorizedN<uint8_t, 1>& src) {
auto src128 = _mm256_castsi256_si128(src[0]);
return Vectorized<int32_t>(_mm256_cvtepu8_epi32(src128));
}
};
template <typename dst_t>
struct VecConvert<
dst_t,
1,
int64_t,
2,
typename std::enable_if<
std::is_same_v<dst_t, int8_t> ||
std::is_same_v<dst_t, uint8_t>>::type> {
static inline VectorizedN<dst_t, 1> apply(
const VectorizedN<int64_t, 2>& src) {
return VecConvert<dst_t, 1, int32_t, 1>::apply(
VecConvert<int32_t, 1, int64_t, 2>::apply(src));
}
};
#endif
template <typename src_t>
struct VecConvert<
float,
1,
src_t,
1,
typename std::enable_if_t<is_reduced_floating_point_v<src_t>, void>> {
static inline VectorizedN<float, 1> apply(const VectorizedN<src_t, 1>& src) {
auto [res_vec1, res_vec2] = convert_to_float<src_t>(src[0]);
return res_vec1;
}
};
template <typename dst_t>
struct VecConvert<
dst_t,
1,
float,
1,
typename std::enable_if_t<is_reduced_floating_point_v<dst_t>, void>> {
static inline VectorizedN<dst_t, 1> apply(const VectorizedN<float, 1>& src) {
return convert_from_float<dst_t>(src[0], src[0]);
}
};
} // namespace CPU_CAPABILITY
} // namespace at::vec

View File

@ -6,7 +6,8 @@
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/vec_base.h>
#include <c10/util/irange.h>
#if defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX2)
#define SLEEF_STATIC_LIBS
#include <sleef.h>
#endif
@ -15,7 +16,7 @@ namespace at::vec {
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX2)
template <> class Vectorized<double> {
private:

View File

@ -6,7 +6,8 @@
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/vec_base.h>
#include <c10/util/irange.h>
#if defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX2)
#define SLEEF_STATIC_LIBS
#include <sleef.h>
#endif
@ -14,7 +15,7 @@ namespace at::vec {
// See Note [CPU_CAPABILITY namespace]
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX2)
template <> class Vectorized<float> {
private:
@ -226,14 +227,14 @@ public:
static __m256 vec_factorial_5 =
_mm256_set1_ps(0.00828929059f); // 1/factorial(5)
static __m256 vec_exp_log2ef =
(__m256)_mm256_set1_epi32(0x3fb8aa3b); // log2(e)
_mm256_castsi256_ps(_mm256_set1_epi32(0x3fb8aa3b)); // log2(e)
static __m256 vec_half = _mm256_set1_ps(0.5f);
static __m256 vec_one = _mm256_set1_ps(1.f);
static __m256 vec_zero = _mm256_set1_ps(0.f);
static __m256 vec_two = _mm256_set1_ps(2.f);
static __m256 vec_ln2f = (__m256)_mm256_set1_epi32(0x3f317218); // ln(2)
static __m256 vec_ln_flt_min = (__m256)_mm256_set1_epi32(0xc2aeac50);
static __m256 vec_ln_flt_max = (__m256)_mm256_set1_epi32(0x42b17218);
static __m256 vec_ln2f = _mm256_castsi256_ps(_mm256_set1_epi32(0x3f317218)); // ln(2)
static __m256 vec_ln_flt_min = _mm256_castsi256_ps(_mm256_set1_epi32(0xc2aeac50));
static __m256 vec_ln_flt_max = _mm256_castsi256_ps(_mm256_set1_epi32(0x42b17218));
static __m256i vec_127 = _mm256_set1_epi32(0x0000007f);
static int n_mantissa_bits = 23;
@ -266,7 +267,7 @@ public:
auto vec_exp_number_i = _mm256_cvtps_epi32(vec_exp_number);
auto vec_two_pow_n_i = _mm256_add_epi32(vec_exp_number_i, vec_127);
vec_two_pow_n_i = _mm256_slli_epi32(vec_two_pow_n_i, n_mantissa_bits);
auto vec_two_pow_n = (__m256)vec_two_pow_n_i;
auto vec_two_pow_n = _mm256_castsi256_ps(vec_two_pow_n_i);
vec_two_pow_n =
_mm256_blendv_ps(vec_two_pow_n, vec_zero, less_ln_flt_min_mask);

View File

@ -0,0 +1,93 @@
#pragma once
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/vec_base.h>
#include <ATen/cpu/vec/vec_mask.h>
namespace at::vec {
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)
template <typename T, typename mask_t>
struct VecMaskLoad<
T,
1,
mask_t,
1,
typename std::enable_if_t<
std::is_same_v<T, float> || std::is_same_v<T, int32_t> ||
std::is_same_v<T, uint32_t>,
void>> {
static inline VectorizedN<T, 1> apply(
const T* ptr,
const VecMask<mask_t, 1>& vec_mask) {
auto int_mask = vec_mask.template cast<int, 1>()[0];
if constexpr (std::is_same_v<T, float>) {
return Vectorized<T>(_mm256_maskload_ps(ptr, int_mask));
} else {
return Vectorized<T>(_mm256_maskload_epi32(ptr, int_mask));
}
}
};
// TODO: add specialization of VecMaskLoad for bfloat16/half and int8/uint8
template <>
struct VecMaskCast<float, 1, int, 1> {
static inline VecMask<float, 1> apply(const VecMask<int, 1>& vec_mask) {
return Vectorized<float>(_mm256_castsi256_ps(vec_mask[0]));
}
};
template <>
struct VecMaskCast<int, 1, float, 1> {
static inline VecMask<int, 1> apply(const VecMask<float, 1>& vec_mask) {
return Vectorized<int>(_mm256_castps_si256(vec_mask[0]));
}
};
template <typename dst_t>
struct VecMaskCast<dst_t, 1, int64_t, 2> {
static inline VecMask<dst_t, 1> apply(const VecMask<int64_t, 2>& vec_mask) {
auto int_vec = convert<int, 1, int64_t, 2>(VectorizedN<int64_t, 2>(vec_mask));
return VecMask<int, 1>(int_vec).cast<dst_t, 1>();
}
};
template <>
inline bool VecMask<int, 1>::all_zero() const {
return _mm256_testz_si256(mask_[0], mask_[0]);
}
template <>
inline bool VecMask<int, 1>::is_masked(int i) const {
return _mm256_movemask_ps(_mm256_castsi256_ps(mask_[0])) & (1 << i);
}
template <>
inline bool VecMask<int, 1>::all_masked() const {
int mask = _mm256_movemask_ps(_mm256_castsi256_ps(mask_[0]));
return mask == 0xff;
}
#define VEC_MASK_METHOD_WITH_CAST_TO_INT( \
T, N, return_type, method, args_def, args) \
template <> \
inline return_type VecMask<T, N>::method args_def const { \
return cast<int, 1>().method args; \
}
VEC_MASK_METHOD_WITH_CAST_TO_INT(float, 1, bool, all_zero, (), ())
VEC_MASK_METHOD_WITH_CAST_TO_INT(int64_t, 2, bool, all_zero, (), ())
VEC_MASK_METHOD_WITH_CAST_TO_INT(float, 1, bool, is_masked, (int i), (i))
VEC_MASK_METHOD_WITH_CAST_TO_INT(int64_t, 2, bool, is_masked, (int i), (i))
VEC_MASK_METHOD_WITH_CAST_TO_INT(float, 1, bool, all_masked, (), ())
VEC_MASK_METHOD_WITH_CAST_TO_INT(int64_t, 2, bool, all_masked, (), ())
#undef VEC_MASK_DEFINE_METHOD_WITH_CAST_TO_INT
#endif
} // namespace CPU_CAPABILITY
} // namespace at::vec

View File

@ -41,11 +41,17 @@
namespace at::vec {
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX2)
#ifdef _MSC_VER
__declspec(align(64)) struct Vectorizedqi {
protected:
__m256i vals;
#else
struct Vectorizedqi {
protected:
__m256i vals __attribute__((aligned(64)));
#endif
public:
Vectorizedqi() {}
@ -133,7 +139,7 @@ inline convert_float_to_int8(at::vec::Vectorized<float> src) {
}
template <typename T>
inline void __attribute__((always_inline)) QuantizeAvx2(
__FORCE_INLINE void QuantizeAvx2(
const float* src,
T* dst,
int len,
@ -1331,5 +1337,5 @@ Vectorized<c10::quint8> inline maximum(const Vectorized<c10::quint8>& a, const V
return a.maximum(b);
}
#endif // if defined(CPU_CAPABILITY_AVX2) && !defined(_MSC_VER)
#endif // if defined(CPU_CAPABILITY_AVX2)
}} // namespace at::vec::CPU_CAPABILITY

View File

@ -13,6 +13,8 @@
#include <ATen/cpu/vec/vec512/vec512_qint.h>
#include <ATen/cpu/vec/vec512/vec512_complex_float.h>
#include <ATen/cpu/vec/vec512/vec512_complex_double.h>
#include <ATen/cpu/vec/vec512/vec512_convert.h>
#include <ATen/cpu/vec/vec512/vec512_mask.h>
#include <algorithm>
#include <cstddef>
@ -55,7 +57,7 @@ std::ostream& operator<<(std::ostream& stream, const Vectorized<T>& vec) {
}
#if defined(CPU_CAPABILITY_AVX512) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX512)
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CAST (AVX512) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -80,7 +82,8 @@ inline Vectorized<double> cast<double, int64_t>(const Vectorized<int64_t>& src)
}
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ GATHER ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#ifndef _MSC_VER
// MSVC is not working well on complex function overload.
template<int64_t scale = 1>
std::enable_if_t<scale == 1 || scale == 2 || scale == 4 || scale == 8, Vectorized<double>>
inline gather(const double* base_addr, const Vectorized<int64_t>& vindex) {
@ -92,9 +95,10 @@ std::enable_if_t<scale == 1 || scale == 2 || scale == 4 || scale == 8, Vectorize
inline gather(const float* base_addr, const Vectorized<int32_t>& vindex) {
return _mm512_i32gather_ps(vindex, base_addr, scale);
}
#endif
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ MASK GATHER ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#ifndef _MSC_VER
// MSVC is not working well on complex function overload.
template<int64_t scale = 1>
std::enable_if_t<scale == 1 || scale == 2 || scale == 4 || scale == 8, Vectorized<double>>
inline mask_gather(const Vectorized<double>& src, const double* base_addr,
@ -112,7 +116,7 @@ inline mask_gather(const Vectorized<float>& src, const float* base_addr,
auto mask_ = _mm512_cmp_ps_mask(all_ones, mask.values, _CMP_EQ_OQ);
return _mm512_mask_i32gather_ps(src, mask_, vindex, base_addr, scale);
}
#endif
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CONVERT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
template<>
@ -270,6 +274,6 @@ inline Vectorized<uint8_t> flip(const Vectorized<uint8_t> & v) {
return flip8(v);
}
#endif // defined(CPU_CAPABILITY_AVX512) && !defined(_MSC_VER)
#endif // defined(CPU_CAPABILITY_AVX512)
}}}

View File

@ -7,7 +7,8 @@
#include <ATen/cpu/vec/vec_base.h>
#include <c10/util/irange.h>
#if defined(CPU_CAPABILITY_AVX512) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX512)
#define SLEEF_STATIC_LIBS
#include <sleef.h>
#endif
@ -16,7 +17,18 @@ namespace vec {
// See Note [CPU_CAPABILITY namespace]
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_AVX512) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX512)
#ifndef SLEEF_CONST
#if (defined(__GNUC__) || defined(__CLANG__)) && !defined(__INTEL_COMPILER)
#define SLEEF_CONST const
#else
#define SLEEF_CONST
#endif
#define SLEEF_CONST_OLD SLEEF_CONST
#else
#define SLEEF_CONST_OLD
#endif
// bfloat16 conversion
static inline void cvtbf16_fp32(const __m256i& a, __m512& o) {
@ -100,6 +112,11 @@ static inline void cvtfp16_fp32(const __m512i& a, __m512& o1, __m512& o2) {
cvtfp16_fp32(hi, o2);
}
static inline __m256i cvtfp32_fp16(const __m512& src) {
return _mm512_cvtps_ph(
src, (_MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC));
}
static inline __m512i cvtfp32_fp16(const __m512& a, const __m512& b) {
__m256i lo = _mm512_cvtps_ph(
a, (_MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC));
@ -362,7 +379,8 @@ public:
}
#pragma clang diagnostic push
#pragma clang diagnostic ignored "-Wignored-qualifiers"
Vectorized<T> map(const __m512 (*const vop)(__m512)) const {
Vectorized<T> map(SLEEF_CONST __m512 (*SLEEF_CONST_OLD vop)(__m512)) const {
__m512 lo, hi;
cvt_to_fp32<T>(values, lo, hi);
const auto o1 = vop(lo);
@ -1571,7 +1589,7 @@ inline Vectorized<type> convert_float_##name(const Vectorized<float>& a, const V
CONVERT_VECTORIZED_INIT(BFloat16, bfloat16);
CONVERT_VECTORIZED_INIT(Half, half);
#else //defined(CPU_CAPABILITY_AVX512) && !defined(_MSC_VER)
#else //defined(CPU_CAPABILITY_AVX512)
#define CONVERT_NON_VECTORIZED_INIT(type, name) \
inline std::tuple<Vectorized<float>, Vectorized<float>> convert_##name##_float(const Vectorized<type>& a) { \
@ -1601,9 +1619,9 @@ inline Vectorized<type> convert_float_##name(const Vectorized<float>& a, const V
CONVERT_NON_VECTORIZED_INIT(BFloat16, bfloat16);
CONVERT_NON_VECTORIZED_INIT(Half, half);
#endif // defined(CPU_CAPABILITY_AVX512) && !defined(_MSC_VER)
#endif // defined(CPU_CAPABILITY_AVX512)
#if defined(CPU_CAPABILITY_AVX512) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX512)
#define LOAD_FP32_VECTORIZED_INIT(type, name) \
inline void load_fp32_from_##name(const type *data, Vectorized<float>& out) { \
auto values = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(data)); \
@ -1622,7 +1640,7 @@ inline void load_fp32_from_##name(const type *data, Vectorized<float>& out1, Vec
LOAD_FP32_VECTORIZED_INIT(BFloat16, bf16);
LOAD_FP32_VECTORIZED_INIT(Half, fp16);
#else // defined(CPU_CAPABILITY_AVX512) && !defined(_MSC_VER)
#else // defined(CPU_CAPABILITY_AVX512)
#define LOAD_FP32_NON_VECTORIZED_INIT(type, name) \
inline void load_fp32_from_##name(const type *data, Vectorized<float>& out) { \
__at_align__ float values[Vectorized<float>::size()]; \

View File

@ -7,7 +7,8 @@
#include <c10/util/irange.h>
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/vec_base.h>
#if defined(CPU_CAPABILITY_AVX512) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX512)
#define SLEEF_STATIC_LIBS
#include <sleef.h>
#endif
@ -16,7 +17,7 @@ namespace vec {
// See Note [CPU_CAPABILITY namespace]
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_AVX512) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX512)
template <> class Vectorized<c10::complex<double>> {
private:
@ -203,7 +204,7 @@ public:
auto abs = abs_();
auto zero = _mm512_setzero_pd();
auto mask = _mm512_cmp_pd_mask(abs, zero, _CMP_EQ_OQ);
auto div = values / abs;
auto div = _mm512_div_pd(values, abs);
return _mm512_mask_blend_pd(mask, div, zero);
}
__m512d real_() const {

View File

@ -7,7 +7,8 @@
#include <c10/util/irange.h>
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/vec_base.h>
#if defined(CPU_CAPABILITY_AVX512) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX512)
#define SLEEF_STATIC_LIBS
#include <sleef.h>
#endif
@ -16,7 +17,7 @@ namespace vec {
// See Note [CPU_CAPABILITY namespace]
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_AVX512) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX512)
template <> class Vectorized<c10::complex<float>> {
private:
@ -708,7 +709,7 @@ public:
auto abs = abs_();
auto zero = _mm512_setzero_ps();
auto mask = _mm512_cmp_ps_mask(abs, zero, _CMP_EQ_OQ);
auto div = values / abs;
auto div = _mm512_div_ps(values, abs);
return _mm512_mask_blend_ps(mask, div, zero);
}
__m512 real_() const {

View File

@ -0,0 +1,139 @@
#pragma once
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/vec512/vec512_bfloat16.h>
#include <ATen/cpu/vec/vec_base.h>
#include <ATen/cpu/vec/vec_convert.h>
namespace at::vec {
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_AVX512) && !defined(_MSC_VER)
template <>
struct VecConvert<float, 1, BFloat16, 1> {
static inline VectorizedN<float, 1> apply(
const VectorizedN<BFloat16, 1>& src) {
VectorizedN<float, 1> result;
__m512 value;
cvtbf16_fp32(_mm512_castsi512_si256(src[0]), value);
result[0] = value;
return result;
}
};
template <>
struct VecConvert<float, 1, Half, 1> {
static inline VectorizedN<float, 1> apply(const VectorizedN<Half, 1>& src) {
VectorizedN<float, 1> result;
__m512 value;
cvtfp16_fp32(_mm512_castsi512_si256(src[0]), value);
result[0] = value;
return result;
}
};
template <>
struct VecConvert<BFloat16, 1, float, 1> {
static inline VectorizedN<BFloat16, 1> apply(
const VectorizedN<float, 1>& src) {
VectorizedN<BFloat16, 1> result;
result[0] = _mm512_castsi256_si512(cvtfp32_bf16(src[0]));
return result;
}
};
template <>
struct VecConvert<Half, 1, float, 1> {
static inline VectorizedN<Half, 1> apply(const VectorizedN<float, 1>& src) {
VectorizedN<Half, 1> result;
result[0] = _mm512_castsi256_si512(cvtfp32_fp16(src[0]));
return result;
}
};
template <>
struct VecConvert<float, 1, int64_t, 2> {
static inline VectorizedN<float, 1> apply(
const VectorizedN<int64_t, 2>& src) {
auto low = _mm512_cvtepi64_ps(src[0]);
auto high = _mm512_cvtepi64_ps(src[1]);
return Vectorized<float>(
_mm512_insertf32x8(_mm512_castps256_ps512(low), high, 1));
}
};
template <>
struct VecConvert<int64_t, 2, float, 1> {
static inline VectorizedN<int64_t, 2> apply(
const VectorizedN<float, 1>& src) {
at::vec::VectorizedN<int64_t, 2> result;
result[0] = _mm512_cvt_roundps_epi64(
_mm512_castps512_ps256(src[0]), _MM_FROUND_TO_ZERO | _MM_FROUND_NO_EXC);
result[1] = _mm512_cvt_roundps_epi64(
_mm512_extractf32x8_ps(src[0], 1),
_MM_FROUND_TO_ZERO | _MM_FROUND_NO_EXC);
return result;
}
};
template <>
struct VecConvert<int32_t, 1, int64_t, 2> {
static inline VectorizedN<int32_t, 1> apply(
const VectorizedN<int64_t, 2>& src) {
auto low = _mm512_cvtepi64_epi32(src[0]);
auto high = _mm512_cvtepi64_epi32(src[1]);
return Vectorized<int32_t>(
_mm512_inserti32x8(_mm512_castsi256_si512(low), high, 1));
}
};
template <>
struct VecConvert<int64_t, 2, int32_t, 1> {
static inline VectorizedN<int64_t, 2> apply(
const VectorizedN<int32_t, 1>& src) {
at::vec::VectorizedN<int64_t, 2> result;
result[0] = _mm512_cvtepi32_epi64(_mm512_castsi512_si256(src[0]));
result[1] = _mm512_cvtepi32_epi64(_mm512_extracti32x8_epi32(src[0], 1));
return result;
}
};
template <>
struct VecConvert<int32_t, 1, int8_t, 1> {
static inline VectorizedN<int32_t, 1> apply(
const VectorizedN<int8_t, 1>& src) {
auto src128 = _mm512_castsi512_si128(src[0]);
return Vectorized<int32_t>(_mm512_cvtepi8_epi32(src128));
}
};
template <>
struct VecConvert<int32_t, 1, uint8_t, 1> {
static inline VectorizedN<int32_t, 1> apply(
const VectorizedN<uint8_t, 1>& src) {
auto src128 = _mm512_castsi512_si128(src[0]);
return Vectorized<int32_t>(_mm512_cvtepu8_epi32(src128));
}
};
template <typename dst_t>
struct VecConvert<
dst_t,
1,
int64_t,
2,
typename std::enable_if<
std::is_same_v<dst_t, int8_t> ||
std::is_same_v<dst_t, uint8_t>>::type> {
static inline VectorizedN<dst_t, 1> apply(
const VectorizedN<int64_t, 2>& src) {
return VecConvert<dst_t, 1, int32_t, 1>::apply(
VecConvert<int32_t, 1, int64_t, 2>::apply(src));
}
};
#endif
} // namespace CPU_CAPABILITY
} // namespace at::vec

View File

@ -6,7 +6,8 @@
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/vec_base.h>
#include <c10/util/irange.h>
#if (defined(CPU_CAPABILITY_AVX512)) && !defined(_MSC_VER)
#if (defined(CPU_CAPABILITY_AVX512))
#define SLEEF_STATIC_LIBS
#include <sleef.h>
#endif
@ -15,7 +16,7 @@ namespace vec {
// See Note [CPU_CAPABILITY namespace]
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_AVX512) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX512)
template <> class Vectorized<double> {
private:

View File

@ -6,7 +6,8 @@
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/vec_base.h>
#include <c10/util/irange.h>
#if defined(CPU_CAPABILITY_AVX512) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX512)
#define SLEEF_STATIC_LIBS
#include <sleef.h>
#endif
@ -15,7 +16,7 @@ namespace vec {
// See Note [CPU_CAPABILITY namespace]
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_AVX512) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX512)
template <> class Vectorized<float> {
private:
@ -246,14 +247,14 @@ public:
static __m512 vec_factorial_5 =
_mm512_set1_ps(0.00828929059f); // 1/factorial(5)
static __m512 vec_exp_log2ef =
(__m512)_mm512_set1_epi32(0x3fb8aa3b); // log2(e)
_mm512_castsi512_ps(_mm512_set1_epi32(0x3fb8aa3b)); // log2(e)
static __m512 vec_half = _mm512_set1_ps(0.5f);
static __m512 vec_one = _mm512_set1_ps(1.f);
static __m512 vec_zero = _mm512_set1_ps(0.f);
static __m512 vec_two = _mm512_set1_ps(2.f);
static __m512 vec_ln2f = (__m512)_mm512_set1_epi32(0x3f317218); // ln(2)
static __m512 vec_ln_flt_min = (__m512)_mm512_set1_epi32(0xc2aeac50);
static __m512 vec_ln_flt_max = (__m512)_mm512_set1_epi32(0x42b17218);
static __m512 vec_ln2f = _mm512_castsi512_ps(_mm512_set1_epi32(0x3f317218)); // ln(2)
static __m512 vec_ln_flt_min = _mm512_castsi512_ps(_mm512_set1_epi32(0xc2aeac50));
static __m512 vec_ln_flt_max = _mm512_castsi512_ps(_mm512_set1_epi32(0x42b17218));
static __m512i vec_127 = _mm512_set1_epi32(0x0000007f);
static int n_mantissa_bits = 23;
@ -288,7 +289,7 @@ public:
auto vec_exp_number_i = _mm512_cvtps_epi32(vec_exp_number);
auto vec_two_pow_n_i = _mm512_add_epi32(vec_exp_number_i, vec_127);
vec_two_pow_n_i = _mm512_slli_epi32(vec_two_pow_n_i, n_mantissa_bits);
auto vec_two_pow_n = (__m512)vec_two_pow_n_i;
auto vec_two_pow_n = _mm512_castsi512_ps(vec_two_pow_n_i);
vec_two_pow_n =
_mm512_mask_blend_ps(less_ln_flt_min_mask, vec_two_pow_n, vec_zero);

View File

@ -1069,7 +1069,7 @@ Vectorized<int8_t> inline maximum(const Vectorized<int8_t>& a, const Vectorized<
template <>
Vectorized<uint8_t> inline maximum(const Vectorized<uint8_t>& a, const Vectorized<uint8_t>& b) {
return _mm512_max_epi8(a, b);
return _mm512_max_epu8(a, b);
}
template <>

View File

@ -0,0 +1,155 @@
#pragma once
#include <ATen/cpu/vec/intrinsics.h>
#include <ATen/cpu/vec/vec_base.h>
#include <ATen/cpu/vec/vec_mask.h>
namespace at::vec {
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_AVX512) && !defined(_MSC_VER)
template <typename T, typename mask_t>
struct VecMaskLoad<
T,
1,
mask_t,
1,
typename std::enable_if_t<
std::is_same_v<T, float> || std::is_same_v<T, int32_t> ||
std::is_same_v<T, uint32_t>,
void>> {
static inline VectorizedN<T, 1> apply(
const T* ptr,
const VecMask<mask_t, 1>& vec_mask) {
at::vec::Vectorized<T> zero_vec(0);
auto all_ones = _mm512_set1_epi32(0xFFFFFFFF);
auto int_mask = vec_mask.template cast<int, 1>()[0];
auto mmask = _mm512_cmp_epi32_mask(int_mask, all_ones, _MM_CMPINT_EQ);
if constexpr (std::is_same_v<T, float>) {
return Vectorized<T>(_mm512_mask_loadu_ps(zero_vec, mmask, ptr));
} else {
return Vectorized<T>(_mm512_mask_loadu_epi32(zero_vec, mmask, ptr));
}
}
};
template <typename data_t, typename mask_t>
struct VecMaskLoad<
data_t,
1,
mask_t,
1,
typename std::enable_if<
std::is_same_v<data_t, BFloat16> ||
std::is_same_v<data_t, Half>>::type> {
static inline VectorizedN<data_t, 1> apply(
const data_t* ptr,
const VecMask<mask_t, 1>& vec_mask) {
auto all_ones = _mm512_set1_epi32(0xFFFFFFFF);
auto int_mask = vec_mask.template cast<int, 1>()[0];
auto mmask = _mm512_cmp_epi32_mask(int_mask, all_ones, _MM_CMPINT_EQ);
auto zero = _mm256_set1_epi16(0);
auto temp = _mm256_mask_loadu_epi16(zero, mmask, ptr);
return Vectorized<data_t>(
_mm512_inserti32x8(_mm512_castsi256_si512(temp), zero, 1));
}
};
template <typename data_t, typename mask_t>
struct VecMaskLoad<
data_t,
1,
mask_t,
1,
typename std::enable_if<
std::is_same_v<data_t, int8_t> ||
std::is_same_v<data_t, uint8_t>>::type> {
static inline VectorizedN<data_t, 1> apply(
const data_t* ptr,
const VecMask<mask_t, 1>& vec_mask) {
auto all_ones = _mm512_set1_epi32(0xFFFFFFFF);
auto int_mask = vec_mask.template cast<int, 1>()[0];
auto mmask = _mm512_cmp_epi32_mask(int_mask, all_ones, _MM_CMPINT_EQ);
auto zero = _mm_set1_epi8(0);
auto temp = _mm_mask_loadu_epi8(zero, mmask, ptr);
return Vectorized<data_t>(
_mm512_inserti64x2(_mm512_set1_epi32(0), temp, 0));
}
};
template <typename mask_t>
struct VecMaskLoad<int64_t, 2, mask_t, 1> {
static inline VectorizedN<int64_t, 2> apply(
const int64_t* ptr,
const VecMask<mask_t, 1>& vec_mask) {
auto all_ones = _mm512_set1_epi32(0xFFFFFFFF);
auto zero = _mm512_set1_epi64(0);
auto int_mask = vec_mask.template cast<int, 1>()[0];
auto mmask = _mm512_cmp_epi32_mask(int_mask, all_ones, _MM_CMPINT_EQ);
at::vec::VectorizedN<int64_t, 2> result;
result[0] = _mm512_mask_loadu_epi64(zero, (__mmask8)mmask, ptr);
result[1] = _mm512_mask_loadu_epi64(zero, (__mmask8)(mmask >> 8), ptr + 8);
return result;
}
};
template <>
struct VecMaskCast<float, 1, int, 1> {
static inline VecMask<float, 1> apply(const VecMask<int, 1>& vec_mask) {
return Vectorized<float>(_mm512_castsi512_ps(vec_mask[0]));
}
};
template <>
struct VecMaskCast<int, 1, float, 1> {
static inline VecMask<int, 1> apply(const VecMask<float, 1>& vec_mask) {
return Vectorized<int>(_mm512_castps_si512(vec_mask[0]));
}
};
template <typename dst_t>
struct VecMaskCast<dst_t, 1, int64_t, 2> {
static inline VecMask<dst_t, 1> apply(const VecMask<int64_t, 2>& vec_mask) {
auto int_vec = convert<int, 1, int64_t, 2>(VectorizedN<int64_t, 2>(vec_mask));
return VecMask<int, 1>(int_vec).cast<dst_t, 1>();
}
};
template <>
inline bool VecMask<int, 1>::all_zero() const {
__mmask16 mask = _mm512_test_epi32_mask(mask_[0], mask_[0]);
return mask == 0;
}
template <>
inline bool VecMask<int, 1>::is_masked(int i) const {
return _mm512_movepi32_mask(mask_[0]) & (1 << i);
}
template <>
inline bool VecMask<int, 1>::all_masked() const {
__mmask16 mask = _mm512_movepi32_mask(mask_[0]);
return mask == 0xffff;
}
#define VEC_MASK_METHOD_WITH_CAST_TO_INT( \
T, N, return_type, method, args_def, args) \
template <> \
inline return_type VecMask<T, N>::method args_def const { \
return cast<int, 1>().method args; \
}
VEC_MASK_METHOD_WITH_CAST_TO_INT(float, 1, bool, all_zero, (), ())
VEC_MASK_METHOD_WITH_CAST_TO_INT(int64_t, 2, bool, all_zero, (), ())
VEC_MASK_METHOD_WITH_CAST_TO_INT(float, 1, bool, is_masked, (int i), (i))
VEC_MASK_METHOD_WITH_CAST_TO_INT(int64_t, 2, bool, is_masked, (int i), (i))
VEC_MASK_METHOD_WITH_CAST_TO_INT(float, 1, bool, all_masked, (), ())
VEC_MASK_METHOD_WITH_CAST_TO_INT(int64_t, 2, bool, all_masked, (), ())
#undef VEC_MASK_DEFINE_METHOD_WITH_CAST_TO_INT
#endif
} // namespace CPU_CAPABILITY
} // namespace at::vec

View File

@ -42,11 +42,17 @@ namespace at {
namespace vec {
inline namespace CPU_CAPABILITY {
#if defined(CPU_CAPABILITY_AVX512) && !defined(_MSC_VER)
#if defined(CPU_CAPABILITY_AVX512)
#ifdef _MSC_VER
__declspec(align(64)) struct Vectorizedqi {
protected:
__m512i vals;
#else
struct Vectorizedqi {
protected:
__m512i vals __attribute__((aligned(64)));
#endif
public:
Vectorizedqi() {}
@ -136,7 +142,7 @@ inline convert_float_to_int8(at::vec::Vectorized<float> src) {
}
template <typename T>
inline void __attribute__((always_inline)) QuantizeAvx512(
__FORCE_INLINE void QuantizeAvx512(
const float* src,
T* dst,
int len,
@ -525,10 +531,17 @@ struct Vectorized<c10::qint8> : public Vectorizedqi {
Vectorized<float> scale,
Vectorized<float> zero_point,
Vectorized<float> scale_neg_zp_premul) const {
#if defined(_MSC_VER) && !defined(__clang__)
__m128i int_val0 = _mm_set_epi64x(vals.m512i_u64[1], vals.m512i_u64[0]);
__m128i int_val1 = _mm_set_epi64x(vals.m512i_u64[3], vals.m512i_u64[2]);
__m128i int_val2 = _mm_set_epi64x(vals.m512i_u64[5], vals.m512i_u64[4]);
__m128i int_val3 = _mm_set_epi64x(vals.m512i_u64[7], vals.m512i_u64[6]);
#else
__m128i int_val0 = _mm_set_epi64x(vals[1], vals[0]);
__m128i int_val1 = _mm_set_epi64x(vals[3], vals[2]);
__m128i int_val2 = _mm_set_epi64x(vals[5], vals[4]);
__m128i int_val3 = _mm_set_epi64x(vals[7], vals[6]);
#endif
__m512 float_val0 = _mm512_cvtepi32_ps(cvtepi8_epi32(int_val0));
__m512 float_val1 = _mm512_cvtepi32_ps(cvtepi8_epi32(int_val1));
@ -549,10 +562,17 @@ struct Vectorized<c10::qint8> : public Vectorizedqi {
float_vec_return_type dequantize(
Vectorized<float> scale,
Vectorized<float> zero_point) const {
#if defined(_MSC_VER) && !defined(__clang__)
__m128i int_val0 = _mm_set_epi64x(vals.m512i_u64[1], vals.m512i_u64[0]);
__m128i int_val1 = _mm_set_epi64x(vals.m512i_u64[3], vals.m512i_u64[2]);
__m128i int_val2 = _mm_set_epi64x(vals.m512i_u64[5], vals.m512i_u64[4]);
__m128i int_val3 = _mm_set_epi64x(vals.m512i_u64[7], vals.m512i_u64[6]);
#else
__m128i int_val0 = _mm_set_epi64x(vals[1], vals[0]);
__m128i int_val1 = _mm_set_epi64x(vals[3], vals[2]);
__m128i int_val2 = _mm_set_epi64x(vals[5], vals[4]);
__m128i int_val3 = _mm_set_epi64x(vals[7], vals[6]);
#endif
__m512 float_val0 = _mm512_cvtepi32_ps(cvtepi8_epi32(int_val0));
__m512 float_val1 = _mm512_cvtepi32_ps(cvtepi8_epi32(int_val1));
@ -598,20 +618,34 @@ struct Vectorized<c10::qint8> : public Vectorizedqi {
}
int_vec_return_type widening_subtract(Vectorized<c10::qint8> b) const {
#if defined(_MSC_VER) && !defined(__clang__)
__m128i int_val0 = _mm_set_epi64x(vals.m512i_u64[1], vals.m512i_u64[0]);
__m128i int_val1 = _mm_set_epi64x(vals.m512i_u64[3], vals.m512i_u64[2]);
__m128i int_val2 = _mm_set_epi64x(vals.m512i_u64[5], vals.m512i_u64[4]);
__m128i int_val3 = _mm_set_epi64x(vals.m512i_u64[7], vals.m512i_u64[6]);
#else
__m128i int_val0 = _mm_set_epi64x(vals[1], vals[0]);
__m128i int_val1 = _mm_set_epi64x(vals[3], vals[2]);
__m128i int_val2 = _mm_set_epi64x(vals[5], vals[4]);
__m128i int_val3 = _mm_set_epi64x(vals[7], vals[6]);
#endif
__m512i int32_val0 = cvtepi8_epi32(int_val0);
__m512i int32_val1 = cvtepi8_epi32(int_val1);
__m512i int32_val2 = cvtepi8_epi32(int_val2);
__m512i int32_val3 = cvtepi8_epi32(int_val3);
#if defined(_MSC_VER) && !defined(__clang__)
__m128i int_b0 = _mm_set_epi64x(b.vals.m512i_u64[1], b.vals.m512i_u64[0]);
__m128i int_b1 = _mm_set_epi64x(b.vals.m512i_u64[3], b.vals.m512i_u64[2]);
__m128i int_b2 = _mm_set_epi64x(b.vals.m512i_u64[5], b.vals.m512i_u64[4]);
__m128i int_b3 = _mm_set_epi64x(b.vals.m512i_u64[7], b.vals.m512i_u64[6]);
#else
__m128i int_b0 = _mm_set_epi64x(b.vals[1], b.vals[0]);
__m128i int_b1 = _mm_set_epi64x(b.vals[3], b.vals[2]);
__m128i int_b2 = _mm_set_epi64x(b.vals[5], b.vals[4]);
__m128i int_b3 = _mm_set_epi64x(b.vals[7], b.vals[6]);
#endif
__m512i int32_b0 = cvtepi8_epi32(int_b0);
__m512i int32_b1 = cvtepi8_epi32(int_b1);
@ -721,10 +755,17 @@ struct Vectorized<c10::quint8> : public Vectorizedqi {
Vectorized<float> scale,
Vectorized<float> zero_point,
Vectorized<float> scale_zp_premul) const {
#if defined(_MSC_VER) && !defined(__clang__)
__m128i int_val0 = _mm_set_epi64x(vals.m512i_u64[1], vals.m512i_u64[0]);
__m128i int_val1 = _mm_set_epi64x(vals.m512i_u64[3], vals.m512i_u64[2]);
__m128i int_val2 = _mm_set_epi64x(vals.m512i_u64[5], vals.m512i_u64[4]);
__m128i int_val3 = _mm_set_epi64x(vals.m512i_u64[7], vals.m512i_u64[6]);
#else
__m128i int_val0 = _mm_set_epi64x(vals[1], vals[0]);
__m128i int_val1 = _mm_set_epi64x(vals[3], vals[2]);
__m128i int_val2 = _mm_set_epi64x(vals[5], vals[4]);
__m128i int_val3 = _mm_set_epi64x(vals[7], vals[6]);
#endif
__m512 float_val0 = _mm512_cvtepi32_ps(cvtepu8_epi32(int_val0));
__m512 float_val1 = _mm512_cvtepi32_ps(cvtepu8_epi32(int_val1));
@ -746,10 +787,17 @@ struct Vectorized<c10::quint8> : public Vectorizedqi {
float_vec_return_type dequantize(
Vectorized<float> scale,
Vectorized<float> zero_point) const {
#if defined(_MSC_VER) && !defined(__clang__)
__m128i int_val0 = _mm_set_epi64x(vals.m512i_u64[1], vals.m512i_u64[0]);
__m128i int_val1 = _mm_set_epi64x(vals.m512i_u64[3], vals.m512i_u64[2]);
__m128i int_val2 = _mm_set_epi64x(vals.m512i_u64[5], vals.m512i_u64[4]);
__m128i int_val3 = _mm_set_epi64x(vals.m512i_u64[7], vals.m512i_u64[6]);
#else
__m128i int_val0 = _mm_set_epi64x(vals[1], vals[0]);
__m128i int_val1 = _mm_set_epi64x(vals[3], vals[2]);
__m128i int_val2 = _mm_set_epi64x(vals[5], vals[4]);
__m128i int_val3 = _mm_set_epi64x(vals[7], vals[6]);
#endif
__m512 float_val0 = _mm512_cvtepi32_ps(cvtepu8_epi32(int_val0));
__m512 float_val1 = _mm512_cvtepi32_ps(cvtepu8_epi32(int_val1));
@ -796,20 +844,34 @@ struct Vectorized<c10::quint8> : public Vectorizedqi {
}
int_vec_return_type widening_subtract(Vectorized<c10::quint8> b) const {
#if defined(_MSC_VER) && !defined(__clang__)
__m128i int_val0 = _mm_set_epi64x(vals.m512i_u64[1], vals.m512i_u64[0]);
__m128i int_val1 = _mm_set_epi64x(vals.m512i_u64[3], vals.m512i_u64[2]);
__m128i int_val2 = _mm_set_epi64x(vals.m512i_u64[5], vals.m512i_u64[4]);
__m128i int_val3 = _mm_set_epi64x(vals.m512i_u64[7], vals.m512i_u64[6]);
#else
__m128i int_val0 = _mm_set_epi64x(vals[1], vals[0]);
__m128i int_val1 = _mm_set_epi64x(vals[3], vals[2]);
__m128i int_val2 = _mm_set_epi64x(vals[5], vals[4]);
__m128i int_val3 = _mm_set_epi64x(vals[7], vals[6]);
#endif
__m512i int32_val0 = cvtepu8_epi32(int_val0);
__m512i int32_val1 = cvtepu8_epi32(int_val1);
__m512i int32_val2 = cvtepu8_epi32(int_val2);
__m512i int32_val3 = cvtepu8_epi32(int_val3);
#if defined(_MSC_VER) && !defined(__clang__)
__m128i int_b0 = _mm_set_epi64x(b.vals.m512i_u64[1], b.vals.m512i_u64[0]);
__m128i int_b1 = _mm_set_epi64x(b.vals.m512i_u64[3], b.vals.m512i_u64[2]);
__m128i int_b2 = _mm_set_epi64x(b.vals.m512i_u64[5], b.vals.m512i_u64[4]);
__m128i int_b3 = _mm_set_epi64x(b.vals.m512i_u64[7], b.vals.m512i_u64[6]);
#else
__m128i int_b0 = _mm_set_epi64x(b.vals[1], b.vals[0]);
__m128i int_b1 = _mm_set_epi64x(b.vals[3], b.vals[2]);
__m128i int_b2 = _mm_set_epi64x(b.vals[5], b.vals[4]);
__m128i int_b3 = _mm_set_epi64x(b.vals[7], b.vals[6]);
#endif
__m512i int32_b0 = cvtepu8_epi32(int_b0);
__m512i int32_b1 = cvtepu8_epi32(int_b1);

View File

@ -36,6 +36,12 @@
#include <c10/util/irange.h>
#include <c10/util/Load.h>
#if defined(__GNUC__)
#define __FORCE_INLINE __attribute__((always_inline)) inline
#elif defined(_MSC_VER)
#define __FORCE_INLINE __forceinline
#endif
// These macros helped us unify vec_base.h
#ifdef CPU_CAPABILITY_AVX512
#if defined(__GNUC__)
@ -228,6 +234,11 @@ public:
std::memcpy(vector.values, ptr, count * sizeof(T));
return vector;
}
static Vectorized<T> loadu_one_fourth(const void* ptr) {
static_assert(std::is_same_v<T, signed char> || std::is_same_v<T, unsigned char>, "For byte types only");
return Vectorized::loadu(ptr, 8);
}
void store(void* ptr, int count = size()) const {
std::memcpy(ptr, values, count * sizeof(T));
}
@ -835,8 +846,8 @@ inline Vectorized<T> operator^(const Vectorized<T>& a, const Vectorized<T>& b) {
template<class T, typename std::enable_if_t<!std::is_base_of<Vectorizedi, Vectorized<T>>::value, int> = 0>
inline Vectorized<T> operator~(const Vectorized<T>& a) {
Vectorized<T> ones; // All bits are 1
memset((T*) ones, 0xFF, VECTOR_WIDTH);
using int_t = int_same_size_t<T>;
Vectorized<T> ones(c10::bit_cast<T>((int_t)(~(int_t)0))); // All bits are 1
return a ^ ones;
}
@ -1106,3 +1117,8 @@ inline void transpose_mxn(const T* src, int64_t ld_src, T* dst, int64_t ld_dst)
}
}} // namespace at::vec::CPU_CAPABILITY
// additional headers for more operations that depend on vec_base
#include <ATen/cpu/vec/vec_n.h>
#include <ATen/cpu/vec/vec_mask.h>
#include <ATen/cpu/vec/vec_convert.h>

View File

@ -0,0 +1,56 @@
#pragma once
#include <ATen/cpu/vec/vec_base.h>
#include <ATen/cpu/vec/vec_n.h>
namespace at::vec {
inline namespace CPU_CAPABILITY {
template <
typename dst_t,
int dst_n,
typename src_t,
int src_n,
typename Enabled = void>
struct VecConvert {
static inline VectorizedN<dst_t, dst_n> apply(
const VectorizedN<src_t, src_n>& src) {
constexpr int count = std::min(
VectorizedN<src_t, src_n>::size(), VectorizedN<dst_t, dst_n>::size());
__at_align__ src_t src_buf[VectorizedN<src_t, src_n>::size()];
src.store(src_buf);
__at_align__ dst_t dst_buf[VectorizedN<dst_t, dst_n>::size()];
for (int i = 0; i < count; i++) {
dst_buf[i] = static_cast<dst_t>(src_buf[i]);
}
return VectorizedN<dst_t, dst_n>::loadu(dst_buf, count);
}
};
template <typename dst_t, typename src_t>
inline Vectorized<dst_t> convert(const Vectorized<src_t>& src) {
return VecConvert<dst_t, 1, src_t, 1>::apply(src);
}
template <
typename dst_t,
int dst_n,
typename src_t,
int src_n,
std::enable_if_t<dst_n != 1, int> = 0>
inline VectorizedN<dst_t, dst_n> convert(const VectorizedN<src_t, src_n>& src) {
return VecConvert<dst_t, dst_n, src_t, src_n>::apply(src);
}
template <
typename dst_t,
int dst_n,
typename src_t,
int src_n,
std::enable_if_t<dst_n == 1, int> = 0>
inline Vectorized<dst_t> convert(const VectorizedN<src_t, src_n>& src) {
return VecConvert<dst_t, dst_n, src_t, src_n>::apply(src);
}
} // namespace CPU_CAPABILITY
} // namespace at::vec

View File

@ -0,0 +1,248 @@
#pragma once
#include <ATen/cpu/vec/vec_base.h>
#include <ATen/cpu/vec/vec_n.h>
namespace at::vec {
inline namespace CPU_CAPABILITY {
/**
* The `VecMask` class provides a convenient interface for working with
* vectorized masks in SIMD operations. It encapsulates a `Vectorized<T, N>`
* mask that can be directly usable in masked vectorized operations. It provides
* various methods for manipulating and accessing the mask elements:
* 1. `from` and `to`: Conversion between a vector of boolean values and a
* vectorized mask.
* 2. `cast`: Casts the mask to a different base type.
* 3. `all_zero`: Checks if all mask elements are zero.
* 4. `is_masked`: Checks if a specific element is masked.
* 5. `loadu`: Loads data from memory using the mask.
* 6. `all_masked`: Checks if all mask elements are masked.
*
* Some helper template classes are provided to simplify the specialization of
* the `VecMask` for the specific CPU arch:
* 1. `VecMaskLoad`: Loads data from memory using the mask.
* 2. `VecMaskTo`: Converts the mask to boolean.
* 3. `VecMaskCast`: Casts the mask to a different base type.
*
*/
template <typename T, int N>
class VecMask;
template <
typename data_t,
int data_n,
typename mask_t,
int mask_n,
typename Enabled = void>
struct VecMaskLoad {
static inline VectorizedN<data_t, data_n> apply(
const data_t* ptr,
const VecMask<mask_t, mask_n>& vec_mask) {
constexpr typename VecMask<mask_t, mask_n>::size_type size =
VecMask<mask_t, mask_n>::size();
static_assert(VectorizedN<data_t, data_n>::size() >= size);
__at_align__ data_t data[size];
__at_align__ mask_t mask[size];
auto mask_ = VectorizedN<mask_t, mask_n>(vec_mask);
mask_.store(mask);
for (int i = 0; i < size; i++) {
data[i] = mask[i] ? ptr[i] : static_cast<data_t>(0);
}
return VectorizedN<data_t, data_n>::loadu(data, size);
}
};
template <
typename dst_t,
int dst_n,
typename src_t,
int src_n,
typename Enabled = void>
struct VecMaskTo {
static inline VecMask<dst_t, dst_n> apply(
const VecMask<src_t, src_n>& vec_mask) {
auto zeros = VectorizedN<dst_t, dst_n>(static_cast<dst_t>(0));
auto ones = VectorizedN<dst_t, dst_n>(static_cast<dst_t>(1));
return VectorizedN<dst_t, dst_n>::blendv(
zeros, ones, vec_mask.template cast<dst_t, dst_n>());
}
};
template <typename dst_t, int dst_n, typename src_t, int src_n>
struct VecMaskCast {
static inline VecMask<dst_t, dst_n> apply(
const VecMask<src_t, src_n>& vec_mask) {
return VecMask<dst_t, dst_n>::from(VectorizedN<src_t, src_n>(vec_mask));
}
};
template <typename T, int N>
struct VecMaskCast<T, N, T, N> {
static inline VecMask<T, N> apply(const VecMask<T, N>& vec_mask) {
return vec_mask;
}
};
template <typename T, int N>
class VecMask {
public:
using size_type = int;
static constexpr size_type size() {
return VectorizedN<T, N>::size();
}
private:
VectorizedN<T, N> mask_;
public:
VecMask() : mask_(static_cast<T>(0)) {}
VecMask(const VectorizedN<T, N>& mask) : mask_(mask) {}
template <int L = N, typename std::enable_if_t<L == 1, int> = 0>
VecMask(const Vectorized<T>& mask) : mask_(mask) {}
template <typename U, int L>
static VecMask<T, N> from(const VectorizedN<U, L>& b_vec) {
__at_align__ U b_buf[size()];
if constexpr (size() >= VectorizedN<U, L>::size()) {
b_vec.store(b_buf);
for (int i = VectorizedN<U, L>::size(); i < size(); i++) {
b_buf[i] = static_cast<U>(0);
}
} else {
b_vec.store(b_buf, size());
}
return from(b_buf);
}
template <typename U>
static VecMask<T, N> from(U b) {
using int_t = int_same_size_t<T>;
T mask = b ? c10::bit_cast<T>((int_t)(~(int_t)0)) : (T)0;
return VectorizedN<T, N>(mask);
}
template <typename U>
static VecMask<T, N> from(U* b) {
using int_t = int_same_size_t<T>;
__at_align__ T mask[size()];
#pragma unroll
for (int i = 0; i < size(); i++) {
*(int_t*)(mask + i) = b[i] ? ~(int_t)0 : (int_t)0;
}
return VectorizedN<T, N>(VectorizedN<T, N>::loadu(mask));
}
template <typename U, int L, std::enable_if_t<L >= 2, int> = 0>
inline VectorizedN<U, L> to() const {
return VecMaskTo<U, L, T, N>::apply(*this);
}
template <typename U, int L, std::enable_if_t<L == 1, int> = 0>
inline Vectorized<U> to() const {
return VecMaskTo<U, L, T, N>::apply(*this);
}
template <typename U, int L>
inline VecMask<U, L> cast() const {
return VecMaskCast<U, L, T, N>::apply(*this);
}
inline bool all_zero() const {
__at_align__ T mask[size()];
mask_.store(mask);
return std::all_of(
mask, mask + size(), [](T m) { return m == static_cast<T>(0); });
}
inline bool all_masked() const {
__at_align__ T mask[size()];
mask_.store(mask);
return std::all_of(
mask, mask + size(), [](T m) { return m != static_cast<T>(0); });
}
inline bool is_masked(int i) const {
__at_align__ T mask[size()];
mask_.store(mask);
return mask[i] != static_cast<T>(0);
}
inline operator VectorizedN<T, N>() const {
return mask_;
}
template <int L = N, typename std::enable_if_t<L == 1, int> = 0>
inline operator Vectorized<T>() const {
return mask_[0];
}
inline Vectorized<T> operator[](int i) const {
return mask_[i];
}
template <
typename U,
int L,
std::enable_if_t<L >= 2 && VectorizedN<U, L>::size() >= size(), int> = 0>
VectorizedN<U, L> loadu(const U* ptr) const {
return VecMaskLoad<U, L, T, N>::apply(ptr, *this);
}
template <
typename U,
int L,
std::enable_if_t<L == 1 && Vectorized<U>::size() >= size(), int> = 0>
Vectorized<U> loadu(const U* ptr) const {
return VecMaskLoad<U, L, T, N>::apply(ptr, *this);
}
};
#define VEC_MASK_DEFINE_UNARY_OP_GLOBAL(op) \
template <typename T, int N> \
inline VecMask<T, N> op(const VecMask<T, N>& a) { \
return op(VectorizedN<T, N>(a)); \
}
#define VEC_MASK_DEFINE_BINARY_OP_GLOBAL(op) \
template < \
typename T, \
int N, \
typename V, \
int M, \
std::enable_if_t<VecMask<T, N>::size() == VecMask<V, M>::size(), int> = \
0> \
inline VecMask<T, N> op(const VecMask<T, N>& a, const VecMask<V, M>& b) { \
return op( \
VectorizedN<T, N>(a), VectorizedN<T, N>(b.template cast<T, N>())); \
}
#define VEC_MASK_DEFINE_BINARY_OP_WITH_EXPR_GLOBAL(op, EXPR) \
template < \
typename T, \
int N, \
typename V, \
int M, \
std::enable_if_t<VecMask<T, N>::size() == VecMask<V, M>::size(), int> = \
0> \
inline VecMask<T, N> op(const VecMask<T, N>& a, const VecMask<V, M>& b) { \
return EXPR; \
}
VEC_MASK_DEFINE_UNARY_OP_GLOBAL(operator~)
VEC_MASK_DEFINE_BINARY_OP_GLOBAL(operator&)
VEC_MASK_DEFINE_BINARY_OP_GLOBAL(operator|)
VEC_MASK_DEFINE_BINARY_OP_GLOBAL(operator^)
VEC_MASK_DEFINE_BINARY_OP_WITH_EXPR_GLOBAL(operator>, a & ~b)
VEC_MASK_DEFINE_BINARY_OP_WITH_EXPR_GLOBAL(operator<, ~a& b)
VEC_MASK_DEFINE_BINARY_OP_WITH_EXPR_GLOBAL(operator==, ~(a ^ b))
VEC_MASK_DEFINE_BINARY_OP_WITH_EXPR_GLOBAL(operator>=, (a == b) | (a > b))
VEC_MASK_DEFINE_BINARY_OP_WITH_EXPR_GLOBAL(operator<=, (a == b) | (a < b))
#undef VEC_MASK_DEFINE_UNARY_OP_GLOBAL
#undef VEC_MASK_DEFINE_BINARY_OP_GLOBAL
#undef VEC_MASK_DEFINE_BINARY_OP_WITH_EXPR_GLOBAL
} // namespace CPU_CAPABILITY
} // namespace at::vec

View File

@ -1,3 +1,5 @@
#pragma once
#include <ATen/cpu/vec/vec_base.h>
#include <array>
@ -83,11 +85,19 @@ class VectorizedN {
}
}
const Vectorized<T>& operator[](int i) const {
template <int L = N, typename std::enable_if_t<L == 1, int> = 0>
VectorizedN(const Vectorized<T>& val) : values({val}) {}
template <int L = N, typename std::enable_if_t<L == 1, int> = 0>
inline operator Vectorized<T>() const {
return values[0];
}
inline const Vectorized<T>& operator[](int i) const {
return values[i];
}
Vectorized<T>& operator[](int i) {
inline Vectorized<T>& operator[](int i) {
return values[i];
}
@ -97,7 +107,7 @@ class VectorizedN {
const VectorizedN<T, N>& b) {
VectorizedN<T, N> result;
for (int i = 0; i < N; ++i) {
result.values[i] = Vectorized<T>::blend<mask>(a.values[i], b.values[i]);
result.values[i] = Vectorized<T>::template blend<mask>(a.values[i], b.values[i]);
}
return result;
}
@ -132,8 +142,10 @@ class VectorizedN {
int64_t count = size()) {
VectorizedN<T, N> result;
for (int i = 0; i < N; ++i) {
result.values[i] =
Vectorized<T>::set(a.values[i], b.values[i], std::min(count, Vectorized<T>::size()));
result.values[i] = Vectorized<T>::set(
a.values[i],
b.values[i],
std::min(count, (int64_t)Vectorized<T>::size()));
count -= Vectorized<T>::size();
if (count <= 0) {
break;
@ -154,8 +166,8 @@ class VectorizedN {
static VectorizedN<T, N> loadu(const void* ptr, int64_t count) {
VectorizedN<T, N> result;
for (int i = 0; i < N; ++i) {
result.values[i] =
Vectorized<T>::loadu(ptr, std::min(count, Vectorized<T>::size()));
result.values[i] = Vectorized<T>::loadu(
ptr, std::min(count, (int64_t)Vectorized<T>::size()));
ptr = static_cast<const T*>(ptr) + Vectorized<T>::size();
count -= Vectorized<T>::size();
if (count <= 0) {
@ -174,7 +186,7 @@ class VectorizedN {
void store(void* ptr, int count) const {
for (int i = 0; i < N; ++i) {
values[i].store(ptr, std::min(count, Vectorized<T>::size()));
values[i].store(ptr, std::min(count, (int)Vectorized<T>::size()));
ptr = static_cast<T*>(ptr) + Vectorized<T>::size();
count -= Vectorized<T>::size();
if (count <= 0) {
@ -341,4 +353,4 @@ inline T vec_reduce_all(const OpVec& vec_fun, VectorizedN<T, N> acc_vec) {
}
} // namespace CPU_CAPABILITY
} // namespace at::vec
} // namespace at::vec

View File

@ -48,7 +48,7 @@ struct TORCH_CUDA_CPP_API CUDAEvent {
CUDAGuard guard(device_index_);
const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
if (C10_UNLIKELY(interp)) {
(*interp)->trace_gpu_event_deletion(reinterpret_cast<uintptr_t>(event_));
(*interp)->trace_gpu_event_deletion(at::kCUDA, reinterpret_cast<uintptr_t>(event_));
}
AT_CUDA_CHECK(cudaEventDestroy(event_));
}
@ -122,7 +122,7 @@ struct TORCH_CUDA_CPP_API CUDAEvent {
AT_CUDA_CHECK(cudaEventRecord(event_, stream));
const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
if (C10_UNLIKELY(interp)) {
(*interp)->trace_gpu_event_record(
(*interp)->trace_gpu_event_record(at::kCUDA,
reinterpret_cast<uintptr_t>(event_),
reinterpret_cast<uintptr_t>(stream.stream())
);
@ -138,7 +138,7 @@ struct TORCH_CUDA_CPP_API CUDAEvent {
AT_CUDA_CHECK(cudaStreamWaitEvent(stream, event_, 0));
const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
if (C10_UNLIKELY(interp)) {
(*interp)->trace_gpu_event_wait(
(*interp)->trace_gpu_event_wait(at::kCUDA,
reinterpret_cast<uintptr_t>(event_),
reinterpret_cast<uintptr_t>(stream.stream())
);
@ -165,7 +165,7 @@ struct TORCH_CUDA_CPP_API CUDAEvent {
if (is_created_) {
const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
if (C10_UNLIKELY(interp)) {
(*interp)->trace_gpu_event_synchronization(reinterpret_cast<uintptr_t>(event_));
(*interp)->trace_gpu_event_synchronization(at::kCUDA, reinterpret_cast<uintptr_t>(event_));
}
AT_CUDA_CHECK(cudaEventSynchronize(event_));
}
@ -195,7 +195,7 @@ private:
AT_CUDA_CHECK(cudaEventCreateWithFlags(&event_, flags_));
const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
if (C10_UNLIKELY(interp)) {
(*interp)->trace_gpu_event_creation(reinterpret_cast<uintptr_t>(event_));
(*interp)->trace_gpu_event_creation(at::kCUDA, reinterpret_cast<uintptr_t>(event_));
}
is_created_ = true;
}

View File

@ -1,5 +1,8 @@
#include <ATen/Functions.h>
#include <ATen/Tensor.h>
#include <ATen/Utils.h>
#include <ATen/cuda/CUDAGeneratorImpl.h>
#include <ATen/cuda/CUDAGraph.h>
#include <ATen/cuda/CUDAGraphsUtils.cuh>
#include <c10/core/StreamGuard.h>
#include <c10/cuda/CUDAFunctions.h>
@ -24,10 +27,10 @@ static std::deque<c10::once_flag> cuda_gens_init_flag;
static std::vector<Generator> default_gens_cuda;
/*
* Populates the global variables related to CUDA generators
* Warning: this function must only be called once!
*/
static void initCUDAGenVector(){
* Populates the global variables related to CUDA generators
* Warning: this function must only be called once!
*/
static void initCUDAGenVector() {
num_gpus = c10::cuda::device_count();
cuda_gens_init_flag.resize(num_gpus);
default_gens_cuda.resize(num_gpus);
@ -77,6 +80,150 @@ Generator createCUDAGenerator(DeviceIndex device_index) {
} // namespace cuda::detail
/**
* Creates a clone of this CUDA Generator State.
*/
c10::intrusive_ptr<CUDAGeneratorState> CUDAGeneratorState::clone() {
return make_intrusive<CUDAGeneratorState>(
seed_, philox_offset_per_thread_, offset_intragraph_);
}
/**
* Function to increase the internal offset based on the specified increment.
*/
void CUDAGeneratorState::increase(uint64_t increment) {
// Rounds increment up to the nearest multiple of 4 to meet alignment
// requirements.
// see Note [Why enforce RNG offset % 4 == 0?]
increment = ((increment + 3) / 4) * 4;
// Handling different behaviors based on whether capturing is active.
if (at::cuda::currentStreamCaptureStatus() != at::cuda::CaptureStatus::None) {
// Ensures that the state is actually capturing.
TORCH_CHECK(
capturing_,
"Attempt to increase offset for a CUDA generator not in capture mode.");
// Ensures the offset is a multiple of 4
// see Note [Why enforce RNG offset % 4 == 0?]
TORCH_INTERNAL_ASSERT(
offset_intragraph_ % 4 == 0, "RNG offset must be a multiple of 4.");
// Ensures the increment does not cause overflow.
TORCH_INTERNAL_ASSERT(
offset_intragraph_ <= std::numeric_limits<uint32_t>::max() - increment,
"Increment causes overflow in the offset value.");
offset_intragraph_ += increment;
} else {
// Checks that the increment is expected outside graph capturing.
TORCH_CHECK(
!capturing_,
"Offset increment outside graph capture encountered unexpectedly.");
// Ensures the offset is a multiple of 4
// see Note [Why enforce RNG offset % 4 == 0?]
TORCH_INTERNAL_ASSERT(
philox_offset_per_thread_ % 4 == 0,
"RNG offset must be a multiple of 4.");
philox_offset_per_thread_ += increment;
}
}
/**
* Registers this state to a CUDA graph to manage within the graph.
*/
void CUDAGeneratorState::register_graph(cuda::CUDAGraph* graph) {
// Ensures that the RNG state is not currently being captured.
at::cuda::assertNotCapturing(
"Cannot register the state during capturing stage.");
// If this is the first graph to be registered, allocate memory for the seed
// and offset on the GPU.
if (registered_graphs_.empty()) {
auto options = at::TensorOptions().device(at::kCUDA).dtype(at::kLong);
seed_extragraph_ = at::empty({1}, options);
offset_extragraph_ = at::empty({1}, options);
}
// Insert the graph into the set of registered graphs if it's not already
// registered.
if (registered_graphs_.find(graph) == registered_graphs_.end()) {
registered_graphs_.insert(graph);
}
}
/**
* Unregisters a CUDA graph from the RNG state.
*/
void CUDAGeneratorState::unregister_graph(cuda::CUDAGraph* graph) {
// Ensures that the RNG state is not currently being captured.
at::cuda::assertNotCapturing(
"Cannot unregister the state during capturing stage.");
// Verify the graph was previously registered.
TORCH_CHECK(
registered_graphs_.find(graph) != registered_graphs_.end(),
"The graph should be registered to the state");
// Remove the graph from the set of registered graphs.
registered_graphs_.erase(graph);
// If no more graphs are registered, deallocate the GPU memory for the seed
// and offset.
if (registered_graphs_.empty()) {
seed_extragraph_.reset();
offset_extragraph_.reset();
}
}
/**
* Note [Explicit Registration of Generators to the CUDA Graph]
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*
* Ideally, it would be more user-friendly if the state could be exchanged and generators
* could be registered with the CUDA graph implicitly. However, resetting GPU tensors during
* the capture stage causes these reset operations to be recorded within the CUDA graph.
* This behavior is undesirable because we do not want these tensors to be reset during
* the replay stage of the graph.
*
* As of now, there is no available method to perform a CUDA operation during the graph's
* recording phase without having that operation be included in the CUDA graph.
* This limitation necessitates explicit user action to register generators with the graph.
* By requiring users to manually register their generators, we can ensure that state resets
* (capture_prologue) only occur before the graph capture begins, thus avoiding unintended
* resets during the replay of the graph. See https://github.com/pytorch/pytorch/pull/114068.
*/
/**
* Performs the prologue steps for capturing a CUDA graph state.
* This method is intended to reset graph-related state variables before capturing begins.
*/
void CUDAGeneratorState::capture_prologue() {
capturing_ = true;
offset_intragraph_ = 0;
seed_extragraph_.fill_(int64_t(seed_));
offset_extragraph_.fill_(int64_t(0));
}
/**
* Ends the capturing phase and resets related variables, returning the whole
* graph increment.
*/
uint64_t CUDAGeneratorState::capture_epilogue() {
capturing_ = false;
return offset_intragraph_;
}
/**
* Prepares the state for replay by setting initial state tensors and applying
* total increment.
*/
void CUDAGeneratorState::replay_prologue(uint64_t wholegraph_increment) {
// Ensures the generator is not in capturing mode.
at::cuda::assertNotCapturing(
"Cannot prepare for replay during capturing stage.");
seed_extragraph_.fill_(int64_t(seed_));
offset_extragraph_.fill_(int64_t(philox_offset_per_thread_));
// Applies the total increment achieved during previous captures to update the
// offset.
increase(wholegraph_increment);
}
/**
* Note [Why enforce RNG offset % 4 == 0?]
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -97,8 +244,18 @@ Generator createCUDAGenerator(DeviceIndex device_index) {
*/
CUDAGeneratorImpl::CUDAGeneratorImpl(DeviceIndex device_index)
: c10::GeneratorImpl{Device(DeviceType::CUDA, device_index),
DispatchKeySet(c10::DispatchKey::CUDA)} {
DispatchKeySet(c10::DispatchKey::CUDA)} {
at::cuda::assertNotCapturing("Cannot construct a new CUDAGeneratorImpl");
state_ = make_intrusive<CUDAGeneratorState>();
no_reset_rnn_state_.clear();
}
CUDAGeneratorImpl::CUDAGeneratorImpl(
DeviceIndex device_index,
c10::intrusive_ptr<CUDAGeneratorState> state)
: c10::
GeneratorImpl{Device(DeviceType::CUDA, device_index), DispatchKeySet(c10::DispatchKey::CUDA)},
state_(std::move(state)) {
no_reset_rnn_state_.clear();
}
@ -109,9 +266,10 @@ CUDAGeneratorImpl::CUDAGeneratorImpl(DeviceIndex device_index)
* See Note [Acquire lock when using random generators]
*/
void CUDAGeneratorImpl::set_current_seed(uint64_t seed) {
at::cuda::assertNotCapturing("Cannot call CUDAGeneratorImpl::set_current_seed");
seed_ = seed;
philox_offset_per_thread_ = 0;
at::cuda::assertNotCapturing(
"Cannot call CUDAGeneratorImpl::set_current_seed");
state_->seed_ = seed;
state_->philox_offset_per_thread_ = 0;
no_reset_rnn_state_.clear();
}
@ -134,15 +292,9 @@ uint64_t CUDAGeneratorImpl::get_offset() const {
// Debatable if get_offset() should be allowed in captured regions.
// Conservatively disallow it for now.
at::cuda::assertNotCapturing("Cannot call CUDAGeneratorImpl::get_offset");
return philox_offset_per_thread_;
return state_->philox_offset_per_thread_;
}
#define CAPTURE_DEFAULT_GENS_MSG \
"In regions captured by CUDA graphs, you may only use the default CUDA RNG " \
"generator on the device that's current when capture begins. " \
"If you need a non-default (user-supplied) generator, or a generator on another " \
"device, please file an issue."
/**
* Gets the current seed of CUDAGeneratorImpl.
*/
@ -150,7 +302,7 @@ uint64_t CUDAGeneratorImpl::current_seed() const {
// Debatable if current_seed() should be allowed in captured regions.
// Conservatively disallow it for now.
at::cuda::assertNotCapturing("Cannot call CUDAGeneratorImpl::current_seed");
return seed_;
return state_->seed_;
}
/**
@ -194,6 +346,8 @@ c10::intrusive_ptr<c10::TensorImpl> CUDAGeneratorImpl::get_state() const {
* and size of the internal state.
*/
void CUDAGeneratorImpl::set_state(const c10::TensorImpl& new_state) {
at::cuda::assertNotCapturing(
"Please ensure to utilize the CUDAGeneratorImpl::set_state_index method during capturing.");
static const size_t seed_size = sizeof(uint64_t);
static const size_t offset_size = sizeof(int64_t);
static const size_t total_size = seed_size + offset_size;
@ -208,7 +362,7 @@ void CUDAGeneratorImpl::set_state(const c10::TensorImpl& new_state) {
TORCH_CHECK(new_state_size == total_size, "RNG state is wrong size");
}
uint64_t input_seed;
uint64_t input_seed = 0;
auto new_rng_state = new_state.data_dtype_initialized<uint8_t>();
memcpy(&input_seed, new_rng_state, seed_size);
this->set_current_seed(input_seed);
@ -219,44 +373,59 @@ void CUDAGeneratorImpl::set_state(const c10::TensorImpl& new_state) {
this->set_philox_offset_per_thread(static_cast<uint64_t>(philox_offset));
}
/**
* Sets the generator's current state to
* This function allows switching between different registered states of
* the generator.
*/
void CUDAGeneratorImpl::graphsafe_set_state(
const c10::intrusive_ptr<GeneratorImpl>& gen) {
c10::intrusive_ptr<CUDAGeneratorImpl> cuda_gen =
dynamic_intrusive_pointer_cast<CUDAGeneratorImpl>(gen);
TORCH_CHECK(cuda_gen, "Expected a CUDA Generator");
state_ = cuda_gen->state_;
}
/**
* Get the GeneratorImpl that point to current state_
*/
c10::intrusive_ptr<c10::GeneratorImpl> CUDAGeneratorImpl::graphsafe_get_state()
const {
auto gen = make_intrusive<CUDAGeneratorImpl>(device().index(), state_);
return gen;
}
/**
* Sets the philox_offset_per_thread_ to be used by curandStatePhilox4_32_10
*
* See Note [Acquire lock when using random generators]
*/
void CUDAGeneratorImpl::set_philox_offset_per_thread(uint64_t offset) {
at::cuda::assertNotCapturing("Cannot call CUDAGeneratorImpl::set_philox_offset_per_thread");
// see Note [Why enforce RNG offset % 4 == 0?]
TORCH_CHECK(offset % 4 == 0, "offset must be a multiple of 4");
philox_offset_per_thread_ = offset;
state_->philox_offset_per_thread_ = offset;
}
/**
* Gets the current philox_offset_per_thread_ of CUDAGeneratorImpl.
*/
uint64_t CUDAGeneratorImpl::philox_offset_per_thread() const {
at::cuda::assertNotCapturing("Cannot call CUDAGeneratorImpl::philox_offset_per_thread");
return philox_offset_per_thread_;
return state_->philox_offset_per_thread_;
}
/**
* Called by CUDAGraph to prepare this instance for a graph capture region.
* offset_extragraph is the initial offset at the start of the graphed region.
* offset_intragraph tracks the offset in the graphed region.
* Registers this state to a CUDA graph to manage within the graph.
*/
void CUDAGeneratorImpl::capture_prologue(int64_t* seed_extragraph, int64_t* offset_extragraph) {
seed_extragraph_ = seed_extragraph;
offset_extragraph_ = offset_extragraph;
offset_intragraph_ = 0;
graph_expects_this_gen_ = true;
void CUDAGeneratorImpl::register_graph(cuda::CUDAGraph* graph) {
graph->register_generator_state(state_);
state_->register_graph(graph);
}
/**
* Called by CUDAGraph to finalize a graph capture region for this instance.
* Unregisters a CUDA graph from the RNG state.
*/
uint64_t CUDAGeneratorImpl::capture_epilogue() {
graph_expects_this_gen_ = false;
return offset_intragraph_;
void CUDAGeneratorImpl::unregister_graph(cuda::CUDAGraph* graph) {
state_->unregister_graph(graph);
}
/**
@ -281,30 +450,17 @@ uint64_t CUDAGeneratorImpl::capture_epilogue() {
* See Note [Acquire lock when using random generators]
*/
PhiloxCudaState CUDAGeneratorImpl::philox_cuda_state(uint64_t increment) {
// rounds increment up to the nearest multiple of 4
increment = ((increment + 3) / 4) * 4;
if (at::cuda::currentStreamCaptureStatus() != at::cuda::CaptureStatus::None) {
TORCH_CHECK(graph_expects_this_gen_,
"philox_cuda_state for an unexpected CUDA generator used during capture. "
CAPTURE_DEFAULT_GENS_MSG);
// see Note [Why enforce RNG offset % 4 == 0?]
TORCH_INTERNAL_ASSERT(this->offset_intragraph_ % 4 == 0);
uint32_t offset = this->offset_intragraph_;
TORCH_INTERNAL_ASSERT(this->offset_intragraph_ <=
std::numeric_limits<uint32_t>::max() - increment);
this->offset_intragraph_ += increment;
return PhiloxCudaState(this->seed_extragraph_,
this->offset_extragraph_,
offset);
uint32_t offset = state_->offset_intragraph_;
state_->increase(increment);
return PhiloxCudaState(
state_->seed_extragraph_.data_ptr<int64_t>(),
state_->offset_extragraph_.data_ptr<int64_t>(),
offset);
} else {
TORCH_CHECK(!graph_expects_this_gen_,
"CUDA generator expects graph capture to be underway, "
"but the current stream is not capturing.");
// see Note [Why enforce RNG offset % 4 == 0?]
TORCH_INTERNAL_ASSERT(this->philox_offset_per_thread_ % 4 == 0);
uint64_t offset = this->philox_offset_per_thread_;
this->philox_offset_per_thread_ += increment;
return PhiloxCudaState(this->seed_, offset);
uint64_t offset = state_->philox_offset_per_thread_;
state_->increase(increment);
return PhiloxCudaState(state_->seed_, offset);
}
}
@ -312,16 +468,13 @@ PhiloxCudaState CUDAGeneratorImpl::philox_cuda_state(uint64_t increment) {
* Temporarily accommodates call sites that use philox_engine_inputs.
* Allows incremental refactor of call sites to use philox_cuda_state.
*/
std::pair<uint64_t, uint64_t> CUDAGeneratorImpl::philox_engine_inputs(uint64_t increment) {
at::cuda::assertNotCapturing("Refactor this op to use CUDAGeneratorImpl::philox_cuda_state. "
"Cannot call CUDAGeneratorImpl::philox_engine_inputs");
// rounds increment up to the nearest multiple of 4
increment = ((increment + 3) / 4) * 4;
// see Note [Why enforce RNG offset % 4 == 0?]
TORCH_INTERNAL_ASSERT(this->philox_offset_per_thread_ % 4 == 0);
uint64_t offset = this->philox_offset_per_thread_;
this->philox_offset_per_thread_ += increment;
return std::make_pair(this->seed_, offset);
std::pair<uint64_t, uint64_t> CUDAGeneratorImpl::philox_engine_inputs(
uint64_t increment) {
at::cuda::assertNotCapturing(
"Refactor this op to use CUDAGeneratorImpl::philox_cuda_state. Cannot call CUDAGeneratorImpl::philox_engine_inputs");
uint64_t offset = state_->philox_offset_per_thread_;
state_->increase(increment);
return std::make_pair(state_->seed_, offset);
}
/*
@ -348,9 +501,7 @@ std::shared_ptr<CUDAGeneratorImpl> CUDAGeneratorImpl::clone() const {
*/
CUDAGeneratorImpl* CUDAGeneratorImpl::clone_impl() const {
at::cuda::assertNotCapturing("Cannot call CUDAGeneratorImpl::clone_impl");
auto gen = new CUDAGeneratorImpl(this->device().index());
gen->set_current_seed(this->seed_);
gen->set_philox_offset_per_thread(this->philox_offset_per_thread_);
auto gen = new CUDAGeneratorImpl(this->device().index(), state_->clone());
return gen;
}

View File

@ -1,12 +1,19 @@
#pragma once
#include <ATen/core/Generator.h>
#include <ATen/cuda/PhiloxCudaState.h>
#include <ATen/Context.h>
#include <limits>
#include <ATen/core/Generator.h>
#include <ATen/core/TensorBase.h>
#include <ATen/cuda/PhiloxCudaState.h>
#include <atomic>
#include <limits>
#include <memory>
#include <unordered_set>
namespace at {
namespace cuda {
struct CUDAGraph;
}
/**
* Note [CUDA Graph-safe RNG states]
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -87,9 +94,41 @@ namespace at {
*
*/
struct CUDAGeneratorState : public c10::intrusive_ptr_target {
uint64_t seed_;
uint64_t philox_offset_per_thread_;
uint32_t offset_intragraph_;
bool capturing_{};
std::unordered_set<cuda::CUDAGraph*> registered_graphs_;
at::TensorBase seed_extragraph_{};
at::TensorBase offset_extragraph_{};
CUDAGeneratorState(
uint64_t seed = default_rng_seed_val,
uint64_t philox_offset_per_thread = 0,
uint32_t offset_intragraph = 0)
: seed_(seed),
philox_offset_per_thread_(philox_offset_per_thread),
offset_intragraph_(offset_intragraph) {}
void increase(uint64_t increment);
void register_graph(cuda::CUDAGraph* graph);
void unregister_graph(cuda::CUDAGraph* graph);
void capture_prologue();
// capture_epilogue returns the wholegraph_increment
uint64_t capture_epilogue();
void replay_prologue(uint64_t wholegraph_increment);
c10::intrusive_ptr<CUDAGeneratorState> clone();
};
struct TORCH_CUDA_CPP_API CUDAGeneratorImpl : public c10::GeneratorImpl {
// Constructors
CUDAGeneratorImpl(DeviceIndex device_index = -1);
CUDAGeneratorImpl(
DeviceIndex device_index,
c10::intrusive_ptr<CUDAGeneratorState> state_);
~CUDAGeneratorImpl() override = default;
// CUDAGeneratorImpl methods
@ -101,10 +140,18 @@ struct TORCH_CUDA_CPP_API CUDAGeneratorImpl : public c10::GeneratorImpl {
uint64_t seed() override;
void set_state(const c10::TensorImpl& new_state) override;
c10::intrusive_ptr<c10::TensorImpl> get_state() const override;
void graphsafe_set_state(
const c10::intrusive_ptr<GeneratorImpl>& state) override;
c10::intrusive_ptr<c10::GeneratorImpl> graphsafe_get_state() const override;
void set_philox_offset_per_thread(uint64_t offset);
uint64_t philox_offset_per_thread() const;
void capture_prologue(int64_t* seed_extragraph, int64_t* offset_extragraph);
uint64_t capture_epilogue();
void register_graph(cuda::CUDAGraph* graph);
void unregister_graph(cuda::CUDAGraph* graph);
// Generates a PhiloxCudaState with a specified increment, and increment
// current state
PhiloxCudaState philox_cuda_state(uint64_t increment);
bool reset_rnn_state() {
@ -117,14 +164,10 @@ struct TORCH_CUDA_CPP_API CUDAGeneratorImpl : public c10::GeneratorImpl {
static c10::DeviceType device_type();
private:
private:
CUDAGeneratorImpl* clone_impl() const override;
uint64_t seed_ = default_rng_seed_val;
uint64_t philox_offset_per_thread_ = 0;
int64_t* seed_extragraph_{};
int64_t* offset_extragraph_{};
uint32_t offset_intragraph_ = 0;
bool graph_expects_this_gen_ = false;
c10::intrusive_ptr<CUDAGeneratorState> state_;
std::atomic_flag no_reset_rnn_state_;
};

View File

@ -6,7 +6,10 @@
#include <c10/cuda/CUDAFunctions.h>
#include <chrono>
#include <cstddef>
#include <cstdint>
#include <thread>
#include <vector>
namespace at::cuda {
@ -86,26 +89,33 @@ CUDAGraph::CUDAGraph()
#endif
}
void CUDAGraph::register_generator_state(
c10::intrusive_ptr<at::CUDAGeneratorState> state) {
captured_generator_states_[std::move(state)] = 0;
}
void CUDAGraph::register_generator_state(const at::Generator& generator) {
c10::intrusive_ptr<CUDAGeneratorImpl> cuda_gen =
dynamic_intrusive_pointer_cast<CUDAGeneratorImpl>(
generator.getIntrusivePtr());
cuda_gen->register_graph(this);
}
void CUDAGraph::capture_begin(MempoolId_t pool/*=0*/, cudaStreamCaptureMode capture_mode) {
#if !defined(USE_ROCM) || ROCM_VERSION >= 50300
TORCH_CHECK(!has_graph_exec_,
"This CUDAGraph instance already owns a captured graph. "
"To capture a new graph, create a new instance.");
// For now, a CUDAGraph instance only accommodates the default generator on the device that's
// current when capture begins. If any op in the captured region uses a non-default generator,
// or a generator on another device, the offending generator will throw an error.
// These restrictions simplify CUDAGraph, but could be relaxed in the future:
// in principle, the underlying Cuda calls do permit cross-device ops to be captured.
// default generator is always registered
auto* gen = get_generator_or_default<CUDAGeneratorImpl>(
c10::nullopt, cuda::detail::getDefaultCUDAGenerator());
gen->register_graph(this);
auto options = TensorOptions().device(at::kCUDA).dtype(at::kLong);
seed_extragraph_ = at::empty({1}, options);
offset_extragraph_ = at::empty({1}, options);
seed_extragraph_.fill_(int64_t(gen->current_seed()));
gen->capture_prologue(seed_extragraph_.data_ptr<int64_t>(), offset_extragraph_.mutable_data_ptr<int64_t>());
for (auto& [generator_state, wholegraph_increments] :
captured_generator_states_) {
generator_state->capture_prologue();
}
auto stream = at::cuda::getCurrentCUDAStream();
@ -115,7 +125,6 @@ void CUDAGraph::capture_begin(MempoolId_t pool/*=0*/, cudaStreamCaptureMode capt
"default stream.)");
capture_stream_ = stream;
capture_gen_ = gen;
capture_dev_ = c10::cuda::current_device();
id_ = capture_sequence_id();
@ -215,13 +224,10 @@ void CUDAGraph::capture_end() {
has_graph_exec_ = true;
auto* gen = get_generator_or_default<CUDAGeneratorImpl>(
c10::nullopt, cuda::detail::getDefaultCUDAGenerator());
TORCH_CHECK(gen == capture_gen_,
"Default CUDA RNG generator on current device at capture end "
"is different from default generator on current device "
"when capture began");
wholegraph_increment_ = gen->capture_epilogue();
for (auto& [generator_state, wholegraph_increments] :
captured_generator_states_) {
wholegraph_increments = generator_state->capture_epilogue();
}
size_t numCUDAGraphNodes = 0;
AT_CUDA_CHECK(cudaGraphGetNodes(graph_, NULL, &numCUDAGraphNodes));
@ -251,17 +257,10 @@ void CUDAGraph::replay() {
c10::OptionalDeviceGuard device_guard{capture_stream_.device()};
// Just like any RNG consumer kernel!
auto* gen = get_generator_or_default<CUDAGeneratorImpl>(
c10::nullopt, cuda::detail::getDefaultCUDAGenerator());
PhiloxCudaState rng_engine_inputs;
{
std::lock_guard<std::mutex> lock(gen->mutex_);
rng_engine_inputs = gen->philox_cuda_state(wholegraph_increment_);
for (auto& [generator_state, wholegraph_increments] :
captured_generator_states_) {
generator_state->replay_prologue(wholegraph_increments);
}
seed_extragraph_.fill_(int64_t(gen->current_seed()));
offset_extragraph_.fill_(int64_t(rng_engine_inputs.offset_.val));
// graph_exec_ may be replayed in any stream.
AT_CUDA_CHECK(cudaGraphLaunch(graph_exec_, at::cuda::getCurrentCUDAStream()));
@ -355,6 +354,10 @@ TORCH_CHECK(has_graph_exec_,
}
CUDAGraph::~CUDAGraph() {
for (auto& [generator_state, wholegraph_increments] :
captured_generator_states_) {
generator_state->unregister_graph(this);
}
reset();
}

View File

@ -4,12 +4,13 @@
#include <c10/core/Device.h>
#include <c10/cuda/CUDAGraphsC10Utils.h>
#include <c10/cuda/CUDAStream.h>
#include <mutex>
#include <c10/util/flat_hash_map.h>
namespace at {
struct Generator;
struct CUDAGeneratorImpl;
struct CUDAGeneratorState;
namespace cuda {
@ -24,7 +25,12 @@ struct TORCH_CUDA_CPP_API CUDAGraph {
static void inc_pending_event_queries();
static void dec_pending_event_queries();
static int num_pending_event_queries();
void capture_begin(MempoolId_t pool={0, 0}, cudaStreamCaptureMode capture_mode = cudaStreamCaptureModeGlobal);
// See Note [Explicit Registration of Generators to the CUDA Graph]
void register_generator_state(c10::intrusive_ptr<at::CUDAGeneratorState> state);
void register_generator_state(const at::Generator& generator);
void capture_begin(
MempoolId_t pool = {0, 0},
cudaStreamCaptureMode capture_mode = cudaStreamCaptureModeGlobal);
void capture_end();
void replay();
void reset();
@ -32,7 +38,7 @@ struct TORCH_CUDA_CPP_API CUDAGraph {
void enable_debug_mode();
void debug_dump(const std::string& debug_path);
protected:
protected:
#if !defined(USE_ROCM) || ROCM_VERSION >= 50300
cudaGraph_t graph_ = NULL;
cudaGraphExec_t graph_exec_ = NULL;
@ -73,19 +79,16 @@ struct TORCH_CUDA_CPP_API CUDAGraph {
// Stream on which capture began
at::cuda::CUDAStream capture_stream_;
// Default generator on device where capture began
at::CUDAGeneratorImpl* capture_gen_;
// multiple generator states and their wholegraph_increments in this graph
// that are managed by the CUDA Graph
ska::flat_hash_map<c10::intrusive_ptr<at::CUDAGeneratorState>, uint64_t>
captured_generator_states_;
// Device where capture occurred. Right now, for simplicity, we require all ops
// in a capture to run on the same device, but this is a limitation of CUDAGraph,
// not CUDA itself. We can straightforwardly modify CUDAGraph to support multi-device
// captures if needed.
int capture_dev_;
// RNG state trackers
at::Tensor seed_extragraph_;
at::Tensor offset_extragraph_;
uint64_t wholegraph_increment_;
};
} // namespace cuda

View File

@ -339,7 +339,7 @@ c10::SmallVector<at::Tensor> CompileAndLaunchKernel(
config.add_owned_output(outs[i]);
}
for (const auto& t: tensors) {
config.add_input(t);
config.add_const_input(t);
}
TensorIterator iter = config.build();

View File

@ -73,7 +73,7 @@ static void autogradBasedTransformProcess(
return materializeGradWrappers(tensor, current_level);
};
auto num_args = op.schema().arguments().size();
foreachTensorInplace(*stack, stack->size() - num_args, stack->size(), maybeTransformGradWrappers);
foreachTensorInplace(*stack, static_cast<int64_t>(stack->size() - num_args), static_cast<int64_t>(stack->size()), maybeTransformGradWrappers);
setup_dispatch_key_tls(transform_type, {});
op.callBoxed(stack);
@ -133,7 +133,7 @@ static void autogradBasedTransformSendToNext(
auto args_size = op.schema().arguments().size();
const auto ret_size = op.schema().returns().size();
// Step 1
auto front = stack->size() - args_size;
auto front = static_cast<int64_t>(stack->size()) - args_size;
for (const auto arg_idx : c10::irange(0, args_size)) {
stack->push_back((*stack)[front + arg_idx]);
}
@ -151,7 +151,7 @@ static void autogradBasedTransformSendToNext(
// if the input is immutable, we find if it aliases anything, noting that
// args are in reverse order on stack, so the last arg is at the top of the stack
const auto relative_pos = idx - (stack->size() - args_size);
const auto aliased_out = findAliasedOutput(op.schema(), relative_pos);
const auto aliased_out = findAliasedOutput(op.schema(), static_cast<int64_t>(relative_pos));
if (aliased_out.has_value()) {
outputs_aliasing_immutable.flip(*aliased_out); // each output aliases at most one input, so we can only hit this once
}
@ -160,7 +160,7 @@ static void autogradBasedTransformSendToNext(
}
// Step 2
foreachTensorInplace(*stack, stack->size() - args_size, stack->size(), unwrap);
foreachTensorInplace(*stack, static_cast<int64_t>(stack->size() - args_size), static_cast<int64_t>(stack->size()), unwrap);
// See NOTE [grad and vjp interaction with no_grad]
optional<c10::AutoGradMode> grad_guard;
@ -183,7 +183,7 @@ static void autogradBasedTransformSendToNext(
op.callBoxed(stack);
// Step 4
foreachTensorInplaceWithFlag(*stack, stack->size() - ret_size, stack->size(), outputs_aliasing_immutable, wrap);
foreachTensorInplaceWithFlag(*stack, static_cast<int64_t>(stack->size() - ret_size), static_cast<int64_t>(stack->size()), outputs_aliasing_immutable, wrap);
// Step 5
auto args_front = stack->size() - args_size - ret_size;
@ -200,7 +200,7 @@ static void autogradBasedTransformSendToNext(
}
// Step 6
stack->erase(stack->end() - (args_size + ret_size), stack->end() - ret_size);
stack->erase(stack->end() - std::ptrdiff_t(args_size + ret_size), stack->end() - std::ptrdiff_t(ret_size));
}
void GradInterpreterPtr::processImpl(

View File

@ -29,7 +29,7 @@ convolution_batch_rule(const Tensor& lhs, optional<int64_t> lhs_bdim, const Tens
// If we have a batched bias or weight, we need to perform the computation separately.
optional<Tensor> unbatched_bias;
bool separate_bias;
bool separate_bias = false;
if ((rhs_bdim && bias && bias->defined()) || bias_bdim) {
TORCH_INTERNAL_ASSERT(bias.has_value());
TORCH_INTERNAL_ASSERT(bias->defined());
@ -245,7 +245,7 @@ convolution_backward_input_batch_rule(
const Tensor& input, optional<int64_t> input_bdim,
const Tensor& weight, optional<int64_t> weight_bdim,
c10::SymIntArrayRef stride, c10::SymIntArrayRef padding, c10::SymIntArrayRef dilation, bool transposed,
c10::SymIntArrayRef output_padding, c10::SymInt groups) {
c10::SymIntArrayRef output_padding, const c10::SymInt& groups) {
const std::array<bool, 3> mask = {true, false, false};
if (grad_output_bdim && weight_bdim) {
// regular: BNO, BOI -> N(BO), (BO)I -> N(BI)
@ -326,7 +326,7 @@ convolution_backward_weight_batch_rule(
const Tensor& input, optional<int64_t> input_bdim,
const Tensor& weight, optional<int64_t> weight_bdim,
c10::SymIntArrayRef stride, c10::SymIntArrayRef padding, c10::SymIntArrayRef dilation, bool transposed,
c10::SymIntArrayRef output_padding, c10::SymInt groups) {
c10::SymIntArrayRef output_padding, const c10::SymInt& groups) {
const std::array<bool, 3> mask = {false, true, false};
if (grad_output_bdim && input_bdim) {
// BNO, BNI -> N(BO), N(BI) -> (BO)I (regular) (BI)O (transposed)

View File

@ -226,6 +226,7 @@ TORCH_LIBRARY_IMPL(aten, FuncTorchBatchedDecomposition, m) {
m.impl("reshape", native::reshape_symint);
OP_DECOMPOSE(resolve_conj);
OP_DECOMPOSE(resolve_neg);
OP_DECOMPOSE(rms_norm);
OP_DECOMPOSE(row_stack);
OP_DECOMPOSE(rrelu);
OP_DECOMPOSE(rrelu_);

View File

@ -118,11 +118,9 @@ Tensor reshape_dim_outof(int64_t src, int64_t size1, const Tensor& x) {
// NOTE: 0 % 0 leads to FPE
TORCH_INTERNAL_ASSERT(shape[src] % size1 == 0);
}
int64_t size2;
// split any size out of `0`-sized dim
if (shape[src] == 0) {
size2 = 0;
} else {
int64_t size2 = 0;
if (shape[src] != 0) {
size2 = shape[src] / size1;
}
shape[src] = size1;
@ -130,7 +128,7 @@ Tensor reshape_dim_outof(int64_t src, int64_t size1, const Tensor& x) {
return at::reshape(x, shape);
}
Tensor reshape_dim_outof_symint(int64_t src, c10::SymInt size1, const Tensor& x) {
Tensor reshape_dim_outof_symint(int64_t src, const c10::SymInt& size1, const Tensor& x) {
src = maybe_wrap_dim(src, x.dim());
c10::SymDimVector shape(x.sym_sizes().begin(), x.sym_sizes().end());
if (shape[src] != 0) {

View File

@ -28,7 +28,7 @@ namespace at::functorch {
TORCH_API Tensor reshape_dim_into(int64_t src, int64_t dst, const Tensor& x);
TORCH_API Tensor reshape_dim_outof(int64_t src, int64_t size1, const Tensor& x);
TORCH_API Tensor reshape_dim_outof_symint(int64_t src, c10::SymInt size1, const Tensor& x);
TORCH_API Tensor reshape_dim_outof_symint(int64_t src, const c10::SymInt& size1, const Tensor& x);
Tensor moveBatchDimToFront(const Tensor& tensor, optional<int64_t> maybe_batch_dim);
int64_t rankWithoutBatchDim(const Tensor& tensor, optional<int64_t> maybe_batch_dim);
@ -146,7 +146,7 @@ void boxed_tensor_inputs_batch_rule(const c10::OperatorHandle& op, torch::jit::S
if (ivalue.isTensor()) {
auto [tensor_value, tensor_bdim] = unwrapTensorAtLevel(ivalue.toTensor(), cur_level);
tensor_inputs.emplace_back(tensor_value, tensor_bdim);
tensor_pos.push_back(idx);
tensor_pos.push_back(static_cast<int64_t>(idx));
}
}
Func(tensor_inputs);
@ -212,7 +212,7 @@ inline void find_and_unpack_tensors(
int64_t* batch_size) {
int64_t computed_batch_size = -1;
int64_t args_begin = stack->size() - num_args;
int64_t args_begin = static_cast<int64_t>(stack->size()) - num_args;
for (const auto idx : c10::irange(0, num_args)) {
const auto& ivalue = (*stack)[args_begin + idx];
@ -241,7 +241,7 @@ inline void boxed_existing_bdim_all_batch_rule(
const c10::OperatorHandle& op, torch::jit::Stack* stack) {
const auto& schema = op.schema();
const auto num_returns = schema.returns().size();
const auto num_arguments = schema.arguments().size();
const auto num_arguments = static_cast<int64_t>(schema.arguments().size());
c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchBatched);
auto maybe_layer = maybeCurrentDynamicLayer();
@ -254,10 +254,10 @@ inline void boxed_existing_bdim_all_batch_rule(
return;
}
int64_t args_begin = stack->size() - num_arguments;
int64_t args_begin = static_cast<int64_t>(stack->size()) - num_arguments;
SmallVector<UnpackedBatchedTensor, 5> tensor_inputs;
SmallVector<int64_t, 5> tensor_pos;
int64_t batch_size;
int64_t batch_size = 0;
find_and_unpack_tensors(
stack, num_arguments, cur_level,
@ -310,13 +310,13 @@ inline void boxed_all_tensors_have_optional_bdim(
return;
}
int64_t args_begin = stack->size() - num_arguments;
int64_t args_begin = static_cast<int64_t>(stack->size() - num_arguments);
SmallVector<UnpackedBatchedTensor, 5> tensor_inputs;
SmallVector<int64_t, 5> tensor_pos;
int64_t batch_size;
int64_t batch_size = 0;
find_and_unpack_tensors(
stack, num_arguments, cur_level,
stack, static_cast<int64_t>(num_arguments), cur_level,
&tensor_inputs, &tensor_pos, &batch_size);
optional<bool> is_no_batch_dim_case;

View File

@ -370,7 +370,7 @@ fourOutputs solve_ex_batch_rule(
TORCH_CHECK(A_logical_rank >= 2,
"linalg.solve: The input tensor A must have at least 2 dimensions.");
int b_logical_rank = max_logical_rank;
auto b_logical_rank = max_logical_rank;
if (A_logical_rank > B_logical_rank) { // vector case: B was a vector or batched vector
// not accurate but matches linalg error message
TORCH_CHECK(B_logical_rank >= 1, "linalg.solve: The input tensor B must have at least 2 dimensions.");
@ -574,6 +574,7 @@ pinv_batch_rule(
}
// These need to be outside. String constant must be declared outside of a macro to be used as template param
// NOLINTBEGIN(*array*)
LINALG_CHECK_MATRIX_UNARY_ONE_OUT(cholesky, cholesky);
LINALG_CHECK_MATRIX_UNARY_ONE_OUT(cholesky_inverse, cholesky_inverse);
LINALG_CHECK_MATRIX_UNARY_TWO_OUT(linalg_cholesky_ex, linalg.cholesky);
@ -590,6 +591,7 @@ LINALG_CHECK_MATRIX_UNARY_THREE_OUT(_linalg_det, linalg.det);
LINALG_CHECK_MATRIX_UNARY_TWO_OUT(_linalg_eigh, linalg.eigh);
LINALG_CHECK_MATRIX_UNARY_FOUR_OUT(_linalg_slogdet, linalg.slogdet);
LINALG_CHECK_MATRIX_UNARY_THREE_OUT(_linalg_svd, linalg.svd);
// NOLINTEND(*array*)
TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
VMAP_SUPPORT(bmm, bmm_batch_rule);

View File

@ -474,7 +474,7 @@ C10_ALWAYS_INLINE void _check_layer_norm_inputs(
const Tensor& weight, optional<int64_t> weight_bdim,
const Tensor& bias, optional<int64_t> bias_bdim) {
const int normalized_ndim = normalized_shape.size();
const auto normalized_ndim = normalized_shape.size();
TORCH_CHECK(
normalized_ndim >= 1,
"Expected normalized_shape to be at least 1-dimensional, i.e., ",
@ -616,7 +616,7 @@ static std::tuple<at::Tensor,at::Tensor,at::Tensor> native_layer_norm_backward_p
if (num_front_dims_to_reduce == 0) {
grad_bias = grad_out;
} else {
grad_bias = grad_out.sum(range(0, num_front_dims_to_reduce));
grad_bias = grad_out.sum(range(0, static_cast<int64_t>(num_front_dims_to_reduce)));
}
}
if (output_mask[1] && weight_value.has_value()) {
@ -628,7 +628,7 @@ static std::tuple<at::Tensor,at::Tensor,at::Tensor> native_layer_norm_backward_p
if (num_front_dims_to_reduce == 0) {
grad_weight = expanded_grad_weight;
} else {
grad_weight = expanded_grad_weight.sum(range(0, num_front_dims_to_reduce));
grad_weight = expanded_grad_weight.sum(range(0, static_cast<int64_t>(num_front_dims_to_reduce)));
}
}
if (output_mask[0]) {

View File

@ -199,8 +199,8 @@ static std::tuple<Tensor,Tensor> native_dropout_batching_rule(const Tensor& tens
}
auto [output, mask] = at::native_dropout(tensor_value, p, train);
return std::make_tuple(
makeBatched(std::move(output), 0, cur_level),
makeBatched(std::move(mask), 0, cur_level));
makeBatched(output, 0, cur_level),
makeBatched(mask, 0, cur_level));
}
// repeated code from the CPU kernel since the CUDA one doesn't call bernoulli_ explicitly
@ -264,7 +264,7 @@ struct RandomBatchRuleHelper<F, Func, typelist<T1, T...>> {
template <typename F, F Func, typename... T>
Tensor rand_int_wrapper(SymIntArrayRef shape, c10::SymInt high, T... extra_args) {
return Func(high, std::move(shape), std::forward<T>(extra_args)...);
return Func(high, shape, std::forward<T>(extra_args)...);
}
template <typename A, A a, typename C>

View File

@ -75,7 +75,7 @@ static Tensor any_decomp(const Tensor& self) {
return at::any(self.flatten(), 0, false);
}
enum ReductionCase { DimArray, Dim };
enum class ReductionCase:uint8_t { DimArray, Dim };
// Macros and templates have a difficult time dealing with enums,
// so we didn't turn this into an enum.
@ -129,7 +129,7 @@ void boxed_reduction_batch_rule(const c10::OperatorHandle& op, torch::jit::Stack
auto logical_dim = rankWithoutBatchDim(self, self_bdim);
std::vector<int64_t> dims;
ReductionCase reduction_case;
ReductionCase reduction_case{};
if (arguments[dim_arg_pos].isIntList()) {
reduction_case = ReductionCase::DimArray;
dims = arguments[dim_arg_pos].toIntList().vec();

View File

@ -11,7 +11,6 @@
#include <ATen/native/TensorAdvancedIndexing.h>
#include <ATen/native/IndexKernel.h>
#include <ATen/native/IndexingUtils.h>
#include <iostream>
#include <torch/library.h>
@ -810,7 +809,7 @@ Tensor get_expanded_index(const Tensor& index, IntArrayRef self_size, int64_t di
if (index.dim() == 0) {
return index.expand(self_size);
}
dim = maybe_wrap_dim(dim, self_size.size());
dim = maybe_wrap_dim(dim, static_cast<int64_t>(self_size.size()));
// setup new_index_shape as [BS, 1, ..., idx_size, ..., 1]
// to reshape index_

View File

@ -5,7 +5,6 @@
// LICENSE file in the root directory of this source tree.
#include <ATen/functorch/BatchRulesHelper.h>
#include <iostream>
#include <utility>
#include <ATen/Operators.h>
@ -202,7 +201,7 @@ std::tuple<Tensor, optional<int64_t>> squeeze_batch_rule(const Tensor& self, opt
int64_t new_batch_idx = 0;
int64_t original_idx = 0;
for (auto it : shape) {
for (const auto& it : shape) {
// Keep only dimensions != 1 and the batch dimension (irrespective of size).
if (it != 1 || original_idx == bdim) {
squeezed_sizes.push_back(it);
@ -452,7 +451,7 @@ std::tuple<Tensor, optional<int64_t>> expand_batch_rule(
auto self_ = moveBatchDimToFront(self, self_bdim);
auto self_sizes = self_.sym_sizes();
auto batch_size = self_sizes[0];
const auto& batch_size = self_sizes[0];
c10::SmallVector<c10::SymInt> size_(size.size() + 1);
size_[0] = batch_size;

View File

@ -159,7 +159,7 @@ static void batchedTensorInplaceForLoopFallback(const c10::OperatorHandle& op, t
"please file a bug report instead.");
}
batched_tensor_inputs.push_back(tensor);
batched_tensor_inputs_position.push_back(idx);
batched_tensor_inputs_position.push_back(static_cast<int64_t>(idx));
}
TORCH_INTERNAL_ASSERT(!batched_tensor_inputs.empty());
@ -304,7 +304,7 @@ void batchedTensorForLoopFallback(const c10::OperatorHandle& op, torch::jit::Sta
continue;
}
batched_tensor_inputs.push_back(tensor);
batched_tensor_inputs_position.push_back(idx);
batched_tensor_inputs_position.push_back(static_cast<int64_t>(idx));
}
TORCH_INTERNAL_ASSERT(!batched_tensor_inputs.empty());
@ -445,18 +445,18 @@ void batchedNestedTensorForLoopFallback(const c10::OperatorHandle& op, torch::ji
continue;
}
batched_tensor_inputs.push_back(tensor);
batched_tensor_inputs_position.push_back(idx);
batched_tensor_inputs_position.push_back(static_cast<int64_t>(idx));
}
TORCH_INTERNAL_ASSERT(!batched_tensor_inputs.empty());
std::vector<std::vector<Tensor>> unbound;
for (auto iter = batched_tensor_inputs.begin(); iter != batched_tensor_inputs.end(); ++iter) {
auto *batched_impl = maybeGetBatchedImpl(*iter);
for (auto const &batched_tensor_input: batched_tensor_inputs) {
auto *batched_impl = maybeGetBatchedImpl(batched_tensor_input);
TORCH_INTERNAL_ASSERT(batched_impl->value().is_nested() || batched_impl->bdim() == 0,
"Fallback not supported for mixed nested / non-nested arguments without bdim=0");
c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::BatchedNestedTensor);
auto this_unbound = batched_impl->value().unbind();
if (unbound.size() > 0) {
if (!unbound.empty()) {
TORCH_INTERNAL_ASSERT(unbound.front().size() == this_unbound.size(),
"Fallback not supported for differently-sized nested arguments");
}

View File

@ -70,7 +70,7 @@ void BatchedTensorImpl::refreshTensorMetadata() {
int64_t BatchedTensorImpl::actualDim(int64_t dim, bool wrap_dim) const {
if (wrap_dim) {
const auto ndim = sizes_and_strides_.size();
dim = maybe_wrap_dim(dim, ndim);
dim = maybe_wrap_dim(dim, static_cast<int64_t>(ndim));
}
if (bdim_ <= dim) {
return dim + 1;
@ -160,6 +160,7 @@ c10::intrusive_ptr<TensorImpl> BatchedTensorImpl::shallow_copy_and_detach(
}
c10::intrusive_ptr<TensorImpl> BatchedTensorImpl::shallow_copy_and_detach(
// NOLINTNEXTLINE(cppcoreguidelines-rvalue-reference-param-not-moved)
c10::VariableVersion&& version_counter,
bool allow_tensor_metadata_change) const {
TORCH_CHECK(false, "accessing `data` under vmap transform is not allowed");

View File

@ -7,7 +7,6 @@
#pragma once
#include <bitset>
#include <utility>
#include <ATen/ArrayRef.h>
#include <ATen/SmallVector.h>
@ -119,15 +118,15 @@ inline bool isBatchedTensor(const Tensor& tensor) {
// It is unsafe to call this on a Tensor that is not backed by a
// BatchedTensorImpl. Please use `maybeGetBatchedImpl` whenever possible.
inline BatchedTensorImpl* unsafeGetBatchedImpl(Tensor tensor) {
inline BatchedTensorImpl* unsafeGetBatchedImpl(const Tensor& tensor) {
return static_cast<BatchedTensorImpl*>(tensor.unsafeGetTensorImpl());
}
inline BatchedTensorImpl* maybeGetBatchedImpl(Tensor tensor) {
inline BatchedTensorImpl* maybeGetBatchedImpl(const Tensor& tensor) {
if (!isBatchedTensor(tensor)) {
return nullptr;
}
return unsafeGetBatchedImpl(std::move(tensor));
return unsafeGetBatchedImpl(tensor);
}
// Returns a bitset. If bit i is set, then that means dim i is a batchdim.

View File

@ -234,7 +234,7 @@ int64_t pushDynamicLayer(DynamicLayer&& dynamic_layer) {
auto& dynamicLayerStack = dynamicLayerStackAccessor();
int64_t layerId = 1 + dynamicLayerStack.size();
TORCH_INTERNAL_ASSERT(layerId == dynamic_layer.layerId());
dynamicLayerStack.emplace_back(dynamic_layer);
dynamicLayerStack.emplace_back(std::move(dynamic_layer));
if (layerId == 1) {
setDynamicLayerFrontBackKeysIncluded(true);
@ -257,7 +257,7 @@ int64_t initAndPushDynamicLayer(
optional<bool> functionalize_add_back_views) {
const auto& dynamicLayerStack = dynamicLayerStackAccessor();
const auto layerId = 1 + dynamicLayerStack.size();
DynamicLayer new_layer(transform_type, layerId, batch_size, randomness, prev_grad_mode, prev_fwd_grad_mode, functionalize_add_back_views);
DynamicLayer new_layer(transform_type, layerId, std::move(batch_size), randomness, prev_grad_mode, prev_fwd_grad_mode, functionalize_add_back_views);
// NB: this function should be called while holding the GIL to avoid races
new_layer.interpreter().set_is_alive(true);
pushDynamicLayer(std::move(new_layer));
@ -306,7 +306,7 @@ void foreachTensorInplace(std::vector<IValue>& args, int64_t begin, int64_t end,
}
void foreachTensorInplaceWithFlag(std::vector<IValue>& args, int64_t begin, int64_t end,
const std::bitset<64> use_flag_relative, std::function<Tensor(const Tensor&, bool)> func){
const std::bitset<64> use_flag_relative, const std::function<Tensor(const Tensor&, bool)>& func){
TORCH_INTERNAL_ASSERT(begin >= 0);
TORCH_INTERNAL_ASSERT(end >= 0);
TORCH_INTERNAL_ASSERT(begin <= end);

View File

@ -6,8 +6,6 @@
#include <ATen/functorch/ADInterpreters.h>
#include <ATen/functorch/DynamicLayer.h>
#include <utility>
namespace at::functorch {
static DispatchKeySet get_all_dynlayer_keyset() {
@ -92,12 +90,12 @@ std::ostream& operator<<(std::ostream& os, const TransformType& t) {
void sanityCheckStack(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
auto num_args = op.schema().arguments().size();
foreachTensorInplace(*stack, stack->size() - num_args, stack->size(),
foreachTensorInplace(*stack, static_cast<int64_t>(stack->size() - num_args), static_cast<int64_t>(stack->size()),
[](const Tensor& tensor) {
auto result = unwrapIfDead(tensor);
auto* wrapper = maybeGetTensorWrapper(result);
TORCH_INTERNAL_ASSERT(wrapper == nullptr);
auto* batched = maybeGetBatchedImpl(std::move(result));
auto* batched = maybeGetBatchedImpl(result);
TORCH_INTERNAL_ASSERT(batched == nullptr);
return tensor;
});

View File

@ -5,6 +5,7 @@
#include <c10/core/impl/LocalDispatchKeySet.h>
#include <c10/util/Optional.h>
#include <bitset>
#include <utility>
#include <variant>
namespace at::functorch {
@ -144,7 +145,7 @@ struct Interpreter {
void saveLocalDispatchKeySet(c10::impl::LocalDispatchKeySet keyset) {
TORCH_INTERNAL_ASSERT(!savedLocalDispatchKeySet_.has_value());
savedLocalDispatchKeySet_ = std::move(keyset);
savedLocalDispatchKeySet_ = keyset;
}
void clearSavedLocalDispatchKeySet() {
TORCH_INTERNAL_ASSERT(savedLocalDispatchKeySet_.has_value());
@ -173,11 +174,11 @@ struct Interpreter {
private:
explicit Interpreter(TransformType type, int64_t level, InterpreterMeta meta):
type_(type), level_(level), is_alive_(std::make_shared<bool>(false)), meta_(meta) {}
type_(type), level_(level), is_alive_(std::make_shared<bool>(false)), meta_(std::move(meta)) {}
// fields
TransformType type_;
int64_t level_;
TransformType type_{};
int64_t level_{};
optional<c10::impl::LocalDispatchKeySet> savedLocalDispatchKeySet_;
std::shared_ptr<bool> is_alive_;
InterpreterMeta meta_;
@ -195,7 +196,7 @@ void foreachTensorInplace(std::vector<IValue>& args, int64_t begin, int64_t end,
// args[i] = func(args[i], i - begin, true)
// args[i] = func(args[i], i - begin)
void foreachTensorInplaceWithFlag(std::vector<IValue>& args, int64_t begin, int64_t end,
const std::bitset<64> use_flag_relative, std::function<Tensor(const Tensor&, bool)> func);
const std::bitset<64> use_flag_relative, const std::function<Tensor(const Tensor&, bool)>& func);
std::vector<int64_t> findUnwrappedInputs(std::vector<IValue>& args, int64_t begin, int64_t end);

View File

@ -286,7 +286,7 @@ std::vector<Tensor> unbind_batching_rule(const Tensor& self, int64_t dim) {
// can be indexed (or nullopt if such a location doesn't exist, e.g., tensors
// with zero-size dims).
static optional<c10::SymInt> maximum_indexable_location(
c10::SymIntArrayRef sizes, c10::SymIntArrayRef strides, c10::SymInt storage_offset) {
c10::SymIntArrayRef sizes, c10::SymIntArrayRef strides, const c10::SymInt& storage_offset) {
auto result = native::storage_size_for(sizes, strides);
if (result == 0) {
return nullopt;
@ -303,7 +303,7 @@ static void checkBasicAsStridedValidForSlice(
int64_t num_batch_dims,
c10::SymIntArrayRef sizes,
c10::SymIntArrayRef strides,
optional<c10::SymInt> maybe_storage_offset) {
const optional<c10::SymInt>& maybe_storage_offset) {
auto slice_sizes = physical_tensor.sym_sizes().slice(num_batch_dims);
auto slice_strides = physical_tensor.sym_strides().slice(num_batch_dims);
auto base_offset = physical_tensor.sym_storage_offset();
@ -693,17 +693,17 @@ Tensor new_empty_strided_batching_rule(
}
Tensor nested_cat_batching_rule(const ITensorListRef& tensors, int64_t dim) {
TORCH_CHECK(tensors.size() > 0, "cat() not supported on empty tensor list");
TORCH_CHECK(!tensors.empty(), "cat() not supported on empty tensor list");
std::vector<std::vector<Tensor>> unbound;
for (auto tensor_iter = tensors.begin(); tensor_iter != tensors.end(); ++tensor_iter) {
auto* maybe_batched_impl = maybeGetBatchedImpl(*tensor_iter);
for (const auto & tensor : tensors) {
auto* maybe_batched_impl = maybeGetBatchedImpl(tensor);
TORCH_CHECK(maybe_batched_impl, "Tried to run batching rule for cat() on a non-batched tensor");
auto nt = maybe_batched_impl->value();
TORCH_CHECK(nt.is_nested(), "Tried to run batching rule for cat() on a non-nested tensor");
c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::BatchedNestedTensor);
auto this_unbound = nt.unbind();
if (unbound.size() > 0) {
if (!unbound.empty()) {
TORCH_INTERNAL_ASSERT(unbound.front().size() == this_unbound.size(),
"cat() not supported for differently-sized nested arguments");
}

View File

@ -135,7 +135,7 @@ MultiBatchVmapTransform::logicalToPhysical(ITensorListRef logical_tensors) {
TORCH_INTERNAL_ASSERT(bdim_size != -1);
std::bitset<kVmapNumLevels> levels;
levels[cur_level] = 1;
levels[cur_level] = true;
VmapPhysicalViewVec result;
for (const auto& logical_tensor : logical_tensors) {
@ -184,7 +184,7 @@ VmapPhysicalViewVec BroadcastingVmapTransform::logicalToPhysical(TensorList logi
TORCH_INTERNAL_ASSERT(bdim_size != -1);
std::bitset<kVmapNumLevels> levels;
levels[cur_level] = 1;
levels[cur_level] = true;
// figure out the example ndim
int64_t max_example_dim = -1;

View File

@ -120,7 +120,7 @@ struct VmapPhysicalToLogicalMap;
// levels: 012345
struct TORCH_API VmapPhysicalView {
VmapPhysicalView(Tensor&& tensor, std::bitset<kVmapNumLevels> levels)
: levels_(levels), tensor_(tensor) {
: levels_(levels), tensor_(std::move(tensor)) {
// TORCH_INTERNAL_ASSERT(!isBatchedTensor(tensor));
}

View File

@ -167,7 +167,7 @@ namespace dropout_hack {
namespace {
template<bool inplace>
using Ctype = typename std::conditional<inplace, Tensor&, Tensor>::type;
using Ctype = std::conditional_t<inplace, Tensor&, Tensor>;
static Tensor make_feature_noise(const Tensor& input) {
auto input_sizes = input.sizes();

View File

@ -50,7 +50,7 @@ void TensorWrapper::refreshMetadata() {
void dumpTensorCout(const Tensor& tensor) {
dumpTensor(std::cout, tensor);
std::cout << std::endl;
std::cout << '\n';
}
static c10::intrusive_ptr<TensorWrapper> makeTensorWrapperPtr(const Tensor& tensor, int64_t level, const std::shared_ptr<bool>& life_handle) {

View File

@ -649,8 +649,8 @@ void apply_ormqr(const Tensor& input, const Tensor& tau, const Tensor& other, bo
char side = left ? 'L' : 'R';
char trans = transpose ? (input.is_complex() ? 'C' : 'T') : 'N';
auto input_data = input.data_ptr<scalar_t>();
auto tau_data = tau.data_ptr<scalar_t>();
auto input_data = input.const_data_ptr<scalar_t>();
auto tau_data = tau.const_data_ptr<scalar_t>();
auto other_data = other.data_ptr<scalar_t>();
auto input_matrix_stride = matrixStride(input);
@ -670,21 +670,21 @@ void apply_ormqr(const Tensor& input, const Tensor& tau, const Tensor& other, bo
// Query for the optimal size of the workspace tensor
int lwork = -1;
scalar_t wkopt;
lapackOrmqr<scalar_t>(side, trans, m, n, k, input_data, lda, tau_data, other_data, ldc, &wkopt, lwork, &info);
lapackOrmqr<scalar_t>(side, trans, m, n, k, const_cast<scalar_t*>(input_data), lda, const_cast<scalar_t*>(tau_data), other_data, ldc, &wkopt, lwork, &info);
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(info == 0);
lwork = std::max<int>(1, real_impl<scalar_t, value_t>(wkopt));
Tensor work = at::empty({lwork}, input.options());
for (const auto i : c10::irange(batch_size)) {
scalar_t* input_working_ptr = &input_data[i * input_matrix_stride];
const scalar_t* input_working_ptr = &input_data[i * input_matrix_stride];
scalar_t* other_working_ptr = &other_data[i * other_matrix_stride];
scalar_t* tau_working_ptr = &tau_data[i * tau_stride];
const scalar_t* tau_working_ptr = &tau_data[i * tau_stride];
// now compute the actual result
lapackOrmqr<scalar_t>(
side, trans, m, n, k,
input_working_ptr, lda,
tau_working_ptr,
const_cast<scalar_t*>(input_working_ptr), lda,
const_cast<scalar_t*>(tau_working_ptr),
other_working_ptr, ldc,
work.data_ptr<scalar_t>(), lwork, &info);

View File

@ -2,6 +2,7 @@
#include <ATen/Context.h>
#include <ATen/Config.h>
#include <ATen/OpMathType.h>
#include <ATen/Parallel.h>
#include <c10/core/ScalarType.h>
#include <c10/util/Exception.h>
#include <c10/util/complex.h>
@ -210,34 +211,39 @@ static inline float16_t reduce(float16x4_t x) {
auto sum = vpadd_f16(x, x);
return vget_lane_f16(vpadd_f16(sum, sum), 0);
}
static inline float16_t reduce(float16x8_t x) {
return reduce(vadd_f16(vget_low_f16(x), vget_high_f16(x)));
}
static void fp16_gemv_trans_fp16_arith(const int m, const int n, const float16_t* a, const int lda, const float16_t *x, float16_t* y, int incy) {
for (auto i = 0 ; i < n; i += 4) {
float16x4_t sum0Vec = vdup_n_f16(0);
float16x4_t sum1Vec = vdup_n_f16(0);
float16x4_t sum2Vec = vdup_n_f16(0);
float16x4_t sum3Vec = vdup_n_f16(0);
const auto row0 = a + lda * (i + 0);
const auto row1 = a + lda * (i + 1);
const auto row2 = a + lda * (i + 2);
const auto row3 = a + lda * (i + 3);
for (auto j = 0; j < m; j += 4) {
float16x4_t a0Vec = vld1_f16(row0 + j);
float16x4_t a1Vec = vld1_f16(row1 + j);
float16x4_t a2Vec = vld1_f16(row2 + j);
float16x4_t a3Vec = vld1_f16(row3 + j);
float16x4_t xVec = vld1_f16(x + j);
sum0Vec = vadd_f16(sum0Vec, vmul_f16(a0Vec, xVec));
sum1Vec = vadd_f16(sum1Vec, vmul_f16(a1Vec, xVec));
sum2Vec = vadd_f16(sum2Vec, vmul_f16(a2Vec, xVec));
sum3Vec = vadd_f16(sum3Vec, vmul_f16(a3Vec, xVec));
parallel_for(0, n / 4, 1, [&](int begin, int end) {
for (auto i = begin * 4 ; i < end * 4; i += 4) {
float16x8_t sum0Vec = vdupq_n_f16(0);
float16x8_t sum1Vec = vdupq_n_f16(0);
float16x8_t sum2Vec = vdupq_n_f16(0);
float16x8_t sum3Vec = vdupq_n_f16(0);
const auto row0 = a + lda * (i + 0);
const auto row1 = a + lda * (i + 1);
const auto row2 = a + lda * (i + 2);
const auto row3 = a + lda * (i + 3);
for (auto j = 0; j < m; j += 8) {
float16x8_t xVec = vld1q_f16(x + j);
float16x8_t a0Vec = vld1q_f16(row0 + j);
sum0Vec = vaddq_f16(sum0Vec, vmulq_f16(a0Vec, xVec));
float16x8_t a1Vec = vld1q_f16(row1 + j);
sum1Vec = vaddq_f16(sum1Vec, vmulq_f16(a1Vec, xVec));
float16x8_t a2Vec = vld1q_f16(row2 + j);
sum2Vec = vaddq_f16(sum2Vec, vmulq_f16(a2Vec, xVec));
float16x8_t a3Vec = vld1q_f16(row3 + j);
sum3Vec = vaddq_f16(sum3Vec, vmulq_f16(a3Vec, xVec));
}
y[(i + 0) * incy] = reduce(sum0Vec);
y[(i + 1) * incy] = reduce(sum1Vec);
y[(i + 2) * incy] = reduce(sum2Vec);
y[(i + 3) * incy] = reduce(sum3Vec);
}
y[(i + 0) * incy] = reduce(sum0Vec);
y[(i + 1) * incy] = reduce(sum1Vec);
y[(i + 2) * incy] = reduce(sum2Vec);
y[(i + 3) * incy] = reduce(sum3Vec);
}
});
}
#endif
@ -247,31 +253,33 @@ static inline float reduce(float32x4_t x) {
}
static void fp16_gemv_trans_fp32_arith(const int m, const int n, const float16_t* a, const int lda, const float16_t *x, float16_t* y, int incy) {
for (auto i = 0 ; i < n; i += 4) {
float32x4_t sum0Vec = vdupq_n_f32(0);
float32x4_t sum1Vec = vdupq_n_f32(0);
float32x4_t sum2Vec = vdupq_n_f32(0);
float32x4_t sum3Vec = vdupq_n_f32(0);
const auto row0 = a + lda * (i + 0);
const auto row1 = a + lda * (i + 1);
const auto row2 = a + lda * (i + 2);
const auto row3 = a + lda * (i + 3);
for (auto j = 0; j < m; j += 4) {
float32x4_t a0Vec = vcvt_f32_f16(vld1_f16(row0 + j));
float32x4_t a1Vec = vcvt_f32_f16(vld1_f16(row1 + j));
float32x4_t a2Vec = vcvt_f32_f16(vld1_f16(row2 + j));
float32x4_t a3Vec = vcvt_f32_f16(vld1_f16(row3 + j));
float32x4_t xVec = vcvt_f32_f16(vld1_f16(x + j));
sum0Vec = vaddq_f32(sum0Vec, vmulq_f32(a0Vec, xVec));
sum1Vec = vaddq_f32(sum1Vec, vmulq_f32(a1Vec, xVec));
sum2Vec = vaddq_f32(sum2Vec, vmulq_f32(a2Vec, xVec));
sum3Vec = vaddq_f32(sum3Vec, vmulq_f32(a3Vec, xVec));
parallel_for(0, n / 4, 1, [&](int begin, int end) {
for (auto i = begin * 4 ; i < end * 4; i += 4) {
float32x4_t sum0Vec = vdupq_n_f32(0);
float32x4_t sum1Vec = vdupq_n_f32(0);
float32x4_t sum2Vec = vdupq_n_f32(0);
float32x4_t sum3Vec = vdupq_n_f32(0);
const auto row0 = a + lda * (i + 0);
const auto row1 = a + lda * (i + 1);
const auto row2 = a + lda * (i + 2);
const auto row3 = a + lda * (i + 3);
for (auto j = 0; j < m; j += 4) {
float32x4_t xVec = vcvt_f32_f16(vld1_f16(x + j));
float32x4_t a0Vec = vcvt_f32_f16(vld1_f16(row0 + j));
sum0Vec = vaddq_f32(sum0Vec, vmulq_f32(a0Vec, xVec));
float32x4_t a1Vec = vcvt_f32_f16(vld1_f16(row1 + j));
sum1Vec = vaddq_f32(sum1Vec, vmulq_f32(a1Vec, xVec));
float32x4_t a2Vec = vcvt_f32_f16(vld1_f16(row2 + j));
sum2Vec = vaddq_f32(sum2Vec, vmulq_f32(a2Vec, xVec));
float32x4_t a3Vec = vcvt_f32_f16(vld1_f16(row3 + j));
sum3Vec = vaddq_f32(sum3Vec, vmulq_f32(a3Vec, xVec));
}
y[(i + 0) * incy] = reduce(sum0Vec);
y[(i + 1) * incy] = reduce(sum1Vec);
y[(i + 2) * incy] = reduce(sum2Vec);
y[(i + 3) * incy] = reduce(sum3Vec);
}
y[(i + 0) * incy] = reduce(sum0Vec);
y[(i + 1) * incy] = reduce(sum1Vec);
y[(i + 2) * incy] = reduce(sum2Vec);
y[(i + 3) * incy] = reduce(sum3Vec);
}
});
}
void fp16_gemv_trans(
@ -287,8 +295,8 @@ void fp16_gemv_trans(
const int incy) {
if (incx == 1 && alpha == 1.0 && beta == 0.0 && m % 4 == 0 && n % 4 == 0) {
#ifdef __ARM_FEATURE_FP16_SCALAR_ARITHMETIC
return at::globalContext().allowFP16ReductionCPU() ? fp16_gemv_trans_fp16_arith(m, n, a, lda, x, y, incy)
: fp16_gemv_trans_fp32_arith(m, n, a, lda, x, y, incy);
return at::globalContext().allowFP16ReductionCPU() && m % 8 == 0 ? fp16_gemv_trans_fp16_arith(m, n, a, lda, x, y, incy)
: fp16_gemv_trans_fp32_arith(m, n, a, lda, x, y, incy);
#else
return fp16_gemv_trans_fp32_arith(m, n, a, lda, x, y, incy);
#endif

View File

@ -92,9 +92,9 @@ void searchsorted_cpu_contiguous(Tensor& result, const Tensor& input, const Tens
int64_t idim_in = is_scalar_input ? 1 : input.sizes().back();
int64_t idim_bd = boundaries.sizes().back();
const input_t *data_in = input.data_ptr<input_t>();
const input_t *data_bd = boundaries.data_ptr<input_t>();
const int64_t *data_st = sorter.defined() ? sorter.data_ptr<int64_t>() : nullptr;
const input_t *data_in = input.const_data_ptr<input_t>();
const input_t *data_bd = boundaries.const_data_ptr<input_t>();
const int64_t *data_st = sorter.defined() ? sorter.const_data_ptr<int64_t>() : nullptr;
output_t *data_out = result.data_ptr<output_t>();
bool is_1d_boundaries = boundaries.dim() == 1;

View File

@ -61,7 +61,7 @@ static Tensor compute_columns2d(
kernel_height * kernel_width * n_input_plane : output_height * output_width;
columns = at::empty({batch_size, row, col}, input.options());
AT_DISPATCH_ALL_TYPES_AND2(kBFloat16, kHalf, input.scalar_type(), "slow_conv2d_cpu", [&]{
auto input_a = input.accessor<scalar_t, 4>();
auto input_a = input.accessor<const scalar_t, 4>();
auto columns_a = columns.accessor<scalar_t, 3>();
at::parallel_for(0, batch_size, 0, [&](int64_t start, int64_t end) {
@ -220,9 +220,9 @@ static inline Tensor view_weight_2d(const Tensor& weight_,
template <typename scalar_t>
static void slow_conv2d_update_output_frame(
TensorAccessor<scalar_t, 3> input,
TensorAccessor<const scalar_t, 3> input,
TensorAccessor<scalar_t, 3> output,
TensorAccessor<scalar_t, 2> weight,
TensorAccessor<const scalar_t, 2> weight,
bool has_bias,
TensorAccessor<scalar_t, 2> finput,
int64_t kernel_height,
@ -588,10 +588,10 @@ Tensor& slow_conv2d_forward_out_cpu(
TORCH_CHECK(output.is_contiguous(memory_format), "slow_conv2d output tensor must be contiguous");
AT_DISPATCH_ALL_TYPES_AND2(kBFloat16, kHalf, input.scalar_type(), "slow_conv2d_cpu", [&]{
auto input_a = input.accessor<scalar_t, 4>();
auto input_a = input.accessor<const scalar_t, 4>();
auto output_a = output.accessor<scalar_t, 4>();
auto finput_a = finput.accessor<scalar_t, 3>();
auto weight_2d_a = weight_2d.accessor<scalar_t, 2>();
auto weight_2d_a = weight_2d.accessor<const scalar_t, 2>();
at::parallel_for(0, batch_size, 0, [&](int64_t start, int64_t end) {
for (const auto t : c10::irange(start, end)) {

View File

@ -72,7 +72,7 @@ static Tensor compute_columns3d(
input.options());
AT_DISPATCH_ALL_TYPES_AND2(kBFloat16, kHalf, input.scalar_type(), "compute_columns3d", [&] {
auto input_a = input.accessor<scalar_t, 5>();
auto input_a = input.accessor<const scalar_t, 5>();
auto columns_a = columns.accessor<scalar_t, 3>();
at::parallel_for(0, batch_size, CONV3D_GRAIN_SALT, [&](int64_t start, int64_t end) {
@ -261,11 +261,11 @@ static Tensor view_weight_2d(const Tensor& weight_) {
template <typename scalar_t>
static void slow_conv3d_update_output_frame(
TensorAccessor<scalar_t, 4> input,
TensorAccessor<const scalar_t, 4> input,
TensorAccessor<scalar_t, 4> output,
TensorAccessor<scalar_t, 2> weight,
TensorAccessor<const scalar_t, 2> weight,
bool has_bias,
TensorAccessor<scalar_t, 2> finput,
TensorAccessor<const scalar_t, 2> finput,
int64_t kernel_depth,
int64_t kernel_height,
int64_t kernel_width,
@ -623,10 +623,10 @@ Tensor& slow_conv3d_forward_out_cpu(const Tensor& self,
TORCH_CHECK(output.is_contiguous(), "slow_conv3d output must be contiguous");
AT_DISPATCH_ALL_TYPES_AND2(kBFloat16, kHalf, input.scalar_type(), "slow_conv3d_cpu", [&] {
auto input_a = input.accessor<scalar_t, 5>();
auto input_a = input.accessor<const scalar_t, 5>();
auto output_a = output.accessor<scalar_t, 5>();
auto finput_a = finput.accessor<scalar_t, 3>();
auto weight_2d_a = weight_2d.accessor<scalar_t, 2>();
auto finput_a = finput.accessor<const scalar_t, 3>();
auto weight_2d_a = weight_2d.accessor<const scalar_t, 2>();
at::parallel_for(
0, batch_size, CONV3D_GRAIN_SALT, [&](int64_t start, int64_t end) {

View File

@ -102,13 +102,12 @@ inline void check_foreach_api_restrictions(
// corresponding tensors (aligning in index across the tensorLists) share the
// same device and dtype.
inline bool _check_tensors_share_device_and_dtype(
ArrayRef<TensorList> tensorLists,
const bool skip_dtype_check = false) {
ArrayRef<TensorList> tensorLists) {
const auto expected_dtype = tensorLists[0][0].dtype();
const auto expected_device = tensorLists[0][0].device();
auto is_tensor_okay = [&](const Tensor& tensor) {
return (skip_dtype_check || tensor.dtype() == expected_dtype) &&
return tensor.dtype() == expected_dtype &&
tensor.device() == expected_device && tensor.layout() == at::kStrided &&
tensor.is_non_overlapping_and_dense();
};

View File

@ -20,9 +20,9 @@ TORCH_META_FUNC(lerp_Tensor)(
" for `weight` but got dtype ", weight.dtype());
build(at::TensorIteratorConfig()
.add_output(maybe_get_output())
.add_input(self)
.add_input(end)
.add_input(weight));
.add_const_input(self)
.add_const_input(end)
.add_const_input(weight));
}
TORCH_META_FUNC(lerp_Scalar)(

View File

@ -52,7 +52,7 @@ void _segment_reduce_lengths_cpu_kernel1(
AT_DISPATCH_FLOATING_TYPES_AND2(
kBFloat16, kHalf, data.scalar_type(), "_segment_reduce_cpu", [&]() {
auto* output_data = output.data_ptr<scalar_t>();
const auto* values_data = data.data_ptr<scalar_t>();
const auto* values_data = data.const_data_ptr<scalar_t>();
for (const auto outer_idx : c10::irange(outer_offset)) {
int64_t segment_start, segment_length;
int64_t segment_end = is_offsets_like ?
@ -145,7 +145,7 @@ Tensor _segment_reduce_lengths_cpu_kernel(
auto output = at::empty(output_shape, data.options());
AT_DISPATCH_INDEX_TYPES(lengths.scalar_type(), "_segment_reduce_lengths_cpu_kernel1", [&]() {
const auto* lengths_data = lengths.data_ptr<index_t>();
const auto* lengths_data = lengths.const_data_ptr<index_t>();
_segment_reduce_lengths_cpu_kernel1(
reduction, data, lengths_data, axis, initial, output, segment_count, lengths_stride_axis);
});
@ -171,7 +171,7 @@ Tensor _segment_reduce_offsets_cpu_kernel(
auto output = at::empty(output_shape, data.options());
AT_DISPATCH_INDEX_TYPES(offsets.scalar_type(), "_segment_reduce_offsets_cpu_kernel1", [&]() {
const auto* offsets_data = offsets.data_ptr<index_t>();
const auto* offsets_data = offsets.const_data_ptr<index_t>();
_segment_reduce_lengths_cpu_kernel1<index_t, /*is_offsets_like=*/true>(
reduction, data, offsets_data, axis, initial, output, segment_count, offsets_stride_axis);
});
@ -214,7 +214,7 @@ void _segment_reduce_cpu_lengths_backward_kernel1(
auto* output_data = output_contig.data_ptr<scalar_t>();
auto* grad_data = grad_contig.data_ptr<scalar_t>();
auto* grad_input_data = grad_input.mutable_data_ptr<scalar_t>();
const auto* values_data = data_contig.data_ptr<scalar_t>();
const auto* values_data = data_contig.const_data_ptr<scalar_t>();
// Used to calculate exclusive prod
scalar_t initial_prod_value;
if (reduction == ReductionType::PROD) {
@ -331,7 +331,7 @@ Tensor _segment_reduce_cpu_lengths_backward_kernel(
AT_DISPATCH_INDEX_TYPES(
lengths_contig.scalar_type(), "_segment_reduce_cpu_lengths_backward_kernel1", [&] {
const auto* lengths_data = lengths_contig.data_ptr<index_t>();
const auto* lengths_data = lengths_contig.const_data_ptr<index_t>();
_segment_reduce_cpu_lengths_backward_kernel1(
grad_contig,
output_contig,
@ -364,7 +364,7 @@ Tensor _segment_reduce_cpu_offsets_backward_kernel(
AT_DISPATCH_INDEX_TYPES(
offsets_contig.scalar_type(), "_segment_reduce_cpu_offsets_backward_kernel1", [&] {
const auto* offsets_data = offsets_contig.data_ptr<index_t>();
const auto* offsets_data = offsets_contig.const_data_ptr<index_t>();
_segment_reduce_cpu_lengths_backward_kernel1<index_t, /*is_offsets_like=*/true>(
grad_contig,
output_contig,

View File

@ -1,5 +1,6 @@
#define TORCH_ASSERT_NO_OPERATORS
#include <ATen/Dispatch.h>
#include <ATen/Parallel.h>
#include <ATen/native/CPUBlas.h>
#include <ATen/native/cpu/zmath.h>
#include <c10/util/irange.h>
@ -337,20 +338,22 @@ void gemm_transa_(
at::native::blas_impl::fp16_gemv_trans(k, m, alpha, reinterpret_cast<const float16_t*>(a), lda, reinterpret_cast<const float16_t*>(b), 1, beta, reinterpret_cast<float16_t*>(c), 1);
return;
}
const auto *a_ = a;
for (const auto i : c10::irange(m)) {
const auto *b_ = b;
for (const auto j : c10::irange(n)) {
const auto dot = compute_dot(reinterpret_cast<const float16_t*>(a_), reinterpret_cast<const float16_t*>(b_), k);
b_ += ldb;
if (beta == 0) {
c[j*ldc+i] = alpha*dot;
} else {
c[j*ldc+i] = beta*c[j*ldc+i]+alpha*dot;
parallel_for(0, m, 1, [&](int64_t begin, int64_t end) {
const auto *a_ = a + begin * lda;
for (const auto i : c10::irange(begin, end)) {
const auto *b_ = b;
for (const auto j : c10::irange(n)) {
const auto dot = compute_dot(reinterpret_cast<const float16_t*>(a_), reinterpret_cast<const float16_t*>(b_), k);
b_ += ldb;
if (beta == 0) {
c[j*ldc+i] = alpha*dot;
} else {
c[j*ldc+i] = beta*c[j*ldc+i]+alpha*dot;
}
}
a_ += lda;
}
a_ += lda;
}
});
}
#endif

View File

@ -292,9 +292,9 @@ Tensor _convolution_depthwise3x3_winograd(
bias_potentially_undefined :
at::zeros({kernel_sizes[0]}, input.options());
auto input_data = input.data_ptr<float>();
auto kernel_data = kernel.data_ptr<float>();
auto bias_data = bias.data_ptr<float>();
auto input_data = input.const_data_ptr<float>();
auto kernel_data = kernel.const_data_ptr<float>();
auto bias_data = bias.const_data_ptr<float>();
auto output_data = output.data_ptr<float>();
at::parallel_for(0, args.batch * args.out_channels, 0, [&](int64_t start, int64_t end) {

View File

@ -321,7 +321,7 @@ void bernoulli_kernel(const TensorBase &self, const TensorBase &p_, RNG generato
auto p = expand_inplace(self, p_cpu);
auto iter = TensorIteratorConfig()
.add_output(self)
.add_input(*p)
.add_const_input(*p)
.check_all_same_dtype(false)
.build();
if (p->scalar_type() == kDouble) {

View File

@ -98,14 +98,14 @@ void histogramdd_cpu_contiguous(Tensor& hist, const TensorList& bin_edges,
return;
}
TensorAccessor<input_t, 2> accessor_in = input.accessor<input_t, 2>();
TensorAccessor<const input_t, 2> accessor_in = input.accessor<const input_t, 2>();
/* Constructs a c10::optional<TensorAccessor> containing an accessor iff
* the optional weight tensor has a value.
*/
const auto accessor_wt = weight.has_value()
? c10::optional<TensorAccessor<input_t, 1>>(weight.value().accessor<input_t, 1>())
: c10::optional<TensorAccessor<input_t, 1>>();
? c10::optional<TensorAccessor<const input_t, 1>>(weight.value().accessor<const input_t, 1>())
: c10::optional<TensorAccessor<const input_t, 1>>();
std::vector<input_t*> bin_seq(D);
std::vector<int64_t> num_bin_edges(D);

View File

@ -36,7 +36,7 @@ multinomial_with_replacement_apply(
/* cumulative probability distribution vector */
Tensor cum_dist = at::empty({n_categories}, self.options());
const scalar_t* const self_ptr = self.data_ptr<scalar_t>();
const scalar_t* const self_ptr = self.const_data_ptr<scalar_t>();
scalar_t* const cum_dist_ptr = cum_dist.data_ptr<scalar_t>();
int64_t* const result_ptr = result.data_ptr<int64_t>();

View File

@ -195,7 +195,7 @@ template <typename scalar_t, typename acc_t=typename scalar_value_type<scalar_t>
void norm_kernel_cpu_impl(TensorIterator& iter, const double& val) {
if (val == 0.0) {
binary_kernel_reduce(iter, NormZeroOps<scalar_t, acc_t, out_t>(), acc_t(0));
} else if (val == 0.0) {
} else if (val == 1.0) {
binary_kernel_reduce(iter, NormOneOps<scalar_t, acc_t, out_t>(), acc_t(0));
} else if (val == 2.0) {
binary_kernel_reduce(iter, NormTwoOps<scalar_t, acc_t, out_t>(), acc_t(0));

View File

@ -32,10 +32,10 @@ PackedTensorAccessor32<scalar_t, ndim, PtrTraits> dummy_packed_accessor32() {
template <int kSize, typename scalar_t, typename index_t>
__global__ void conv_depthwise2d_forward_kernel(
const PackedTensorAccessor32<scalar_t, 4, DefaultPtrTraits> input,
const PackedTensorAccessor32<const scalar_t, 4, DefaultPtrTraits> input,
PackedTensorAccessor32<scalar_t, 4, DefaultPtrTraits> output,
const PackedTensorAccessor32<scalar_t, 4, DefaultPtrTraits> weight,
const PackedTensorAccessor32<scalar_t, 1, DefaultPtrTraits> bias,
const PackedTensorAccessor32<const scalar_t, 4, DefaultPtrTraits> weight,
const PackedTensorAccessor32<const scalar_t, 1, DefaultPtrTraits> bias,
bool biasEnabled,
index_t totalElements,
const int outputChannels,
@ -309,12 +309,12 @@ void conv_depthwise2d_forward_out(
// Create PackedTensorAccessor
// Kernel currently relies upon all the Tensors to be contiguous, but we made
// them contiguous above
const auto input_a = input.packed_accessor32<scalar_t, 4>();
const auto weight_a = weight.packed_accessor32<scalar_t, 4>();
const auto input_a = input.packed_accessor32<const scalar_t, 4>();
const auto weight_a = weight.packed_accessor32<const scalar_t, 4>();
const auto output_a = output.packed_accessor32<scalar_t, 4>();
const auto bias_a = has_bias ?
bias.packed_accessor32<scalar_t, 1>() :
dummy_packed_accessor32<scalar_t, 1>();
bias.packed_accessor32<const scalar_t, 1>() :
dummy_packed_accessor32<const scalar_t, 1>();
if (kW == 3 && kH == 3) {
conv_depthwise2d_forward_kernel<3> <<<grid, block, 0, stream>>>(
input_a, output_a, weight_a, bias_a, has_bias, n, outputChannels, depthwiseMultiplier,

View File

@ -26,9 +26,9 @@ template <typename scalar_t, typename accscalar_t,
int kKnownKernelT, int kKnownKernelH, int kKnownKernelW,
int kKnownDilationT, int kKnownDilationH, int kKnownDilationW>
__global__ void conv_depthwise3d_cuda_kernel(
const PackedTensorAccessor32<scalar_t, 5> input,
const PackedTensorAccessor32<const scalar_t, 5> input,
PackedTensorAccessor32<scalar_t, 5> output,
const PackedTensorAccessor32<scalar_t, 5> kernel,
const PackedTensorAccessor32<const scalar_t, 5> kernel,
const scalar_t* bias,
int strideT, int strideH, int strideW,
int paddingT, int paddingH, int paddingW,
@ -361,9 +361,9 @@ void conv_depthwise_shape_check(
conv_depthwise3d_cuda_kernel \
<scalar_t, accscalar_t, (kt), (kh), (kw), (dilt), (dilh), (dilw)> \
<<<grid, block, (smem), at::cuda::getCurrentCUDAStream()>>>( \
input_.packed_accessor32<scalar_t, 5>(), \
input_.packed_accessor32<const scalar_t, 5>(), \
output_.packed_accessor32<scalar_t, 5>(), \
weight_.packed_accessor32<scalar_t, 5>(), \
weight_.packed_accessor32<const scalar_t, 5>(), \
bias_ptr, \
stride[0], stride[1], stride[2], \
padding[0], padding[1], padding[2], \
@ -377,9 +377,9 @@ void conv_depthwise_shape_check(
conv_depthwise3d_cuda_kernel \
<scalar_t,accscalar_t, -1, -1, -1, -1, -1, -1> \
<<<grid, block, (smem), at::cuda::getCurrentCUDAStream()>>>( \
input_.packed_accessor32<scalar_t, 5>(), \
input_.packed_accessor32<const scalar_t, 5>(), \
output_.packed_accessor32<scalar_t, 5>(), \
weight_.packed_accessor32<scalar_t, 5>(), \
weight_.packed_accessor32<const scalar_t, 5>(), \
bias_ptr, \
stride[0], stride[1], stride[2], \
padding[0], padding[1], padding[2], \

View File

@ -618,7 +618,7 @@ void bernoulli_tensor_cuda_kernel(
};
// The template argument `4` below indicates that we want to operate on four
// element at each time. See NOTE [ CUDA_tensor_applyN helpers ] for details.
at::cuda::CUDA_tensor_apply2<scalar_t, prob_t, 4, decltype(functor),
at::cuda::CUDA_tensor_apply2<scalar_t, const prob_t, 4, decltype(functor),
/*max_threads_per_block=*/512,
/*min_blocks_per_sm==*/2>(ret, p, functor);
}

View File

@ -187,8 +187,8 @@ void masked_scale_kernel(at::Tensor& ret, const at::Tensor& src, const at::Tenso
auto iter = at::TensorIteratorConfig()
.check_all_same_dtype(false)
.add_output(ret)
.add_input(src)
.add_input(mask)
.add_const_input(src)
.add_const_input(mask)
.build();
at::native::gpu_kernel(

View File

@ -4,7 +4,6 @@
#include <ATen/native/cuda/ForeachFunctors.cuh>
#include <ATen/native/cuda/ForeachMinMaxFunctors.cuh>
#include <functional>
#include <type_traits>
#ifndef AT_PER_OPERATOR_HEADERS
#include <ATen/NativeFunctions.h>
@ -251,152 +250,20 @@ FOREACH_BINARY_OP_LIST(
power_functor,
/*division_op*/ true);
template <typename dst_t, typename src_t = dst_t>
struct Copy {
__device__ __forceinline__ dst_t operator()(const src_t& x) {
return static_cast<dst_t>(x);
template <typename T>
struct Identity {
__device__ __forceinline__ T operator()(const T& x) {
return x;
}
};
template <typename dst_t>
struct Copy<dst_t, c10::complex<double>> {
__device__ __forceinline__ dst_t operator()(const c10::complex<double>& x) {
if constexpr (!(std::is_same_v<dst_t, c10::complex<double>> ||
std::is_same_v<dst_t, c10::complex<float>>)) {
return static_cast<dst_t>(x.real());
} else {
return static_cast<dst_t>(x);
}
}
};
template <typename dst_t>
struct Copy<dst_t, c10::complex<float>> {
__device__ __forceinline__ dst_t operator()(const c10::complex<float>& x) {
if constexpr (!(std::is_same_v<dst_t, c10::complex<double>> ||
std::is_same_v<dst_t, c10::complex<float>>)) {
return static_cast<dst_t>(x.real());
} else {
return static_cast<dst_t>(x);
}
}
};
#define AT_DISPATCH_SOURCE_TYPES(TYPE, NAME, ...) \
AT_DISPATCH_SWITCH( \
TYPE, \
NAME, \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Byte, src_t, __VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Char, src_t, __VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Long, src_t, __VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Short, src_t, __VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Double, src_t, __VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Float, src_t, __VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::ComplexDouble, \
src_t, \
__VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::ComplexFloat, \
src_t, \
__VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Half, \
src_t, \
__VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::BFloat16, \
src_t, \
__VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Bool, \
src_t, \
__VA_ARGS__))
namespace {
template <
typename T,
typename src_t,
int depth,
int r_args_depth,
int res_arg_index>
struct CopyFunctor {
static_assert(depth == 2 && r_args_depth == 1 && res_arg_index == 1);
template <typename Op>
__device__ __forceinline__ void operator()(
int chunk_size,
TensorListMetadata<depth>& tl,
Op op) {
const auto tensor_loc = tl.block_to_tensor[blockIdx.x];
const auto chunk_idx = tl.block_to_chunk[blockIdx.x];
auto n = tl.numel_for_tensor[tensor_loc];
src_t* src_ptr = (src_t*)tl.addresses[0][tensor_loc];
src_ptr += chunk_idx * chunk_size;
T* self_ptr = (T*)tl.addresses[1][tensor_loc];
self_ptr += chunk_idx * chunk_size;
const bool all_aligned{is_aligned(src_ptr) && is_aligned(self_ptr)};
n -= chunk_idx * chunk_size;
src_t src_args[kILP];
T r_args[kILP];
// to make things simple, we put aligned case in a different code path
if (n % kILP == 0 && chunk_size % kILP == 0 && all_aligned) {
for (int64_t i_start = threadIdx.x;
i_start * kILP < n && i_start * kILP < chunk_size;
i_start += blockDim.x) {
// load
load_store(src_args, src_ptr, 0, i_start);
#pragma unroll
for (int ii = 0; ii < kILP; ii++) {
r_args[ii] = static_cast<T>(op(src_args[ii]));
}
// store
load_store(self_ptr, r_args, i_start, 0);
}
} else {
for (int64_t i_start = 0; i_start < n && i_start < chunk_size;
i_start += blockDim.x * kILP) {
#pragma unroll
for (int ii = 0; ii < kILP; ii++) {
const auto i = i_start + threadIdx.x + ii * blockDim.x;
src_args[ii] = src_ptr[i];
}
#pragma unroll
for (int ii = 0; ii < kILP; ii++) {
r_args[ii] = static_cast<T>(op(src_args[ii]));
}
store_args(self_ptr, r_args, i_start, chunk_size, n);
}
}
}
};
} // anonymous namespace
void foreach_tensor_copy_list_kernel_cuda_(
TensorList self,
TensorList src,
const bool non_blocking) {
check_foreach_api_restrictions(self, src);
if (!(_check_tensors_share_device_and_dtype(
{self, src}, /* skip_dtype_check */ true) &&
std::all_of(
src.cbegin(),
src.cend(),
[&](const auto& t) -> bool {
return t.dtype() == src[0].dtype();
}) &&
_check_tensors_share_sizes_and_strides({self, src}))) {
if (!can_use_fast_route(
self, src, /* does_op_promote_integer_inputs_to_float */ false)) {
return at::native::foreach_tensor_copy_list_kernel_slow_(
self, src, non_blocking);
}
@ -411,38 +278,16 @@ void foreach_tensor_copy_list_kernel_cuda_(
"foreach_tensor_copy",
[&]() {
using opmath_t = at::opmath_type<scalar_t>;
AT_DISPATCH_SOURCE_TYPES(src[0].scalar_type(), "foreach_tensor_copy", [&] {
if constexpr (std::is_same_v<scalar_t, src_t>) {
multi_tensor_apply<2>(
tensor_lists,
UnaryOpFunctor<
scalar_t,
/* depth */ 2,
/* r_args_depth */ 1,
/* res_arg_index */ 1>(),
Copy<opmath_t, opmath_t>());
} else {
// Ref:
// https://github.com/pytorch/pytorch/blob/656134c38f4737d13c3f43fc5c59470bc23c1d2f/aten/src/ATen/native/Copy.cpp#L299-L301
if (!self[0].is_complex() && src[0].is_complex()) {
TORCH_WARN_ONCE(
"Casting complex values to real discards the imaginary part");
}
multi_tensor_apply<2>(
tensor_lists,
CopyFunctor<
scalar_t,
src_t,
/* depth */ 2,
/* r_args_depth */ 1,
/* res_arg_index */ 1>(),
Copy<scalar_t, src_t>());
}
});
multi_tensor_apply<2>(
tensor_lists,
UnaryOpFunctor<
scalar_t,
/* depth */ 2,
/* r_args_depth */ 1,
/* res_arg_index */ 1>(),
Identity<opmath_t>());
});
increment_version(self);
}
#undef AT_DISPATCH_SOURCE_TYPES
} // namespace at::native

View File

@ -65,7 +65,7 @@ C10_LAUNCH_BOUNDS_1(cuda::getApplyBlockSize())
__global__ void kernelHistogram1D(
detail::TensorInfo<output_t, IndexType> a, /* output */
detail::TensorInfo<output_t, IndexType> p, /* partial output */
detail::TensorInfo<input_t, IndexType> b, /* input */
detail::TensorInfo<const input_t, IndexType> b, /* input */
int64_t nbins,
at::acc_type<input_t, /*is_cuda=*/true> minvalue,
at::acc_type<input_t, /*is_cuda=*/true> maxvalue,
@ -86,7 +86,7 @@ __global__ void kernelHistogram1D(
FOR_KERNEL_LOOP(linearIndex, totalElements) {
// Convert `linearIndex` into an offset of `b`
const IndexType bOffset =
detail::IndexToOffset<input_t, IndexType, BDims>::get(linearIndex, b);
detail::IndexToOffset<const input_t, IndexType, BDims>::get(linearIndex, b);
const auto bVal = b.data[bOffset];
if (bVal >= minvalue && bVal <= maxvalue) {
// Use value at `b` as an offset of `smem`
@ -112,7 +112,7 @@ __global__ void kernelHistogram1D(
FOR_KERNEL_LOOP(linearIndex, totalElements) {
// Convert `linearIndex` into an offset of `b`
const IndexType bOffset =
detail::IndexToOffset<input_t, IndexType, BDims>::get(linearIndex, b);
detail::IndexToOffset<const input_t, IndexType, BDims>::get(linearIndex, b);
const auto bVal = b.data[bOffset];
if (bVal >= minvalue && bVal <= maxvalue) {
// Use value at `b` as an offset of `a`
@ -219,7 +219,7 @@ bool CUDA_tensor_histogram(
using IndexType = int64_t;
auto aInfo = detail::getTensorInfo<output_t, IndexType>(a);
auto bInfo = detail::getTensorInfo<input_t, IndexType>(b);
auto bInfo = detail::getTensorInfo<const input_t, IndexType>(b);
detail::TensorInfo<output_t, IndexType> pInfo(nullptr, 0, {}, {});
if (HasWeights) {

View File

@ -1500,7 +1500,11 @@ NvrtcFunction jit_pwise_function(
std::stringstream ss;
ss << *cache_dir << "/";
ss << kernel_name;
#ifdef USE_ROCM
ss << "_arch" << prop->gcnArchName;
#else
ss << "_arch" << cuda_major << "." << cuda_minor;
#endif
ss << "_nvrtc" << nvrtc_major << "." << nvrtc_minor;
ss << (compile_to_sass ? "_sass" : "_ptx");
ss << "_" << code.length();

View File

@ -1078,8 +1078,8 @@ static void apply_ormqr(const Tensor& input, const Tensor& tau, const Tensor& ot
auto side = left ? CUBLAS_SIDE_LEFT : CUBLAS_SIDE_RIGHT;
auto trans = transpose ? (input.is_complex() ? CUBLAS_OP_C : CUBLAS_OP_T) : CUBLAS_OP_N;
auto input_data = input.data_ptr<scalar_t>();
auto tau_data = tau.data_ptr<scalar_t>();
auto input_data = input.const_data_ptr<scalar_t>();
auto tau_data = tau.const_data_ptr<scalar_t>();
auto other_data = other.data_ptr<scalar_t>();
auto input_matrix_stride = matrixStride(input);
@ -1101,9 +1101,9 @@ static void apply_ormqr(const Tensor& input, const Tensor& tau, const Tensor& ot
auto info_data = info.data_ptr<int>();
for (auto i = decltype(batch_size){0}; i < batch_size; i++) {
scalar_t* input_working_ptr = &input_data[i * input_matrix_stride];
const scalar_t* input_working_ptr = &input_data[i * input_matrix_stride];
scalar_t* other_working_ptr = &other_data[i * other_matrix_stride];
scalar_t* tau_working_ptr = &tau_data[i * tau_stride];
const scalar_t* tau_working_ptr = &tau_data[i * tau_stride];
auto handle = at::cuda::getCurrentCUDASolverDnHandle();
// allocate workspace storage

View File

@ -95,9 +95,9 @@ std::ostream& operator<<(std::ostream& out, const ConvolutionArgs& args) {
<< "weight: " << args.wdesc // already has a trailing newline
<< "Pointer addresses: "
<< "\n"
<< " input: " << args.input.data_ptr() << "\n"
<< " output: " << args.output.data_ptr() << "\n"
<< " weight: " << args.weight.data_ptr() << "\n";
<< " input: " << args.input.const_data_ptr() << "\n"
<< " output: " << args.output.const_data_ptr() << "\n"
<< " weight: " << args.weight.const_data_ptr() << "\n";
return out;
}
@ -306,9 +306,9 @@ struct algorithm_search<cudnnConvolutionFwdAlgoPerf_t> {
cudnnFindConvolutionForwardAlgorithmEx(
args.handle,
args.idesc.desc(),
args.input.data_ptr(),
args.input.const_data_ptr(),
args.wdesc.desc(),
args.weight.data_ptr(),
args.weight.const_data_ptr(),
args.cdesc.desc(),
args.odesc.desc(),
args.output.data_ptr(),
@ -390,9 +390,9 @@ struct algorithm_search<cudnnConvolutionBwdDataAlgoPerf_t> {
cudnnFindConvolutionBackwardDataAlgorithmEx(
args.handle,
args.wdesc.desc(),
args.weight.data_ptr(),
args.weight.const_data_ptr(),
args.odesc.desc(),
args.output.data_ptr(),
args.output.const_data_ptr(),
args.cdesc.desc(),
args.idesc.desc(),
args.input.data_ptr(),
@ -478,9 +478,9 @@ struct algorithm_search<cudnnConvolutionBwdFilterAlgoPerf_t> {
cudnnFindConvolutionBackwardFilterAlgorithmEx(
args.handle,
args.idesc.desc(),
args.input.data_ptr(),
args.input.const_data_ptr(),
args.odesc.desc(),
args.output.data_ptr(),
args.output.const_data_ptr(),
args.cdesc.desc(),
args.wdesc.desc(),
args.weight.data_ptr(),
@ -760,9 +760,9 @@ void raw_cudnn_convolution_forward_out_32bit(
args.handle,
&one,
args.idesc.desc(),
input.data_ptr(),
input.const_data_ptr(),
args.wdesc.desc(),
weight.data_ptr(),
weight.const_data_ptr(),
args.cdesc.desc(),
fwdAlgPerf.algo,
workspace.data_ptr(),
@ -871,9 +871,9 @@ void raw_cudnn_convolution_backward_input_out_32bit(
args.handle,
&one,
args.wdesc.desc(),
weight.data_ptr(),
weight.const_data_ptr(),
args.odesc.desc(),
grad_output.data_ptr(),
grad_output.const_data_ptr(),
args.cdesc.desc(),
bwdDataAlgPerf.algo,
workspace.data_ptr(),
@ -884,7 +884,7 @@ void raw_cudnn_convolution_backward_input_out_32bit(
args,
"Additional pointer addresses: \n",
" grad_output: ",
grad_output.data_ptr(),
grad_output.const_data_ptr(),
"\n",
" grad_input: ",
grad_input.mutable_data_ptr(),
@ -990,9 +990,9 @@ void raw_cudnn_convolution_backward_weight_out_32bit(
args.handle,
&one,
args.idesc.desc(),
input.data_ptr(),
input.const_data_ptr(),
args.odesc.desc(),
grad_output.data_ptr(),
grad_output.const_data_ptr(),
args.cdesc.desc(),
bwdFilterAlgPerf.algo,
workspace.data_ptr(),
@ -1003,7 +1003,7 @@ void raw_cudnn_convolution_backward_weight_out_32bit(
args,
"Additional pointer addresses: \n",
" grad_output: ",
grad_output.data_ptr(),
grad_output.const_data_ptr(),
"\n",
" grad_weight: ",
grad_weight.data_ptr(),
@ -1173,18 +1173,18 @@ void raw_cudnn_convolution_add_relu_out_v7(
args.handle,
&one,
args.idesc.desc(),
input.data_ptr(),
input.const_data_ptr(),
args.wdesc.desc(),
weight.data_ptr(),
weight.const_data_ptr(),
args.cdesc.desc(),
fwdAlgPerf.algo,
workspace.data_ptr(),
fwdAlgPerf.memory,
&alpha_,
zdesc.desc(),
z.data_ptr(),
z.const_data_ptr(),
bdesc.desc(),
bias.data_ptr(),
bias.const_data_ptr(),
adesc.desc(),
args.odesc.desc(),
output.data_ptr()),

View File

@ -52,7 +52,7 @@ constexpr int64_t operator"" _TiB(unsigned long long n) {
uint8_t getAlignment(const Tensor& t) {
// alignment are in bytes
uint8_t alignment = 1;
uintptr_t address = reinterpret_cast<uintptr_t>(t.data_ptr());
uintptr_t address = reinterpret_cast<uintptr_t>(t.const_data_ptr());
for (; alignment < 32; alignment *= 2) {
if (address % (alignment * 2)) {
return alignment;
@ -358,12 +358,30 @@ void run_conv_plan(
const Tensor& x,
const Tensor& y,
const Tensor& w,
const cudnn_frontend::ExecutionPlan& plan) {
const cudnn_frontend::ExecutionPlan& plan,
const cudnnBackendDescriptorType_t operation) {
c10::DeviceGuard g(x.options().device());
auto workspace_size = plan.getWorkspaceSize();
auto workspace_ptr =
c10::cuda::CUDACachingAllocator::get()->allocate(workspace_size);
void* data_ptrs[] = {x.data_ptr(), y.data_ptr(), w.data_ptr()};
void* data_ptrs[3];
if (operation == CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR) {
data_ptrs[0] = const_cast<void*>(x.const_data_ptr());
data_ptrs[1] = y.data_ptr();
data_ptrs[2] = const_cast<void*>(w.const_data_ptr());
} else if (
operation ==
CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_DATA_DESCRIPTOR) {
data_ptrs[0] = x.data_ptr();
data_ptrs[1] = const_cast<void*>(y.const_data_ptr());
data_ptrs[2] = const_cast<void*>(w.const_data_ptr());
} else {
data_ptrs[0] = x.data_ptr();
data_ptrs[1] = y.data_ptr();
data_ptrs[2] = w.data_ptr();
}
int64_t uids[] = {'x', 'y', 'w'};
auto variantPack =
cudnn_frontend::VariantPackBuilder()
@ -843,10 +861,11 @@ void try_plans(
const cudnnHandle_t handle,
const Tensor& x,
const Tensor& y,
const Tensor& w) {
const Tensor& w,
const cudnnBackendDescriptorType_t operation) {
for (auto& plan : plans) {
try {
run_conv_plan(handle, x, y, w, plan);
run_conv_plan(handle, x, y, w, plan, operation);
benchmark_cache.update(key, plan);
return;
} catch (cudnn_frontend::cudnnException& e) {
@ -890,7 +909,8 @@ bool try_configs(
const cudnnHandle_t handle,
const Tensor& x,
const Tensor& y,
const Tensor& w) {
const Tensor& w,
const cudnnBackendDescriptorType_t operation) {
for (auto& config : configs) {
try {
auto plan = cudnn_frontend::ExecutionPlanBuilder()
@ -900,7 +920,7 @@ bool try_configs(
if (plan_errata_exception(handle, plan.getTag())) {
continue;
}
run_conv_plan(handle, x, y, w, plan);
run_conv_plan(handle, x, y, w, plan, operation);
benchmark_cache.update(key, plan);
return true;
} catch (cudnn_frontend::cudnnException& e) {
@ -971,7 +991,7 @@ void run_single_conv(
auto search = benchmark_cache.find(key);
if (search) {
try {
run_conv_plan(handle, x, y, w, *search);
run_conv_plan(handle, x, y, w, *search, operation);
return;
} catch (c10::OutOfMemoryError& e) {
(void)cudaGetLastError(); // clear CUDA error
@ -994,7 +1014,7 @@ void run_single_conv(
deterministic,
allow_tf32,
false);
if (try_configs(configs, opgraph_tag, key, handle, x, y, w)) {
if (try_configs(configs, opgraph_tag, key, handle, x, y, w, operation)) {
return;
}
// fallback configs
@ -1012,7 +1032,7 @@ void run_single_conv(
deterministic,
allow_tf32,
true);
if (try_configs(configs, opgraph_tag, key, handle, x, y, w)) {
if (try_configs(configs, opgraph_tag, key, handle, x, y, w, operation)) {
return;
}
TORCH_CHECK(
@ -1035,7 +1055,7 @@ void run_single_conv(
if (at::native::_cudnn_get_conv_benchmark_empty_cache()) {
c10::cuda::CUDACachingAllocator::emptyCache();
}
try_plans(plans, key, handle, x, y, w);
try_plans(plans, key, handle, x, y, w, operation);
}
}

View File

@ -2,6 +2,7 @@
#include <ATen/native/layer_norm.h>
#include <ATen/core/Tensor.h>
#include <ATen/Dispatch.h>
#include <ATen/Parallel.h>
#include <ATen/native/cpu/mixed_data_type.h>
#include <c10/util/irange.h>
@ -18,6 +19,9 @@
#include <ATen/ops/native_layer_norm.h>
#include <ATen/ops/native_layer_norm_backward_native.h>
#include <ATen/ops/native_layer_norm_native.h>
#include <ATen/ops/pow.h>
#include <ATen/ops/rsqrt.h>
#include <ATen/ops/rms_norm.h>
#include <ATen/ops/zeros_like_native.h>
#endif
@ -258,4 +262,49 @@ std::tuple<Tensor, Tensor, Tensor> math_native_layer_norm(
rstd = rstd.view(stat_shape);
return std::make_tuple(out, mean, rstd);
}
Tensor rms_norm(
const Tensor& input,
IntArrayRef normalized_shape,
const c10::optional<Tensor>& weight_opt /* optional */,
c10::optional<double> eps) {
// See [Note: hacky wrapper removal for optional tensor]
c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
const Tensor& weight = *weight_maybe_owned;
auto bias_opt = at::optional<Tensor>();
const Tensor& bias = *at::borrow_from_optional_tensor(bias_opt);
(void) _check_layer_norm_inputs(input, normalized_shape, weight, bias);
std::vector<int64_t> dims_to_reduce;
for (const auto i : c10::irange(normalized_shape.size())) {
dims_to_reduce.push_back(input.dim() - i - 1);
}
IntArrayRef dims_to_reduce_ref = IntArrayRef(dims_to_reduce);
auto result = AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(
at::ScalarType::Half,
at::ScalarType::BFloat16,
input.scalar_type(),
"rms_norm",
[&] {
scalar_t eps_val;
if (!eps.has_value()) {
eps_val = std::numeric_limits<at::scalar_value_type<scalar_t>::type>::epsilon();
} else {
eps_val = eps.value();
}
auto result = input.mul(at::rsqrt(at::pow(input, 2).mean(dims_to_reduce_ref, /*keep_dim=*/true).add_(eps_val)));
if (weight_opt.has_value()) {
result = result.mul(weight_opt.value());
}
return result;
});
return result;
}
} // namespace at::native

View File

@ -71,6 +71,12 @@ void layer_norm_cpu_out(
int64_t M,
int64_t N);
Tensor rms_norm(
const Tensor& input,
IntArrayRef normalized_shape,
const c10::optional<Tensor>& weight_opt /* optional */,
c10::optional<double> eps);
using forward_fn = void (*)(
const Tensor& /* X */,
const Tensor& /* gamma */,

View File

@ -371,8 +371,8 @@ struct algorithm_search<miopenConvFwdAlgorithm_t> {
Workspace ws(max_ws_size);
MIOPEN_CHECK(miopenFindConvolutionForwardAlgorithm(
args.handle,
args.idesc.desc(), args.input.data_ptr(),
args.wdesc.desc(), args.weight.data_ptr(),
args.idesc.desc(), args.input.const_data_ptr(),
args.wdesc.desc(), args.weight.const_data_ptr(),
args.cdesc.desc(),
args.odesc.desc(), args.output.data_ptr(),
1, // just return the fastest
@ -444,8 +444,8 @@ struct algorithm_search<miopenConvBwdDataAlgorithm_t> {
Workspace ws(max_ws_size);
MIOPEN_CHECK(miopenFindConvolutionBackwardDataAlgorithm(
args.handle,
args.odesc.desc(), args.output.data_ptr(),
args.wdesc.desc(), args.weight.data_ptr(),
args.odesc.desc(), args.output.const_data_ptr(),
args.wdesc.desc(), args.weight.const_data_ptr(),
args.cdesc.desc(),
args.idesc.desc(), args.input.data_ptr(),
1, // just return the fastest
@ -517,8 +517,8 @@ struct algorithm_search<miopenConvBwdWeightsAlgorithm_t> {
Workspace ws(max_ws_size);
MIOPEN_CHECK(miopenFindConvolutionBackwardWeightsAlgorithm(
args.handle,
args.odesc.desc(), args.output.data_ptr(),
args.idesc.desc(), args.input.data_ptr(),
args.odesc.desc(), args.output.const_data_ptr(),
args.idesc.desc(), args.input.const_data_ptr(),
args.cdesc.desc(),
args.wdesc.desc(), args.weight.data_ptr(),
1, // just return the fastest
@ -682,7 +682,7 @@ void miopen_convolution_add_bias_(CheckedFrom c, const TensorArg& output, const
Constant one(dataType, 1);
Constant zero(dataType, 0);
MIOPEN_CHECK(miopenConvolutionForwardBias(handle, &one, bdesc.desc(), bias->data_ptr(),
MIOPEN_CHECK(miopenConvolutionForwardBias(handle, &one, bdesc.desc(), bias->const_data_ptr(),
&zero, odesc.desc(), output->data_ptr()));
*/
}
@ -730,8 +730,8 @@ void raw_miopen_convolution_forward_out(
MIOPEN_CHECK(miopenConvolutionForward(
args.handle,
&one, args.idesc.desc(), input.data_ptr(),
args.wdesc.desc(), weight.data_ptr(),
&one, args.idesc.desc(), input.const_data_ptr(),
args.wdesc.desc(), weight.const_data_ptr(),
args.cdesc.desc(), fwdAlg, &zero,
args.odesc.desc(), output.data_ptr(), workspace.data, workspace.size));
}
@ -741,8 +741,8 @@ void raw_miopen_convolution_forward_out(
MIOPEN_CHECK(miopenConvolutionForwardImmediate(
args.handle,
args.wdesc.desc(), weight.data_ptr(),
args.idesc.desc(), input.data_ptr(),
args.wdesc.desc(), weight.const_data_ptr(),
args.idesc.desc(), input.const_data_ptr(),
args.cdesc.desc(),
args.odesc.desc(), output.data_ptr(), workspace.data, workspace.size, solution_id));
}
@ -838,8 +838,8 @@ void raw_miopen_depthwise_convolution_forward_out(
MIOPEN_CHECK(miopenConvolutionForward(
args.handle,
&one, args.idesc.desc(), input.data_ptr(),
args.wdesc.desc(), weight.data_ptr(),
&one, args.idesc.desc(), input.const_data_ptr(),
args.wdesc.desc(), weight.const_data_ptr(),
args.cdesc.desc(), fwdAlg, &zero,
args.odesc.desc(), output.data_ptr(), workspace.data, workspace.size));
}
@ -849,8 +849,8 @@ void raw_miopen_depthwise_convolution_forward_out(
MIOPEN_CHECK(miopenConvolutionForwardImmediate(
args.handle,
args.wdesc.desc(), weight.data_ptr(),
args.idesc.desc(), input.data_ptr(),
args.wdesc.desc(), weight.const_data_ptr(),
args.idesc.desc(), input.const_data_ptr(),
args.cdesc.desc(),
args.odesc.desc(), output.data_ptr(), workspace.data, workspace.size, solution_id));
}
@ -993,8 +993,8 @@ void raw_miopen_convolution_backward_weight_out(
MIOPEN_CHECK(miopenConvolutionBackwardWeights(
args.handle,
&one, args.odesc.desc(), grad_output.data_ptr(),
args.idesc.desc(), input.data_ptr(),
&one, args.odesc.desc(), grad_output.const_data_ptr(),
args.idesc.desc(), input.const_data_ptr(),
args.cdesc.desc(), bwdFilterAlg, &zero,
args.wdesc.desc(), grad_weight.data_ptr(), workspace.data, workspace.size));
}
@ -1004,8 +1004,8 @@ void raw_miopen_convolution_backward_weight_out(
MIOPEN_CHECK(miopenConvolutionBackwardWeightsImmediate(
args.handle,
args.odesc.desc(), grad_output.data_ptr(),
args.idesc.desc(), input.data_ptr(),
args.odesc.desc(), grad_output.const_data_ptr(),
args.idesc.desc(), input.const_data_ptr(),
args.cdesc.desc(),
args.wdesc.desc(), grad_weight.data_ptr(), workspace.data, workspace.size, solution_id));
}
@ -1037,8 +1037,8 @@ void raw_miopen_depthwise_convolution_backward_weight_out(
MIOPEN_CHECK(miopenConvolutionBackwardWeights(
args.handle,
&one, args.odesc.desc(), grad_output.data_ptr(),
args.idesc.desc(), input.data_ptr(),
&one, args.odesc.desc(), grad_output.const_data_ptr(),
args.idesc.desc(), input.const_data_ptr(),
args.cdesc.desc(), bwdFilterAlg, &zero,
args.wdesc.desc(), grad_weight.data_ptr(), workspace.data, workspace.size));
}
@ -1048,8 +1048,8 @@ void raw_miopen_depthwise_convolution_backward_weight_out(
MIOPEN_CHECK(miopenConvolutionBackwardWeightsImmediate(
args.handle,
args.odesc.desc(), grad_output.data_ptr(),
args.idesc.desc(), input.data_ptr(),
args.odesc.desc(), grad_output.const_data_ptr(),
args.idesc.desc(), input.const_data_ptr(),
args.cdesc.desc(),
args.wdesc.desc(), grad_weight.data_ptr(), workspace.data, workspace.size, solution_id));
}
@ -1242,8 +1242,8 @@ void raw_miopen_convolution_backward_input_out(
MIOPEN_CHECK(miopenConvolutionBackwardData(
args.handle,
&one, args.odesc.desc(), grad_output.data_ptr(),
args.wdesc.desc(), weight.data_ptr(),
&one, args.odesc.desc(), grad_output.const_data_ptr(),
args.wdesc.desc(), weight.const_data_ptr(),
args.cdesc.desc(), bwdDataAlg, &zero,
args.idesc.desc(), grad_input.mutable_data_ptr(), workspace.data, workspace.size));
}
@ -1253,8 +1253,8 @@ void raw_miopen_convolution_backward_input_out(
MIOPEN_CHECK(miopenConvolutionBackwardDataImmediate(
args.handle,
args.odesc.desc(), grad_output.data_ptr(),
args.wdesc.desc(), weight.data_ptr(),
args.odesc.desc(), grad_output.const_data_ptr(),
args.wdesc.desc(), weight.const_data_ptr(),
args.cdesc.desc(),
args.idesc.desc(), grad_input.mutable_data_ptr(), workspace.data, workspace.size, solution_id));
}
@ -1351,8 +1351,8 @@ void raw_miopen_depthwise_convolution_backward_input_out(
MIOPEN_CHECK(miopenConvolutionBackwardData(
args.handle,
&one, args.odesc.desc(), grad_output.data_ptr(),
args.wdesc.desc(), weight.data_ptr(),
&one, args.odesc.desc(), grad_output.const_data_ptr(),
args.wdesc.desc(), weight.const_data_ptr(),
args.cdesc.desc(), bwdDataAlg, &zero,
args.idesc.desc(), grad_input.mutable_data_ptr(), workspace.data, workspace.size));
}
@ -1362,8 +1362,8 @@ void raw_miopen_depthwise_convolution_backward_input_out(
MIOPEN_CHECK(miopenConvolutionBackwardDataImmediate(
args.handle,
args.odesc.desc(), grad_output.data_ptr(),
args.wdesc.desc(), weight.data_ptr(),
args.odesc.desc(), grad_output.const_data_ptr(),
args.wdesc.desc(), weight.const_data_ptr(),
args.cdesc.desc(),
args.idesc.desc(), grad_input.mutable_data_ptr(), workspace.data, workspace.size, solution_id));
}
@ -1528,11 +1528,11 @@ void raw_miopen_convolution_relu_out(
float activ_gamma = static_cast<float>(0);
miopenOperatorArgs_t fusionArgs;
MIOPEN_CHECK(miopenCreateOperatorArgs(&fusionArgs));
MIOPEN_CHECK(miopenSetOpArgsConvForward(fusionArgs, convoOp, &alpha, &beta, weight.data_ptr()));
MIOPEN_CHECK(miopenSetOpArgsBiasForward(fusionArgs, biasOp, &alpha, &beta, bias.data_ptr()));
MIOPEN_CHECK(miopenSetOpArgsConvForward(fusionArgs, convoOp, &alpha, &beta, weight.const_data_ptr()));
MIOPEN_CHECK(miopenSetOpArgsBiasForward(fusionArgs, biasOp, &alpha, &beta, bias.const_data_ptr()));
MIOPEN_CHECK(miopenSetOpArgsActivForward(fusionArgs, activOp, &alpha, &beta, activ_alpha, activ_beta, activ_gamma));
miopenExecuteFusionPlan(args.handle, fusePlanDesc, args.idesc.desc(), input.data_ptr(), args.odesc.desc(), output.data_ptr(), fusionArgs);
miopenExecuteFusionPlan(args.handle, fusePlanDesc, args.idesc.desc(), input.const_data_ptr(), args.odesc.desc(), output.data_ptr(), fusionArgs);
// Cleanup
miopenDestroyFusionPlan(fusePlanDesc);

View File

@ -223,10 +223,10 @@ static void _mkldnn_convolution_out (
auto memory_format = mkldnn_convolution_memory_format(input_t.ndimension(), is_channels_last);
auto input = input_t.is_mkldnn() ? input_t : input_t.contiguous(memory_format);
auto weight = weight_t.is_mkldnn() ? weight_t : weight_t.contiguous(memory_format);
const ideep::tensor x = itensor_from_tensor(input);
const ideep::tensor w = itensor_from_tensor(weight);
const ideep::tensor x = itensor_from_tensor(input, /*from_const_data_ptr*/true);
const ideep::tensor w = itensor_from_tensor(weight, /*from_const_data_ptr*/true);
if (bias.defined()) {
const ideep::tensor b = itensor_from_tensor(bias);
const ideep::tensor b = itensor_from_tensor(bias, /*from_const_data_ptr*/true);
ideep::convolution_forward::compute_v3(
x,
w,
@ -704,9 +704,9 @@ Tensor _mkldnn_convolution_transpose(
auto output_sizes = conv_input_size(input.sizes(), weight_IOHW_sizes, padding_expanded, output_padding_expanded, stride_expanded, dilation_expanded, groups);
auto output = at::empty({0}, input.options());
const ideep::tensor x = itensor_from_tensor(input);
const ideep::tensor x = itensor_from_tensor(input, /*from_const_data_ptr*/true);
ideep::tensor w = itensor_from_tensor(weight);
ideep::tensor w = itensor_from_tensor(weight, /*from_const_data_ptr*/true);
if (!weight.is_mkldnn()) {
// mkldnn transposed convolution has weight in logical order of OIHW or OIDHW,
// while PyTorch has IOHW or IODHW, `._tranpose()` switches strides (no memory copy).
@ -720,7 +720,7 @@ Tensor _mkldnn_convolution_transpose(
}
if (bias.defined()) {
const ideep::tensor b = itensor_from_tensor(bias);
const ideep::tensor b = itensor_from_tensor(bias, /*from_const_data_ptr*/true);
ideep::convolution_transpose_forward::compute_v3(
x,
w,

View File

@ -3268,6 +3268,8 @@
autogen: native_layer_norm_backward.out
tags: core
- func: rms_norm(Tensor input, int[] normalized_shape, Tensor? weight=None, float? eps=None) -> Tensor
- func: nan_to_num(Tensor self, float? nan=None, float? posinf=None, float? neginf=None) -> Tensor
variants: function, method
dispatch:

View File

@ -48,6 +48,8 @@ std::tuple<Tensor, Tensor> fake_quantize_per_channel_affine_cachemask(
int64_t axis,
int64_t quant_min,
int64_t quant_max) {
TORCH_CHECK(scale.scalar_type() == ScalarType::Float,
"Scale must be Float, found ", scale.scalar_type());
TORCH_CHECK(zero_point.scalar_type() == ScalarType::Int || zero_point.scalar_type() == ScalarType::Float || zero_point.scalar_type() == ScalarType::Half,
"Zero-point must be Int32, Float or Half, found ", zero_point.scalar_type());
TORCH_CHECK(scale.dim() == 1, "scale should be a 1-D tensor");

Some files were not shown because too many files have changed in this diff Show More