Compare commits

...

100 Commits

Author SHA1 Message Date
3ddec713b8 Revert "[cuDNN][Quantization] Don't print when plan finalization fails in cuDNN quantization backend (#128177)"
This reverts commit cac7a22b92478d897488688010e562b7bd36b97f.

Reverted https://github.com/pytorch/pytorch/pull/128177 on behalf of https://github.com/clee2000 due to broke test/test_quantization.py::TestQuantizedLinear::test_qlinear_cudnn on sm86 tests cac7a22b92 https://github.com/pytorch/pytorch/actions/runs/9470648757/job/26100448913.  Probably a landrace, test ran on the PR and succeed ([comment](https://github.com/pytorch/pytorch/pull/128177#issuecomment-2161977110))
2024-06-12 02:20:15 +00:00
85eeb90d2c [dynamo] Fix graph breaks related to HF ModelOutput (#127780)
Fixes https://github.com/pytorch/pytorch/issues/126028 and https://github.com/pytorch/pytorch/issues/126027.

Changes:
- Support building `CustomizedDictVariable` in` VariableBuilder` (but only for HF `ModelOutput` subclasses)
- Remove `DataClassVariable` since it's not really being used anywhere (`CustomizedDictVariable` can be used instead)
- Support side effects for `CustomizedDictVariable`
- Allow `NO_HASATTR` leaf guard on `DictSubclassGuardManager`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127780
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-06-12 02:16:24 +00:00
7f6daf289b [inductor] parallel compile: set LD_LIBRARY_PATH for sub-processes in internal (#128376)
Test Plan: `TORCHINDUCTOR_WORKER_START=subprocess TORCHINDUCTOR_COMPILE_THREADS=16 buck run mode/opt scripts/slarsen/torch_compile:run`

Differential Revision: D58371264

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128376
Approved by: https://github.com/eellison
2024-06-12 01:55:53 +00:00
3d55d84ec2 [Fix] Check tensor dtype before using torch.allclose in _trace log (#128438)
#### Issue
`torch.allclose` errors out during logging due to different dtypes.

#### Test
* `pytest test/test_jit.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128438
Approved by: https://github.com/angelayi
2024-06-12 01:52:09 +00:00
bb2a995529 Back out "[Dynamo] Treat integers stored on nn.Modules as dynamic (#126466)" (#128432)
Summary:
Original commit changeset: c7d2e6b13922

Original Phabricator Diff: D57618942

Differential Revision: D58383241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128432
Approved by: https://github.com/ezyang, https://github.com/Yuzhen11
2024-06-12 01:34:32 +00:00
cyy
9538bf4e7c [2/N] Remove inclusion of c10/util/string_utils.h (#128372)
Follows  #128300.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128372
Approved by: https://github.com/aaronenyeshi
2024-06-12 01:18:20 +00:00
cyy
219da29dfd [7/N] Remove unused functions (#128407)
Follows  #128309
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128407
Approved by: https://github.com/ezyang
2024-06-12 01:10:33 +00:00
cyy
fb013ecb24 Remove unused private List::ptr_to_first_element (#128405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128405
Approved by: https://github.com/ezyang
2024-06-12 01:07:14 +00:00
6af4c6acad Migrate test to internal base class, fixes (#128367)
Summary:
## Remove etc deps
converted tests to non-etcd based rdzv handler so that tests don't have dependency on etcd server

## Adopt pytorch test convetions
- test starts with `test_TESTS.py`
- Test base class is torch.testing._internal.common_utils.TestCase
- include __main__  handler

## reduce test timing (used to take > 300 seconds):

3.05s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_init_method_env_with_torchelastic
2.59s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_init_method_tcp_with_torchelastic
2.33s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_worker_raise_exception
2.33s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_run_path
2.30s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_launch_auto_configurations
2.24s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_is_torchelastic_launched_with_logs_spec_defined
2.24s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_is_torchelastic_launched
2.17s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_multiple_agents
2.12s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic
2.08s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_gpu_launch_configurations
1.32s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_standalone
1.05s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_launch_number_configurations
1.05s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_with_env_vars
1.05s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_python
1.05s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_python_caffe2_bc
1.04s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_bash
1.03s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_default_nproc
0.04s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_logs_logs_spec_entrypoint_must_be_defined
0.01s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_agent_raise_exception
0.01s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_shutdown

Test Plan: pytest --durations=0  test/distributed/launcher/run_test.py

Differential Revision: D58388182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128367
Approved by: https://github.com/d4l3k
2024-06-12 01:03:40 +00:00
786c24a4cd [inductor] Always realize sigmoid for CPU (#128339)
Summary: Currently the cpu backend prefers to always realize exp because it's a heavy op on CPU. For the same reason, we need to realize sigmoid as well. This solves a problem in llama2 inference where exp was repeated in an inner loop for many times.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128339
Approved by: https://github.com/eellison, https://github.com/helloguo, https://github.com/jansel, https://github.com/jgong5, https://github.com/peterbell10
2024-06-12 00:46:33 +00:00
5d8c7f39d4 Revert "Introduce int_oo (#127693)"
This reverts commit 9cab5987bdeb66df8efbc581b3469bfe300e168c.

Reverted https://github.com/pytorch/pytorch/pull/127693 on behalf of https://github.com/clee2000 due to sorry executorch CI is a bit weird regarding pins, I'll make a chat with mergen with the choices of what to do and how it'll affect executorch CI, reverting for now to prevent more divergences in the meantime ([comment](https://github.com/pytorch/pytorch/pull/127693#issuecomment-2161775400))
2024-06-11 23:36:08 +00:00
c9c1fed065 Revert "Flip default value for mypy disallow_untyped_defs [10+2/11] (#128374)"
This reverts commit c13e03c87428b986972a48d8fc78dbffc2579f63.

Reverted https://github.com/pytorch/pytorch/pull/128374 on behalf of https://github.com/clee2000 due to sorry I need to revert this in order to revert something else, to remerge, just rebase and fix the merge conflict ([comment](https://github.com/pytorch/pytorch/pull/128374#issuecomment-2161772864))
2024-06-11 23:34:03 +00:00
94fea82d66 init sub comment (#128082)
Fixes #127905

### Description

Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function

### Checklist
- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128082
Approved by: https://github.com/titaiwangms
2024-06-11 22:42:35 +00:00
447173198b Add docstring for the torch.fx.operator_schemas.create_type_hint func… (#128139)
Fixes: #127916

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128139
Approved by: https://github.com/SherlockNoMad
2024-06-11 22:42:11 +00:00
b79d056e76 [export] FIx unflattener for preserving modules containing unused inputs (#128260)
Currently unflattener fails if the module its preserving the module signature for contains unused inputs/outputs.

This also fixes unflattener issues in D57829276.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128260
Approved by: https://github.com/pianpwk
2024-06-11 22:32:08 +00:00
eb567b1f40 Pass params to dump_nccl_trace_pickle (#128307)
Summary:
Pass parameters from request to dump_nccl_trace_pickle handler.
The supported parameters + value are all lowercase.
includecollectives={true, false}
includestacktraces={true, false}
onlyactive={true, false}

Example post is:
/handler/dump_nccl_trace_pickle?includecollectives=true&includestacktraces=false&onlyactive=true

Test Plan:
unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128307
Approved by: https://github.com/d4l3k
ghstack dependencies: #128191
2024-06-11 22:28:53 +00:00
1dd2431f86 [Test] Add test for only_active flag (#128191)
Summary:
Add a unit test for the only_active flag to _dump_nccl_trace API call.
With this flag, we only expect active records to be returned.

Test Plan:
Unit test.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128191
Approved by: https://github.com/d4l3k
2024-06-11 22:26:01 +00:00
5fcb5f0c8b init reshape_from_tensor_shape comment (#128171)
Fixes #127897

### Description
Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function

### Checklist
- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128171
Approved by: https://github.com/titaiwangms
2024-06-11 21:56:33 +00:00
a55d0d9718 Fix side effect pruning (#128028)
Summary:
The previous side effect pruning algorithm would keep many dead cell
variables alive. For example, in
https://github.com/pytorch/pytorch/issues/125078, the compiled function
has one return but there were three in the Dynamo graph due to two
dead cell variables not being pruned away.

This PR adds a corrected algorithm. "new cell variables" are alive if
they can be reached from one of the following:
1. any of the tx.symbolic_locals or tx.stack (that is, if they are
   involved in a return from the function or intermediate variable
   during a graph break). Example: an alive NestedUserFunctionVariable
2. "mutations to pre-existing objects". Example: appending a
   NestedUserFunctionVariable to a global list

The new algorithm reflects this, but please let me know if there are
more cases to handle.

Test Plan:
- existing tests (afaict, test/dynamo/test_python_autograd is the best
  SideEffects test case we have)
- see in test/dynamo/test_higher_order_ops that the expecttests changed
  -- the functorch dynamo graphs no longer return dead cellvars.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128028
Approved by: https://github.com/jansel
2024-06-11 21:40:48 +00:00
8c1247cffb [BE] Fixed CPU autocast warning (#127774)
This PR fixes
```
/data/users/andgu/pytorch/torch/utils/checkpoint.py:1398: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127774
Approved by: https://github.com/soulitzer, https://github.com/Skylion007, https://github.com/tianyu-l
2024-06-11 21:33:35 +00:00
70a1e85718 [Traceable FSDP2] Use custom ops for AllGather copy-in / copy-out and ReduceScatter copy-in (#127856)
Making these operations into custom ops helps Inductor identify these ops and enforce the FSDP communication op ordering.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127856
Approved by: https://github.com/awgu
2024-06-11 20:15:03 +00:00
adb699189b Revert "[RELAND][dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578)"
This reverts commit b2d602306a9eb19e30328cbaee941c874f8148a9.

Reverted https://github.com/pytorch/pytorch/pull/126578 on behalf of https://github.com/clee2000 due to failed internal test D58394084.  Author has forward fix but includes external changes so reverting is a bit easier to coordinate ([comment](https://github.com/pytorch/pytorch/pull/126578#issuecomment-2161481839))
2024-06-11 19:41:41 +00:00
eqy
45dccfddcd [cuDNN][SDPA] Support different key, value dimension in cuDNN SDPA (#128350)
CC @vedaanta-nvidia @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128350
Approved by: https://github.com/Skylion007
2024-06-11 19:22:21 +00:00
3e09123797 Enable UFMT on test_nestedtensor.py (#128359)
split it into two PRs since it is more than 2k lines of change

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128359
Approved by: https://github.com/davidberard98
2024-06-11 19:14:04 +00:00
61f922c2ca Fix 'get_real_value' on placeholder nodes (#127698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127698
Approved by: https://github.com/jansel
ghstack dependencies: #127695, #127696
2024-06-11 18:57:25 +00:00
984b1a8c35 Fix 'get_attr' call in dynamo 'run_node' (#127696)
Fixes #124858

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127696
Approved by: https://github.com/jansel
ghstack dependencies: #127695
2024-06-11 18:57:25 +00:00
205410cb44 add xpu to torch.tensors (#127280)
As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to torch.tensors doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127280
Approved by: https://github.com/svekars
2024-06-11 18:13:01 +00:00
cac7a22b92 [cuDNN][Quantization] Don't print when plan finalization fails in cuDNN quantization backend (#128177)
Similar in spirit to #125790, hopefully addresses failures seen for cuDNN 9.1 upgrade: #https://github.com/pytorch/pytorch/pull/128166

CC @nWEIdia @atalman

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128177
Approved by: https://github.com/nWEIdia, https://github.com/Skylion007
2024-06-11 18:09:25 +00:00
8a09940a54 [inductor] fix compile time regression by caching get_gpu_type (#128363)
We observed signficant compile time regression in torchtitan when turning
on 2D parallel + torch.compile recently. So I decided to get a deeper
understanding why.

It turns out this is affecting **all the trainings** that have functional collectives
captured in the graph, not only 2D parallel (2D parallel was just the
job that happen to have collectives captured in the TP region).

The root cause is because when doing inductor lowering, we are calling
the comm analysis pass to get a estimated collective time for each
collective node in the graph, for each call to check the collective
node, we are calling `get_gpu_type()`, which under the hood calls a
`torch.utils.collect_env.run` to get the GPU info. However, this call is
super expensive! The reason is that this call effectively spawns a new
process and call `nvidia-smi` to get the GPU info, so the cost is **linear**
to the number of collective nodes in the graph.

see https://github.com/pytorch/pytorch/blob/main/torch/utils/collect_env.py#L75

The fix is to add a lru cache to the function, so that we only call this
once and reuse the cached results afterwards

torchtitan benchmark shows:
* before this fix: 2D parallel + fp8 compile time: 6min +
* after this fix: 2D parallel + fp8 compile time: 2min 48s (more than 100% improvement)

There're more room to improve the compile time, but this PR is trying to fix the biggest regression I found so far.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128363
Approved by: https://github.com/yf225
2024-06-11 18:02:13 +00:00
1d233b8f50 Revert "Make nn.Module state_dict load_state_dict pre-hook and state_dict post hook public (#126704)"
This reverts commit c38b3381a12a0ec033dd417827c530c4474b8165.

Reverted https://github.com/pytorch/pytorch/pull/126704 on behalf of https://github.com/clee2000 due to broke internal typecheck D58394110 (which probably means the code wouldn't work either but I guess it didn't run on the diff). Probably an easy fix? ([comment](https://github.com/pytorch/pytorch/pull/126704#issuecomment-2161299193))
2024-06-11 17:45:20 +00:00
491c4a5dcb Revert "Make sure #126704 is BC for torch.save-ed nn.Module (#128344)"
This reverts commit 841d87177a900c2bbd59b6589165189141c4e8bb.

Reverted https://github.com/pytorch/pytorch/pull/128344 on behalf of https://github.com/clee2000 due to broke internal typecheck D58394110 (which probably means the code wouldn't work either but I guess it didn't run on the diff). Probably an easy fix? ([comment](https://github.com/pytorch/pytorch/pull/126704#issuecomment-2161299193))
2024-06-11 17:45:20 +00:00
4345d98663 [dynamo] Fix for #127696 (#128358)
Test Plan:
`buck2 test @//mode/dev-nosan //executorch/exir/backend/...`
https://www.internalfb.com/intern/testinfra/testrun/12666373989243932

Differential Revision: D58384518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128358
Approved by: https://github.com/ydwu4
2024-06-11 16:43:15 +00:00
a838e90964 Add Intel Gaudi device/HPU to auto load in instantiate_device_type_tests (#126970)
### Motivation
Intel Gaudi accelerator (device name hpu) is seen to have good pass rate with the pytorch framework UTs , however being an out-of-tree device, we face challenges in adapting the device to natively run the existing pytorch UTs under pytorch/test. The UTs however is a good indicator of the device stack health and as such we run them regularly with adaptations.
Although we can add Gaudi/HPU device to generate the device specific tests using the TORCH_TEST_DEVICES environment variable, we miss out on lot of features such as executing for specific dtypes, skipping and overriding opInfo. With significant changes introduced every Pytorch release maintaining these adaptations become difficult and time consuming.
Hence with this PR  we introduce Gaudi device in common_device_type framework, so that the tests are instantiated for Gaudi when the library is loaded.
The eventual goal is to introduce Gaudi out-of-tree support as equivalent to in-tree devices

### Changes
Add HPUTestBase of type DeviceTypeTestBase specifying appropriate attributes for Gaudi/HPU.
Include code to check if  intel Gaudi Software library is loaded and if so, add the device to the list of devices considered for instantiation of device type tests

### Additional Context
please refer the following RFC : https://github.com/pytorch/rfcs/pull/63/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126970
Approved by: https://github.com/albanD
2024-06-11 16:35:17 +00:00
29081059b6 [Static Runtime] Fix & run gen_static_runtime_ops (#128299)
gen_static_runtime_ops hasn't been updated in a while. In preparation for https://github.com/pytorch/pytorch/pull/127675 in which I need to re-run the codegen step for cumprod, I want to land these changes beforehand in case there are any other issues that arise.

I added a number of ops to the blocklist:
```
+        "_nested_tensor_storage_offsets",
+        "_nested_get_values",  # no CPU backend
+        "_nested_get_values_copy",  # no CPU backend
+        "_nested_view_from_jagged",  # testing needs to be patched
+        "_nested_view_from_jagged_copy",  # testing needs to be patched
+        "_nested_view_from_buffer",  # testing needs to be patched
+        "_nested_view_from_buffer_copy",  # testing needs to be patched
+        "_int_mm",  # testing needs to be patched
+        "_to_sparse_csc",  # testing needs to be patched
+        "_to_sparse_csr",  # testing needs to be patched
+        "segment_reduce",  # testing needs to be patched
```

Most of these are added just because testing doesn't work right now.

Additionally, a few `fft` ops seem to have been removed from native_functions.yaml; I'm guessing it's unlikely FFT would have been used in many real models though.

Differential Revision: [D58329403](https://our.internmc.facebook.com/intern/diff/D58329403/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128299
Approved by: https://github.com/YuqingJ
2024-06-11 16:27:39 +00:00
f8c45996d5 [MPS] Make erfinv compilable for bfloat16 (#128375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128375
Approved by: https://github.com/Skylion007
ghstack dependencies: #128373
2024-06-11 16:04:11 +00:00
c13e03c874 Flip default value for mypy disallow_untyped_defs [10+2/11] (#128374)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128374
Approved by: https://github.com/Skylion007
2024-06-11 15:58:28 +00:00
053930e194 [MPS][BE] Remove code duplication (#128373)
Use `scalarToMetalTypeString` instead of `getMetalType`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128373
Approved by: https://github.com/Skylion007
2024-06-11 15:58:04 +00:00
9a38cae299 [AOTI] Switch to use shim v2 (#127674)
Differential Revision: D56709309

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127674
Approved by: https://github.com/desertfire
2024-06-11 15:01:25 +00:00
55901fb3da [fx] Preserve Fx graph node order in partitioner across runs (#115621)
Fixes #ISSUE_NUMBER
partitioner generates different graph in recompilation on each run
Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115621
Approved by: https://github.com/ezyang
2024-06-11 14:04:52 +00:00
fc77fdca6f [guard_size_oblivious] Add gso ExpandUtils:_sym_to (#128224)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128224
Approved by: https://github.com/ezyang
2024-06-11 14:01:34 +00:00
648625b230 Make TraceUtils.h to be device-agnostic (#126969)
Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files.

In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969
Approved by: https://github.com/c-p-i-o
2024-06-11 08:38:07 +00:00
207c2248a8 [inductor] Fix lowering full with SymBool value (#128213)
Fixes #128161, fixes #128095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128213
Approved by: https://github.com/lezcano
2024-06-11 08:33:35 +00:00
a206dcc79e fb_memcache: Move to fbcode from thirdparty (#128174)
Summary: The fb_memcache injections location and path is changing.

Test Plan: Existing tests should pass.

Reviewed By: bertmaher, oulgen

Differential Revision: D57973772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128174
Approved by: https://github.com/oulgen
2024-06-11 07:46:12 +00:00
f2d7f235a6 [dynamo][yolov3] Track UnspecializedNNModuleVariable for mutation (#128269)
Fixes https://github.com/pytorch/pytorch/issues/101168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128269
Approved by: https://github.com/jansel
ghstack dependencies: #128295, #126578, #128268, #128254
2024-06-11 07:09:04 +00:00
402b289f3b Properly register parameter for binary folding test (#128356)
This PR properly registers the tensor used in the module compute as a parameter. This bug was hidden previously because all tensors on the nn modules would be considered constant by dynamo, with inlining NN modules, this is no longer the case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128356
Approved by: https://github.com/anijain2305
ghstack dependencies: #128355
2024-06-11 06:48:26 +00:00
a32157c67c Mark params static if inlining modules and freezing (#128355)
Today inlining builtin nn modules is not compatible with parameter freezing. Freezing parameters and then constant folding them through the graph relies on the assumption that they will not be inputs and will be static across calls to the same graph. When inlining builtin nn modules this assumption is broken and we reuse the same graph for different instances of the same nn module. There are three options 1) abandon constant folding, 2) create a dispatcher layer (like cudagraphs) which will dispatch to the correct constant-folded graph for each distinct set of parameters or 3) recompile

This PR implements 3 by introducing guards on the parameter pointers. This was due to freezing being relatively rare and performance sensistive. 2 Had many more unknowns and 1 is not a viable option due to the drop in performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128355
Approved by: https://github.com/anijain2305
2024-06-11 06:48:26 +00:00
24e7f29099 Lowering for avg_pool_3d_backward (Fixes:#127101) (#127722)
We implemented a lowering for the avg_pool3d_backward operation and created tests for it.
We ran some benchmarks and achieved the following results:

```
[-------------- avgpool_3d_backwards --------------]
                             |  Decomposed  |  Eager
16 threads: ----------------------------------------
      (3, 5, 400, 200, 200)  |     6061     |  11160
      (3, 5, 300, 200, 200)  |     4547     |   8372
      (3, 5, 200, 200, 200)  |     3032     |   5585
      (3, 5, 300, 300, 300)  |    10100     |  18840
      (3, 5, 100, 100, 100)  |      381     |    703
      (3, 5, 100, 300, 200)  |     2270     |   4190
      (8, 8, 128, 128, 128)  |     3397     |   6253
      (2, 3, 150, 150, 150)  |      520     |    947
      (1, 3, 128, 128, 128)  |      161     |    299
      (8, 16, 64, 64, 64)    |      851     |   1569
      (1, 1, 50, 50, 50)     |       17     |     11
      (3, 5, 20, 40, 40)     |       17     |     30
      (3, 5, 10, 20, 20)     |       17     |     11
      (1, 1, 10, 10, 10)     |       16     |     11
      (3, 5, 5, 10, 10)      |       17     |     11
      (3, 5, 2, 5, 5)        |       17     |     11
```
These were run on an RTX 3050, so we were not able to allocate larger tensors due to memory limitations.
We believe it would be beneficial to benchmark this on more recent hardware, just to check if the performance holds up with larger sizes.

Furthermore, we also refactored code from adaptive_avg_pool2d and adaptive_max_pool2d, to reduce code duplication.
We diffed the kernels and they are identical.

Fixes #127101

Co-authored-by: Martim Mendes <martimccmendes@tecnico.ulisboa.pt>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127722
Approved by: https://github.com/jansel
2024-06-11 06:39:04 +00:00
5b5d269d34 Speed up fx graph iteration by implementing it in C++ (#128288)
Before this change
```
python benchmarks/dynamo/microbenchmarks/fx_microbenchmarks.py
iterating over 100000000 FX nodes took 19.5s (5132266 nodes/s)
```

After this change
```
python benchmarks/dynamo/microbenchmarks/fx_microbenchmarks.py
iterating over 100000000 FX nodes took 3.4s (29114001 nodes/s)
```

5.7x improvement

Differential Revision: [D58343997](https://our.internmc.facebook.com/intern/diff/D58343997)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128288
Approved by: https://github.com/jansel, https://github.com/albanD
2024-06-11 05:48:31 +00:00
fa88f390a0 Revert "[inductor] enable fx graph cache on torchbench (#128239)"
This reverts commit 734e8f6ad7e7f0fa0341fb658f1f986225173f5f.

Reverted https://github.com/pytorch/pytorch/pull/128239 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to surface a bunch of inductor failures in trunk 734e8f6ad7 ([comment](https://github.com/pytorch/pytorch/pull/128239#issuecomment-2159789242))
2024-06-11 04:53:38 +00:00
fe39c07826 [pipelining][doc] Remove duplicated words (#128368)
"for execution" is used in both step titles

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128368
Approved by: https://github.com/wconstab
ghstack dependencies: #128361
2024-06-11 04:52:57 +00:00
cba195c8ed Support aten operations with out tensor (#124926)
This PR intends to support the aten operations with the `out` tensor.

Currently, the AOT compile always does **NOT** keep input tensor mutations. According to the comments, this is because it has not encountered such a use case.
> For now there's no use case involving keeping input mutations in the graph (which we can only do in the inference case anyway). We can add this later if we need to.

However, for aten operations, it is popular that the `out` tensor is an input parameter and needs to be mutated. This PR intends to support it by adding a `keep_inference_input_mutations` flag to `aot_inductor.keep_inference_input_mutations`. This flag can provide flexibility to the callee in deciding whether the AOT compile needs to keep input tensor mutations in the graph.

Take `clamp` as an example as follows.
```python
out_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(-2.0)
inp_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(1.0)
min_tensor = inp_tensor - 0.05
max_tensor = inp_tensor + 0.05
torch.clamp(input=inp_tensor, min=min_tensor, max=max_tensor, out=out_tensor)
```

W/O this PR
```python
def forward(self):
    arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]";

    arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
    clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1);  arg0_1 = arg1_1 = None
    clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1);  clamp_min = arg2_1 = None
    return (clamp_max, clamp_max)
```

W/ this PR
```python
def forward(self):
    arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]";

    arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
    clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1);  arg0_1 = arg1_1 = None
    clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1);  clamp_min = arg2_1 = None
    copy_: "f32[128]" = torch.ops.aten.copy_.default(arg3_1, clamp_max);  arg3_1 = clamp_max = None
    return (copy_,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124926
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/angelayi
2024-06-11 04:35:27 +00:00
16e67be7f1 Also preserve unbacked SymInts when partitioning as backward inputs (#128338)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128338
Approved by: https://github.com/IvanKobzarev
2024-06-11 04:27:09 +00:00
7afffdf48b [CI] Comment hf_T5_generate, hf_GPT2 and timm_efficientnet in inductor cpu smoketest for performance unstable issue (#127588)
Fixes #126993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127588
Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/desertfire
2024-06-11 03:12:11 +00:00
ca45649eb5 [easy][dynamo][inline work] Fix test with inlining inbuilt nn modules (#128254)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128254
Approved by: https://github.com/williamwen42
ghstack dependencies: #128295, #126578, #128268
2024-06-11 03:02:51 +00:00
665e568381 [inductor][inlining nn module] Skip batchnorm version check test for inlining (#128268)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128268
Approved by: https://github.com/zou3519
ghstack dependencies: #128295, #126578
2024-06-11 03:02:51 +00:00
4077cdd589 [pipelining][doc] Update arg list of pipeline API (#128361)
And document the use of `build_stage` API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128361
Approved by: https://github.com/wconstab
2024-06-11 02:55:17 +00:00
cyy
e4bd0adca5 [6/N] Remove unused functions (#128309)
Follows #127185

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128309
Approved by: https://github.com/ezyang
2024-06-11 02:46:33 +00:00
793df7b7cb Prevent expansion of cat indexing to avoid int64 intermediate (#127815)
Fix for https://github.com/pytorch/pytorch/issues/127652

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127815
Approved by: https://github.com/shunting314, https://github.com/peterbell10
2024-06-11 02:41:07 +00:00
d1d9bc7aa6 init add comment (#128083)
Fixes #127898

### Description

Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function

### Checklist
- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128083
Approved by: https://github.com/titaiwangms
2024-06-11 02:37:04 +00:00
841d87177a Make sure #126704 is BC for torch.save-ed nn.Module (#128344)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128344
Approved by: https://github.com/albanD
ghstack dependencies: #126906, #126704
2024-06-11 02:26:06 +00:00
3b555ba477 Add docstring for torch.utils.data.datapipes.decoder.basicandlers (#128018)
Fixes #127912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128018
Approved by: https://github.com/andrewkho
2024-06-11 01:32:45 +00:00
734e8f6ad7 [inductor] enable fx graph cache on torchbench (#128239)
Summary: We've already enabled for timm and huggingface, but we had failures saving cache entries for moco. It looks like https://github.com/pytorch/pytorch/pull/128052 has fixed that issue, so we can enable for torchbench.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128239
Approved by: https://github.com/oulgen
2024-06-11 00:40:31 +00:00
cyy
99f5a85a09 [Clang Tidy] Fix misc-header-include-cycle errors in clang-tidy and ignore some files (#127233)
Since there are such cycles in libfmt and PyTorch, which are detected by clang-tidy.
```
/home/cyy/pytorch/third_party/fmt/include/fmt/format-inl.h:25:10: error: circular header file dependency detected while including 'format.h', please check the include path [misc-header-include-cycle,-warnings-as-errors]
   25 | #include "format.h"
      |          ^
/home/cyy/pytorch/third_party/fmt/include/fmt/format.h:4530:12: note: 'format-inl.h' included from here
 4530 | #  include "format-inl.h"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127233
Approved by: https://github.com/ezyang
2024-06-10 23:49:58 +00:00
f843ccbb1a [MTIA] Add set_device support (#128040)
Summary: Support set_device API in MTIA backend.

Reviewed By: gnahzg

Differential Revision: D58089498

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128040
Approved by: https://github.com/gnahzg
2024-06-10 23:42:52 +00:00
cyy
30875953a4 [1/N] Remove inclusion of c10/util/string_utils.h (#128300)
As a first step to remove it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128300
Approved by: https://github.com/ezyang, https://github.com/eqy
2024-06-10 23:40:47 +00:00
cyy
2126ae186e Remove caffe2/perfkernels files (#128186)
These files are not used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128186
Approved by: https://github.com/ezyang, https://github.com/r-barnes
2024-06-10 23:40:18 +00:00
739aa224ec [Fix] Parameter un/lifting issues in the TorchScript to ExportedProgram converter (#127975)
This PR fixes issues related to parameters and inputs lifting in the converter.

#### Issue 1
```
> Graph[linear.weights, bias.weights, x.1]
%1 ...
%2 ...
%3 = CreateObject()

	> Block 0[]
        %linear.0 = GetAttr(linear)[%3]

	             > Block 0.0[]
	             %weight.0 = GetAttr(weights)[%linear.0]

	> Block 1[]
	...
```
* Model parameters for the top level module should be unlifted, while parameters from sub-blocks should be lifted.
#### Fixes
* Bottom-up traversal (i.e., start from the inner most block) to figure out which parameters to be lifted for sub-blocks.

#### Test Plan
* Add test cases for nested block without control flow `pytest test/export/test_converter.py -s -k test_convert_nn_module_with_nested_param`
* Add test cases for nested block with control flow `pytest test/export/test_converter.py -s -k test_convert_nn_module_with_nested_if_and_param`

#### Outcome
##### TorchScript
```
graph(%x.1 : Float(3, strides=[1], requires_grad=0, device=cpu),
      %m1.m1.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu),
      %m1.m1.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu),
      %m1.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu),
      %m1.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu),
      %m1.m2.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu),
      %m1.m2.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu),
      %linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu),
      %linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu),
      %m2.m1.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu),
      %m2.m1.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu),
      %m2.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu),
      %m2.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu),
      %m2.m2.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu),
      %m2.m2.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu)):
  %15 : __torch__.export.test_converter.___torch_mangle_14.SuperNestedM1 = prim::CreateObject()
  %16 : NoneType = prim::Constant(), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
  %17 : int = prim::Constant[value=1](), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:34
  %18 : Tensor = aten::max(%x.1), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:19
  %19 : Tensor = aten::gt(%18, %17), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:19
  %20 : bool = aten::Bool(%19), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:19
  %21 : Tensor = prim::If(%20), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:16
    block0():
      %linear.6 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%15), scope: export.test_converter.SuperNestedM1::
      %m1.1 : __torch__.export.test_converter.___torch_mangle_15.NestedM = prim::GetAttr[name="m1"](%15), scope: export.test_converter.SuperNestedM1::
      %24 : Tensor = aten::sum(%x.1, %16), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19
      %25 : Tensor = aten::gt(%24, %17), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19
      %26 : bool = aten::Bool(%25), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19
      %27 : Tensor = prim::If(%26), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:16
        block0():
          %linear.10 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m1.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %m1.3 : __torch__.export.test_converter.___torch_mangle_16.M = prim::GetAttr[name="m1"](%m1.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %linear.12 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m1.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %weight.4 : Tensor = prim::GetAttr[name="weight"](%linear.12), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %bias.4 : Tensor = prim::GetAttr[name="bias"](%linear.12), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %33 : Tensor = aten::linear(%x.1, %weight.4, %bias.4), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          %weight.6 : Tensor = prim::GetAttr[name="weight"](%linear.10), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %bias.6 : Tensor = prim::GetAttr[name="bias"](%linear.10), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %36 : Tensor = aten::linear(%33, %weight.6, %bias.6), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          -> (%36)
        block1():
          %linear.14 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m1.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %m2.3 : __torch__.export.test_converter.___torch_mangle_16.M = prim::GetAttr[name="m2"](%m1.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %linear.16 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m2.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %weight.8 : Tensor = prim::GetAttr[name="weight"](%linear.16), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %bias.8 : Tensor = prim::GetAttr[name="bias"](%linear.16), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %42 : Tensor = aten::linear(%x.1, %weight.8, %bias.8), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          %weight.2 : Tensor = prim::GetAttr[name="weight"](%linear.14), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %bias.2 : Tensor = prim::GetAttr[name="bias"](%linear.14), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %45 : Tensor = aten::linear(%42, %weight.2, %bias.2), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          -> (%45)
      %weight.10 : Tensor = prim::GetAttr[name="weight"](%linear.6), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear
      %bias.10 : Tensor = prim::GetAttr[name="bias"](%linear.6), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear
      %48 : Tensor = aten::linear(%27, %weight.10, %bias.10), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
      -> (%48)
    block1():
      %linear.8 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%15), scope: export.test_converter.SuperNestedM1::
      %m2.1 : __torch__.export.test_converter.___torch_mangle_15.NestedM = prim::GetAttr[name="m2"](%15), scope: export.test_converter.SuperNestedM1::
      %51 : Tensor = aten::sum(%x.1, %16), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19
      %52 : Tensor = aten::gt(%51, %17), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19
      %53 : bool = aten::Bool(%52), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19
      %54 : Tensor = prim::If(%53), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:16
        block0():
          %linear.1 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m2.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %m1 : __torch__.export.test_converter.___torch_mangle_16.M = prim::GetAttr[name="m1"](%m2.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %linear.5 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %weight.1 : Tensor = prim::GetAttr[name="weight"](%linear.5), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %bias.1 : Tensor = prim::GetAttr[name="bias"](%linear.5), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %60 : Tensor = aten::linear(%x.1, %weight.1, %bias.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          %weight.3 : Tensor = prim::GetAttr[name="weight"](%linear.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %bias.3 : Tensor = prim::GetAttr[name="bias"](%linear.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %63 : Tensor = aten::linear(%60, %weight.3, %bias.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          -> (%63)
        block1():
          %linear.3 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m2.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %m2 : __torch__.export.test_converter.___torch_mangle_16.M = prim::GetAttr[name="m2"](%m2.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %linear : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m2), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %weight.5 : Tensor = prim::GetAttr[name="weight"](%linear), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %bias.5 : Tensor = prim::GetAttr[name="bias"](%linear), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %69 : Tensor = aten::linear(%x.1, %weight.5, %bias.5), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          %weight.12 : Tensor = prim::GetAttr[name="weight"](%linear.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %bias.12 : Tensor = prim::GetAttr[name="bias"](%linear.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %72 : Tensor = aten::linear(%69, %weight.12, %bias.12), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          -> (%72)
      %weight : Tensor = prim::GetAttr[name="weight"](%linear.8), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear
      %bias : Tensor = prim::GetAttr[name="bias"](%linear.8), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear
      %75 : Tensor = aten::linear(%54, %weight, %bias), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
      -> (%75)
  return (%21)
```
##### ExportedProgram
```
ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, p_linear_weight: "f32[3, 3]", p_linear_bias: "f32[3]", p_m1_linear_weight: "f32[3, 3]", p_m1_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m1_m1_linear_bias: "f32[3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]", p_m2_linear_weight: "f32[3, 3]", p_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]", p_m2_m2_linear_bias: "f32[3]", x_1: "f32[3]"):
            # No stacktrace found for following nodes
            max_1: "f32[]" = torch.ops.aten.max.default(x_1)
            gt: "b8[]" = torch.ops.aten.gt.Scalar(max_1, 1);  max_1 = None

            # File: <eval_with_key>.137:23 in forward, code: cond = torch.ops.higher_order.cond(l_args_0_, cond_true_2, cond_false_2, [l_args_3_0_, l_args_3_13_, l_args_3_5_, l_args_3_12_, l_args_3_14_, l_args_3_1_, l_args_3_3_, l_args_3_4_, l_args_3_7_, l_args_3_10_, l_args_3_11_, l_args_3_2_, l_args_3_6_, l_args_3_8_, l_args_3_9_]);  l_args_0_ = cond_true_2 = cond_false_2 = l_args_3_0_ = l_args_3_13_ = l_args_3_5_ = l_args_3_12_ = l_args_3_14_ = l_args_3_1_ = l_args_3_3_ = l_args_3_4_ = l_args_3_7_ = l_args_3_10_ = l_args_3_11_ = l_args_3_2_ = l_args_3_6_ = l_args_3_8_ = l_args_3_9_ = None
            true_graph_0 = self.true_graph_0
            false_graph_0 = self.false_graph_0
            conditional = torch.ops.higher_order.cond(gt, true_graph_0, false_graph_0, [p_linear_weight, p_linear_bias, x_1, p_m1_linear_weight, p_m1_m1_linear_bias, p_m1_linear_bias, p_m1_m2_linear_weight, p_m1_m2_linear_bias, p_m1_m1_linear_weight, p_m2_m2_linear_bias, p_m2_m1_linear_weight, p_m2_linear_weight, p_m2_m1_linear_bias, p_m2_m2_linear_weight, p_m2_linear_bias]);  gt = true_graph_0 = false_graph_0 = p_linear_weight = p_linear_bias = x_1 = p_m1_linear_weight = p_m1_m1_linear_bias = p_m1_linear_bias = p_m1_m2_linear_weight = p_m1_m2_linear_bias = p_m1_m1_linear_weight = p_m2_m2_linear_bias = p_m2_m1_linear_weight = p_m2_linear_weight = p_m2_m1_linear_bias = p_m2_m2_linear_weight = p_m2_linear_bias = None
            getitem: "f32[3]" = conditional[0];  conditional = None
            return (getitem,)

        class <lambda>(torch.nn.Module):
            def forward(self, p_linear_weight: "f32[3, 3]", p_linear_bias: "f32[3]", x_1: "f32[3]", p_m1_linear_weight: "f32[3, 3]", p_m1_m1_linear_bias: "f32[3]", p_m1_linear_bias: "f32[3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m2_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]", p_m2_linear_bias: "f32[3]"):
                # File: <eval_with_key>.134:8 in forward, code: sum_default = torch.ops.aten.sum.default(l_args_3_5__1, dtype = None)
                sum_1: "f32[]" = torch.ops.aten.sum.default(x_1)

                # File: <eval_with_key>.134:9 in forward, code: gt_scalar = torch.ops.aten.gt.Scalar(sum_default, 1);  sum_default = None
                gt: "b8[]" = torch.ops.aten.gt.Scalar(sum_1, 1);  sum_1 = None

                # File: <eval_with_key>.134:12 in forward, code: cond = torch.ops.higher_order.cond(gt_scalar, cond_true_0, cond_false_0, [l_args_3_12__true_branch, l_args_3_1__true_branch, l_args_3_5__1, l_args_3_14__true_branch, l_args_3_7__true_branch, l_args_3_3__true_branch, l_args_3_4__true_branch]);  gt_scalar = cond_true_0 = cond_false_0 = l_args_3_12__true_branch = l_args_3_1__true_branch = l_args_3_5__1 = l_args_3_14__true_branch = l_args_3_7__true_branch = l_args_3_3__true_branch = l_args_3_4__true_branch = None
                true_graph_0 = self.true_graph_0
                false_graph_0 = self.false_graph_0
                conditional = torch.ops.higher_order.cond(gt, true_graph_0, false_graph_0, [p_m1_linear_weight, p_m1_linear_bias, x_1, p_m1_m1_linear_bias, p_m1_m1_linear_weight, p_m1_m2_linear_weight, p_m1_m2_linear_bias]);  gt = true_graph_0 = false_graph_0 = p_m1_linear_weight = p_m1_linear_bias = x_1 = p_m1_m1_linear_bias = p_m1_m1_linear_weight = p_m1_m2_linear_weight = p_m1_m2_linear_bias = None
                getitem: "f32[3]" = conditional[0];  conditional = None

                # File: <eval_with_key>.134:14 in forward, code: linear_default = torch.ops.aten.linear.default(getitem, l_args_3_0__1, l_args_3_13__1);  getitem = l_args_3_0__1 = l_args_3_13__1 = None
                linear: "f32[3]" = torch.ops.aten.linear.default(getitem, p_linear_weight, p_linear_bias);  getitem = p_linear_weight = p_linear_bias = None
                return (linear,)

            class <lambda>(torch.nn.Module):
                def forward(self, p_m1_linear_weight: "f32[3, 3]", p_m1_linear_bias: "f32[3]", x_1: "f32[3]", p_m1_m1_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]"):
                    # File: <eval_with_key>.130:8 in forward, code: linear_default = torch.ops.aten.linear.default(l_args_3_5__1, l_args_3_7__true_branch, l_args_3_14__true_branch);  l_args_3_5__1 = l_args_3_7__true_branch = l_args_3_14__true_branch = None
                    linear: "f32[3]" = torch.ops.aten.linear.default(x_1, p_m1_m1_linear_weight, p_m1_m1_linear_bias);  x_1 = p_m1_m1_linear_weight = p_m1_m1_linear_bias = None

                    # File: <eval_with_key>.130:9 in forward, code: linear_default_1 = torch.ops.aten.linear.default(linear_default, l_args_3_12__1, l_args_3_1__1);  linear_default = l_args_3_12__1 = l_args_3_1__1 = None
                    linear_1: "f32[3]" = torch.ops.aten.linear.default(linear, p_m1_linear_weight, p_m1_linear_bias);  linear = p_m1_linear_weight = p_m1_linear_bias = None
                    return (linear_1,)

            class <lambda>(torch.nn.Module):
                def forward(self, p_m1_linear_weight: "f32[3, 3]", p_m1_linear_bias: "f32[3]", x_1: "f32[3]", p_m1_m1_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]"):
                    # File: <eval_with_key>.131:8 in forward, code: linear_default = torch.ops.aten.linear.default(l_args_3_5__1, l_args_3_3__false_branch, l_args_3_4__false_branch);  l_args_3_5__1 = l_args_3_3__false_branch = l_args_3_4__false_branch = None
                    linear: "f32[3]" = torch.ops.aten.linear.default(x_1, p_m1_m2_linear_weight, p_m1_m2_linear_bias);  x_1 = p_m1_m2_linear_weight = p_m1_m2_linear_bias = None

                    # File: <eval_with_key>.131:9 in forward, code: linear_default_1 = torch.ops.aten.linear.default(linear_default, l_args_3_12__1, l_args_3_1__1);  linear_default = l_args_3_12__1 = l_args_3_1__1 = None
                    linear_1: "f32[3]" = torch.ops.aten.linear.default(linear, p_m1_linear_weight, p_m1_linear_bias);  linear = p_m1_linear_weight = p_m1_linear_bias = None
                    return (linear_1,)

        class <lambda>(torch.nn.Module):
            def forward(self, p_linear_weight: "f32[3, 3]", p_linear_bias: "f32[3]", x_1: "f32[3]", p_m1_linear_weight: "f32[3, 3]", p_m1_m1_linear_bias: "f32[3]", p_m1_linear_bias: "f32[3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m2_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]", p_m2_linear_bias: "f32[3]"):
                # File: <eval_with_key>.135:8 in forward, code: sum_default = torch.ops.aten.sum.default(l_args_3_5__1, dtype = None)
                sum_1: "f32[]" = torch.ops.aten.sum.default(x_1)

                # File: <eval_with_key>.135:9 in forward, code: gt_scalar = torch.ops.aten.gt.Scalar(sum_default, 1);  sum_default = None
                gt: "b8[]" = torch.ops.aten.gt.Scalar(sum_1, 1);  sum_1 = None

                # File: <eval_with_key>.135:12 in forward, code: cond = torch.ops.higher_order.cond(gt_scalar, cond_true_1, cond_false_1, [l_args_3_2__false_branch, l_args_3_5__1, l_args_3_9__false_branch, l_args_3_11__false_branch, l_args_3_6__false_branch, l_args_3_10__false_branch, l_args_3_8__false_branch]);  gt_scalar = cond_true_1 = cond_false_1 = l_args_3_2__false_branch = l_args_3_5__1 = l_args_3_9__false_branch = l_args_3_11__false_branch = l_args_3_6__false_branch = l_args_3_10__false_branch = l_args_3_8__false_branch = None
                true_graph_0 = self.true_graph_0
                false_graph_0 = self.false_graph_0
                conditional = torch.ops.higher_order.cond(gt, true_graph_0, false_graph_0, [p_m2_linear_weight, x_1, p_m2_linear_bias, p_m2_m1_linear_weight, p_m2_m1_linear_bias, p_m2_m2_linear_bias, p_m2_m2_linear_weight]);  gt = true_graph_0 = false_graph_0 = p_m2_linear_weight = x_1 = p_m2_linear_bias = p_m2_m1_linear_weight = p_m2_m1_linear_bias = p_m2_m2_linear_bias = p_m2_m2_linear_weight = None
                getitem: "f32[3]" = conditional[0];  conditional = None

                # File: <eval_with_key>.135:14 in forward, code: linear_default = torch.ops.aten.linear.default(getitem, l_args_3_0__1, l_args_3_13__1);  getitem = l_args_3_0__1 = l_args_3_13__1 = None
                linear: "f32[3]" = torch.ops.aten.linear.default(getitem, p_linear_weight, p_linear_bias);  getitem = p_linear_weight = p_linear_bias = None
                return (linear,)

            class <lambda>(torch.nn.Module):
                def forward(self, p_m2_linear_weight: "f32[3, 3]", x_1: "f32[3]", p_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]"):
                    # File: <eval_with_key>.132:8 in forward, code: linear_default = torch.ops.aten.linear.default(l_args_3_5__1, l_args_3_11__true_branch, l_args_3_6__true_branch);  l_args_3_5__1 = l_args_3_11__true_branch = l_args_3_6__true_branch = None
                    linear: "f32[3]" = torch.ops.aten.linear.default(x_1, p_m2_m1_linear_weight, p_m2_m1_linear_bias);  x_1 = p_m2_m1_linear_weight = p_m2_m1_linear_bias = None

                    # File: <eval_with_key>.132:9 in forward, code: linear_default_1 = torch.ops.aten.linear.default(linear_default, l_args_3_2__1, l_args_3_9__1);  linear_default = l_args_3_2__1 = l_args_3_9__1 = None
                    linear_1: "f32[3]" = torch.ops.aten.linear.default(linear, p_m2_linear_weight, p_m2_linear_bias);  linear = p_m2_linear_weight = p_m2_linear_bias = None
                    return (linear_1,)

            class <lambda>(torch.nn.Module):
                def forward(self, p_m2_linear_weight: "f32[3, 3]", x_1: "f32[3]", p_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]"):
                    # File: <eval_with_key>.133:8 in forward, code: linear_default = torch.ops.aten.linear.default(l_args_3_5__1, l_args_3_8__false_branch, l_args_3_10__false_branch);  l_args_3_5__1 = l_args_3_8__false_branch = l_args_3_10__false_branch = None
                    linear: "f32[3]" = torch.ops.aten.linear.default(x_1, p_m2_m2_linear_weight, p_m2_m2_linear_bias);  x_1 = p_m2_m2_linear_weight = p_m2_m2_linear_bias = None

                    # File: <eval_with_key>.133:9 in forward, code: linear_default_1 = torch.ops.aten.linear.default(linear_default, l_args_3_2__1, l_args_3_9__1);  linear_default = l_args_3_2__1 = l_args_3_9__1 = None
                    linear_1: "f32[3]" = torch.ops.aten.linear.default(linear, p_m2_linear_weight, p_m2_linear_bias);  linear = p_m2_linear_weight = p_m2_linear_bias = None
                    return (linear_1,)

Graph signature: ExportGraphSignature(input_specs=[InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_linear_weight'), target='linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_linear_bias'), target='linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_linear_weight'), target='m1.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_linear_bias'), target='m1.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_m1_linear_weight'), target='m1.m1.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_m1_linear_bias'), target='m1.m1.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_m2_linear_weight'), target='m1.m2.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_m2_linear_bias'), target='m1.m2.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_linear_weight'), target='m2.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_linear_bias'), target='m2.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_m1_linear_weight'), target='m2.m1.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_m1_linear_bias'), target='m2.m1.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_m2_linear_weight'), target='m2.m2.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_m2_linear_bias'), target='m2.m2.linear.bias', persistent=None), InputSpec(kind=<InputKind.USER_INPUT: 1>, arg=TensorArgument(name='x_1'), target=None, persistent=None)], output_specs=[OutputSpec(kind=<OutputKind.USER_OUTPUT: 1>, arg=TensorArgument(name='getitem'), target=None)])
Range constraints: {}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127975
Approved by: https://github.com/angelayi, https://github.com/ydwu4
2024-06-10 23:24:16 +00:00
b2d602306a [RELAND][dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578)
Tracing through `__init__`  is important because it initializes (calls STORE_ATTR) on members. By doing that, we kick in the mutation tracking for these objects. So, things like mutating `_modules` etc is tracked automatically.

Fixes https://github.com/pytorch/pytorch/issues/111837

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126578
Approved by: https://github.com/jansel
ghstack dependencies: #128295
2024-06-10 23:11:04 +00:00
05711eece9 [dynamo][inlining inbuilt modules] Ensure BC for nn_module_stack (#128295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128295
Approved by: https://github.com/ydwu4
2024-06-10 23:11:04 +00:00
a287ff75d0 Use init_torchbind_implementations in inductor torchbind tests. (#128341)
Summary: To unify how we load the torch bind libraries for testing.

Test Plan: Existing tests.

Differential Revision: D58372372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128341
Approved by: https://github.com/angelayi
2024-06-10 23:02:48 +00:00
4bbadeee8a Revert "Set simdlen based on ATEN_CPU_CAPABILITY (#123514)"
This reverts commit b66e3f0957b96b058c9b632ca60833d9717a9d8a.

Reverted https://github.com/pytorch/pytorch/pull/123514 on behalf of https://github.com/clee2000 due to broke test/inductor/test_torchinductor.py::CpuTests::test_new_cpp_build_logical_cpu on periodic test on the no gpu tests b66e3f0957 https://github.com/pytorch/pytorch/actions/runs/9453518547/job/26040077301 ([comment](https://github.com/pytorch/pytorch/pull/123514#issuecomment-2159433432))
2024-06-10 22:46:01 +00:00
2176ef7dfa [compiled autograd] support .backward(inputs=) (#128252)
autograd already marks nodes as needed or not before calling calling compiled autograd. so our worklist already skips nodes not specified in the `inputs` kwarg.

For the .backward(inputs=) case, I'm keeping the grads as outputs, just like for .grad(inputs=), this is to still guard on graph_output when we collect the nodes. This does not get DCE'd rn, and is ignored in the post graph bytecode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128252
Approved by: https://github.com/jansel
2024-06-10 22:20:51 +00:00
583a56d5a8 DOC: add docstring to construct_and_record_rdzv_event() (#128189)
Fixes #127902

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128189
Approved by: https://github.com/kurman
2024-06-10 22:17:33 +00:00
c38b3381a1 Make nn.Module state_dict load_state_dict pre-hook and state_dict post hook public (#126704)
Fixes https://github.com/pytorch/pytorch/issues/75287 and https://github.com/pytorch/pytorch/issues/117437

- `nn.Module._register_state_dict_hook` --> add public `nn.Module.register_state_dict_post_hook`
   - Add a test as this API was previously untested
- `nn.Module._register_load_state_dict_pre_hook` --> add public `nn.Module.register_load_state_dict_pre_hook` (remove the `with_module` flag, default it to `True`
    ~- For consistency with optimizer `load_state_dict_pre_hook` raised by @janeyx99, allow the pre-hook to return a new `state_dict`~
 - Document issue pointed out by https://github.com/pytorch/pytorch/issues/117437 regarding `_register_state_dict_hook` semantic of returning a new state_dict only being respected for the root for private hook
       - Remove this for the public `register_state_dict_post_hook`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126704
Approved by: https://github.com/albanD
ghstack dependencies: #126906
2024-06-10 21:50:17 +00:00
a2d4fea872 [easy] Move state_dict hooks tests to test_module_hooks and decorate tests that call load_state_dict with swap (#126906)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126906
Approved by: https://github.com/albanD
2024-06-10 21:50:17 +00:00
58083ffb10 Improve unbacked reasoning involving has internal overlap (#128332)
Fixes https://github.com/pytorch/pytorch/issues/122477
Partially addresses https://github.com/pytorch/pytorch/issues/116336

This PR is slightly overkill: not only does it disable the overlap test
when there are unbacked SymInts, it also improves the is non-overlapping
and dense test for some more unbacked situations.  We technically don't
need the latter change, but I was already deep in the sauce and just
went ahead and did it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128332
Approved by: https://github.com/lezcano
2024-06-10 21:49:38 +00:00
6630dcd53c Add docstring for the torch.serialization.default_restore_location function (#128132)
Fixes: #127887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128132
Approved by: https://github.com/mikaylagawarecki
2024-06-10 21:33:56 +00:00
3a2d0755a4 enable test_ParameterList with dynamo if nn module inlining enabled only (#128308)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128308
Approved by: https://github.com/anijain2305
2024-06-10 21:25:40 +00:00
b459713ca7 [aota] compiled forward outputs requires_grad alignment with eager (#128016)
Original issue: https://github.com/pytorch/pytorch/issues/114338

We assume only two possible mutually exclusive scenarios:

1. Running compiled region for training (Any of inputs has requires_grad)
	- Produced differentiable outputs should have requires_grad.

2. Running compiled region for inference (None of inputs has requires_grad)
	- All outputs do not have requires_grad.

Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1).

With current state that means:
1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad
2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128016
Approved by: https://github.com/bdhirsh
2024-06-10 20:51:22 +00:00
4460e481bc Disable jacrev/jacfwd/hessian if compiling with dynamo (#128255)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128255
Approved by: https://github.com/zou3519
2024-06-10 20:47:53 +00:00
90bb510ece Revert "Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)"
This reverts commit 348b181a97abc2e636a6c18e5880a78e5d1dab94.

Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/clee2000 due to sorry I think https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456 is still relevant, I will reach out to them to see what needs to be done in internal to get this remerged ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2159248859))
2024-06-10 20:44:42 +00:00
38e0a0440c [AMD] Default to hipblaslt in gemm (#127944)
Summary: It has been a constant pain that we have to specify env var to go with the hipblaslt path. The default path is very slow on MI300. Therefore, let's default to hipblaslt.

Differential Revision: D58150764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127944
Approved by: https://github.com/aaronenyeshi, https://github.com/houseroad
2024-06-10 19:55:21 +00:00
946f554c8f Flip default value for mypy disallow_untyped_defs [10+1/11] (#128293)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128293
Approved by: https://github.com/oulgen
2024-06-10 19:32:44 +00:00
55646554b7 [EZ] Fix typos in SECURITY.md (#128340)
permisisons -> permissions
lates -> latest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128340
Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/kit1980
2024-06-10 19:21:39 +00:00
9cab5987bd Introduce int_oo (#127693)
In a previous life, we used sympy.oo to represent the lower/upper bounds of integer ranges. Later, we changed this to be sys.maxsize - 1 for a few reasons: (1) sometimes we do tests on a value being exactly sys.maxsize, and we wanted to avoid a data dependent guard in this case, (2) sympy.oo corresponds to floating point infinity, so you get incorrect types for value ranges with oo, and (3) you can do slightly better reasoning if you assume that input sizes fall within representable 64-bit integer range.

After working in the sys.maxsize regime for a bit, I've concluded that this was actually a bad idea. Specifically, the problem is that you end up with sys.maxsize in your upper bound, and then whenever you do any sort of size-increasing computation like size * 2, you end up with 2 * sys.maxsize, and you end up doing a ton of arbitrary precision int computation that is totally unnecessary. A symbolic bound is better.

But especially after #126905, we can't go back to using sympy.oo, because that advertises that it's not an integer, and now your ValueRanges is typed incorrectly. So what do we do? We define a new numeric constant `int_oo`, which is like `sympy.oo` but it advertises `is_integer`. **test/test_sympy_utils.py** describes some basic properties of the number, and **torch/utils/_sympy/numbers.py** has the actual implementation.

The rest of the changes of the PR are working out the implications of this change. I'll give more commentary as inline comments.

Fixes https://github.com/pytorch/pytorch/issues/127396

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127693
Approved by: https://github.com/lezcano
ghstack dependencies: #126905
2024-06-10 19:09:53 +00:00
db2fa7b827 Revert "[export] FIx unflattener for preserving modules containing unused inputs (#128260)"
This reverts commit 093a4ff5f859ccbbd8ba62dd189f76e5faadfb04.

Reverted https://github.com/pytorch/pytorch/pull/128260 on behalf of https://github.com/angelayi due to breaking windows test ([comment](https://github.com/pytorch/pytorch/pull/128260#issuecomment-2159050726))
2024-06-10 18:42:33 +00:00
093a4ff5f8 [export] FIx unflattener for preserving modules containing unused inputs (#128260)
Currently unflattener fails if the module its preserving the module signature for contains unused inputs/outputs.

This also fixes unflattener issues in D57829276.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128260
Approved by: https://github.com/pianpwk
2024-06-10 18:39:33 +00:00
fa8ec8e718 [dynamo] handle hashable exceptions in trace_rules lookup (#128078)
Summary: Found during user empathy day when attempting to hash a fractions.Fraction object before it was fully constructed. See https://github.com/pytorch/pytorch/issues/128075

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128078
Approved by: https://github.com/anijain2305
2024-06-10 18:23:22 +00:00
136bdb96cb Update Kineto submodule with fix to test_basic_chrome_trace (#128333)
Summary: We've updated the sort_index in Kineto chrome traces to support device ids up to 16 devices. This should make chrome trace rows be ordered in the same way as CUDA. We need to update the unit test as well.

Test Plan:
Ran locally the changing test:
```
$ buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:test_profiler_cuda -- --exact 'caffe2/test:test_profiler_cuda - test_basic_chrome_trace (profiler.test_profiler.TestProfiler)'
File changed: fbcode//caffe2/third_party/kineto.submodule.txt
Buck UI: https://www.internalfb.com/buck2/f4fd1e9a-99f1-4422-aeed-b54903c64146
Test UI: https://www.internalfb.com/intern/testinfra/testrun/16888498639845776
Network: Up: 5.4KiB  Down: 8.6KiB  (reSessionID-0329120e-7fa2-4bc0-b539-7e58058f8fce)
Jobs completed: 6. Time elapsed: 1:01.2s.
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D58362964

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128333
Approved by: https://github.com/Skylion007
2024-06-10 18:12:34 +00:00
83941482f7 Add docstring for the torch.distributed.elastic.utils.distributed.get_free_port function (#128133)
Fixes: #127914

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128133
Approved by: https://github.com/H-Huang
2024-06-10 18:10:58 +00:00
08d038f8a8 [PT2] Fix a typo and lint problem (#128258)
Summary: Titled

Test Plan: see signal

Differential Revision: D58310169

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128258
Approved by: https://github.com/dshi7, https://github.com/Yuzhen11
2024-06-10 18:03:40 +00:00
46948300a2 [c10d] integrate PMI NCCL initialization to NCCL-PG (#128243)
Summary: Move broadcastUniqueID check to NCCLUtils

Differential Revision: D58273755

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128243
Approved by: https://github.com/wconstab
2024-06-10 17:20:03 +00:00
ab3a0b192a [RFC] add per-collective timeout value in flight recorder (#128190)
Summary:
Add timeout value field on every collected record.

Test Plan:
Unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128190
Approved by: https://github.com/wconstab
2024-06-10 17:12:57 +00:00
8e482e909b Add some guard to size oblivious has_internal_overlap (#128328)
This doesn't actually help on
https://github.com/pytorch/pytorch/issues/122477 but I noticed this
modest improvement so sure, why not.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128328
Approved by: https://github.com/Skylion007
2024-06-10 17:11:26 +00:00
7b9c5e0e3f Turn on GraphTransformObserver for inductor (#127962)
The FX graphs for some PT2 models are very complicated, Inductor usually goes through many passes of graph optimization to generate the final FX graph. It’s very difficult to see the change in each pass, and check if the optimized graph is correct and optimal.

GraphTransformObserver is an observer listening to all add/erase node events on GraphModule during a graph transform pass, and save the changed nodes. When the pass is done and if there is any change in the graph, GraphTransformObserver will save the SVG files of the input graph and the output graph for that pass.

This PR is to enable GraphTransformObserver for inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127962
Approved by: https://github.com/jansel
2024-06-10 16:49:02 +00:00
ca561d639b Revert "Fix 'get_attr' call in dynamo 'run_node' (#127696)"
This reverts commit b741819b0580204e6a6b60c62ce44dacaf7787c8.

Reverted https://github.com/pytorch/pytorch/pull/127696 on behalf of https://github.com/clee2000 due to broke (executorch?) internal tests D58295865 ([comment](https://github.com/pytorch/pytorch/pull/127696#issuecomment-2158820093))
2024-06-10 16:29:20 +00:00
d22287d1ad Revert "Fix 'get_real_value' on placeholder nodes (#127698)"
This reverts commit 19b31d899a78a6806314bcc73b88172dabf0c26e.

Reverted https://github.com/pytorch/pytorch/pull/127698 on behalf of https://github.com/clee2000 due to broke (executorch?) internal tests D58295865 ([comment](https://github.com/pytorch/pytorch/pull/127696#issuecomment-2158820093))
2024-06-10 16:29:20 +00:00
3b73f5de3a Revert "Add OpInfo entry for alias_copy (#127232) (#128142)"
This reverts commit 04da6aeb61f4d57bf73ed1054dd897abbcceca83.

Reverted https://github.com/pytorch/pytorch/pull/128142 on behalf of https://github.com/DanilBaibak due to The changes broke the test_output_match_alias_copy_cpu_complex64 test. ([comment](https://github.com/pytorch/pytorch/pull/128142#issuecomment-2158793878))
2024-06-10 16:17:16 +00:00
c993f1b37f Fix edge cases for gather in inductor (#126893)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126893
Approved by: https://github.com/peterbell10
ghstack dependencies: #126876
2024-06-10 15:31:03 +00:00
04da6aeb61 Add OpInfo entry for alias_copy (#127232) (#128142)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128142
Approved by: https://github.com/lezcano
2024-06-10 15:01:53 +00:00
292 changed files with 5702 additions and 6491 deletions

View File

@ -62,4 +62,6 @@ readability-string-compare,
'
HeaderFilterRegex: '^(aten/|c10/|torch/).*$'
WarningsAsErrors: '*'
CheckOptions:
misc-header-include-cycle.IgnoredFilesList: 'format.h;ivalue.h;custom_class.h;Dict.h;List.h'
...

View File

@ -1099,7 +1099,6 @@ exclude_patterns = [
'test/test_namedtuple_return_api.py',
'test/test_native_functions.py',
'test/test_native_mha.py',
'test/test_nestedtensor.py',
'test/test_nn.py',
'test/test_out_dtype_op.py',
'test/test_overrides.py',

View File

@ -461,15 +461,8 @@ filegroup(
filegroup(
name = "caffe2_perfkernels_srcs",
srcs = [
"caffe2/perfkernels/adagrad.cc",
"caffe2/perfkernels/embedding_lookup.cc",
"caffe2/perfkernels/embedding_lookup_idx.cc",
"caffe2/perfkernels/fused_8bit_rowwise_embedding_lookup.cc",
"caffe2/perfkernels/fused_8bit_rowwise_embedding_lookup_idx.cc",
"caffe2/perfkernels/fused_nbit_rowwise_conversion.cc",
"caffe2/perfkernels/lstm_unit_cpu_common.cc",
"caffe2/perfkernels/math_cpu_base.cc",
"caffe2/perfkernels/typed_axpy.cc",
],
)

View File

@ -40,7 +40,7 @@ Important Note: The trustworthiness of a model is not binary. You must always de
### Untrusted inputs during training and prediction
If you plan to open your model to untrusted inputs, be aware that inputs can also be used as vectors by malicious agents. To minimize risks, make sure to give your model only the permisisons strictly required, and keep your libraries updated with the lates security patches.
If you plan to open your model to untrusted inputs, be aware that inputs can also be used as vectors by malicious agents. To minimize risks, make sure to give your model only the permissions strictly required, and keep your libraries updated with the latest security patches.
If applicable, prepare your model against bad inputs and prompt injections. Some recommendations:
- Pre-analysis: check how the model performs by default when exposed to prompt injection (e.g. using fuzzing for prompt injection).

View File

@ -385,8 +385,11 @@ class TORCH_API Context {
? at::LinalgBackend::Cusolver
: at::LinalgBackend::Default;
at::BlasBackend blas_preferred_backend =
(c10::utils::check_env("TORCH_BLAS_PREFER_CUBLASLT") == true ||
c10::utils::check_env("TORCH_BLAS_PREFER_HIPBLASLT") == true)
#ifdef USE_ROCM
(c10::utils::check_env("TORCH_BLAS_PREFER_HIPBLASLT") != false)
#else
(c10::utils::check_env("TORCH_BLAS_PREFER_CUBLASLT") == true)
#endif
? at::BlasBackend::Cublaslt
: at::BlasBackend::Cublas;
#ifdef C10_MOBILE

View File

@ -143,7 +143,7 @@ static Device getATenDevice(const DLDevice& ctx, void* data) {
return at::detail::getXPUHooks().getDeviceFromPtr(data);
default:
TORCH_CHECK(
false, "Unsupported device_type: " + c10::to_string(ctx.device_type));
false, "Unsupported device_type: ", std::to_string(ctx.device_type));
}
}
@ -167,7 +167,7 @@ ScalarType toScalarType(const DLDataType& dtype) {
break;
default:
TORCH_CHECK(
false, "Unsupported kUInt bits " + c10::to_string(dtype.bits));
false, "Unsupported kUInt bits ", std::to_string(dtype.bits));
}
break;
case DLDataTypeCode::kDLInt:
@ -186,7 +186,7 @@ ScalarType toScalarType(const DLDataType& dtype) {
break;
default:
TORCH_CHECK(
false, "Unsupported kInt bits " + c10::to_string(dtype.bits));
false, "Unsupported kInt bits ", std::to_string(dtype.bits));
}
break;
case DLDataTypeCode::kDLFloat:
@ -202,7 +202,7 @@ ScalarType toScalarType(const DLDataType& dtype) {
break;
default:
TORCH_CHECK(
false, "Unsupported kFloat bits " + c10::to_string(dtype.bits));
false, "Unsupported kFloat bits ", std::to_string(dtype.bits));
}
break;
case DLDataTypeCode::kDLBfloat:
@ -212,7 +212,7 @@ ScalarType toScalarType(const DLDataType& dtype) {
break;
default:
TORCH_CHECK(
false, "Unsupported kFloat bits " + c10::to_string(dtype.bits));
false, "Unsupported kFloat bits ", std::to_string(dtype.bits));
}
break;
case DLDataTypeCode::kDLComplex:
@ -228,7 +228,7 @@ ScalarType toScalarType(const DLDataType& dtype) {
break;
default:
TORCH_CHECK(
false, "Unsupported kFloat bits " + c10::to_string(dtype.bits));
false, "Unsupported kFloat bits ", std::to_string(dtype.bits));
}
break;
case DLDataTypeCode::kDLBool:
@ -238,11 +238,11 @@ ScalarType toScalarType(const DLDataType& dtype) {
break;
default:
TORCH_CHECK(
false, "Unsupported kDLBool bits " + c10::to_string(dtype.bits));
false, "Unsupported kDLBool bits ", std::to_string(dtype.bits));
}
break;
default:
TORCH_CHECK(false, "Unsupported code " + c10::to_string(dtype.code));
TORCH_CHECK(false, "Unsupported code ", std::to_string(dtype.code));
}
return stype;
}
@ -298,9 +298,7 @@ Tensor fromDLPack(DLManagedTensor* src) {
return fromDLPack(src, std::move(deleter));
}
Tensor fromDLPack(
DLManagedTensor* src,
std::function<void(void*)> deleter) {
Tensor fromDLPack(DLManagedTensor* src, std::function<void(void*)> deleter) {
Device device = getATenDevice(src->dl_tensor.device, src->dl_tensor.data);
ScalarType stype = toScalarType(src->dl_tensor.dtype);
if (!src->dl_tensor.strides) {

View File

@ -462,7 +462,7 @@ inline Tensor _sum_to(
reduce_dims.push_back(i);
}
for (int64_t i = leading_dims; i < static_cast<int64_t>(sizes.size()); ++i) {
if (shape[i - leading_dims] == 1 &&
if (TORCH_GUARD_SIZE_OBLIVIOUS(sym_eq(shape[i - leading_dims], 1)) &&
TORCH_GUARD_SIZE_OBLIVIOUS(sym_ne(sizes[i], 1))) {
reduce_dims.push_back(i);
}

View File

@ -19,7 +19,13 @@ MemOverlap has_internal_overlap(TensorImpl* t) {
auto strides = t->sym_strides();
auto sizes = t->sym_sizes();
for (const auto i : c10::irange(strides.size())) {
if (strides[i] == 0 && sizes[i] > 1) {
// NB: The size oblivious test is written very carefully here. When
// unbacked SymInts are involved, we should try to conservatively report
// if memory overlap /could/ happen under some setting of unbacked
// SymInts. Thus, if I have u0 size, we should assume that this has > 1
// elements (first expression), but if I have a u0 stride, I should NOT
// assume that it is not zero (second expression)
if (TORCH_GUARD_SIZE_OBLIVIOUS(sizes[i].sym_gt(1)) && strides[i] == 0) {
return MemOverlap::Yes;
}
}

View File

@ -22,7 +22,6 @@
#endif
#include <c10/util/irange.h>
#include <c10/util/string_utils.h>
#include <c10/util/SmallBuffer.h>
#include <array>
@ -1398,7 +1397,7 @@ bool TensorIteratorBase::fast_set_up(const TensorIteratorConfig& config) {
break;
}
default:
TORCH_INTERNAL_ASSERT(false, "Unsupported fast setup type", c10::to_string((int)setup_type));
TORCH_INTERNAL_ASSERT(false, "Unsupported fast setup type", std::to_string((int)setup_type));
}
//coalescing dimensions consists of collapsing dimensions to 1 (we are limited to contiguous no-broadcast cases here)
if (ndim() > 1){

View File

@ -31,7 +31,7 @@ struct TemplateEnv {
// Add a number 'v' to the map at key 'k'
template <typename T>
void d(const std::string& k, const T& v) {
strings_[k] = c10::to_string(v);
strings_[k] = std::to_string(v);
lists_.erase(k);
}

View File

@ -478,8 +478,6 @@ namespace impl {
// (maybe except for some internal prim ops).
using GenericList = List<IValue>;
const IValue* ptr_to_first_element(const GenericList& list);
}
}

View File

@ -350,11 +350,4 @@ void List<T>::unsafeSetElementType(TypePtr t) {
impl_->elementType = std::move(t);
}
namespace impl {
inline const IValue* ptr_to_first_element(const GenericList& list) {
return &list.impl_->list[0];
}
}
}

View File

@ -440,15 +440,6 @@ TORCH_IMPL_FUNC(log_softmax_backward_cpu_out) (
}
}
static Tensor softmax(const Tensor& input_, const int64_t dim_) {
auto result = [&]() {
NoNamesGuard guard;
return at::_softmax(input_, dim_, false);
}();
namedinference::propagate_names(result, input_);
return result;
}
Tensor softmax(const Tensor& input_, const int64_t dim_, std::optional<ScalarType> dtype) {
auto result = [&]() {
NoNamesGuard guard;
@ -505,15 +496,6 @@ Tensor special_softmax(const Tensor& input_, const int64_t dim_, std::optional<S
return at::softmax(input_, dim_, dtype);
}
static Tensor log_softmax(const Tensor& input_, const int64_t dim_) {
auto result = [&]() {
NoNamesGuard guard;
return at::_log_softmax(input_, dim_, false);
}();
namedinference::propagate_names(result, input_);
return result;
}
Tensor log_softmax(const Tensor& input_, const int64_t dim_, std::optional<ScalarType> dtype) {
auto result = [&]() {
NoNamesGuard guard;

View File

@ -1195,15 +1195,6 @@ Tensor istft(const Tensor& self, const int64_t n_fft, const optional<int64_t> ho
#undef REPR
}
static Tensor istft(const Tensor& self, const int64_t n_fft, const optional<int64_t> hop_lengthOpt,
const optional<int64_t> win_lengthOpt, const Tensor& window,
const bool center, const bool normalized, const optional<bool> onesidedOpt,
const optional<int64_t> lengthOpt) {
return at::native::istft(
self, n_fft, hop_lengthOpt, win_lengthOpt, window, center, normalized,
onesidedOpt, lengthOpt, /*return_complex=*/false);
}
void _fft_fill_with_conjugate_symmetry_(const Tensor& input, IntArrayRef dim_) {
const auto input_sizes = input.sizes();
const auto input_strides = input.strides();

View File

@ -172,18 +172,10 @@ Tensor arange(
return at::arange_out(result, start, end, step);
}
static Tensor& arange_start_out(const Scalar& start, const Scalar& end, Tensor& result) {
return at::arange_out(result, start, end, /*step=*/1);
}
Tensor& arange_out(const Scalar& end, Tensor& result) {
return at::arange_out(result, /*start=*/0, end, /*step=*/1);
}
static Tensor& arange_out(Tensor& result, const Scalar& start, const Scalar& end) {
return at::arange_out(result, start, end, /*step=*/1);
}
Tensor _dim_arange(const Tensor& like, int64_t dim) {
return at::arange(like.size(dim), like.options().dtype(at::kLong));
}

View File

@ -105,10 +105,6 @@ Tensor & detach_(Tensor & self) {
return self;
}
static Tensor contiguous(const Tensor & self) {
return contiguous(self, MemoryFormat::Contiguous);
}
Tensor contiguous(const Tensor& self, MemoryFormat memory_format) {
if (self.is_contiguous(memory_format)) {
return self;

View File

@ -210,7 +210,6 @@
#include <ATen/ops/zeros_native.h>
#endif
#include <c10/util/StringUtil.h>
#include <algorithm>
#include <cstdint>
#include <utility>
@ -1181,14 +1180,6 @@ Tensor as_strided_tensorimpl(const Tensor& self, IntArrayRef size, IntArrayRef s
return result;
}
static Tensor as_strided_tensorimpl_meta(const Tensor& self, IntArrayRef size, IntArrayRef stride, optional<int64_t> storage_offset_) {
auto storage_offset = storage_offset_.value_or(self.storage_offset());
auto result = at::detail::make_tensor<TensorImpl>(
c10::TensorImpl::VIEW, Storage(self.storage()), self.key_set(), self.dtype());
setStrided(result, size, stride, storage_offset);
return result;
}
template <typename T>
inline void setStridedUnchecked(
const Tensor& self,
@ -1249,10 +1240,6 @@ const Tensor &as_strided__symint(const Tensor& self, SymIntArrayRef size, SymInt
return self;
}
static Tensor narrow_copy_dense(const Tensor& self, int64_t dim, int64_t start, int64_t length) {
return self.narrow(dim, start, length).clone(at::MemoryFormat::Contiguous);
}
// Should just use narrow_copy_out, but this API is used internally at Meta:
// https://github.com/pytorch/pytorch/pull/87045#issuecomment-1309353561
Tensor narrow_copy_dense_cpu(const Tensor& self, int64_t dim, int64_t start, int64_t length){
@ -3587,10 +3574,6 @@ Tensor view_as(const Tensor& self, const Tensor& other) {
return self.view_symint(other.sym_sizes());
}
static int64_t numel(const Tensor& self) {
return self.unsafeGetTensorImpl()->numel();
}
std::vector<Tensor> unbind(const Tensor &self, int64_t dim) {
dim = maybe_wrap_dim(dim, self.dim());
int64_t size = self.size(dim);

View File

@ -1002,7 +1002,7 @@ std::string generate_code(
std::string extra_args = "";
for (size_t i = 0; i < extra_args_typenames.size(); i++) {
auto type = std::string(extra_args_typenames[i]);
auto name = "extra_arg_" + std::string(to_string(i));
auto name = "extra_arg_" + std::to_string(i);
extra_params += "," + type + " " + name;
extra_args += ", " + name;
}

View File

@ -13,7 +13,8 @@ void run_cudnn_SDP_fprop(
int64_t h,
int64_t s_q,
int64_t s_kv,
int64_t d,
int64_t d_qk,
int64_t d_v,
float scaling_factor,
bool isTraining,
bool is_causal,
@ -34,7 +35,8 @@ void run_cudnn_SDP_bprop(
int64_t h,
int64_t s_q,
int64_t s_kv,
int64_t d,
int64_t d_qk,
int64_t d_v,
float scaling_factor,
bool is_causal,
float dropout_probability,
@ -128,7 +130,8 @@ struct MHAParams {
int64_t h;
int64_t s_q;
int64_t s_kv;
int64_t d;
int64_t d_qk;
int64_t d_v;
double dropout_probability;
bool is_causal;
bool return_softmaxstats;
@ -140,7 +143,8 @@ void setMHAParams(
int64_t h,
int64_t s_q,
int64_t s_kv,
int64_t d,
int64_t d_qk,
int64_t d_v,
const Tensor& q,
const Tensor& k,
const Tensor& v,
@ -155,7 +159,8 @@ void setMHAParams(
}
params.b = b;
params.h = h;
params.d = d;
params.d_qk = d_qk;
params.d_v = d_v;
params.s_q = s_q;
params.s_kv = s_kv;
params.dropout_probability = dropout_probability;
@ -193,7 +198,8 @@ struct MHACacheKeyWrapper : ParamsWrapper<MHAParams> {
int64_t h,
int64_t s_q,
int64_t s_kv,
int64_t d,
int64_t d_qk,
int64_t d_v,
const Tensor& q,
const Tensor& k,
const Tensor& v,
@ -206,7 +212,8 @@ struct MHACacheKeyWrapper : ParamsWrapper<MHAParams> {
h,
s_q,
s_kv,
d,
d_qk,
d_v,
q,
k,
v,
@ -249,7 +256,8 @@ auto build_graph_and_tensors(
int64_t h,
int64_t s_q,
int64_t s_kv,
int64_t d,
int64_t d_qk,
int64_t d_v,
float scaling_factor,
bool return_softmaxstats,
bool is_causal,
@ -383,7 +391,8 @@ auto build_graph_and_tensors_backward(
int64_t h,
int64_t s_q,
int64_t s_kv,
int64_t d,
int64_t d_qk,
int64_t d_v,
float scaling_factor,
bool is_causal,
float dropout_probability,
@ -514,7 +523,8 @@ void run_cudnn_SDP_fprop(
int64_t h,
int64_t s_q,
int64_t s_kv,
int64_t d,
int64_t d_qk,
int64_t d_v,
float scaling_factor,
bool return_softmaxstats,
bool is_causal,
@ -528,7 +538,7 @@ void run_cudnn_SDP_fprop(
Tensor& dropoutoffset) {
cudnnHandle_t handle = getCudnnHandle();
o = at::empty_strided(
{b, h, s_q, d}, {s_q * h * d, d, h * d, 1}, q.options());
{b, h, s_q, d_v}, {s_q * h * d_v, d_v, h * d_v, 1}, q.options());
if (return_softmaxstats) {
// TODO(eqy): verify that this is correct
softmaxstats = at::empty({b, h, s_q}, q.options().dtype(kFloat));
@ -539,7 +549,8 @@ void run_cudnn_SDP_fprop(
h,
s_q,
s_kv,
d,
d_qk,
d_v,
q,
k,
v,
@ -556,7 +567,8 @@ void run_cudnn_SDP_fprop(
h,
s_q,
s_kv,
d,
d_qk,
d_v,
scaling_factor,
return_softmaxstats,
is_causal,
@ -599,7 +611,8 @@ void run_cudnn_SDP_bprop(
int64_t h,
int64_t s_q,
int64_t s_kv,
int64_t d,
int64_t d_qk,
int64_t d_v,
float scaling_factor,
bool is_causal,
float dropout_probability,
@ -623,7 +636,18 @@ void run_cudnn_SDP_bprop(
}
cudnnHandle_t handle = getCudnnHandle();
auto key = MHACacheKeyWrapper(
b, h, s_q, s_kv, d, q, k, v, dropout_probability, is_causal, true);
b,
h,
s_q,
s_kv,
d_qk,
d_v,
q,
k,
v,
dropout_probability,
is_causal,
true);
auto graph_and_tensors_backward_ptr = mhagraphbackwardcache.find(key);
graph_and_tensors_backward graph_and_tensors_backward_values;
if (graph_and_tensors_backward_ptr) {
@ -634,7 +658,8 @@ void run_cudnn_SDP_bprop(
h,
s_q,
s_kv,
d,
d_qk,
d_v,
scaling_factor,
is_causal,
dropout_probability,
@ -684,5 +709,4 @@ void run_cudnn_SDP_bprop(
} // namespace native
} // namespace at
#endif

View File

@ -9,7 +9,8 @@ void run_cudnn_SDP_fprop(
int64_t h,
int64_t s_q,
int64_t s_kv,
int64_t d,
int64_t d_k,
int64_t d_v,
float scaling_factor,
bool isTraining,
bool is_causal,
@ -27,7 +28,8 @@ void run_cudnn_SDP_bprop(
int64_t h,
int64_t s_q,
int64_t s_kv,
int64_t d,
int64_t d_k,
int64_t d_v,
float scaling_factor,
bool is_causal,
float dropout_probability,

View File

@ -27,53 +27,7 @@ Tensor mkldnn_convolution(
TORCH_CHECK(false, "mkldnn_convolution_forward: ATen not compiled with MKLDNN support");
}
static Tensor mkldnn_convolution_backward_input(
IntArrayRef input_size, const Tensor& grad_output, const Tensor& weight,
IntArrayRef padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups, bool bias_defined) {
TORCH_CHECK(false, "mkldnn_convolution_backward_input: ATen not compiled with MKLDNN support");
}
static std::tuple<Tensor, Tensor> mkldnn_convolution_backward_weights(
IntArrayRef weight_size, const Tensor& grad_output, const Tensor& input,
IntArrayRef padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups, bool bias_defined) {
TORCH_CHECK(false, "mkldnn_convolution_backward_weights: ATen not compiled with MKLDNN support");
}
static std::tuple<Tensor, Tensor, Tensor> mkldnn_convolution_backward(
const Tensor& input, const Tensor& grad_output_t, const Tensor& weight,
IntArrayRef padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups, std::array<bool,3> output_mask) {
TORCH_CHECK(false, "mkldnn_convolution_backward: ATen not compiled with MKLDNN support");
}
REGISTER_NO_CPU_DISPATCH(mkldnn_convolution_backward_stub);
static Tensor mkldnn_convolution_transpose(
const Tensor& input, const Tensor& weight, const std::optional<Tensor>& bias_opt,
IntArrayRef padding, IntArrayRef output_padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups) {
TORCH_CHECK(false, "mkldnn_convolution_transpose: ATen not compiled with MKLDNN support");
}
static Tensor mkldnn_convolution_transpose_backward_input(
IntArrayRef input_size, const Tensor& grad_output, const Tensor& weight,
IntArrayRef padding, IntArrayRef output_padding, IntArrayRef stride, IntArrayRef dilation,
int64_t groups, bool bias_defined) {
TORCH_CHECK(false, "mkldnn_convolution_transpose_backward_input: ATen not compiled with MKLDNN support");
}
static std::tuple<Tensor, Tensor> mkldnn_convolution_transpose_backward_weights(
IntArrayRef weight_size, const Tensor& grad_output, const Tensor& input,
IntArrayRef padding, IntArrayRef output_padding, IntArrayRef stride, IntArrayRef dilation,
int64_t groups, bool bias_defined) {
TORCH_CHECK(false, "mkldnn_convolution_transpose_backward_weights: ATen not compiled with MKLDNN support");
}
static std::tuple<Tensor, Tensor, Tensor> mkldnn_convolution_transpose_backward(
const Tensor& input, const Tensor& grad_output_t, const Tensor& weight,
IntArrayRef padding, IntArrayRef output_padding, IntArrayRef stride, IntArrayRef dilation,
int64_t groups, std::array<bool,3> output_mask) {
TORCH_CHECK(false, "mkldnn_convolution_transpose_backward: ATen not compiled with MKLDNN support");
}
REGISTER_NO_CPU_DISPATCH(mkldnn_convolution_transpose_stub);
REGISTER_NO_CPU_DISPATCH(mkldnn_convolution_transpose_backward_stub);

View File

@ -18,26 +18,21 @@ kernel void erfinv_mps_kernel( device {0} *output [[buffer(0)]],
/* coefficients in rational expansion */
float y_abs = abs(y);
if(y_abs > 1.0f){{
output[index] = NAN;
if (y_abs >= 1.0f) {{
output[index] = {0}( y_abs > 1.0f ? NAN : copysign(INFINITY, y));
return;
}}
if(y_abs == 1.0f){{
output[index] = copysign(INFINITY, y);
return;
}}
if(y_abs <= 0.7f) {{
if (y_abs <= 0.7f) {{
z = y * y;
num = (((a[3]*z + a[2])*z + a[1])*z + a[0]);
dem = ((((b[3]*z + b[2])*z + b[1])*z +b[0]) * z + 1.0f);
num = ((a[3] * z + a[2]) * z + a[1])*z + a[0];
dem = (((b[3] * z + b[2]) * z + b[1]) * z +b[0]) * z + 1.0f;
x = y * num / dem;
}}
else{{
}} else {{
z = sqrt(-1.0f*log((1.0-y_abs)/2.0));
num = ((c[3]*z + c[2])*z + c[1]) * z + c[0];
dem = (d[1]*z + d[0])*z + 1.0f;
num = ((c[3] * z + c[2]) * z + c[1]) * z + c[0];
dem = (d[1] * z + d[0]) * z + 1.0f;
x = copysign(num, y) / dem;
}}
output[index] = x;
}})METAL";
output[index] = {0}(x);
}})METAL";

View File

@ -143,7 +143,7 @@ TORCH_IMPL_FUNC(leaky_relu_out_mps)(const Tensor& self, const Scalar& negative_s
Tensor output_ = at::empty_like(self, executeGatherOp ? MemoryFormat::Contiguous : MemoryFormat::Preserve);
@autoreleasepool {
string key = "leaky_relu" + getTensorsStringKey({self}) + ":" + to_string(negative_slope.to<double>());
string key = "leaky_relu" + getTensorsStringKey({self}) + ":" + std::to_string(negative_slope.to<double>());
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, self);
@ -193,8 +193,8 @@ TORCH_IMPL_FUNC(leaky_relu_backward_out_mps)
Tensor output_ = at::empty_like(self, self.suggest_memory_format());
@autoreleasepool {
string key =
"leaky_relu_backward" + getTensorsStringKey({self, grad_output}) + ":" + to_string(negative_slope.to<double>());
string key = "leaky_relu_backward" + getTensorsStringKey({self, grad_output}) + ":" +
std::to_string(negative_slope.to<double>());
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, self);
MPSGraphTensor* gradOutputTensor = mpsGraphRankedPlaceHolder(mpsGraph, grad_output);
@ -242,7 +242,7 @@ TORCH_IMPL_FUNC(log_softmax_mps_out)
MPSStream* stream = at::mps::getCurrentMPSStream();
@autoreleasepool {
string key = "log_softmax_mps_out" + getTensorsStringKey({self}) + ":" + to_string(dim);
string key = "log_softmax_mps_out" + getTensorsStringKey({self}) + ":" + std::to_string(dim);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, self);
@ -285,7 +285,7 @@ TORCH_IMPL_FUNC(log_softmax_backward_mps_out)
MPSStream* stream = at::mps::getCurrentMPSStream();
@autoreleasepool {
string key = "log_softmax_backward_mps_out:" + getMPSTypeString(grad_output) + ":" + to_string(dim);
string key = "log_softmax_backward_mps_out:" + getMPSTypeString(grad_output) + ":" + std::to_string(dim);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* gradOutputTensor = mpsGraphUnrankedPlaceHolder(mpsGraph, getMPSDataType(grad_output));
MPSGraphTensor* outputTensor = mpsGraphUnrankedPlaceHolder(mpsGraph, getMPSDataType(output));
@ -539,8 +539,8 @@ TORCH_IMPL_FUNC(threshold_out_mps)
MPSStream* stream = getCurrentMPSStream();
@autoreleasepool {
string key = "threshold_out_mps" + getTensorsStringKey({self}) + ":" + to_string(threshold.to<double>()) + ":" +
to_string(value.to<double>());
string key = "threshold_out_mps" + getTensorsStringKey({self}) + ":" + std::to_string(threshold.to<double>()) +
":" + std::to_string(value.to<double>());
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, self);
@ -587,7 +587,7 @@ TORCH_IMPL_FUNC(threshold_backward_out_mps)
@autoreleasepool {
string key =
"threshold_backward_out_mps" + getTensorsStringKey({self, grad}) + ":" + to_string(threshold.to<double>());
"threshold_backward_out_mps" + getTensorsStringKey({self, grad}) + ":" + std::to_string(threshold.to<double>());
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, self);
@ -826,8 +826,8 @@ static void elu_variants_out_mps(const Tensor& self,
MPSStream* stream = getCurrentMPSStream();
@autoreleasepool {
string key = func_name + ":" + getTensorsStringKey({self}) + ":" + to_string(alpha.to<double>()) + ":" +
to_string(scale.to<double>()) + ":" + to_string(input_scale.to<double>());
string key = func_name + ":" + getTensorsStringKey({self}) + ":" + std::to_string(alpha.to<double>()) + ":" +
std::to_string(scale.to<double>()) + ":" + std::to_string(input_scale.to<double>());
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, self);
@ -916,8 +916,8 @@ TORCH_IMPL_FUNC(elu_backward_out_mps)
@autoreleasepool {
string key = "elu_backward_out_mps:" + getTensorsStringKey({grad_output, self_or_result}) + ":" +
to_string(alpha.to<double>()) + ":" + to_string(scale.to<double>()) + ":" +
to_string(input_scale.to<double>()) + ":" + to_string(is_result);
std::to_string(alpha.to<double>()) + ":" + std::to_string(scale.to<double>()) + ":" +
std::to_string(input_scale.to<double>()) + ":" + std::to_string(is_result);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* gradOutputTensor = mpsGraphRankedPlaceHolder(mpsGraph, grad_output);
@ -1010,7 +1010,7 @@ TORCH_IMPL_FUNC(glu_out_mps)(const Tensor& self, const int64_t dim, const Tensor
MPSStream* stream = getCurrentMPSStream();
@autoreleasepool {
string key = "glu_out_mps" + getTensorsStringKey({self}) + ":" + to_string(dim);
string key = "glu_out_mps" + getTensorsStringKey({self}) + ":" + std::to_string(dim);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(self), getMPSShape(self));
NSArray<MPSGraphTensor*>* outputTensorsArray = [mpsGraph splitTensor:inputTensor
@ -1052,7 +1052,7 @@ Tensor& glu_backward_mps_out(const Tensor& grad_output, const Tensor& self, cons
MPSStream* stream = getCurrentMPSStream();
@autoreleasepool {
string key = "glu_backward_mps_out" + getTensorsStringKey({grad_output, self}) + ":" + to_string(dim);
string key = "glu_backward_mps_out" + getTensorsStringKey({grad_output, self}) + ":" + std::to_string(dim);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(self), getMPSShape(self));
MPSGraphTensor* gradOutputTensor =
@ -1855,8 +1855,8 @@ Tensor& hardtanh_backward_out_mps(const Tensor& grad_output,
MPSStream* stream = getCurrentMPSStream();
@autoreleasepool {
string key = "hardtanh_backward_out_mps:" + getTensorsStringKey({grad_output}) + ":" + to_string(min.to<double>()) +
":" + to_string(max.to<double>());
string key = "hardtanh_backward_out_mps:" + getTensorsStringKey({grad_output}) + ":" +
std::to_string(min.to<double>()) + ":" + std::to_string(max.to<double>());
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* gradOutputTensor = mpsGraphRankedPlaceHolder(mpsGraph, grad_output);

View File

@ -136,8 +136,8 @@ static Tensor& addmv_out_mps_impl(const Tensor& self,
Tensor matMulVec = at::mm(mat, vec.unsqueeze(1)).squeeze(1);
@autoreleasepool {
string key = "addmv_out_mps_impl" + getTensorsStringKey({self, matMulVec}) + ":" + to_string(beta_.toDouble()) +
":" + to_string(alpha_.toDouble());
string key = "addmv_out_mps_impl" + getTensorsStringKey({self, matMulVec}) + ":" +
std::to_string(beta_.toDouble()) + ":" + std::to_string(alpha_.toDouble());
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* matMulVecTensor = mpsGraphRankedPlaceHolder(mpsGraph, matMulVec);
MPSGraphTensor* selfTensor = mpsGraphRankedPlaceHolder(mpsGraph, self);

View File

@ -33,7 +33,7 @@ static Tensor& fill_scalar_mps_impl(Tensor& self, const Scalar& value) {
};
@autoreleasepool {
string key = "fill_scalar_mps_impl" + getTensorsStringKey(self) + ":" + to_string(value.toDouble());
string key = "fill_scalar_mps_impl" + getTensorsStringKey(self) + ":" + std::to_string(value.toDouble());
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphScalarPlaceHolder(mpsGraph, getMPSDataType(self.scalar_type()));

View File

@ -193,24 +193,24 @@ static Tensor _mps_convolution_impl(const Tensor& input_t,
string bias_shape_key;
if (bias_defined) {
bias_shape_key = to_string(bias_shape[0]);
bias_shape_key = std::to_string(bias_shape[0]);
} else {
bias_shape_key = "nobias";
}
string key;
if (is3DConv) {
key = "mps_3d_convolution:" + to_string(stride[0]) + ":" + to_string(stride[1]) + ":" + to_string(stride[2]) +
":" + to_string(dilation[0]) + ":" + to_string(dilation[1]) + ":" + to_string(dilation[2]) + ":" +
to_string(padding[0]) + ":" + to_string(padding[1]) + ":" + to_string(padding[2]) + ":" + to_string(groups) +
":" + mem_format_key + mps::getTensorsStringKey({input_t, weight_t}) + ":" + to_string(bias_defined) + ":" +
bias_shape_key;
key = "mps_3d_convolution:" + std::to_string(stride[0]) + ":" + std::to_string(stride[1]) + ":" +
std::to_string(stride[2]) + ":" + std::to_string(dilation[0]) + ":" + std::to_string(dilation[1]) + ":" +
std::to_string(dilation[2]) + ":" + std::to_string(padding[0]) + ":" + std::to_string(padding[1]) + ":" +
std::to_string(padding[2]) + ":" + std::to_string(groups) + ":" + mem_format_key +
mps::getTensorsStringKey({input_t, weight_t}) + ":" + std::to_string(bias_defined) + ":" + bias_shape_key;
} else {
key = "mps_convolution:" + to_string(stride[0]) + ":" + to_string(stride[1]) + ":" + to_string(dilation[0]) +
":" + to_string(dilation[1]) + ":" + to_string(padding[0]) + ":" + to_string(padding[1]) + ":" +
to_string(groups) + ":" + mem_format_key + mps::getTensorsStringKey({input_t, weight_t}) + ":" +
to_string(bias_defined) + ":" + bias_shape_key;
key = "mps_convolution:" + std::to_string(stride[0]) + ":" + std::to_string(stride[1]) + ":" +
std::to_string(dilation[0]) + ":" + std::to_string(dilation[1]) + ":" + std::to_string(padding[0]) + ":" +
std::to_string(padding[1]) + ":" + std::to_string(groups) + ":" + mem_format_key +
mps::getTensorsStringKey({input_t, weight_t}) + ":" + std::to_string(bias_defined) + ":" + bias_shape_key;
}
MPSShape* inputShape = mps::getMPSShape(input_t, memory_format);
@ -388,16 +388,16 @@ static Tensor mps_convolution_backward_input(IntArrayRef input_size,
NSString* ns_shape_key = [[gradOutputShape valueForKey:@"description"] componentsJoinedByString:@","];
string key;
if (is3DConv) {
key = "mps_3d_convolution_backward_input:" + to_string(stride[0]) + ":" + to_string(stride[1]) + ":" + ":" +
to_string(stride[2]) + to_string(dilation[0]) + ":" + to_string(dilation[1]) + ":" + to_string(dilation[2]) +
":" + to_string(padding[0]) + ":" + to_string(padding[1]) + ":" + to_string(padding[2]) + ":" +
to_string(groups) + ":" + mem_format_key + getTensorsStringKey({grad_output_t, weight_t}) + ":" +
string([ns_shape_key UTF8String]);
key = "mps_3d_convolution_backward_input:" + std::to_string(stride[0]) + ":" + std::to_string(stride[1]) + ":" +
":" + std::to_string(stride[2]) + std::to_string(dilation[0]) + ":" + std::to_string(dilation[1]) + ":" +
std::to_string(dilation[2]) + ":" + std::to_string(padding[0]) + ":" + std::to_string(padding[1]) + ":" +
std::to_string(padding[2]) + ":" + std::to_string(groups) + ":" + mem_format_key +
getTensorsStringKey({grad_output_t, weight_t}) + ":" + string([ns_shape_key UTF8String]);
} else {
key = "mps_convolution_backward_input:" + to_string(stride[0]) + ":" + to_string(stride[1]) + ":" +
to_string(dilation[0]) + ":" + to_string(dilation[1]) + ":" + to_string(padding[0]) + ":" +
to_string(padding[1]) + ":" + to_string(groups) + ":" + mem_format_key +
key = "mps_convolution_backward_input:" + std::to_string(stride[0]) + ":" + std::to_string(stride[1]) + ":" +
std::to_string(dilation[0]) + ":" + std::to_string(dilation[1]) + ":" + std::to_string(padding[0]) + ":" +
std::to_string(padding[1]) + ":" + std::to_string(groups) + ":" + mem_format_key +
getTensorsStringKey({grad_output_t, weight_t}) + ":" + string([ns_shape_key UTF8String]);
}
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
@ -547,15 +547,15 @@ static Tensor mps_convolution_backward_weights(IntArrayRef weight_size,
NSString* ns_shape_key = [[gradOutputShape valueForKey:@"description"] componentsJoinedByString:@","];
string key;
if (is3DConv) {
key = "mps_3d_convolution_backward_weights:" + to_string(stride[0]) + ":" + to_string(stride[1]) + ":" +
to_string(stride[2]) + ":" + to_string(dilation[0]) + ":" + to_string(dilation[1]) + ":" +
to_string(dilation[2]) + ":" + to_string(padding[0]) + ":" + to_string(padding[1]) + ":" +
to_string(padding[2]) + ":" + to_string(groups) + ":" + mem_format_key +
key = "mps_3d_convolution_backward_weights:" + std::to_string(stride[0]) + ":" + std::to_string(stride[1]) + ":" +
std::to_string(stride[2]) + ":" + std::to_string(dilation[0]) + ":" + std::to_string(dilation[1]) + ":" +
std::to_string(dilation[2]) + ":" + std::to_string(padding[0]) + ":" + std::to_string(padding[1]) + ":" +
std::to_string(padding[2]) + ":" + std::to_string(groups) + ":" + mem_format_key +
getTensorsStringKey({grad_output_t, input_t, grad_weight_t}) + ":" + string([ns_shape_key UTF8String]);
} else {
key = "mps_convolution_backward_weights:" + to_string(stride[0]) + ":" + to_string(stride[1]) + ":" +
to_string(dilation[0]) + ":" + to_string(dilation[1]) + ":" + to_string(padding[0]) + ":" +
to_string(padding[1]) + ":" + to_string(groups) + ":" + mem_format_key +
key = "mps_convolution_backward_weights:" + std::to_string(stride[0]) + ":" + std::to_string(stride[1]) + ":" +
std::to_string(dilation[0]) + ":" + std::to_string(dilation[1]) + ":" + std::to_string(padding[0]) + ":" +
std::to_string(padding[1]) + ":" + std::to_string(groups) + ":" + mem_format_key +
getTensorsStringKey({grad_output_t, input_t, grad_weight_t}) + ":" + string([ns_shape_key UTF8String]);
}
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {

View File

@ -63,7 +63,7 @@ Tensor& random_mps_impl(Tensor& self,
@autoreleasepool {
string key = op_name + getTensorsStringKey({self, mean_opt.value_or(Tensor()), std_opt.value_or(Tensor())}) + ":" +
to_string(val1) + ":" + to_string(val2);
std::to_string(val1) + ":" + std::to_string(val2);
auto cachedGraph = LookUpOrCreateCachedGraph<RandomCachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
newCachedGraph->stateTensor =
mpsGraphRankedPlaceHolder(mpsGraph, MPSDataTypeInt32, @[ @(at::mps::detail::PHILOX_STATE_N) ]);
@ -469,7 +469,7 @@ static Tensor& multinomial_with_replacement_mps_kernel(const Tensor& self,
MPSStream* stream = getCurrentMPSStream();
@autoreleasepool {
string key = "multinomial_with_replacement:" + getTensorsStringKey({self}) + ":" + to_string(n_sample);
string key = "multinomial_with_replacement:" + getTensorsStringKey({self}) + ":" + std::to_string(n_sample);
auto cachedGraph = LookUpOrCreateCachedGraph<RandomCachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSShape* prob_shape = getMPSShape(self_v);
newCachedGraph->stateTensor = mpsGraphRankedPlaceHolder(mpsGraph, MPSDataTypeInt32, @[ @7 ]);

View File

@ -236,7 +236,7 @@ static std::tuple<Tensor, Tensor> _mps_linear_backward_weights(const Tensor& gra
MPSStream* stream = getCurrentMPSStream();
@autoreleasepool {
string key = "mps_linear_backward_weights:" + to_string(bias_defined) + ":" +
string key = "mps_linear_backward_weights:" + std::to_string(bias_defined) + ":" +
getTensorsStringKey({input_reshaped, weight, grad_output_reshaped});
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, input_reshaped);

View File

@ -229,8 +229,8 @@ static Tensor& addbmm_or_baddbmm_out_mps_impl(const Tensor& input,
@autoreleasepool {
string key = (opType == ADDBMM_OP_TYPE) ? ("addbmm_out_mps_impl") : ("baddbmm_out_mps_impl");
key += getTensorsStringKey({batch1, batch2, input}) + ":" + to_string(beta.toDouble()) + ":" +
to_string(alpha.toDouble());
key += getTensorsStringKey({batch1, batch2, input}) + ":" + std::to_string(beta.toDouble()) + ":" +
std::to_string(alpha.toDouble());
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mps::mpsGraphRankedPlaceHolder(mpsGraph, input);
@ -331,8 +331,8 @@ static Tensor& addmm_out_mps_impl(const Tensor& bias,
};
@autoreleasepool {
string key = "addmm_out_mps_impl" + getTensorsStringKey({self, other, *bias_}) + ":" + to_string(beta.toDouble()) +
":" + to_string(alpha.toDouble());
string key = "addmm_out_mps_impl" + getTensorsStringKey({self, other, *bias_}) + ":" +
std::to_string(beta.toDouble()) + ":" + std::to_string(alpha.toDouble());
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* selfTensor = nil;
MPSGraphTensor* otherTensor = nil;
@ -615,8 +615,8 @@ Tensor& addr_out_mps(const Tensor& self,
};
@autoreleasepool {
string key = "addr_out_mps_impl" + getTensorsStringKey({vec1, vec2, *self_}) + ":" + to_string(beta.toDouble()) +
":" + to_string(alpha.toDouble());
string key = "addr_out_mps_impl" + getTensorsStringKey({vec1, vec2, *self_}) + ":" +
std::to_string(beta.toDouble()) + ":" + std::to_string(alpha.toDouble());
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* t1 = mps::mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(vec1), inputShape);
MPSGraphTensor* t2 = mps::mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(vec2), otherShape);

View File

@ -69,7 +69,7 @@ static Tensor& mse_loss_backward_out_impl(const Tensor& grad_output,
};
@autoreleasepool {
string key = op_name + reductionToString(reduction) + ":" + to_string(grad_input.sizes()[1]) +
string key = op_name + reductionToString(reduction) + ":" + std::to_string(grad_input.sizes()[1]) +
getTensorsStringKey({input, target, grad_output});
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
newCachedGraph->inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, input);
@ -327,8 +327,8 @@ static void nllnd_loss_backward_impl(Tensor& grad_input_arg,
}
@autoreleasepool {
string key = "nllnd_loss_backward" + getTensorsStringKey({input, grad_output, target, weight, total_weight}) +
to_string(numClasses) + ":" + to_string(ignore_index) + ":" + to_string(isWeightsArrayValid) + ":" +
to_string(isTargetCasted) + ":" + reductionToString(reduction);
std::to_string(numClasses) + ":" + std::to_string(ignore_index) + ":" + std::to_string(isWeightsArrayValid) +
":" + std::to_string(isTargetCasted) + ":" + reductionToString(reduction);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, input);
@ -463,9 +463,9 @@ static void nllnd_loss_forward_impl(Tensor& output,
NSString* ns_shape_key = [[input_shape valueForKey:@"description"] componentsJoinedByString:@","];
// TODO: Make the key
string key = "nllnd_loss_forward_impl:" + to_string(ignore_index) + ":" + to_string(isWeightsArrayValid) + ":" +
reductionToString(reduction) + ":" + [ns_shape_key UTF8String] + ":" + getMPSTypeString(input) + ":" +
getMPSTypeString(target) + ":" + to_string(isTargetCasted) + ":" + getMPSTypeString(weight);
string key = "nllnd_loss_forward_impl:" + std::to_string(ignore_index) + ":" + std::to_string(isWeightsArrayValid) +
":" + reductionToString(reduction) + ":" + [ns_shape_key UTF8String] + ":" + getMPSTypeString(input) + ":" +
getMPSTypeString(target) + ":" + std::to_string(isTargetCasted) + ":" + getMPSTypeString(weight);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(input), input_shape);
MPSGraphTensor* targetTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(target), target_shape);
@ -598,7 +598,7 @@ static void smooth_l1_loss_impl(const Tensor& input,
NSString* ns_shape_key = [[input_shape valueForKey:@"description"] componentsJoinedByString:@","];
string key = "smooth_l1_loss_impl:" + reductionToString(reduction) + ":" + [ns_shape_key UTF8String] + ":" +
to_string(beta) + ":" + getMPSTypeString(input) + ":" + getMPSTypeString(target);
std::to_string(beta) + ":" + getMPSTypeString(input) + ":" + getMPSTypeString(target);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
// smooth_l1_loss_mps:
// ln = 0.5 * ( xn - yn ) ^ 2 / beta, if |xn - yn| < beta
@ -734,7 +734,7 @@ static void smooth_l1_loss_backward_impl(const Tensor& grad_output,
@autoreleasepool {
string key = "smooth_l1_loss_backward" + getTensorsStringKey({input, grad_output, grad_input, target}) + ":" +
reductionToString(reduction) + ":" + to_string(beta);
reductionToString(reduction) + ":" + std::to_string(beta);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, input);

View File

@ -106,7 +106,7 @@ Tensor& arange_mps_out(const Scalar& start, const Scalar& end, const Scalar& ste
auto stream = getCurrentMPSStream();
auto mpsDataType = getMPSDataType(result);
@autoreleasepool {
string key = "arange_mps_out" + getTensorsStringKey({result}) + ":" + to_string(size);
string key = "arange_mps_out" + getTensorsStringKey({result}) + ":" + std::to_string(size);
auto cachedGraph = cache_->LookUpAs<RangeCachedGraph>(key);
if (!cachedGraph) {
cachedGraph = cache_->CreateCachedGraphAs<RangeCachedGraph>(key, ^MPSCachedGraph*() {
@ -173,7 +173,7 @@ Tensor& range_mps_out(const Scalar& start, const Scalar& end, const Scalar& step
auto stream = getCurrentMPSStream();
auto mpsDataType = getMPSDataType(result);
@autoreleasepool {
string key = "arange_mps_out" + getTensorsStringKey({result}) + ":" + to_string(size);
string key = "arange_mps_out" + getTensorsStringKey({result}) + ":" + std::to_string(size);
auto cachedGraph = cache_->LookUpAs<RangeCachedGraph>(key);
if (!cachedGraph) {
cachedGraph = cache_->CreateCachedGraphAs<RangeCachedGraph>(key, ^MPSCachedGraph*() {
@ -221,8 +221,8 @@ Tensor& linspace_out_mps(const Scalar& start, const Scalar& end, int64_t steps,
bool start_less_end = (start.to<double>() <= end.to<double>());
@autoreleasepool {
string key =
"linspace_out_mps:" + getTensorsStringKey({result}) + ":" + to_string(steps) + to_string(start_less_end);
string key = "linspace_out_mps:" + getTensorsStringKey({result}) + ":" + std::to_string(steps) +
std::to_string(start_less_end);
auto cachedGraph = cache_->LookUpAs<RangeCachedGraph>(key);
if (!cachedGraph) {

View File

@ -359,8 +359,8 @@ static void impl_func_norm_mps(const Tensor& input_tensor,
NSString* ns_key = [[wrappedAxes valueForKey:@"description"] componentsJoinedByString:@","];
string keepdim_info = (keepdim) ? "keepdim=1" : "keepdim=0";
string tensor_key = cdist ? getTensorsStringKey({input_tensor, other_tensor}) : getTensorsStringKey({input_t});
string key = string("norm_out_mps:") + [ns_key UTF8String] + ":" + tensor_key + ":p" + to_string(p) + ":" +
keepdim_info + ":" + toString(in_dtype) + ":" + to_string(castInputData);
string key = string("norm_out_mps:") + [ns_key UTF8String] + ":" + tensor_key + ":p" + std::to_string(p) + ":" +
keepdim_info + ":" + toString(in_dtype) + ":" + std::to_string(castInputData);
auto cachedGraph = LookUpOrCreateCachedGraph<MPSBinaryCachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
newCachedGraph->inputTensor_ = mpsGraphRankedPlaceHolder(mpsGraph, input_tensor);
@ -572,7 +572,7 @@ static Tensor std_var_common_impl_mps(const Tensor& input_t,
string op_key = (stdVarType == STANDARD_DEVIATION) ? "std_mps" : "var_mps";
NSString* ns_key = [[wrappedAxes valueForKey:@"description"] componentsJoinedByString:@","];
string bessel_corrected = (use_correction && correction_value) ? "unbiased " : "biased ";
string use_dim_info = (use_dim) ? "use_dim=1:" + to_string(dim_value.size()) : "use_dim=0";
string use_dim_info = (use_dim) ? "use_dim=1:" + std::to_string(dim_value.size()) : "use_dim=0";
string keepdim_info = (keepdim) ? "keepdim=1" : "keepdim=0";
string key = op_key + ":" + getTensorsStringKey(input_t) + ":" + use_dim_info + ":" + keepdim_info + ":" +
string([ns_key UTF8String]) + ":" + bessel_corrected + ":" + std::to_string(correction_value);
@ -700,7 +700,7 @@ static void min_max_out_mps(const Tensor& input_t,
auto stream = at::mps::getCurrentMPSStream();
@autoreleasepool {
string key = func_name + getTensorsStringKey({input_t, indices_t}) + ":" + to_string(dim_);
string key = func_name + getTensorsStringKey({input_t, indices_t}) + ":" + std::to_string(dim_);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, input_t);
MPSGraphTensor* outputTensor = nil;
@ -860,7 +860,7 @@ static void argmax_argmin_out_mps(const Tensor& input_t,
@autoreleasepool {
NSString* ns_key = [[apparent_in_shape valueForKey:@"description"] componentsJoinedByString:@","];
string key =
func_name + ":" + to_string(dim_) + ":" + getTensorsStringKey(input_t) + ":" + string([ns_key UTF8String]);
func_name + ":" + std::to_string(dim_) + ":" + getTensorsStringKey(input_t) + ":" + string([ns_key UTF8String]);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
auto inputScalarType = input_t.scalar_type();
MPSGraphTensor* inputTensor =
@ -1217,7 +1217,7 @@ TORCH_IMPL_FUNC(any_out_mps)
@autoreleasepool {
MPSShape* input_t_shape = getMPSShape(input_t);
string key = string("any_out_mps:") + getMPSShapeString(input_t_shape) + ":" + to_string(dim_) + ":" +
string key = string("any_out_mps:") + getMPSShapeString(input_t_shape) + ":" + std::to_string(dim_) + ":" +
getMPSTypeString(input_t);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSDataType input_type = getMPSDataType(input_t);
@ -1313,7 +1313,7 @@ TORCH_IMPL_FUNC(all_out_mps)
@autoreleasepool {
MPSShape* input_t_shape = getMPSShape(input_t);
string key = string("all_out_mps:") + getMPSShapeString(input_t_shape) + ":" + to_string(dim_) + ":" +
string key = string("all_out_mps:") + getMPSShapeString(input_t_shape) + ":" + std::to_string(dim_) + ":" +
getMPSTypeString(input_t);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSDataType input_type = getMPSDataType(input_t);
@ -1531,8 +1531,8 @@ static void median_out_mps(const Tensor& input_t,
auto stream = at::mps::getCurrentMPSStream();
@autoreleasepool {
string key =
func_name + ":" + to_string(dim_) + ":" + getTensorsStringKey(input_t) + ":" + getTensorsStringKey(indices_t);
string key = func_name + ":" + std::to_string(dim_) + ":" + getTensorsStringKey(input_t) + ":" +
getTensorsStringKey(indices_t);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, input_t);
MPSGraphTensor* castInputTensor =

View File

@ -108,8 +108,8 @@ TORCH_IMPL_FUNC(topk_out_mps)
// Input as placeholders
MPSShape* input_shape = getMPSShape(self);
NSString* ns_shape_key = [[input_shape valueForKey:@"description"] componentsJoinedByString:@","];
string key = string("topk:") + [ns_shape_key UTF8String] + ":" + getMPSTypeString(self) + ":k" + to_string(k) +
":dim" + to_string(dim_) + ":largest" + to_string(largest);
string key = string("topk:") + [ns_shape_key UTF8String] + ":" + getMPSTypeString(self) + ":k" + std::to_string(k) +
":dim" + std::to_string(dim_) + ":largest" + std::to_string(largest);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
newCachedGraph->selfTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(self), input_shape);
@ -320,12 +320,12 @@ TORCH_IMPL_FUNC(cat_out_mps)
};
@autoreleasepool {
string key =
"cat_out_mps:" + to_string(dimension) + ":" + (memory_format == MemoryFormat::ChannelsLast ? "NHWC" : "NCHW");
string key = "cat_out_mps:" + std::to_string(dimension) + ":" +
(memory_format == MemoryFormat::ChannelsLast ? "NHWC" : "NCHW");
if (!all_same_dtype) {
key += getTensorsStringKey(input_tensors, true, all_same_sizes_and_stride);
} else {
key += ":" + getMPSTypeString(input_tensors[0].scalar_type(), true) + ":" + to_string(inputs.size());
key += ":" + getMPSTypeString(input_tensors[0].scalar_type(), true) + ":" + std::to_string(inputs.size());
}
for (auto idx : skipped_tensor_indices) {
key += "," + std::to_string(idx);

View File

@ -60,8 +60,8 @@ TORCH_IMPL_FUNC(sort_stable_out_mps)
// Input as placeholders
MPSShape* input_shape = getMPSShape(self);
NSString* ns_shape_key = [[input_shape valueForKey:@"description"] componentsJoinedByString:@","];
string key = string("sort:") + [ns_shape_key UTF8String] + ":" + getMPSTypeString(self) + ":dim" + to_string(dim) +
":descending" + to_string(descending);
string key = string("sort:") + [ns_shape_key UTF8String] + ":" + getMPSTypeString(self) + ":dim" +
std::to_string(dim) + ":descending" + std::to_string(descending);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
newCachedGraph->selfTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(self), input_shape);

View File

@ -240,8 +240,8 @@ static void clamp_scalar_out_mps(const Tensor& input_t,
@autoreleasepool {
// the optional min/max refs could affect how we build the cached graph
string key = op_name + (has_min ? ("_min:" + to_string(min_scalar)) : "") +
(has_max ? ("_max:" + to_string(max_scalar)) : "") + "_scalar:" + getTensorsStringKey({input_t});
string key = op_name + (has_min ? ("_min:" + std::to_string(min_scalar)) : "") +
(has_max ? ("_max:" + std::to_string(max_scalar)) : "") + "_scalar:" + getTensorsStringKey({input_t});
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
if (has_min)
newCachedGraph->minTensor = [mpsGraph

View File

@ -13,32 +13,6 @@
#include <fmt/format.h>
namespace at::native {
static const std::string& getMetalType(const c10::ScalarType& t) {
// Mapping from c10::ScalarType to integral type that can be used for unary ops
static std::unordered_map<c10::ScalarType, std::string> scalar_to_metal_type = {
{c10::ScalarType::Half, "half"},
{c10::ScalarType::Float, "float"},
{c10::ScalarType::Long, "long"},
{c10::ScalarType::Int, "int"},
{c10::ScalarType::Short, "short"},
{c10::ScalarType::Bool, "bool"},
{c10::ScalarType::Char, "int8_t"},
{c10::ScalarType::Byte, "uint8_t"},
};
auto it = scalar_to_metal_type.find(t);
TORCH_CHECK(it != scalar_to_metal_type.end(), "Unsupported type ", t);
return it->second;
}
static const std::string& getMetalType(const c10::Scalar& s) {
return getMetalType(s.type());
}
static const std::string& getMetalType(const Tensor& t) {
return getMetalType(t.scalar_type());
}
static mps::MetalShaderLibrary lib(UNARY_KERNEL_TEMPLATE, 2);
TORCH_IMPL_FUNC(erfinv_out_mps)(const Tensor& self, const Tensor& output_) {
@ -57,7 +31,8 @@ TORCH_IMPL_FUNC(erfinv_out_mps)(const Tensor& self, const Tensor& output_) {
}
using namespace mps;
@autoreleasepool {
auto cplState = lib.getPipelineStateForFunc("erfinv_mps_kernel", {getMetalType(outputTensor), getMetalType(self)});
auto cplState = lib.getPipelineStateForFunc("erfinv_mps_kernel",
{scalarToMetalTypeString(outputTensor), scalarToMetalTypeString(self)});
if (!self.is_contiguous()) {
inputTensor = inputTensor.contiguous();

View File

@ -36,8 +36,8 @@ static std::string getUniqueKey(const ScalarType& dtype,
const bool consecutive,
c10::optional<int64_t> dimOpt) {
return "_unique2_mps:" + getMPSTypeString(dtype) + "[" + getArrayRefString(base_shape) + "]:[" +
(dimOpt.has_value() ? to_string(dimOpt.value()) : "None") + "]:[" + to_string(return_inverse) + "]:[" +
to_string(return_counts) + "]:[" + to_string(consecutive) + "]";
(dimOpt.has_value() ? std::to_string(dimOpt.value()) : "None") + "]:[" + std::to_string(return_inverse) + "]:[" +
std::to_string(return_counts) + "]:[" + std::to_string(consecutive) + "]";
}
// dim arg not supported when non consecutive, ie sorted

View File

@ -99,7 +99,7 @@ static void upsample_out_template(const Tensor& input,
@autoreleasepool {
string key = "upsample_" + std::string(resize_mode_str) + (align_corners ? "_aligned_corners" : "") +
getTensorsStringKey({input}) + ":[" + to_string(scale_h) + "," + to_string(scale_w) + "]:[" +
getTensorsStringKey({input}) + ":[" + std::to_string(scale_h) + "," + std::to_string(scale_w) + "]:[" +
(is_backward_pass ? getArrayRefString(input_size) : "Undefined") + "]";
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {

View File

@ -42,7 +42,7 @@ static std::string getStridedKey(const ScalarType& self_dtype,
}
return (is_scatter ? "scatter:" : "gather:") + dtype_key + "[" + getArrayRefString(base_shape) + "]:[" +
getArrayRefString(new_shape) + "]:[" + getArrayRefString(stride) + "]:[" + to_string(storage_offset) + "]";
getArrayRefString(new_shape) + "]:[" + getArrayRefString(stride) + "]:[" + std::to_string(storage_offset) + "]";
}
// initializes the MTLBuffers for tensor data and runs the MPSGraph for the view op

View File

@ -172,16 +172,6 @@ Tensor mean_quantized_cpu(
return result;
}
static Tensor& mean_out_quantized_cpu(
Tensor& result,
const Tensor& self,
DimnameList dim,
bool keepdim,
std::optional<ScalarType> opt_dtype) {
return mean_out_quantized_cpu(
self, dimnames_to_positions(self, dim), keepdim, opt_dtype, result);
}
// qstd
inline bool is_std_inner_dim_fast_path(
const Tensor& self,

View File

@ -216,20 +216,6 @@ Tensor upsample_bilinear2d_quantized_cpu(
}
}
using at::native::upsample::compute_output_size;
using at::native::upsample::get_scale_value;
static Tensor upsample_bilinear2d_quantized_cpu(
const Tensor& input,
at::OptionalIntArrayRef output_size,
bool align_corners,
std::optional<ArrayRef<double>> scale_factors) {
auto osize = compute_output_size(input.sizes(), output_size, scale_factors);
auto scale_h = get_scale_value(scale_factors, 0);
auto scale_w = get_scale_value(scale_factors, 1);
return upsample_bilinear2d_quantized_cpu(input, osize, align_corners, scale_h, scale_w);
}
DEFINE_DISPATCH(qupsample_bilinear2d_nhwc_stub);
} // namespace native
} // namespace at

View File

@ -1,6 +1,7 @@
#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
#include <algorithm>
#include <cmath>
#include <string>
#include <vector>
#include <ATen/core/Tensor.h>
@ -35,7 +36,6 @@
#endif
#include <c10/util/irange.h>
#include <c10/util/string_utils.h>
namespace {
// To have a sanity check for maximum matrix size.
@ -1848,15 +1848,15 @@ class QConvInt8ForBC final {
int64_t output_zero_point) {
if (kReluFused) {
TORCH_WARN_ONCE(
"Arguments [stride, padding, dilation, groups] in ops.quantized.conv"
+ c10::to_string(kSpatialDim) + "d_relu, " +
"have been removed, please update your model to remove these arguments.");
"Arguments [stride, padding, dilation, groups] in ops.quantized.conv" +
std::to_string(kSpatialDim),
"d_relu, have been removed, please update your model to remove these arguments.");
return packed_weight->apply_relu(act, output_scale, output_zero_point);
} else {
TORCH_WARN_ONCE(
"Arguments [stride, padding, dilation, groups] in ops.quantized.conv"
+ c10::to_string(kSpatialDim) + "d, " +
"have been removed, please update your model to remove these arguments.");
"Arguments [stride, padding, dilation, groups] in ops.quantized.conv",
std::to_string(kSpatialDim),
"d, have been removed, please update your model to remove these arguments.");
return packed_weight->apply(act, output_scale, output_zero_point);
}
}

View File

@ -342,7 +342,10 @@ Tensor qembeddingbag_byte_prepack_meta(const Tensor& weight) {
output_shape[cols_dim] = output_columns;
at::SymDimVector output_shape_vec(output_shape);
return at::empty_symint(output_shape_vec, weight.options().dtype(weight.scalar_type()), weight.suggest_memory_format());
return at::empty_symint(
output_shape_vec,
weight.options().dtype(weight.scalar_type()),
weight.suggest_memory_format());
}
namespace {
@ -373,9 +376,10 @@ Tensor _qembeddingbag_nbit_prepack_helper(
int NUM_ELEM_PER_BYTE = 8 / bit_width;
TORCH_CHECK(
weight_contig.size(weight.dim() - 1) % NUM_ELEM_PER_BYTE == 0,
"qembeddingbag_" + c10::to_string(bit_width) +
"bit_prepack only works for the number of columns a multiple of " +
c10::to_string(NUM_ELEM_PER_BYTE));
"qembeddingbag_",
std::to_string(bit_width),
"bit_prepack only works for the number of columns a multiple of ",
std::to_string(NUM_ELEM_PER_BYTE));
// The "fused" representation stores the scale and bias with the
// row-wise quantized data in one tensor.
@ -551,11 +555,9 @@ TORCH_LIBRARY_IMPL(quantized, QuantizedCPU, m) {
TORCH_FN(QEmbeddingPackWeights::run));
}
TORCH_LIBRARY_IMPL(quantized, Meta, m) {
m.impl(
"quantized::embedding_bag_byte_prepack",
qembeddingbag_byte_prepack_meta);
"quantized::embedding_bag_byte_prepack", qembeddingbag_byte_prepack_meta);
}
} // namespace

View File

@ -270,10 +270,6 @@ Tensor& div_sparse_(Tensor& self, const Tensor& value) {
return div_out_sparse_zerodim(self, value, self);
}
static SparseTensor& div_out_sparse_scalar(const SparseTensor& t, Scalar value, SparseTensor& r) {
return div_out_sparse_zerodim(t, wrapped_scalar_tensor(value), r);
}
Tensor div_sparse(const Tensor& self, const Tensor& value, std::optional<c10::string_view> rounding_mode) {
auto commonDtype = at::result_type(self, value);
if (c10::isIntegralType(commonDtype, /*includeBool=*/true) && !rounding_mode.has_value()) {
@ -287,10 +283,6 @@ Tensor& div_sparse_(Tensor& self, const Tensor& value, std::optional<c10::string
return div_out_sparse_zerodim(self, value, std::move(rounding_mode), self);
}
static SparseTensor& div_out_sparse_scalar(const SparseTensor& t, Scalar value, std::optional<c10::string_view> rounding_mode, SparseTensor& r) {
return div_out_sparse_zerodim(t, wrapped_scalar_tensor(value), std::move(rounding_mode), r);
}
// --------------------------------------------------------------------
// floor_divide(SparseTensor, Scalar)
// --------------------------------------------------------------------
@ -350,10 +342,6 @@ Tensor& floor_divide_sparse_(Tensor& self, const Tensor& value) {
return floor_divide_out_sparse_zerodim(self, value, self);
}
static SparseTensor& floor_divide_out_sparse_scalar(SparseTensor& r, const SparseTensor& t, const Scalar& value) {
return floor_divide_out_sparse_zerodim(t, wrapped_scalar_tensor(value), r);
}
// --------------------------------------------------------------------
// norm(SparseTensor, Scalar)
// --------------------------------------------------------------------

View File

@ -764,8 +764,8 @@ std::tuple<Tensor, Tensor, Tensor, Tensor> _scaled_dot_product_cudnn_attention_c
const int64_t batch_size = query.size(0);
const int64_t num_heads = query.size(1);
const int64_t max_seqlen_batch_q = query.size(2);
const int64_t head_dim = query.size(3);
const int64_t head_dim_qk = query.size(3);
const int64_t head_dim_v = value.size(3);
const int64_t max_seqlen_batch_k = key.size(2);
const int64_t max_seqlen_batch_v = value.size(2);
TORCH_CHECK(
@ -806,7 +806,8 @@ std::tuple<Tensor, Tensor, Tensor, Tensor> _scaled_dot_product_cudnn_attention_c
num_heads/*int64_t h*/,
max_seqlen_batch_q/*int64_t s_q*/,
max_seqlen_batch_k/*int64_t s_kv*/,
head_dim/*int64_t d*/,
head_dim_qk/*int64_t d_qk*/,
head_dim_v/*int64_t d_v*/,
softmax_scale/*float scaling_factor*/,
compute_logsumexp/* bool */,
is_causal/* bool */,

View File

@ -194,12 +194,11 @@ std::tuple<Tensor, Tensor, Tensor> _scaled_dot_product_cudnn_attention_backward_
const int64_t batch_size = query.size(0);
const int64_t num_heads = query.size(1);
const int64_t head_dim = query.size(3);
const int64_t head_dim_qk = query.size(3);
const int64_t head_dim_v = value.size(3);
const int64_t max_seqlen_batch_q = query.size(1);
const int64_t max_seqlen_batch_k = key.size(1);
const auto softmax_scale = sdp::calculate_scale(query, scale).as_float_unchecked();
auto dq = at::empty_like(query);
auto dk = at::empty_like(key);
auto dv = at::empty_like(value);
@ -207,7 +206,8 @@ std::tuple<Tensor, Tensor, Tensor> _scaled_dot_product_cudnn_attention_backward_
num_heads /*int64_t h*/,
max_seqlen_batch_q /*int64_t s_q*/,
max_seqlen_batch_k /*int64_t s_kv*/,
head_dim /*int64_t d*/,
head_dim_qk /*int64_t d_qk*/,
head_dim_v /*int64_t d_v*/,
softmax_scale /*float scaling_factor*/,
is_causal /*bool is_causal*/,
dropout_p /*float dropout_probability*/,

View File

@ -14,11 +14,11 @@ AllenaiLongformerBase,pass,9
BartForCausalLM,pass,12
BartForCausalLM,pass,6
BartForConditionalGeneration,pass,24
BartForConditionalGeneration,pass,8
@ -34,11 +34,11 @@ BlenderbotForCausalLM,eager_fail_to_run,0
BlenderbotSmallForCausalLM,pass,12
BlenderbotSmallForCausalLM,pass,6
BlenderbotSmallForConditionalGeneration,pass,24
BlenderbotSmallForConditionalGeneration,pass,8
@ -102,11 +102,11 @@ M2M100ForConditionalGeneration,pass,4
MBartForCausalLM,pass,12
MBartForCausalLM,pass,6
MBartForConditionalGeneration,pass,24
MBartForConditionalGeneration,pass,8
@ -130,23 +130,23 @@ MobileBertForQuestionAnswering,pass,3
OPTForCausalLM,pass,12
OPTForCausalLM,pass,6
PLBartForCausalLM,pass,12
PLBartForCausalLM,pass,6
PLBartForConditionalGeneration,pass,29
PLBartForConditionalGeneration,pass,8
PegasusForCausalLM,pass,12
PegasusForCausalLM,pass,6
PegasusForConditionalGeneration,pass,23
PegasusForConditionalGeneration,pass,7
@ -158,7 +158,7 @@ RobertaForQuestionAnswering,pass,5
Speech2Text2ForCausalLM,pass,12
Speech2Text2ForCausalLM,pass,6
@ -170,11 +170,11 @@ T5Small,pass,5
TrOCRForCausalLM,pass,12
TrOCRForCausalLM,pass,6
XGLMForCausalLM,pass,12
XGLMForCausalLM,pass,6

1 name accuracy graph_breaks
14 DebertaForQuestionAnswering pass 5
15 DebertaV2ForMaskedLM pass_due_to_skip 0
16 DebertaV2ForQuestionAnswering eager_1st_run_OOM 0
17 DistilBertForMaskedLM pass 5
18 DistilBertForQuestionAnswering pass 5
19 DistillGPT2 pass 5
20 ElectraForCausalLM pass 4
21 ElectraForQuestionAnswering pass 5
22 GPT2ForSequenceClassification pass 7
23 GoogleFnet pass 5
24 LayoutLMForMaskedLM pass 5
34 OPTForCausalLM pass 12 6
35 PLBartForCausalLM pass 12 6
36 PLBartForConditionalGeneration pass 29 8
37 PegasusForCausalLM pass 12 6
38 PegasusForConditionalGeneration pass 23 7
39 RobertaForCausalLM pass 5
40 RobertaForQuestionAnswering pass 5
41 Speech2Text2ForCausalLM pass 12 6
42 T5ForConditionalGeneration pass 5
43 T5Small pass 5
44 TrOCRForCausalLM pass 12 6
102
103
104
105
106
107
108
109
110
111
112
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
158
159
160
161
162
163
164
170
171
172
173
174
175
176
177
178
179
180

View File

@ -150,7 +150,7 @@ hf_Bert_large,pass,0
hf_BigBird,pass,46
hf_BigBird,pass,43
@ -378,4 +378,4 @@ vision_maskrcnn,pass,17
yolov3,pass,2
yolov3,pass,0

1 name accuracy graph_breaks
150
151
152
153
154
155
156
378
379
380
381

View File

@ -98,7 +98,7 @@ hf_Bert_large,pass,6
hf_BigBird,pass, 52
hf_BigBird,pass,49
@ -286,4 +286,4 @@ vision_maskrcnn,pass,34
yolov3,pass,9
yolov3,fail_accuracy,8

1 name accuracy graph_breaks
98
99
100
101
102
103
104
286
287
288
289

View File

@ -242,7 +242,7 @@ pyhpc_equation_of_state,pass,0
pyhpc_isoneutral_mixing,fail_to_run,0
pyhpc_isoneutral_mixing,pass,0
@ -350,4 +350,4 @@ vision_maskrcnn,fail_to_run,0
yolov3,fail_to_run,0
yolov3,pass,0

1 name accuracy graph_breaks
242
243
244
245
246
247
248
350
351
352
353

View File

@ -338,4 +338,4 @@ vision_maskrcnn,pass,28
yolov3,pass,2
yolov3,pass,0

1 name accuracy graph_breaks
338
339
340
341

View File

@ -338,4 +338,4 @@ vision_maskrcnn,pass,28
yolov3,pass,2
yolov3,pass,0

1 name accuracy graph_breaks
338
339
340
341

View File

@ -242,7 +242,7 @@ pyhpc_equation_of_state,pass,0
pyhpc_isoneutral_mixing,fail_to_run,0
pyhpc_isoneutral_mixing,pass,0
@ -350,4 +350,4 @@ vision_maskrcnn,fail_to_run,0
yolov3,fail_to_run,0
yolov3,pass,0

1 name accuracy graph_breaks
242
243
244
245
246
247
248
350
351
352
353

View File

@ -14,11 +14,11 @@ AllenaiLongformerBase,pass,9
BartForCausalLM,pass,12
BartForCausalLM,pass,6
BartForConditionalGeneration,pass,24
BartForConditionalGeneration,pass,8
@ -34,11 +34,11 @@ BlenderbotForCausalLM,eager_fail_to_run,0
BlenderbotSmallForCausalLM,pass,12
BlenderbotSmallForCausalLM,pass,6
BlenderbotSmallForConditionalGeneration,pass,24
BlenderbotSmallForConditionalGeneration,pass,8
@ -102,11 +102,11 @@ M2M100ForConditionalGeneration,pass,4
MBartForCausalLM,pass,12
MBartForCausalLM,pass,6
MBartForConditionalGeneration,pass,24
MBartForConditionalGeneration,pass,8
@ -130,23 +130,23 @@ MobileBertForQuestionAnswering,pass,3
OPTForCausalLM,pass,12
OPTForCausalLM,pass,6
PLBartForCausalLM,pass,12
PLBartForCausalLM,pass,6
PLBartForConditionalGeneration,pass,29
PLBartForConditionalGeneration,pass,8
PegasusForCausalLM,pass,12
PegasusForCausalLM,pass,6
PegasusForConditionalGeneration,pass,23
PegasusForConditionalGeneration,pass,7
@ -158,7 +158,7 @@ RobertaForQuestionAnswering,pass,5
Speech2Text2ForCausalLM,pass,12
Speech2Text2ForCausalLM,pass,6
@ -170,11 +170,11 @@ T5Small,pass,5
TrOCRForCausalLM,pass,12
TrOCRForCausalLM,pass,6
XGLMForCausalLM,pass,12
XGLMForCausalLM,pass,6

1 name accuracy graph_breaks
14 DebertaForQuestionAnswering pass 5
15 DebertaV2ForMaskedLM pass_due_to_skip 0
16 DebertaV2ForQuestionAnswering eager_1st_run_OOM 0
17 DistilBertForMaskedLM pass 5
18 DistilBertForQuestionAnswering pass 5
19 DistillGPT2 pass 5
20 ElectraForCausalLM pass 4
21 ElectraForQuestionAnswering pass 5
22 GPT2ForSequenceClassification pass 7
23 GoogleFnet pass 5
24 LayoutLMForMaskedLM pass 5
34 OPTForCausalLM pass 12 6
35 PLBartForCausalLM pass 12 6
36 PLBartForConditionalGeneration pass 29 8
37 PegasusForCausalLM pass 12 6
38 PegasusForConditionalGeneration pass 23 7
39 RobertaForCausalLM pass 5
40 RobertaForQuestionAnswering pass 5
41 Speech2Text2ForCausalLM pass 12 6
42 T5ForConditionalGeneration pass 5
43 T5Small pass 5
44 TrOCRForCausalLM pass 12 6
102
103
104
105
106
107
108
109
110
111
112
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
158
159
160
161
162
163
164
170
171
172
173
174
175
176
177
178
179
180

View File

@ -14,11 +14,11 @@ AllenaiLongformerBase,pass,9
BartForCausalLM,pass,12
BartForCausalLM,pass,6
BartForConditionalGeneration,pass,24
BartForConditionalGeneration,pass,8
@ -34,11 +34,11 @@ BlenderbotForCausalLM,eager_fail_to_run,0
BlenderbotSmallForCausalLM,pass,12
BlenderbotSmallForCausalLM,pass,6
BlenderbotSmallForConditionalGeneration,pass,24
BlenderbotSmallForConditionalGeneration,pass,8
@ -102,11 +102,11 @@ M2M100ForConditionalGeneration,pass,4
MBartForCausalLM,pass,12
MBartForCausalLM,pass,6
MBartForConditionalGeneration,pass,24
MBartForConditionalGeneration,pass,8
@ -130,23 +130,23 @@ MobileBertForQuestionAnswering,pass,3
OPTForCausalLM,pass,12
OPTForCausalLM,pass,6
PLBartForCausalLM,pass,12
PLBartForCausalLM,pass,6
PLBartForConditionalGeneration,pass,29
PLBartForConditionalGeneration,pass,8
PegasusForCausalLM,pass,12
PegasusForCausalLM,pass,6
PegasusForConditionalGeneration,pass,23
PegasusForConditionalGeneration,pass,7
@ -158,7 +158,7 @@ RobertaForQuestionAnswering,pass,5
Speech2Text2ForCausalLM,pass,12
Speech2Text2ForCausalLM,pass,6
@ -170,11 +170,11 @@ T5Small,pass,5
TrOCRForCausalLM,pass,12
TrOCRForCausalLM,pass,6
XGLMForCausalLM,pass,12
XGLMForCausalLM,pass,6

1 name accuracy graph_breaks
14 DebertaForQuestionAnswering pass 5
15 DebertaV2ForMaskedLM pass_due_to_skip 0
16 DebertaV2ForQuestionAnswering eager_1st_run_OOM 0
17 DistilBertForMaskedLM pass 5
18 DistilBertForQuestionAnswering pass 5
19 DistillGPT2 pass 5
20 ElectraForCausalLM pass 4
21 ElectraForQuestionAnswering pass 5
22 GPT2ForSequenceClassification pass 7
23 GoogleFnet pass 5
24 LayoutLMForMaskedLM pass 5
34 OPTForCausalLM pass 12 6
35 PLBartForCausalLM pass 12 6
36 PLBartForConditionalGeneration pass 29 8
37 PegasusForCausalLM pass 12 6
38 PegasusForConditionalGeneration pass 23 7
39 RobertaForCausalLM pass 5
40 RobertaForQuestionAnswering pass 5
41 Speech2Text2ForCausalLM pass 12 6
42 T5ForConditionalGeneration pass 5
43 T5Small pass 5
44 TrOCRForCausalLM pass 12 6
102
103
104
105
106
107
108
109
110
111
112
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
158
159
160
161
162
163
164
170
171
172
173
174
175
176
177
178
179
180

View File

@ -14,11 +14,11 @@ AllenaiLongformerBase,pass,9
BartForCausalLM,pass,12
BartForCausalLM,pass,6
BartForConditionalGeneration,pass,24
BartForConditionalGeneration,pass,8
@ -34,11 +34,11 @@ BlenderbotForCausalLM,eager_fail_to_run,0
BlenderbotSmallForCausalLM,pass,12
BlenderbotSmallForCausalLM,pass,6
BlenderbotSmallForConditionalGeneration,pass,24
BlenderbotSmallForConditionalGeneration,pass,8
@ -102,11 +102,11 @@ M2M100ForConditionalGeneration,pass,4
MBartForCausalLM,pass,12
MBartForCausalLM,pass,6
MBartForConditionalGeneration,pass,24
MBartForConditionalGeneration,pass,8
@ -130,23 +130,23 @@ MobileBertForQuestionAnswering,pass,3
OPTForCausalLM,pass,12
OPTForCausalLM,pass,6
PLBartForCausalLM,pass,12
PLBartForCausalLM,pass,6
PLBartForConditionalGeneration,pass,29
PLBartForConditionalGeneration,pass,8
PegasusForCausalLM,pass,12
PegasusForCausalLM,pass,6
PegasusForConditionalGeneration,pass,23
PegasusForConditionalGeneration,pass,7
@ -158,7 +158,7 @@ RobertaForQuestionAnswering,pass,5
Speech2Text2ForCausalLM,pass,12
Speech2Text2ForCausalLM,pass,6
@ -170,11 +170,11 @@ T5Small,pass,5
TrOCRForCausalLM,pass,12
TrOCRForCausalLM,pass,6
XGLMForCausalLM,pass,12
XGLMForCausalLM,pass,6

1 name accuracy graph_breaks
14 DebertaForQuestionAnswering pass 5
15 DebertaV2ForMaskedLM pass_due_to_skip 0
16 DebertaV2ForQuestionAnswering eager_1st_run_OOM 0
17 DistilBertForMaskedLM pass 5
18 DistilBertForQuestionAnswering pass 5
19 DistillGPT2 pass 5
20 ElectraForCausalLM pass 4
21 ElectraForQuestionAnswering pass 5
22 GPT2ForSequenceClassification pass 7
23 GoogleFnet pass 5
24 LayoutLMForMaskedLM pass 5
34 OPTForCausalLM pass 12 6
35 PLBartForCausalLM pass 12 6
36 PLBartForConditionalGeneration pass 29 8
37 PegasusForCausalLM pass 12 6
38 PegasusForConditionalGeneration pass 23 7
39 RobertaForCausalLM pass 5
40 RobertaForQuestionAnswering pass 5
41 Speech2Text2ForCausalLM pass 12 6
42 T5ForConditionalGeneration pass 5
43 T5Small pass 5
44 TrOCRForCausalLM pass 12 6
102
103
104
105
106
107
108
109
110
111
112
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
158
159
160
161
162
163
164
170
171
172
173
174
175
176
177
178
179
180

View File

@ -150,7 +150,7 @@ hf_Bert_large,pass,0
hf_BigBird,pass,46
hf_BigBird,pass,43
@ -374,4 +374,4 @@ vision_maskrcnn,pass,17
yolov3,pass,2
yolov3,pass,0

1 name accuracy graph_breaks
150
151
152
153
154
155
156
374
375
376
377

View File

@ -98,7 +98,7 @@ hf_Bert_large,pass,6
hf_BigBird,pass,52
hf_BigBird,pass,49
@ -282,4 +282,4 @@ vision_maskrcnn,pass,34
yolov3,pass,9
yolov3,fail_accuracy,8

1 name accuracy graph_breaks
98
99
100
101
102
103
104
282
283
284
285

View File

@ -298,4 +298,4 @@ vision_maskrcnn,pass,28
yolov3,pass,2
yolov3,pass,0

1 name accuracy graph_breaks
298
299
300
301

View File

@ -14,11 +14,11 @@ AllenaiLongformerBase,pass,9
BartForCausalLM,pass,12
BartForCausalLM,pass,6
BartForConditionalGeneration,pass,24
BartForConditionalGeneration,pass,8
@ -34,11 +34,11 @@ BlenderbotForCausalLM,eager_fail_to_run,0
BlenderbotSmallForCausalLM,pass,12
BlenderbotSmallForCausalLM,pass,6
BlenderbotSmallForConditionalGeneration,pass,24
BlenderbotSmallForConditionalGeneration,pass,8
@ -102,11 +102,11 @@ M2M100ForConditionalGeneration,pass,4
MBartForCausalLM,pass,12
MBartForCausalLM,pass,6
MBartForConditionalGeneration,pass,24
MBartForConditionalGeneration,pass,8
@ -130,23 +130,23 @@ MobileBertForQuestionAnswering,pass,3
OPTForCausalLM,pass,12
OPTForCausalLM,pass,6
PLBartForCausalLM,pass,12
PLBartForCausalLM,pass,6
PLBartForConditionalGeneration,pass,29
PLBartForConditionalGeneration,pass,8
PegasusForCausalLM,pass,12
PegasusForCausalLM,pass,6
PegasusForConditionalGeneration,pass,23
PegasusForConditionalGeneration,pass,7
@ -158,7 +158,7 @@ RobertaForQuestionAnswering,pass,5
Speech2Text2ForCausalLM,pass,12
Speech2Text2ForCausalLM,pass,6
@ -170,11 +170,11 @@ T5Small,pass,5
TrOCRForCausalLM,pass,12
TrOCRForCausalLM,pass,6
XGLMForCausalLM,pass,12
XGLMForCausalLM,pass,6

1 name accuracy graph_breaks
14 DebertaForQuestionAnswering pass 5
15 DebertaV2ForMaskedLM pass_due_to_skip 0
16 DebertaV2ForQuestionAnswering eager_1st_run_OOM 0
17 DistilBertForMaskedLM pass 5
18 DistilBertForQuestionAnswering pass 5
19 DistillGPT2 pass 5
20 ElectraForCausalLM pass 4
21 ElectraForQuestionAnswering pass 5
22 GPT2ForSequenceClassification pass 7
23 GoogleFnet pass 5
24 LayoutLMForMaskedLM pass 5
34 OPTForCausalLM pass 12 6
35 PLBartForCausalLM pass 12 6
36 PLBartForConditionalGeneration pass 29 8
37 PegasusForCausalLM pass 12 6
38 PegasusForConditionalGeneration pass 23 7
39 RobertaForCausalLM pass 5
40 RobertaForQuestionAnswering pass 5
41 Speech2Text2ForCausalLM pass 12 6
42 T5ForConditionalGeneration pass 5
43 T5Small pass 5
44 TrOCRForCausalLM pass 12 6
102
103
104
105
106
107
108
109
110
111
112
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
158
159
160
161
162
163
164
170
171
172
173
174
175
176
177
178
179
180

View File

@ -150,7 +150,7 @@ hf_Bert_large,pass,0
hf_BigBird,fail_accuracy,46
hf_BigBird,fail_accuracy,43
@ -374,4 +374,4 @@ vision_maskrcnn,pass,17
yolov3,pass,2
yolov3,pass,0

1 name accuracy graph_breaks
150
151
152
153
154
155
156
374
375
376
377

View File

@ -98,7 +98,7 @@ hf_Bert_large,pass,6
hf_BigBird,pass,52
hf_BigBird,pass,49
@ -282,4 +282,4 @@ vision_maskrcnn,pass,34
yolov3,pass,9
yolov3,pass,8

1 name accuracy graph_breaks
98
99
100
101
102
103
104
282
283
284
285

View File

@ -14,11 +14,11 @@ AllenaiLongformerBase,pass,9
BartForCausalLM,pass,12
BartForCausalLM,pass,6
BartForConditionalGeneration,pass,24
BartForConditionalGeneration,pass,8
@ -34,11 +34,11 @@ BlenderbotForCausalLM,eager_fail_to_run,0
BlenderbotSmallForCausalLM,pass,12
BlenderbotSmallForCausalLM,pass,6
BlenderbotSmallForConditionalGeneration,pass,24
BlenderbotSmallForConditionalGeneration,pass,8
@ -102,11 +102,11 @@ M2M100ForConditionalGeneration,pass,4
MBartForCausalLM,pass,12
MBartForCausalLM,pass,6
MBartForConditionalGeneration,pass,24
MBartForConditionalGeneration,pass,8
@ -130,23 +130,23 @@ MobileBertForQuestionAnswering,pass,3
OPTForCausalLM,pass,12
OPTForCausalLM,pass,6
PLBartForCausalLM,pass,12
PLBartForCausalLM,pass,6
PLBartForConditionalGeneration,pass,29
PLBartForConditionalGeneration,pass,8
PegasusForCausalLM,pass,12
PegasusForCausalLM,pass,6
PegasusForConditionalGeneration,pass,23
PegasusForConditionalGeneration,pass,7
@ -158,7 +158,7 @@ RobertaForQuestionAnswering,pass,5
Speech2Text2ForCausalLM,pass,12
Speech2Text2ForCausalLM,pass,6
@ -170,11 +170,11 @@ T5Small,pass,5
TrOCRForCausalLM,pass,12
TrOCRForCausalLM,pass,6
XGLMForCausalLM,pass,12
XGLMForCausalLM,pass,6

1 name accuracy graph_breaks
14 DebertaForQuestionAnswering pass 5
15 DebertaV2ForMaskedLM pass_due_to_skip 0
16 DebertaV2ForQuestionAnswering eager_1st_run_OOM 0
17 DistilBertForMaskedLM pass 5
18 DistilBertForQuestionAnswering pass 5
19 DistillGPT2 pass 5
20 ElectraForCausalLM pass 4
21 ElectraForQuestionAnswering pass 5
22 GPT2ForSequenceClassification pass 7
23 GoogleFnet pass 5
24 LayoutLMForMaskedLM pass 5
34 OPTForCausalLM pass 12 6
35 PLBartForCausalLM pass 12 6
36 PLBartForConditionalGeneration pass 29 8
37 PegasusForCausalLM pass 12 6
38 PegasusForConditionalGeneration pass 23 7
39 RobertaForCausalLM pass 5
40 RobertaForQuestionAnswering pass 5
41 Speech2Text2ForCausalLM pass 12 6
42 T5ForConditionalGeneration pass 5
43 T5Small pass 5
44 TrOCRForCausalLM pass 12 6
102
103
104
105
106
107
108
109
110
111
112
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
158
159
160
161
162
163
164
170
171
172
173
174
175
176
177
178
179
180

View File

@ -150,7 +150,7 @@ hf_Bert_large,pass,0
hf_BigBird,pass,46
hf_BigBird,pass,43
@ -378,4 +378,4 @@ vision_maskrcnn,pass,17
yolov3,pass,2
yolov3,pass,0

1 name accuracy graph_breaks
150
151
152
153
154
155
156
378
379
380
381

View File

@ -98,7 +98,7 @@ hf_Bert_large,pass,6
hf_BigBird,pass,52
hf_BigBird,pass,49
@ -286,4 +286,4 @@ vision_maskrcnn,pass,34
yolov3,pass,9
yolov3,pass,8

1 name accuracy graph_breaks
98
99
100
101
102
103
104
286
287
288
289

View File

@ -14,11 +14,11 @@ AllenaiLongformerBase,pass,9
BartForCausalLM,pass,12
BartForCausalLM,pass,6
BartForConditionalGeneration,pass,24
BartForConditionalGeneration,pass,8
@ -34,11 +34,11 @@ BlenderbotForCausalLM,eager_fail_to_run,0
BlenderbotSmallForCausalLM,pass,12
BlenderbotSmallForCausalLM,pass,6
BlenderbotSmallForConditionalGeneration,pass,24
BlenderbotSmallForConditionalGeneration,pass,8
@ -102,11 +102,11 @@ M2M100ForConditionalGeneration,pass,4
MBartForCausalLM,pass,12
MBartForCausalLM,pass,6
MBartForConditionalGeneration,pass,24
MBartForConditionalGeneration,pass,8
@ -130,23 +130,23 @@ MobileBertForQuestionAnswering,pass,3
OPTForCausalLM,pass,12
OPTForCausalLM,pass,6
PLBartForCausalLM,pass,12
PLBartForCausalLM,pass,6
PLBartForConditionalGeneration,pass,29
PLBartForConditionalGeneration,pass,8
PegasusForCausalLM,pass,12
PegasusForCausalLM,pass,6
PegasusForConditionalGeneration,pass,23
PegasusForConditionalGeneration,pass,7
@ -158,7 +158,7 @@ RobertaForQuestionAnswering,pass,5
Speech2Text2ForCausalLM,pass,12
Speech2Text2ForCausalLM,pass,6
@ -170,11 +170,11 @@ T5Small,pass,5
TrOCRForCausalLM,pass,12
TrOCRForCausalLM,pass,6
XGLMForCausalLM,pass,12
XGLMForCausalLM,pass,6

1 name accuracy graph_breaks
14 DebertaForQuestionAnswering pass 5
15 DebertaV2ForMaskedLM pass_due_to_skip 0
16 DebertaV2ForQuestionAnswering eager_1st_run_OOM 0
17 DistilBertForMaskedLM pass 5
18 DistilBertForQuestionAnswering pass 5
19 DistillGPT2 pass 5
20 ElectraForCausalLM pass 4
21 ElectraForQuestionAnswering pass 5
22 GPT2ForSequenceClassification pass 7
23 GoogleFnet pass 5
24 LayoutLMForMaskedLM pass 5
34 OPTForCausalLM pass 12 6
35 PLBartForCausalLM pass 12 6
36 PLBartForConditionalGeneration pass 29 8
37 PegasusForCausalLM pass 12 6
38 PegasusForConditionalGeneration pass 23 7
39 RobertaForCausalLM pass 5
40 RobertaForQuestionAnswering pass 5
41 Speech2Text2ForCausalLM pass 12 6
42 T5ForConditionalGeneration pass 5
43 T5Small pass 5
44 TrOCRForCausalLM pass 12 6
102
103
104
105
106
107
108
109
110
111
112
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
158
159
160
161
162
163
164
170
171
172
173
174
175
176
177
178
179
180

View File

@ -150,7 +150,7 @@ hf_Bert_large,pass,0
hf_BigBird,fail_accuracy,46
hf_BigBird,fail_accuracy,43
@ -378,4 +378,4 @@ vision_maskrcnn,pass,17
yolov3,pass,2
yolov3,pass,0

1 name accuracy graph_breaks
150
151
152
153
154
155
156
378
379
380
381

View File

@ -98,7 +98,7 @@ hf_Bert_large,pass,6
hf_BigBird,pass,52
hf_BigBird,pass,49
@ -286,4 +286,4 @@ vision_maskrcnn,pass,34
yolov3,pass,9
yolov3,pass,8

1 name accuracy graph_breaks
98
99
100
101
102
103
104
286
287
288
289

View File

@ -4,12 +4,11 @@ phlippe_densenet,float32,static,default,1.3988316
basic_gnn_gcn,float32,dynamic,default,1.074576405
llama_v2_7b_16h,float32,dynamic,default,1.211740245
resnet50,float32,dynamic,default,1.65984261
timm_efficientnet,float32,static,cpp,2.271561735
#timm_efficientnet,float32,static,cpp,2.1938112
mobilenet_v3_large,float32,static,cpp,2.63375628
timm_resnest,float32,dynamic,cpp,1.67998548
pyhpc_turbulent_kinetic_energy,float32,dynamic,cpp,1.59968463
#hf_GPT2,float32,dynamic,cpp,
hf_GPT2,float32,dynamic,cpp,1.379885175
#hf_GPT2,float32,dynamic,cpp,1.292704418
resnext50_32x4d,amp,static,default,1.461687045
vgg16,amp,static,default,1.267194285
hf_Longformer,amp,dynamic,default,0.997006035
@ -17,6 +16,6 @@ hf_Bert_large,amp,dynamic,default,0.99391146
llama,amp,static,default,1.32950568
timm_regnet,amp,static,cpp,1.157188305
lennard_jones,amp,static,cpp,2.240104485
hf_T5_generate,amp,dynamic,cpp,1.447656135
#hf_T5_generate,amp,dynamic,cpp,1.29339502
timm_vovnet,amp,dynamic,cpp,1.07856471
mobilenet_v2,amp,dynamic,cpp,2.27774577

1 #name data_type shape wrapper perf_speedup_target_c7i_metal_24xl
4 basic_gnn_gcn float32 dynamic default 1.074576405
5 llama_v2_7b_16h float32 dynamic default 1.211740245
6 resnet50 float32 dynamic default 1.65984261
7 timm_efficientnet #timm_efficientnet float32 static cpp 2.271561735 2.1938112
8 mobilenet_v3_large float32 static cpp 2.63375628
9 timm_resnest float32 dynamic cpp 1.67998548
10 pyhpc_turbulent_kinetic_energy float32 dynamic cpp 1.59968463
11 #hf_GPT2 float32 dynamic cpp 1.292704418
hf_GPT2 float32 dynamic cpp 1.379885175
12 resnext50_32x4d amp static default 1.461687045
13 vgg16 amp static default 1.267194285
14 hf_Longformer amp dynamic default 0.997006035
16 llama amp static default 1.32950568
17 timm_regnet amp static cpp 1.157188305
18 lennard_jones amp static cpp 2.240104485
19 hf_T5_generate #hf_T5_generate amp dynamic cpp 1.447656135 1.29339502
20 timm_vovnet amp dynamic cpp 1.07856471
21 mobilenet_v2 amp dynamic cpp 2.27774577

View File

@ -272,6 +272,38 @@ TEST(StaticRuntime, autogen_addr) {
/*check_resize=*/true);
}
TEST(StaticRuntime, autogen__test_functorch_fallback) {
const std::string script = R"IR(
graph(%self: Tensor, %other: Tensor):
%bias: None = prim::Constant()
%ret = aten::_test_functorch_fallback(%self, %other)
%cloned = aten::clone(%ret, %bias)
return (%cloned)
)IR";
auto self0 = at::rand({6, 6, 6});
auto other0 = at::rand({6, 6, 6});
std::vector<IValue> args{self0, other0};
testStaticRuntime(
script,
args,
{},
/*use_allclose=*/false,
/*use_equalnan=*/false,
/*check_resize=*/true);
auto self1 = at::rand({22, 22, 22});
auto other1 = at::rand({22, 22, 22});
std::vector<IValue> args2{self1, other1};
testStaticRuntime(
script,
args,
args2,
/*use_allclose=*/false,
/*use_equalnan=*/false,
/*check_resize=*/true);
}
TEST(StaticRuntime, autogen_argmax) {
const std::string script = R"IR(
graph(%self: Tensor, %dim: int?, %keepdim: bool):
@ -4440,6 +4472,40 @@ TEST(StaticRuntime, autogen_masked_select) {
/*check_resize=*/true);
}
TEST(StaticRuntime, autogen_nonzero_static) {
const std::string script = R"IR(
graph(%self: Tensor, %size: int, %fill_value: int):
%bias: None = prim::Constant()
%ret = aten::nonzero_static(%self, %size, %fill_value)
%cloned = aten::clone(%ret, %bias)
return (%cloned)
)IR";
auto self0 = at::rand({6, 6, 6});
auto size0 = 1;
auto fill_value0 = 1;
std::vector<IValue> args{self0, size0, fill_value0};
testStaticRuntime(
script,
args,
{},
/*use_allclose=*/false,
/*use_equalnan=*/false,
/*check_resize=*/true);
auto self1 = at::rand({22, 22, 22});
auto size1 = 1;
auto fill_value1 = 1;
std::vector<IValue> args2{self1, size1, fill_value1};
testStaticRuntime(
script,
args,
args2,
/*use_allclose=*/false,
/*use_equalnan=*/false,
/*check_resize=*/true);
}
TEST(StaticRuntime, autogen_gather) {
const std::string script = R"IR(
graph(%self: Tensor, %dim: int, %index: Tensor, %sparse_grad: bool):
@ -7106,222 +7172,6 @@ TEST(StaticRuntime, autogen_special_multigammaln) {
/*check_resize=*/true);
}
TEST(StaticRuntime, autogen_fft_fft) {
const std::string script = R"IR(
graph(%self: Tensor, %n: int?, %dim: int, %norm: str?):
%bias: None = prim::Constant()
%ret = aten::fft_fft(%self, %n, %dim, %norm)
%cloned = aten::clone(%ret, %bias)
return (%cloned)
)IR";
auto self0 = at::rand({6, 6, 6});
auto n0 = 1;
auto dim0 = 1;
auto norm0 = "forward";
std::vector<IValue> args{self0, n0, dim0, norm0};
testStaticRuntime(
script,
args,
{},
/*use_allclose=*/false,
/*use_equalnan=*/false,
/*check_resize=*/true);
auto self1 = at::rand({22, 22, 22});
auto n1 = 1;
auto dim1 = 1;
auto norm1 = "forward";
std::vector<IValue> args2{self1, n1, dim1, norm1};
testStaticRuntime(
script,
args,
args2,
/*use_allclose=*/false,
/*use_equalnan=*/false,
/*check_resize=*/true);
}
TEST(StaticRuntime, autogen_fft_ifft) {
const std::string script = R"IR(
graph(%self: Tensor, %n: int?, %dim: int, %norm: str?):
%bias: None = prim::Constant()
%ret = aten::fft_ifft(%self, %n, %dim, %norm)
%cloned = aten::clone(%ret, %bias)
return (%cloned)
)IR";
auto self0 = at::rand({6, 6, 6});
auto n0 = 1;
auto dim0 = 1;
auto norm0 = "forward";
std::vector<IValue> args{self0, n0, dim0, norm0};
testStaticRuntime(
script,
args,
{},
/*use_allclose=*/false,
/*use_equalnan=*/false,
/*check_resize=*/true);
auto self1 = at::rand({22, 22, 22});
auto n1 = 1;
auto dim1 = 1;
auto norm1 = "forward";
std::vector<IValue> args2{self1, n1, dim1, norm1};
testStaticRuntime(
script,
args,
args2,
/*use_allclose=*/false,
/*use_equalnan=*/false,
/*check_resize=*/true);
}
TEST(StaticRuntime, autogen_fft_rfft) {
const std::string script = R"IR(
graph(%self: Tensor, %n: int?, %dim: int, %norm: str?):
%bias: None = prim::Constant()
%ret = aten::fft_rfft(%self, %n, %dim, %norm)
%cloned = aten::clone(%ret, %bias)
return (%cloned)
)IR";
auto self0 = at::rand({6, 6, 6});
auto n0 = 1;
auto dim0 = 1;
auto norm0 = "forward";
std::vector<IValue> args{self0, n0, dim0, norm0};
testStaticRuntime(
script,
args,
{},
/*use_allclose=*/false,
/*use_equalnan=*/false,
/*check_resize=*/true);
auto self1 = at::rand({22, 22, 22});
auto n1 = 1;
auto dim1 = 1;
auto norm1 = "forward";
std::vector<IValue> args2{self1, n1, dim1, norm1};
testStaticRuntime(
script,
args,
args2,
/*use_allclose=*/false,
/*use_equalnan=*/false,
/*check_resize=*/true);
}
TEST(StaticRuntime, autogen_fft_irfft) {
const std::string script = R"IR(
graph(%self: Tensor, %n: int?, %dim: int, %norm: str?):
%bias: None = prim::Constant()
%ret = aten::fft_irfft(%self, %n, %dim, %norm)
%cloned = aten::clone(%ret, %bias)
return (%cloned)
)IR";
auto self0 = at::rand({6, 6, 6});
auto n0 = 1;
auto dim0 = 1;
auto norm0 = "forward";
std::vector<IValue> args{self0, n0, dim0, norm0};
testStaticRuntime(
script,
args,
{},
/*use_allclose=*/false,
/*use_equalnan=*/false,
/*check_resize=*/true);
auto self1 = at::rand({22, 22, 22});
auto n1 = 1;
auto dim1 = 1;
auto norm1 = "forward";
std::vector<IValue> args2{self1, n1, dim1, norm1};
testStaticRuntime(
script,
args,
args2,
/*use_allclose=*/false,
/*use_equalnan=*/false,
/*check_resize=*/true);
}
TEST(StaticRuntime, autogen_fft_hfft) {
const std::string script = R"IR(
graph(%self: Tensor, %n: int?, %dim: int, %norm: str?):
%bias: None = prim::Constant()
%ret = aten::fft_hfft(%self, %n, %dim, %norm)
%cloned = aten::clone(%ret, %bias)
return (%cloned)
)IR";
auto self0 = at::rand({6, 6, 6});
auto n0 = 1;
auto dim0 = 1;
auto norm0 = "forward";
std::vector<IValue> args{self0, n0, dim0, norm0};
testStaticRuntime(
script,
args,
{},
/*use_allclose=*/false,
/*use_equalnan=*/false,
/*check_resize=*/true);
auto self1 = at::rand({22, 22, 22});
auto n1 = 1;
auto dim1 = 1;
auto norm1 = "forward";
std::vector<IValue> args2{self1, n1, dim1, norm1};
testStaticRuntime(
script,
args,
args2,
/*use_allclose=*/false,
/*use_equalnan=*/false,
/*check_resize=*/true);
}
TEST(StaticRuntime, autogen_fft_ihfft) {
const std::string script = R"IR(
graph(%self: Tensor, %n: int?, %dim: int, %norm: str?):
%bias: None = prim::Constant()
%ret = aten::fft_ihfft(%self, %n, %dim, %norm)
%cloned = aten::clone(%ret, %bias)
return (%cloned)
)IR";
auto self0 = at::rand({6, 6, 6});
auto n0 = 1;
auto dim0 = 1;
auto norm0 = "forward";
std::vector<IValue> args{self0, n0, dim0, norm0};
testStaticRuntime(
script,
args,
{},
/*use_allclose=*/false,
/*use_equalnan=*/false,
/*check_resize=*/true);
auto self1 = at::rand({22, 22, 22});
auto n1 = 1;
auto dim1 = 1;
auto norm1 = "forward";
std::vector<IValue> args2{self1, n1, dim1, norm1};
testStaticRuntime(
script,
args,
args2,
/*use_allclose=*/false,
/*use_equalnan=*/false,
/*check_resize=*/true);
}
TEST(StaticRuntime, autogen_linalg_cross) {
const std::string script = R"IR(
graph(%self: Tensor, %other: Tensor, %dim: int):

View File

@ -827,6 +827,7 @@ libtorch_python_core_sources = [
"torch/csrc/dynamo/guards.cpp",
"torch/csrc/dynamo/init.cpp",
"torch/csrc/functorch/init.cpp",
"torch/csrc/fx/node.cpp",
"torch/csrc/mps/Module.cpp",
"torch/csrc/mtia/Module.cpp",
"torch/csrc/inductor/aoti_runner/pybind.cpp",

View File

@ -1,186 +0,0 @@
#include "caffe2/perfkernels/adagrad.h"
#include <cmath>
#include "caffe2/perfkernels/common.h"
namespace caffe2 {
void adagrad_update__base(
int N,
const float* w,
const float* g,
const float* h,
float* nw,
float* nh,
float epsilon,
float decay,
const float lr,
const float weight_decay = 0.f) {
internal::adagrad_update_base_inlined(
N, w, g, h, nw, nh, decay, epsilon, lr, weight_decay);
}
void adagrad_update_prefetch__base(
int N,
const float* w,
const float* /* w_n */, // prefetch ptr
const float* g,
const float* h,
const float* /* h_n */, // prefetch ptr
float* nw,
float* /* nw_n */, // prefetch ptr
float* nh,
float* /* nh_n */, // prefetch ptr
float epsilon,
float lr,
float weight_decay = 0.f) {
adagrad_update__base(N, w, g, h, nw, nh, epsilon, 1.0f, lr, weight_decay);
}
void adagrad_fp16_update_prefetch__base(
int N,
const at::Half* w,
const at::Half* /* w_n */, // prefetch ptr
const float* g,
const at::Half* h,
const at::Half* /* h_n */, // prefetch ptr
at::Half* nw,
at::Half* /* nw_n */, // prefetch ptr
at::Half* nh,
at::Half* /* nh_n */, // prefetch ptr
float epsilon,
float lr,
float weight_decay = 0.f) {
internal::adagrad_update_base_inlined(
N, w, g, h, nw, nh, 1.0f, epsilon, lr, weight_decay);
}
// version without prefetching
decltype(adagrad_update__base) adagrad_update__avx2_fma;
decltype(adagrad_update__base) adagrad_update__avx512;
void adagrad_update(
int N,
const float* w,
const float* g,
const float* h,
float* nw,
float* nh,
float epsilon,
float decay,
float lr,
float weight_decay) {
AVX512_DO(adagrad_update, N, w, g, h, nw, nh, epsilon, decay, lr, weight_decay);
AVX2_FMA_DO(
adagrad_update, N, w, g, h, nw, nh, epsilon, decay, lr, weight_decay);
BASE_DO(adagrad_update, N, w, g, h, nw, nh, epsilon, decay, lr, weight_decay);
}
decltype(adagrad_update_prefetch__base) adagrad_update_prefetch__avx2_fma;
void adagrad_update_prefetch(
int N,
const float* w,
const float* w_n, // prefetch ptr
const float* g,
const float* h,
const float* h_n, // prefetch ptr
float* nw,
float* nw_n, // prefetch ptr
float* nh,
float* nh_n, // prefetch ptr
float epsilon,
float lr,
float weight_decay) {
AVX2_FMA_DO(
adagrad_update_prefetch,
N,
w,
w_n,
g,
h,
h_n,
nw,
nw_n,
nh,
nh_n,
epsilon,
lr,
weight_decay);
BASE_DO(
adagrad_update_prefetch,
N,
w,
w_n,
g,
h,
h_n,
nw,
nw_n,
nh,
nh_n,
epsilon,
lr,
weight_decay);
}
// Version with prefetching for embeddings and
// momentum using fp16
decltype(adagrad_fp16_update_prefetch__base)
adagrad_fp16_update_prefetch__avx2_fma;
void adagrad_fp16_update_prefetch(
int N,
const at::Half* w,
const at::Half* w_n, // prefetch ptr
const float* g,
const at::Half* h,
const at::Half* h_n, // prefetch ptr
at::Half* nw,
at::Half* nw_n, // prefetch ptr
at::Half* nh,
at::Half* nh_n, // prefetch ptr
float epsilon,
float lr,
float weight_decay) {
AVX2_FMA_DO(
adagrad_fp16_update_prefetch,
N,
w,
w_n,
g,
h,
h_n,
nw,
nw_n,
nh,
nh_n,
epsilon,
lr,
weight_decay);
BASE_DO(
adagrad_fp16_update_prefetch,
N,
w,
w_n,
g,
h,
h_n,
nw,
nw_n,
nh,
nh_n,
epsilon,
lr,
weight_decay);
}
} // namespace caffe2

View File

@ -1,205 +0,0 @@
#pragma once
#if defined(__AVX__) && !defined(__NVCC__) && \
(defined(__x86_64__) || defined(_M_X64) || defined(__i386__))
#define CAFFE2_PERFKERNELS_ADAGRAD_H_USE_INTRINSIC
#include <immintrin.h>
#endif
#include <c10/util/Half.h>
#include <c10/util/irange.h>
namespace caffe2 {
namespace internal {
// The following functions inside internal namespace are inlined because they
// are performance critical.
template <typename T>
static inline void adagrad_update_base_inlined(
int N,
const T* w,
const float* g,
const T* h,
T* nw,
T* nh,
float decay,
float epsilon,
float lr,
float weight_decay = 0.f) {
for (const auto i : c10::irange(N)) {
float gi = std::fma(weight_decay, w[i], g[i]);
float hi = decay * h[i] + gi * gi;
nh[i] = hi;
nw[i] = w[i] + lr * gi / (std::sqrt(hi) + epsilon);
}
}
// version with prefetching
// TODO(msmelyan)
// Crux of the computation is computing a / (sqrt(b) + epsilon),
// where a and b are vectors and epsilon is very small (eg., 10^-5) and does not
// change. Today it's computed using two vector sqrt and vector divide simd
// instructions. It is slow. We can take advantage of existing fast vector
// VRSQRTPS instruction that computes approximate reciprocals of square roots
// of the vector. It is 6x faster than vsrt and vdiv combinations. Since the
// addition of epsilon is just done to avoid division by zero, we approximate a
// / (sqrt(b) + epsilon) by a / (sqrt(b + sqrt(epsilon)) If we do that, we can
// use VRSQRTPS instead now. VRSQRTPS is not very accurate. Specifically, for
// the test on random numbers between 0.1 and 1 the absolute error was about
// 10^-3 compared to using slower but more accurate combination of vsqrt and
// vdiv. Extend Marat's function with more NR iterations to get more accuracy
// for training
// TODO(msmelyan)
// explore streaming stores, but need to have unique indices (deduplication)
inline void adagrad_update_prefetch_inlined(
int N,
const float* w,
#ifdef CAFFE2_PERFKERNELS_ADAGRAD_H_USE_INTRINSIC
const float* w_n, // prefetch ptr
#else
const float* /* unused */,
#endif
const float* g,
const float* h,
#ifdef CAFFE2_PERFKERNELS_ADAGRAD_H_USE_INTRINSIC
const float* h_n, // prefetch ptr
#else
const float* /* unused */,
#endif
float* nw,
#ifdef CAFFE2_PERFKERNELS_ADAGRAD_H_USE_INTRINSIC
float* nw_n, // prefetch ptr
#else
float* /* unused */,
#endif
float* nh,
#ifdef CAFFE2_PERFKERNELS_ADAGRAD_H_USE_INTRINSIC
float* nh_n, // prefetch ptr
#else
float* /* unused */,
#endif
float epsilon,
float lr,
float weight_decay = 0.f) {
auto i = 0;
#ifdef CAFFE2_PERFKERNELS_ADAGRAD_H_USE_INTRINSIC
constexpr int kSize = 8;
for (; i + kSize <= N; i += kSize) {
_mm_prefetch(reinterpret_cast<const char*>(&w_n[i]), _MM_HINT_T0);
_mm_prefetch(reinterpret_cast<const char*>(&h_n[i]), _MM_HINT_T0);
_mm_prefetch(reinterpret_cast<const char*>(&nw_n[i]), _MM_HINT_T0);
_mm_prefetch(reinterpret_cast<const char*>(&nh_n[i]), _MM_HINT_T0);
__m256 gi = _mm256_loadu_ps(g + i);
__m256 hi = _mm256_loadu_ps(h + i);
__m256 wi = _mm256_loadu_ps(w + i);
#ifdef __FMA__
gi = _mm256_fmadd_ps(_mm256_set1_ps(weight_decay), wi, gi);
#else
gi = _mm256_add_ps(_mm256_mul_ps(_mm256_set1_ps(weight_decay), wi), gi);
#endif
__m256 nhi = _mm256_add_ps(hi, _mm256_mul_ps(gi, gi));
_mm256_storeu_ps(nh + i, nhi);
__m256 vtmp = _mm256_div_ps(
_mm256_mul_ps(_mm256_set1_ps(lr), gi),
_mm256_add_ps(_mm256_sqrt_ps(nhi), _mm256_set1_ps(epsilon)));
_mm256_storeu_ps(nw + i, _mm256_add_ps(wi, vtmp));
}
#endif
adagrad_update_base_inlined(
N - i,
w + i,
g + i,
h + i,
nw + i,
nh + i,
1.0f,
epsilon,
lr,
weight_decay);
}
} // namespace internal
// version with prefetching
// TODO(msmelyan)
// Crux of the computation is computing a / (sqrt(b) + epsilon),
// where a and b are vectors and epsilon is very small (eg., 10^-5) and does not
// change. Today it's computed using two vector sqrt and vector divide simd
// instructions. It is slow. We can take advantage of existing fast vector
// VRSQRTPS instruction that computes approximate reciprocals of square roots
// of the vector. It is 6x faster than vsrt and vdiv combinations. Since the
// addition of epsilon is just done to avoid division by zero, we approximate a
// / (sqrt(b) + epsilon) by a / (sqrt(b + sqrt(epsilon)) If we do that, we can
// use VRSQRTPS instead now. VRSQRTPS is not very accurate. Specifically, for
// the test on random numbers between 0.1 and 1 the absolute error was about
// 10^-3 compared to using slower but more accurate combination of vsqrt and
// vdiv. Extend Marat's function with more NR iterations to get more accuracy
// for training
// TODO(msmelyan)
// explore streaming stores, but need to have inuque indices (deduplication)
void adagrad_update_prefetch(
int N,
const float* w,
const float* w_n, // prefetch ptr
const float* g,
const float* h,
const float* h_n, // prefetch ptr
float* nw,
float* nw_n, // prefetch ptr
float* nh,
float* nh_n, // prefetch ptr
float epsilon,
float lr,
float weight_decay = 0.f);
// Version with prefetching for embeddings and
// momentum using fp16
void adagrad_fp16_update_prefetch(
int N,
const at::Half* w,
const at::Half* w_n, // prefetch ptr
const float* g,
const at::Half* h,
const at::Half* h_n, // prefetch ptr
at::Half* nw,
at::Half* nw_n, // prefetch ptr
at::Half* nh,
at::Half* nh_n, // prefetch ptr
float epsilon,
float lr,
float weight_decay = 0.f);
// version without prefetching
void adagrad_update(
int N,
const float* w,
const float* g,
const float* h,
float* nw,
float* nh,
float epsilon,
float decay,
float lr,
float weight_decay = 0.f);
} // namespace caffe2
#ifdef CAFFE2_PERFKERNELS_ADAGRAD_H_USE_INTRINSIC
#undef CAFFE2_PERFKERNELS_ADAGRAD_H_USE_INTRINSIC
#endif

View File

@ -1,125 +0,0 @@
#include "caffe2/perfkernels/adagrad.h"
#include "caffe2/perfkernels/cvtsh_ss_bugfix.h"
#include <emmintrin.h>
#include <immintrin.h>
namespace caffe2 {
// version without prefetching
void adagrad_update__avx2_fma(
int N,
const float* w,
const float* g,
const float* h,
float* nw,
float* nh,
float epsilon,
float decay,
float lr,
float weight_decay = 0.f) {
constexpr int kSize = 8;
auto i = 0;
for (; i + kSize <= N; i += kSize) {
__m256 gi = _mm256_loadu_ps(g + i);
__m256 hi = _mm256_loadu_ps(h + i);
__m256 wi = _mm256_loadu_ps(w + i);
gi = _mm256_fmadd_ps(_mm256_set1_ps(weight_decay), wi, gi);
__m256 nhi = _mm256_add_ps(
_mm256_mul_ps(_mm256_set1_ps(decay), hi), _mm256_mul_ps(gi, gi));
_mm256_storeu_ps(nh + i, nhi);
__m256 vtmp = _mm256_div_ps(
_mm256_mul_ps(_mm256_set1_ps(lr), gi),
_mm256_add_ps(_mm256_sqrt_ps(nhi), _mm256_set1_ps(epsilon)));
_mm256_storeu_ps(nw + i, _mm256_add_ps(wi, vtmp));
}
for (; i < N; ++i) {
float gi = std::fma(weight_decay, w[i], g[i]);
float hi = nh[i] = decay * h[i] + gi * gi;
nw[i] = w[i] + lr * gi / (std::sqrt(hi) + epsilon);
}
}
void adagrad_update_prefetch__avx2_fma(
int N,
const float* w,
const float* w_n, // prefetch ptr
const float* g,
const float* h,
const float* h_n, // prefetch ptr
float* nw,
float* nw_n, // prefetch ptr
float* nh,
float* nh_n, // prefetch ptr
float epsilon,
float lr,
float weight_decay = 0.f) {
internal::adagrad_update_prefetch_inlined(
N, w, w_n, g, h, h_n, nw, nw_n, nh, nh_n, epsilon, lr, weight_decay);
}
// Compute adagrad sparse, assumes embedding and momentum are at::Half
void adagrad_fp16_update_prefetch__avx2_fma(
int N,
const at::Half* w,
const at::Half* w_n, // prefetch ptr
const float* g,
const at::Half* h,
const at::Half* h_n, // prefetch ptr
at::Half* nw,
at::Half* nw_n, // prefetch ptr
at::Half* nh,
at::Half* nh_n, // prefetch ptr
float epsilon,
float lr,
float weight_decay = 0.f) {
constexpr int kSize = 8;
auto i = 0;
for (; i + kSize <= N; i += kSize) {
_mm_prefetch(reinterpret_cast<const char*>(&w_n[i]), _MM_HINT_T0);
_mm_prefetch(reinterpret_cast<const char*>(&h_n[i]), _MM_HINT_T0);
_mm_prefetch(reinterpret_cast<const char*>(&nw_n[i]), _MM_HINT_T0);
_mm_prefetch(reinterpret_cast<const char*>(&nh_n[i]), _MM_HINT_T0);
// only convert momentum and embedding, gradient is fp32
__m256 gi = _mm256_loadu_ps(g + i);
__m128i hhi = _mm_loadu_si128(reinterpret_cast<const __m128i*>(h + i));
__m256 hi = _mm256_cvtph_ps(hhi);
__m128i whi = _mm_loadu_si128(reinterpret_cast<const __m128i*>(w + i));
__m256 wi = _mm256_cvtph_ps(whi);
gi = _mm256_fmadd_ps(_mm256_set1_ps(weight_decay), wi, gi);
__m256 nhi = _mm256_add_ps(hi, _mm256_mul_ps(gi, gi));
__m128i nhhi = _mm256_cvtps_ph(nhi, 0);
_mm_storeu_si128(reinterpret_cast<__m128i*>(nh + i), nhhi);
__m256 vtmp = _mm256_div_ps(
_mm256_mul_ps(_mm256_set1_ps(lr), gi),
_mm256_add_ps(_mm256_sqrt_ps(nhi), _mm256_set1_ps(epsilon)));
__m256 nwi = _mm256_add_ps(wi, vtmp);
__m128i nhwi = _mm256_cvtps_ph(nwi, 0);
_mm_storeu_si128(reinterpret_cast<__m128i*>(nw + i), nhwi);
}
for (; i < N; ++i) {
float gi = std::fma(
weight_decay,
_cvtsh_ss(reinterpret_cast<const unsigned short*>(w)[i]),
g[i]);
float nhi =
_cvtsh_ss(reinterpret_cast<const unsigned short*>(h)[i]) + gi * gi;
reinterpret_cast<unsigned short*>(nh)[i] = _cvtss_sh(nhi, 0);
float nwi = _cvtsh_ss(reinterpret_cast<const unsigned short*>(w)[i]) +
lr * gi / (std::sqrt(nhi) + epsilon);
reinterpret_cast<unsigned short*>(nw)[i] = _cvtss_sh(nwi, 0);
}
}
} // namespace caffe2

View File

@ -1,45 +0,0 @@
#include "caffe2/perfkernels/adagrad.h"
#include "caffe2/perfkernels/cvtsh_ss_bugfix.h"
#include <emmintrin.h>
#include <immintrin.h>
namespace caffe2 {
// version without prefetching
void adagrad_update__avx512(
int N,
const float* w,
const float* g,
const float* h,
float* nw,
float* nh,
float epsilon,
float decay,
float lr,
float weight_decay = 0.f) {
constexpr int kSize = 16;
auto i = 0;
for (; i + kSize <= N; i += kSize) {
__m512 gi = _mm512_loadu_ps(g + i);
__m512 hi = _mm512_loadu_ps(h + i);
__m512 wi = _mm512_loadu_ps(w + i);
gi = _mm512_fmadd_ps(_mm512_set1_ps(weight_decay), wi, gi);
__m512 nhi = _mm512_add_ps(
_mm512_mul_ps(_mm512_set1_ps(decay), hi), _mm512_mul_ps(gi, gi));
_mm512_storeu_ps(nh + i, nhi);
__m512 vtmp = _mm512_div_ps(
_mm512_mul_ps(_mm512_set1_ps(lr), gi),
_mm512_add_ps(_mm512_sqrt_ps(nhi), _mm512_set1_ps(epsilon)));
_mm512_storeu_ps(nw + i, _mm512_add_ps(wi, vtmp));
}
for (; i < N; ++i) {
float gi = std::fma(weight_decay, w[i], g[i]);
float hi = nh[i] = decay * h[i] + gi * gi;
nw[i] = w[i] + lr * gi / (std::sqrt(hi) + epsilon);
}
}
} // namespace caffe2

View File

@ -1,113 +0,0 @@
#include "caffe2/perfkernels/common.h"
#include <algorithm>
#include <cstdint>
#include <cmath>
namespace caffe2 {
namespace {
template <typename T>
void BoxCoxNaive(
std::size_t N,
std::size_t D,
const T* data_ptr,
const T* __restrict lambda1_ptr,
const T* __restrict lambda2_ptr,
T* output_ptr) {
constexpr T k_eps = static_cast<T>(1e-6);
for (std::size_t i = 0; i < N; i++) {
for (std::size_t j = 0; j < D; j++, data_ptr++, output_ptr++) {
T lambda1_v = lambda1_ptr[j];
T lambda2_v = lambda2_ptr[j];
T tmp = std::max(*data_ptr + lambda2_v, k_eps);
if (lambda1_v == 0) {
*output_ptr = std::log(tmp);
} else {
T lambda_1 = 1 / lambda1_v;
T pow = std::pow(tmp, lambda1_v);
*output_ptr = lambda_1 * pow - lambda_1;
}
}
}
}
}
#if defined(CAFFE2_PERF_WITH_AVX2) && defined(CAFFE2_PERF_USE_MKL)
namespace details {
template <typename T>
void compute_batch_box_cox__avx2_fma(
std::size_t N,
std::size_t D,
std::size_t block_size,
const T* data_ptr,
const T* __restrict lambda1_ptr,
const T* __restrict lambda2_ptr,
T* output_ptr);
extern template
void compute_batch_box_cox__avx2_fma<float>(
std::size_t N,
std::size_t D,
std::size_t block_size,
const float* self_data,
const float* __restrict lambda1_data,
const float* __restrict lambda2_data,
float* output_data);
extern template
void compute_batch_box_cox__avx2_fma<double>(
std::size_t N,
std::size_t D,
std::size_t block_size,
const double* self_data,
const double* __restrict lambda1_data,
const double* __restrict lambda2_data,
double* output_data);
} // namespace detail
#endif
template <typename T>
void compute_batch_box_cox(
std::size_t N,
std::size_t D,
std::size_t block_size,
const T* data,
const T* lambda1_data,
const T* lambda2_data,
T* output_data) {
#ifdef CAFFE2_PERF_WITH_AVX2
AVX2_FMA_DO(
details::compute_batch_box_cox,
N,
D,
block_size,
data,
lambda1_data,
lambda2_data,
output_data);
#endif
BoxCoxNaive<T>(N, D, data, lambda1_data, lambda2_data, output_data);
}
template void compute_batch_box_cox<float>(
std::size_t N,
std::size_t D,
std::size_t block_size,
const float* data,
const float* lambda1_data,
const float* lambda2_data,
float* output_data);
template void compute_batch_box_cox<double>(
std::size_t N,
std::size_t D,
std::size_t block_size,
const double* data,
const double* lambda1_data,
const double* lambda2_data,
double* output_data);
} // namespace caffe2

View File

@ -1,35 +0,0 @@
// Impmenets BoxCox operator for CPU
#pragma once
#include <cstdint>
namespace caffe2 {
template <typename T>
void compute_batch_box_cox(
std::size_t N,
std::size_t D,
std::size_t block_size,
const T* self_data,
const T* lambda1_data,
const T* lambda2_data,
T* output_data);
extern template void compute_batch_box_cox<float>(
std::size_t N,
std::size_t D,
std::size_t block_size,
const float* data,
const float* lambda1_data,
const float* lambda2_data,
float* output_data);
extern template void compute_batch_box_cox<double>(
std::size_t N,
std::size_t D,
std::size_t block_size,
const double* data,
const double* lambda1_data,
const double* lambda2_data,
double* output_data);
} // namespace caffe2

View File

@ -1,399 +0,0 @@
#include <immintrin.h>
#ifdef CAFFE2_PERF_USE_MKL
#include <c10/util/irange.h>
#include <caffe2/perfkernels/common.h>
#include <folly/SingletonThreadLocal.h>
#include "vectorizer.h"
// Enable compiler vectorized version only if numerical consistency is not
// required between dev and opt versions - disabled for now
#ifndef FAST_VECTORIZED_KERNEL
#define CPU_CAPABILITY_AVX2
#include <ATen/cpu/vec/vec.h>
namespace at::vec {
// Implements the vectorized version of std::max() operation,
// which DOESNOT propagates NaN for second argument
template <typename scalar_t>
Vectorized<scalar_t> max(const Vectorized<scalar_t>& a, const Vectorized<scalar_t>& b);
template <>
Vectorized<double> max(const Vectorized<double>& a, const Vectorized<double>& b) {
// std::max(NaN, nonNan) -> NaN
return _mm256_max_pd(b, a);
}
template <>
Vectorized<float> max(const Vectorized<float>& a, const Vectorized<float>& b) {
// std::max(NaN, nonNan) -> NaN
return _mm256_max_ps(b, a);
}
// Implements recieprocal method based on newton-rapson method
// 1. user RCP approximiation
// 2. update with RCP = RCP * (2 - X * RCP)
template <typename scalar_t>
Vectorized<scalar_t> fast_recieprocal(const Vectorized<scalar_t>& b);
template <typename scalar_t>
scalar_t fast_recieprocal(scalar_t b);
template<>
Vectorized<float> fast_recieprocal(const Vectorized<float>& b) {
auto minus2 = _mm256_set1_ps(-2.f);
auto rcp = _mm256_rcp_ps(b);
rcp = _mm256_mul_ps(rcp, _mm256_fnmsub_ps(rcp, b, minus2));
rcp = _mm256_mul_ps(rcp, _mm256_fnmsub_ps(rcp, b, minus2));
return rcp;
}
template <>
float fast_recieprocal(float b) {
auto minus2 = _mm_set_ss(-2.f);
auto b_reg = _mm_set_ss(b);
auto rcp = _mm_rcp_ss(b_reg);
rcp = _mm_mul_ss(rcp, _mm_fnmsub_ss(rcp, b_reg, minus2));
rcp = _mm_mul_ss(rcp, _mm_fnmsub_ss(rcp, b_reg, minus2));
return _mm_cvtss_f32(rcp);
}
template<>
Vectorized<double> fast_recieprocal(const Vectorized<double>& b) {
return b.reciprocal();
}
template <>
double fast_recieprocal(double b) {
return 1./b;
}
}
#endif
#include <cstdint>
#include <cmath>
#include <vector>
#include <mkl.h>
namespace caffe2::details {
// MKL VML function templates.
template <typename T>
void PackV(const int N, const T* a, const int* ia, T* y);
template <typename T>
void UnpackV(const int N, const T* a, T* y, const int* iy);
#define DELEGATE_PACKV_FUNCTION(T, OriginalFunc) \
template <> \
void PackV<T>(const int N, const T* a, const int* ia, T* y) { \
OriginalFunc(N, a, ia, y); \
}
DELEGATE_PACKV_FUNCTION(float, vsPackV)
DELEGATE_PACKV_FUNCTION(double, vdPackV)
#undef DELEGATE_PACKV_FUNCTION
#define DELEGATE_UNPACKV_FUNCTION(T, OriginalFunc) \
template <> \
void UnpackV<T>(const int N, const T* a, T* y, const int* iy) { \
OriginalFunc(N, a, y, iy); \
}
DELEGATE_UNPACKV_FUNCTION(float, vsUnpackV)
DELEGATE_UNPACKV_FUNCTION(double, vdUnpackV)
#undef DELEGATE_UNPACKV_FUNCTION
#ifndef FAST_VECTORIZED_KERNEL
template <typename T>
void box_cox_zero_lambda(
size_t D,
const T* const self_data,
const T* const lambda2_data,
T k_eps,
T* const output_data) {
int j = 0;
using Vec = at::vec::Vectorized<T>;
constexpr int64_t VLEN = Vec::size();
auto k_eps_vec = Vec(k_eps);
for(; j + VLEN < D; j += VLEN) {
auto data = Vec::loadu(self_data + j);
auto lambda2 = Vec::loadu(lambda2_data + j);
auto sum = data + lambda2;
auto max = at::vec::max(sum, k_eps_vec);
auto res = max.log();
res.store(output_data + j);
}
for ( ;j < D; ++j) {
auto sum = self_data[j] + lambda2_data[j];
auto max = std::max(sum, k_eps);
output_data[j] = std::log(max);
}
}
template <typename T>
void box_cox_nonzero_lambda(
int64_t D,
const T* data_ptr,
const T* lambda1_ptr,
const T* lambda2_ptr,
T k_eps,
T* out) {
int j = 0;
using Vec = at::vec::Vectorized<T>;
constexpr int64_t VLEN = Vec::size();
auto k_eps_vec = Vec(k_eps);
for(; j + VLEN < D; j += VLEN) {
auto data = Vec::loadu(data_ptr + j);
auto lambda2 = Vec::loadu(lambda2_ptr + j);
auto sum = data + lambda2;
auto max = at::vec::max(sum, k_eps_vec);
auto lambda1 = Vec::loadu(lambda1_ptr + j);
auto lambda_over_1 = at::vec::fast_recieprocal(lambda1);
auto pow = max.pow(lambda1);
auto res = at::vec::fmsub(pow, lambda_over_1, lambda_over_1);
res.store(out + j);
}
for ( ;j < D; ++j) {
auto sum = data_ptr[j] + lambda2_ptr[j];
auto max = std::max(sum, k_eps);
auto lambda_over_1 = at::vec::fast_recieprocal(lambda1_ptr[j]);
auto pow = std::pow(max, lambda1_ptr[j]);
out[j] = pow * lambda_over_1 - lambda_over_1;
}
}
#else
template <typename T>
void box_cox_zero_lambda(
size_t D,
const T* const self_data,
const T* const lambda2_data,
T k_eps,
T* const output_data) {
VECTOR_LOOP for (auto j=0 ;j < D; ++j) {
auto sum = self_data[j] + lambda2_data[j];
auto max = std::max(sum, k_eps);
output_data[j] = std::log(max);
}
}
template <typename T>
void box_cox_nonzero_lambda(
int64_t D,
const T* data_ptr,
const T* lambda1_ptr,
const T* lambda2_ptr,
T k_eps,
T* out) {
VECTOR_LOOP for (auto j=0 ;j < D; ++j) {
FAST_MATH
auto sum = data_ptr[j] + lambda2_ptr[j];
auto max = std::max(sum, k_eps);
auto lamda1 = lambda1_ptr[j];
auto lambda_over_1 = 1 / lamda1;
if constexpr (std::is_same<T, float>::value) {
lambda_over_1 = lambda_over_1 * (T{2} - lambda_over_1 * lamda1);
lambda_over_1 = lambda_over_1 * (T{2} - lambda_over_1 * lamda1);
}
auto pow = std::pow(max, lamda1);
out[j] = pow * lambda_over_1 - lambda_over_1;
}
}
#endif
template <typename T>
void box_cox_mixed_lambda(
const T* const self_data,
const std::vector<int>& nonzeros,
const std::vector<int>& zeros,
const T* const lambda1,
const T* const lambda2,
const T* const lambda2_z_,
T k_eps,
T* const buffer,
T* const output_data) {
PackV(nonzeros.size(), self_data, nonzeros.data(), buffer);
box_cox_nonzero_lambda<T>(
nonzeros.size(), buffer, lambda1, lambda2, k_eps, buffer);
UnpackV(nonzeros.size(), buffer, output_data, nonzeros.data());
PackV(zeros.size(), self_data, zeros.data(), buffer);
box_cox_zero_lambda<T>(
zeros.size(), buffer, lambda2_z_, k_eps, buffer);
UnpackV(zeros.size(), buffer, output_data, zeros.data());
}
template <typename T>
void TileArrayIntoVector(
const T* const a,
const size_t D,
const int K,
std::vector<T>& b) {
b.resize(K * D);
for (const auto k : c10::irange(K)) {
std::copy(a, a + D, b.begin() + k * D);
}
}
void TileIndicesInPlace(std::vector<int>& v, const std::size_t D, const std::size_t K) {
auto n = v.size();
v.resize(K * n);
for (const auto k : c10::irange(1, K)) {
for (const auto j : c10::irange(n)) {
v[k * n + j] = v[j] + k * D;
}
}
}
template <typename T>
void compute_batch_box_cox__avx2_fma(
std::size_t N,
std::size_t D,
std::size_t block_size,
const T* self_data,
const T* __restrict lambda1_data,
const T* __restrict lambda2_data,
T* output_data) {
constexpr T k_eps = static_cast<T>(1e-6);
FOLLY_DECLARE_REUSED(zeros, std::vector<int>);
FOLLY_DECLARE_REUSED(nonzeros, std::vector<int>);
// Don't bother calling reserve; calls after the first will get a
// correctly-sized allocation anyway.
for (const auto j : c10::irange(D)) {
if (lambda1_data[j] == 0) {
zeros.push_back(j);
} else {
nonzeros.push_back(j);
}
}
// Process K rows at a time for effective vectorization with small rows.
const auto K = std::min(N, (block_size + D - 1) / D);
FOLLY_DECLARE_REUSED(lambda1_, std::vector<T>);
FOLLY_DECLARE_REUSED(lambda2_, std::vector<T>);
FOLLY_DECLARE_REUSED(lambda2_z_, std::vector<T>);
if (nonzeros.size() == D) {
// ((x + lambda2)^lambda1 - 1)/lambda1, if lambda1 != 0
size_t i = 0;
if (K > 1) {
TileArrayIntoVector(lambda1_data, D, K, lambda1_);
TileArrayIntoVector(lambda2_data, D, K, lambda2_);
DCHECK_EQ(K * D, lambda1_.size());
DCHECK_EQ(K * D, lambda2_.size());
for (; i < N - K + 1; i += K, self_data += K * D, output_data += K * D) {
box_cox_nonzero_lambda<T>(
K * D,
self_data,
lambda1_.data(),
lambda2_.data(),
k_eps,
output_data);
}
}
for (; i < N; i++, self_data += D, output_data += D) {
box_cox_nonzero_lambda<T>(
D, self_data, lambda1_data, lambda2_data, k_eps, output_data);
}
} else if (zeros.size() == D) {
// ln(x + lambda2), if lambda1 == 0
size_t i = 0;
if (K > 1) {
TileArrayIntoVector(lambda2_data, D, K, lambda2_z_);
DCHECK_EQ(K * D, lambda2_z_.size());
for (; i < N - K + 1; i += K, self_data += K * D, output_data += K * D) {
box_cox_zero_lambda<T>(
K * D, self_data, lambda2_z_.data(), k_eps, output_data);
}
}
for (; i < N; i++, self_data += D, output_data += D) {
box_cox_zero_lambda<T>(
D, self_data, lambda2_data, k_eps, output_data);
}
} else {
// mix zeros and nonzeros
const size_t n = nonzeros.size();
if (K > 1) {
TileIndicesInPlace(nonzeros, 0, K);
TileIndicesInPlace(zeros, 0, K);
}
FOLLY_DECLARE_REUSED(buffer, std::vector<T>);
buffer.resize(std::max(nonzeros.size(), zeros.size()));
lambda1_.resize(nonzeros.size());
lambda2_.resize(nonzeros.size());
lambda2_z_.resize(zeros.size());
PackV(nonzeros.size(), lambda1_data, nonzeros.data(), lambda1_.data());
PackV(nonzeros.size(), lambda2_data, nonzeros.data(), lambda2_.data());
PackV(zeros.size(), lambda2_data, zeros.data(), lambda2_z_.data());
size_t i = 0;
if (K > 1) {
// Truncate to original size, and re-tile with offsets this time.
nonzeros.resize(n);
DCHECK_GT(D, n);
zeros.resize(D - n);
TileIndicesInPlace(nonzeros, D, K);
TileIndicesInPlace(zeros, D, K);
DCHECK_EQ(nonzeros.size(), lambda1_.size());
DCHECK_EQ(nonzeros.size(), lambda2_.size());
DCHECK_EQ(zeros.size(), lambda2_z_.size());
for (; i < N - K + 1; i += K, self_data += K * D, output_data += K * D) {
box_cox_mixed_lambda<T>(
self_data,
nonzeros,
zeros,
lambda1_.data(),
lambda2_.data(),
lambda2_z_.data(),
k_eps,
buffer.data(),
output_data);
}
// Truncate to original size.
nonzeros.resize(n);
zeros.resize(D - n);
}
for (; i < N; i++, self_data += D, output_data += D) {
box_cox_mixed_lambda<T>(
self_data,
nonzeros,
zeros,
lambda1_.data(),
lambda2_.data(),
lambda2_z_.data(),
k_eps,
buffer.data(),
output_data);
}
}
};
template
void compute_batch_box_cox__avx2_fma<float>(
std::size_t N,
std::size_t D,
std::size_t block_size,
const float* self_data,
const float* __restrict lambda1_data,
const float* __restrict lambda2_data,
float* output_data);
template
void compute_batch_box_cox__avx2_fma<double>(
std::size_t N,
std::size_t D,
std::size_t block_size,
const double* self_data,
const double* __restrict lambda1_data,
const double* __restrict lambda2_data,
double* output_data);
} // namespace caffe2::detail
#endif

View File

@ -1,75 +0,0 @@
#pragma once
// Apple clang was fixed in 8.1
#if defined(__apple_build_version__) && \
((__clang_major__ < 8) || \
((__clang_major__ == 8) && (__clang_minor__ < 1)))
#define CAFFE2_INTERNAL_APPLE_NEED_FIX 1
#endif
// Regular clang was fixed in 3.9
#if defined(__clang__) && (__clang_major__ < 4) && (__clang_minor__ < 9)
#define CAFFE2_INTERNAL_CLANG_NEED_FIX 1
#endif
#if defined(CAFFE2_INTERNAL_APPLE_NEED_FIX) || \
defined(CAFFE2_INTERNAL_CLANG_NEED_FIX)
#include <c10/util/Half.h>
#include <emmintrin.h>
// This version of clang has a bug that _cvtsh_ss is not defined, see
// https://reviews.llvm.org/D16177
static __inline float
__attribute__((__always_inline__, __nodebug__, __target__("f16c")))
_cvtsh_ss(unsigned short a) {
__v8hi v = {(short)a, 0, 0, 0, 0, 0, 0, 0};
__v4sf r = __builtin_ia32_vcvtph2ps(v);
return r[0];
}
static __inline unsigned short
__attribute__((__always_inline__, __nodebug__, __target__("f16c")))
_cvtss_sh(float a, int imm8) {
unsigned short ret;
*reinterpret_cast<at::Half*>(&ret) = a;
return ret;
}
#endif // __APPLE_NEED_FIX || __CLANG_NEED_FIX
#undef __APPLE_NEED_FIX
#undef __CLANG_NEED_FIX
#if defined(_MSC_VER) && !defined(__clang__)
#include <c10/util/Half.h>
#include <cstdint>
// It seems that microsoft msvc does not have a _cvtsh_ss implementation so
// we will add a dummy version to it.
static inline float _cvtsh_ss(unsigned short x) {
union {
std::uint32_t intval;
float floatval;
} t1;
std::uint32_t t2, t3;
t1.intval = x & 0x7fff; // Non-sign bits
t2 = x & 0x8000; // Sign bit
t3 = x & 0x7c00; // Exponent
t1.intval <<= 13; // Align mantissa on MSB
t2 <<= 16; // Shift sign bit into position
t1.intval += 0x38000000; // Adjust bias
t1.intval = (t3 == 0 ? 0 : t1.intval); // Denormals-as-zero
t1.intval |= t2; // Re-insert sign bit
return t1.floatval;
}
static inline unsigned short _cvtss_sh(float x, int imm8) {
unsigned short ret;
*reinterpret_cast<at::Half*>(&ret) = x;
return ret;
}
#endif // _MSC_VER

View File

@ -1,211 +0,0 @@
#include "caffe2/perfkernels/fused_8bit_rowwise_embedding_lookup.h"
#include "caffe2/perfkernels/common.h"
#include <c10/util/Logging.h>
#include <c10/util/irange.h>
namespace caffe2 {
/**
* Base implementation does runtime dispatch for each segment of reduction
* @return false if there is an out-of-bound error
*/
template <
typename IndexType,
typename InType,
typename OutType,
bool IS_WEIGHT_POSITIONAL = false>
static bool Fused8BitRowwiseEmbeddingLookupGenericSlow(
const int64_t block_size,
const int64_t output_size,
const int64_t index_size,
const int64_t data_size,
const InType* input,
const IndexType* indices,
const int* lengths,
const float* weights, // optional, can be null for sum reducer
bool normalize_by_lengths,
OutType* out) {
// block_size is the number of elements and fused_block_size is the size of
// an entire row, including scale and bias.
const auto scale_bias_offset = 8 / sizeof(InType);
const int64_t fused_block_size = block_size + scale_bias_offset;
int64_t current = 0;
for (const auto m : c10::irange(output_size)) {
memset(out, 0, sizeof(OutType) * block_size);
if (current + lengths[m] > index_size) {
return false;
}
for (int i = 0; i < lengths[m]; ++i) {
int64_t idx = indices[current];
if (idx < 0 || idx >= data_size) {
return false;
}
#ifdef __GNUC__
if (current + 1 < index_size) {
__builtin_prefetch(
input + fused_block_size * indices[current + 1], 0, 1);
}
#endif // __GNUC__
const float* scale_bias = reinterpret_cast<const float*>(
input + fused_block_size * indices[current] + block_size);
float weight = 1.0f;
if (weights) {
weight = weights[IS_WEIGHT_POSITIONAL ? i : current];
}
const float scale = weight * scale_bias[0];
const float bias = weight * scale_bias[1];
for (const auto j : c10::irange(block_size)) {
out[j] += scale * input[fused_block_size * indices[current] + j] + bias;
}
++current;
}
if (normalize_by_lengths && lengths[m]) {
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
float scale = 1.f / lengths[m];
for (const auto j : c10::irange(block_size)) {
out[j] *= scale;
}
}
out += block_size;
}
return current == index_size;
}
// clang-format off
// Proxy back to generic implementation
#define FUSED_8BIT_ROWWISE_EMBEDDING_SPECIALIZATION(IndexType, OutType) \
bool \
Fused8BitRowwiseEmbeddingLookup_##IndexType##_uint8_t_##OutType##_false__base( \
const int64_t block_size, \
const int64_t output_size, \
const int64_t index_size, \
const int64_t data_size, \
const uint8_t* input, \
const IndexType* indices, \
const int* lengths, \
const float* weights, \
bool normalize_by_lengths, \
OutType* out) { \
return Fused8BitRowwiseEmbeddingLookupGenericSlow< \
IndexType, \
uint8_t, \
OutType, \
false>( \
block_size, \
output_size, \
index_size, \
data_size, \
input, \
indices, \
lengths, \
weights, \
normalize_by_lengths, \
out); \
} \
decltype( \
Fused8BitRowwiseEmbeddingLookup_##IndexType##_uint8_t_##OutType##_false__base) \
Fused8BitRowwiseEmbeddingLookup_##IndexType##_uint8_t_##OutType##_false__avx2_fma; \
bool Fused8BitRowwiseEmbeddingLookup_##IndexType##_uint8_t_##OutType( \
const int64_t block_size, \
const int64_t output_size, \
const int64_t index_size, \
const int64_t data_size, \
const uint8_t* input, \
const IndexType* indices, \
const int* lengths, \
const float* weights, \
bool normalize_by_lengths, \
OutType* out) { \
const int32_t one = 1; \
CAFFE_ENFORCE_EQ( \
reinterpret_cast<const uint8_t*>(&one)[0], \
1, \
"Fused8BitRowwiseEmbeddingLookup is not supported on this platform"); \
AVX2_FMA_DO( \
Fused8BitRowwiseEmbeddingLookup_##IndexType##_uint8_t_##OutType##_false, \
block_size, \
output_size, \
index_size, \
data_size, \
input, \
indices, \
lengths, \
weights, \
normalize_by_lengths, \
out); \
BASE_DO( \
Fused8BitRowwiseEmbeddingLookup_##IndexType##_uint8_t_##OutType##_false, \
block_size, \
output_size, \
index_size, \
data_size, \
input, \
indices, \
lengths, \
weights, \
normalize_by_lengths, \
out); \
} \
template <> \
void Fused8BitRowwiseEmbeddingLookup<IndexType, uint8_t, OutType, false>( \
const int64_t block_size, \
const int64_t output_size, \
const int64_t index_size, \
const int64_t data_size, \
const uint8_t* input, \
const IndexType* indices, \
const int* lengths, \
const float* weights, \
bool normalize_by_lengths, \
OutType* out) { \
bool success = \
Fused8BitRowwiseEmbeddingLookup_##IndexType##_uint8_t_##OutType( \
block_size, \
output_size, \
index_size, \
data_size, \
input, \
indices, \
lengths, \
weights, \
normalize_by_lengths, \
out); \
if (success) { \
return; \
} \
int64_t current = 0; \
for (int m = 0; m < output_size; ++m) { \
for (int i = 0; i < lengths[m]; ++i) { \
CAFFE_ENFORCE_LT(current, index_size); \
IndexType idx = indices[current]; \
CAFFE_ENFORCE( \
0 <= idx && idx < data_size, \
"Index ", \
current, \
" is out of bounds: ", \
idx, \
", range 0 to ", \
data_size); \
++current; \
} \
} \
CAFFE_ENFORCE_EQ( \
current, \
index_size, \
"Your input seems to be incorrect: the sum of lengths values should be " \
"the size of the indices tensor, but it appears not."); \
}
// clang-format on
FUSED_8BIT_ROWWISE_EMBEDDING_SPECIALIZATION(int32_t, float);
FUSED_8BIT_ROWWISE_EMBEDDING_SPECIALIZATION(int64_t, float);
#undef FUSED_8BIT_ROWWISE_EMBEDDING_SPECIALIZATION
} // namespace caffe2

View File

@ -1,55 +0,0 @@
#pragma once
#include <cstdint>
namespace caffe2 {
/**
* Embedding lookup with reduction.
*
* `input` of size data_size * (block_size + 8B)
* `indices` of size index_size
* `lengths` of size output_size
* `weights` nullptr or array of size index_size
* `out` of size output_size * block_size
* sum(lengths[i]) == index_size
*
* Note that block_size should be the number of quantized values per row in the
* data, i.e. excluding the scale and bias. The total (fused) block size is
* assumed to be this block_size, plus 4 bytes for scale and 4 bytes for bias.
*
* Behavior is roughly equivalent to pseudocode:
*
* pos = 0
* fused_block_size = block_size + 8B // quantized values and scale and bias
* for (i = 0..output_size-1)
* for (k = 0..block_size-1)
* out[i*block_size + k] = 0
* for (j = 0..lengths[i]-1)
* for (k = 0..block_size-1)
* out[i*block_size + k] += input[indices[pos]*(fused_block_size) + k] *
* (weights ? weights[IS_WEIGHT_POSITIONAL ? j : pos] : 1.0)
* pos += 1
* if (normalize_weights && lengths[i] > 0)
* for (k = 0..block_size-1)
* out[i*block_size + k] /= lengths[i]
*
*/
template <
typename IndexType,
typename InType,
typename OutType,
bool IS_WEIGHT_POSITIONAL = false>
void Fused8BitRowwiseEmbeddingLookup(
const std::int64_t block_size,
const std::int64_t output_size,
const std::int64_t index_size,
const std::int64_t data_size,
const InType* input,
const IndexType* indices,
const int* lengths,
const float* weights, // optional, can be null for non-weighted sum
bool normalize_by_lengths,
OutType* out);
} // namespace caffe2

View File

@ -1,213 +0,0 @@
#include "caffe2/perfkernels/fused_8bit_rowwise_embedding_lookup_idx.h"
#include "caffe2/perfkernels/common.h"
#include <c10/util/Logging.h>
#include <c10/util/irange.h>
namespace caffe2 {
/**
* Base implementation does runtime dispatch for each segment of reduction
* @return false if there is an out-of-bound error
*/
template <
typename IndexType,
typename InType,
typename OutType,
bool IS_WEIGHT_POSITIONAL = false>
static bool Fused8BitRowwiseEmbeddingLookupGenericSlowIdx(
const int64_t block_size,
const int64_t output_size,
const int64_t index_size,
const int64_t data_size,
const InType* input,
const IndexType* indices,
const IndexType* offsets,
const float* weights, // optional, can be null for sum reducer
bool normalize_by_lengths,
OutType* out) {
// block_size is the number of elements and fused_block_size is the size of
// an entire row, including scale and bias.
const auto scale_bias_offset = 8 / sizeof(InType);
const int64_t fused_block_size = block_size + scale_bias_offset;
int64_t current = 0;
for (const auto m : c10::irange(output_size)) {
memset(out, 0, sizeof(OutType) * block_size);
if (current != offsets[m] - offsets[0]) {
return false;
}
int64_t start_offset = offsets[m];
int64_t end_offset = offsets[m + 1];
int64_t length = end_offset - start_offset;
for (const auto i : c10::irange(start_offset, end_offset)) {
int64_t idx = indices[current];
if (idx < 0 || idx >= data_size) {
return false;
}
#ifdef __GNUC__
if (current + 1 < index_size) {
__builtin_prefetch(
input + fused_block_size * indices[current + 1], 0, 1);
}
#endif // __GNUC__
const float* scale_bias = reinterpret_cast<const float*>(
input + fused_block_size * indices[current] + block_size);
float weight = 1.0f;
if (weights) {
weight = weights[IS_WEIGHT_POSITIONAL ? i : current];
}
const float scale = weight * scale_bias[0];
const float bias = weight * scale_bias[1];
for (const auto j : c10::irange(block_size)) {
out[j] += scale * input[fused_block_size * indices[current] + j] + bias;
}
++current;
}
if (normalize_by_lengths && length) {
float scale = 1.f / length;
for (const auto j : c10::irange(block_size)) {
out[j] *= scale;
}
}
out += block_size;
}
return current == index_size;
}
// clang-format off
// Proxy back to generic implementation
#define FUSED_8BIT_ROWWISE_EMBEDDING_IDX_SPECIALIZATION(IndexType, OutType) \
bool \
Fused8BitRowwiseEmbeddingLookupIdx_##IndexType##_uint8_t_##OutType##_false__base( \
const int64_t block_size, \
const int64_t output_size, \
const int64_t index_size, \
const int64_t data_size, \
const uint8_t* input, \
const IndexType* indices, \
const IndexType* offsets, \
const float* weights, \
bool normalize_by_lengths, \
OutType* out) { \
return Fused8BitRowwiseEmbeddingLookupGenericSlowIdx< \
IndexType, \
uint8_t, \
OutType, \
false>( \
block_size, \
output_size, \
index_size, \
data_size, \
input, \
indices, \
offsets, \
weights, \
normalize_by_lengths, \
out); \
} \
decltype( \
Fused8BitRowwiseEmbeddingLookupIdx_##IndexType##_uint8_t_##OutType##_false__base) \
Fused8BitRowwiseEmbeddingLookupIdx_##IndexType##_uint8_t_##OutType##_false__avx2_fma; \
bool Fused8BitRowwiseEmbeddingLookupIdx_##IndexType##_uint8_t_##OutType( \
const int64_t block_size, \
const int64_t output_size, \
const int64_t index_size, \
const int64_t data_size, \
const uint8_t* input, \
const IndexType* indices, \
const IndexType* offsets, \
const float* weights, \
bool normalize_by_lengths, \
OutType* out) { \
const int32_t one = 1; \
CAFFE_ENFORCE_EQ( \
reinterpret_cast<const uint8_t*>(&one)[0], \
1, \
"Fused8BitRowwiseEmbeddingLookup is not supported on this platform"); \
AVX2_FMA_DO( \
Fused8BitRowwiseEmbeddingLookupIdx_##IndexType##_uint8_t_##OutType##_false, \
block_size, \
output_size, \
index_size, \
data_size, \
input, \
indices, \
offsets, \
weights, \
normalize_by_lengths, \
out); \
BASE_DO( \
Fused8BitRowwiseEmbeddingLookupIdx_##IndexType##_uint8_t_##OutType##_false, \
block_size, \
output_size, \
index_size, \
data_size, \
input, \
indices, \
offsets, \
weights, \
normalize_by_lengths, \
out); \
} \
template <> \
void Fused8BitRowwiseEmbeddingLookupIdx<IndexType, uint8_t, OutType, false>( \
const int64_t block_size, \
const int64_t output_size, \
const int64_t index_size, \
const int64_t data_size, \
const uint8_t* input, \
const IndexType* indices, \
const IndexType* offsets, \
const float* weights, \
bool normalize_by_lengths, \
OutType* out) { \
bool success = \
Fused8BitRowwiseEmbeddingLookupIdx_##IndexType##_uint8_t_##OutType( \
block_size, \
output_size, \
index_size, \
data_size, \
input, \
indices, \
offsets, \
weights, \
normalize_by_lengths, \
out); \
if (success) { \
return; \
} \
int64_t current = 0; \
for (int m = 0; m < output_size; ++m) { \
for (int64_t i = offsets[m]; i < offsets[m + 1]; ++i) { \
CAFFE_ENFORCE_LT(current, index_size); \
IndexType idx = indices[current]; \
CAFFE_ENFORCE( \
0 <= idx && idx < data_size, \
"Index ", \
current, \
" is out of bounds: ", \
idx, \
", range 0 to ", \
data_size); \
++current; \
} \
} \
CAFFE_ENFORCE_EQ( \
current, \
index_size, \
"Your input seems to be incorrect: the sum of lengths values should be " \
"the size of the indices tensor, but it appears not."); \
}
// clang-format on
FUSED_8BIT_ROWWISE_EMBEDDING_IDX_SPECIALIZATION(int32_t, float);
FUSED_8BIT_ROWWISE_EMBEDDING_IDX_SPECIALIZATION(int64_t, float);
#undef FUSED_8BIT_ROWWISE_EMBEDDING_IDX_SPECIALIZATION
} // namespace caffe2

View File

@ -1,57 +0,0 @@
#pragma once
#include <cstdint>
namespace caffe2 {
/**
* Embedding lookup with reduction.
*
* `input` of size data_size * (block_size + 8B)
* `indices` of size index_size
* `offsets` of size output_size
* `weights` nullptr or array of size index_size
* `out` of size output_size * block_size
*
* Note that block_size should be the number of quantized values per row in the
* data, i.e. excluding the scale and bias. The total (fused) block size is
* assumed to be this block_size, plus 4 bytes for scale and 4 bytes for bias.
*
* Behavior is roughly equivalent to pseudocode:
*
* pos = 0
* fused_block_size = block_size + 8B // quantized values and scale and bias
* for (i = 0..output_size-1)
* for (k = 0..block_size-1)
* out[i*block_size + k] = 0
* start_offset = offsets[i]
* end_offset = i == output_size-1 ? index_size : offsets[i+1] - 1
* length = end_offset - start_offset
* for (j = start_offset..end_offset)
* for (k = 0..block_size-1)
* out[i*block_size + k] += input[indices[pos]*(fused_block_size) + k] *
* (weights ? weights[IS_WEIGHT_POSITIONAL ? j : pos] : 1.0)
* pos += 1
* if (normalize_weights && length > 0)
* for (k = 0..block_size-1)
* out[i*block_size + k] /= length
*
*/
template <
typename IndexType,
typename InType,
typename OutType,
bool IS_WEIGHT_POSITIONAL = false>
void Fused8BitRowwiseEmbeddingLookupIdx(
const std::int64_t block_size,
const std::int64_t output_size,
const std::int64_t index_size,
const std::int64_t data_size,
const InType* input,
const IndexType* indices,
const IndexType* offsets,
const float* weights, // optional, can be null for non-weighted sum
bool normalize_by_lengths,
OutType* out);
} // namespace caffe2

View File

@ -1,214 +0,0 @@
#include "./fused_nbit_rowwise_conversion.h"
#include <c10/util/Half.h>
#include <algorithm>
#include <cmath>
#include "common.h"
#ifdef USE_FBGEMM
#include "fbgemm/QuantUtils.h"
#endif
namespace caffe2 {
void FloatToFused8BitRowwiseQuantized__base(
const float* input,
size_t input_rows,
int input_columns,
std::uint8_t* output) {
constexpr float kEpsilon = 1e-8f;
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
int output_columns = input_columns + 2 * sizeof(float);
for (std::size_t row = 0; row < input_rows; ++row) {
const float* input_row = input + row * input_columns;
std::uint8_t* output_row = output + row * output_columns;
float* output_row_scale_bias =
reinterpret_cast<float*>(output_row + input_columns);
float minimum_element =
*std::min_element(input_row, input_row + input_columns);
float maximum_element =
*std::max_element(input_row, input_row + input_columns);
float range = maximum_element - minimum_element;
output_row_scale_bias[0] = range / 255.0f;
output_row_scale_bias[1] = minimum_element;
const auto inverse_scale = 255.0f / (range + kEpsilon);
for (std::size_t col = 0; col < static_cast<size_t>(input_columns); ++col) {
output_row[col] =
std::lrintf((input_row[col] - minimum_element) * inverse_scale);
}
}
}
void Fused8BitRowwiseQuantizedToFloat__base(
const std::uint8_t* input,
size_t input_rows,
int input_columns,
float* output) {
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
int output_columns = input_columns - 2 * sizeof(float);
for (std::size_t row = 0; row < input_rows; ++row) {
const std::uint8_t* input_row = input + row * input_columns;
const float* input_row_scale_bias =
reinterpret_cast<const float*>(input_row + output_columns);
float* output_row = output + row * output_columns;
for (std::size_t col = 0; col < static_cast<std::size_t>(output_columns); ++col) {
output_row[col] =
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
input_row[col] * input_row_scale_bias[0] + input_row_scale_bias[1];
}
}
}
void FloatToFused8BitRowwiseQuantized(
const float* input,
size_t input_rows,
int input_columns,
std::uint8_t* output) {
#ifdef USE_FBGEMM
fbgemm::FloatOrHalfToFused8BitRowwiseQuantizedSBFloat<float>(
input, input_rows, input_columns, output);
#else
FloatToFused8BitRowwiseQuantized__base(
input, input_rows, input_columns, output);
#endif
}
void Fused8BitRowwiseQuantizedToFloat(
const std::uint8_t* input,
size_t input_rows,
int input_columns,
float* output) {
#ifdef USE_FBGEMM
fbgemm::Fused8BitRowwiseQuantizedSBFloatToFloatOrHalf<float>(
input, input_rows, input_columns, output);
#else
Fused8BitRowwiseQuantizedToFloat__base(
input, input_rows, input_columns, output);
#endif
}
void FloatToFusedNBitRowwiseQuantizedSBHalf__base(
int bit_rate,
const float* input,
size_t input_rows,
int input_columns,
std::uint8_t* output) {
int num_elem_per_byte = 8 / bit_rate;
int output_columns =
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
(input_columns + num_elem_per_byte - 1) / num_elem_per_byte +
2 * sizeof(at::Half);
for (std::size_t row = 0; row < input_rows; ++row) {
const float* input_row = input + row * input_columns;
std::uint8_t* output_row = output + row * output_columns;
at::Half* output_row_scale_bias = reinterpret_cast<at::Half*>(
output_row +
(input_columns + num_elem_per_byte - 1) / num_elem_per_byte);
float minimum_element =
*std::min_element(input_row, input_row + input_columns);
float maximum_element =
*std::max_element(input_row, input_row + input_columns);
minimum_element = static_cast<at::Half>(minimum_element);
const float range = maximum_element - minimum_element;
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
at::Half scale = range == 0 ? 1.0f : range / ((1 << bit_rate) - 1);
if (scale == 0) {
// Corner case handling when maximum_element == minimum_element
// Any scale would work because X - minimum_element will be 0 for all X
scale = 1.0f;
}
float inverse_scale = 1.0f / scale;
if (std::isinf(inverse_scale)) {
scale = 1.0f;
inverse_scale = 1.0f;
}
output_row_scale_bias[0] = scale;
output_row_scale_bias[1] = minimum_element;
for (std::size_t col = 0; col < static_cast<size_t>(input_columns); ++col) {
float X = input_row[col];
std::uint8_t quantized = std::max(
0,
std::min<int>(
std::lrintf((X - minimum_element) * inverse_scale),
(1 << bit_rate) - 1));
if (col % num_elem_per_byte == 0) {
output_row[col / num_elem_per_byte] = quantized;
} else {
output_row[col / num_elem_per_byte] |=
(quantized << ((col % num_elem_per_byte) * bit_rate));
}
}
}
}
void FusedNBitRowwiseQuantizedSBHalfToFloat__base(
int bit_rate,
const std::uint8_t* input,
size_t input_rows,
int input_columns,
float* output) {
int num_elem_per_byte = 8 / bit_rate;
int output_columns =
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
(input_columns - 2 * sizeof(at::Half)) * num_elem_per_byte;
for (std::size_t row = 0; row < static_cast<size_t>(input_rows); ++row) {
const std::uint8_t* input_row = input + row * input_columns;
const at::Half* input_row_scale_bias = reinterpret_cast<const at::Half*>(
input_row +
(output_columns + num_elem_per_byte - 1) / num_elem_per_byte);
float scale = input_row_scale_bias[0];
float bias = input_row_scale_bias[1];
float* output_row = output + row * output_columns;
for (std::size_t col = 0; col < static_cast<std::size_t>(output_columns); ++col) {
std::uint8_t quantized = input_row[col / num_elem_per_byte];
quantized >>= (col % num_elem_per_byte) * bit_rate;
quantized &= (1 << bit_rate) - 1;
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
output_row[col] = scale * quantized + bias;
}
}
}
void FloatToFusedNBitRowwiseQuantizedSBHalf(
int bit_rate,
const float* input,
size_t input_rows,
int input_columns,
std::uint8_t* output) {
#ifdef USE_FBGEMM
fbgemm::FloatOrHalfToFusedNBitRowwiseQuantizedSBHalf<float>(
bit_rate, input, input_rows, input_columns, output);
#else
FloatToFusedNBitRowwiseQuantizedSBHalf__base(
bit_rate, input, input_rows, input_columns, output);
#endif
}
void FusedNBitRowwiseQuantizedSBHalfToFloat(
int bit_rate,
const std::uint8_t* input,
size_t input_rows,
int input_columns,
float* output) {
#ifdef USE_FBGEMM
fbgemm::FusedNBitRowwiseQuantizedSBHalfToFloatOrHalf<float>(
bit_rate, input, input_rows, input_columns, output);
#else
FusedNBitRowwiseQuantizedSBHalfToFloat__base(
bit_rate, input, input_rows, input_columns, output);
#endif
}
} // namespace caffe2

View File

@ -1,39 +0,0 @@
#pragma once
#include <cstddef>
#include <cstdint>
namespace caffe2 {
void FloatToFused8BitRowwiseQuantized(
const float* input,
size_t input_rows,
int input_columns,
std::uint8_t* output);
void Fused8BitRowwiseQuantizedToFloat(
const std::uint8_t* input,
size_t input_rows,
int input_columns,
float* output);
/**
* Row-wise quantization with fp16 scale and bias
*
* @param bit_rate can be 2, 4, or 8
*/
void FloatToFusedNBitRowwiseQuantizedSBHalf(
int bit_rate,
const float* input,
size_t input_rows,
int input_columns,
std::uint8_t* output);
void FusedNBitRowwiseQuantizedSBHalfToFloat(
int bit_rate,
const std::uint8_t* input,
size_t input_rows,
int input_columns,
float* output);
} // namespace caffe2

View File

@ -1,141 +0,0 @@
#pragma once
#include <string.h>
#include <cmath>
#include <cstdint>
#include "c10/util/irange.h"
#include "caffe2/utils/conversions.h"
#include "vectorizer.h"
namespace caffe2 {
namespace perfkernels {
namespace {
template <typename T>
inline T sigmoid(T x) {
return 1 / (1 + std::exp(-x));
}
template <typename T>
inline T host_tanh(T x) {
return 2 * sigmoid(2 * x) - 1;
}
template <typename T>
inline void LstmUnitImpl(
const int N,
const int D,
const int t,
const T* H_prev,
const T* C_prev,
const T* X,
const int32_t* seqLengths,
const bool drop_states,
T* C,
T* H,
const float forget_bias) {
const T forgetBias = convert::To<float, T>(forget_bias);
for (const auto n : c10::irange(N)) {
const bool valid = seqLengths == nullptr || t < seqLengths[n];
if (!valid) {
if (drop_states) {
memset(H, 0, sizeof(T) * D);
memset(C, 0, sizeof(T) * D);
} else {
memcpy(H, H_prev, sizeof(T) * D);
memcpy(C, C_prev, sizeof(T) * D);
}
} else {
const T* X_D = &X[D];
const T* X_2D = &X[2 * D];
const T* X_3D = &X[3 * D];
VECTOR_LOOP for (const auto d : c10::irange(D)) {
const T i = sigmoid(X[d]);
const T f = sigmoid(X_D[d] + forgetBias);
const T o = sigmoid(X_2D[d]);
const T g = host_tanh(X_3D[d]);
const T c_prev = C_prev[d];
const T c = f * c_prev + i * g;
C[d] = c;
const T host_tanh_c = host_tanh(c);
H[d] = o * host_tanh_c;
}
}
H_prev += D;
C_prev += D;
X += 4 * D;
C += D;
H += D;
}
}
template <typename T>
inline void LstmUnitGradientImpl(
int N,
int D,
int t,
const T* C_prev,
const T* X,
const int32_t* seqLengths,
const T* C,
const T* H,
const T* C_diff,
const T* H_diff,
bool drop_states,
T* H_prev_diff,
T* C_prev_diff,
T* X_diff,
const float forget_bias) {
const T localForgetBias = convert::To<float, T>(forget_bias);
for (const auto n : c10::irange(N)) {
const bool valid = seqLengths == nullptr || t < seqLengths[n];
if (!valid) {
if (drop_states) {
memset(C_prev_diff, 0, sizeof(T) * D);
memset(H_prev_diff, 0, sizeof(T) * D);
} else {
memcpy(H_prev_diff, H_diff, sizeof(T) * D);
memcpy(C_prev_diff, C_diff, sizeof(T) * D);
}
memset(X_diff, 0, 4 * sizeof(T) * D);
} else {
VECTOR_LOOP for (const auto d : c10::irange(D)) {
T* c_prev_diff = C_prev_diff + d;
T* h_prev_diff = H_prev_diff + d;
T* i_diff = X_diff + d;
T* f_diff = X_diff + 1 * D + d;
T* o_diff = X_diff + 2 * D + d;
T* g_diff = X_diff + 3 * D + d;
const T i = sigmoid(X[d]);
const T f = sigmoid(X[1 * D + d] + localForgetBias);
const T o = sigmoid(X[2 * D + d]);
const T g = host_tanh(X[3 * D + d]);
const T c_prev = C_prev[d];
const T c = C[d];
const T host_tanh_c = host_tanh(c);
const T c_term_diff =
C_diff[d] + H_diff[d] * o * (1 - host_tanh_c * host_tanh_c);
*c_prev_diff = c_term_diff * f;
*h_prev_diff = 0; // not used in 'valid' case
*i_diff = c_term_diff * g * i * (1 - i);
*f_diff = c_term_diff * c_prev * f * (1 - f);
*o_diff = H_diff[d] * host_tanh_c * o * (1 - o);
*g_diff = c_term_diff * i * (1 - g * g);
}
}
C_prev += D;
X += 4 * D;
C += D;
H += D;
C_diff += D;
H_diff += D;
X_diff += 4 * D;
H_prev_diff += D;
C_prev_diff += D;
}
}
} // namespace
} // namespace perfkernels
} // namespace caffe2

View File

@ -1,73 +0,0 @@
#pragma once
#include <cstdint>
namespace caffe2 {
namespace detail {
// Forward declration of the LSTMUnit templated
// implementation
template <typename T>
void LstmUnitCpu(
const int N,
const int D,
const int t,
const T* H_prev,
const T* C_prev,
const T* X,
const int32_t* seqLengths,
const bool drop_states,
T* C,
T* H,
const float forget_bias);
// Forward specialization
extern template void LstmUnitCpu<float>(
const int N,
const int D,
const int t,
const float* H_prev,
const float* C_prev,
const float* X,
const int32_t* seqLengths,
const bool drop_states,
float* C,
float* H,
const float forget_bias);
template <typename T>
void LstmUnitGradientCpu(
int N,
int D,
int t,
const T* C_prev,
const T* X,
const int32_t* seqLengths,
const T* C,
const T* H,
const T* C_diff,
const T* H_diff,
bool drop_states,
T* H_prev_diff,
T* C_prev_diff,
T* X_diff,
const float forget_bias);
extern template void LstmUnitGradientCpu<float>(
int N,
int D,
int t,
const float* C_prev,
const float* X,
const int32_t* seqLengths,
const float* C,
const float* H,
const float* C_diff,
const float* H_diff,
bool drop_states,
float* H_prev_diff,
float* C_prev_diff,
float* X_diff,
const float forget_bias);
} // namespace detail
} // namespace caffe2

View File

@ -1,123 +0,0 @@
#include "caffe2/perfkernels/lstm_unit_cpu-impl.h"
namespace caffe2 {
namespace perfkernels {
namespace {
// Explicit initialize for float and AVX2 vectorization
template void LstmUnitImpl<float>(
const int N,
const int D,
const int t,
const float* H_prev,
const float* C_prev,
const float* X,
const int32_t* seqLengths,
const bool drop_states,
float* C,
float* H,
const float forget_bias);
template void LstmUnitGradientImpl<float>(
int N,
int D,
int t,
const float* C_prev,
const float* X,
const int32_t* seqLengths,
const float* C,
const float* H,
const float* C_diff,
const float* H_diff,
bool drop_states,
float* H_prev_diff,
float* C_prev_diff,
float* X_diff,
const float forget_bias);
} // namespace
// Define templated implementation fo LSTM kernels on CPU supporting AVX2
template <typename T>
void LstmUnitImpl__avx2_fma(
const int N,
const int D,
const int t,
const T* H_prev,
const T* C_prev,
const T* X,
const int32_t* seqLengths,
const bool drop_states,
T* C,
T* H,
const float forget_bias) {
LstmUnitImpl(
N, D, t, H_prev, C_prev, X, seqLengths, drop_states, C, H, forget_bias);
}
template <typename T>
void LstmUnitGradientImpl__avx2_fma(
int N,
int D,
int t,
const T* C_prev,
const T* X,
const int32_t* seqLengths,
const T* C,
const T* H,
const T* C_diff,
const T* H_diff,
bool drop_states,
T* H_prev_diff,
T* C_prev_diff,
T* X_diff,
const float forget_bias) {
LstmUnitGradientImpl(
N,
D,
t,
C_prev,
X,
seqLengths,
C,
H,
C_diff,
H_diff,
drop_states,
H_prev_diff,
C_prev_diff,
X_diff,
forget_bias);
}
// Explicit initialize for float
template void LstmUnitImpl__avx2_fma<float>(
const int N,
const int D,
const int t,
const float* H_prev,
const float* C_prev,
const float* X,
const int32_t* seqLengths,
const bool drop_states,
float* C,
float* H,
const float forget_bias);
template void LstmUnitGradientImpl__avx2_fma<float>(
int N,
int D,
int t,
const float* C_prev,
const float* X,
const int32_t* seqLengths,
const float* C,
const float* H,
const float* C_diff,
const float* H_diff,
bool drop_states,
float* H_prev_diff,
float* C_prev_diff,
float* X_diff,
const float forget_bias);
} // namespace perfkernels
} // namespace caffe2

View File

@ -1,125 +0,0 @@
#include "caffe2/perfkernels/lstm_unit_cpu_common.h"
#include "caffe2/perfkernels/common.h"
#include "caffe2/perfkernels/lstm_unit_cpu-impl.h"
namespace caffe2 {
namespace detail {
// Define templated implementation fo LSTM kernels on CPU
template <typename T>
void LstmUnitCpu(
const int N,
const int D,
const int t,
const T* H_prev,
const T* C_prev,
const T* X,
const int32_t* seqLengths,
const bool drop_states,
T* C,
T* H,
const float forget_bias) {
// Do CPU dispatching
AVX2_FMA_DO(
perfkernels::LstmUnitImpl,
N,
D,
t,
H_prev,
C_prev,
X,
seqLengths,
drop_states,
C,
H,
forget_bias);
perfkernels::LstmUnitImpl(
N, D, t, H_prev, C_prev, X, seqLengths, drop_states, C, H, forget_bias);
}
template <typename T>
void LstmUnitGradientCpu(
int N,
int D,
int t,
const T* C_prev,
const T* X,
const int32_t* seqLengths,
const T* C,
const T* H,
const T* C_diff,
const T* H_diff,
bool drop_states,
T* H_prev_diff,
T* C_prev_diff,
T* X_diff,
const float forget_bias) {
// Do CPU dispatching
AVX2_FMA_DO(
perfkernels::LstmUnitGradientImpl,
N,
D,
t,
C_prev,
X,
seqLengths,
C,
H,
C_diff,
H_diff,
drop_states,
H_prev_diff,
C_prev_diff,
X_diff,
forget_bias);
perfkernels::LstmUnitGradientImpl(
N,
D,
t,
C_prev,
X,
seqLengths,
C,
H,
C_diff,
H_diff,
drop_states,
H_prev_diff,
C_prev_diff,
X_diff,
forget_bias);
}
// Explicit initialize for float
template void LstmUnitCpu<float>(
const int N,
const int D,
const int t,
const float* H_prev,
const float* C_prev,
const float* X,
const int32_t* seqLengths,
const bool drop_states,
float* C,
float* H,
const float forget_bias);
template void LstmUnitGradientCpu<float>(
int N,
int D,
int t,
const float* C_prev,
const float* X,
const int32_t* seqLengths,
const float* C,
const float* H,
const float* C_diff,
const float* H_diff,
bool drop_states,
float* H_prev_diff,
float* C_prev_diff,
float* X_diff,
const float forget_bias);
} // namespace detail
} // namespace caffe2

View File

@ -1,71 +0,0 @@
#pragma once
#include <cstdint>
namespace caffe2 {
namespace perfkernels {
template <typename T>
void LstmUnitImpl__avx2_fma(
const int N,
const int D,
const int t,
const T* H_prev,
const T* C_prev,
const T* X,
const int32_t* seqLengths,
const bool drop_states,
T* C,
T* H,
const float forget_bias);
template <typename T>
void LstmUnitGradientImpl__avx2_fma(
int N,
int D,
int t,
const T* C_prev,
const T* X,
const int32_t* seqLengths,
const T* C,
const T* H,
const T* C_diff,
const T* H_diff,
bool drop_states,
T* H_prev_diff,
T* C_prev_diff,
T* X_diff,
const float forget_bias);
// Forward declaration of specialized functions
extern template void LstmUnitImpl__avx2_fma(
const int N,
const int D,
const int t,
const float* H_prev,
const float* C_prev,
const float* X,
const int32_t* seqLengths,
const bool drop_states,
float* C,
float* H,
const float forget_bias);
extern template void LstmUnitGradientImpl__avx2_fma(
int N,
int D,
int t,
const float* C_prev,
const float* X,
const int32_t* seqLengths,
const float* C,
const float* H,
const float* C_diff,
const float* H_diff,
bool drop_states,
float* H_prev_diff,
float* C_prev_diff,
float* X_diff,
const float forget_bias);
} // namespace perfkernels
} // namespace caffe2

View File

@ -1,35 +0,0 @@
#pragma once
#include <cstdint>
namespace caffe2 {
namespace math {
// Returns the quantized and compressed values of floating inputs
// The "fused" representation stores the [bitwidth][tail][min][max]
// with the quantized data in one array. Since we store 8/bitwidth
// quantized data in one byte, the last buckets of some bytes may have
// unused bits. There are totally tail buckets are unused.
// We encode *bitwidth* and *tail* at the beginning,
// following by 32-bit floating data respresenting min and max.
// | bitwidth | tail | min | max | ... int8 data ... |
// | 1B | 1B | 4B | 4B | ...output_data....|
// In output_data: the b-th bucket of the i-th byte stores
// the i-th data of the b-th segment of input row
void quantize_and_compress(
const float* input_data,
std::uint8_t* output_data,
std::uint64_t input_size,
std::uint64_t bitwidth,
bool random,
const float* random_buffer);
void decompress_and_dequantize(
const std::uint8_t* input_data,
float* output_data,
std::uint64_t input_size);
} // namespace math
} // namespace caffe2

View File

@ -1,246 +0,0 @@
// Implements the math functions for CPU.
// The implementation in this file allows us to route the underlying numerical
// computation library to different compiler options (-mno-avx2 or -mavx2).
#include <immintrin.h>
#include <cmath>
#include <cstdint>
#include <c10/util/irange.h>
using std::uint64_t;
using std::uint8_t;
namespace caffe2 {
namespace math {
static constexpr double QEPSILON = 1e-8;
void quantize_and_compress__avx2(
const float* input_data,
uint8_t* output_data,
uint64_t input_size,
uint64_t bitwidth,
bool random,
const float* random_buffer) {
__m256i shuffle_mask_v = _mm256_set_epi8(
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
0x0c,
0x08,
0x04,
0x00,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
0xff,
0x0c,
0x08,
0x04,
0x00);
__m256i permute_mask_v =
_mm256_set_epi32(0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x04, 0x00);
uint64_t data_per_byte = 8 / bitwidth;
uint64_t tail = input_size % data_per_byte;
tail = tail ? data_per_byte - tail : 0;
uint64_t segment_size = (input_size + data_per_byte - 1) / data_per_byte;
// basic info
float minimum_element = INFINITY, maximum_element = -INFINITY;
for (const auto i : c10::irange(input_size)) {
minimum_element =
(input_data[i] < minimum_element) ? input_data[i] : minimum_element;
maximum_element =
(input_data[i] > maximum_element) ? input_data[i] : maximum_element;
}
output_data[0] = bitwidth;
output_data[1] = tail;
reinterpret_cast<float*>(output_data + 2)[0] = minimum_element;
reinterpret_cast<float*>(output_data + 2)[1] = maximum_element;
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
float gap = (maximum_element - minimum_element) / ((1 << bitwidth) - 1.0f);
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
float gap_inverse = 1. / (gap + QEPSILON);
uint8_t max_q = (1 << bitwidth) - 1;
uint64_t bit_start = 0;
if (random) {
for (uint64_t start = 0; start < input_size; start += segment_size) {
uint64_t stride = start + segment_size <= input_size ? segment_size
: input_size - start;
uint64_t i = 0;
constexpr int VLEN = 8;
for (; i < stride / VLEN * VLEN; i += VLEN) {
__m256 r_v = _mm256_loadu_ps(&random_buffer[start + i]);
__m256 fval_v = _mm256_loadu_ps(input_data + start + i);
__m256 thetimes_v = _mm256_mul_ps(
_mm256_sub_ps(fval_v, _mm256_set1_ps(minimum_element)),
_mm256_set1_ps(gap_inverse));
__m256 rounded_v = _mm256_floor_ps(_mm256_add_ps(thetimes_v, r_v));
rounded_v = _mm256_max_ps(
_mm256_setzero_ps(),
_mm256_min_ps(_mm256_set1_ps(max_q), rounded_v));
__m256i qval_v = _mm256_cvtps_epi32(rounded_v);
__m256i orval_v = _mm256_cvtepu8_epi32(_mm_lddqu_si128(
reinterpret_cast<const __m128i*>(output_data + 10 + i)));
orval_v =
_mm256_or_si256(orval_v, _mm256_slli_epi32(qval_v, bit_start));
orval_v = _mm256_shuffle_epi8(orval_v, shuffle_mask_v);
orval_v = _mm256_permutevar8x32_epi32(orval_v, permute_mask_v);
*reinterpret_cast<int64_t*>(output_data + 10 + i) =
_mm256_extract_epi64(orval_v, 0);
}
for (; i < stride; ++i) {
float fval = input_data[start + i];
float thetimes = (fval - minimum_element) * gap_inverse;
float rounded = floor(thetimes + random_buffer[start + i]);
rounded = rounded < static_cast<float>(max_q)
? rounded
: static_cast<float>(max_q);
rounded = rounded > 0.0f ? rounded : 0.0f;
uint8_t qval = rounded;
uint8_t orval = output_data[10 + i];
output_data[10 + i] = orval | static_cast<uint8_t>(qval << bit_start);
}
bit_start += bitwidth;
}
} else {
// !random
for (uint64_t start = 0; start < input_size; start += segment_size) {
uint64_t stride = start + segment_size <= input_size ? segment_size
: input_size - start;
uint64_t i = 0;
constexpr int VLEN = 8;
for (; i < stride / VLEN * VLEN; i += VLEN) {
__m256 fval_v = _mm256_loadu_ps(input_data + start + i);
__m256 thetimes_v = _mm256_mul_ps(
_mm256_sub_ps(fval_v, _mm256_set1_ps(minimum_element)),
_mm256_set1_ps(gap_inverse));
thetimes_v = _mm256_max_ps(
_mm256_setzero_ps(),
_mm256_min_ps(_mm256_set1_ps(max_q), thetimes_v));
__m256i qval_v = _mm256_cvtps_epi32(_mm256_round_ps(
thetimes_v, _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC));
__m256i orval_v = _mm256_cvtepu8_epi32(_mm_lddqu_si128(
reinterpret_cast<const __m128i*>(output_data + 10 + i)));
orval_v =
_mm256_or_si256(orval_v, _mm256_slli_epi32(qval_v, bit_start));
orval_v = _mm256_shuffle_epi8(orval_v, shuffle_mask_v);
orval_v = _mm256_permutevar8x32_epi32(orval_v, permute_mask_v);
*reinterpret_cast<int64_t*>(output_data + 10 + i) =
_mm256_extract_epi64(orval_v, 0);
}
for (; i < stride; ++i) {
float fval = input_data[start + i];
float thetimes = (fval - minimum_element) * gap_inverse;
thetimes = thetimes < static_cast<float>(max_q)
? thetimes
: static_cast<float>(max_q);
thetimes = thetimes > 0.0f ? thetimes : 0.0f;
uint8_t qval = nearbyint(thetimes);
uint8_t orval = output_data[10 + i];
output_data[10 + i] = orval | static_cast<uint8_t>(qval << bit_start);
}
bit_start += bitwidth;
}
} // !random
}
void decompress_and_dequantize__avx2(
const uint8_t* input_data,
float* output_data,
uint64_t input_size) {
// basic info
const float minimum_element =
reinterpret_cast<const float*>(input_data + 2)[0];
const float maximum_element =
reinterpret_cast<const float*>(input_data + 2)[1];
const uint64_t bitwidth = input_data[0];
const float gap =
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
(maximum_element - minimum_element) / ((1 << bitwidth) - 1.f) +
QEPSILON; // for exact recovering
const uint64_t tail = input_data[1];
const uint64_t output_size = (input_size - 10) * (8 / bitwidth) - tail;
// decoding
uint64_t bit_start = 0;
const uint64_t segment_size = input_size - 10;
for (uint64_t start = 0; start < output_size; start += segment_size) {
uint64_t stride = start + segment_size <= output_size ? segment_size
: output_size - start;
uint8_t mask = (1 << bitwidth) - 1;
uint64_t i = 0;
// Can process 8 elements at a time because we need to expand uint8_t
// to int32_t to use epi32 vector instructions.
constexpr int VLEN = 8;
for (; i < stride / VLEN * VLEN; i += VLEN) {
__m128i in_v = _mm_lddqu_si128(
reinterpret_cast<const __m128i*>(input_data + 10 + i));
__m256i out_epi32_v = _mm256_and_si256(
_mm256_srli_epi32(_mm256_cvtepu8_epi32(in_v), bit_start),
_mm256_set1_epi32(mask));
__m256 out_v = _mm256_fmadd_ps(
_mm256_cvtepi32_ps(out_epi32_v),
_mm256_set1_ps(gap),
_mm256_set1_ps(minimum_element));
_mm256_storeu_ps(output_data + start + i, out_v);
}
for (; i < stride; ++i) {
output_data[start + i] =
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
((input_data[10 + i] >> bit_start) & mask) * gap + minimum_element;
}
bit_start += bitwidth;
}
}
} // namespace math
} // namespace caffe2

View File

@ -1,168 +0,0 @@
// Implements the math functions for CPU.
// The implementation in this file allows us to route the underlying numerical
// computation library to different compiler options (-mno-avx2 or -mavx2).
#include <cfloat>
#include <cmath>
#include <cstdint>
#include "common.h"
// NOLINTNEXTLINE(modernize-deprecated-headers)
#include "math.h"
#include <c10/util/irange.h>
using std::uint64_t;
using std::uint8_t;
namespace caffe2 {
namespace math {
static constexpr double QEPSILON = 1e-8;
void quantize_and_compress__base(
const float* input_data,
uint8_t* output_data,
uint64_t input_size,
uint64_t bitwidth,
bool random,
const float* random_buffer) {
uint64_t data_per_byte = 8 / bitwidth;
uint64_t tail = input_size % data_per_byte;
tail = tail ? data_per_byte - tail : 0;
uint64_t segment_size = (input_size + data_per_byte - 1) / data_per_byte;
// basic info
float minimum_element = INFINITY, maximum_element = -INFINITY;
for (const auto i : c10::irange(input_size)) {
minimum_element =
input_data[i] < minimum_element ? input_data[i] : minimum_element;
maximum_element =
input_data[i] > maximum_element ? input_data[i] : maximum_element;
}
output_data[0] = bitwidth;
output_data[1] = tail;
reinterpret_cast<float*>(output_data + 2)[0] = minimum_element;
reinterpret_cast<float*>(output_data + 2)[1] = maximum_element;
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
float gap = (maximum_element - minimum_element) / ((1 << bitwidth) - 1.0f);
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
float gap_inverse = 1. / (gap + QEPSILON);
uint8_t max_q = (1 << bitwidth) - 1;
uint64_t bit_start = 0;
if (random) {
for (uint64_t start = 0; start < input_size; start += segment_size) {
uint64_t stride = start + segment_size <= input_size ? segment_size
: input_size - start;
uint64_t i = 0;
for (; i < stride; ++i) {
float fval = input_data[start + i];
float thetimes = (fval - minimum_element) * gap_inverse;
float rounded = floor(thetimes + random_buffer[start + i]);
rounded = rounded < static_cast<float>(max_q)
? rounded
: static_cast<float>(max_q);
rounded = rounded > 0.0f ? rounded : 0.0f;
uint8_t qval = rounded;
uint8_t orval = output_data[10 + i];
output_data[10 + i] = orval | static_cast<uint8_t>(qval << bit_start);
}
bit_start += bitwidth;
}
} else {
for (uint64_t start = 0; start < input_size; start += segment_size) {
uint64_t stride = start + segment_size <= input_size ? segment_size
: input_size - start;
uint64_t i = 0;
for (; i < stride; ++i) {
float fval = input_data[start + i];
float thetimes = (fval - minimum_element) * gap_inverse;
thetimes = thetimes < static_cast<float>(max_q)
? thetimes
: static_cast<float>(max_q);
thetimes = thetimes > 0.0f ? thetimes : 0.0f;
uint8_t qval = nearbyint(thetimes);
uint8_t orval = output_data[10 + i];
output_data[10 + i] = orval | static_cast<uint8_t>(qval << bit_start);
}
bit_start += bitwidth;
}
}
}
decltype(quantize_and_compress__base) quantize_and_compress__avx2;
void quantize_and_compress(
const float* input_data,
uint8_t* output_data,
uint64_t input_size,
uint64_t bitwidth,
bool random,
const float* random_buffer) {
AVX2_DO(
quantize_and_compress,
input_data,
output_data,
input_size,
bitwidth,
random,
random_buffer);
BASE_DO(
quantize_and_compress,
input_data,
output_data,
input_size,
bitwidth,
random,
random_buffer);
}
void decompress_and_dequantize__base(
const uint8_t* input_data,
float* output_data,
uint64_t input_size) {
// basic info
const float minimum_element =
reinterpret_cast<const float*>(input_data + 2)[0];
const float maximum_element =
reinterpret_cast<const float*>(input_data + 2)[1];
const uint64_t bitwidth = input_data[0];
const float gap =
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
(maximum_element - minimum_element) / ((1 << bitwidth) - 1.f) +
QEPSILON; // for exact recovering
const uint64_t tail = input_data[1];
const uint64_t output_size = (input_size - 10) * (8 / bitwidth) - tail;
// decoding
uint64_t bit_start = 0;
const uint64_t segment_size = input_size - 10;
for (uint64_t start = 0; start < output_size; start += segment_size) {
uint64_t stride = start + segment_size <= output_size ? segment_size
: output_size - start;
uint8_t mask = (1 << bitwidth) - 1;
uint64_t i = 0;
for (; i < stride; ++i) {
output_data[start + i] =
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-avoid-magic-numbers,cppcoreguidelines-narrowing-conversions)
((input_data[10 + i] >> bit_start) & mask) * gap + minimum_element;
}
bit_start += bitwidth;
}
}
decltype(decompress_and_dequantize__base) decompress_and_dequantize__avx2;
void decompress_and_dequantize(
const uint8_t* input_data,
float* output_data,
uint64_t input_size) {
AVX2_DO(decompress_and_dequantize, input_data, output_data, input_size);
BASE_DO(decompress_and_dequantize, input_data, output_data, input_size);
}
} // namespace math
} // namespace caffe2

View File

@ -1,88 +0,0 @@
#include <c10/util/Half.h>
#include "caffe2/perfkernels/typed_axpy.h"
#include "caffe2/perfkernels/common.h"
namespace caffe2 {
void TypedAxpy__base(int N, const float a, const float* x, float* y) {
for (int i = 0; i < N; ++i) {
y[i] += a * x[i];
}
}
decltype(TypedAxpy__base) TypedAxpy__avx2_fma;
decltype(TypedAxpy__base) TypedAxpy__avx_f16c;
template <>
void TypedAxpy<float, float>(int N, const float a, const float* x, float* y) {
AVX2_FMA_DO(TypedAxpy, N, a, x, y);
AVX_F16C_DO(TypedAxpy, N, a, x, y);
BASE_DO(TypedAxpy, N, a, x, y);
}
void TypedAxpyHalffloat__base(
int N,
const float a,
const at::Half* x,
float* y) {
for (int i = 0; i < N; ++i) {
// NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
union {
uint32_t intval;
float floatval;
} t1;
// NOLINTNEXTLINE(cppcoreguidelines-init-variables)
uint32_t t2, t3;
// NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
t1.intval = x[i].x & 0x7fff; // Non-sign bits
// NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
t2 = x[i].x & 0x8000; // Sign bit
// NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
t3 = x[i].x & 0x7c00; // Exponent
// NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
t1.intval <<= 13; // Align mantissa on MSB
// NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
t2 <<= 16; // Shift sign bit into position
// NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
t1.intval += 0x38000000; // Adjust bias
t1.intval = (t3 == 0 ? 0 : t1.intval); // Denormals-as-zero
t1.intval |= t2; // Re-insert sign bit
y[i] += t1.floatval * a;
}
}
decltype(TypedAxpyHalffloat__base) TypedAxpyHalffloat__avx2_fma;
decltype(TypedAxpyHalffloat__base) TypedAxpyHalffloat__avx_f16c;
template <>
void TypedAxpy<at::Half, float>(
int N,
const float a,
const at::Half* x,
float* y) {
AVX2_FMA_DO(TypedAxpyHalffloat, N, a, x, y);
AVX_F16C_DO(TypedAxpyHalffloat, N, a, x, y);
BASE_DO(TypedAxpyHalffloat, N, a, x, y);
}
void TypedAxpy_uint8_float__base(
int N,
const float a,
const std::uint8_t* x,
float* y) {
for (int i = 0; i < N; ++i) {
y[i] += (float)(x[i]) * a;
}
}
decltype(TypedAxpy_uint8_float__base) TypedAxpy_uint8_float__avx2_fma;
decltype(TypedAxpy_uint8_float__base) TypedAxpy_uint8_float__avx_f16c;
template <>
void TypedAxpy<std::uint8_t, float>(
int N,
const float a,
const std::uint8_t* x,
float* y) {
AVX2_FMA_DO(TypedAxpy_uint8_float, N, a, x, y);
BASE_DO(TypedAxpy_uint8_float, N, a, x, y);
}
} // namespace caffe2

View File

@ -1,12 +0,0 @@
#pragma once
namespace caffe2 {
// Similar to Axpy that calculate y = a * x + y, but allowing x and y to be
// of different data types.
// It also provides a performance optimization hint (use_a) to see if a is going
// to be 1 or not.
template <typename IN, typename OUT>
void TypedAxpy(int N, const OUT a, const IN* x, OUT* y);
} // namespace caffe2

View File

@ -1,68 +0,0 @@
#include "caffe2/perfkernels/cvtsh_ss_bugfix.h"
#include <c10/util/Half.h>
#include <emmintrin.h>
#include <immintrin.h>
namespace caffe2 {
void TypedAxpy__avx_f16c(int N, const float a, const float* x, float* y) {
int current = 0;
const int bound = (N % 8) ? N - 8 : N;
__m256 mma = _mm256_set1_ps(a);
// NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
for (; current < bound; current += 8) {
_mm256_storeu_ps(
y + current,
_mm256_add_ps(
_mm256_mul_ps(mma, _mm256_loadu_ps(x + current)),
_mm256_loadu_ps(y + current)));
}
if (bound != N) {
while (current < N) {
y[current] += x[current] * a;
++current;
}
}
}
void TypedAxpyHalffloat__avx_f16c(
int N,
const float a,
const at::Half* x,
float* y) {
// if x does not start at the 16 byte boundary, we will process the first few.
// before we get to a real one.
// NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
while ((reinterpret_cast<unsigned long>(x) % 16) && N) {
*(y++) += _cvtsh_ss((*(x++)).x) * a;
--N;
}
// From now on we can do vectorized additions using __m256, which is 8 floats,
// so we will vectorize every 8 element and then resort to cvtsh_ss.
__m256 mma = _mm256_set1_ps(a);
int current = 0;
const int bound = (N % 8) ? N - 8 : N;
// NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
for (; current < bound; current += 8) {
__m128i mmx_16 =
_mm_loadu_si128(reinterpret_cast<const __m128i*>(x + current));
__m256 mmx_32 = _mm256_cvtph_ps(mmx_16);
__m256 mmy_in = _mm256_loadu_ps(y + current);
__m256 mmmul = _mm256_mul_ps(mmx_32, mma);
__m256 mmy_out = _mm256_add_ps(mmmul, mmy_in);
_mm256_storeu_ps(y + current, mmy_out);
}
if (bound != N) {
while (current < N) {
y[current] += _cvtsh_ss(x[current].x) * a;
++current;
}
}
}
} // namespace caffe2

View File

@ -1,104 +0,0 @@
#include "caffe2/perfkernels/cvtsh_ss_bugfix.h"
#include <c10/util/Half.h>
#include <emmintrin.h>
#include <immintrin.h>
namespace caffe2 {
void TypedAxpy__avx2_fma(int N, const float a, const float* x, float* y) {
int current = 0;
const int bound = (N % 8) ? N - 8 : N;
__m256 mma = _mm256_set1_ps(a);
// NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
for (; current < bound; current += 8) {
_mm256_storeu_ps(
y + current,
_mm256_fmadd_ps(
mma, _mm256_loadu_ps(x + current), _mm256_loadu_ps(y + current)));
}
if (bound != N) {
while (current < N) {
y[current] += x[current] * a;
++current;
}
}
}
void TypedAxpyHalffloat__avx2_fma(
int N,
const float a,
const at::Half* x,
float* y) {
// if x does not start at the 16 byte boundary, we will process the first few.
// before we get to a real one.
// NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
while ((reinterpret_cast<unsigned long>(x) % 16) && N) {
*(y++) += _cvtsh_ss((*(x++)).x) * a;
--N;
}
// From now on we can do vectorized additions using __m256, which is 8 floats,
// so we will vectorize every 8 element and then resort to cvtsh_ss.
__m256 mma = _mm256_set1_ps(a);
int current = 0;
const int bound = (N % 8) ? N - 8 : N;
// NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
for (; current < bound; current += 8) {
__m128i mmx_16 =
_mm_loadu_si128(reinterpret_cast<const __m128i*>(x + current));
__m256 mmx_32 = _mm256_cvtph_ps(mmx_16);
__m256 mmy = _mm256_loadu_ps(y + current);
mmy = _mm256_fmadd_ps(mmx_32, mma, mmy);
_mm256_storeu_ps(y + current, mmy);
}
if (bound != N) {
while (current < N) {
y[current] += _cvtsh_ss(x[current].x) * a;
++current;
}
}
}
void TypedAxpy_uint8_float__avx2_fma(
int N,
const float a,
const std::uint8_t* x,
float* y) {
// if x does not start at the 16 byte boundary, we will process the first few.
// before we get to a real one.
// NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
while ((reinterpret_cast<unsigned long>(x) % 16) && N) {
*(y++) += static_cast<float>(*(x++)) * a;
--N;
}
// From now on we can do vectorized additions using __m256, which is 8 floats,
// so we will vectorize every 8 element and then resort to cvtsh_ss.
__m256 mma = _mm256_set1_ps(a);
int current = 0;
const int bound = (N % 8) ? N - 8 : N;
// NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
for (; current < bound; current += 8) {
__m256i mmx_int32 = _mm256_cvtepi8_epi32(
_mm_loadu_si128(reinterpret_cast<const __m128i*>(x + current)));
__m256 mmx_fp32 = _mm256_cvtepi32_ps(mmx_int32);
__m256 mmy = _mm256_loadu_ps(y + current);
mmy = _mm256_fmadd_ps(mmx_fp32, mma, mmy);
_mm256_storeu_ps(y + current, mmy);
}
if (bound != N) {
while (current < N) {
y[current] += (float)(x[current]) * a;
++current;
}
}
}
} // namespace caffe2

View File

@ -1,28 +0,0 @@
#pragma once
#if (ENABLE_VECTORIZATION > 0) && !defined(_DEBUG) && !defined(DEBUG)
#if defined(__clang__) && (__clang_major__ > 7)
#define IS_SANITIZER \
((__has_feature(address_sanitizer) == 1) || \
(__has_feature(memory_sanitizer) == 1) || \
(__has_feature(thread_sanitizer) == 1) || \
(__has_feature(undefined_sanitizer) == 1))
#if IS_SANITIZER == 0
#define VECTOR_LOOP _Pragma("clang loop vectorize(enable)")
#define FAST_MATH _Pragma("clang fp contract(fast)")
#define VECTORIZED_KERNEL 1
#endif
#elif defined(_OPENMP) && (_OPENMP >= 201511)
// Support with OpenMP4.5 and above
#define VECTOR_LOOP _Pragma("omp for simd")
#define VECTORIZED_KERNEL 1
#define FAST_MATH
#endif
#endif
#ifndef VECTOR_LOOP
// Not supported
#define VECTOR_LOOP
#define FAST_MATH
#endif

View File

@ -62,8 +62,8 @@ Overall, the ``pipelining`` package provides the following features:
application on the Llama model.
Step 1: build ``PipelineStage`` for execution
*********************************************
Step 1: build ``PipelineStage``
*******************************
Before we can use a ``PipelineSchedule``, we need to create ``PipelineStage``
objects that wrap the part of the model running in that stage. The
@ -261,11 +261,12 @@ Let us see how the ``pipeline`` API works:
from torch.distributed.pipelining import pipeline, SplitPoint
# An example micro-batch input
x = torch.LongTensor([1, 2, 4, 5])
pipe = pipeline(
module=mod,
num_chunks=1,
example_args=(x,),
mb_args=(x,),
split_spec={
"layers.1": SplitPoint.BEGINNING,
}
@ -306,7 +307,7 @@ If we ``print(pipe)``, we can see::
The "model partitions" are represented by submodules (``submod_0``,
``submod_1``), each of which is reconstructed with original model operations
``submod_1``), each of which is reconstructed with original model operations, weights
and hierarchies. In addition, a "root-level" ``forward`` function is
reconstructed to capture the data flow between those partitions. Such data flow
will be replayed by the pipeline runtime later, in a distributed fashion.
@ -317,12 +318,29 @@ The ``Pipe`` object provides a method for retrieving the "model partitions":
stage_mod : nn.Module = pipe.get_stage_module(stage_idx)
You can also create a distributed stage runtime on a device using ``Pipe``:
The returned ``stage_mod`` is a ``nn.Module``, with which you can create an
optimizer, save or load checkpoints, or apply other parallelisms.
``Pipe`` also allows you to create a distributed stage runtime on a device given
a ``ProcessGroup``:
.. code-block:: python
stage = pipe.build_stage(stage_idx, device, group)
Alternatively, if you would like to build the stage runtime later after some
modification to the ``stage_mod``, you can use a functional version of the
``build_stage`` API. For example:
.. code-block:: python
from torch.distributed.pipelining import build_stage
from torch.nn.parallel import DistributedDataParallel
dp_mod = DistributedDataParallel(stage_mod)
info = pipe.info()
stage = build_stage(dp_mod, stage_idx, info, device, group)
.. note::
The ``pipeline`` frontend uses a tracer (``torch.export``) to capture your
model into a single graph. If your model is not full-graph'able, you can use

View File

@ -10,6 +10,8 @@ API Methods
.. autofunction:: torch.distributed.elastic.events.record
.. autofunction:: torch.distributed.elastic.events.construct_and_record_rdzv_event
.. autofunction:: torch.distributed.elastic.events.get_logging_handler
Event Objects

Some files were not shown because too many files have changed in this diff Show More