pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Pearu Peterson	70d4d109f2	Make SparseCsr a functionality dispatch key (#120703 ) As in the title. To enable meta and fake tensor support for sparse compressed tensors in compliance with the meta/fake tensor support for sparse COO tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120703 Approved by: https://github.com/ezyang	2024-03-01 13:28:46 +00:00
PyTorch MergeBot	8a32a07856	Revert "Add meta device support to sparse compressed tensors (#120498 )" This reverts commit 5d71ba688563ef491bb28d47c493ec6fc7791da2. Reverted https://github.com/pytorch/pytorch/pull/120498 on behalf of https://github.com/zou3519 due to broke CI ([comment](https://github.com/pytorch/pytorch/pull/120498#issuecomment-1964491999))	2024-02-26 15:59:36 +00:00
Pearu Peterson	5d71ba6885	Add meta device support to sparse compressed tensors (#120498 ) As in the title. Unblocks https://github.com/pytorch/pytorch/pull/117907#discussion_r1499251745 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120498 Approved by: https://github.com/ezyang	2024-02-25 16:50:17 +00:00
Joel Schlosser	b928e08f3d	Initial vmap + NT support with unbind fallback (#106786 ) PoC demonstrating vmap + NT based on the [design doc](https://docs.google.com/document/d/1dVVk6TOqz93PLTIneU2T3xaxCs9qZ0MaJyCvOAp_bC0). This PR: * Allows `BatchedTensorImpl`s to contain NTs * Introduces a `BatchedNestedTensor` dispatch key for NT-specific batching rules * Provides a batching rule fallback that unbinds the NTs -> performs computation on constituent -> rebinds results into NT Restrictions: * Only supports one level of vmap * Only supports vmapping over dim=0 for NTs * For operations with mixed NT / dense inputs, support is also limited to dim=0 for the dense inputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/106786 Approved by: https://github.com/zou3519	2023-09-07 13:53:20 +00:00
Meghan	6ff4548b6e	[AMP] Support XLA:TPU (#96370 ) With https://github.com/pytorch/xla/pull/5148, https://github.com/pytorch/xla/pull/4740 With these changes XLA:GPU users should use `torch.cuda.amp.autocast()` for AMP with float16 XLA:TPU users should use `torch.amp.autocast('xla')` for AMP with bfloat16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96370 Approved by: https://github.com/bdhirsh, https://github.com/malfet	2023-06-23 19:46:42 +00:00
Charlie West-Taylor	5eb7325bc7	Add autocast support for IPU (#103890 ) As part of this, a new `AutocastIPU` dispatch key has been added. There's an existing PR, #85043, to make `Autocast` a proper per-backend functionality key, but it ran into issues with layering with other functionality keys and went stale. This has been tested in the out-of-tree IPU PyTorch backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103890 Approved by: https://github.com/albanD	2023-06-22 15:38:45 +00:00
Brian Hirsh	c3c03e7cb8	Reland of https://github.com/pytorch/pytorch/pull/101818 (#103888 ) Original PR broke internal This reverts commit 5ed618132f466440ad76c884240e07796c7e2c6b. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103888 Approved by: https://github.com/albanD	2023-06-21 21:00:56 +00:00
PyTorch MergeBot	5ed618132f	Revert "change pre_autograd to pre_dispatch tracing (#101818 )" This reverts commit b0392de2c39d132b5901fc9a366afc1ddc214f96. Reverted https://github.com/pytorch/pytorch/pull/101818 on behalf of https://github.com/izaitsevfb due to Breaks internal builds see D46629736 TypeError: wrap_key() got an unexpected keyword argument pre_autograd ([comment](https://github.com/pytorch/pytorch/pull/101818#issuecomment-1587837667))	2023-06-12 18:16:37 +00:00
Brian Hirsh	b0392de2c3	change pre_autograd to pre_dispatch tracing (#101818 ) We discussed in a composability meeting a few weeks ago that `pre_autograd` should probably be renamed to `pre_dispatch`. One question in this PR was: should I re-use a dispatch key? Or should I create a new dispatch key (that yet again corresponds to "top of the dispatcher")? ~~For now, I ended up sticking our proxy mode on the mode stack corresponding to `PythonTLSSnapshot`, because it was simple and it works. It looks like one of the functorch dispatch keys has higher priority though, so it's possible that functorch will end up running first. Open to options, but we can consider adding a new dispatch key later if that becomes a problem~~ Update: I added a dedicated dispatch key, `PreDispatch`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101818 Approved by: https://github.com/ezyang, https://github.com/Neilblaze, https://github.com/albanD, https://github.com/zou3519	2023-06-09 17:30:15 +00:00
shibo19	af50efca24	add nested/sprase/quantized tensor key for privateuse1 (#102696 ) Fixes #ISSUE_NUMBER add nested/sprase/quantized tensor key for privateuse1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102696 Approved by: https://github.com/bdhirsh	2023-06-02 22:35:52 +00:00
shibo	6b691b99da	add amp support for custom backend (#96188 ) Fixes #ISSUE_NUMBER 1、add amp support for custom backend 2、optimize the file `backend_registration.py`, and rename it with `custom_backend_registration.py`. And then we would register other funcs for custom backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96188 Approved by: https://github.com/bdhirsh	2023-03-20 20:27:35 +00:00
PyTorch MergeBot	a8f36dd646	Revert "add amp support for custom backend (#96188 )" This reverts commit cf12edee02a44009c4f06e36efa97d9a7372ab35. Reverted https://github.com/pytorch/pytorch/pull/96188 on behalf of https://github.com/kit1980 due to Broke some linalg tests : https://github.com/pytorch/pytorch/actions/runs/4420037607/jobs/7750708339	2023-03-15 00:03:19 +00:00
shibo	cf12edee02	add amp support for custom backend (#96188 ) Fixes #ISSUE_NUMBER 1、add amp support for custom backend 2、optimize the file `backend_registration.py`, and rename it with `custom_backend_registration.py`. And then we would register other funcs for custom backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96188 Approved by: https://github.com/bdhirsh	2023-03-14 20:43:21 +00:00
Mikayla Gawarecki	895d4781b8	[easy] Add NestedTensorMeta to parseDispatchKey (#94279 ) ran into this when trying to use `torch.library.Library("aten", "IMPL", "NestedTensorMeta")` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94279 Approved by: https://github.com/bdhirsh	2023-02-07 19:46:29 +00:00
Hangchen Yu	5a0fa04a49	Add MTIA DeviceType for Meta training and inference devices (#92232 ) Summary: This adds a new MTIA DeviceType which is associated with the MTIA DispatchKey and will be used for the Meta in-house training and inference accelerators. Test Plan: All CI should pass. Differential Revision: D42526044 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92232 Approved by: https://github.com/ezyang	2023-01-16 12:20:23 +00:00
Sean Ross-Ross	5f881ac2d1	Adding dispatch alias 'FuncTorchBatchedDecomposition' (#88771 ) part of https://github.com/pytorch/functorch/issues/1009 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88771 Approved by: https://github.com/zou3519	2022-12-02 04:38:28 +00:00
Edward Z. Yang	825f4e602b	Add support for symbolic shapes to sparse tensor (#88573 ) Along the way, I undid making sparse/dense dim symint (they're dimensions, so they should be static.) Also symintify set_indices_and_values_unsafe There is a little bit of a nontrivial infra change here: previously, we didn't populate the strides field on sparse tensors. It is now populated with "empty" strides, and this meant that sparse tensors were falsely reporting they were non-overlapping dense/contiguous. I added in a hack to work around this case. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88573 Approved by: https://github.com/anjali411	2022-11-08 03:13:42 +00:00
Horace He	b3b9786fdd	Unified symbolic shape variables between AOTAutograd and Inductor (#86659 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86659 Approved by: https://github.com/wconstab	2022-10-14 00:24:43 +00:00
Amadeusz Skrzypczak	6be9d9a630	Add AutocastHPU support (#84927 ) New dispatch key and necessary functions are added to PyTorch. Backend implementation will be added in the external library. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84927 Approved by: https://github.com/bdhirsh	2022-10-12 19:37:16 +00:00
Edward Z. Yang	490727a35f	New calling convention for Python dispatcher (#85133 ) Instead of calling into the Python dispatcher for EVERY dispatcher call, we now have a two step process. First, we getattr(op: OpOverload, dispatch_key) to "load" the handler for the function. This can either be a conventional function (in which case we will call it, in the same way the old Python dispatcher worked), or it can be a DispatchKey, in which case we will directly call that DispatchKey in C++, bypassing marshalling between Python and C++ entirely. OpOverload.__getattr__ is carefully written so that it will cache the A further optimization would be to define __slots__ on OpOverload, and ensuring that the DispatchKey strings are interned. The resulting Python dispatcher is less flexible: after the first lookup, the handler is cached and we won't recompute it. Furthermore, by default, dispatches will not go into Python, and so you won't get stack frames for the Python dispatcher by default. But we get a huge performance improvement: on the following microbenchmark we go from 2.5s to 1.9s. ``` import time import torch from functorch import make_fx def f(x): for i in range(1000): x = x * x return x begin = time.time() res = make_fx(f, tracing_mode="symbolic")(torch.randn(10, 20)) print(time.time()-begin) ``` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/85133 Approved by: https://github.com/wconstab	2022-09-16 20:38:21 +00:00
Michael Voznesensky	8ca1839d32	Python Dispatcher integration with C++ dispatcher (#85050 ) #84826 but without ghstack Pull Request resolved: https://github.com/pytorch/pytorch/pull/85050 Approved by: https://github.com/malfet	2022-09-15 00:43:36 +00:00
PyTorch MergeBot	706b990306	Revert "Python Dispatcher integration with C++ dispatcher (#84826 )" This reverts commit 35f6a69191ef762cf22b6cbfe94b8d9406e16674. Reverted https://github.com/pytorch/pytorch/pull/84826 on behalf of https://github.com/malfet due to Broke dynamo, see `35f6a69191`	2022-09-14 14:07:58 +00:00
Michael Voznesensky	35f6a69191	Python Dispatcher integration with C++ dispatcher (#84826 ) Signed-off-by: Edward Z. Yang <ezyangfb.com> From @ezyang's original PR: There are a number of situations where we have non-backend kernels (e.g., CompositeImplicitAutograd, batching rules) which we would like to port to Python, but we have no way to integrate these ports with the overall system while using preexisting C++ registrations otherwise. This PR changes that by introducing a Python dispatcher (which can have its own kernels directly in Python), which can be interpose over ordinary C++ dispatch. The ingredients: We introduce a new PythonDispatcher dispatch key, that has the same tenor as FuncTorchDynamicLayerFrontMode: it works by getting triggered before every other dispatch key in the dispatch key, and shunting to a Python implementation The Python dispatcher is a per-interpreter global object that is enabled/disabled via the guard EnablePythonDispatcher/DisablePythonDispatcher. We don't make it compositional as I have no idea what a compositional version of this feature would look like. Because it is global, we don't need to memory manage it and so I use a simpler SafePyHandle (newly added) to control access to this pointer from non-Python C++. Like __torch_dispatch__, we use PyInterpreter to get to the Python interpreter to handle the dispatch. I need to reimplement dispatch table computation logic in Python. To do this, I expose a lot more helper functions for doing computations on alias dispatch keys and similar. I also improve the pybind11 handling for DispatchKey so that you can either accept the pybind11 bound enum or a string; this simplifies our binding code. See https://github.com/pybind/pybind11/issues/483#issuecomment-1237418106 for how this works; the technique is generally useful. I need to be able to call backend fallbacks. I do this by permitting you to call at a dispatch key which doesn't have a kernel for the operator; if the kernel doesn't exist, we check the backend fallback table instead. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/84826 Approved by: https://github.com/ezyang	2022-09-14 06:57:19 +00:00
YifanShenSZ	673b35c847	Better reshape with autograd support (#82754 ) (#84154 ) The original author is @YifanShenSZ and the original PR is: #82754 # Summary: Previous reshape [https://github.com/pytorch/pytorch/issues/80981](https://github.com/pytorch/pytorch/pull/80981) is ok for forward, but needs improvement for backward: need to handle "sometimes view sometimes copy" behavior. This pull request fixes it by: 1. add a new alias dispatch key `CompositeImplicitAutogradNestedTensor`, which ideally would work as nested-tensor version of `CompositeImplicitAutograd` 2. register `reshape_nested` to `reshape` by `CompositeImplicitAutogradNestedTensor` Side changes: * add contiguous memory format support to `clone_nested` * add `view_nested` * add `reshape_as_nested` Fix issue [https://github.com/pytorch/pytorch/issues/83041](https://github.com/pytorch/pytorch/issues/83041) Pull Request resolved: https://github.com/pytorch/pytorch/pull/82754 Test Plan: Imported from GitHub, without a `Test Plan:` line. Static Docs Preview: executorch \|[Full Site](https://our.intern.facebook.com/intern/staticdocs/eph/D39023822/V13/executorch/)\| \|Modified Pages\| Reviewed By: albanD Differential Revision: D39023822 Pulled By: drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/84154 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2022-09-01 20:01:39 +00:00
Elias Ellison	642aed8b99	Add Autocast Support for FakeTensors / use fake device dispatch keys (#82449 ) From PR: ``` Note: [Fake Tensor Dispatch Keys] In order to model the behavior of device-specific autocast and autograd logic, we update the dispatch keys of FakeTensors to reflect their fake device. This includes the BackendComponent (DispatchKey::Meta -> DispatchKey::CUDA), and also the BackendComponent related Autocast and Autograd keys. __torch__dispatch__ sits below Autocast and Autograd, and is only invoked when we are at the kernel for the BackendComponent. Then, we add Meta to the thread-local dispatch include set to hit the meta kernel instead of the kernel of the BackendComponent for the fake device. ``` Also adds the `conv1/2/3d.padding` operators to the Autocast rule set. Without that fix, the FakeTensor dtype would diverge. See: https://github.com/pytorch/pytorch/issues/81608 Pull Request resolved: https://github.com/pytorch/pytorch/pull/82449 Approved by: https://github.com/ezyang	2022-08-01 21:40:36 +00:00
Edward Z. Yang	1724e9f21f	Refactor functionality and backend keys to reduce duplication (#81752 ) Define some macros for stamping these out, and then use them everywhere applicable. Parsing should get this treatment too but I leave it to a follow up. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/81752 Approved by: https://github.com/cpuhrsch, https://github.com/bdhirsh	2022-07-21 21:23:54 +00:00
Brian Hirsh	adf8060600	add a new alias key for functional to view op decompositions Pull Request resolved: https://github.com/pytorch/pytorch/pull/79615 Approved by: https://github.com/zou3519	2022-06-15 23:18:09 +00:00
Edward Z. Yang	7313a7a987	Make Meta into a backend component Seems like it should be one. This will make it possible to register meta implementations even when there is a CompositeImplicitAutograd registration already. It also paves the way for sparse meta, etc. Signed-off-by: Edward Z. Yang <ezyangfb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/78469 Approved by: https://github.com/ngimel	2022-05-31 18:59:16 +00:00
Kulin Seth	f348b1b2b5	Add the Runtime components for MPS backend. (#76725 ) The PR adds the runtime components and few basic operations like copy, as_strided for MPS backend. Current list of identified TODOs are: - https://github.com/pytorch/pytorch/issues/77176 - Unify the logic with CUDACachingAllocator and remove redundant code. - https://github.com/pytorch/pytorch/issues/77170 - Look into using C++ smart pointers where possible with ObjC code - Use empty_strided_generic() to implement the `empty_strided_mps` code - https://github.com/pytorch/pytorch/issues/77144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76725 Approved by: https://github.com/albanD	2022-05-11 17:19:45 +00:00
Kulin Seth	54c75e1e8f	Add "mps" device to PyTorch framework. Remove the "mlc" device for Mac platforms. This commit will be followed up with: * adding MPS runtime components * PyTorch ops for MPS device Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/76291 Approved by: https://github.com/albanD	2022-04-27 19:21:57 +00:00
Can Balioglu	a0bf0f5611	Add new dispatch keys for Fake Tensor and Deferred Module Initialization Thanks to @bdhirsh's work, we now have room for new dispatch keys in `DispatchKey` enum. This PR adds two new keys for out-of-core [Fake Tensor](https://pytorch.org/torchdistx/latest/fake_tensor.html) and [Deferred Module Initialization](https://pytorch.org/torchdistx/latest/deferred_init.html) features. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76139 Approved by: https://github.com/bdhirsh	2022-04-27 18:48:44 +00:00
Guo Yejun	6f991fc5fc	add XPU support for autocast Pull Request resolved: https://github.com/pytorch/pytorch/pull/75250 Approved by: https://github.com/bdhirsh	2022-04-19 21:18:23 +00:00
Scott Wolchok	0a5e788ab2	[PyTorch] Add NestedTensorCPU and NestedTensorCUDA dispatch keys (#75808 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75808 Just as it is often difficult to write a single kernel that can handle both CPU and CUDA, so can it be difficult to do the same for NestedTensor. ghstack-source-id: 154171542 (Note: this ignores all push blocking failures!) Test Plan: CI? Reviewed By: bdhirsh Differential Revision: D35603836 fbshipit-source-id: fb0ebb19d34531ed96ce176aca325f8e2b5f90e6 (cherry picked from commit 0bcd753f93c04256c1b745f84a74ecccf0dceef5)	2022-04-19 18:12:12 +00:00
Anthony Barbier	ce9e27a0fc	Add new keys for Graphcore IPU (DispatchKey / Backend / DeviceType) We need a key to register our out of tree backend: https://github.com/graphcore/poptorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/74763 Approved by: https://github.com/bdhirsh	2022-04-07 17:18:45 +00:00
Brian Hirsh	1b7d7d9327	Reland: "free up dispatch key space (in C++)" (#74963 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963 This is a re-land of D35192346 (`9872a06d77`) and D35192317 (`a9216cde6c`), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633. The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`. Background: Existing Mobile Optimization Pytorch mobile builds have an existing optimization (here `cc23725e89/c10/core/DispatchKey.h (L382)` and here `cc23725e89/aten/src/ATen/core/dispatch/OperatorEntry.h (L214)`), which works as follows: Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc). In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys. The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: `cc23725e89/aten/src/ATen/core/dispatch/Dispatcher.h (L294)`. The mobile-optimization currently does not extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64. The Bug This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294). That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`. Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan? Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this? The debugging experience was pretty difficult Debugging the Milan-specific failure was made difficult by the following: (1) lack of CI - the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging. (2) It's difficult to get a repro. - my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space) - There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/) (3) Lack of stack-traces. - Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash. ghstack-source-id: 152688542 Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing. Reviewed By: phding, albanD Differential Revision: D35222806 fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30 (cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)	2022-03-31 21:52:38 +00:00
Brian Hirsh	9872a06d77	Back out "free up dispatch key space (in C++)" (#74859 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74859 Original commit changeset: 6d1dd0fd8144 Original Phabricator Diff: D34227616 (`2cbddc0e9b`) ghstack-source-id: 152381077 (Note: this ignores all push blocking failures!) Test Plan: Test on Milan with "get weather utterance" buck build fbsourcefbandroid/mode/opt fbsourcefbandroid/mode/milan_build_rdk //fbandroid/apps/wearable/system/speechservice:speechservice_target30_xhdpi_armv7_release_debug_keystore -c pt.has_backtaces=1 Reviewed By: phding Differential Revision: D35192346 fbshipit-source-id: b962de5d5effaf23f9aa8afd3ef36f8c6383de5b (cherry picked from commit 913e3027a11457aaa2d97a9d89ebc6133b14213c)	2022-03-29 15:39:17 +00:00
Brian Hirsh	2cbddc0e9b	free up dispatch key space (in C++) (#72827 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72827 Reland of D34034848 (`6690256021`) ghstack-source-id: 152161452 Test Plan: Confirm that Milan tests are passing Reviewed By: ezyang Differential Revision: D34227616 fbshipit-source-id: 6d1dd0fd8144dfbd9e194cd7564cce017e7db968 (cherry picked from commit e5c1b29fedd5c2a0bad810cedc94aa784136b6aa)	2022-03-25 17:04:51 +00:00
Alban Desmaison	a7cac05ca6	Add new tls snapshot feature (#72832 ) Summary: Reland of https://github.com/pytorch/pytorch/pull/72623 that was reverted for the tls cleanup was removed. From close inspection on the counting of the number of available keys, I think there is one more since the guard is actually one after the last usable key. With this update assert, the last updated key will still be <=63 which will fit just fine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/72832 Reviewed By: H-Huang Differential Revision: D34228571 Pulled By: albanD fbshipit-source-id: ce5e10a841ea87386727346cfc8d9327252574c4 (cherry picked from commit 59d3b863534a37ac3463e2814bc9599c322669ee)	2022-02-15 19:02:05 +00:00
Brian Hirsh	22ccf448e8	Revert D34034848: free up dispatch key space (in C++) Test Plan: revert-hammer Differential Revision: D34034848 (`6690256021`) Original commit changeset: 9677ee2c0a1a Original Phabricator Diff: D34034848 (`6690256021`) fbshipit-source-id: fd50943d915ef813bb9f9ab278fb582429eea3b1 (cherry picked from commit 3acefee1cdb89bc051d1ef2e9deb5698d2bd85c3)	2022-02-14 23:29:00 +00:00
Brian Hirsh	f1a9650e4f	Revert D34214953: Add new tls snapshot feature Test Plan: revert-hammer Differential Revision: D34214953 (`6199b5231f`) Original commit changeset: 7aa5d5e3540a Original Phabricator Diff: D34214953 (`6199b5231f`) fbshipit-source-id: 5d271e9a5ab021b8202402630dbf917b43c55421 (cherry picked from commit a12c630198d391e05b413ee2ff5155ab1aee282f)	2022-02-14 23:14:19 +00:00
Alban Desmaison	6199b5231f	Add new tls snapshot feature (#72623 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72623 Test Plan: Imported from OSS Reviewed By: samdow Differential Revision: D34214953 Pulled By: albanD fbshipit-source-id: 7aa5d5e3540a45a0ae70c5af3a4495c755908aa9 (cherry picked from commit dc0a1ab54a459019e4cd91b30a34adbc2e4ac5a4)	2022-02-14 20:46:54 +00:00
Brian Hirsh	6690256021	free up dispatch key space (in C++) (#72402 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72402 The original PR had an array-out-of-bounds access in `DispatchKeyExtractor.cpp`, that wasn't caught by ASAN and appeared to only manifest in a subset of android internal tests. After fixing the OOB access (and adding more asserts), I confirmed that the android internal test passes. Reland of D33255193 (`20b8653dfa`) ghstack-source-id: 148830728 Test Plan: Steps to test: (1) connect to a mobile OD (2) run `one_world android emulator android-29` in a terminal to start the android emulator (3) In a separate terminal, run the test: `buck test //fbandroid/instrumentation_tests/com/facebook/pytorch/bi_xray:instrumentation_test -c test.external_runner=tpx -- --regex 'testBIXRayModel.*PyTorchBIXRayInstrumentationTest' --force-remote-execution --run-disabled` I also ran `buck test fbandroid/mode/dbg //fbandroid/instrumentation_tests/com/facebook/pytorch/bi_xray:instrumentation_test`, which failed before and passed after the PR. Reviewed By: albanD Differential Revision: D34034848 fbshipit-source-id: 9677ee2c0a1afd1183896f7055009445712523c5 (cherry picked from commit 9ab9b12d355540ad0923c6869ed088ff6c21490c)	2022-02-14 16:02:29 +00:00
Jacob Szwejbka	791e7df7d9	Back out "free up dispatch key space (in C++)" Summary: I think this diff stack broke all the related tasks below. Test Plan: For our failing tests: buck test //fbandroid/instrumentation_tests/com/facebook/pytorch/bi_xray:instrumentation_test -c test.external_runner=tpx -- --regex 'testBIXRayModel.*PyTorchBIXRayInstrumentationTest' --force-remote-execution --run-disabled For the ubn: Not really sure what to do, trying to build the app and see if I can use an effect? Reviewed By: shoumikhin Differential Revision: D34018849 fbshipit-source-id: 3571718cb6621931af931b494e0a70d6e0164e65 (cherry picked from commit 3cc63cb2ea2664dd1063b190614f2034cce5f2d0)	2022-02-05 01:25:42 +00:00
Brian Hirsh	20b8653dfa	free up dispatch key space (in C++) (#69633 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69633 Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D33255193 Pulled By: bdhirsh fbshipit-source-id: 79773e9c15bf4f2f27675121a49ff5ffd1375238 (cherry picked from commit eac0b1300569e035f3de28a1f0fdce03f60bd270)	2022-02-04 17:57:38 +00:00
vfdev	4d28cef03a	Added AutocastCPU string (#70013 ) Summary: Description: - Added "AutocastCPU" string repr into `toString` method Before ``` std::cout << c10::DispatchKey::AutocastCPU; > UNKNOWN_TENSOR_TYPE_ID ``` and now: ``` std::cout << c10::DispatchKey::AutocastCPU; > AutocastCPU ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/70013 Reviewed By: ejguan Differential Revision: D33550777 Pulled By: bdhirsh fbshipit-source-id: b31e15e6d52fc1768af085e428328117d588f283	2022-01-12 12:06:46 -08:00
anjali411	3e6164449f	Add efficient zero tensors (#64837 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64837 Test Plan: Imported from OSS Reviewed By: gchanan Differential Revision: D32834987 Pulled By: anjali411 fbshipit-source-id: 20ea08ade0db0044ca633d9c1a117a6a2e65d1fd	2021-12-08 10:37:39 -08:00
Mark Richardson	834bd3134e	Back out "Add efficient zero tensors" (#69327 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69327 Original commit changeset: d44096d88265 Original Phabricator Diff: D32144240 (`668574af4a`) Test Plan: CI original diff failed 175 builds in CI Reviewed By: airboyang, anjali411 Differential Revision: D32809407 fbshipit-source-id: c7c8e69bcee0274992e2d5da901f035332e60071	2021-12-02 19:11:41 -08:00
anjali411	668574af4a	Add efficient zero tensors (#64837 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64837 Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D32144240 Pulled By: anjali411 fbshipit-source-id: d44096d882657c7f9270a16636900e0b73cefa40	2021-12-02 08:47:45 -08:00
Brian Hirsh	0032fa7725	Add a Functionalization pass in core (#64432 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64432 Original PR description + feedback here: https://github.com/pytorch/pytorch/pull/63048 I've addressed all of the feedback in the original PR and made some pretty large changes, listed below. Table of Contents - Starting points - List of the main changes from the original PR - Next Steps - Example codegen output (for a view, mutation, and view+mutation op) Starting Points A good place to start when looking through the PR: * Alban mentioned that this is a useful mental model (thanks Ed for originally making this clear to me). Semantically, the pass currently does THREE things, which are all needed by functorch - all fused together into one big pass. * (a) alias removal, which replaces {view} calls with {view}_copy calls, and manually tracks aliasing information, so that when one tensor is mutated, we re-apply the same mutation to all of the aliases. This is the bulk of the work - once this is done, the next 2 things are trivial to implement. * (b) mutation removal, which is easy to do once we know that there are no aliases. Every mutation `a.add_(b)` becomes `a.replace_(a.add(b))` * (c) reapplying views: all of the `{view}_copy` calls are replaced with `{view}` calls again. This is an optimization that we can make specifically for functorch (and strided backends), that only care about mutation removal and not alias removal * XLA and Vulkan only want (a), or (a) + (b). Later, we'll want to split this out so that you can actually opt into different versions of this logic. * There is currently no {view}_copy replacement, because the pass just <replace views with copies> and <replace copies with views> steps have been combined. Later, we'll want to actually implement {view}_copy variants of each view operator, probably with codegen. * documentation breadcrumb 1, in `FunctionalTensorWrapper.cpp`: https://github.com/pytorch/pytorch/pull/64432/files#diff-a0bac99bf205dba5b94cb64fc2466d3d55d991887572f9cd6a02e27b3a91dd60R59 (you might have to expand the `FunctionalTensorWrapper.cpp` file, which GitHub closes by default because it's large) * documentation breadcrumb 2, in `FunctionalTensorWrapper.h`: https://github.com/pytorch/pytorch/pull/64432/files#diff-c945c71a4ccac65871f24a912e8904f9a5088b24a32e636727ea9c8fe920708aR12 * Reading through the codegen output at the bottom of this description. Main changes from the original PR (1) I use lambdas instead of a giant enum to handle all of the different views. This results in less boilerplate per view op (and more stuff that can be codegen'd). Every `ViewMeta` object now contains a `forward` and `reverse` lambda, that knows how to replay the view and its inverse. This makes the actual code that executes the replaying logic a lot less boilerplate-y (see `Alias::sync_update_operations` and `FunctionalTensorWrapper::sync_`) (2) Every tensor during the functionalization pass is always wrapped in a `FunctionalTensorWrapper`. This is potentially unnecessary for Vulkan/XLA, and will have a mild perf impact, but for now this PR just targets the functorch use case. I previously had a complicated design a (`FunctionalTensorImplBase` class) to avoid needing the wrapper for XLA, but it had some subtleties that are gonna require more thought to fix, so I'm pushing that off for now. (3) `FunctionalTensorWrapper` objects accurately report stride information. It's a little annoying to do this though, because the logic that calculates stride info for each view isn't easily separated from the actual view kernels in core, `at::native::{view}`. I do this by adding logic in each `at::functionalization::{view}` kernel to call the reference implementation `at::native::{view}`. I don't do anything with the output aside from taking it's size/stride/storage_offset to set the actual output tensor's size/stride/storage_offset correctly. There's another annoying part to this: I'm pretty sure that we want to pass in the actual wrapper tensors directly into the native kernels, not their inner unwrapped values. But there are some `at::native::{view}` kernels that call other tensor methods, which re-invokes the dispatcher, calling functionalization/functorch kernels that try do the unwrapping. To do this, right now I have an `AutoDispatchDirectlyToNative` guard that basically ensures that any tensor methods called inside of the at::native::{view} op always redispatch straight to the CPU kernel (which will be another at::native:: kernel). This feels kind of heavy handed, but I'm not sure of a better way to do it. (4) `FunctionalTensorWrapper` objects accurately report aliasing information. There's a new `FunctionalStorageImpl` class (subclass of `StorageImpl`) that allows tensors in the functionalization pass to accurately alias storage. If two tensors `a` and `b` in a functionalized program are views of one another, then `a.storage.is_alias_of(b.storage)` should return true. I added this in a pretty similar way to how meta tensors allocate storage, although I don't pass in an actual allocator (I think this is fine because you should never resize a functional tensor's storage). One thing I'm not sure about - should `FunctionalTensorWrapper` set `storage_access_should_throw_`: (a) always, (b) never, (c) only if its wrapped tensor has it set. Right now I have it not set, mostly because calling the reference view functions (`at::native::{view}`) requires looking at the storage. But that means that if you try to access storage from python in a functionalized program, you'll get silent garbage instead of an error. Related question: are we planning on exposing meta tensor storage to python in the future (even though it contains garbage)? (5) better docs :) View operator coverage (6) The functionalization pass now gets math-composite view ops for free. I didn't add the `Functionalize` dispatch key to the composite set, because I don't want composite ops like `torch.ones` to get decomposed before hitting the functionalization pass. Instead, I added codegen to manually register the `at::native::` kernels of composite view ops. This is a little hairy, because the names of the `at::native::` kernels aren't easily accessible. They're stored in a `Dict[DispatchKey, BackendIndex]`. I made a best-effort attempt to get each view kernel's name, basically by assuming that every view op has either a composite or cpu implementation. There's also a hardcoded list of composite view ops in `gen_inplace_or_view_type.py`, but it looks like it's wrong. This is probably worth rationalizing later, but instead I created a new list of the "complete" set of composite view ops, and preserved the old set by hardcoding the delta between the two sets. (7) I've added codegen for ops that are both views AND mutations, like `transpose_()` (why do we even have these {emoji:1f622}). From some light testing, it looks like they work correctly with one caveat: I had a hard time ensuring that functorch programs that mutate their inputs using ops like `transpose_()` preserve the input mutations after the program finishes running. For (in my corresponding functorch branch) I emit a warning when this happens, and just don't preserve the mutation (8) I added `{view}_inverse` implementations for every view op, in `FunctionalInverses.cpp`. These are needed to take mutations made to views and replay them back onto the base. To reduce boilerplate, the codegen generates function declarations for each `{view}_inverse` function, so you get a nice compiler error when someone eventually adds a new view op. The only view ops currently not supported are (a) as_strided, and (b) the sparse view ops (values()/indices()). I can add support for as_strided, but it needs an `as_strided_inverse()` function. That will look really similar to the `as_strided_backward()` function in FunctionsManual.cpp, but it has some noticeable differences: we basically want an `as_strided_embed` for autograd and `as_strided_scatter` for functionalization. We also will probably need them to be primitives w.r.t to autograd, since the currently implementation for autograd uses view().copy_() calls that XLA won't be able to handle. I'm wondering if anyone has any objections, but otherwise I can make those change (which will require writing backward formulas for `as_strided_embed` and `as_strided_scatter`). I did a bunch of manual testing that all looks pretty good, but it's definitely not fully tested. Ed pointed out that once XLA uses this pass (or at least once there's a POC), we can just run the existing xla view test suite. Hopefully that delay is okay - if it's not, maybe we can think about using OpInfos similar to how functorch uses them for testing. Note: there's some duplication with autograd's view code. Every `{view}_inverse` implementation is really similar to the implementation for that view listed in `derivatives.yaml`. There are some major differences though: * the autograd implementations over those backwards functions (like `permute_backwards()`, in `FunctionsManual.cpp`) internally call other view ops. For functoinalization, we want them to (eventually call `{view}_copy` operators). * For view ops that take a subset of the original storage, like `slice/select/diagonal/as_strided()`, the autograd backward functions fill the "spaces" in the inverse call with zeroes. For functionalizations, we want to fill them with the value of `base` at those positions. It looks like this currently applies to 6 total ops (since we can ignore composites): * select * slice * diagonal * as_stridied * split * split_with_sizes A nice end state would probably be for the autograd + functoinalization codegen to both look at the same yaml (either `derivatives.yaml`, or something else), and automatically generate the right thing. I didn't leave that in scope for this PR though. Current State + Next Steps There are a bunch of followups after this PR eventually lands. Roughly in order: * Use the current pass to register problematic composite ops in functorch. Also, nested `functionalize()` calls aren't supported yet (I mostly just need to remove some debug asserts and test it). * Work on freeing up dispatch key space in the by deduplicating the `{backend}`/`Autograd{backend}`/`Sparse{backend}`/`Quantized{backend}` keys * Once we have more dispatch keys, split up this pass into 3 pieces - it's currently fused, and doesn't do the right thing for vulkan/XLA. Specifically, all of the `{view}` calls in the current pass's view-replay logic should turn into `{view}_copy` calls that vulkan/XLA know how to implement, and there will be separate passes for (a) removing mutations, and (b) turning `{view}_copy` calls back into `{view}` calls. For Vulkan, we eventually want a pass that ONLY removes aliasing and view calls, and doesn't remove mutations. We can also probably make the 2 new passes user dispatch keys to save dispatch key space, if they'll only be used by functorch anyway. * Do more of a dive on perf for the vulkan/xla use cases. There are several areas to improve perf with varying levels of effort required. The simplest one that I'll probably do regardless is to codegen the out-of-place kernels instead of using a boxed fallback. Getting a POC working for xla will also be useful to test the view operator coverage. Example Codegen Output View Op: ``` ::std::vector<at::Tensor> split_Tensor(c10::DispatchKeySet ks, const at::Tensor & self, int64_t split_size, int64_t dim) { auto self_ = at::functionalization::impl::unwrapFunctionalTensor(self); ::std::vector<at::Tensor> out; { at::AutoDispatchBelowFunctionalize guard; auto tmp_output = at::redispatch::split(ks & c10::after_func_keyset, self_, split_size, dim); out = at::functionalization::impl::wrapFunctionalTensor(tmp_output); // I'm fusing the [alias removal], [mutation removal], [add views back] passes together. // Later, we'll want to turn them into separate passes (since e.g. vulkan only cares about alias removal). } at::functionalization::ViewMeta view_meta = at::functionalization::ViewMeta( [split_size, dim](const at::Tensor& base, int64_t mutated_view_idx) -> at::Tensor { return base.split(split_size, dim)[mutated_view_idx]; }, [split_size, dim](const at::Tensor& base, const at::Tensor& mutated_view, int64_t mutated_view_idx) -> at::Tensor { return at::functionalization::impl::split_inverse(base, mutated_view, mutated_view_idx, split_size, dim); } ); at::functionalization::impl::set_view_meta(out, self, view_meta); at::AutoDispatchDirectlyToNative native_guard; ::std::vector<at::Tensor> reference_tensor_output = at::native::split(self, split_size, dim); at::functionalization::impl::set_strides(out, reference_tensor_output); return out; } ``` Mutation Op: ``` at::Tensor & add__Tensor(c10::DispatchKeySet ks, at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha) { at::functionalization::impl::sync(self); at::functionalization::impl::sync(other); auto self_ = at::functionalization::impl::unwrapFunctionalTensor(self); auto other_ = at::functionalization::impl::unwrapFunctionalTensor(other); at::Tensor tmp_output; { at::AutoDispatchBelowFunctionalize guard; // The functionalization pass explicitly doesn't pass out= parameters to the redispatch tmp_output = at::redispatch::add( ks & c10::after_func_keyset, self_, other_, alpha); } self.replace_(tmp_output); at::functionalization::impl::maybe_add_update(self); return self; } ``` View + Mutation Op: ``` at::Tensor & transpose_(c10::DispatchKeySet ks, at::Tensor & self, int64_t dim0, int64_t dim1) { at::functionalization::ViewMeta view_meta = at::functionalization::ViewMeta( [dim0, dim1](const at::Tensor& base, int64_t mutated_view_idx) -> at::Tensor { return base.transpose(dim0, dim1); }, [dim0, dim1](const at::Tensor& base, const at::Tensor& mutated_view, int64_t mutated_view_idx) -> at::Tensor { return at::functionalization::impl::transpose_inverse(base, mutated_view, dim0, dim1); } ); at::functionalization::impl::mutate_view_meta(self, view_meta); // See Note [Propagating strides in the functionalization pass] // Directly update the sizes/strides/storage_offset fields on self using the inplace call. // I need the guard because I don't want the at::native kernel to end up calling more functionalization/functorch kernels. // Its only job is to directly compute the output size/stride/storage_offset metadata. at::AutoDispatchDirectlyToNative native_guard; at::native::transpose_(self, dim0, dim1); return self; } ``` Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D31942093 Pulled By: bdhirsh fbshipit-source-id: b95598dae35dd1842fa8b1d8d1448332f3afaadf	2021-10-28 10:51:17 -07:00
Brian Hirsh	bcc6e3ab5e	add python API to print all operators that have kernels registered to a particular DispatchKey (#63575 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63575 Test Plan: Imported from OSS Reviewed By: ezyang, Chillee Differential Revision: D30426919 Pulled By: bdhirsh fbshipit-source-id: b0e487e48dfe02f7b9d678403f0a2b5bfe146f4e	2021-09-22 09:15:55 -07:00

1 2

100 Commits