Summary:
# Context
See the first PR https://github.com/pytorch/pytorch/pull/153670
# This PR
1. Migrate 3 clamp ops from out-of-tree to in-tree(had to migrate the 3 ops altogether, because clamp.out calls all 3 stubs, which are also called by the other 2 ops):
- clamp.out
- clamp_min.out
- clamp_max.out
2. Also enabled structured kernel codegen for MTIA, which is needed by clamp
3. Also introduced the `--mtia` flag to torchgen to prevent OSS from gencoding MTIA code.(Otherwise we got such link error `lib/libtorch_cpu.so: undefined reference to at::detail::empty_mtia`)
Differential Revision: D74674418
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154015
Approved by: https://github.com/albanD, https://github.com/nautsimon
Summary:
# Context
The MTIA New Aten Backend work is essentially to move MTIA operators from pytorch out-of-tree to in-tree, with following benefits:
1. Avoid duplicate code copied from pytorch, e.g. view ops implementation, util functions.
2. Utilize TensorIterator and structured kernel codegen, avoid manual implementation of broadcasting, dtype casting, asserting, etc.
3. Eliminate MTIA's own codegen flow, which is unnecessary complexity.
4. Overall make MTIA's aten backend more pytorch native.
Differential Revision: D74672464
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153670
Approved by: https://github.com/albanD, https://github.com/nautsimon
This change enables basic NestedTensor operations on HPU,
fixing the runtime error when creating a NestedTensor on HPU.
- Extended `NestedTensorImpl` to recognize `hpu` as a valid storage device.
- Added `NestedTensorHPU` to `DispatchKey` parsing in `DispatchKey.cpp`.
- Updated `torchgen/model.py` to include `NestedTensorHPU` in `dispatch_keys`.
- Modified `native_functions.yaml` to enable `NestedTensorHPU` support for various ops.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148659
Approved by: https://github.com/jeromean, https://github.com/albanD, https://github.com/sujoysaraswati
Fixes https://github.com/pytorch/executorch/issues/8711
In ExecuTorch when we try to parse the following schema:
```
aten::__lshift__.Scalar(Tensor self, Scalar other) -> Tensor
```
Repro:
```python
from torchgen.model import FunctionSchema
native_schema = FunctionSchema.parse("aten::__lshift__.Scalar(Tensor self, Scalar other) -> Tensor")
```
It's failing because `BaseOperatorName` categorizes it to be a
inplace operator.
I understand we are not supposed to pass in namespace "aten::" into
`FunctionSchema.parse()` but unfortunately ExecuTorch requires this
feature to work.
This PR adds a new `namespace` attribute to `BaseOperatorName` and makes
sure the rest of the stack works as before, if a schema without
namespace is passed in
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148038
Approved by: https://github.com/bdhirsh
Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements
> Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target:
>
> ```python
> # Input
> assert (
> len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
>
> # Black
> assert (
> len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
> # Ruff
> assert len(policy_types) >= priority + num_duplicates, (
> f"This tests needs at least {priority + num_duplicates} many types."
> )
> ```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546
Approved by: https://github.com/malfet
Summary:
Continuing the work from https://github.com/pytorch/pytorch/pull/146427
Adds the `torch.float8_e8m0fnu` dtype to PyTorch, as detailed in
https://github.com/pytorch/pytorch/issues/146414 . Please see the issue for a detailed definition of the format. Example of basic functionality:
```python
import torch
# round trip
x0 = torch.randn(4, 4, dtype=torch.float32)
x1 = x0.to(torch.float8_e8m0fnu) # RNE rounding
x2 = x1.to(torch.float32) # 2 ** exponent
# creation with empty
x0 = torch.empty(4, 4, dtype=torch.float8_e8m0fnu)
# printing
print(x0)
```
Done in this PR:
* numerical correctness
* op coverage (except for `torch._scaled_mm`): create tensor, cast to/from float32
* printing a tensor works
For future PRs:
* performance optimizations for casting
* torch._scaled_mm
* PT2
* various cleanups (detailed in comments with issue numbers)
Test Plan:
```
pytest test/quantization/core/experimental/test_float8.py -s
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147466
Approved by: https://github.com/drisspg
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
[Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM ops.
Motivation: There are two parts of aten ops for XPU, one is in-tree ops like GEMM related OPs and the other is out-off-tree ops in torch-xpu-ops. For the in-tree part,since Pytorch uses native_functions.yaml registration and is equipped with convenient codegen capabilities, we want to take advantage of these benefits as well.
At the same time, since AOT Inductor also uses native_functions.yaml to generate c shim wrappers, we also need to enable this mechanism for XPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139025
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
This PR is a supplement to #130082. The previous PR #130082 fulfill the basic functionality of codegen, while we found it fails to handle the device sameness check in lots of uts. Current PR is aimed to facilitate the XPU device guard code generation.
With current PR, the code snippet in `RegisterXPU.cpp` is as follows, where we can see the device guard is successfully generated.
```c++
namespace {
at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) {
std::optional<Device> common_device = std::nullopt;
(void)common_device; // Suppress unused variable warning
c10::impl::check_and_update_common_device(common_device, out, "wrapper_XPU_Tensor_float_out_normal_out", "out");
c10::impl::check_and_update_common_device(common_device, mean, "wrapper_XPU_Tensor_float_out_normal_out", "mean");
const OptionalDeviceGuard device_guard(device_of(out));
return at::native::normal_out(mean, std, generator, out);
}
} // anonymous namespace
```
Nevertheless, without current change, the generated code is
```c++
namespace {
at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) {
// No device check
// DeviceGuard omitted
return at::native::normal_out(mean, std, generator, out);
}
} // anonymous namespace
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133980
Approved by: https://github.com/EikanWang, https://github.com/malfet
Add a way of generating a FunctionSchema from example values because hop's schema varies even for the same hop.
We didn't use torch._C.FunctionSchema because we cannot construct the classes directly (e.g. "__init__" cannot be used for torch._C.FunctionSchema). Also extending the Basic types in c++ seems not that easy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133521
Approved by: https://github.com/zou3519
# Motivation
This PR intends to enhance the codegen to allow generate codes for XPU backend.
XPU operators need be registered in an hand-written way currently. Developers have no chance to take the advantage of shared code to handle tensor meta setting (like strides, proxy output, structured kernels). Manually porting code is erro-prone and may lead to high maintaining efforts.
We utilize the backend_whitelist argument in `gen.py` to generate XPU needed headers and source codes.
# Usage
XPU ops lie in `third_pary/torch-xpu-ops`, the codegen process is triggered before the complation of `torch-xpu-ops`
We use the following commands to generate XPU operators
` python -m torchgen.gen --source-path path/to/yaml/of/xpu --install-dir build/xpu --per-operator-headers --static-dispatch-backend --backend-whitelist=XPU`
The diff lies at `backend-whitelist=XPU`. The backend-whitelist key is an existent argument in torchgen.
The input of `gen.py` are code templates and operators yaml. We share the same templates in `aten`. A simplified yaml lies in `third_party/torch-xpu-ops`, which only includes the supported xpu operators. This yaml is a copy-and-modify of `native_functions.yaml`. No extra entry is added, the format is same as the one in `aten`
# Result
All operators headers are generated in `build/xpu/ATen/ops` independently, which would not affect operators declared/defined by CPU/CUDA or any other backend. XPU operators only include headers in this folder.
# Verification
* In `third-party/torch-xpu-ops`, we migrate all supported kernels to structured kernels style, where they are registered through `REGISTER_XPU_DISPATCH` or `TORCH_IMPL_FUNC`, and we have UT verification based on `test_ops.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130082
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/atalman
ghstack dependencies: #130019
The default value of `rot90()` in the schema registry is `[0,1]` because we split the function schema by `", "`. There should be no space after `,` in `[0,1]`.
5c9d5272e4/aten/src/ATen/native/native_functions.yaml (L6120-L6126)
Then the the default value is formatted to `(0,1)` in `pyi` files. This PR manually adds an extra whitespace when rerendering the default value to a string.
```python
", ".join(string.split(","))
```
```python
# before
def rot90(input: Tensor, k: _int = 1, dims: _size = (0,1)) -> Tensor: ...
# after
def rot90(input: Tensor, k: _int = 1, dims: _size = (0, 1)) -> Tensor: ...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129884
Approved by: https://github.com/ezyang
Previously it worked with torchgen.model.FunctionSchema. This PR extends
it to work with torch._C._FunctionSchema by making
torchgen.model.FunctionSchema look more like torch._C._FunctionSchema.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123108
Approved by: https://github.com/albanD
Introduces a new op `slice_inverse()`. This is used in the reverse view_func for slice and several other ops (e.g. `split_with_sizes`, `chunk`). It's implemented behind the scenes by a call to `as_strided()`, but it's easier for subclasses to implement the more limited `slice_inverse()` than the full `as_strided()`. This PR:
* Introduces the op itself
* Updates all relevant functional inverses to call `slice_inverse()` instead of `as_strided()` directly
* Makes codegen changes to allow `slice_scatter()` to be the copy variant for `slice_inverse()`
* Need to avoid view_copy codegen (assumes if view name ends in inverse, we don't need to gen one, which is possibly a bad assumption)
@albanD / @soulitzer / @bdhirsh: I'm most interested in your thoughts on the codegen changes and whether this is the right way to go.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117041
Approved by: https://github.com/bdhirsh
In this PR, we are implementing Functionalization on pre-dispatch graph. Today, every dispatch key except for Dispatchkey.Python has a dedicated mode stack in python. PreDispatch tracing relies on this behaviour by pushing ProxyTorchDispatchMode to Dispatchkey.PreDispatch mode stack and handle the dispatching logic in python. To make pre-dispatch functionalization work, we now need to push FunctionalTensorMode on DispatchKey.PreDispatch mode stack and make sure it runs before ProxyTorchDispatchMode. (this is very similar to how post-dispatch tracing work). Here are some design decisions we made for this flow to work:
1. FunctionalTensorMode internally calls C++ functionalize key. Since C++ functionalization goes after PreDispatch, if we are not careful, we will keep re-entering into PreDispatch key. We solve this by directly dispatching to C++ Functionalize key.
2. We delete mode_stack_per_key logic because the only realistic time it is exercised is for PreDispatch and it is in general not safe to have a plain list because FunctionalTensorMode and ProxyTorchDispatchMode ordering matter and it is hard to enforce it on plain list. Instead, now we have a private class that tracks PreDispatch mode stack.
3. We will still run CompositeImplicitAutograd decomps in this PR, and disable this logic later as a followup.
Some missing bits after this PR:
1. Preserving autograd ops in a functional form. Right now they still show up in the graph but in a "non-functional" way.
2. Turn off CompositeImplicitAutograd decomps
3. Functionalizing HOO
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113728
Approved by: https://github.com/bdhirsh
torch.library.impl now accepts a device string (e.g. "cpu", "cuda"). It
still accepts DispatchKey strings, but we no longer document this, because
using arbitrary DispatchKeys is more for the power users.
We map the device string to a DispatchKey and then register the impl for
said DispatchKey. A user may also specify multiple device strings at once
or specify "types=default" to get a CompositeExplicitAutograd registration.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111659
Approved by: https://github.com/soulitzer
ghstack dependencies: #111380
I added some tests for Conj, Neg and ZeroTensor for both python and C++ functionalization. This also fixes a nasty segfult when running a functorch `jacfwd` test with `torch.compile`, once AOTAutograd is using `FunctionalTensor`.
Changes:
(1) I use Jeffrey's `make_wrapper_subclass(extra_dispatch_keys)` kwarg to plumb extra dispatch keys ontoto the wrapper, mirroring what C++ functionalization does (C++ functionalization will mirror all dispatch keys from the inner tensor to the wrapper, except for python and functorch keys).
(2) FunctionalTensorMode will decompose CompositeImplicitAutograd ops, since (for example) ZeroTensor kernels can send ops like `.to()` directly to the Python key. We'll need a way to toggle this later for pre-dispatch functionalization
(3) Bound `_ForceDispatchKeyGuard` and BatchedTensorImpl's dispatch keyset to python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109023
Approved by: https://github.com/zou3519
ghstack dependencies: #108654, #109662, #109632
This PR adds a new `FunctionalTensor` subclass, and `FunctionalTensorMode` torch dispatch mode. Together, this class/mode are a lightweight wrapper around our existing C++ functionalization logic.
This idea came from Ed - later in the stack, I want to be able to run functionalization **underneath** torch_dispatch, when performing tracing in AOTAutograd. I can't do this easily with vanilla C++ functionalization, because it has a dedicated dispatch key that always runs before TorchDispatch. However, by adding a torch_dispatch mode shim around functionalization, we can use functionalization as a torch_dispatch mode, which will make it easier to run underneath other modes later.
This PR provides the basic new classes, and some light testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106404
Approved by: https://github.com/ezyang
PoC demonstrating vmap + NT based on the [design doc](https://docs.google.com/document/d/1dVVk6TOqz93PLTIneU2T3xaxCs9qZ0MaJyCvOAp_bC0). This PR:
* Allows `BatchedTensorImpl`s to contain NTs
* Introduces a `BatchedNestedTensor` dispatch key for NT-specific batching rules
* Provides a batching rule fallback that unbinds the NTs -> performs computation on constituent -> rebinds results into NT
Restrictions:
* Only supports one level of vmap
* Only supports vmapping over dim=0 for NTs
* For operations with mixed NT / dense inputs, support is also limited to dim=0 for the dense inputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106786
Approved by: https://github.com/zou3519
**Update:** Made refactor of the original PR. See the original description below, but here I'll describe the updates:
(1) TLS changes in `TorchDispatchModeTLS.h/cpp`.
I added a `TorchDispatchModeKey` enum, that (for now) just contains PROXY and FAKE. The ModeTLS used to just contain a `std::vector<std::shared_ptr<c10::SafePyObject>>` corresponding to the mode stack. It now **also** contains a separate array of "infra modes", indexed by mode key (PROXY and FAKE, with a new addition, FUNCTIONAL, coming later in the stack).
`TorchDispatchModeTLS::push_onto_stack` and `TorchDispatchModeTLS::pop_stack` are now a bit more complicated. Pushing accepts an optional mode_key, which if set, tells us to add the given mode directly to our "infra_modes" array. Popping will first check the "user mode" stack, before trying to pop anything from the infra mode stack. It also optionally returns the mode key of the mode we popped if there was one - that way if we push that same mode back onto the TLS later, we know where it goes.
`TorchDispatchModeTLS::dispatch_mode_enabled()` now accepts an optional `skip_infra_modes` param, so you can separately query if there are "any modes at all", or if there are "any user modes".
`TorchDispatchModeTLS::get/set/unset_mode()` all take in a mode key, and get/set/unset the mode at that particular mode key (meaning they are only meant to be used for infra modes).
There were also some mild codegen changes to support the new enum
(2) `fake_tensor.py/proxy_tensor.py/_python_dispatch.py`
The way I tell the infra that certain subclasses/modes are "infra" is through the enum: I gave `FakeTensor` and `FakeTensorMode` a `self._mode_key = torch._C.TorchDispatchModeKey.FAKE`. `TorchDispatchMode.__enter/exit__()` (in `_python_dispatch.py` now check if the current mode has a mode key, and if so they plumb it into any `push_onto_stack()` calls (which eventually instructs `TorchDispatchModeTLS` where to put the mode). Same thing for `ProxyTorchDispatchMode`.
I also had to change both of these mode's enter/exit, to handle the fact that there can no longer be multiple proxy/fake modes on the mode stack at once. I updated them both to have a `self.enter_stack: List[Optional[TorchDispatchMode]]` - whenever we push a given mode in `__enter__`, we remove the current ambient fake/proxy mode from the mode stack, and save it in `enter_stack`, so that on exit we can reset the state properly.
(2) dispatching logic in `python_arg_parser.cpp`
This is where the core dispatching logic changes are. I added two helpers, `dispatch_on_subclass()` and `dispatch_on_mode()`. The overall dispatching order is now:
```
(a) dispatch_on_mode() # try user modes first (where the mode stack automatically considers infra modes last)
(b) dispatch_on_subclass() # try user subclasses next (skipping infra subclasses)
(c) dispatch_on_subclass() # try infra subclasses next (skipping user subclasses)
```
Note that we still want "user subclasses" to run before "infra modes". As Ed helped me realize, this will work today: If proxy/fake modes in step 1, they'll return NotImplemented if they see a user subclass, allowing us to redispatch to the user subclass.
How do (b) and (c) distinguish between user and infra subclasses? Infra subclasses (FakeTensor, and later FunctionalTensor) are required to have a `_mode_key` hidden on the subclass - so we filter via arguments that do/don't have the _mode_key.
(3) I also changed `DoubleTensor` to `TwoTensor` to minimize confusion (@albanD pointed out that DoubleTensor would be easily confused with `torch.FloatTensor` and friends).
----- original description below -----
The main purpose of this PR is to fix the "ordering problem" between torch_dispatch modes, where we want to ensure that our Fake and Proxy dispatch modes always run **after** any dispatch modes created by the user, regardless of where they are in the stack. See this doc for more details: https://docs.google.com/document/d/1COQ291nOZvtFnzGTQMJqoYZ3sttEYFw_7HbfSyL8gcA/edit
Full set of changes below. I ended up including a few semi-related changes in this PR that I documented - but if folks would rather I separate them out, happy to try to do that.
**(1) Add dedicated TLS slots for FakeTensorMode and ProxyTensorMode**
This is the main component of this PR. There are two new slots, `TorchDispatchModeTLS.fake_mode_` and `TorchDispatchModeTLS.proxy_mode_`, which correspond to a single "global" fake and proxy mode. There is now an invariant that `torchDispatchModeState.stack_` can never contain either of these modes.
I also added a `TorchDispatchModeTLS::maybe_highest_mode()` helper that consults the `stack_` as well as both the proxy and fake slots, and returns the highest priority mode - this is because there are a few places in the codebase where we legitimately want to get the highest priority mode, *including* fake or proxy, if one is set.
This also made the implementations of the existing `disable_proxy_modes_tracing()` and `get_innermost_proxy_mode()` marginally simpler.
**(2) Updated the dispatching logic in handle_torch_function_no_python_arg_parser()**
This is the function that actually figures out which torch_dispatch implementation to call, given the current mode stack and tensor subclass inputs. This function got marginally more complicated as part of the refactor: First we inspect the mode stack and any non-fake subclass inputs. Then we check for the proxy mode slot. Then we check for the Fake mode slot, before finally checking for any fake subclass inputs.
**(3) new python `_get_fake_tensor_mode()` and `_get_proxy_tensor_mode()` API's**
Before, if you wanted to see if proxy or fake modes were active in python, you would have to consult the mode stack. Since these two modes are no longer part of the actual mode stack, I added two new API's to directly check if either proxy or fake modes are active.
**(4) Allow traceable tensor subclasses to access storages from python**
This is convenient later in the stack, where AOTAutograd needs to detect aliasing of inputs and outputs, where those inputs and outputs might be tensor subclasses. Previously, `x.untyped_storage()` would raise an error if `x` was a subclass. In this PR, I tried to relax this constraint as little as possible: `THPVariable_storage()` will only try to return a storage to python if the tensor subclass that you are passing in is "traceable"
**(5) Fixed subclass fakeification**
@wanchaol recently added support to be able to fakeify tensor subclasses. That fakeification logic works in most cases, but there is one case it doesn't handle: autograd metadata. In particular, since autograd sees our tensor subclasses and not their desugared tensors, we need to make sure that our fakeified subclass has the same autograd metadata as the original subclass. I updated `meta_utils.py` to make sure that the autograd metadata is correct.
**(6) make tensor subclasses resizeable**
Previously we didn't allow tensor subclasses to be resizeable. I ran into an issue where fakeifying a tensor subclass occasionally requires swapping out its storage, which can involve resizing the tensor. Mechanically, this required updating `at::for_blob()` to expose a way to request that the tensor that you create has resizeable storage, and then using this new API in `_make_wrapper_tensor()`.
**(7) Added a basic DoubleTensor subclass for testing**
I use this subclass more later in this stack in my AOTAutograd tests - but it serves as a simple subclass example to test the dispatch ordering in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104482
Approved by: https://github.com/ezyang
ghstack dependencies: #107415