Summary:
# Context
See the first PR https://github.com/pytorch/pytorch/pull/153670
# This PR
1. Migrate 3 clamp ops from out-of-tree to in-tree(had to migrate the 3 ops altogether, because clamp.out calls all 3 stubs, which are also called by the other 2 ops):
- clamp.out
- clamp_min.out
- clamp_max.out
2. Also enabled structured kernel codegen for MTIA, which is needed by clamp
3. Also introduced the `--mtia` flag to torchgen to prevent OSS from gencoding MTIA code.(Otherwise we got such link error `lib/libtorch_cpu.so: undefined reference to at::detail::empty_mtia`)
Differential Revision: D74674418
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154015
Approved by: https://github.com/albanD, https://github.com/nautsimon
Summary:
# Context
The MTIA New Aten Backend work is essentially to move MTIA operators from pytorch out-of-tree to in-tree, with following benefits:
1. Avoid duplicate code copied from pytorch, e.g. view ops implementation, util functions.
2. Utilize TensorIterator and structured kernel codegen, avoid manual implementation of broadcasting, dtype casting, asserting, etc.
3. Eliminate MTIA's own codegen flow, which is unnecessary complexity.
4. Overall make MTIA's aten backend more pytorch native.
Differential Revision: D74672464
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153670
Approved by: https://github.com/albanD, https://github.com/nautsimon
This change enables basic NestedTensor operations on HPU,
fixing the runtime error when creating a NestedTensor on HPU.
- Extended `NestedTensorImpl` to recognize `hpu` as a valid storage device.
- Added `NestedTensorHPU` to `DispatchKey` parsing in `DispatchKey.cpp`.
- Updated `torchgen/model.py` to include `NestedTensorHPU` in `dispatch_keys`.
- Modified `native_functions.yaml` to enable `NestedTensorHPU` support for various ops.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148659
Approved by: https://github.com/jeromean, https://github.com/albanD, https://github.com/sujoysaraswati
Fixes https://github.com/pytorch/executorch/issues/8711
In ExecuTorch when we try to parse the following schema:
```
aten::__lshift__.Scalar(Tensor self, Scalar other) -> Tensor
```
Repro:
```python
from torchgen.model import FunctionSchema
native_schema = FunctionSchema.parse("aten::__lshift__.Scalar(Tensor self, Scalar other) -> Tensor")
```
It's failing because `BaseOperatorName` categorizes it to be a
inplace operator.
I understand we are not supposed to pass in namespace "aten::" into
`FunctionSchema.parse()` but unfortunately ExecuTorch requires this
feature to work.
This PR adds a new `namespace` attribute to `BaseOperatorName` and makes
sure the rest of the stack works as before, if a schema without
namespace is passed in
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148038
Approved by: https://github.com/bdhirsh
Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements
> Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target:
>
> ```python
> # Input
> assert (
> len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
>
> # Black
> assert (
> len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
> # Ruff
> assert len(policy_types) >= priority + num_duplicates, (
> f"This tests needs at least {priority + num_duplicates} many types."
> )
> ```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546
Approved by: https://github.com/malfet
Summary:
Continuing the work from https://github.com/pytorch/pytorch/pull/146427
Adds the `torch.float8_e8m0fnu` dtype to PyTorch, as detailed in
https://github.com/pytorch/pytorch/issues/146414 . Please see the issue for a detailed definition of the format. Example of basic functionality:
```python
import torch
# round trip
x0 = torch.randn(4, 4, dtype=torch.float32)
x1 = x0.to(torch.float8_e8m0fnu) # RNE rounding
x2 = x1.to(torch.float32) # 2 ** exponent
# creation with empty
x0 = torch.empty(4, 4, dtype=torch.float8_e8m0fnu)
# printing
print(x0)
```
Done in this PR:
* numerical correctness
* op coverage (except for `torch._scaled_mm`): create tensor, cast to/from float32
* printing a tensor works
For future PRs:
* performance optimizations for casting
* torch._scaled_mm
* PT2
* various cleanups (detailed in comments with issue numbers)
Test Plan:
```
pytest test/quantization/core/experimental/test_float8.py -s
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147466
Approved by: https://github.com/drisspg
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
[Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM ops.
Motivation: There are two parts of aten ops for XPU, one is in-tree ops like GEMM related OPs and the other is out-off-tree ops in torch-xpu-ops. For the in-tree part,since Pytorch uses native_functions.yaml registration and is equipped with convenient codegen capabilities, we want to take advantage of these benefits as well.
At the same time, since AOT Inductor also uses native_functions.yaml to generate c shim wrappers, we also need to enable this mechanism for XPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139025
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
This PR is a supplement to #130082. The previous PR #130082 fulfill the basic functionality of codegen, while we found it fails to handle the device sameness check in lots of uts. Current PR is aimed to facilitate the XPU device guard code generation.
With current PR, the code snippet in `RegisterXPU.cpp` is as follows, where we can see the device guard is successfully generated.
```c++
namespace {
at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) {
std::optional<Device> common_device = std::nullopt;
(void)common_device; // Suppress unused variable warning
c10::impl::check_and_update_common_device(common_device, out, "wrapper_XPU_Tensor_float_out_normal_out", "out");
c10::impl::check_and_update_common_device(common_device, mean, "wrapper_XPU_Tensor_float_out_normal_out", "mean");
const OptionalDeviceGuard device_guard(device_of(out));
return at::native::normal_out(mean, std, generator, out);
}
} // anonymous namespace
```
Nevertheless, without current change, the generated code is
```c++
namespace {
at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) {
// No device check
// DeviceGuard omitted
return at::native::normal_out(mean, std, generator, out);
}
} // anonymous namespace
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133980
Approved by: https://github.com/EikanWang, https://github.com/malfet
Add a way of generating a FunctionSchema from example values because hop's schema varies even for the same hop.
We didn't use torch._C.FunctionSchema because we cannot construct the classes directly (e.g. "__init__" cannot be used for torch._C.FunctionSchema). Also extending the Basic types in c++ seems not that easy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133521
Approved by: https://github.com/zou3519
# Motivation
This PR intends to enhance the codegen to allow generate codes for XPU backend.
XPU operators need be registered in an hand-written way currently. Developers have no chance to take the advantage of shared code to handle tensor meta setting (like strides, proxy output, structured kernels). Manually porting code is erro-prone and may lead to high maintaining efforts.
We utilize the backend_whitelist argument in `gen.py` to generate XPU needed headers and source codes.
# Usage
XPU ops lie in `third_pary/torch-xpu-ops`, the codegen process is triggered before the complation of `torch-xpu-ops`
We use the following commands to generate XPU operators
` python -m torchgen.gen --source-path path/to/yaml/of/xpu --install-dir build/xpu --per-operator-headers --static-dispatch-backend --backend-whitelist=XPU`
The diff lies at `backend-whitelist=XPU`. The backend-whitelist key is an existent argument in torchgen.
The input of `gen.py` are code templates and operators yaml. We share the same templates in `aten`. A simplified yaml lies in `third_party/torch-xpu-ops`, which only includes the supported xpu operators. This yaml is a copy-and-modify of `native_functions.yaml`. No extra entry is added, the format is same as the one in `aten`
# Result
All operators headers are generated in `build/xpu/ATen/ops` independently, which would not affect operators declared/defined by CPU/CUDA or any other backend. XPU operators only include headers in this folder.
# Verification
* In `third-party/torch-xpu-ops`, we migrate all supported kernels to structured kernels style, where they are registered through `REGISTER_XPU_DISPATCH` or `TORCH_IMPL_FUNC`, and we have UT verification based on `test_ops.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130082
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/atalman
ghstack dependencies: #130019
The default value of `rot90()` in the schema registry is `[0,1]` because we split the function schema by `", "`. There should be no space after `,` in `[0,1]`.
5c9d5272e4/aten/src/ATen/native/native_functions.yaml (L6120-L6126)
Then the the default value is formatted to `(0,1)` in `pyi` files. This PR manually adds an extra whitespace when rerendering the default value to a string.
```python
", ".join(string.split(","))
```
```python
# before
def rot90(input: Tensor, k: _int = 1, dims: _size = (0,1)) -> Tensor: ...
# after
def rot90(input: Tensor, k: _int = 1, dims: _size = (0, 1)) -> Tensor: ...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129884
Approved by: https://github.com/ezyang
Previously it worked with torchgen.model.FunctionSchema. This PR extends
it to work with torch._C._FunctionSchema by making
torchgen.model.FunctionSchema look more like torch._C._FunctionSchema.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123108
Approved by: https://github.com/albanD
Introduces a new op `slice_inverse()`. This is used in the reverse view_func for slice and several other ops (e.g. `split_with_sizes`, `chunk`). It's implemented behind the scenes by a call to `as_strided()`, but it's easier for subclasses to implement the more limited `slice_inverse()` than the full `as_strided()`. This PR:
* Introduces the op itself
* Updates all relevant functional inverses to call `slice_inverse()` instead of `as_strided()` directly
* Makes codegen changes to allow `slice_scatter()` to be the copy variant for `slice_inverse()`
* Need to avoid view_copy codegen (assumes if view name ends in inverse, we don't need to gen one, which is possibly a bad assumption)
@albanD / @soulitzer / @bdhirsh: I'm most interested in your thoughts on the codegen changes and whether this is the right way to go.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117041
Approved by: https://github.com/bdhirsh
In this PR, we are implementing Functionalization on pre-dispatch graph. Today, every dispatch key except for Dispatchkey.Python has a dedicated mode stack in python. PreDispatch tracing relies on this behaviour by pushing ProxyTorchDispatchMode to Dispatchkey.PreDispatch mode stack and handle the dispatching logic in python. To make pre-dispatch functionalization work, we now need to push FunctionalTensorMode on DispatchKey.PreDispatch mode stack and make sure it runs before ProxyTorchDispatchMode. (this is very similar to how post-dispatch tracing work). Here are some design decisions we made for this flow to work:
1. FunctionalTensorMode internally calls C++ functionalize key. Since C++ functionalization goes after PreDispatch, if we are not careful, we will keep re-entering into PreDispatch key. We solve this by directly dispatching to C++ Functionalize key.
2. We delete mode_stack_per_key logic because the only realistic time it is exercised is for PreDispatch and it is in general not safe to have a plain list because FunctionalTensorMode and ProxyTorchDispatchMode ordering matter and it is hard to enforce it on plain list. Instead, now we have a private class that tracks PreDispatch mode stack.
3. We will still run CompositeImplicitAutograd decomps in this PR, and disable this logic later as a followup.
Some missing bits after this PR:
1. Preserving autograd ops in a functional form. Right now they still show up in the graph but in a "non-functional" way.
2. Turn off CompositeImplicitAutograd decomps
3. Functionalizing HOO
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113728
Approved by: https://github.com/bdhirsh
torch.library.impl now accepts a device string (e.g. "cpu", "cuda"). It
still accepts DispatchKey strings, but we no longer document this, because
using arbitrary DispatchKeys is more for the power users.
We map the device string to a DispatchKey and then register the impl for
said DispatchKey. A user may also specify multiple device strings at once
or specify "types=default" to get a CompositeExplicitAutograd registration.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111659
Approved by: https://github.com/soulitzer
ghstack dependencies: #111380
I added some tests for Conj, Neg and ZeroTensor for both python and C++ functionalization. This also fixes a nasty segfult when running a functorch `jacfwd` test with `torch.compile`, once AOTAutograd is using `FunctionalTensor`.
Changes:
(1) I use Jeffrey's `make_wrapper_subclass(extra_dispatch_keys)` kwarg to plumb extra dispatch keys ontoto the wrapper, mirroring what C++ functionalization does (C++ functionalization will mirror all dispatch keys from the inner tensor to the wrapper, except for python and functorch keys).
(2) FunctionalTensorMode will decompose CompositeImplicitAutograd ops, since (for example) ZeroTensor kernels can send ops like `.to()` directly to the Python key. We'll need a way to toggle this later for pre-dispatch functionalization
(3) Bound `_ForceDispatchKeyGuard` and BatchedTensorImpl's dispatch keyset to python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109023
Approved by: https://github.com/zou3519
ghstack dependencies: #108654, #109662, #109632
This PR adds a new `FunctionalTensor` subclass, and `FunctionalTensorMode` torch dispatch mode. Together, this class/mode are a lightweight wrapper around our existing C++ functionalization logic.
This idea came from Ed - later in the stack, I want to be able to run functionalization **underneath** torch_dispatch, when performing tracing in AOTAutograd. I can't do this easily with vanilla C++ functionalization, because it has a dedicated dispatch key that always runs before TorchDispatch. However, by adding a torch_dispatch mode shim around functionalization, we can use functionalization as a torch_dispatch mode, which will make it easier to run underneath other modes later.
This PR provides the basic new classes, and some light testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106404
Approved by: https://github.com/ezyang