This PR:
- moves some of the dtype-string utilities into ScalarType.{h, cpp}
- adds a new utility to get a mapping from dtype name to the C++ dtype
- the perser now checks if the string is a dtype name; if it is then it
pulls the c++ dtype from the mapping.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129189
Approved by: https://github.com/albanD
ghstack dependencies: #129177, #129178, #129179
This PR renames the implementation details of register_fake to align
more with the new name. It is in its own PR because this is risky
(torch.package sometimes depends on private library functions and
implementation details).
Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123938
Approved by: https://github.com/williamwen42
This matches our autograd logic for pytorch native operators. There's no
need to invoke an autograd.Function if we're under a torch.no_grad() or
if none of the inputs have requires_grad=True (invoking an
autograd.Function results in (noticeable) overhead).
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127976
Approved by: https://github.com/williamwen42
Fixes https://github.com/pytorch/pytorch/issues/128544
Fixes https://github.com/pytorch/pytorch/issues/128535
We had a problem with multithreading where the nonlocals were being
clobbered. In the first place, we stored these nonlocals because we
wanted to ferry information from an autograd.Function.apply to
autograd.Function.forward.
Our new approach is:
- pass the information directly as an input to the
autograd.Function.apply. This means that the autograd.Function.forward
will receive the information too.
- this messes up ctx.needs_input_grad, which has an element per input to
forward. The user should not see the additional information we passed.
We fix this by temporarily overriding ctx.needs_input_grad to the
right thing.
- this exposed a bug in that ctx.needs_input_grad wasn't correct for
TensorList inputs. This PR fixes that too.
Test Plan:
- existing and new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128547
Approved by: https://github.com/williamwen42, https://github.com/soulitzer
If a user accesses an OpOverloadPacket, then creates a new OpOverload,
then uses the OpOverloadPacket, the new OpOverload never gets hit. This
is because OpOverloadPacket caches OpOverloads when it is constructed.
This PR fixes the problem by "refreshing" the OpOverloadPacket if a new
OpOverload gets constructed and the OpOverloadPacket exists.
Test Plan:
- new tests
This is the third land attempt. The first one was reverted for breaking
internal tests, the second was reverted for being erroneously suspected
of causing a perf regression.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128000
Approved by: https://github.com/albanD
This PR excises opcheck's dependency on
torch.testing._internal.common_utils, (which comes with dependencies on
expecttest and hypothesis). We do this by moving what we need to
torch.testing._utils and adding a test for it.
Fixes#126870, #126871
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127292
Approved by: https://github.com/williamwen42
ghstack dependencies: #127291
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
ghstack dependencies: #127122, #127123, #127124, #127125
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127125
Approved by: https://github.com/Skylion007
ghstack dependencies: #127122, #127123, #127124
Summary:
co-dev reland of https://github.com/pytorch/pytorch/pull/124520, which requires
the removal of some executorch tests.
Before this PR, we didn't check that types in a schema were valid. This
is because TorchScript treats unknown types as type variables.
This PR checks types in a schema for the TORCH_LIBRARY APIs. To do this,
we add an `allow_typevars` flag to parseSchema so that TorchScript can
use allow_typevars=True. We also add some error messages for common
mistakes (e.g. using int64_t or double in schema).
Test Plan: Wait for tests
Differential Revision: D57666659
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126861
Approved by: https://github.com/albanD
If a user accesses an OpOverloadPacket, then creates a new OpOverload,
then uses the OpOverloadPacket, the new OpOverload never gets hit. This
is because OpOverloadPacket caches OpOverloads when it is constructed.
This PR fixes the problem by "refreshing" the OpOverloadPacket if a new
OpOverload gets constructed and the OpOverloadPacket exists.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126863
Approved by: https://github.com/albanD
More details further down, but first a more high-level description of "how do we functionalize storage resizing"
Today, dynamo converts `param.untyped_storage().resize_(x)` calls that it sees from fsdp into a custom op, `ops.inductor.resize_storage_bytes_(x)`
So given this setup, there are 3 main cases that I think we want to handle:
(1) graph input starts with a real storage size, gets resized down to zero in the graph
(2) graph input starts with 0 storage size, gets resized up in the graph
(3) graph input starts with 0 storage size, gets resized up and used in some compute, then resized back down to 0
For case (1) we need to emit a `resize_storage_bytes_` at the end of the graph, similar to how we emit `copy_()` for data mutations.
For case (2), we need to emit a `resize_storage_bytes_` in the graph, and we **also** need to emit a `copy_()` (the input had its storage resized up, and filled in with data, which is we need to reflect as an input mutation)
For case (3), the net effect is that the input had no data on entry and exit of the function, so we don't need to emit any mutable ops in the end of the graph.
The main thing to call out is that: we need to write a functionalization rule for `resize_storage_byte_`, (`FunctionalTensorWrapper::storage_resize_()`) and this rule actually does very little. We would like to **not** emit any new ops in the graph (like say, a functional resize op). Instead, we should expect / rely on the fact that any resize up will be immediately followed by a `copy_()`/`foreach_copy_`/`out=` op, that will fill in the data of the tensor. So `FunctionalTensor` can temporarily live in a state where its data is invalid, until the `x.copy_(y)` "updates" its data with the new tensor.
So effectively, all that this rule does is:
(1) it stores metadata on the storage, indicating that the tensor was resized, as well as the updated storage size. We need this info in AOTAutograd, so it knows whether to emit a mutable resize_() op in the graph epilogue
(2) There is also a corner case: if we are resizing down to zero, but our tensor had **previously** had a zero size storage, then we update `value_` to point to the original value of the tensor. The reason this seems safe is because if we have a zero storage sized tensor `x`, and we resize it up, use it in some compute, resize it back down to zero, and use it somewhere, we would want the functional version of this code to use the original `x` after the second resize. For FSDP, this is important because we end up saving parameters (graph inputs) for backward, and we want to make sure that the thing we save (and the output to the forward graph) is the original, zero-storage-sized parameter, and not the "version 2" of the parameter after the first resize_()
I think a good order to look at changes in this PR would be:
(1) `test_aotdispatch.py` shows the 3 main cases I focused on as well as the expected functionalized graphs
(2) In `FunctionalStorageImpl.h/cpp`, I had to add a notion of "original base", and "original/curr_size". The first is so I can re-use the zero-size tensor after multiple resizes, and the second is so I can tell in AOTAutograd whether any resizes canceled each other out into a no-op
(3) FunctionalTensorWrapper.h/cpp has the new resize functionalizion rule + some extra utils
(4) `_functorch/_autograd`: the main changes in this folder were around adding the logic at trace-time to detect when we need to put a resize_() in the graph. I also have some assertions to check that any inputs that experience storage resizing will **always be in the graph** and not the opaque epilogue, and I also limited the resize_() mutation case so that you can only ever start with zero storage, or end with zero storage (you can't do e.g. `torch.ones(2).storage().resize_(3)`), and banned it on tensor subclasses
(5) `fake_tensor.py`/`meta_utils.py`: we now need to be able to fakeify tensors with zero storage, so I added a quick version of it in meta_utils.py. This also.. has ramifications for fake tensor caching that I need to fix (include the storage size on the cache key, maybe?)
------------------
This PR subsumes https://github.com/pytorch/pytorch/pull/120971.
This PR is enough to **almost** get a simple ppFSDP forward pass tracing with a functionalized resize_() properly. It also attempts to do the updated version from @jansel, where we don't have any notion of `resize_()` in the graph at all, post functionalization. It would probably be good to test it with @yf225 's FSDP changes, and see how many of the FX passes it allows us to remove. I think that in theory, it should allow us to remove all FX passes that affect the forward graph / partitioner, **except** the one that forces views to be recomputed in the backward (more details below).
There are a few things worth calling out:
(1) failed attempt at functionalizing `aten.copy_()`. I originally wanted to get a version takes these operations:
```
param.storage().resize_(all_gather_size)
param.copy_(all_gather_buffer)
out = aten.matmul(param, param)
```
and functionalizes them into:
```
out = aten.matmul(all_gather_buffer, all_gather_buffer)
```
This would involve getting functionalization to turn `x.copy_(y)` into a giant no-op that just returns `y`. Unfortunately, we can't actually do this in a reasonable way within functionalization (instead, there's a functional `aten.copy` in the graph - see the test case graph expecttest for details). Why? In order for that transformation to be safe, `x` and `y` need to have the same metadata. However, it's possible for `x` and `y` to be subclasses of different types. This is not something we can easily tell from within functionalization, and would be a layering violation. So for now I'm leaving it to downstream code to optimize away the `aten.copy` (this is already the case today, so I think inductor can handle this)
(2) The forward doesn't **actually** run successfully in this PR (see the `assertRaisesRegex` in the test). Why?
The final forward graph looks like this:
```
def forward(self, primals_1, primals_2):
_foreach_copy = torch.ops.aten._foreach_copy.default([primals_1], [primals_2]); primals_2 = None
getitem = _foreach_copy[0]; _foreach_copy = None
mm = torch.ops.aten.mm.default(getitem, getitem); getitem = None
t_1 = torch.ops.aten.t.default(primals_1); primals_1 = None
return [mm, t_1]
```
Where `primals_1` starts out as a secretly-zero-storage-size parameter, and gets resized up and back down within the forward (these are functionalized away).
Importantly, the matmul happy on the result of the `foreach_copy`, **but** the activation that we save for backward (`t_1`) is the result of transposing the **original parameter** (the zero-storage-size param). This is exactly the optimization in fsdp that allows us to have good peak memory usage.
The problem is that the min-cut partitioner decides to save `t_1` for backward. Running this code in eager breaks, because the kernel for `aten.permute(x)` is not happy when `x` has secretly-zero-sized-storage.
The real problem here is that in eager mode the `permute` kernel runs during the backward, after backward hooks have properly resized the saved activation. Here, we are running the transpose in the forward.
One option would be to turn off the checks in our view kernels and allow them to work on zero-storage-sized tensors, which feels pretty bad. Another option is to tweak the partitioner (or use one of Will's FX passes) to force the partitioner to not save views for backward, and allow the views to be recomputed in the backward. This seems kind of silly, but is also probably harmless.
(3) The backward is still broken. To be fair, this issue is pretty separable from "functionalizing storage resize calls", and can be fixed later (either by a real fix to our tracing infra, or via another hacky FX pass). More description of this problem is described at issue (8) of my PR description in https://github.com/pytorch/pytorch/pull/120971
(4) I only added support for "full graph" resizing: basically, the limited case where a param starts with zero storage size, and gets resized up and back down. I think we can add support for the graph break case, but I think we can keep that add-on separate from this PR unless we need it immediately. I also added asserts so we should fail loudly when we hit this case
(5) I have a change to FakeTensor creation when inputs have zero storage size that.. is probably ok. But I also removed FakeTensor caching on view ops, which I probably need to fix before I can land this PR
(6) I added a notion of "original_base" to `FunctionalStorageImpl`. More details are in the comments, but my rational for this was that we basically need it to ensure that autograd saves the **original**, zero-storage-sized param for backward, after resizing up and back down
(7) I had to update our eager kernels for `aten.copy` and `aten._foreach_copy`, to handle the case where the `self` argument has secretly-zero-storage. Inductor can probably generate correct code for this case, but we need these ops to work properly in this situation for the `aot_eager` backend to do the right thing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122434
Approved by: https://github.com/jansel
torch.library.register_fake reports the python module the fake impl is
located in. This is used to check against
`m.set_python_module("foo.bar")` calls in C++.
The module reporting logic was wrong in most cases. This PR fixes it.
Test Plan:
- exhaustive tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125037
Approved by: https://github.com/williamwen42
This is to mirror autograd.Function's setup_context behavior.
The PyTorch Dispatcher removes default values for "FC/BC reasons", but I
convinced myself there's no FC/BC problem for the setup_context API.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124852
Approved by: https://github.com/albanD
ghstack dependencies: #124637, #124805, #124806
This is a partial revert of https://github.com/pytorch/pytorch/pull/124059
Like in #124297, profiling has revealed that testing equality on *every* output is kind of expensive. So we only test equality when we know there is an unbacked binding. This is the same playbook as the previous PR, just on FakeTensorProp instead of PropagateUnbackedSymInts. Note that we also need to populate `unbacked_bindings` in proxy_tensor.py, since we're generating an entirely new graph in that case.
We now have enough propagation that we're able to trigger a bug related to divisibility replacement. In https://github.com/pytorch/pytorch/pull/113165 we allowed to replace `u0` with `u1 * c` for some constant c, when we have determined that u0 is divisible by c. However, where does the binding for u1 come from? What we will have in practice is that there is some node that is supposed to have bound u1, but which actually is getting a `u1 * c` in its output. So, to get u1, we must divide out c. Fortunately, under the divisibility condition, this is always possible (but remember, we must test divisibility at runtime!)
Because we have tightened up asserts, it is now an error to allocate unbacked SymInts and then fail to track them under unbacked_bindings. In torch/_dynamo/eval_frame.py and torch/_functorch/_aot_autograd/collect_metadata_analysis.py there are examples of benign cases where we repropagated fake tensors but then immediately threw away the results. In these cases, it's not appropriate to rebind, since we're still using the old FX graph that has all of the old symbols. So we just manually clear it. It is possible that other cases will need to be updated, so this PR is "risky" from the perspective of hitting fbcode.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124310
Approved by: https://github.com/lezcano
Before this PR, we didn't check that types in a schema were valid. This
is because TorchScript treats unknown types as type variables.
This PR checks types in a schema for the TORCH_LIBRARY APIs. To do this,
we add an `allow_typevars` flag to parseSchema so that TorchScript can
use allow_typevars=True. We also add some error messages for common
mistakes (e.g. using int64_t or double in schema).
Test Plan:
- new tests
Differential Revision: [D56432690](https://our.internmc.facebook.com/intern/diff/D56432690)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124520
Approved by: https://github.com/albanD
The user does not need to return gradients for these args.
We also change how setup_context works to adapt to kwargonly-args. If
the user's op has no kwonly-args, then their setup_context function must
look like `setup_context(ctx, inputs, output)`: we require that the
arguments have the same names.
If the user's op has kwonly-args, then their setup_context function must
look like `setup_context(ctx, inputs, keyword_only_inputs, output)`.
We require that the arguments have the same names.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124806
Approved by: https://github.com/albanD, https://github.com/williamwen42
ghstack dependencies: #124637, #124805
If a user accesses an OpOverloadPacket, then creates a new OpOverload,
then uses the OpOverloadPacket, the new OpOverload never gets hit. This
is because OpOverloadPacket caches OpOverloads when it is constructed.
This PR fixes the problem by "refreshing" the OpOverloadPacket if a new
OpOverload gets constructed and the OpOverloadPacket exists.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124654
Approved by: https://github.com/albanD
This PR:
- exposes torch.testing._internal.optests.opcheck as
torch.library.opcheck
- Adds support for CustomOpDef (aka functions decorated with
torch.library.custom_op) to opcheck.
Test Plan:
- Updated tests
- We validated opcheck's design internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124496
Approved by: https://github.com/williamwen42
Before this PR, we didn't check that types in a schema were valid. This
is because TorchScript treats unknown types as type variables.
This PR checks types in a schema for the TORCH_LIBRARY APIs. To do this,
we add an `allow_typevars` flag to parseSchema so that TorchScript can
use allow_typevars=True. We also add some error messages for common
mistakes (e.g. using int64_t or double in schema).
Test Plan:
- new tests
Differential Revision: [D56432690](https://our.internmc.facebook.com/intern/diff/D56432690)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124520
Approved by: https://github.com/albanD
Motivations:
- this is pretty redundant with test_aot_dispatch_dynamic.
- The user story for opcheck is that a user should use opcheck to see
if their operator was "registered correctly". If a user's custom op
only supports dynamic shapes, then it's a bit awkward for
one of the tests (e.g. `test_aot_dispatch_static`) to fail.
- We've already stopped running test_aot_dispatch_static in all of
our opcheck tests.
Test Plan:
- wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124495
Approved by: https://github.com/williamwen42
ghstack dependencies: #124180, #124200, #124299, #124134, #124199, #124403, #124414
old: `register_autograd(setup_context, backward, /)`
new: `register_autograd(backward, /, *, setup_context=None)`
Motivations:
- We introduce these APIs as "give us a backward and use setup_context
to save things for backward".
- setup_context isn't always necessary.
Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124403
Approved by: https://github.com/albanD
ghstack dependencies: #124180, #124200, #124299, #124134, #124199
Motivation:
- The API is used for registering an implementation for a specific
device type.
- "impl" is ambiguous and can be confused with Library.impl.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124200
Approved by: https://github.com/albanD
ghstack dependencies: #124180
We allow it to accept:
- a string with the op name
- an opoverload
- a new-style custom op
If any of these are referring to a new-style custom op (created with the
custom_op decorator), then we dispatch to CustomOpDef.register_fake.
Otherwise, we do what we previously did.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124066
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064, #124065
This PR:
- adds a new torch.library.register_fake and deprecates
torch.library.impl_abstract. The motivation is that we have a lot of
confusion around the naming so we are going to align the naming with
the actual subsystem (FakeTensor).
- renames `m.impl_abstract_pystub("fbgemm_gpu.sparse_ops")` to
`m.has_python_registration("fbgemm_gpu.sparse_ops")`. No deprecation
here yet; I need to test how this works with static initialization.
- Renames a bunch of internals to match (e.g. abstractimplpystub ->
pystub)
I'm scared to rename the Python-side internal APIs (e.g.
torch._library.abstract_impl) because of torch.package concerns. I'll do
that in its own isolated PR next just in case it causes problems.
DEPRECATION NOTE: torch.library.impl_abstract was renamed to to
torch.library.register_fake. Please use register_fake. We'll delete
impl_abstract in a future version of PyTorch.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123937
Approved by: https://github.com/albanD
If a user accesses an OpOverloadPacket, then creates a new OpOverload,
then uses the OpOverloadPacket, the new OpOverload never gets hit. This
is because OpOverloadPacket caches OpOverloads when it is constructed.
This PR fixes the problem by "refreshing" the OpOverloadPacket if a new
OpOverload gets constructed and the OpOverloadPacket exists.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123578
Approved by: https://github.com/albanD
ghstack dependencies: #123453
The user provides a `setup_context` and a `backward_function`. These
get put into a torch.autograd.Function that gets registered as the
custom op's autograd implementation.
Test Plan:
- we update custom ops in the custom_op_db to use the new
register_autograd API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123110
Approved by: https://github.com/albanD
ghstack dependencies: #123108, #123109
Previously it worked with torchgen.model.FunctionSchema. This PR extends
it to work with torch._C._FunctionSchema by making
torchgen.model.FunctionSchema look more like torch._C._FunctionSchema.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123108
Approved by: https://github.com/albanD
Previously, it suggested that a user add a manual functionalization
kernel. However, since we have auto_functionalize now, the user's first
course of action should be to modify their op into the form that
auto_functionalize accepts (this is possible in the majority of custom
ops).
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123261
Approved by: https://github.com/williamwen42