This PR makes basic nnmodule forward hooks work by default, without any overhead. But it leaves silent correctness issues if users modify/remove their hooks later, thus also emits a warning.
- the usual case is to not use hooks, so avoid guard overhead here
- registering any hook before compile will trigger a warning about hook support
- registering a hook later (or removing one) requires user knowledge and opting in,
currently this isn't warnable (but maybe we can observe compiled nnmodules to make it
warnable).
Why skip hook guards by default instead of not tracing __call__/hooks by default?
- avoid having a mode flag that alters dynamo tracing behavior (harder to test both codepaths
in CI with full coverage)
- the most basic hook usecase (registering a hook before compile, and never removing it)
will work by default with this PR, while it would require enablement and incur overhead
in the 'not tracing __call__' proposal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98371
Approved by: https://github.com/jansel
Small QoL improvement such that add_numbered_label now works more intuitively. Now if we push different labels instead of having `[reverted, mergedX2, revertX3, mergedX4, revertedX5, mergedX6]` we have `[reverted, merged, revertX2, mergedX2, revertedX3, mergedX3]`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98551
Approved by: https://github.com/huydhn
Significantly reduces overhead of constructing Tensors and Storages and checking Storage Liveness. Removes the regression for HF models that I tested and removes 75% of overhead of the extremely overhead bound resnet50 training we have in torchbench. (.91x base commit, 1.02x torchinductor default, 1.16x this PR, 1.25 previous cudagraphs impl).
This PR takes care of all of the lower hanging fruit.
- Computes storage aliasing at record time instead of during at runtime. We no longer need to use a runtime storage cache, and can instead index directly into the existing alias if there is one, or construct a new Storage
- Moves the heavyweight C++ calls into a batch - getting storage weakrefs and constructing tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98529
Approved by: https://github.com/jansel, https://github.com/ngimel
Summary:
This PR adds annotation support for conv2d relu, linear, maxpool2d, add and add relu so
that we can successfully quantize resnet18 with the prepare_pt2e_quantizer API and get the same result
as fx graph mode quantization
Test Plan:
python test/test_quantization.py TestQuantizePT2EModels.test_resnet18_with_quantizer_api
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98507
Approved by: https://github.com/vkuzo
Add a PrivateUse1 folder to contain all the feature adaptations for PrivateUse1 under Aten,For example GetGeneratorPrivate which is used for the three-party backend to register his own Generator implementation.This makes it easier for us to centrally manage these features, and it will increase the convenience of adaptation for different back-end manufacturers. For more info: https://github.com/pytorch/pytorch/issues/98073
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98127
Approved by: https://github.com/bdhirsh
This is yet another wrong shard number calculation on ASAN causing flakiness. I figure that we don't really need to run this test on ASAN, so let disable it. There is discussion at the moment to run ASAN periodically too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98544
Approved by: https://github.com/malfet
Summary: This is a reland of #98264.
When _inductor.config.cpp_wrapper is specified, we run a
two-pass wrapper codegen to generate wrapper code in cpp which calls
cuLaunchKernel to launch pre-compiled cuda kernels, and then call
load_inline to load that generated wrapper back into the python world.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98534
Approved by: https://github.com/huydhn
Fixes#98149
The type of `mul`'s output is not inconsistent with its input. This PR fixes the type of `mul`'s output.
Here is the output code for the newly added test case `pow+cos`. `tmp4` is 1024 before fixing and 0 after fixing.
#### Before fixing
```
auto tmp0 = in_ptr0[static_cast<long>(0)]; // tmp0 is unsigned_char
auto tmp1 = tmp0 * tmp0; // tmp1 is int
auto tmp2 = tmp1 * tmp1; // tmp2 is int
auto tmp3 = tmp2 * tmp0; // tmp3 is int
auto tmp4 = static_cast<float>(tmp3); // tmp4 is float
auto tmp5 = std::cos(tmp4);
out_ptr0[static_cast<long>(0)] = tmp5;
```
#### After fixing
```
auto tmp0 = in_ptr0[static_cast<long>(0)]; // tmp0 is unsigned_char
auto tmp1 = decltype(tmp0)(tmp0 * tmp0); // tmp1 is unsigned_char
auto tmp2 = decltype(tmp1)(tmp1 * tmp1); // tmp2 is unsigned_char
auto tmp3 = decltype(tmp2)(tmp2 * tmp0); // tmp3 is unsigned_char
auto tmp4 = static_cast<float>(tmp3); // tmp4 is float
auto tmp5 = std::cos(tmp4);
out_ptr0[static_cast<long>(0)] = tmp5;
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98473
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel
This PR explicitly add $CONDA_ENV/bin to MacOS PATH, so that it can always detect and use the correct Python. $CONDA_ENV is always set to the correct value in setup-miniconda https://github.com/pytorch/test-infra/blob/main/.github/actions/setup-miniconda/action.yml#L141
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at b4de81a</samp>
This pull request fixes the conda-pip environment mismatch for the macOS build and test workflows by using consistent pip requirements files. It also adds a conditional block to the `.github/workflows/_mac-test-mps.yml` file to enable the test MPS job.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98522
Approved by: https://github.com/malfet
Pattern replacement behaves incorrectly when the replacement pattern maps inputs to outputs (such a pattern can be used to replace redundant code). However, current code in `torch.fx.subgraph_rewriter._replace_pattern` causes the list of replacement nodes to include the entire graph before that node, resulting in an exponential slowdown due to recursive calls traversing the entire graph multiple times.
The proposed fix is to add a check in `_replace_pattern` to prevent the call to `get_replacement_nodes`:
```python
for ret_node in copied_returning_nodes:
if ret_node in match.placeholder_nodes:
replacement_nodes.append(ret_node)
else:
get_replacement_nodes(ret_node)
```
Fixes#97817
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97903
Approved by: https://github.com/angelayi
This PR is to address the issue seeing in PR #97417 where the newly added op requires `kwargs`, however, currently tools/autograd/gen_annotated_fn_args.py does not support `kwargs`, only `func_args` are generated for test_overrides.py.
The PR adds a new field "is_kwargs" to each argument indicating whether it's a `kwargs` or not. See example:
```
annotated_args = {
torch._C._VariableFunctions._cast_Byte: [{'is_kwarg_only': 'False', 'name': 'self', 'simple_type': 'Tensor'}],
...
```
The full comparison of the generated file `annotated_fn_args.py` can be found here:
- **Before**: [P681991116](https://www.internalfb.com/phabricator/paste/view/P681991116)
- **After**: [P681994218](https://www.internalfb.com/intern/paste/P681994218/)
Differential Revision: D44698310
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98396
Approved by: https://github.com/ezyang
The meta implementation for these _like function is wrong whenever device != "meta" (it doesn't fill the memory!).
zeros_like is special due to sparse and is fixed directly by always filling it with zeros.
Every other one is CompositeExplicit implementation, I went with removing their meta registration and tweaking code to avoid infinite recursions.
I can do the same as zeros_like (and add the proper filling for each) but that would duplicate the c++ logic and make the meta registrations non trivial. I can do it if you prefer to removal.
test_meta works fine with these fixes, relying on CI to see if other tests are breaking as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98160
Approved by: https://github.com/ezyang