In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:
(1) Disables torch function running a second time in AOTAutograd
If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.
2. Enables torch function to be inlined in dynamo for NT
Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level.
3. Fixes graph breaks for NT torch function
Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).
Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang
[ghstack-poisoned]
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:
(1) Disables torch function running a second time in AOTAutograd
If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.
2. Enables torch function to be inlined in dynamo for NT
Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level.
3. Fixes graph breaks for NT torch function
Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).
Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang
[ghstack-poisoned]
Downloading CUDA sometimes fails and breaks the build process, but AOTriton does not need these packages for its own Triton fork. This commit comments out the related downloading scripts.
The actual changes from Triton can be found at: 9b73a543a5
Fixes the following building error
```
[2/6] cd /var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python && /opt/conda/envs/py_3.8/bin/cmake -E env VIRTUAL_ENV=/var/lib/jenkins/workspace/build/aotriton/build/venv PATH="/var/lib/jenkins/workspace/build/aotriton/build/venv/bin:/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.8/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" TRITON_BUILD_DIR=/var/lib/jenkins/workspace/build/aotriton/build/triton_build python setup.py develop
FAILED: CMakeFiles/aotriton_venv_triton /var/lib/jenkins/.local/lib/python3.8/site-packages/triton/_C/libtriton.so /var/lib/jenkins/workspace/build/aotriton/build/CMakeFiles/aotriton_venv_triton
cd /var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python && /opt/conda/envs/py_3.8/bin/cmake -E env VIRTUAL_ENV=/var/lib/jenkins/workspace/build/aotriton/build/venv PATH="/var/lib/jenkins/workspace/build/aotriton/build/venv/bin:/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.8/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" TRITON_BUILD_DIR=/var/lib/jenkins/workspace/build/aotriton/build/triton_build python setup.py develop
downloading and extracting https://conda.anaconda.org/nvidia/label/cuda-12.1.1/linux-64/cuda-nvcc-12.1.105-0.tar.bz2 ...
downloading and extracting https://conda.anaconda.org/nvidia/label/cuda-12.1.1/linux-64/cuda-cuobjdump-12.1.111-0.tar.bz2 ...
Traceback (most recent call last):
File "/var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python/setup.py", line 325, in <module>
download_and_copy(
File "/var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python/setup.py", line 151, in download_and_copy
ftpstream = urllib.request.urlopen(url)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/urllib/request.py", line 215, in urlopen
return opener.open(url, data, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/urllib/request.py", line 521, in open
response = meth(req, response)
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/urllib/request.py", line 630, in http_response
response = self.parent.error(
^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/urllib/request.py", line 559, in error
return self._call_chain(*args)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/urllib/request.py", line 492, in _call_chain
result = func(*args)
^^^^^^^^^^^
File "/opt/conda/lib/python3.12/urllib/request.py", line 639, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 524:
ninja: build stopped: subcommand failed.
```
Example of failed build log: https://github.com/pytorch/pytorch/actions/runs/8483953034/job/23245996425
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122982
Approved by: https://github.com/jansel
Summary:
Removes `using namespace` from a header file. Having `using namespace` in a header file is *always* a bad idea. A previous raft of diffs provided appropriate qualifications to everything that relied on this `using namespace`, so it is now safe to remove it in this separate diff.
Helps us enable `-Wheader-hygiene`.
Test Plan: Sandcastle
Reviewed By: dmm-fb
Differential Revision: D54838298
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121847
Approved by: https://github.com/Skylion007
as titled, previously we could possibly return the expected input spec
that shared by multiple args, this is not ok since different args might
have different tensor metas, why it was working before is because
redistribute in these cases become a no-op.
This PR fixes it by making each expected input spec to shallow clone the
corresponding input metadata
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122949
Approved by: https://github.com/tianyu-l
ghstack dependencies: #122929
This PR refactors the schema_suggestions in OuputSharding to be a single
OpSchema instead of list of schemas, which in practice we only have one,
for the multiple resharding case we also moved to OpStrategy so there's
no case that needs it to be a list
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122929
Approved by: https://github.com/tianyu-l
Summary: After we migrate to torch.export, we won't see ops like add_ and mul_ due to functionalization. We are rolling out pre dispatch export, so for now we just skip those mutating ops in tests.
Test Plan: buck run mode/opt caffe2/test/quantization:test_quantization
Reviewed By: tugsbayasgalan
Differential Revision: D55442019
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122863
Approved by: https://github.com/clee2000
When fakifying a grad tracking tensor, if the level is -2 (sentinel
value) we can just unwrap the grad tensor and return a fake version of
it. In this PR, we update the `assert_metadata_eq` to not compare if
the grad tensor and the unwrapped ones are leafs or not, as this may
not be always true.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122728
Approved by: https://github.com/zou3519
Fixes#114844
In the linked issue we have
```
compiled_module = torch.compile(module)
compiled_module.x = ...
compiled_module(...) # Mutates self.x
```
Where since the module mutates `self.x` you would expect `compiled_module.x`
to be updated but actually `compiled_module.x = ...` sets an attribute "x"
on the `OptimizedModule` object while the forward method of the module mutates
`module.x`.
This gives the expected behavior by forwarding `compiled_module.__setattr__`
down to `module.__setattr__`. There is already a corresponding `__getattr__`
so now `compiled_module.x` becomes an alias for `module.x`.
Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122098
Approved by: https://github.com/ezyang, https://github.com/lezcano
Summary:
Added support for quantized linear on CPU with fbgemm.
Specifically, for torch.ops.quantized.linear_unpacked_dynamic_fp16, we
decompose it into two steps, pack weight, and fbgemm's qlinear with
packed weight.
Test Plan:
Included in commit.
test_aot_inductor::test_quantized_linear
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D55577959](https://our.internmc.facebook.com/intern/diff/D55577959)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123069
Approved by: https://github.com/hl475
Enable VEC on Windows OS.
1. Fix some type defination gap between Windows and Linux.
2. Fix some operator not support on Windows, such as [], /.
3. Enable static sleef library build on Windows.
4. Disable unsupported function overloading on MSVC.
5. Upgrade submodule sleef lib, which fixed build issue on Windows.
6. Fixed bazel build issues.
7. Fix test app not link to sleef on Windows.
Note: If rebuild fail after pulled this PR, please sync `sleef` submodule by run:
```cmd
git submodule sync
git submodule update --init --recursive
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980
Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet
Fixes#118849
Add a map for parent_to_child_mappings in _mesh_resources so we can cache and reuse submesh slicing result so that we can avoid recreating submesh and the underlying sub pg repeatedly, which could lead to funky behaviors.
We will follow up with reusing pg from the parent_mesh during submesh creation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122975
Approved by: https://github.com/wanchaol
Summary:
This would otherwise yield
> ValueError: ('Manual wrapping with ShardingStrategy.HYBRID_SHARD', 'requires explicit specification of process group or device_mesh.')
which is odd.
Remove the extra tailing commas.
Test Plan: CI
Differential Revision: D55549851
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123019
Approved by: https://github.com/Skylion007
inference for vision_maskrcnn model fail when max-autotune is enabled.
Repro:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --accuracy --inference --bfloat16 --backend inductor --only vision_maskrcnn
```
It turns out that MA code receives empty input tensor for convolution and some places in MA related code does not handle this corner case properly. This PR enhance that and now the accuracy test above can pass.
Regarding why the input tensor is empty, I think it's probably due to no objects are detected in the input images (random data?).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123008
Approved by: https://github.com/jansel
This PR adds a new metadata, `torch_fn` which is meant to replace `source_fn_stack` as `source_fn_stack` is not entirely well defined between strict/nonstrict. Previous discussion [here](https://docs.google.com/document/d/1sPmmsmh6rZFWH03QBOe49MaXrQkP8SxoG8AOMb-pFk4/edit#heading=h.anmx9qknhvm).
`torch_fn` represents the torch function that a particular aten operator came from. For example, `torch.nn.Linear` goes down to the `torch.nn.functional.linear` at the `__torch_function__` layer, and then `aten.t/aten.addmm` in the `__torch_dispatch__` layer. So the nodes `aten.t/aten.addmm` will now have the `torch_fn` metadata containing the `torch.nn.functional.linear`.
The `torch_fn` metadata is a tuple of 2 strings: a unique identifier for each torch function call, and the actual torch function `f"{fn.__class__}.{fn.__name__}"`. The purpose of the first value is to distinguish between 2 consecutive calls to the same function. For example, if we had 2 calls to `torch.nn.Linear`, the nodes and corresponding metadata would look something like:
```
aten.t - ("linear_1", "builtin_function_or_method.linear"),
aten.addmm - ("linear_1", "builtin_function_or_method.linear"),
aten.t - ("linear_2", "builtin_function_or_method.linear"),
aten.addmm - ("linear_2", "builtin_function_or_method.linear"),
```
Higher order ops -- currently we can get the torch_fn metadata for nodes within the HOO's subgraph, but after retracing, this becomes the `(cond, higher_order_op.cond)` :( This is because `fx_traceback.set_current_meta` points to the cond node in the toplevel graph, rather than the original node in the subgraph. I think this is because `fx.Interpreter` does not go into the cond subgraphs. (will discuss with Yidi more ab this)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122693
Approved by: https://github.com/tugsbayasgalan
fixes https://github.com/pytorch/pytorch/issues/122826
# Problem
When the model returns multiple outputs which alias the same tensor, we get a SEGFAULT. Because we try to release the same buffer twice.
```
def forward(x):
x_out = x + 1
contig = x_out.contiguous() # alias of same tensor as x_out
return x_out, contig
run_impl() {
output_handles[0] = buf0.release();
output_handles[1] = buf0.release(); # SEGFAULT
}
# if we try to workaround this by assign aliases without creating a new tensor,
# then, we'll get a double free error during handle clean-up.
output_handles[1] = output_handles[0]; # assign without creating a new tensor
...
alloc_tensors_by_stealing_from_handles(){
aoti_torch_delete_tensor_object(handles[0]);
aoti_torch_delete_tensor_object(handles[1]); # Double free
}
```
# Solution
~~Instead, we use the first `output_handle` that shares the same tensor and alias it.~~
```
output_handles[0] = buf0.release();
aoti_torch_alias_tensor(output_handles[0], &output_handles[1]); # No SEGFAULT & No double free!
```
A simpler approach is to figure out which handles are duplicate. Then we simply copy all duplicate except the last one. The last one will use `std::move` and free the tensor owned by the model instance.
```
output_handles[0] = buf0.release();
output_handles[1] = output_handles[0];
```
Differential Revision: [D55455344](https://our.internmc.facebook.com/intern/diff/D55455344)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122846
Approved by: https://github.com/desertfire, https://github.com/chenyang78, https://github.com/jingsh
The batch-size for this model is 64 previously. Later on we change that to 256 and cause OOM in cudagraphs setting. This PR tune the batch size down to 128.
Share more logs from my local run
```
cuda,res2net101_26w_4s,128,1.603578,110.273572,335.263494,1.042566,11.469964,11.001666,807,2,7,6,0,0
cuda,res2net101_26w_4s,256,1.714980,207.986155,344.013071,1.058278,22.260176,21.034332,807,2,7,6,0,0
```
The log shows that torch.compile uses 11GB for 128 batch size and 21GB for 256 batch size. I guess the benchmark script has extra overhead cause the model OOM for 256 batch size in the dashboard run.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122977
Approved by: https://github.com/Chillee
# Motivation
Add some attributes to `XPUDeviceProp` and expose them via `torch.xpu.get_device_properties` and `torch.xpu.get_device_capability`. They can be used in `torch.compile` or directly passed to triton to generate more optimized code based on device properties.
# Additional Context
expose the following attributes to `torch.xpu.get_device_properties`:
- `has_fp16` (newly added)
- `has_fp64` (newly added)
- `has_atomic64` (newly added)
- `driver_version`
- `vendor`
- `version`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121898
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet, https://github.com/albanD, https://github.com/atalman
Summary:
When a rank detects a timeout from tcpstore and triggers the dump. It's good to have more info about the source rank which detects the
collective timeout locally. We just need to put the source rank as the
value in the kvstore
Test Plan:
In unit test, we triggered the timeout on rank 0 and rank 1 should get
the timeout signal from store and log the correct source rank:
```
(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (34d27652)]$ python
test/distributed/test_c10d_nccl.py NCCLTraceTestTimeoutDumpOnStuckRanks
NCCL version 2.19.3+cuda12.0
[rank0]:[E327 17:04:16.986381360 ProcessGroupNCCL.cpp:565] [Rank 0]
Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2,
OpType=ALLREDUCE, NumelIn=12, NumelOut=12, Timeout(ms)=1000) ran for
1099 milliseconds before timing out.
[rank0]:[E327 17:04:16.988036373 ProcessGroupNCCL.cpp:1582] [PG 0 Rank
0] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed
NCCL work: 1.
[rank0]:[E327 17:04:16.182548526 ProcessGroupNCCL.cpp:1346] [PG 0
Rank 0] Received a timeout signal from this local rank and will start
to dump the debug info. Last enqueued NCCL work: 2, last completed
NCCL work: 1.
[rank0]:[E327 17:04:16.247574460 ProcessGroupNCCL.cpp:1167] [PG 0
Rank 0] ProcessGroupNCCL preparing to dump debug info.
[rank1]:[E327 17:04:16.273332178 ProcessGroupNCCL.cpp:1346] [PG 0
Rank 1] Received a global timeout from another rank 0, and will start
to dump the debug info. Last enqueued NCCL work: 1, last completed
NCCL work: 1.
[rank1]:[E327 17:04:16.273565177 ProcessGroupNCCL.cpp:1167] [PG 0
Rank 1] ProcessGroupNCCL preparing to dump debug info.
[rank1]:[F327 17:04:16.274256512 ProcessGroupNCCL.cpp:1185] [PG 0
Rank 1] [PG 0 Rank 1] ProcessGroupNCCL's watchdog detected a
collective timeout from another rank 0 and notified the current rank.
This is most likely caused by incorrect usages of collectives, e.g.,
wrong sizes used across ranks, the order of collectives is not same
for all ranks or the scheduled collective, for some reason, didn't
run. Additionally, this can be caused by GIL deadlock or other
reasons such as network errors or bugs in the communications library
(e.g. NCCL), etc. We tried our best to dump the debug info into the
storage to help you debug the issue.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122850
Approved by: https://github.com/wconstab
This PR unified the vectorized conversion with `at::vec::convert` for all vectorized data types. The intrinsics implementations are implemented as a specialization and moved to their own arch-specific files. The vectorized conversion logic in cpp Inductor is simplified.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119979
Approved by: https://github.com/jansel, https://github.com/malfet
Partially addresses #122160
In the module `torch.utils.tensorboard.summary`, the `hparams` method does not depend on any utilities from pytorch as it uses only the utilities from `tensorboard`. Thus, I think it will be safe to delete the test for `hparams` method as it does not depend on pytorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122556
Approved by: https://github.com/huydhn
Summary:
Replacing `torch._export.aot_compile` callsites with
```
ep = torch.export._trace._export(.., predispatch=True) # Traces the given program into predispatch IR
so_path = torch._inductor.aot_compile_ep(ep, ...) # Takes an exported program and compiles it into a .so
```
This allows us to explicitly split up the export step from AOTInductor. We can later modify tests to do `export + serialize + deserialize + inductor` to mimic internal production use cases better.
Test Plan: CI
Differential Revision: D54808612
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122225
Approved by: https://github.com/SherlockNoMad, https://github.com/khabinov
Sympy simplifications don't obey floating point semantics, so don't
use Sympy for this. Keep them as is, only evaluate with the reference
implementations when all arguments are known.
This may end up getting subsumed by some other changes later, but I
wanted to understand if this was easy and it seems to be easy.
This doesn't actually depend on the earlier diffs on the stack and I can detach it.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122823
Approved by: https://github.com/lezcano
Summary: Pre-grad fx passes expect information from shape propagation to be present. D55221119 ensured that `pass_execution_and_save` invokes shape propagation, and this diff adds a covering unit test to prevent regression.
Test Plan: New UT passes locally.
Differential Revision: D55440240
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122897
Approved by: https://github.com/khabinov, https://github.com/Skylion007
This PR fixes the two major issues that was discovered after the initial merge of PR #121561
1. The Flash Attention support added by has severe performance regressions on regular shapes (power of two head dimensions and sequence lengths) compared with PR #115981. Its performance is worse than the math backend and only has numerical stability advantages. This PR fixes this problem.
2. There is a flaw of memory storage handling in PR #121561 which does not copy the gradients back to the designated output tensor. This PR removes the deprecated `TensorStorageSanitizer` class which is unnecessary due to the more flexible backward kernel shipped by PR #121561
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122857
Approved by: https://github.com/jeffdaily, https://github.com/drisspg
This significantly speeds up real world applications, such as LLMs
Before this change llama2-7b fp16 inference run at 1.5 tokens per sec,
after it runs at almost 6 tokens per sec
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122951
Approved by: https://github.com/ezyang
Enable VEC on Windows OS.
1. Fix some type defination gap between Windows and Linux.
2. Fix some operator not support on Windows, such as [], /.
3. Enable static sleef library build on Windows.
4. Disable unsupported function overloading on MSVC.
5. Upgrade submodule sleef lib, which fixed build issue on Windows.
6. Fixed bazel build issues.
7. Fix test app not link to sleef on Windows.
Note: If rebuild fail after pulled this PR, please sync `sleef` submodule by run:
```cmd
git submodule sync
git submodule update --init --recursive
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980
Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet
Dynamo skips user defined modules from `torch/testing/_internal` (eg MLP, Transformer). This PR adds `torch/testing/_internal/...` to `manual_torch_name_rule_map`. It ensures FSDP CI + torch.compile are meaningfully tested
unit test shows frame count = 0 before and frame count > 0 after
```pytest test/dynamo/test_trace_rules.py -k test_module_survive_skip_files```
some FSDP unit tests actually start to compile modules with this change. add trition availability check or disable tests for now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122851
Approved by: https://github.com/jansel
Summary:
Minor logging cleanup in distributed library
1. Don't use "f" formatted strings - address linter issues.
2. Nits: Make use of unused `e` (error) in a few logs.
3. Change info->debug as asked in issue #113545
4. Nit: rename log -> logger in a few files for consistency
5. Fix a linter error.
Test Plan:
1. Local build passes.
2. Linter is happy.
Reviewers: wanchaol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921
Approved by: https://github.com/wanchaol
When we enter map_autograd, we try to trace through fwd/bwd of a map operator that is wrapped in ctx.functionalize wrapper. This forces us to go through PreDispatch functionalization again (only the python part). As a result, it revealed our previous bug where pre-dispatch mode handling doesn't actually manage the local dispatch key set. (If there is no active mode, we need to turn off PreDispatch key). This PR fixes that. Also I shuffled some APIs around so that there is less code duplication as the setting/unsetting logic is quite hard to get it right.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121444
Approved by: https://github.com/bdhirsh
Summary: When we migrate to torch.export, we won't put L['self'] as the prefix for all the fqn in nn_module_stack. This diff adds the branch to handle the new case.
Test Plan: buck test mode/opt caffe2/test/quantization:test_quantization -- -r set_module_name
Differential Revision: D55436617
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122819
Approved by: https://github.com/tugsbayasgalan
After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.
In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.
The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.
This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122882
Approved by: https://github.com/desertfire
Summary: Vulkan rewrite sp that quantized transpose 2d ops can run in a model
Test Plan:
Run vulkan api test:
# buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
# buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 418 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 418 tests from VulkanAPITest
....
[----------] Global test environment tear-down
[==========] 418 tests from 1 test suite ran. (4510 ms total)
[ PASSED ] 417 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
YOU HAVE 9 DISABLED TESTS
Run quantized vulkan api test: Note the linear quantized are failing but all the convolution tests still pass. Linear failures are being debugged.
# buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
# buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 86 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 86 tests from VulkanAPITest
...
[ PASSED ] 77 tests.
[ FAILED ] 9 tests, listed below:
[ FAILED ] VulkanAPITest.linear_2d_flat
[ FAILED ] VulkanAPITest.linear_2d_small
[ FAILED ] VulkanAPITest.linear_2d_large
[ FAILED ] VulkanAPITest.linear_3d_flat
[ FAILED ] VulkanAPITest.linear_3d_small
[ FAILED ] VulkanAPITest.linear_3d_large
[ FAILED ] VulkanAPITest.linear_4d_flat
[ FAILED ] VulkanAPITest.linear_4d_small
[ FAILED ] VulkanAPITest.linear_4d_large
9 FAILED TESTS
YOU HAVE 8 DISABLED TESTS
# Run CUNET quantized model on hibiki board.
Reviewed By: manuelcandales
Differential Revision: D52344263
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122547
Approved by: https://github.com/manuelcandales, https://github.com/copyrightly, https://github.com/yipjustin
Summary:
We add a new op quantized.linear_unpacked_dynamic_fp16, which is essentially linear_dynamic_fp16 with different (unpacked) weight/bias format.
This op does packing on the fly for each call with standard at::Tensor weight & bias.
Test Plan:
Included in commit.
test_quantized_op::test_unpacked_qlinear_dynamic_fp16
Differential Revision: [D55433203](https://our.internmc.facebook.com/intern/diff/D55433203)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122762
Approved by: https://github.com/jerryzh168
Summary:
We allow CPU to use the config use_runtime_constant_folding.
Changes include
1. Rearrange USE_CUDA flags. Add CPU sections that consumes memory directly.
2. Codegen changes to accomodate cpp fusions for CPU only. Specifically, we shouldn't generate 2 headers that would cause re-declaration.
Test Plan: Activate tests that were deactivated for CPU before.
Reviewed By: khabinov
Differential Revision: D55234300
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122563
Approved by: https://github.com/chenyang78
Summary:
Original commit changeset: ebda663a196b
Original Phabricator Diff: D55271788
Test Plan: Some models are failing torch compile with this, retrying the tests
Reviewed By: colinchan15
Differential Revision: D55374457
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122709
Approved by: https://github.com/huydhn
In this PR, we add a systematic way to test all HOPs to be exportable as export team has been running into various bugs related to newly added HOPs due to lack of tests. We do this by creating:
- hop_db -> a list of HOP OpInfo tests which then used inside various flows including export functionalities: [aot-export, pre-dispatch export, retrace, and ser/der
For now, we also create an allowlist so that people can bypass the failures for now. But we should discourage ppl to do that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122265
Approved by: https://github.com/ydwu4, https://github.com/zou3519
Fixes#120794
Torch creates a cache of compiled kernels at $HOME/.cache/torch/kernels. The names used to save and select the cached kernels use cuda_major and cuda_minor to identify the gpu architecture for which the gpu kernels where compiled. On ROCM this is insufficient as on rocm cudaDeviceProp cuda_major and cuda_minor are mapped to hipDeviceProp_t::major and hipDeviceProp_t::minor which correspond to the first and second number of the LLVM target corresponding to the architecture in question:
GFX1030 is major = 10, minor = 3
GFX1032 is major = 10, minor = 3
GFX900 is major = 9, minor = 0
GFX906 is major = 9, minor = 0
GFX908 is major = 9, minor = 0
Thus it can be seen hipDeviceProp_t::major and hipDeviceProp_t::minor are insufficient to uniquely identify the ROCM architecture. This causes the rocm runtime to raise an error when an operation uses a cached kernel that was first cached on a architecture with the same hipDeviceProp_t::major and hipDeviceProp_t::minor but a different llvm target.
The solution provided in this pr is to replace the use of hipDeviceProp_t::major,hipDeviceProp_t::minor with hipDeviceProp_t::gcnArchName when pytorch is compiled for rocm which contains a string identical to the LLVM target of the architecture in question
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121401
Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang, https://github.com/malfet
Summary: Previous work `https://github.com/pytorch/pytorch/pull/120742` to enable `matrix_instr_nonkdim` only dealt with the autotuner benchmarking, but failed to enable the parameter in Triton meta for real runs. `matrix_instr_nonkdim` needs to be visible to the compiler driver to set up the optimization pipeline, so it's unlike other kernel parameters such as `BLOCK_N` that can be just set inside the kernel itself.
Test Plan:
P1201466917
triton_heuristics.template(
num_stages=1,
num_warps=4,
triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())], 'matrix_instr_nonkdim': 16},
inductor_meta={'kernel_name': 'triton_tem_fused_mm_0', 'backend_hash': None},
)
Perf :
Before: 1.693ms 0.134GB 79.28GB/s
After: 1.577ms 0.134GB 85.12GB/s
Differential Revision: D55456401
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122852
Approved by: https://github.com/xw285cornell
For some reason, if we construct `class Handle(RemovableHandle` inside `register_multi_grad_hook`, then over time, the call to `RemovableHandle.__init__` slows down more and more (when we have GC disabled). Perhaps, this is related to the class attribute `next_id: int = 0`. Python experts: please let me know if you have thoughts 😅
I am open to any suggestions on if how we should deal with this `Handle` class. For now, I changed it to a private `_MultiHandle`.
<details>
<summary> Experiment Script </summary>
```
import gc
import time
import torch
NUM_TENSORS = int(5e4)
ts = [torch.empty(1, requires_grad=True) for _ in range(NUM_TENSORS)]
def hook(grad) -> None:
return
gc.disable()
times = []
for i, t in enumerate(ts):
start_time = time.time()
torch.autograd.graph.register_multi_grad_hook([t], hook)
end_time = time.time()
times.append(end_time - start_time)
print([f"{t * 1e6:.3f} us" for t in times[1:6]]) # print first few times
print([f"{t * 1e6:.3f} us" for t in times[-5:]]) # print last few times
times = []
for i, t in enumerate(ts):
start_time = time.time()
t.register_hook(hook)
end_time = time.time()
times.append(end_time - start_time)
print([f"{t * 1e6:.3f} us" for t in times[1:6]]) # print first few times
print([f"{t * 1e6:.3f} us" for t in times[-5:]]) # print last few times
```
</details>
<details>
<summary> Results </summary>
Before fix:
```
['23.603 us', '19.550 us', '15.497 us', '12.875 us', '13.828 us']
['327.110 us', '341.177 us', '329.733 us', '332.832 us', '341.177 us']
['318.050 us', '315.189 us', '319.719 us', '311.613 us', '308.990 us']
['374.317 us', '394.821 us', '350.714 us', '337.362 us', '331.402 us']
```
Calling `register_multi_grad_hook` makes calling itself and `register_hook` slower (actually, any call to `RemovableHandle.__init__`).
After fix:
```
['13.590 us', '9.060 us', '12.875 us', '7.153 us', '8.583 us']
['4.530 us', '5.245 us', '6.437 us', '4.768 us', '5.007 us']
['2.623 us', '1.907 us', '1.431 us', '1.669 us', '1.192 us']
['1.431 us', '1.431 us', '1.192 us', '1.192 us', '1.431 us']
```
</details>
Update: from @soulitzer
> Your suspicion about next_id is right. I think what is happening is that whenever a class attribute is set, it needs to invalidate some cached data for the subclasses one-by-one. eefff682f0/Objects/typeobject.c (L845)
And this PR fixes the issue by avoiding creating many subclasses dynamically. Changing next_id to something like List[int] or incrementing a global instead also fixes this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122847
Approved by: https://github.com/soulitzer
ghstack dependencies: #122726
Fixes `During handling of the above exception, another exception occurred: [...] torch._dynamo.exc.Unsupported: generator`. traceback.format_exc uses generators which isn't supported by dynamo yet.
<details>
<summary>current error message</summary>
```
======================================================================
ERROR: test_custom_fn_saved_tensors (__main__.TestCompiledAutograd)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 307, in __call__
return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1537, in _call_impl
return forward_call(*args, **kwargs)
File "<eval_with_key>.0", line 4, in forward
def forward(self, inputs, sizes, hooks):
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/xmfan/core/pytorch/torch/testing/_internal/common_utils.py", line 2741, in wrapper
method(*args, **kwargs)
File "/home/xmfan/core/pytorch/test/inductor/test_compiled_autograd.py", line 499, in test_custom_fn_saved_tensors
self.check_output_and_recompiles(fn, 1)
File "/home/xmfan/core/pytorch/test/inductor/test_compiled_autograd.py", line 61, in check_output_and_recompiles
actual = list(opt_fn())
File "/home/xmfan/core/pytorch/test/inductor/test_compiled_autograd.py", line 495, in fn
loss.backward()
File "/home/xmfan/core/pytorch/torch/_tensor.py", line 534, in backward
torch.autograd.backward(
File "/home/xmfan/core/pytorch/torch/autograd/__init__.py", line 267, in backward
_engine_run_backward(
File "/home/xmfan/core/pytorch/torch/autograd/graph.py", line 766, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1537, in _call_impl
return forward_call(*args, **kwargs)
File "/home/xmfan/core/pytorch/torch/_dynamo/eval_frame.py", line 397, in _fn
res = fn(*args, **kwargs)
File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 741, in call_wrapped
return self._wrapped_call(self, *args, **kwargs)
File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 315, in __call__
_WrappedCall._generate_error_message(topmost_framesummary),
File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 289, in _generate_error_message
tb_repr = get_traceback()
File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 288, in get_traceback
return traceback.format_exc()
File "/home/xmfan/.conda/envs/benchmarks/lib/python3.10/traceback.py", line 183, in format_exc
return "".join(format_exception(*sys.exc_info(), limit=limit, chain=chain))
File "/home/xmfan/.conda/envs/benchmarks/lib/python3.10/traceback.py", line 136, in format_exception
return list(te.format(chain=chain))
File "/home/xmfan/core/pytorch/torch/_dynamo/convert_frame.py", line 941, in catch_errors
return callback(frame, cache_entry, hooks, frame_state, skip=1)
File "/home/xmfan/core/pytorch/torch/_dynamo/convert_frame.py", line 348, in _convert_frame_assert
unimplemented("generator")
File "/home/xmfan/core/pytorch/torch/_dynamo/exc.py", line 199, in unimplemented
raise Unsupported(msg)
torch._dynamo.exc.Unsupported: generator
```
</details>
With this change, we get back the descriptive error message:
<details>
<summary>post-fix error message</summary>
```
Traceback (most recent call last):
File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 307, in __call__
return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1537, in _call_impl
return forward_call(*args, **kwargs)
File "<eval_with_key>.0", line 4, in forward
def forward(self, inputs, sizes, hooks):
IndexError: list index out of range
Call using an FX-traced Module, line 4 of the traced Module's generated forward function:
def forward(self, inputs, sizes, hooks):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
getitem = inputs[0]
getitem_1 = inputs[1]; inputs = None
```
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122746
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #122691
Currently, when we create proxies for a list's elements in wrap_fx_proxy_cls, we create them using the same source as the list's e.g. `LocalSource(inputs)` instead of `GetItemSource(LocalSource(inputs), index=i)`. This results in invalid guards when the tensors it contains becomes dynamic, and the guard system thinks the list is a tensor:
```
Malformed guard:
L['sizes'][0] == L['inputs'].size()[0]
Malformed guard:
2 <= L['inputs'].size()[0]
Traceback [...]
AttributeError: 'list' object has no attribute 'size'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122691
Approved by: https://github.com/jansel, https://github.com/anijain2305
In ARC Runners we are using dind-rootless to run docker-in-docker and
in rootless mode volume mounts always mount as root but are mapped to
the local `runner` user in ARC. This causes the build.sh and test.sh
scripts to fail because they run as the `jenkins` user and expect to
be able to write to the workspace path that's being mounted.
Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>
misc-include-cleaner was introduced in clang-tidy-17 as a way to check missing and unused includes. However, there are lots of transitive headers in PyTorch and it would take enormous efforts to add related annotations to them in order to direct this checker. For this reason, it's better to disable it now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122855
Approved by: https://github.com/cpuhrsch
This fixes a bug when casting a module that has DTensor parameters. The old behavior will swap the .data field of the Tensor subclass which is incorrect behavior when dealing with tensor subclasses that may have multiple child tensors.
This uses the `swap_tensors` method to swap all of the tensors not just the .data field.
Test plan:
```
pytest test/distributed/_tensor/test_api.py -k 'test_distribute_module_casting'
python test/distributed/fsdp/test_wrap.py -k test_auto_wrap_smoke_test_cuda_init_mode1_cpu_offload0_use_device_id_True
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122755
Approved by: https://github.com/wanchaol, https://github.com/mikaylagawarecki
This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton)
- [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`).
* MI300X is supported. More architectures will be added once Triton support them.
- [x] Only supports power of two sequence lengths.
* Now it support arbitrary sequence length
- [ ] No support for varlen APIs.
* varlen API will be supported in future release of AOTriton
- [x] Only support head dimension 16,32,64,128.
* Now it support arbitrary head dimension <= 256
- [x] Performance is still being optimized.
* Kernel is selected according to autotune information from Triton.
Other improvements from AOTriton include
* Allow more flexible Tensor storage layout
* More flexible API
This is a more extensive fix to #112997
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561
Approved by: https://github.com/huydhn
`dynamo.explain()` was updated to return a structure but the docs weren't updated to match.
- Update the docs to use the new API
- Remove some dead code left when `explain` was updated.
- Drive-by: Fix some `nopython` uses that I noticed
- Drive-by: I noticed an ignored error coming from CleanupHook on shutdown - make it check the global before setting it.
Fixes#122573
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122745
Approved by: https://github.com/jansel
This PR reduces the difference between strict and non-strict exported program by
- Support `inline_constraints` for non-strict exported program
- Add runtime assertions for range constraints to non-strict exported program
After this PR, the following unit tests are no longer `expectedFailureNonStrict`:
- test_automatic_constrain_size
- test_export_with_inline_constraints
- test_redundant_asserts
- test_constrain_size_with_constrain_value
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122722
Approved by: https://github.com/pianpwk
Python 3.12 changed a few things with how `_PyInterpreterFrame`s are allocated and freed:
- Frames are now required to be placed on the Python frame stack. In 3.11, we could allocate frames anywhere in memory. In 3.12, we now need to use `THP_PyThreadState_BumpFramePointerSlow`/`push_chunk`/`allocate_chunk`. This method of allocating/freeing frames is also compatible with 3.11.
- The eval frame function is now responsible for clearing the frame (see https://docs.python.org/3/whatsnew/changelog.html#id128, the point about "...which now clear the frame.")
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122146
Approved by: https://github.com/jansel
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:
(1) Disables torch function running a second time in AOTAutograd
If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.
2. Enables torch function to be inlined in dynamo for NT
Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level.
3. Fixes graph breaks for NT torch function
Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).
Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang
[ghstack-poisoned]
Previously, we were checking `len(device_types)` where `device_types` is a `list`. This meant that if there were multiple inputs, we would see something like `device_types = ["cuda", "cuda"]` and a false positive warning. We should check `len(set(device_types))`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122726
Approved by: https://github.com/soulitzer
Summary:
Right now we don't insert additional observers (share observers) if qspec.dtype and qspec.is_dynamic matches exactly,
since fixed qparams quantization spec and derived quantization spec do have have is_dynamic field curerntly, observer sharing does not happen between them and quantization spec, in this PR we fixed the issue by
adding is_dynamic to all quantization specs.
Note: SharedQuantizationSpec should probably be its own type in the future
TODO later:
(1). move all these fields (dtype, is_dynamic, quant_min, quant_max etc.) to QuantizationSpecBase,
(2). make SharedQuantizationSpec a separate type
(3). add quant_min/quant_max in observer sharing checking in pt2e/prepare.py
Test Plan:
python test/test_quantization.py -k test_fixed_qparams_qspec_observer_dedup
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D55396546](https://our.internmc.facebook.com/intern/diff/D55396546)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122734
Approved by: https://github.com/andrewor14
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:
(1) Disables torch function running a second time in AOTAutograd
If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.
2. Enables torch function to be inlined in dynamo for NT
Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level.
3. Fixes graph breaks for NT torch function
Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).
Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang
[ghstack-poisoned]
This PR adds the vectorized indirect indexing so that we can further simplify the `CppVecKernelChecker` (done in the later PR #119734) and remove the check that throws `CppVecUnsupportedError`. A boundary assertion check is added on vectorized indices and via the new `indirect_assert` method on `Kernel` - the base implementation is for scalar indices, overridden in `CppVecKernel` for vectorized indices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119655
Approved by: https://github.com/jansel
ghstack dependencies: #119654
Summary: It looks like this target has stopped working, lets fix it.
Test Plan:
```
buck2 run mode/opt //caffe2/benchmarks/dynamo/:test
```
now works
Differential Revision: D55389546
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122735
Approved by: https://github.com/xmfan
Vectorized boolean values in CPU Inductor were modeled with `Vectorized<float>` which cannot work for operations with other data types. This PR generalizes it with the new `VecMask` template class that can work for masks on any vectorized data types. The intrinsics implementation in `cpp_prefix.h` for mask conversion, cast and masked load are now implemented as the specialization for `VecMask` and moved to corresponding header files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119654
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
This PR:
- disallows FakeTensor.data_ptr when it is called inside PT2 or fx tracing.
- disallows FunctionalTensor.data_ptr (python FunctionalTensor is only used in
PT2)
The motivation behind this is that the leading cause of segfaults when
using custom ops with PT2 is calling .data_ptr on FunctionalTensor or
FakeTensor.
This change is BC-breaking. If your code broke as a result of this, it's
because there was a bug in it (these .data_ptr should never be
accessed!). You can either fix the bug (recommended) or get the previous
behavior back with:
```
from torch._subclasses.fake_tensor import FakeTensor
from torch._subclasses.functional_tensor import FunctionalTensor
data_ptr = 0 if isinstance(tensor, (FakeTensor, FunctionalTensor)) else tensor.data_ptr()
```
Test Plan:
- existing tests
Differential Revision: [D55366199](https://our.internmc.facebook.com/intern/diff/D55366199)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122514
Approved by: https://github.com/ezyang, https://github.com/albanD, https://github.com/yifuwang, https://github.com/kurtamohler
Fixes#118795
This is a graph breaking partial fix for #120914. We still need -actual- module parametrization tracing support, but at least it doesn't blow up hard now.
**Background**: Module parametrization injects a property as the module parameter attribute that calls a `nn.Module` whose forward takes in a module parameter and returns a reparametrized module parameter.
Example:
```
class MyParametrization(nn.Module):
def forward(X):
# This reparametrization just negates the original parameter value
return -X
m = nn.Linear(...)
p = MyParametrization()
register_parametrization(m, "weight", p)
# Accessing the "weight" attribute will invoke p's forward() on m's original weight and return the output as the new weight.
# m.weight here is now an injected property that does the above instead of an actual Parameter.
# This property is defined in torch/nn/utils/parametrize.py.
m.weight
# NB: Parametrization changes the module type (e.g. torch.nn.utils.parametrize.ParametrizedLinear)
print(type(m))
```
**Problem 1**: Dynamo has special tracing rules for things in `torch.nn`. Parametrizing a module changes the type of the module and the parametrized attribute, so now these rules wrongly affect tracing here. To fix this:
* For parametrized modules, call `convert_to_unspecialized()` to restart analysis where Dynamo starts inlining the module.
**Problem 2**: The issue seen in #118795 is that Dynamo will see a dynamically constructed tensor when `m.weight` is called and introduce that to its `tensor_weakref_to_sizes_strides` cache during fake-ification. This tensor is also made to be a graph input, since it's a module parameter. When guards are created for this module parameter input, the logic calls `m.weight` again and tries to look the result up in the cache, but this is a different tensor now, giving the `KeyError` symptom. To fix this:
* Replace Dynamo's `tensor_weakref_to_sizes_strides` cache with a `input_source_to_sizes_strides` cache.
* This cache was originally introduced in #100128.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121041
Approved by: https://github.com/anijain2305
Summary:
During tracing, some constants (tensor_constant{idx}) are being generated internally.
Those constants are neither parameters or buffers, and users have zero control on them.
To accomodate this, we should allow users not passing in those constants generated internally but still be able the constants in the model.
Test Plan:
Included in commit.
```
build/bin/test_aot_inductor
```
Reviewed By: zoranzhao
Differential Revision: D55354548
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122690
Approved by: https://github.com/khabinov
By using `vcvt_f16_f32` and back
According to [benchmark_convert.py](d3279637ca) this makes float32 to float16 tensor conversion roughly 3 times faster: time to convert 4096x4096 float32 tensor drops from 5.23 msec to 1.66 msec on M2 Pro
Test plan: run `vector_test_all_types` + CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122702
Approved by: https://github.com/kimishpatel
`CXX_AVX[2|512]_FOUND` flags should indicate whether compiler supports generating code for given instruction set, rather than whether host machine can run the generated code.
This fixes a weird problem that surfaced after https://github.com/pytorch/pytorch/pull/122503 when builder can sometimes be dispatched to an old CPU architecture, that can not run AVX512 instructions, but can compile for those just fine
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122708
Approved by: https://github.com/jeanschmidt
Fixes https://github.com/pytorch/pytorch/issues/122404
Previously, when rewriting c10d collectives, if the group argument is
unspecified or None, we create a world pg variable out of thin air and
pass it to the rewrite target. The approach was problematic, as it
assumes the symbol `torch` is available in the scope (see #122404).
After #120560, dynamo can now trace dist.group.WORLD. If the group
argument is unspecified, we can just set it with dist.group.WORLD in the
rewrite target.
Testing
pytest test/distributed/test_inductor_collectives.py -k test_dynamo_rewrite_dist_allreduce
Also verified with the repro provided in #122404
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122561
Approved by: https://github.com/wconstab
ghstack dependencies: #120560
Summary:
This diff
* Refactors triton and autotune caches to be child classes of the original memcache based cache infra
* Swaps scuba table for autotune
* Adds autotune time spent/saved to scuba table
Test Plan:
Local testing using:
```
buck run mode/opt fbcode//caffe2/test/inductor/:max_autotune -- -r test_max_autotune_remote_caching_dynamic_False
```
and
```
TORCH_INDUCTOR_AUTOTUNE_REMOTE_CACHE=1 buck2 run mode/opt //scripts/oulgen:runner
```
Differential Revision: D55332620
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122637
Approved by: https://github.com/jamesjwu
Summary:
The test is unstable at the moment. We need to make sure both Aten
and Triton Kernel works to reactivate the test.
Test Plan:
Disabling test
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122682
Approved by: https://github.com/clee2000
This started as a re-land of https://github.com/pytorch/pytorch/pull/105590 but focusing on enabling it on MacOS, but quickly turned into landing very limited platform-specific acceleration at this time (I.e. this PR does not add any NEON accelerated code at all, just enables vectorized compilation for the existing abstractions)
Enabling the test harness, uncovered number of latent issues in CPU inductor that were fixed in the following PRS:
- https://github.com/pytorch/pytorch/pull/122511
- https://github.com/pytorch/pytorch/pull/122513
- https://github.com/pytorch/pytorch/pull/122580
- https://github.com/pytorch/pytorch/pull/122608
Following was added/changed to enable vectorization code to work on MacOS
- Added VecNEON class to `_inductor/codecache.py` that is supported on all AppleSilicon Macs
- Added `Vectorized::loadu_one_fourth` to `vec_base.h`, and limit it to 8-bit types
- Change 64-bit integral types mapping to `int64_t`/`uint64_t` to align with the rest of the code, as on MacOS, `int64_t` is a `long long` rather than `long` (see https://github.com/pytorch/pytorch/pull/118149 for more details)
See table below for perf changes with and without torch.compile using [gpt-fast](https://github.com/pytorch-labs/gpt-fast) running `stories15M` on M2 Pro:
| dtype | Eager | Compile (before) | Compile (after) |
| ------ | ------ | --------- | --------- |
| bfloat16 | 120 tokens/sec | 130 tokens/sec | 156 tokens/sec |
| float32 | 158 tokens/sec | 140 tokens/sec | 236 tokens/sec |
| float16 | 235 tokens/sec | 81 tokens/sec | 58 tokens/sec |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122217
Approved by: https://github.com/jansel
Summary:
`torch.export` is a powerful tool for creating a structured and shareable package from arbitrary pytorch code. One great use case of `torch.export` is sharing models or subgraphs in a way that allows results to be easily replicated. However, in the current implementation of `export`, the `example_inputs` field is thrown out. When trying to replicate bugs, benchmarks, or behaviors, losing the original input shapes and values makes the process much messier.
This change adds saving and loading for the `example_inputs` attribute of an `ExportedProgram` when using `torch.export.save` and `torch.export.load`. This simple addition makes `ExportedPrograms`s a fantastic tool for performance and accuracy replication. For example, with this change we enable the following workflow:
```
# Script to create a reproducible accuracy issue with my model.
kwargs = {"fastmath_mode": True}
exp_program = export(my_model, sample_inputs, kwargs)
result = exp_program.module()(*sample_inputs, **kwargs)
# Uhoh, I dont like that result, lets send the module to a colleague to take a look.
torch.export.save(exp_program, "my_model.pt2")
```
My colleague can then easily reproduce my results llike so:
```
# Script to load and reproduce results from a saved ExportedProgram.
loaded_program = torch.export.load("my_model.pt2")
# The following line is enabled by this Diff, we pull out the arguments
# and options that caused the issue.
args, kwargs = loaded_program.example_inputs
reproduced_result = loaded_program.module()(*args, **kwargs)
# Oh I see what happened here, lets fix it.
```
Being able to share exact inputs and arguments makes `ExportedPrograms` much
more clean and powerful with little downside. The main potential issue with this change
is that it does slightly increase the size of saved programs. However, the size of
inputs will be much smaller than parameters in most cases. I am curious to hear
discussion on saved file size though.
The deserialization of `example_inputs` is currently implemented as `Optional`. Although this wont effect users of `export.save` and `export.load`, it does give backwards compatibility to any direct users of `serialize` and `deserialize`.
Test Plan:
This diff includes a new test which exercises the save / load flow with multiple args and kwargs.
```
buck test //caffe2/test:test_export -- TestSerialize
```
Differential Revision: D55294614
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122618
Approved by: https://github.com/zhxchen17
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:
(1) Disables torch function running a second time in AOTAutograd
If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.
2. Enables torch function to be inlined in dynamo for NT
Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level.
3. Fixes graph breaks for NT torch function
Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).
Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang
[ghstack-poisoned]
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:
(1) Disables torch function running a second time in AOTAutograd
If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.
2. Enables torch function to be inlined in dynamo for NT
Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level.
3. Fixes graph breaks for NT torch function
Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).
Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang
[ghstack-poisoned]
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:
(1) Disables torch function running a second time in AOTAutograd
If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.
2. Enables torch function to be inlined in dynamo for NT
Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level.
3. Fixes graph breaks for NT torch function
Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).
Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang
[ghstack-poisoned]
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:
(1) Disables torch function running a second time in AOTAutograd
If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.
2. Enables torch function to be inlined in dynamo for NT
Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level.
3. Fixes graph breaks for NT torch function
Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).
Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang
[ghstack-poisoned]
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:
(1) Disables torch function running a second time in AOTAutograd
If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.
2. Enables torch function to be inlined in dynamo for NT
Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level.
3. Fixes graph breaks for NT torch function
Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).
Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang
[ghstack-poisoned]
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:
(1) Disables torch function running a second time in AOTAutograd
If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.
2. Enables torch function to be inlined in dynamo for NT
Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level.
3. Fixes graph breaks for NT torch function
Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).
Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang
[ghstack-poisoned]
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:
(1) Disables torch function running a second time in AOTAutograd
If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.
2. Enables torch function to be inlined in dynamo for NT
Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level.
3. Fixes graph breaks for NT torch function
Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).
Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang
[ghstack-poisoned]
In order to avoid having any temporary state where the behavior of anything is regressed. This PR does all of the following at once:
(1) Disables torch function running a second time in AOTAutograd
If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.
2. Enables torch function to be inlined in dynamo for NT
Due to torch function running a second time AOTAutograd, NT was actually relying on this behavior instead of properly inlining through torch function at the dynamo level.
3. Fixes graph breaks for NT torch function
Now that we are inlining through torch function for the first time in dynamo, we've uncovered some graph breaks. Thanks to mlazos, we should have support for custom attributes for torch function now. We also add support for a custom Enum type. Finally, a few of them we can get rid of by adding allow_in_graph (though we may need to double check the soundness here).
Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang
[ghstack-poisoned]
If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.
Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124
cc bdhirsh
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang
[ghstack-poisoned]
If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.
Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124
cc bdhirsh
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang
[ghstack-poisoned]
If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.
Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124
cc bdhirsh
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang
[ghstack-poisoned]
If you have a tensor subclass that relies on dispatch into the same op without unwrapping and calling torch._C.DisableTorchFunctionSubclass() the torch function-ness will survive into AOTAutograd (when normally we may expect the torch function to be inlined away during dynamo). If this happens, we should make sure to not run the torch function logic a second time.
Fixes https://github.com/pytorch/pytorch/issues/120654, https://github.com/pytorch/pytorch/issues/120124
cc bdhirsh
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang
[ghstack-poisoned]
"Zero-point must be Int32, Float or Half, found ",zero_point.scalar_type());
TORCH_CHECK(scale.dim()==1,"scale should be a 1-D tensor");
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.