Fixes#115922
This PR is prepared to separate existing https://github.com/pytorch/pytorch/pull/116926 and to apply suggestions in the review.
`scalar_t` which is defined as `c10::impl::ScalarTypeToCPPType<ScalarType::Half>::t` appears to be causing the issue with `Visual Studio 2022 17.8.4` (coming with `MSVC 14.38.33130`)
Error message:
```
aten\src\ATen/cpu/vec/vec_base.h(150): fatal error C1001: Internal compiler error.
(compiler file 'D:\a_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\toinil.c', line 910)
```
---
Related line was added for a similar issue before as a workaround (`scalar_t` definition) [Fix compile error for vs2022](https://github.com/pytorch/pytorch/pull/85958)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117497
Approved by: https://github.com/ezyang, https://github.com/malfet
(cherry picked from commit fa86fa7a61e7cb85e1d193ed69d41757abe43310)
Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
For graphs with single output, the expectation of torch.export / torch.compile graph_module output type is a single torch.tensor instead of a tuple.
However, after using `_SplitterBase` partitioner on these graph_module (obtained from torch.export/torch.compile), the resultant graph module will return a tuple of tensors, in this case `(output,)`.
This PR adds codegen to the graphs produced by `_SplitterBase` partitioner. Setting this will ensure pytree unflatten nodes will be added automatically to handle unflattening of the output to return single outputs directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120361
Approved by: https://github.com/angelayi
(cherry picked from commit 15add24bf28477843a7e13d9deaa4beb39473900)
Co-authored-by: Dheeraj Peri <peri.dheeraj@gmail.com>
It's already a requirement for building PyTorch, but should be a
requirement for linking extensions with it, as that can lead to runtime
crashes, as `std::optional` template layout is incompatible between
gcc-9 and older compilers.
Also, update minimum supported clang version to 9.x(used to build Android), as clang-5 is clearly not C++17 compliant.
Fixes https://github.com/pytorch/pytorch/issues/120020
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120126
Approved by: https://github.com/Skylion007
(cherry picked from commit 3ad067fe2b969d17773e9ada918c67da829bb5cc)
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Fixes#120044
Should fix build from source instructions on release branch here: https://github.com/pytorch/pytorch#from-source
Please note we are using /test/ channel for release here to make sure it works, before actual release is completed.
Test main:
```
make triton
pip3 uninstall -y triton
WARNING: Skipping triton as it is not installed.
Looking in indexes: https://download.pytorch.org/whl/nightly/
Collecting pytorch-triton==3.0.0+a9bc1a3647
Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.0.0%2Ba9bc1a3647-cp310-cp310-linux_x86_64.whl (239.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 239.0/239.0 MB 8.7 MB/s eta 0:00:00
Requirement already satisfied: filelock in /home/atalman/miniconda3/envs/py310/lib/python3.10/site-packages (from pytorch-triton==3.0.0+a9bc1a3647) (3.13.1)
Installing collected packages: pytorch-triton
Attempting uninstall: pytorch-triton
Found existing installation: pytorch-triton 2.2.0
Uninstalling pytorch-triton-2.2.0:
Successfully uninstalled pytorch-triton-2.2.0
Successfully installed pytorch-triton-3.0.0+a9bc1a3647
```
Test release/2.2:
```
make triton
pip3 uninstall -y triton
WARNING: Skipping triton as it is not installed.
Looking in indexes: https://download.pytorch.org/whl/test/
Collecting pytorch-triton==2.2.0
Using cached https://download.pytorch.org/whl/test/pytorch_triton-2.2.0-cp310-cp310-linux_x86_64.whl (183.1 MB)
Requirement already satisfied: filelock in /home/atalman/miniconda3/envs/py310/lib/python3.10/site-packages (from pytorch-triton==2.2.0) (3.13.1)
Installing collected packages: pytorch-triton
Attempting uninstall: pytorch-triton
Found existing installation: pytorch-triton 3.0.0+a9bc1a3647
Uninstalling pytorch-triton-3.0.0+a9bc1a3647:
Successfully uninstalled pytorch-triton-3.0.0+a9bc1a3647
Successfully installed pytorch-triton-2.2.0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121169
Approved by: https://github.com/seemethere
Currently (after https://github.com/pytorch/pytorch/pull/114407), the user has must pass the original user ``model`` to APIs such as ``ONNXProgram.__call__``, ``ONNXProgram.adapt_torch_inputs_to_onnx`` and ``ONNXProgram.adapt_torch_outputs_to_onnx`` APIs.
This was needed because when the model is fakefied, a version of the non-fakefied model is needed so that the Initializers, buffers and constants can be extracted from a real model (and used as input to the ONNX model).
That approach brings an unnecessary usability burden to the user when the model is not fakefied, because the model that was already passed to ``torch.onnx.dynamo_export`` could be used to extract ``state_dict``.
This PR adds ``ONNXProgram._model_torch`` attribute to store the user model and demote ``model`` argument of the aforementioned APIs to optional, only (as opposed to required).
As a result, for the fakefied model scenario, the user still need to pass the required model, but for non fakefied models, the persisted model is implicitly used to extract the model state_dict, making it easier to use.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115281
Approved by: https://github.com/BowenBao
ghstack dependencies: #114407
Currently, the ONNX exporter using torch.nn.Module as input can support
FakeTensor because the ONNX model stores all initializers
When using torch.export.ExportedProgram as input, the initializers are
lifted as inputs. In order to execute the ONNX model, we need to pass a
reference to the non-fake model to the
ONNXProgram.adapt_torch_inputs_to_onnx API, so that initializers can be
fetched from the model and fed to the ONNX model as input
ps: https://github.com/pytorch/pytorch/issues/115461 will track the API revision for the cases where additional `model_with_state_dict` are required to produce complete ONNX files exported with fake support. This is also tracked by the umbrella fake tensor issue https://github.com/pytorch/pytorch/issues/105464 FYI @BowenBao
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114407
Approved by: https://github.com/BowenBao
- Add call to `at::globalContext().userEnabledMkldnn()` to `apply_mkldnn_matmul_heur`
- Surround calls to `mkldnn_matmul` with `try {} catch {}`
- Print warning and fall back to BLAS (by calling `at::globalContext().setUserEnabledMkldnn()`) if `mkldnn_matmul()` fails
Test plan: On Linux arm run:
```shell
$ sudo chmod 400 /sys; python -c "import torch;m=torch.nn.Linear(1, 32);print(torch.__version__);print(m(torch.rand(32, 1)))"
Error in cpuinfo: failed to parse the list of possible processors in /sys/devices/system/cpu/possible
Error in cpuinfo: failed to parse the list of present processors in /sys/devices/system/cpu/present
Error in cpuinfo: failed to parse both lists of possible and present processors
2.3.0.dev20231215
bad err=11 in Xbyak::Error
bad err=11 in Xbyak::Error
/home/ubuntu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/nn/modules/linear.py:116: UserWarning: mkldnn_matmul failed, switching to BLAS gemm:internal error (Triggered internally at /pytorch/aten/src/ATen/native/LinearAlgebra.cpp:1509.)
return F.linear(input, self.weight, self.bias)
tensor([[-0.5183, 0.2279, -0.4035, ..., -0.3446, 0.0938, -0.2113],
[-0.5111, 0.2362, -0.3821, ..., -0.3536, 0.1011, -0.2159],
[-0.6387, 0.0894, -0.7619, ..., -0.1939, -0.0282, -0.1344],
...,
[-0.6352, 0.0934, -0.7516, ..., -0.1983, -0.0247, -0.1366],
[-0.4790, 0.2733, -0.2862, ..., -0.3939, 0.1338, -0.2365],
[-0.5702, 0.1682, -0.5580, ..., -0.2796, 0.0412, -0.1782]],
grad_fn=<AddmmBackward0>)
```
Fixes https://github.com/pytorch/pytorch/issues/114750
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115936
Approved by: https://github.com/lezcano
Co-authored-by: Nikita Shulga <nshulga@meta.com>
A bug fix of a recently merged PR per comment: https://github.com/pytorch/pytorch/pull/101507#discussion_r1271393706
The follow test would fail without this bug fix:
```
import torch
def test_erfinv():
for device in ['cpu', 'mps']:
x = torch.tensor([0.1, 0.2, 0.3, 0.4, 0.5], device=device)
y = x[2:].erfinv()
x2 = torch.tensor([0.3, 0.4, 0.5], device=device)
y2 = x2.erfinv()
print(y)
print(y2)
torch.testing.assert_close(y, y2)
print(f"{device} passes.")
test_erfinv()
```
Cherry-pick of https://github.com/pytorch/pytorch/pull/105801 into release/2.2
Co-authored-by: Peter Pham <peterpham86@gmail.com>
Run slow/periodic and inductor workflows on push to release branches
Right now there are no signal from those jobs on release branches at all.
This will run periodic jobs on every commit to release branch, which is fine, as they are short lived and have a much lower traffic that a regular jobs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116516
Approved by: https://github.com/clee2000
Co-authored-by: Nikita Shulga <nshulga@meta.com>
# Summary
Found using the repros mentioned in this issue: #112577
After many go rounds with compute-sanitizer and eventual printf debugging I feel pretty confident that this was the underlying issue
Cherry-pick of https://github.com/pytorch/pytorch/pull/116234 into release/2.2 branch
Related to #103973#110532#108404#94891
**Context:**
As commented in 6ae0554d11/cmake/Dependencies.cmake (L1198)
Kernel asserts are enabled by default for CUDA and disabled for ROCm.
However it is somewhat broken, and Kernel assert was still enabled for ROCm.
Disabling kernel assert is also needed for users who do not have PCIe atomics support. These community users have verified that disabling the kernel assert in PyTorch/ROCm platform fixed their pytorch workflow, like torch.sum script, stable-diffusion. (see the related issues)
**Changes:**
This pull request serves the following purposes:
* Refactor and clean up the logic, make it simpler for ROCm to enable and disable Kernel Asserts
* Fix the bug that Kernel Asserts for ROCm was not disabled by default.
Specifically,
- Renamed `TORCH_DISABLE_GPU_ASSERTS` to `C10_USE_ROCM_KERNEL_ASSERT` for the following reasons:
(1) This variable only applies to ROCm.
(2) The new name is more align with #define CUDA_KERNEL_ASSERT function.
(3) With USE_ in front of the name, we can easily control it with environment variable to turn on and off this feature during build (e.g. `USE_ROCM_KERNEL_ASSERT=1 python setup.py develop` will enable kernel assert for ROCm build).
- Get rid of the `ROCM_FORCE_ENABLE_GPU_ASSERTS' to simplify the logic and make it easier to understand and maintain
- Added `#cmakedefine` to carry over the CMake variable to C++
**Tests:**
(1) build with default mode and verify that USE_ROCM_KERNEL_ASSERT is OFF(0), and kernel assert is disabled:
```
python setup.py develop
```
Verify CMakeCache.txt has correct value.
```
/xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt
USE_ROCM_KERNEL_ASSERT:BOOL=0
```
Tested the following code in ROCm build and CUDA build, and expected the return code differently.
```
subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"])
```
This piece of code is adapted from below unit test to get around the limitation that this unit test now was skipped for ROCm. (We will check to enable this unit test in the future)
```
python test/test_cuda_expandable_segments.py -k test_fixed_cuda_assert_async
```
Ran the following script, expecting r ==0 since the CUDA_KERNEL_ASSERT is defined as nothing:
```
>> import sys
>>> import subprocess
>>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"])
>>> r
0
```
(2) Enable the kernel assert by building with USE_ROCM_KERNEL_ASSERT=1, or USE_ROCM_KERNEL_ASSERT=ON
```
USE_ROCM_KERNEL_ASSERT=1 python setup.py develop
```
Verify `USE_ROCM_KERNEL_ASSERT` is `1`
```
/xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt
USE_ROCM_KERNEL_ASSERT:BOOL=1
```
Run the assert test, and expected return code not equal to 0.
```
>> import sys
>>> import subprocess
>>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"])
>>>/xxxx/pytorch/aten/src/ATen/native/hip/TensorCompare.hip:108: _assert_async_cuda_kernel: Device-side assertion `input[0] != 0' failed.
:0:rocdevice.cpp :2690: 2435301199202 us: [pid:206019 tid:0x7f6cf0a77700] Callback: Queue 0x7f64e8400000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
>>> r
-6
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114660
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/jithunnair-amd
(cherry picked from commit 66a76516bfc341b2b55bb2056d2faa9c2de46d69)
Co-authored-by: hongxyan <hongxyan@amd.com>
* [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115193)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115193
Rename _device_mesh.py to device_mesh.py, update all callsites, add documentation.
We created stubs for public class and methods in torch.distributed.device_mesh so that torch.distributed.device_mesh can be imported with or without distributed is available().
Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/115099
Prior to landing, CI signals are all passed. Shipit added the "ci/trunk" label to the PR and DID NOT wait for it and went ahead committing. More context can be found in the reverted PR above.
Test Plan: CI.
Differential Revision: D51861018
fbshipit-source-id: dc7b26cea7340d55498730123e82a42cef46ff55
* fix doc
* Update device_mesh.py docs imports
#116074
some typo result in the note section not rendered properly, can't see
this from the last PR directly as the last PR only show the first commit
documentation :(
Also make the parallelize_module doc example more concrete
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115974
Approved by: https://github.com/wz337
The 2MB THP(transparent huge pages) pages provide better allocation latencies compared to the standard 4KB pages. This change has shown substantial improvement for batch mode usecases where the tensor sizes are larger than 100MB.
Only enabled if `THP_MEM_ALLOC_ENABLE` environment variable is set.
Relanding https://github.com/pytorch/pytorch/pull/93888 with functionality disabled for Android
Cherry-pick of https://github.com/pytorch/pytorch/pull/107697 into release/2.2 branch
(cherry-picked from commit 88207b10cab33b08a15a9009630b5c1e7549ea2b)
Due to not all tests in the Dynamo shard actually running in CI, we've
started to bitrot on this implementation. Since our plan is to trace
into the functorch implementations instead of construct a HOP
(which is what capture_func_transforms=True does), let's turn off this
config by default.
Test Plan:
- Tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115267
Approved by: https://github.com/voznesenskym, https://github.com/guilhermeleobas
Cherry pick of https://github.com/pytorch/pytorch/pull/115193 into release/2.2 branch
Rename `_device_mesh.py` to `device_mesh.py`, update all callsites, add documentation.
We created stubs for public class and methods in torch.distributed.device_mesh so that torch.distributed.device_mesh can be imported with or without distributed is available().
Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/115099
Prior to landing, CI signals are all passed. Shipit added the "ci/trunk" label to the PR and DID NOT wait for it and went ahead committing. More context can be found in the reverted PR above.
Test Plan: CI.
Differential Revision: D51861018
fbshipit-source-id: dc7b26cea7340d55498730123e82a42cef46ff55
In certain edge cases when using lazy tensors, the base tensor stored in the `FunctionalStorageImpl` and the `value_` tensor stored in the `FunctionalTensorWrapper` diverge. For instance, take this simple example
```python
class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.fc1 = torch.nn.Linear(4, 2, bias=False)
def forward(self, x):
return x @ self.fc1.weight.transpose(0, 1)
with torch.device("lazy"):
model = Model()
x = torch.ones(4)
out = model(x)
```
The call to `transpose` on the lazily initialized weight `fc1.weight` applies a view op on the functional tensor which only gets propagated to the functional tensor wrapper and not the base tensor in the storage. Thus, causing them to diverge.
To fix this behaviour, we need to reset the functional tensor's storage. To facilitate this, we add a `_unsafe_reset_storage` method to `FunctionalTensorWrapper` which clears away the old storage and view metas.
Porting over PR from https://github.com/pytorch/pytorch/pull/115235
Cherry-picked: 73c0035160e7b2c5772417bb7206b316bdf34044
this was a SEV when FSDP modules are registered as graph attributes this unit test prevents it from happening again
without SEV fix: D48810186
```
python test/distributed/test_dynamo_distributed.py -k
test_fsdp_skip_register_attr_or_module
File "/data/users/weif/pytorch/torch/_dynamo/repro/after_dynamo.py",
line 117, in debug_wrapper
compiled_gm = compiler_fn(gm, example_inputs)
File
"/data/users/weif/pytorch/test/distributed/test_dynamo_distributed.py", line 897, in debug_compiler
self.assertFalse(name in node.name, f"FSDP module {name} should not
be registered as attributes")
torch._dynamo.exc.BackendCompilerFailed: backend='debug_compiler' raised:
AssertionError: True is not false : FSDP module l__self___net_0_weight should not be registered as attributes
```
with SEV fix: D48810186
```
python test/distributed/test_dynamo_distributed.py -k test_fsdp_skip_register_attr_or_module
Ran 1 test in 6.438s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115112
Approved by: https://github.com/mlazos
I was curious what hf_T5_generate was trying to deepcopy, so I updated the errror message:
Before:
```
STATS graph_break
("'skip function deepcopy in file /home/jansel/conda/envs/pytorch/lib/python3.10/copy.py'', skipped according skipfiles.SKIP_DIRS'", 3)
...
```
After:
```
STATS graph_break
('copy.deepcopy UserDefinedObjectVariable(GenerationConfig)', 3)
...
```
Related issue: #115122
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115120
Approved by: https://github.com/oulgen
ghstack dependencies: #115095, #115046, #115057, #115119
**Summary**:
DTensor sharding propagation returns an `OpStrategy` object in case of a
Tuple of multiple DTensors of the same `placements` and this object will later
be expanded to a tuple of `DTensorSpec`s. However, the expansion was done
as copying the object's reference instead of copying/creating new objects and
this leads to wrong overriding issue in Tensor Meta propagation logic.
**Test**:
pytest test/distributed/_tensor/test_math_ops.py
pytest test/distributed/_tensor/test_dtensor_ops.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115161
Approved by: https://github.com/wanchaol
I hit the following error when performing pipeline parallel for T5:
```
return default_pg.send([tensor], dst, tag)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: Tensors must be contiguous
```
In theory, we shouldn't require the tensors to be contiguous, especially for P2P ops, because we are just doing bit-wise "copy".
Thus, this PR relaxes the requirement and instead calls out that it would be user responsibility to guarantee the source and destination tensors have the same contiguity setting.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114982
Approved by: https://github.com/H-Huang
Summary: When the model takes no inputs, AOTInductor relies on checking weights to figure out which device to compile the model into. Currently recording buffer device type happens too late, and this PR fixes that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114682
Approved by: https://github.com/chenyang78
Users may wish to torch.compile custom ops that mutate their inputs
and return nothing (this is a common class of operators).
torch.compile will automatically support this op without anyone needing
to provide a functionalization kernel for it. Here's how.
Let's say we have a hypothetical mylib::sin_(Tensor(a!) x) -> ()
op. First, when FakeTensor sees this op, it can just return None.
This is the case because custom ops are not allowed to mutate input
metadata, so the FakeTensor rule for one that returns nothing is trivial.
Next, when Python FunctionalTensor sees the op, it will functionalize
it by emitting a call to an auto_functionalize(op, ["x"], {"x": ...})
HOP and replacing the mutated inputs with the outputs of this HOP.
This HOP effectively runs the functional version of the op when
called: it clones inputs that will be mutated, runs the op, and
then returns Tensors with the new values.
In the future we can teach Inductor how to do re-inplacing when it sees
this HOP (like how triton kernels do it) but this isn't urgent (and is
more of a performance problem).
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114955
Approved by: https://github.com/bdhirsh
Summary:
In [9cc040fef64154a2424b2ccd2c0909641e245cf0], we accidentally changed some of the environment variable names to the non-deprecated form. The intent was to support both the deprecated and the new form of the env variables (with a warning thrown for the deprecated form).
Test Plan:
OSS CI
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115082
Approved by: https://github.com/zdevito
Summary:
As discussed D51695982, fusion may not be always good. We want to let the user customize the fx passes.
Some example for new configs:
* Use batch_fusion config: this will automatically use the following batch fusions, including batch linear, layernorm, relu, tanh, sigmoid and post grad batch linear fusion
* use config:
```
"pre_grad_fusion_options": {
"batch_linear": {"min_fuse_set_size": 10},
"batch_linear_lhs": {},
"batch_layernorm": {"max_fuse_search_depth": 100},
"batch_tanh": {},
"batch_relu": {},
"batch_sigmoid": {}
},
```
Test Plan:
with flag: f509168388
with config: f509168595
Reviewed By: frank-wei, mengluy0125
Differential Revision: D51817314
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115128
Approved by: https://github.com/mengluy0125
Summary:
Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation.
Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/114991
It was failing because failing a public module binding tests in MacOS, and this is due to the change in import order for torch/distributed/fsdp/_common_utils.py. Since this original import would still work, we remove the changes in this file.
Test Plan: CI.
Differential Revision: D51825114
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115099
Approved by: https://github.com/wanchaol, https://github.com/fegin
`ir.Wait` generates the last 2 lines of this code:
```python
buf1_work = dist.all_gather_into_tensor(buf1[0], buf1_inputs[0], async_op=True, group=buf1_pg)
fun_col_impl._register_tensor_work(buf1, buf1_work)
buf2 = buf1[0]
del buf1
buf2 = _wait_tensor(buf2) # <- generated by ir.Wait
buf3 = buf2; # reuse <- generated by ir.Wait
```
`_wait_tensor` technically is a "mutation" op that changes `buf2` in place. So we should mark `ir.Wait` as a mutation op (by overriding its `get_mutation_names()`).
This fixes a very peculiar issue when inductor comm reordering is used for llama model: downstream nodes that uses the all-gather comm output sometimes takes dependency on `buf2` (the node before `ir.Wait`) instead of on `buf3` (`ir.Wait`) (it's still unclear why it behaves like this). To work around the issue, we add the missing annotation that `buf3` is a mutation of `buf2`, so that the scheduler knows to schedule `buf3` before any of the `buf2` users.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115104
Approved by: https://github.com/wanchaol
This fixes the "should have been handled in replace_random.py" error
raised during lowering.
I also fixed `test_randn_generator` to catch any regressions.
Previously, it did not use the result of randn(), so dynamo tracing
omitted that node entirely.
Fixes#114203.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115051
Approved by: https://github.com/eellison
As titled, this PR fixes the empty shape init case, where if we pass in
things like `torch.dtensor.zeros([])`, it should call `torch.zeros([])`
under the hood not `torch.empty(0)`, this makes dtensor constructor and
torch constructor aligns
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115091
Approved by: https://github.com/XilunWu
Fixes#88443
Forces the internal `dtype` of `torch.distributions.von_mises.VonMises` to be `torch.double` and mirrors the numpy implementation of the second order Taylor expansion for `concentration < 1e-5`. Samples and log probs are returned with `dtype` of argument `loc`.
In principle one could also use masking in the rejection sampler to return uniformly distributed numbers for `concentration < 1e-8`, as in numpy. This may be slightly more efficient, but isn't required to solve the hanging issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114498
Approved by: https://github.com/fritzo
Summary:
FX graph mode quant workflow and also pt2e flow relies on the `is_dynamic` flag in observer/quantizationspec to
convert an observer to dynamic quantization patterns (choose_qparams -> q -> dq), this PR added is_dynamic flag
for all observers so that it's possible to convert these observers to the pattern.
However, this dynamic quantization pattern (choose_qparams -> q -> dq) is actually only valid for MovingAverageObserver(averaging_constant=1)
for the computation before convert and after convert to match in the context of QAT. So we'll have some sanity
checks in other observers to make sure the is_dynamic is False.
Test Plan:
python test/test_quantization.py TestXNNPACKQuantizer.test_qat_dynamic_linear
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D51124725](https://our.internmc.facebook.com/intern/diff/D51124725)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113288
Approved by: https://github.com/kimishpatel
Summary:
The performance with 2D texture for weight and bias is better for quantized conv2d, the un-quantized version of conv2d also uses 2D texture.
The performance gain is:
With 3D:
Kernel Name Workgroup Size Duration P50 (ns)
=========== ============== =================
vulkan.quantized_conv2d {96, 72, 2} 5965440
vulkan.quantized_conv2d {96, 72, 2} 11316968
vulkan.quantized_conv2d_dw{96, 72, 2} 2735564
vulkan.quantized_conv2d_pw_2x2{96, 72, 2} 1645696
With 2D:
vulkan.quantized_conv2d {96, 72, 2} 4295772
vulkan.quantized_conv2d {96, 72, 2} 7874620
vulkan.quantized_conv2d_dw{96, 72, 2} 2658552
vulkan.quantized_conv2d_pw_2x2{96, 72, 2} 1632020
Test Plan:
Ensure all vulkan quantize tests pass:
buck2 run --target-platforms ovr_configplatform/macos:arm64-fbsourcexplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 78 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 78 tests from VulkanAPITest
....
[----------] 78 tests from VulkanAPITest (1519 ms total)
[----------] Global test environment tear-down
[==========] 78 tests from 1 test suite ran. (1519 ms total)
[ PASSED ] 78 tests.
buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 395 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 395 tests from VulkanAPITest
......
----------] 395 tests from VulkanAPITest (6515 ms total)
[----------] Global test environment tear-down
[==========] 395 tests from 1 test suite ran. (6515 ms total)
[ PASSED ] 394 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
YOU HAVE 5 DISABLED TESTS
Reviewed By: yipjustin
Differential Revision: D50997534
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114902
Approved by: https://github.com/yipjustin
Accounts for the case where `state_dict` keys may present in different orders. Since users may be calling collectives in `state_dict` and `load_state_dict` call, different ordered keys could cause a deadlock. This is mostly a defensive move, meant to match the feature in TSS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114304
Approved by: https://github.com/fegin, https://github.com/wz337
Summary:
This PR adds in support for passing in a alpha Tensor, which represents
a tensor of alpha values to fuse into the matmul.
```
cusparselt_sparse_mm = alpha A @ B + bias
```
This operation is necessary for quantization, where we would like to
fuse one of the dequant matmuls into the sparse op.
Test Plan:
```
python test/test_sparse_semi_structured -k alpha
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112056
Approved by: https://github.com/cpuhrsch
Summary:
We did two things:
1. We add back the batch_fusion and group_fusion flags to keep the current production model implementation
2. We tell batch and group fusion in the post grad since group need fbgemm.
Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion
```
Buck UI: https://www.internalfb.com/buck2/13d152d2-5d4d-4c7a-ab88-51f8e8218942
Test UI: https://www.internalfb.com/intern/testinfra/testrun/1125900253044737
Network: Up: 376KiB Down: 44KiB (reSessionID-c508aedc-8cc2-434a-8c17-bbe075a05562)
Jobs completed: 17. Time elapsed: 1:23.1s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1)
Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0
Differential Revision: D51695982
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114841
Approved by: https://github.com/jackiexu1992
This was invaluable when I was debugging #114917. Without the node names
in the log message, it was difficult to make sense of them.
However, I did not want to bloat the number of LOC with this change.
Thus, instead of calling `debug()` directly with the node arguments, I
made a new callable class WhyNoFuse to partially apply the node
arguments at the top of each fusion-checking method. WhyNoFuse generates
the logging string only when its `__str__` method gets called, so there
is minimal overhead when logging is disabled.
I also removed the various logging 'tags' like "vert:1" / "triton:1" --
the log messages themselves are unique enough that the user can identify
them without the tag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115003
Approved by: https://github.com/Skylion007
Enable pylint rules `PLC0131` and `PLC0132`. There was a violation of the `PLC0132` so this commit also fixes it and enables the rules so the violation do not occur again. `PLC0205` checks accidentally setting your `__slots__` to a string which is almost always a bug.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115015
Approved by: https://github.com/jansel, https://github.com/malfet
This PR rewrites Tensor Parallel implementation. Tensor Parallel APIs
supposed to be a very thin-wrapper to DTensor APIs, but the current
implementation got too messy and buggy. It's really hard to debug what
went wrong when using it. It's crucially important for advanced users or
developers to understand the API and its implementation easily without
going through all different types of functions and utils, so that
they could trust what happen under the hood.
In particular this PR:
* Make ParallelStyle to be a real contract API for parallelize_module to
take, each concrete ParallelStyle only needs to implement `apply` to
apply the sharding to nn.Module, remove all non-necessary fields. This
also enable easier ParallelStyle authoring going forward.
* Keep the ColwiseParallel and RowwiseParallel public interface, but
refactor them in a way that makes the parameter sharding, inputs and
outputs handling lives within the style itself, so that it's easy to
understand how Linear/Embedding layers are sharded and how the inputs/outputs
transformations are performed.
* remove all those private _prepare_input/_prepare_output_fn fields for
both ColwiseParallel/RowwiseParallel. Since we throw deprecation
messages in nightly for a while and TP is on prototype release, the
fields are also private, it should be safe to remove them
* Refactor the recently landed PrepareModuleInput/Output style, change
output_layouts to desired_input/output_layouts, group
the function inside the style itself, no default arguments for these
two styles and user need to specify them to think about the sharding
layouts. Fixed bugs about not handling
`use_local_output` flag.
* Make default arguments be None instead of Placement object, this is
standard python practice to not have custom object instance as default
argument
* Remove all dead APIs (i.e. PairwiseParallel and SequenceParallel
style, all prepare input/output functions) as we throw deprecation
msgs for a while, and in the progress of removing all of them from the tests.
* throw deprecation warning for `tp_mesh_dim` as we recomemnd use device
mesh slice/indexing instead of manually specify mesh dim
* Rewrite all documentations for every ParallelStyle and make the
documentation more clear about what each style is doing
TODOs:
* Rewrite TP tests to adjust for the changes we have in this PR
* add more tests to guard the bug fixes
Differential Revision: [D51761183](https://our.internmc.facebook.com/intern/diff/D51761183)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114732
Approved by: https://github.com/wz337, https://github.com/fduwjj
Summary:
att, this is because histogram observer does not work for a corner case in mobilebert (observing a scalar tensor of float32 max value)
because histc operator errors out when the value is larger than certain number
Test Plan:
python test/test_quantization.py -k test_mul_float32_max
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113405
Approved by: https://github.com/mcr229
**Dynamo**
We don't want setattr in the graph. Setting data has interesting implications on both aliasing and on the autograd engine.
The safe recipe is:
1) Disable grad
2) Call set_()
3) Manually lower the version counter on the object to hide it from the autograd engine
This is effectively the same exact thing as setting .data, and it composes properly with aot_autograd and inductor.
**aot_autograd**
For aot_autograd, there's another snag.
Specifically, when we invoke aot_autograd, we call `fake_mode.from_tensor()`, relying on memo to get the right tensor out. For .data mutations, this doesn't work, because the memoized fake_tensor is in the state it will be in at the end of the trace, not at the beginning. This means that the .data call is already applied, and the tensor shape (as in the case of these tests) mismatches. aot_autograd produces an invalid graph, with illegal calls like `torch.ops.aten.view.default(primals_2, [0])` where primals is actually sized `([6])` on input.
The new plan here is to:
1) Record tensor fakification policy in dynamo
2) provide a fresh fake mode to all backends
3) Invoke from_tensor with the stored policy to get fresh new fake tensors in aot_autograd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113080
Approved by: https://github.com/bdhirsh
Previously both `optimize_ctx` call and `experiment` call will do export and session creation, ending up doubling the resource cost. This PR makes `experiment` call re-use the onnx model created by `optimize_ctx`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114907
Approved by: https://github.com/thiagocrepaldi
ghstack dependencies: #110178
This PR will convert the stride to the default contiguous stride in `mkldnn_linear_pointwise` before calling oneDNN to run into an optimization path similar to https://github.com/pytorch/pytorch/pull/99511. Also refactored the code to provide a common utility function.
https://github.com/pytorch/pytorch/pull/111976 will ignore Dims of value 1 in Require_Stride_order. For a tensor with `size = [1, 1280]`, `stride = [0, 1]`:
**Before the above PR**, it is considered as non-contiguous, thus in the below call, it is converted to `size = [1, 1280]`, `stride = [1280,1]`:
25b83521be/torch/_inductor/ir.py (L5263)
**While after the above PR**, dims of value 1 are ignored so this tensor is already contiguous and we'll feed a tensor with `stride = [0, 1]` to oneDNN, which results in poor performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114939
Approved by: https://github.com/jgong5
Want to be a able to benchmark epilogue fused triton matmul kernel for a couple of reasons
1. @eellison found that certain TB models (resnet50, resnet152, moco) fails sometimes in maxautotune mode on the dashboard. The issue is quite hard to repro due to flakiness. The issue only get triggered when certain triton config for certain epilogue fused kernel get picked. (disable epilogue fusion bypass the issue) It would be nice if we can have a runnable script that directly run that kernel to ease further debugging
2. this is a necessary piece to do benchmark fusion for triton matmul kernels. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler for this
Example runnable kernel script: https://gist.github.com/shunting314/00bdbc1b6b46bfa73d1389d8f40cd669
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114809
Approved by: https://github.com/eellison
Previously:
```
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
```
With this PR, those warnings disappear. They were introduced in #114077
This change was generated with this sed script, applied with `sed -i -f /tmp/x **/*.{py,hpp,cpp,cc}` and hand inspected.
```
s/\bNCCL_BLOCKING_WAIT\b/TORCH_NCCL_BLOCKING_WAIT/g
s/\bNCCL_ENABLE_TIMING\b/TORCH_NCCL_ENABLE_TIMING/g
s/\bNCCL_DESYNC_DEBUG\b/TORCH_NCCL_DESYNC_DEBUG/g
s/\bNCCL_ASYNC_ERROR_HANDLING\b/TORCH_NCCL_ASYNC_ERROR_HANDLING/g
s/\bENABLE_NCCL_HEALTH_CHECK\b/TORCH_ENABLE_NCCL_HEALTH_CHECK/g
s/\bNCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK\b/TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK/g
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114880
Approved by: https://github.com/kwen2501
The linter uses libcst to check for a call to run_tests or a raised exception when the test file is run as main to ensure that all test files either get run in OSS CI or don't run and are expected to not run.
A better option instead of making this into a linter might be to add this code in run_test since there's also a list of blocklisted tests there that needs to be updated when a test file raises an exception.
This is possibly overkill since run on its own, the code takes ~1 minutes to run without the multiprocessing on all the files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114882
Approved by: https://github.com/kit1980
The diff looks messy but this PR essentially does one thing: Move non-nll loss tests in `TestNLLLoss` class to `TestMPS` class. After doing so, it ends up having two stack tests the same name `test_stack` ; therefore, I rename one of them to `test_stack_storage_offset`, which is what the test actually does.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113312
Approved by: https://github.com/kulinseth
Summary: This is a follow-up from D51428979. These tests should be run only from `TestQuantizePT2EQAT_ConvBn1d` and `TestQuantizePT2EQAT_ConvBn2d`. The base class doesn't have the necessary setup to run them and will fail expectedly. I previously ignored the failures on D51428979, and these failed tests have been disabled.
Test Plan:
Run an example test there and confirm that two versions from `TestQuantizePT2EQAT_ConvBn1d` and `TestQuantizePT2EQAT_ConvBn2d` are run while the one from `BaseTestQuantizePT2EQAT_ConvBn` is skipped
```
$ buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --run-disabled 'caffe2/test/quantization:test_quantization - test_qat_conv_bn_fusion_literal_args'
File changed: fbcode//caffe2/test/quantization/pt2e/test_quantize_pt2e_qat.py
↷ Skip: caffe2/test/quantization:test_quantization - test_qat_conv_bn_fusion_literal_args (caffe2.test.quantization.pt2e.test_quantize_pt2e_qat.BaseTestQuantizePT2EQAT_ConvBn) (0.0s)
/data/users/huydo/fbsource/buck-out/v2/gen/fbcode/689edf96bfbb5738/caffe2/test/quantization/__test_quantization__/test_quantization#link-tree/torch/_utils_internal.py:230: NCCL_DEBUG env var is set to None
/data/users/huydo/fbsource/buck-out/v2/gen/fbcode/689edf96bfbb5738/caffe2/test/quantization/__test_quantization__/test_quantization#link-tree/torch/_utils_internal.py:239: NCCL_DEBUG is WARN from /etc/nccl.conf
INFO:2023-11-29 19:20:33 3049620:3049620 CuptiActivityProfiler.cpp:225] CUDA versions. CUPTI: 18; Runtime: 12000; Driver: 12000
/data/users/huydo/fbsource/buck-out/v2/gen/fbcode/689edf96bfbb5738/caffe2/test/quantization/__test_quantization__/test_quantization#link-tree/torch/_utils_internal.py:158: DeprecationWarning: This is a NOOP in python >= 3.7, its just too dangerous with how we write code at facebook. Instead we patch os.fork and multiprocessing which can raise exceptions if a deadlock would happen.
threadSafeForkRegisterAtFork()
test_qat_conv_bn_fusion_literal_args (caffe2.test.quantization.pt2e.test_quantize_pt2e_qat.BaseTestQuantizePT2EQAT_ConvBn) ... skipped 'Skipping test running from BaseTestQuantizePT2EQAT_ConvBn'
----------------------------------------------------------------------
Ran 1 test in 0.001s
OK (skipped=1)
Skipped: Skipping test running from BaseTestQuantizePT2EQAT_ConvBn
Buck UI: https://www.internalfb.com/buck2/7b70fb33-44cb-4745-92e1-64031bb413b8
Test UI: https://www.internalfb.com/intern/testinfra/testrun/6473924660765251
Network: Up: 12KiB Down: 0B (reSessionID-0399f0c3-e671-4770-a41c-75c06ae709d5)
Jobs completed: 11. Time elapsed: 1:07.2s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1)
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 1. Build failure 0
```
Differential Revision: D51694959
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114829
Approved by: https://github.com/clee2000
Follow-up: #114539
This PR introduces a minor change to the `move_constructor_to_cuda` implementation, while
refactoring the whole pass into a class. Here's a brief summary of the changes:
- Create a new `ConstructorMoverPass`
- Rephrase the condition:
```python
if not isinstance(
node.target, torch._ops.OpOverload
) or node.target.namespace not in ("prims", "aten"):
...
if not (
isinstance(node.target, torch._ops.OpOverload)
and node.target.namespace in ("prims", "aten")
):
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114626
Approved by: https://github.com/eellison
Summary:
This is a util for numeric suite in pt2 export so that we can build
a more streamlined UX for numerical debugging in quant + executorch stack
Test Plan:
python test/test_quantization.py TestGenerateNumericDebugHandle
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114315
Approved by: https://github.com/zhxchen17
Previous to this PR, op level debug mismatches whenever it comes to complex dtype matching, because in ONNX, we support real representation. This PR makes sure we use real representation to compare the results.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114885
Approved by: https://github.com/BowenBao
Summary:
collective like broadcast/gather/reduce/scatter need root rank info in order to be replayed in PARAM benchmarks. Log root rank instead of local rank in RECORD_PARAM_COMMS_DATA
Reference: distributed/c10d/Types.hpp
Test Plan: Tested in HPC
Differential Revision: D51381196
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113828
Approved by: https://github.com/fduwjj
Fixes#104797
```
File "/home/jansel/pytorch/torch/_dynamo/utils.py", line 1486, in <lambda>
lambda: run_node(tx.output, node, args, kwargs, nnmodule)
File "/home/jansel/pytorch/torch/_dynamo/utils.py", line 1591, in run_node
raise RuntimeError(fn_str + str(e)).with_traceback(e.__traceback__) from e
File "/home/jansel/pytorch/torch/_dynamo/utils.py", line 1570, in run_node
return node.target(*args, **kwargs)
File "/home/jansel/conda/envs/pytorch/lib/python3.10/site-packages/einops/packing.py", line 153, in unpack
n_unknown_composed_axes = sum(x == -1 for x in lengths_of_composed_axes)
torch._dynamo.exc.TorchRuntimeError: Failed running call_function <function unpack at 0x7f644b962710>(*(FakeTensor(..., device='cuda:0', size=(1, s0*s1, 128)), [(s0, s1)], 'b * c'), **{}):
unsupported operand type(s) for +: 'int' and 'SymBool'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114828
Approved by: https://github.com/lezcano
As in the title.
In addition, this PR fixes a bug in `bsr_dense_mm` and `bsr_dense_addmm` return value handling where computations are performed on `make_triton_contiguous` return value while `bsr_dense_mm`/`bsr_dense_addmm` return a tensor that is an input to `make_triton_contiguous`. If `make_triton_contiguous` makes a copy of the input, the return values of `bsr_dense_mm`/`bsr_dense_addmm` will contain garbage.
The PR increases the performance of nn.linear as follows (float32, `NVIDIA A100-SXM4-80GB`):
- with 16x16 blocks, the average/maximal speed up is 67/78 %
- with 32x32 blocks, the average/maximal speed up is 72/79 %
- with 64x64 blocks, the average/maximal speed up is 71/79 %
- with 128x128 blocks, the average/maximal speed up is 62/76 %
The performance increase is illustrated also by the following sparsity-speedup graphs (before and after this PR):
<img src="https://github.com/pytorch/pytorch/assets/402156/55ce0bf7-8ef2-47ab-99e8-8878f159037d" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/df256175-a594-4bd7-b244-90867fb9a45e" width="48%">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114757
Approved by: https://github.com/cpuhrsch
We draw our fx graphs with the "record" shape attribute by default.
Sometimes, when the graph is very complex, we may hit dot errors like below:
"flat edge between adjacent nodes one of which has a record shape -
replace records with HTML-like labels"
and thus fail to generate a graph. So, let's give the user an option
to specify the shape attribute for the dot graph. For example, passing
INDUCTOR_DOT_GRAPH_SHAPE_SVG = "none" would let us generate HTML-like lables
to workaround the above failure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114811
Approved by: https://github.com/weifengpy
**Remaining Issue**
When replace SequenceParallel, tests would pass even setting `input_layouts=Replicate()`. Still looking into it...
**Summary**
This is a follow-up PR to #114314.
**Test Plan**
`python test_files.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114402
Approved by: https://github.com/wanchaol
time_created_us is the cpu-side epoch_time (in usec) when a flight-recorder
event was created. It loosely corresponds to the time the c10d collective
API was called and a work object was created. It does NOT correspond to
the time the collective started on the GPU.
We follow the precedent of us epoch time from this PR adding timestamps
to the cuda caching allocator:
https://github.com/pytorch/pytorch/pull/112266
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114810
Approved by: https://github.com/zdevito
New function for continue on error
Another solution might be to run the entire suite to the end and use last failed, but I'm worried about concurrent processes writing to the same last failed cache entry, it's a bit different than the usual test rerunning strategy we use especially regarding segfaults and other ways the test suite can suddenly end, and there are some cases where the entire test suite should immediately get rerun in a new process (ex cuda error that causes sync to fail).
Find example logs on commit 2f1510839727f6ef2631040d5f0edde26265015d
TODO: continue on error for --subprocess and test_distributed aren't working fully
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112098
Approved by: https://github.com/huydhn
Because isCompleted() returns true on an exception, a timeout exception
will cause the flight recorder to consider the event completed even though it timed out.
This changes the logic to explicitly query the completion events on "retirement"
when the work item leaves the workMetaList. We mark events as retired so
we can distinguish between an event still in the queue but not completed and one
that timed out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114804
Approved by: https://github.com/wconstab
**Summary**
To solve issue #113706:
1. replace `PariwiseParallel` with `ColwiseParallel` and `RowwiseParallel`.
2. replace the input of ColwiseParallel from `make_input_replicate_1d` and `make_output_replicate_1d` to `input_layouts` and `output_layouts`.
3. deprecate the tests for `_parallelize_mlp` as it only supports `PariwiseParallel`.
**Test Plan**
`pytest pytorch/test/distributed/tensor/parallel/`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114314
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
Summary:
`Layernorm` has two arguments weight and bias which are stored as constant tensors on the CPU and they are transferred to GPU at every inference call. We create a context for this op to avoid the repeated passing. Specifically, we
- created `create_layernorm_context` and `run_layernorm_context` in `Layernorm.h` and `Layernorm.cpp`
- registered them in `Register.cpp`
- rewrote the graph representation of the op in `vulkan_rewrite.cpp`
Test Plan:
## Numerical test
```
[luwei@devbig984.prn1 /data/users/luwei/fbsource (b6ccc956c)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*layer_norm*"
Recommended: For faster builds try buck2: replace 'buck' with 'buck2'
NOTE: buck-out/ has changed: look for files in fbsource/buck-out/v2/
'buck2 build --show-output //xplat/caffe2:pt_vulkan_api_test_bin' will print the new output paths.
If you are building in fbsource//xplat and have questions, post in 'Cross Platform Dev Discussions': https://fb.workplace.com/groups/xplat.qa
Targets matching .buckconfig buck2.supported_projects:
{'//xplat/caffe2:pt_vulkan_api_test_bin': '//xplat'}
To suppress this warning: touch ~/.config/.dont_hint_buck2
Building: finished in 0.1 sec (100%) 339/339 jobs, 0/339 updated
Total time: 0.2 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *layer_norm*
[==========] Running 10 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 10 tests from VulkanAPITest
[ RUN ] VulkanAPITest.packed_layer_norm_2d
[ OK ] VulkanAPITest.packed_layer_norm_2d (342 ms)
[ RUN ] VulkanAPITest.packed_layer_norm_3d
[ OK ] VulkanAPITest.packed_layer_norm_3d (284 ms)
[ RUN ] VulkanAPITest.packed_layer_norm_4d
[ OK ] VulkanAPITest.packed_layer_norm_4d (5 ms)
[ RUN ] VulkanAPITest.layer_norm_invalid_inputs
[ OK ] VulkanAPITest.layer_norm_invalid_inputs (28 ms)
[ RUN ] VulkanAPITest.layer_norm_2d
[ OK ] VulkanAPITest.layer_norm_2d (1 ms)
[ RUN ] VulkanAPITest.layer_norm_3d
[ OK ] VulkanAPITest.layer_norm_3d (2 ms)
[ RUN ] VulkanAPITest.layer_norm_4d
[ OK ] VulkanAPITest.layer_norm_4d (4 ms)
[ RUN ] VulkanAPITest.native_layer_norm_2d
[ OK ] VulkanAPITest.native_layer_norm_2d (1 ms)
[ RUN ] VulkanAPITest.native_layer_norm_3d
[ OK ] VulkanAPITest.native_layer_norm_3d (2 ms)
[ RUN ] VulkanAPITest.native_layer_norm_4d
[ OK ] VulkanAPITest.native_layer_norm_4d (6 ms)
[----------] 10 tests from VulkanAPITest (679 ms total)
[----------] Global test environment tear-down
[==========] 10 tests from 1 test suite ran. (679 ms total)
[ PASSED ] 10 tests.
```
Full test result in P888496077, summary as below
```
[----------] 419 tests from VulkanAPITest (21652 ms total)
[----------] Global test environment tear-down
[==========] 419 tests from 1 test suite ran. (21652 ms total)
[ PASSED ] 418 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
```
## Graph representation comparison
We created a model using `layer_norm` and traced it as below
```
class MyModel(torch.nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.layer_norm = torch.nn.LayerNorm(normalized_shape=10)
def forward(self, x):
return self.layer_norm(x)
# Create an instance of the model
model = MyModel()
# Create a dummy input tensor for tracing
input_tensor = torch.randn(1, 10)
# Use torch.jit.trace to trace the model and generate a graph
traced_model = torch.jit.trace(model, input_tensor)
```
Then we converted the traced model to Vulkan backend using `optimize_for_mobile`
```
from torch.utils import mobile_optimizer
vulkan_model = mobile_optimizer.optimize_for_mobile(
traced_model, backend="vulkan", preserved_methods=to_preserve
)
```
Then we can print the graph of the `vulkan_model` as `print(vk_model.graph)`
- Before this diff
```
%4 : bool = prim::Constant[value=1](), scope: __module.layer_norm # /mnt/xarfuse/uid-602118/33e18f68-seed-nspid4026531836_cgpid32066351-ns-4026531840/torch/nn/functional.py:2546:0
%5 : float = prim::Constant[value=1.0000000000000001e-05](), scope: __module.layer_norm # /mnt/xarfuse/uid-602118/33e18f68-seed-nspid4026531836_cgpid32066351-ns-4026531840/torch/nn/functional.py:2546:0
%14 : int[] = prim::Constant[value=[10]]()
%33 : Tensor = aten::to(%x, %53, %30, %31, %31)
%10 : Tensor = aten::layer_norm(%33, %14, %self.layer_norm.weight, %self.layer_norm.bias, %5, %4), scope: __module.layer_norm # /mnt/xarfuse/uid-602118/33e18f68-seed-nspid4026531836_cgpid32066351-ns-4026531840/torch/nn/functional.py:2546:0
```
- after this diff
```
%14 : int[] = prim::Constant[value=[10]]()
%47 : Tensor = aten::to(%x, %78, %44, %45, %45)
%16 : Tensor = vulkan_prepack::run_layernorm_context(%47, %14, %17)
```
Reviewed By: SS-JIA
Differential Revision: D51530478
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114701
Approved by: https://github.com/yipjustin
By finally breaking FC promise on new dtypes by serializing untyped
storage and tensor dtypes
- Add `_rebuild_tensor_v3` that takes an extra dtype argument
- In `Tensor.__reduce_ex__` serialize tensor using untyped storage for
v3_dtypes (which are at the moment limited to float8 dtypes)
Test plan: `python -c "import torch;x=torch.arange(10).to(dtype=torch.float8_e4m3fn);torch.save(x, 'pt.pt');print(torch.load('pt.pt'))"`
Fixes https://github.com/pytorch/pytorch/issues/114634
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114662
Approved by: https://github.com/ngimel
Context:
https://github.com/pytorch/pytorch/pull/47724 fixed the problem that SparseAdam could not handle generators by using the `list(...)` construct. However, this meant that SparseAdam deviated from other optimizers in that it could _accept_ a raw Tensors/Parameter vs requiring a container of them. This is not really a big deal.
So why this PR?
I do think this PR is cleaner. It uses the fact that the Optimizer parent class already containerizes parameters into parameter groups, so we could reuse that here by calling `super().__init__` first and then filter the param_groups after. This change would also make SparseAdam consistent with the rest of our optimizers in that only containerized params are accepted, which technically is BC breaking SO I've added a deprecation warning that we should remove in May 2024.
(But is it really BC breaking when we've said in the docs that params should be an iterable this whole time? Maybe this is just a bug fix....😛)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114425
Approved by: https://github.com/drisspg
Context: pt2 oncall is revamping its labeling system. One of the guidelines is to remove duplicate labeling in our system. Both primTorch and decomposition labels are referring to the same thing. primTorch was the legacy name (and we no longer have a primTorch project), so using decomposition as the label name makes more sense.
Right now, the only open issues that use "module: primTorch" are the ones generated by the DISABLED bots. Once we replace the label in the bot, we can safely remove the primTorch label.
Here an example of the issue that has primTorch label :
https://github.com/pytorch/pytorch/issues/112719
Torchbot uses following logic to auto extract module owners:
https://github.com/pytorch/test-infra/blob/main/torchci/pages/api/flaky-tests/disable.ts#L391
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114754
Approved by: https://github.com/huydhn
… device placement of the RNG. This PR now ignores the user-specified default device, allocates the tensor on the CPU and then moves the tensor to the device of the input tensor. This was more or less already the standard procedure in case the default device wasn't set.
Fixes#114536.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114560
Approved by: https://github.com/soulitzer
Summary:
Current order of implicit sharing breaks common annotation patterns of SharedQuantizationSpec, so we changed the order here.
But it's not going to work in all possible annotation cases, so quantizer implementors still need to be careful.
In general if people only refer to node/edges that comes before the current node/edge in SharedQuantizationSpec, it should work I think
Test Plan: CI, make sure this Fixed some internal tests
Differential Revision: D51605918
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114704
Approved by: https://github.com/andrewor14
While many models regress in training when converted to channels last, in inference the results are quite different. Almost all of the models experienced a speedup when converted to channels last. There were a few big regressions in torchbench - `timm_regnet` from `1.4343 → 1.0573` and `timm_resnet` from `1.7484 → 1.2868`.
I used a modified script of the operator benchmarks [here](https://gist.github.com/eellison/e11dc645412f52e8b45fb26ba6f9f6a1) to measure the average speedup of convolutions across all of the input shapes found in torchbench according to the existing classifications that @shunting314 used - grouped convs, small channel convs, convolution with larger in-channel than out-channel. Only grouped convolutions benchmarked as a slowdown in inference.
I updated the inference heuristic to multiply the flops of each conv with its predicted speedup/slowdown in channels last. With this heuristic the two previously regressing models no longer regress.
Speeds up inference for torchbench ~8% and timm ~6%. The motivating model here was SDXL which now hits channels last and improves 10%.
There were some models that were sped up in training when forcing channels last (along with a number of regressions). It's possible there is some speedup in training to be had with additional heuristics. We could also have more granular classification/predictions which might benefit both training and inference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114600
Approved by: https://github.com/jansel, https://github.com/shunting314
As in the title.
The `bsr_dense_addmm` kernel implemented in this PR is a generalization of `bsr_dense_mm` in the following respects (in addition of having input, beta, and alpha parameters):
- it implements `SPLIT_N` kernel parameter that enables efficient kernel launches in the case of wide inputs. For instance, the timing of nn.linear with 256x256 BSR weights having 16x16 blocks and 256x131072 strided input reduced about 16x (this corresponds to the 94 % speed up value listed below).
- it supports rectangular blocks in sparse BSR tensor weights
The performance increase of nn.linear is as follows (float16, `NVIDIA A100-SXM4-80GB`):
- with 16x16 blocks, the average/maximal speed up is 55/94 %
- with 32x32 blocks, the average/maximal speed up is 33/63 %
- with 64x64 blocks, the average/maximal speed up is 23/42 %
- with 128x128 blocks, the average/maximal speed up is 15/39 %
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114595
Approved by: https://github.com/cpuhrsch
Quick recap of events:
(1) https://github.com/pytorch/pytorch/pull/111347, which fixed a perf regression in 2.1 compared to 2.0, introduced a correctness problem around input mutations on inputs that require grad that show up in an inference-only graph (the specific case where this can happen is rare and nobody reported the issue, but it was fixed a few weeks later)
(2) That fix happened here: https://github.com/pytorch/pytorch/pull/113584, which makes sure to keep input mutations outside of the graph, so the autograd engine can set metadata properly on them
(3) That in turn caused a slight regression compared to (1), which is what this PR attempts to fix. In particular, code like the below is safe to keep the mutations in the graph for:
```
@torch.compile
def f(x):
x.mul_(2)
x = torch.ones(2, requires_grad=True).clone()
# x requires_grad, so the input mutation will change some autograd metadata, like the version counter
# However, the mutation is under no_grad, so we don't have to worry about e.g. aliases of x having their .grad_fn fields changed
with torch.no_grad():
f(x)
```
This particular case is pretty important to the shampoo optimizer code, which is run under `torch.compile`, and mutates parameters (which require grad).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114646
Approved by: https://github.com/zou3519
Summary:
1. We stop using excess memory in generate_opcheck_tests. This is safe because
all the individual test utils already ensure that they do not modify the
inputs.
2. We re-enable the fbgemm TBE tests (see internal diff, but all of this is open
source). They were previously removed because they OOM'ed when run serially;
(1) and (3) cut down the memory usage to ~20gb peak.
3. I needed to skip some newly failing generated tests and also some that had an
impact on the memory usage.
Test Plan: - run tests
Reviewed By: sryap
Differential Revision: D51601964
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114641
Approved by: https://github.com/williamwen42
_cslt_sparse_mm + additional stride checking in test.
Summary:
This PR adds in meta registrations for _cslt_sparse_mm.
Based on the work @drisspg did
in #114370.
Additionally, it updates the tests by checking that the strides of the
spare result and the result returned by sparse+compile are the same, to
avoid errors like those found in
https://github.com/pytorch/pytorch/pull/114477.
Test Plan:
```
python test/test_sparse_semi_structred -k compile_cusparselt
python test/test_sparse_semi_structred -k compile_cutlass
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114685
Approved by: https://github.com/alexsamardzic, https://github.com/drisspg
I can hold off on reviews / landing until I talk to Driss and we confirm that we need this for FP8. This PR also needs testing and probably shouldn't land until Tugsuu's input mutation handling [PR](https://github.com/pytorch/pytorch/pull/111046) goes through.
What this PR tries to solve is when you have a model that tries to mutate some nn module state (a buffer), but during the **backward**. It appears that this might be necessary for FP8's delayed scaling.
Today, AOTAutograd will just not realize if you happened to mutate any graph inputs when running the backward pass, and functionalize them away but not realize that they were input mutations. This PR tries to:
(a) detect this situation (input mutations during the backward)
(b) put `copy_()`'s in the graph to properly handle the input mutation when we can. In cases where we can't keep the copy_() in the graph, we just error loudly (I imagine that these cases will be extremely rare, but we can fix them if they ever come up).
This is mostly a prototype for now, not ready for review.
I made this example locally to test out:
```
import torch
class MutatingAutogradFn(torch.autograd.Function):
@staticmethod
def forward(ctx, x, buf):
ctx.save_for_backward(buf)
return x
@staticmethod
def backward(ctx, x_grad):
buf = ctx.saved_tensors[0]
buf.add_(x_grad)
return x_grad * 3, None
class Mod(torch.nn.Module):
def __init__(self):
super().__init__()
self.buf = torch.ones(2)
@torch._dynamo.allow_in_graph
def backward_mutating_fn(self, x, buf):
return MutatingAutogradFn.apply(x, buf)
def forward(self, x):
tmp = self.backward_mutating_fn(x, self.buf)
return tmp + self.buf
m = Mod()
x = torch.ones(2, requires_grad=True)
out = m(x)
# After the fw, buf should not have been mutated
print(m.buf)
out.sum().backward()
# bw has run, so buf should now be mutated
print(m.buf)
print(x.grad)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112906
Approved by: https://github.com/ezyang
Smuggle important and not too slow tests to run on this trunk job,
instead of just on the periodic job where they currently reside.
- test_dtensor_compile took 70sec, test_fsdp_2d_parallel took 198sec
locally
As a follow up, organize the distributed-mgpu tests better and maybe
rename this job to reflect its more 'general dist mgpu'
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114642
Approved by: https://github.com/wanchaol, https://github.com/malfet
I'm looking to repurpose some logic in `torch.utils.collect_env` for the `geowatch` package. I'm mostly able to just use this script as a library, which is great because it reduces code in my package. However, the issue is that the package patterns that are relevant to torch are hard-coded inside of `get_conda_packages` and `get_pip_packages`.
The changes I made are simple. I defined the default package patterns as two global sets, and I added an argument to each function that lets the user customize exactly what package patterns are relevant. If they are not specified the defaults are used.
I was considering extending the power of the patterns by utilizing `fnmatch`, `re` (or [xdev.pattern](https://github.com/Erotemic/xdev/blob/main/xdev/patterns.py) which abstracts them both), but instead I opted to just use the existing `__contains__` test to keep things simple.
From torch's perspective this should make maintaining this file slightly easier because to update relevant packages, the developer now updates two neighboring top-level globals instead of two separated local variables. However, it does add an argument to two functions, and that argument isn't used in torch itself, so there is an argument for removing that, and then users *could* still have some control by modifying globals, but I think the way I did it balances the tradeoffs well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112993
Approved by: https://github.com/zou3519
Updated version of #108885 addressing the review. In this PR:
- We add a VT.can_reconstruct utility that checks if VT.reconstruct()
does something.
- If functools.wraps(fn) is passed a `fn` that either has a source or
has .can_reconstruct() == True, then we stash the source (or the VT)
- Later on, we use the source (or VT.reconstruct) to actually
reconstruct the object in codegen.
Test Plan:
- New tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114279
Approved by: https://github.com/voznesenskym
torch.split(x, l) fails when l's shape is the unbacked symint.
E.g. l =
y.tolist() makes l the unbacked shape, because l depends on the
data access of y. The downdtream call `SliceView.create()`
evaluates the shape even if the input shape is unbacked symint,
which brings up the bug.
Test Plan:
python test/inductor/test_unbacked_symints.py -k test_split_with_sizes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113406
Approved by: https://github.com/aakhundov, https://github.com/ezyang
Add TORCH_NCCL_DUMP_DEBUG_INFO env to control dumping independently
of desync debug feature.
Currently default to disabled (so no behavior change by default),
but plan to default this to true after validation.
Moves 'sleep for 30 sec' that used to be after desync debug to before
it. In my view sleeping before desync is equivalent since we always
sleep the same duration, and keeps the code simpler this way.
Fixes#114433
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114614
Approved by: https://github.com/zdevito
ghstack dependencies: #114651
This should be enough to get @voznesenskym 's FSDP branch to plumb `set_()` through AOTAutograd properly and have everything properly no-op out. Main changes are:
(1) graph break on `aten::set_.source_Tensor_storage_offset` (we could support it but it isn't needed, seems safer to graph break)
(2) Functionalization: add a "proper" functionalization kernel for `aten::set_.source_Tensor`. The previous one we had was codegen'd and it was wrong (it would just clone() and call set_(), which does not do the right thing). I also manually mark on the `FunctionalTensorWrapper` when a given tensor has been mutated by a `set_()` call.
(3) AOTAutograd: I added a new field, `InputAliasInfo.mutates_storage_metadata`, so we can distinguish between "regular" metadata mutations, and metadata mutations due to `set_()` calls. This is mainly because at runtime, one requires calling `as_strided_()` to fix up metadata, while the other requires calling `set_()`.
(4) Made AOTAutograd's detection for metadata mutations / set_() mutations smarter and detect no-ops (if the storage and metadata are all the same).
I also killed `was_updated()` and `was_metadata_updated()`, and replaced them with (existing) `has_data_mutation() ` and (new) `has_data_mutation()`, which can more accurately distinguish between data-mutation vs. `set_()` calls vs. metadata-mutation
**This PR is still silently correct in one case though**, which I'd like to discuss more. In particular, this example:
```
def f(x):
x_view = x.view(-1)
x.set_(torch.ones(2))
x_view.mul_(2)
return
```
If you have an input that experiences both a data-mutation **and** a `x_old.set_(x_new)` call, there are two cases:
(a) the data mutation happened on the storage of `x_new`. This case should be handled automatically: if x_new is a graph intermediate then we will functionalize the mutation. If x_new is a different graph input, then we will perform the usual `copy_()` on that other graph input
(b) the data mutation happened on the storage of `x_old`. This is more of a pain to handle, and doesn't currently work. At runtime, the right thing to do is probably something like:
```
def functionalized_f(x):
x_view = x.view(-1)
# set_() desugars into a no-op; later usages of x will use x_output
x_output = torch.ones(2)
# functionalize the mutation on x_view
x_view_updated = x.mul(2)
x_updated = x_view_updated.view(x.shape)
# x experienced TWO TYPES of mutations; a data mutation and a metatadata mutation
# We need to return both updated tensors in our graph
return x_updated, x_output
def runtime_wrapper(x):
x_data_mutation_result, x_set_mutation_result = compiled_graph(x)
# First, perform the data mutation on x's old storage
x.copy_(x_data_mutation_result)
# Then, swap out the storage of x with the new storage
x.set_(x_set_mutation_result)
```
There are two things that make this difficult to do though:
(1) Functionalization: the functionalization rule for `set_()` will fully throw away the old `FunctionalStorageImpl` on the graph input. So if there are any mutations to that `FunctionalStorageImpl` later on in the graph, the current graph input won't know about it. Maybe we can have a given `FunctionalTensorWrapper` remember all previous storages that it had, and track mutations on all of them - although this feels pretty complicated.
(2) AOTAutograd now needs to know that we might have *two* graph outputs that correspond to a single "mutated input", which is annoying.
It's worth pointing out that this issue is probably extremely unlikely for anyone to run into - can we just detect it and error? This feels slightly easier than solving it, although not significantly easier. We would still need `FunctionalTensorWrapper` to keep track of mutations on any of its "previous" storages, so it can report this info back to AOTAutograd so we can raise an error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111554
Approved by: https://github.com/ezyang
ghstack dependencies: #113926
Summary:
This diff adds support in the ExecuTorch codegen layer to log the outputs of kernels to event_tracer. It does this by calling the `event_tracer_log_evalue` API.
When the `ET_EVENT_TRACER_ENABLED` flag is disabled this is essentially a no-op and will add no overhead.
Test Plan: CI
Reviewed By: larryliu0820
Differential Revision: D51534590
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114584
Approved by: https://github.com/larryliu0820
CPP Stacktrace processing (symbolizer) takes a long time on some systems
using a particular version of addr2line. In slow systems, this makes
flight-recorder dumping slow enough to time out on even toy programs.
TORCH_NCCL_TRACE_CPP_STACK=True will re-enable CPP stacktrace collection
as part of the flight recorder.
CPP stacktrace is fast enough for use on certain combinations of OS. We
can investigate moving to llvm's symbolizer as a replacement.
On devserver with C++ stacktraces disabled/enabled:
```
python test/distributed/test_c10d_nccl.py -k test_short
Ran 1 test in 12.175s
TORCH_NCCL_TRACE_CPP_STACK=1 python test/distributed/test_c10d_nccl.py -k test_short
Ran 1 test in 53.338s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114651
Approved by: https://github.com/zdevito
This adds some unit testing for the `ignored_states` argument and auto wrapping. There is some ongoing discussion with @erhoo82 about his particular use case, but it should not block this PR. (We can land a separate PR if needed.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114612
Approved by: https://github.com/wanchaol
ghstack dependencies: #114611
With this PR it is possible to differentiate through NumPy code modulo
the usual caveats that apply to differentiation:
- That there are no graphbreaks
- That the decomposition in `torch._numpy` is differentiable
@ev-br and I were somewhat careful to achieve the second point, but
it is not tested though and through, so YMMV
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114608
Approved by: https://github.com/voznesenskym
Changes:
1. Add `_private_register_pytree_node` API in both C++ and Python pytree. In C++ pytree, the API will only register pytree node for C++ pytree. In Python pytree, the API will only register pytree node for Python pytree.
2. Do not allow registering a type as pytree node twice in the Python pytree.
3. Add thread lock to the Python pytree node register API.
4. The old `_register_pytree_node` API will call the `_private_register_pytree_node` API and raise a deprecation warning.
5. Add a new `register_pytree_node` API to register node type in both C++ and Python implementations.
6. Add tests to ensure a warning will be raised when the old private function is called.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112111
Approved by: https://github.com/zou3519
As in the title.
As a result, `nn.linear(<strided tensor>, <BSR tensor>, bias=<strided tensor>)` performance increases as follows (`float16`, `NVIDIA A100-SXM4-80GB`):
- 256x256 weights, speed up is 14..27 %
- 512x512 weights, speed up is 9..25 %
- 1024x1024 weights, speed up is 5..20 %
- 2048x2048 weights, speed up is 3..16 %
- 4092x4092 weights, speed up is 2..9 %
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114484
Approved by: https://github.com/cpuhrsch
It appears that `mypy` is now checking a few more previously-unchecked files; these files
are being found via import-following. Not sure exactly why they weren't being checked before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114160
Approved by: https://github.com/eellison
ghstack dependencies: #114162
**Summary**
Add the `extra_check` in `_register_quantized_conv_binary_lowering` to skip the pattern which matched unexpected. To match a Conv-Binary pattern, we should expect the extra input of binary node comes from a dequant pattern instead of a constant scalar.
**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_add_2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114541
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
ghstack dependencies: #114540
# Summary
Improved Fix for Attention Mask Alignment Issue (#112577)
This PR addresses Issue #112577 by refining the previously implemented fix, which was found to be incorrect and causes un-needed memory regressions. The update simplifies the approach to handling the alignment of the attention mask for mem eff attention.
## Changes
Alignment Check and Padding: Initially, the alignment of the attention mask is checked. If misalignment is detected, padding is applied, followed by slicing. During this process, a warning is raised to alert users.
Should this be warn_once?
We only call expand, once on the aligned mask.
Reference
https://github.com/facebookresearch/xformers/blob/main/xformers/ops/fmha/cutlass.py#L115
@albanD, @mruberry, @jbschlosser, @walterddr, and @mikaylagawarecki.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114173
Approved by: https://github.com/danthe3rd
In https://github.com/pytorch/pytorch/pull/113432, we changed
the behavior of _is_allowed_module_prefix, where we moved the '.'
from the module perfixes. Consequently, 'LOAD_ATTR submodule'
(e.g. LOAD_ATTR fx) is turned into PythonModuleVariable instead
of TorchVariable. This caused some issue for record_replayer.record_module_access
, which is enabled by setting TORCH_COMPILER_DEBUG=1, because 'torch.fx'
doesn't exist in record_replayer's name_to_modrec dictionary when
record_module_access is called.
This PR fixed the issue by adding "torch.fx" into record_replayer's
EXCLUDES list. The fix is likely to be a workaround to unblock
internal workflow. There might be some fundamental changes
to the relevant pieces along with Yanbo's refactoring PRs for
tracing in-graph functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114623
Approved by: https://github.com/mlazos, https://github.com/yanboliang
There are now 3 ways to see logs from ddpoptimzer.
1) TORCH_LOGS="distributed"
2) TORCH_LOGS="dynamo"
3) TORCH_LOGS="torch._dynamo.backends.distributed"
(1 and 2 are different supersets of 3 that also include other content)
Note: ddp_graphs is still a separate 'artifact' logger, which just
includes graph dumps from the graph-splitting process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114376
Approved by: https://github.com/wanchaol
Currently, we print out profile bandwidth result for each triton
kernel to stdout after each profiling run finishes. Consequently,
the profiling results are mixed with other debug outputs.
This PR adds a config, profile_bandwidth_output, to specify a file
where we can dump the results in a sorted order. The new config can
be set by setting "TORCHINDUCTOR_PROFILE_OUTPUT" environment variable.
Hopefully it would offer a slightly better way to navigate the profiling
results.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114587
Approved by: https://github.com/Chillee
**Summary**
To annotate a conv-binary pattern, should skip the pattern if the conv node has more than one user.
**Test Plan**
```
python -m pytest test_x86inductor_quantizer.py -k test_conv2d_binary2
python -m pytest test_x86inductor_quantizer.py -k test_qat_conv2d_binary2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114540
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
Summary:
Fixed nn_module_stack dynamo produced by symbolic trace to align with the nn_module_stack metadata produced by dynamo. The key should be the module path, with the value being a unique name, and the type. Something like: `{'L__self___one_module': ("L['self'].one_module", <class 'torch.fx.graph_module.GraphModule.__new__.<locals>.GraphModuleImpl'>)}`
This was causing some tests to fail when using export + the old quantization flow (prepare_fx calls symbolic_trace).
Test Plan: D51534471 `buck2 run @//mode/dev-nosan //executorch/backends/xnnpack/test:test_xnnpack_quantized -- -r "test_xnnpack_leaky_relu"`
Differential Revision: D51539118
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114422
Approved by: https://github.com/JacobSzwejbka, https://github.com/jerryzh168
https://github.com/pytorch/pytorch/pull/113580 introduced the `DDP._update_process_group` API. However, the implementation did not correctly reset all of the necessary state in the reducer. In particular if an error occurred during backward, DDP would end up in an incorrect state.
As a result, in this PR I've enhanced the unit test to test for this case and also appropriately fixed resetting Reducer state.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114194
Approved by: https://github.com/rohan-varma
if logger_qname is a.b.c and dynamo_qnames contains a.b, it still
matches dynamo's INFO setting
concretely, torch._dynamo.backends.distributed is implicitly part of
the dynamo namespace since it is covered by `torch._dynamo` which is
one of dynamo_qnames. However, it is not an exact match for any
of dynamo_qnames, which made this test fail when adding a specific
qname for backends.distributed in the subsequent PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114429
Approved by: https://github.com/Skylion007
ghstack dependencies: #114428
This reduces compile time of Adam on 1k parameters from 180s to 140s (28%), the main reason being that thousands of buffers no longer get sent to the scheduler.
The idea behind this is that if a destination buffer (from a copy_) has no users, it shouldn't matter if dst aliases src.
This is implemented by reinplacing copy_ nodes when safe.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112440
Approved by: https://github.com/jansel
BEFORE
```
expected torch._dynamo.backends.distributed is DEBUG, got 0
```
(0 is both unhelpful and also not numerically the right value,
getEffectiveLevel() returns 20 not 0 for this particular case)
AFTER
```
expected torch._dynamo.backends.distributed is DEBUG, got INFO
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114428
Approved by: https://github.com/Skylion007
This PR adds a CommDebugMode debugging tool to record the number of
distributed collectives, utilizing TorchDispatchMode, the idea borrows
from the FlopCounterMode and we can expand this later to make it more
feature complete like the FlopCounterMode
This is useful for debugging with DTensor and testing, in general this
fits for any complex distributed algorithms where it's non-trival to
understand the algorithm, we can use this tool to understand what
happened under the hood., we can later cover c10d collectives directly
Not sure if it would be a good general distributed debug tool yet,
so adding to the dtensor package first
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113592
Approved by: https://github.com/wconstab
Summary:
The primary problem we are setting out to solve here is fake tensor freshness. Before this PR, fake tensors after dynamo represented fake tensors *at the end* of trace, so subsequent retraces like aot_autograd would start off with fake tensors in the wrong (end result) state, rather than their expected fresh state. The solution here is to start a fresh fake mode, and re-fakify the tensors. The nuance comes from ensuring that symbols are uniformly created for the symbolic sizes and strides of the tensor.
This PR is the result of *a lot* of back and forth with ezyang and eellison. Initially, the first pass at this was not super different from what we have in the PR - the broad strokes were the same:
1) We cache source->symbol in shape_env
2) We pass policy objects around, stored at dynamo fakificaiton time, and reused for later fakification
3) We create a new fake mode for backends
(from https://github.com/pytorch/pytorch/pull/113605/files)
This is ugly, and has some layering violations. We detoured our decision making through a few other alternatives. Immutable/mutable fake tensor mode was the most interesting alternative, https://github.com/pytorch/pytorch/pull/113653, and was struck down on concerns of complexity in fake mode combined with it not covering all edge cases. We also detoured on what to do about tensor memoization returning back potentially different tensors than requested, and if that was an anti pattern (it is) we want to hack in with the symbol cache (we don't).
We went back to the drawing board here, but with a few concessions:
1) the cache for source->symbol must live outside of shape_env, for both lifecycle, and layering reasons
2) A good amount of work needs to be done to pipe policy around fake_mode and meta_utils correctly, to cover all the cases (ezyang did this)
cc penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng
imported-using-ghimport
Test Plan: Imported from OSS
Reviewed By: huydhn, Chillee
Differential Revision: D51566250
Pulled By: voznesenskym
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114526
Approved by: https://github.com/Chillee, https://github.com/huydhn
Calls to `assertTrue` corrected to be `assertEqual` in `ElasticLaunchTest test_min_max_nodes_parse`.
As originally written, the `assertTrue` statements will always pass, not actually asserting anything of value for the test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114537
Approved by: https://github.com/Skylion007
If you copy and paste the env var in the docs:
```console
TORCHDYNAMO_REPRO_AFTER=“aot”
```
it leads to this error:
```python
@functools.wraps(unconfigured_compiler_fn)
def debug_wrapper(gm, example_inputs, **kwargs):
compiler_fn = functools.partial(unconfigured_compiler_fn, **kwargs)
> assert config.repro_after in ("dynamo", "aot", None)
E torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
E AssertionError:
```
because `config.repro_after` is being `'“aot”'` but not `'aot'`.
---
It would've saved a few minutes of my time 😄
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114530
Approved by: https://github.com/Chillee
Summary: `None` arguments are codegened as `*i8` in the `triton_meta` of the generated or user-defined Triton kernels:
85aa372374/torch/_inductor/codegen/triton_utils.py (L33-L36)
Due to this, in contrary to the conventional Triton, we actually should pass `nullptr` to the Triton kernels in C++ wrapper codegen instead of passing nothing (as normally `None` doesn't make it to the generated PTX parameters, just like `tl.constexpr` args).
This PR adds two things:
1. Proper C++ wrapper codegening (ABI and non-ABI) of `nullptr` and `c10::nullopt`, as the prior way codegening `c10::nullopt` as tensor breaks (also `c10` breaks in the ABI mode).
2. Skipping `tl.constexpr` args when calling the loaded-from-cubin compiled Triton kernel in the C++ wrapper codegen. As a side effect, this also resolves an issue with string arguments: now they are simply omitted in the C++ wrapper codegen.
Test Plan:
```
$ python test/inductor/test_aot_inductor.py -k test_triton_kernel_with_none_input
...
----------------------------------------------------------------------
Ran 4 tests in 40.364s
OK (skipped=2)
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114475
Approved by: https://github.com/oulgen
torch.split(x, l) fails when l's shape is the unbacked symint.
E.g. l =
y.tolist() makes l the unbacked shape, because l depends on the
data access of y. The downdtream call `SliceView.create()`
evaluates the shape even if the input shape is unbacked symint,
which brings up the bug.
Test Plan:
python test/inductor/test_unbacked_symints.py -k test_split_with_sizes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113406
Approved by: https://github.com/aakhundov, https://github.com/ezyang
For whatever reason calling`.cpu()` on a `nn.Parameter` wrapping a CUDA tensor will return a plain (non-parameter) tensor. This PR fixes the symptom in the linked issue, but not the underlying issue.
Fixes#113999.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114029
Approved by: https://github.com/zhxchen17
This PR adds a unit test that leverages `torch.export.ExportedProgram` models that mutates registered buffers. Although the exporter already works out of the box in such scenario, the GraphModule and the exported ONNX model have extra outputs containing the mutated buffers. On future runs of the ONNX model, the mutated buffers are used as input to the model.
The aforementioned extra inputs and outputs are by design and the `ONNXProgram.model_signature` can be used to fetch detailed input/output schema for the exported model.
However, when we want to compare pytorch output to ONNX's, there is a mismatch between the schema because pytorch output does not include the mutated buffers present on the ONNX output.
This PR extends `onnx_program.adapt_torch_outputs_to_onnx(torch_outputs)` so that the mutated buffers are prepended to the Pytorch output, matching the ONNX schema.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112272
Approved by: https://github.com/titaiwangms, https://github.com/BowenBao
- [c10d] (retry) Opportunistically use `ncclCommSplit` when creating new NCCL groups (#112889)
- Guard use of `split_from` with a `hasattr` check for cases when NCCL (or RCCL) lacks `ncclCommSplit`
Fixes cause of revert of original PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114385
Approved by: https://github.com/huydhn
Test Plan:
```
python test/distributed/_tensor/test_dtensor_compile.py
```
We found that after https://github.com/pytorch/pytorch/pull/114236 landed, DTensor + `torch.compile` tests were breaking (which was confounded with `DTensorSpec` hash changes). The temporary solution is to pass `dynamic=False`.
Otherwise, we see errors like:
<details>
```
======================================================================
ERROR: test_2d_fsdp_tp_ac_compile (__main__.TestDTensorCompileE2E)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/data/users/andgu/pytorch/torch/testing/_internal/common_distributed.py", line 533, in wrapper
self._join_processes(fn)
File "/data/users/andgu/pytorch/torch/testing/_internal/common_distributed.py", line 752, in _join_processes
self._check_return_codes(elapsed_time)
File "/data/users/andgu/pytorch/torch/testing/_internal/common_distributed.py", line 802, in _check_return_codes
raise RuntimeError(error)
RuntimeError: Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
File "/data/users/andgu/pytorch/torch/testing/_internal/common_distributed.py", line 649, in run_test
getattr(self, test_name)()
File "/data/users/andgu/pytorch/torch/testing/_internal/common_distributed.py", line 535, in wrapper
fn()
File "/data/users/andgu/pytorch/torch/testing/_internal/common_utils.py", line 2652, in wrapper
method(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 193, in wrapper
func(self, *args, **kwargs) # type: ignore[misc]
File "/data/users/andgu/pytorch/torch/testing/_internal/common_distributed.py", line 174, in wrapper
return func(*args, **kwargs)
File "/data/users/andgu/pytorch/test/distributed/_tensor/test_dtensor_compile.py", line 328, in test_2d_fsdp_tp_ac_compile
compiled_output = compiled_2d(inp)
File "/data/users/andgu/pytorch/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/_dynamo/eval_frame.py", line 489, in _fn
return fn(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 848, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/_dynamo/eval_frame.py", line 655, in catch_errors
return callback(frame, cache_entry, hooks, frame_state)
File "/data/users/andgu/pytorch/torch/_dynamo/convert_frame.py", line 721, in _convert_frame
result = inner_convert(frame, cache_entry, hooks, frame_state)
File "/data/users/andgu/pytorch/torch/_dynamo/convert_frame.py", line 383, in _convert_frame_assert
compiled_product = _compile(
File "/data/users/andgu/pytorch/torch/_dynamo/convert_frame.py", line 645, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/data/users/andgu/pytorch/torch/_dynamo/utils.py", line 244, in time_wrapper
r = func(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/_dynamo/convert_frame.py", line 562, in compile_inner
out_code = transform_code_object(code, transform)
File "/data/users/andgu/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
transformations(instructions, code_options)
File "/data/users/andgu/pytorch/torch/_dynamo/convert_frame.py", line 151, in _fn
return fn(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/_dynamo/convert_frame.py", line 527, in transform
tracer.run()
File "/data/users/andgu/pytorch/torch/_dynamo/symbolic_convert.py", line 2123, in run
super().run()
File "/data/users/andgu/pytorch/torch/_dynamo/symbolic_convert.py", line 818, in run
and self.step()
File "/data/users/andgu/pytorch/torch/_dynamo/symbolic_convert.py", line 781, in step
getattr(self, inst.opname)(inst)
File "/data/users/andgu/pytorch/torch/_dynamo/symbolic_convert.py", line 2238, in RETURN_VALUE
self.output.compile_subgraph(
File "/data/users/andgu/pytorch/torch/_dynamo/output_graph.py", line 912, in compile_subgraph
self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root)
File "/home/andgu/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/data/users/andgu/pytorch/torch/_dynamo/output_graph.py", line 1069, in compile_and_call_fx_graph
compiled_fn = self.call_user_compiler(gm)
File "/data/users/andgu/pytorch/torch/_dynamo/utils.py", line 244, in time_wrapper
r = func(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/_dynamo/output_graph.py", line 1141, in call_user_compiler
raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
File "/data/users/andgu/pytorch/torch/_dynamo/output_graph.py", line 1122, in call_user_compiler
compiled_fn = compiler_fn(gm, self.example_inputs())
File "/data/users/andgu/pytorch/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
compiled_gm = compiler_fn(gm, example_inputs)
File "/data/users/andgu/pytorch/torch/__init__.py", line 1696, in __call__
return self.compiler_fn(model_, inputs_, **self.kwargs)
File "/data/users/andgu/pytorch/torch/_dynamo/backends/common.py", line 55, in compiler_fn
cg = aot_module_simplified(gm, example_inputs, **kwargs)
File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 4946, in aot_module_simplified
compiled_fn = create_aot_dispatcher_function(
File "/data/users/andgu/pytorch/torch/_dynamo/utils.py", line 244, in time_wrapper
r = func(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 4486, in create_aot_dispatcher_function
compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata)
File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 2825, in aot_wrapper_dedupe
return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 3011, in aot_wrapper_synthetic_base
return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 3714, in aot_dispatch_autograd
fx_g, joint_inputs, maybe_subclass_meta = aot_dispatch_autograd_graph(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 3694, in aot_dispatch_autograd_graph
fx_g = create_graph(joint_fn_to_trace, updated_joint_inputs, aot_config=aot_config)
File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 1955, in create_graph
fx_g = make_fx(f, decomposition_table=aot_config.decompositions)(*args)
File "/data/users/andgu/pytorch/torch/fx/experimental/proxy_tensor.py", line 869, in wrapped
t = dispatch_trace(wrap_key(func, args, fx_tracer, pre_dispatch), tracer=fx_tracer, concrete_args=tuple(phs))
File "/data/users/andgu/pytorch/torch/_compile.py", line 24, in inner
return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/_dynamo/eval_frame.py", line 489, in _fn
return fn(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/fx/experimental/proxy_tensor.py", line 481, in dispatch_trace
graph = tracer.trace(root, concrete_args)
File "/data/users/andgu/pytorch/torch/_dynamo/eval_frame.py", line 489, in _fn
return fn(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/fx/_symbolic_trace.py", line 821, in trace
(self.create_arg(fn(*args)),),
File "/data/users/andgu/pytorch/torch/fx/_symbolic_trace.py", line 688, in flatten_fn
tree_out = root_fn(*tree_args)
File "/data/users/andgu/pytorch/torch/fx/experimental/proxy_tensor.py", line 517, in wrapped
out = f(*tensors)
File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 3607, in joint_fn
return inner_fn(flat_fn_maybe_joint, (primals, tangents), use_trace_joint=True)
File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 3591, in inner_fn
wrapped_outs = fn(*all_args)
File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 1941, in joint_helper
return functionalized_f_helper(primals, tangents)
File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 1894, in functionalized_f_helper
f_outs = fn(*f_args)
File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 1862, in inner_fn_with_anomaly
return inner_fn(*args)
File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 1796, in inner_fn
outs, tangent_mask = fn(*primals)
File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 1724, in inner_fn
outs = fn(*args_maybe_cloned)
File "/data/users/andgu/pytorch/torch/_functorch/aot_autograd.py", line 4552, in functional_call
out = Interpreter(mod).run(*args[params_len:], **kwargs)
File "/data/users/andgu/pytorch/torch/fx/interpreter.py", line 138, in run
self.env[node] = self.run_node(node)
File "/data/users/andgu/pytorch/torch/fx/interpreter.py", line 195, in run_node
return getattr(self, n.op)(n.target, args, kwargs)
File "/data/users/andgu/pytorch/torch/fx/interpreter.py", line 267, in call_function
return target(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/distributed/_tensor/api.py", line 280, in __torch_dispatch__
return DTensor._op_dispatcher.dispatch(
File "/data/users/andgu/pytorch/torch/distributed/_tensor/dispatch.py", line 106, in dispatch
self.sharding_propagator.propagate(op_info)
File "/data/users/andgu/pytorch/torch/distributed/_tensor/sharding_prop.py", line 161, in propagate
output_sharding = self.propagate_op_sharding_non_cached(op_info.schema)
File "/data/users/andgu/pytorch/torch/distributed/_tensor/sharding_prop.py", line 175, in propagate_op_sharding_non_cached
out_tensor_meta = self._propagate_tensor_meta(op_schema)
File "/data/users/andgu/pytorch/torch/distributed/_tensor/sharding_prop.py", line 85, in _propagate_tensor_meta
fake_args = op_schema.gen_fake_args()
File "/data/users/andgu/pytorch/torch/distributed/_tensor/op_schema.py", line 332, in gen_fake_args
return tree_map_only(
File "/data/users/andgu/pytorch/torch/utils/_cxx_pytree.py", line 765, in tree_map_only
return tree_map(
File "/data/users/andgu/pytorch/torch/utils/_cxx_pytree.py", line 607, in tree_map
return optree.tree_map(
File "/home/andgu/local/miniconda3/envs/pytorch-3.10/lib/python3.10/site-packages/optree/ops.py", line 473, in tree_map
return treespec.unflatten(flat_results)
File "/data/users/andgu/pytorch/torch/utils/_cxx_pytree.py", line 713, in wrapped
return func(x)
File "/data/users/andgu/pytorch/torch/distributed/_tensor/op_schema.py", line 31, in _rebuild_tensor_from_dtensor_meta
return torch.empty_strided(
File "/data/users/andgu/pytorch/torch/_subclasses/functional_tensor.py", line 297, in __torch_dispatch__
outs_unwrapped = func(*args_unwrapped, **kwargs_unwrapped)
File "/data/users/andgu/pytorch/torch/_ops.py", line 509, in __call__
return self._op(*args, **kwargs or {})
File "/data/users/andgu/pytorch/torch/utils/_stats.py", line 20, in wrapper
return fn(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/fx/experimental/proxy_tensor.py", line 594, in __torch_dispatch__
return self.inner_torch_dispatch(func, types, args, kwargs)
File "/data/users/andgu/pytorch/torch/fx/experimental/proxy_tensor.py", line 629, in inner_torch_dispatch
return proxy_call(self, func, self.pre_dispatch, args, kwargs)
File "/data/users/andgu/pytorch/torch/fx/experimental/proxy_tensor.py", line 317, in proxy_call
proxy_args, proxy_kwargs = pytree.tree_map_only(
File "/data/users/andgu/pytorch/torch/utils/_pytree.py", line 631, in tree_map_only
return tree_map(map_only(__type_or_types)(func), tree)
File "/data/users/andgu/pytorch/torch/utils/_pytree.py", line 523, in tree_map
return tree_unflatten([func(i) for i in flat_args], spec)
File "/data/users/andgu/pytorch/torch/utils/_pytree.py", line 523, in <listcomp>
return tree_unflatten([func(i) for i in flat_args], spec)
File "/data/users/andgu/pytorch/torch/utils/_pytree.py", line 591, in wrapped
return func(x)
File "/data/users/andgu/pytorch/torch/fx/experimental/proxy_tensor.py", line 247, in inner
return get_proxy_slot(n, tracer)()
File "/data/users/andgu/pytorch/torch/fx/experimental/proxy_tensor.py", line 110, in get_proxy_slot
raise RuntimeError(f"{obj} is not tracked with proxy for {tracer}")
torch._dynamo.exc.BackendCompilerFailed: backend='aot_eager' raised:
RuntimeError: s0 is not tracked with proxy for <torch.fx.experimental.proxy_tensor.PythonKeyTracer object at 0x7fae60366c50>
While executing %result_2 : [num_users=1] = call_function[target=torch._C._nn.linear](args = (%prim_redistribute_2, %l_self_mlp_0_net2_weight, %l_self_mlp_0_net2_bias), kwargs = {})
Original traceback:
File "/data/users/andgu/pytorch/test/distributed/_tensor/test_dtensor_compile.py", line 51, in forward
return self.mlp_1(self.mlp_0(input))
File "/data/users/andgu/pytorch/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 64, in forward
return self.net2(self.relu(self.net1(x)))
File "/data/users/andgu/pytorch/torch/nn/modules/module.py", line 1561, in _call_impl
result = forward_call(*args, **kwargs)
File "/data/users/andgu/pytorch/torch/nn/modules/linear.py", line 116, in forward
return F.linear(input, self.weight, self.bias)
```
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114390
Approved by: https://github.com/wanchaol, https://github.com/huydhn
ghstack dependencies: #114379
If we assign `spec.tensor_meta = ...`, we do not have to recompute the hash eagerly. We just need to clear the existing hash so that the next call to `__hash__` recomputes it.
We found that the breakage of the DTensor + `torch.compile` tests comes from https://github.com/pytorch/pytorch/pull/114236 and are not directly related to the `DTensorSpec` hashing changes. We fix that in the following PR temporarily by passing `dynamic=False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114379
Approved by: https://github.com/wanchaol
This PR adds a unit test that leverages `torch.export.ExportedProgram` models that mutates registered buffers. Although the exporter already works out of the box in such scenario, the GraphModule and the exported ONNX model have extra outputs containing the mutated buffers. On future runs of the ONNX model, the mutated buffers are used as input to the model.
The aforementioned extra inputs and outputs are by design and the `ONNXProgram.model_signature` can be used to fetch detailed input/output schema for the exported model.
However, when we want to compare pytorch output to ONNX's, there is a mismatch between the schema because pytorch output does not include the mutated buffers present on the ONNX output.
This PR extends `onnx_program.adapt_torch_outputs_to_onnx(torch_outputs)` so that the mutated buffers are prepended to the Pytorch output, matching the ONNX schema.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112272
Approved by: https://github.com/titaiwangms, https://github.com/BowenBao
Summary:
Forward fix test failures caused by D51468211.
The root cause is that when converting the param_buffer into fake_tensor, we didn't set the static_shapes=True, this causes the shape_env to have more symbols than expected. The current status is that we assume all param and buffers are constant sizes.
Test Plan: buck2 test 'fbcode//mode/opt' fbcode//aps_models/ads/icvr/tests:export_test_cpu -- --exact 'aps_models/ads/icvr/tests:export_test_cpu - test_20x_icvr_export (aps_models.ads.icvr.tests.export_test.ExportTest)'
Reviewed By: hongtansun-meta
Differential Revision: D51531279
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114381
Approved by: https://github.com/angelayi
1. This tags docker images using docker pull/tag/push for current release
2. Sets RELEASE_VERSION_TAG var and regenerates the workflows using the new docker tag
3. Remove conda token setting and Binary tests release changes these are already automated
4. Pin unstable and disabled jobs, autumate: https://github.com/pytorch/pytorch/pull/111675
Test:
```
RELEASE_VERSION=2.2 ./scripts/release/apply-release-changes.sh
Tagging pytorch/manylinux-builder:cuda11.8-main to pytorch/manylinux-builder:cuda11.8-2.2 , dry_run: enabled
Tagging pytorch/manylinux-builder:cuda12.1-main to pytorch/manylinux-builder:cuda12.1-2.2 , dry_run: enabled
Tagging pytorch/libtorch-cxx11-builder:cuda11.8-main to pytorch/libtorch-cxx11-builder:cuda11.8-2.2 , dry_run: enabled
Tagging pytorch/libtorch-cxx11-builder:cuda12.1-main to pytorch/libtorch-cxx11-builder:cuda12.1-2.2 , dry_run: enabled
Tagging pytorch/manylinux-builder:rocm5.6-main to pytorch/manylinux-builder:rocm5.6-2.2 , dry_run: enabled
Tagging pytorch/manylinux-builder:rocm5.7-main to pytorch/manylinux-builder:rocm5.7-2.2 , dry_run: enabled
Tagging pytorch/libtorch-cxx11-builder:rocm5.6-main to pytorch/libtorch-cxx11-builder:rocm5.6-2.2 , dry_run: enabled
Tagging pytorch/libtorch-cxx11-builder:rocm5.7-main to pytorch/libtorch-cxx11-builder:rocm5.7-2.2 , dry_run: enabled
Tagging pytorch/manylinux-builder:cpu-main to pytorch/manylinux-builder:cpu-2.2 , dry_run: enabled
Tagging pytorch/libtorch-cxx11-builder:cpu-main to pytorch/libtorch-cxx11-builder:cpu-2.2 , dry_run: enabled
Tagging pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-main to pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-2.2 , dry_run: enabled
Tagging pytorch/manylinuxaarch64-builder:cpu-aarch64-main to pytorch/manylinuxaarch64-builder:cpu-aarch64-2.2 , dry_run: enabled
Tagging pytorch/conda-builder:cuda11.8-main to pytorch/conda-builder:cuda11.8-2.2 , dry_run: enabled
Tagging pytorch/conda-builder:cuda12.1-main to pytorch/conda-builder:cuda12.1-2.2 , dry_run: enabled
Tagging pytorch/conda-builder:cpu-main to pytorch/conda-builder:cpu-2.2 , dry_run: enabled
/data/users/atalman/pytorch/.github/workflows/generated-linux-binary-manywheel-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-linux-binary-conda-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-linux-binary-libtorch-cxx11-abi-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-linux-binary-libtorch-pre-cxx11-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-linux-aarch64-binary-manywheel-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-linux-binary-manywheel-main.yml
/data/users/atalman/pytorch/.github/workflows/generated-linux-binary-libtorch-cxx11-abi-main.yml
/data/users/atalman/pytorch/.github/workflows/generated-linux-binary-libtorch-pre-cxx11-main.yml
/data/users/atalman/pytorch/.github/workflows/generated-windows-binary-wheel-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-windows-binary-conda-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-windows-binary-libtorch-release-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-windows-binary-libtorch-debug-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-windows-binary-libtorch-release-main.yml
/data/users/atalman/pytorch/.github/workflows/generated-windows-binary-libtorch-debug-main.yml
/data/users/atalman/pytorch/.github/workflows/generated-macos-binary-wheel-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-macos-binary-conda-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-macos-binary-libtorch-cxx11-abi-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-macos-arm64-binary-libtorch-cxx11-abi-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-macos-arm64-binary-wheel-nightly.yml
/data/users/atalman/pytorch/.github/workflows/generated-macos-arm64-binary-conda-nightly.yml
````
Result of pinning unstable and disabled jobs:
```
# The link to the published list of disabled jobs
DISABLED_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json?versionid=kKJlAXdrUbk3CilXbKu.6OwNTGQB8a.B"
# and unstable jobs
UNSTABLE_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/unstable-jobs.json?versionid=vzaicOxSsh55iXBXwgGrW6dFeVtPfrhr"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114355
Approved by: https://github.com/malfet
Summary:
Move the serialized CustomClassHolder objects to the toplevel SerializedArtifact instead of embedding the bytes in the graph.
Currently the CustomClassHolder objects are embedded in the graph instead of being lifted to the ExportedProgram, so there's some logic introduced to lift it to the higher level of the serialized ExportedProgram. However, once that CustomClassHolder objects get lifted, we can remove the TODOs I added.
Test Plan: CI
Reviewed By: zhxchen17
Differential Revision: D51479125
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114371
Approved by: https://github.com/ydwu4
Hello everyone! 😄
Also @lezcano , nice to meet you! :)
Sorry if I miss anything, this is my first time around here. 🙃
This PR basically makes the same behaviour for cuda when using `torch.pow`. Basically Python considers True as 1 and False as 0. I just added this check into `pow` function. From what I understood, when I do `.equal` for `Scalar` that is boolean, I'm sure that types match so that won't cause more trouble.
I know that the issue suggest to disable this case but that could be a little more complicated, in my humble opinion. And that can create some compability problems too, I guess.
My argument is that code below is correct for native language, so I guess it does makes sense sending booleans as Scalar.
```
$ x = True
$ x + x
2
```
This was my first test:
```
Python 3.12.0 | packaged by Anaconda, Inc. | (main, Oct 2 2023, 17:29:18) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.pow(torch.tensor([1, 2], device='cuda'), True)
tensor([1, 2], device='cuda:0')
>>> torch.pow(torch.tensor([1, 2]), True)
tensor([1, 2])
>>> torch.pow(torch.tensor([1, 2]), False)
tensor([1, 1])
>>> torch.pow(torch.tensor([1, 2], device='cuda'), False)
tensor([1, 1], device='cuda:0')
```
I've run `test_torch.py` and got following results, so my guess is that I didn't break anything. I was just looking for a test that uses linear regression, as suggested.
```
Ran 1619 tests in 52.363s
OK (skipped=111)
[TORCH_VITAL] Dataloader.enabled True
[TORCH_VITAL] Dataloader.basic_unit_test TEST_VALUE_STRING
[TORCH_VITAL] CUDA.used true
```
(I can paste whole log, if necessary)
If this is a bad idea overall, dont worry about it. It's not a big deal, it's actually a two line change 😅 so can we talk of how do things in a different strategy.
For the record I've signed the agreement already. And I didn't run linter because it's not working 😞 . Looks like PyYaml 6.0 is broken and there's a 6.0.1 fix already but I have no idea how to update that 😅Fixes#113198
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114133
Approved by: https://github.com/lezcano
Installing the current pinned version of TorchAudio can be problematic because it
expects to be able to download a file from sourceware.org (see [ref](a8f4e97bd5/third_party/bzip2/CMakeLists.txt (L14))) and that does
not have any guarantees of uptime.
This bumps this commit to the latest v2.1.1 commit (https://github.com/pytorch/audio/releases/tag/v2.1.1) which should have less third_party dependencies and thus be less flaky
<details>
<summary> Should help with errors like: </summary>
logs link: https://github.com/pytorch/pytorch/actions/runs/6959510046/job/18942955523#step:15:592
```
5h+ pip install --progress-bar off --no-use-pep517 --user git+https://github.com/pytorch/audio.git@a8f4e97bd5356a7a77510cdf6a3a62e25a5dc602
Collecting git+https://github.com/pytorch/audio.git@a8f4e97bd5356a7a77510cdf6a3a62e25a5dc602
Cloning https://github.com/pytorch/audio.git (to revision a8f4e97bd5356a7a77510cdf6a3a62e25a5dc602) to /tmp/pip-req-build-6b5hkzmq
Running command git clone --filter=blob:none --quiet https://github.com/pytorch/audio.git /tmp/pip-req-build-6b5hkzmq
Running command git rev-parse -q --verify 'sha^a8f4e97bd5356a7a77510cdf6a3a62e25a5dc602'
Running command git fetch -q https://github.com/pytorch/audio.git a8f4e97bd5356a7a77510cdf6a3a62e25a5dc602
Running command git checkout -q a8f4e97bd5356a7a77510cdf6a3a62e25a5dc602
Resolved https://github.com/pytorch/audio.git to commit a8f4e97bd5356a7a77510cdf6a3a62e25a5dc602
Running command git submodule update --init --recursive -q
Preparing metadata (setup.py) ... 25l- error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [60 lines of output]
Traceback (most recent call last):
File "/opt/conda/envs/py_3.10/lib/python3.10/urllib/request.py", line 1348, in do_open
h.request(req.get_method(), req.selector, req.data, headers,
File "/opt/conda/envs/py_3.10/lib/python3.10/http/client.py", line 1283, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/opt/conda/envs/py_3.10/lib/python3.10/http/client.py", line 1329, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/opt/conda/envs/py_3.10/lib/python3.10/http/client.py", line 1278, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/opt/conda/envs/py_3.10/lib/python3.10/http/client.py", line 1038, in _send_output
self.send(msg)
File "/opt/conda/envs/py_3.10/lib/python3.10/http/client.py", line 976, in send
self.connect()
File "/opt/conda/envs/py_3.10/lib/python3.10/http/client.py", line 1448, in connect
super().connect()
File "/opt/conda/envs/py_3.10/lib/python3.10/http/client.py", line 942, in connect
self.sock = self._create_connection(
File "/opt/conda/envs/py_3.10/lib/python3.10/socket.py", line 845, in create_connection
raise err
File "/opt/conda/envs/py_3.10/lib/python3.10/socket.py", line 833, in create_connection
sock.connect(sa)
OSError: [Errno 99] Cannot assign requested address
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/tmp/pip-req-build-6b5hkzmq/setup.py", line 184, in <module>
_main()
File "/tmp/pip-req-build-6b5hkzmq/setup.py", line 145, in _main
_fetch_third_party_libraries()
File "/tmp/pip-req-build-6b5hkzmq/setup.py", line 129, in _fetch_third_party_libraries
_fetch_archives(_parse_sources())
File "/tmp/pip-req-build-6b5hkzmq/setup.py", line 123, in _fetch_archives
torch.hub.download_url_to_file(url, dest, progress=False)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py", line 620, in download_url_to_file
u = urlopen(req)
File "/opt/conda/envs/py_3.10/lib/python3.10/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "/opt/conda/envs/py_3.10/lib/python3.10/urllib/request.py", line 519, in open
response = self._open(req, data)
File "/opt/conda/envs/py_3.10/lib/python3.10/urllib/request.py", line 536, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/opt/conda/envs/py_3.10/lib/python3.10/urllib/request.py", line 496, in _call_chain
result = func(*args)
File "/opt/conda/envs/py_3.10/lib/python3.10/urllib/request.py", line 1391, in https_open
return self.do_open(http.client.HTTPSConnection, req,
File "/opt/conda/envs/py_3.10/lib/python3.10/urllib/request.py", line 1351, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 99] Cannot assign requested address>
-- Git branch: HEAD
-- Git SHA: a8f4e97bd5356a7a77510cdf6a3a62e25a5dc[602](https://github.com/pytorch/pytorch/actions/runs/6959510046/job/18942955523#step:15:603)
-- Git tag: None
-- PyTorch dependency: torch
-- Building version 2.0.0a0+a8f4e97
--- Initializing submodules
--- Initialized submodule
--- Fetching v1.2.12.tar.gz
--- Fetching bzip2-1.0.8.tar.gz
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
```
</details>
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114393
Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/huydhn
This PR fixes a incorrect warning for non static rdzv backend, the
warning should only be thrown when the rdzv endpoint not specified.
error repro from @stas00
```
$ cat test.py
import torch
$ python -u -m torch.distributed.run --nproc_per_node=1 --rdzv_endpoint localhost:6000 --rdzv_backend c10d test.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114335
Approved by: https://github.com/H-Huang
# Summary
Improved Fix for Attention Mask Alignment Issue (#112577)
This PR addresses Issue #112577 by refining the previously implemented fix, which was found to be incorrect and causes un-needed memory regressions. The update simplifies the approach to handling the alignment of the attention mask for mem eff attention.
## Changes
Alignment Check and Padding: Initially, the alignment of the attention mask is checked. If misalignment is detected, padding is applied, followed by slicing. During this process, a warning is raised to alert users.
Should this be warn_once?
We only call expand, once on the aligned mask.
Reference
https://github.com/facebookresearch/xformers/blob/main/xformers/ops/fmha/cutlass.py#L115
@albanD, @mruberry, @jbschlosser, @walterddr, and @mikaylagawarecki.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114173
Approved by: https://github.com/danthe3rd
**Summary**
To solve issue #113706:
1. replace `PariwiseParallel` with `ColwiseParallel` and `RowwiseParallel`.
2. replace the input of ColwiseParallel from `make_input_replicate_1d` and `make_output_replicate_1d` to `input_layouts` and `output_layouts`.
3. deprecate the tests for `_parallelize_mlp` as it only supports `PariwiseParallel`.
**Test Plan**
`pytest pytorch/test/distributed/tensor/parallel/`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114314
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
Summary:
As raised in this issue: https://github.com/pytorch/pytorch/issues/113776
cuSPARSELt does not support sharing handles across different threads.
Ideally we would use something like CuSparseHandlePool to do this, but
since cuSPARSELt handle creation is inconsitent with the rest of CUDA,
we have to do make these variables thread_local instead.
Test Plan:
`python test/test_sparse_semi_structured.py`
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114273
Approved by: https://github.com/danthe3rd
Fixes#97361
When fused kernel more than 1024 parameters, it should throw error from ctypes.
Limit args number is should be a mechanism to protect stack memory. As we known, CPP is passing args via stack memory, and stack memory has size limitation.
Code change:
1. cpp backend will check the fused nodes' args number, if it is reach the limitation. It will status flush status to ready.
2. scheduler will check `ready_to_flush` API and help backend flush codegen.
3. Add `ready_to_flush` API to `BaseScheduling`, Triton backend will return False due to not support it yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113131
Approved by: https://github.com/jgong5, https://github.com/mlazos
Summary: When a weight tensor is 0-size, no device memory should be allocated for it. This PR fixes the weight loading logic for such a case. This problem was found when running the 14K model test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114280
Approved by: https://github.com/chenyang78
## How to reproduce:
```py
import os
import tempfile
import torch
from torch import nn
from torch.optim import SGD
from torch.optim.lr_scheduler import CyclicLR
model = nn.Linear(100, 100)
opt = SGD(model.parameters(), lr=1.)
scheduler = CyclicLR(opt, base_lr=0.1, max_lr=0.2, scale_fn=lambda x: 0.99)
tmp = tempfile.NamedTemporaryFile(delete=False)
try:
torch.save(scheduler.state_dict(), tmp.name)
scheduler.load_state_dict(torch.load(tmp.name))
finally:
tmp.close()
os.unlink(tmp.name)
```
Error:
```
_pickle.PicklingError: Can't pickle <function <lambda> at 0x000001A51DF67600>: attribute lookup <lambda> on __main__ failed
```
## Fix:
Saving `scale_fn` to the state dict only if it is a callable object and not if it is a function or lambda.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110931
Approved by: https://github.com/janeyx99
For embedding lookup, there are indirect indexing with indices that are invariant to the vectorized itervar. To vectorize it, we need to keep the related indexing variables as scalars and allow vectorization when the related index_exprs are invariant to the vectorized itervar.
This PR adds the support by lazily broadcasting scalar values (index_expr and constant) to vectors so that vector operations are only generated if needed by `CppVecKernel` when any of the inputs are vectors, otherwise, scalar ops are generated. The cse variable in cpp is now represented with `CppCSEVariable` which bookkeeps the relevant itervars to the variable and has a flag to mark whether it is a scalar or a vector. `CppVecOverrides` is improved to propagate these states when the ops are executed.
For the added UT `test_embedding_vec`, the generated code before this PR is:
```c++
extern "C" void kernel(const long* in_ptr0,
const float* in_ptr1,
const float* in_ptr2,
float* out_ptr0)
{
#pragma omp parallel num_threads(64)
{
{
#pragma omp for
for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L))
{
#pragma GCC ivdep
for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(1L))
{
auto tmp0 = in_ptr0[static_cast<long>(x0)];
auto tmp5 = in_ptr2[static_cast<long>(x1 + (128L*x0))];
auto tmp1 = decltype(tmp0)(tmp0 + 64);
auto tmp2 = tmp0 < 0;
auto tmp3 = tmp2 ? tmp1 : tmp0;
TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L")
auto tmp4 = in_ptr1[static_cast<long>(x1 + (128L*tmp3))];
auto tmp6 = decltype(tmp4)(tmp4 + tmp5);
out_ptr0[static_cast<long>(x1 + (128L*x0))] = tmp6;
}
}
}
}
}
```
After this PR, we have:
```c++
extern "C" void kernel(const long* in_ptr0,
const float* in_ptr1,
const float* in_ptr2,
float* out_ptr0)
{
#pragma omp parallel num_threads(64)
{
{
#pragma omp for
for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L))
{
for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(16L))
{
auto tmp0 = in_ptr0[static_cast<long>(x0)];
auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x1 + (128L*x0)));
auto tmp1 = decltype(tmp0)(tmp0 + 64);
auto tmp2 = tmp0 < 0;
auto tmp3 = tmp2 ? tmp1 : tmp0;
TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L")
auto tmp4 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (128L*tmp3)));
auto tmp6 = tmp4 + tmp5;
tmp6.store(out_ptr0 + static_cast<long>(x1 + (128L*x0)));
}
}
}
}
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114062
Approved by: https://github.com/jansel
Meta internal customers need more flexible configs on these group/batch fusion's execution order and parameters, I'd like to provide a new inductor config that users can fine and auto tune these group/batch fusions for different models.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113738
Approved by: https://github.com/xuzhao9
Although opmath is the right thing to do to retain on-par precision, it inserts
upcasts everywhere in the graph. This is particularly hard for backend to optimize
since there is no way to differentiate between inserted upcasts and model code
casts. Hence we consolidate the input dtype to the result dtype to avoid this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113780
Approved by: https://github.com/titaiwangms, https://github.com/justinchuby
This PR checks the tensor meta of the outputs of cond's branches. This helps us to identify several tests that return outputs that have different requires_grad. Also fix the error messages, which previously was in torch.ops.higher_order.cond now is raised in dynamo CondHigherOrder.
Test Plan:
Existing tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113900
Approved by: https://github.com/zou3519
ghstack dependencies: #113819
This PR add should_flatten_outpu=True for cond. This effectively allows cond to support pytree output with the output being flattened. Note: a single tensor output will be automatically casted as tuple for torch.ops.higher_order.cond.
This PR also adds support for comparing BuiltinVariables e.g. tuple, this is to make sure we could make dynamo inline comparing two tree_spec to make sure both branches returns the same tree_spec.
Test Plan:
Existing tests. Will add more pytree tests and modify the documentations in the follow-up prs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113819
Approved by: https://github.com/zou3519
I'm not sure why we needed two overloads previously, let's find out! Removing the int overload is load bearing because it now forces specialization on SymInt arguments instead of falling through to the SymInt overload, see new test.
I decided NOT to allow storage offset simultaneously with None strides.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114236
Approved by: https://github.com/albanD
Currently the user can use torch.onnx.dynamo_export to export the model.
to ONNX.
```python
import torch
class Model(torch.nn.Module):
def forward(self, x):
return x + 1.0
onnx_program = torch.onnx.dynamo_export(
Model(),
torch.randn(1, 1, 2, dtype=torch.float),
)
```
The next step would be instantiating a ONNX runtime to execute it.
```python
import onnxruntime # type: ignore[import]
onnx_input = self.adapt_torch_inputs_to_onnx(*args, **kwargs)
options = options or {}
providers = options.get("providers", onnxruntime.get_available_providers())
onnx_model = self.model_proto.SerializeToString()
ort_session = onnxruntime.InferenceSession(onnx_model, providers=providers)
def to_numpy(tensor):
return (
tensor.detach().cpu().numpy()
if tensor.requires_grad
else tensor.cpu().numpy()
)
onnxruntime_input = {
k.name: to_numpy(v) for k, v in zip(ort_session.get_inputs(), onnx_input)
}
return ort_session.run(None, onnxruntime_input)
```
This PR provides the `ONNXProgram.__call__` method as facilitator to use ONNX Runtime under the hood, similar to how `torch.export.ExportedProgram.__call__` which allows the underlying `torch.fx.GraphModule` to be executed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113495
Approved by: https://github.com/titaiwangms
Summary:
The FORMAT opcode for the mobile TorchScript interpreter contained an out of bounds read issue leading to memory corruption.
This change adds an explicit check that the number of inputs passed to the format method called when handling the FORMAT opcode is a valid and within bounds of the stack.
Test Plan: contbuild + OSS signals
Differential Revision: D49739095
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110303
Approved by: https://github.com/malfet
I have observed that quite a few Reduction.Outer kernels have potential for performance improvement by reducing register pressure. This is due to our current register pressure reduction logics, which only reduces RBLOCK, doesn't work for outer reductions. While we can tighten up there, which will likely increase compile time, I found a better workaround to tune down XBLOCK in the first place.
Perf job: main 9efbb4ea73 (11/16) vs hoy/autotune/reduction
Slight compile time and perf improvement seen.
I also saw perf improvement locally for the few kernels being investigated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114284
Approved by: https://github.com/jansel
Related to the Reproducible Testing BE project. Goal is to print out the sample input that failed an OpInfo test.
Crazy idea: to avoid requiring widespread changes across tests that use OpInfo sample inputs, return a new special iterator type from `OpInfo.sample_inputs()`, etc. that tracks the most recent item seen. If a test fails later on, print out this info to identify the sample that failed the test.
This solves the problem that the test framework currently has no concept of which sample input is being operated on.
This PR contains the following changes:
* New `TrackedInputIter` that wraps a sample inputs func iterator and tracks the most recent input seen in a `TrackedInput` structure
* The information is stored in a dictionary on the test function itself, mapping `full test ID -> most recent TrackedInput`
* To determine the test function that is being run, we do some stack crawling hackery in `extract_test_fn_and_id()`
* Above applies only when one of the following is called: `OpInfo.sample_inputs()`, `OpInfo.error_inputs()`, `OpInfo.reference_inputs()`, and `OpInfo.conjugate_sample_inputs()`. This could easily be extended to `ModuleInfo`s and the sparse sample input funcs as well
Example output when a sample input causes a failure:
```
======================================================================
ERROR: test_foo_add_cpu_uint8 (__main__.TestFakeTensorCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/jbschlosser/branches/reproducible_testing/torch/testing/_internal/common_device_type.py", line 911, in test_wrapper
return test(*args, **kwargs)
File "/home/jbschlosser/branches/reproducible_testing/torch/testing/_internal/common_device_type.py", line 1097, in only_fn
return fn(slf, *args, **kwargs)
File "/home/jbschlosser/branches/reproducible_testing/test/test_ops.py", line 2211, in test_foo
self.fail('Example failure')
AssertionError: Example failure
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/jbschlosser/branches/reproducible_testing/torch/testing/_internal/common_utils.py", line 2436, in wrapper
method(*args, **kwargs)
File "/home/jbschlosser/branches/reproducible_testing/torch/testing/_internal/common_device_type.py", line 414, in instantiated_test
result = test(self, **param_kwargs)
File "/home/jbschlosser/branches/reproducible_testing/torch/testing/_internal/common_device_type.py", line 917, in test_wrapper
raise Exception(
Exception: Caused by sample input at index 2: SampleInput(input=Tensor[size=(5, 1), device="cpu", dtype=torch.uint8], args=TensorList[Tensor[size=(5,), device="cpu", dtype=torch.uint8]], kwargs={}, broadcasts_input=True, name='')
To execute this test, run the following from the base repo dir:
python test/test_ops.py -k test_foo_add_cpu_uint8
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
```
This notably doesn't print the actual `SampleInput` values, as that's hard without fully reproducible random sample generation. I went down this path for a while and it seems infeasible without adding an untenable amount of overhead to set the random seed per SampleInput (see https://github.com/pytorch/pytorch/issues/86694#issuecomment-1614943708 for more details). For now, I am settling for at least spitting out the index and some metadata of the `SampleInput`, as it seems better than nothing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99444
Approved by: https://github.com/janeyx99
Currently `ncclCommInitRankConfig` is always used when creating new
communicator groups. This is wasteful as it creates non-shared pairs
of endpoint queues as well as costs time to re-establish
communication.
This change is transparent and opportunistic; when `dist.new_group` is
called, it will use the existing, healthy world process group to
select the right ranks to include in the process group.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112889
Approved by: https://github.com/kwen2501
ATT, for cases like reduction + multiple pointwises + multi-level reduction, previously to decide num_splits of the multi-level reduction, we only check whether the input of multi-level reduction or input of input of multi-level reduction is a reduction node (i.e. max search level is 2). This PR changes the behavior to search for a reduction input node recursively if previous input nodes are pointwise nodes.
Performance-wise it looks fine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112935
Approved by: https://github.com/chenyang78
Changes:
1. Add `_private_register_pytree_node` API in both C++ and Python pytree. In C++ pytree, the API will only register pytree node for C++ pytree. In Python pytree, the API will only register pytree node for Python pytree.
2. Do not allow registering a type as pytree node twice in the Python pytree.
3. Add thread lock to the Python pytree node register API.
4. The old `_register_pytree_node` API will call the `_private_register_pytree_node` API and raise a deprecation warning.
5. Add a new `register_pytree_node` API to register node type in both C++ and Python implementations.
6. Add tests to ensure a warning will be raised when the old private function is called.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112111
Approved by: https://github.com/zou3519
Summary:
In some cases (especially those involving collective calls) - we would want to always kick off a collective call first before running going down another path.
For example:
```
tbe lookup -> a2a ->
overarch
dense ------------->
```
if the forward code is written as
a2a_out = a2a
dense = dense_net
out = overarch(a2a_out, dense)
out.backward()
The current default is running backwards in the opposite order the forward is called. However, there is no data dependency between a2a and dense, so in reality either of them could be run first. We would like the a2a to run first because it provides optimal (on average) overlap.
Changing the seq_nr of a2a_out to something large enough would allow autograd engine to kick it off first.
Test Plan: Tests incoming
Differential Revision: D51445261
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114120
Approved by: https://github.com/ezyang, https://github.com/albanD
By always calling `[destBuffer release]` before leaving the scope in which it was allocated.
Leak was introduced by https://github.com/pytorch/pytorch/pull/84928
Add regression test.
Before the change:
```
% python ../test/test_mps.py -v -k test_copy_cast_no_leak --repeat 10
test_copy_cast_no_leak (__main__.TestMemoryLeak) ... FAIL
======================================================================
FAIL: test_copy_cast_no_leak (__main__.TestMemoryLeak)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/nshulga/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2554, in wrapper
method(*args, **kwargs)
File "/Users/nshulga/git/pytorch/pytorch/build/../test/test_mps.py", line 1064, in test_copy_cast_no_leak
self.assertTrue(driver_before == driver_after, f"Detected {driver_after-driver_before} bytes leak of GPU memory")
AssertionError: False is not true : Detected 65536 bytes leak of GPU memory
To execute this test, run the following from the base repo dir:
python test/test_mps.py -k test_copy_cast_no_leak
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 1 test in 1.102s
FAILED (failures=1)
```
After:
```
% python ../test/test_mps.py -k test_copy_cast_no_leak --repeat 10
.
----------------------------------------------------------------------
Ran 1 test in 0.819s
OK
.
----------------------------------------------------------------------
Ran 1 test in 0.001s
OK
.
----------------------------------------------------------------------
Ran 1 test in 0.002s
OK
...
```
Fixes https://github.com/pytorch/pytorch/issues/114096
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114197
Approved by: https://github.com/kit1980
submodule / Updating pthreadpool to this revision.
This is in preparation for upgrading XNNPACK, as the new XNNPACK version uses some of the new pthreadpool APIs introduced in this revision.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113904
Approved by: https://github.com/Skylion007
Fixes#113940. This vendors the relevant parts of [`packaging==23.2.0`]() to have access to `Version` and `InvalidVersion` without taking a runtime dependency on `setuptools` or `packaging`.
I didn't find any vendoring policy so I put it under `torch._vendor.packaging`. While I have only vendored the files we need, I have not touched or trimmed the files otherwise.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114108
Approved by: https://github.com/malfet, https://github.com/albanD
Summary: Some model has hardcoded int constant tensors for some binary operations.
Test Plan:
```
yipjustin@yipjustin-mbp fbsource % buck2 run -c pt.has_backtraces=1 --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -- --gtest_filter="*"
...
[ OK ] VulkanAPITest.linear_3d_flat (0 ms)
[ RUN ] VulkanAPITest.linear_3d_small
[ OK ] VulkanAPITest.linear_3d_small (0 ms)
[ RUN ] VulkanAPITest.linear_3d_large
[ OK ] VulkanAPITest.linear_3d_large (0 ms)
[ RUN ] VulkanAPITest.linear_4d_flat
[ OK ] VulkanAPITest.linear_4d_flat (0 ms)
[ RUN ] VulkanAPITest.linear_4d_small
[ OK ] VulkanAPITest.linear_4d_small (0 ms)
[ RUN ] VulkanAPITest.linear_4d_large
[ OK ] VulkanAPITest.linear_4d_large (0 ms)
[ RUN ] VulkanAPITest.lstm_success
[ OK ] VulkanAPITest.lstm_success (5 ms)
[ RUN ] VulkanAPITest.lstm_mclareninputs_success
[ OK ] VulkanAPITest.lstm_mclareninputs_success (21 ms)
[ RUN ] VulkanAPITest.lstm_prepack_success
[ OK ] VulkanAPITest.lstm_prepack_success (8 ms)
[ RUN ] VulkanAPITest.querypool_flushed_shader_log
xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp:8108: Skipped
QueryPool is not available
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 414 tests from VulkanAPITest (5690 ms total)
[----------] Global test environment tear-down
[==========] 414 tests from 1 test suite ran. (5690 ms total)
[ PASSED ] 413 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
YOU HAVE 9 DISABLED TESTS
```
Full Paste: P885827407
Differential Revision: D51452935
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114145
Approved by: https://github.com/SS-JIA
Summary:
Two nodes can point to the same attribute via node.target.
This makes sure,
- we don't try to delete already deleted attribute, i.e. delete attr only once
- we do delete all the nodes pointing to the attribute
Test Plan:
```
buck run fbcode//mode/dev-nosan fbcode//executorch/backends/xnnpack/test:test_xnnpack_passes -- executorch.backends.xnnpack.test.passes.test_batch_norm_fusion.TestBatchNormFusion.test_q8_batch_norm_fusion
```
Differential Revision: D51419442
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113957
Approved by: https://github.com/Skylion007
For embedding lookup, there are indirect indexing with indices that are invariant to the vectorized itervar. To vectorize it, we need to keep the related indexing variables as scalars and allow vectorization when the related index_exprs are invariant to the vectorized itervar.
This PR adds the support by lazily broadcasting scalar values (index_expr and constant) to vectors so that vector operations are only generated if needed by `CppVecKernel` when any of the inputs are vectors, otherwise, scalar ops are generated. The cse variable in cpp is now represented with `CppCSEVariable` which bookkeeps the relevant itervars to the variable and has a flag to mark whether it is a scalar or a vector. `CppVecOverrides` is improved to propagate these states when the ops are executed.
For the added UT `test_embedding_vec`, the generated code before this PR is:
```c++
extern "C" void kernel(const long* in_ptr0,
const float* in_ptr1,
const float* in_ptr2,
float* out_ptr0)
{
#pragma omp parallel num_threads(64)
{
{
#pragma omp for
for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L))
{
#pragma GCC ivdep
for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(1L))
{
auto tmp0 = in_ptr0[static_cast<long>(x0)];
auto tmp5 = in_ptr2[static_cast<long>(x1 + (128L*x0))];
auto tmp1 = decltype(tmp0)(tmp0 + 64);
auto tmp2 = tmp0 < 0;
auto tmp3 = tmp2 ? tmp1 : tmp0;
TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L")
auto tmp4 = in_ptr1[static_cast<long>(x1 + (128L*tmp3))];
auto tmp6 = decltype(tmp4)(tmp4 + tmp5);
out_ptr0[static_cast<long>(x1 + (128L*x0))] = tmp6;
}
}
}
}
}
```
After this PR, we have:
```c++
extern "C" void kernel(const long* in_ptr0,
const float* in_ptr1,
const float* in_ptr2,
float* out_ptr0)
{
#pragma omp parallel num_threads(64)
{
{
#pragma omp for
for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L))
{
for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(16L))
{
auto tmp0 = in_ptr0[static_cast<long>(x0)];
auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x1 + (128L*x0)));
auto tmp1 = decltype(tmp0)(tmp0 + 64);
auto tmp2 = tmp0 < 0;
auto tmp3 = tmp2 ? tmp1 : tmp0;
TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L")
auto tmp4 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (128L*tmp3)));
auto tmp6 = tmp4 + tmp5;
tmp6.store(out_ptr0 + static_cast<long>(x1 + (128L*x0)));
}
}
}
}
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114062
Approved by: https://github.com/jansel
ghstack dependencies: #113950
NCCL_ prefix should only be used for NCCL library's environment variables. We currently use a few environment variables in PyTorch with the NCCL_ prefix that are the NCCL library does not understand.
This patch renames such environment variables to use the TORCH_NCCL_ prefix instead. We still maintain the old NCCL_ variables, but throw a warning when they are used.
The following env changes have been made:
`NCCL_BLOCKING_WAIT` -> `TORCH_NCCL_BLOCKING_WAIT`
`NCCL_ENABLE_TIMING` -> `TORCH_NCCL_ENABLE_TIMING`
`NCCL_DESYNC_DEBUG` -> `TORCH_NCCL_DESYNC_DEBUG`
`NCCL_ASYNC_ERROR_HANDLING` -> `TORCH_NCCL_ASYNC_ERROR_HANDLING`
`ENABLE_NCCL_HEALTH_CHECK` -> `TORCH_ENABLE_NCCL_HEALTH_CHECK`
`NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK` -> `TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK`
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114077
Approved by: https://github.com/fduwjj
Document torch._C._jit_set_fusion_strategy, which can control how many static-shape compilation attempts are made before falling back to dynamic shapes, before falling back to uncompiled graph execution.
Would be good to keep all the graph executor settings documented in one place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113964
Approved by: https://github.com/eellison
Summary: our docs were saying dynamic embedding bag wasn't supported but
it actually is (at least at the same level as embeddings were) it just wasn't previously tested/listed.
Test Plan: python test/test_quantization.py -k "test_embedding"
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107623
Approved by: https://github.com/jerryzh168
Previously it lacked a type hint and so was treated as an Any type. This
resulted in a lot of untyped code downstream as V.graph is referenced in
many places in inductor code. I've typed it properly now as
GraphLowering, and fixed the numerous type errors this surfaced.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114025
Approved by: https://github.com/eellison
ghstack dependencies: #114013
This should be enough to get @voznesenskym 's FSDP branch to plumb `set_()` through AOTAutograd properly and have everything properly no-op out. Main changes are:
(1) graph break on `aten::set_.source_Tensor_storage_offset` (we could support it but it isn't needed, seems safer to graph break)
(2) Functionalization: add a "proper" functionalization kernel for `aten::set_.source_Tensor`. The previous one we had was codegen'd and it was wrong (it would just clone() and call set_(), which does not do the right thing). I also manually mark on the `FunctionalTensorWrapper` when a given tensor has been mutated by a `set_()` call.
(3) AOTAutograd: I added a new field, `InputAliasInfo.mutates_storage_metadata`, so we can distinguish between "regular" metadata mutations, and metadata mutations due to `set_()` calls. This is mainly because at runtime, one requires calling `as_strided_()` to fix up metadata, while the other requires calling `set_()`.
(4) Made AOTAutograd's detection for metadata mutations / set_() mutations smarter and detect no-ops (if the storage and metadata are all the same).
I also killed `was_updated()` and `was_metadata_updated()`, and replaced them with (existing) `has_data_mutation() ` and (new) `has_data_mutation()`, which can more accurately distinguish between data-mutation vs. `set_()` calls vs. metadata-mutation
**This PR is still silently correct in one case though**, which I'd like to discuss more. In particular, this example:
```
def f(x):
x_view = x.view(-1)
x.set_(torch.ones(2))
x_view.mul_(2)
return
```
If you have an input that experiences both a data-mutation **and** a `x_old.set_(x_new)` call, there are two cases:
(a) the data mutation happened on the storage of `x_new`. This case should be handled automatically: if x_new is a graph intermediate then we will functionalize the mutation. If x_new is a different graph input, then we will perform the usual `copy_()` on that other graph input
(b) the data mutation happened on the storage of `x_old`. This is more of a pain to handle, and doesn't currently work. At runtime, the right thing to do is probably something like:
```
def functionalized_f(x):
x_view = x.view(-1)
# set_() desugars into a no-op; later usages of x will use x_output
x_output = torch.ones(2)
# functionalize the mutation on x_view
x_view_updated = x.mul(2)
x_updated = x_view_updated.view(x.shape)
# x experienced TWO TYPES of mutations; a data mutation and a metatadata mutation
# We need to return both updated tensors in our graph
return x_updated, x_output
def runtime_wrapper(x):
x_data_mutation_result, x_set_mutation_result = compiled_graph(x)
# First, perform the data mutation on x's old storage
x.copy_(x_data_mutation_result)
# Then, swap out the storage of x with the new storage
x.set_(x_set_mutation_result)
```
There are two things that make this difficult to do though:
(1) Functionalization: the functionalization rule for `set_()` will fully throw away the old `FunctionalStorageImpl` on the graph input. So if there are any mutations to that `FunctionalStorageImpl` later on in the graph, the current graph input won't know about it. Maybe we can have a given `FunctionalTensorWrapper` remember all previous storages that it had, and track mutations on all of them - although this feels pretty complicated.
(2) AOTAutograd now needs to know that we might have *two* graph outputs that correspond to a single "mutated input", which is annoying.
It's worth pointing out that this issue is probably extremely unlikely for anyone to run into - can we just detect it and error? This feels slightly easier than solving it, although not significantly easier. We would still need `FunctionalTensorWrapper` to keep track of mutations on any of its "previous" storages, so it can report this info back to AOTAutograd so we can raise an error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111554
Approved by: https://github.com/ezyang
ghstack dependencies: #113926
I have been seen below warning even though I did not set `TORCH_NCCL_AVOID_RECORD_STREAMS` to 1.
```
Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
```
Turns out that `TORCH_WARN_ONCE` is unconditional, so the original code below would print out both the value of `avoidRecordStreams_` and the error message:
```
TORCH_WARN_ONCE(
avoidRecordStreams_,
"TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point "
"collectives.");
```
That's also where the "0" in the message came from.
Cc: @eqy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114168
Approved by: https://github.com/eqy, https://github.com/fduwjj, https://github.com/H-Huang
Related to the Reproducible Testing BE project. Goal is to print out the sample input that failed an OpInfo test.
Crazy idea: to avoid requiring widespread changes across tests that use OpInfo sample inputs, return a new special iterator type from `OpInfo.sample_inputs()`, etc. that tracks the most recent item seen. If a test fails later on, print out this info to identify the sample that failed the test.
This solves the problem that the test framework currently has no concept of which sample input is being operated on.
This PR contains the following changes:
* New `TrackedInputIter` that wraps a sample inputs func iterator and tracks the most recent input seen in a `TrackedInput` structure
* The information is stored in a dictionary on the test function itself, mapping `full test ID -> most recent TrackedInput`
* To determine the test function that is being run, we do some stack crawling hackery in `extract_test_fn_and_id()`
* Above applies only when one of the following is called: `OpInfo.sample_inputs()`, `OpInfo.error_inputs()`, `OpInfo.reference_inputs()`, and `OpInfo.conjugate_sample_inputs()`. This could easily be extended to `ModuleInfo`s and the sparse sample input funcs as well
Example output when a sample input causes a failure:
```
======================================================================
ERROR: test_foo_add_cpu_uint8 (__main__.TestFakeTensorCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/jbschlosser/branches/reproducible_testing/torch/testing/_internal/common_device_type.py", line 911, in test_wrapper
return test(*args, **kwargs)
File "/home/jbschlosser/branches/reproducible_testing/torch/testing/_internal/common_device_type.py", line 1097, in only_fn
return fn(slf, *args, **kwargs)
File "/home/jbschlosser/branches/reproducible_testing/test/test_ops.py", line 2211, in test_foo
self.fail('Example failure')
AssertionError: Example failure
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/jbschlosser/branches/reproducible_testing/torch/testing/_internal/common_utils.py", line 2436, in wrapper
method(*args, **kwargs)
File "/home/jbschlosser/branches/reproducible_testing/torch/testing/_internal/common_device_type.py", line 414, in instantiated_test
result = test(self, **param_kwargs)
File "/home/jbschlosser/branches/reproducible_testing/torch/testing/_internal/common_device_type.py", line 917, in test_wrapper
raise Exception(
Exception: Caused by sample input at index 2: SampleInput(input=Tensor[size=(5, 1), device="cpu", dtype=torch.uint8], args=TensorList[Tensor[size=(5,), device="cpu", dtype=torch.uint8]], kwargs={}, broadcasts_input=True, name='')
To execute this test, run the following from the base repo dir:
python test/test_ops.py -k test_foo_add_cpu_uint8
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
```
This notably doesn't print the actual `SampleInput` values, as that's hard without fully reproducible random sample generation. I went down this path for a while and it seems infeasible without adding an untenable amount of overhead to set the random seed per SampleInput (see https://github.com/pytorch/pytorch/issues/86694#issuecomment-1614943708 for more details). For now, I am settling for at least spitting out the index and some metadata of the `SampleInput`, as it seems better than nothing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99444
Approved by: https://github.com/janeyx99
This is a reland of #112757. Cannot land original one internally because internal diff is not in sync with OSS due to issues in dealing with two export repos (executorch and pytorch) using the ghimport-ghexport approach.
Will try the web UI of import and export instead of ghimport and ghexport flow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113813
Approved by: https://github.com/angelayi
The primary problem we are setting out to solve here is fake tensor freshness. Before this PR, fake tensors after dynamo represented fake tensors *at the end* of trace, so subsequent retraces like aot_autograd would start off with fake tensors in the wrong (end result) state, rather than their expected fresh state. The solution here is to start a fresh fake mode, and re-fakify the tensors. The nuance comes from ensuring that symbols are uniformly created for the symbolic sizes and strides of the tensor.
This PR is the result of *a lot* of back and forth with @ezyang and @eellison. Initially, the first pass at this was not super different from what we have in the PR - the broad strokes were the same:
1) We cache source->symbol in shape_env
2) We pass policy objects around, stored at dynamo fakificaiton time, and reused for later fakification
3) We create a new fake mode for backends
(from https://github.com/pytorch/pytorch/pull/113605/files)
This is ugly, and has some layering violations. We detoured our decision making through a few other alternatives. Immutable/mutable fake tensor mode was the most interesting alternative, https://github.com/pytorch/pytorch/pull/113653, and was struck down on concerns of complexity in fake mode combined with it not covering all edge cases. We also detoured on what to do about tensor memoization returning back potentially different tensors than requested, and if that was an anti pattern (it is) we want to hack in with the symbol cache (we don't).
We went back to the drawing board here, but with a few concessions:
1) the cache for source->symbol must live outside of shape_env, for both lifecycle, and layering reasons
2) A good amount of work needs to be done to pipe policy around fake_mode and meta_utils correctly, to cover all the cases (@ezyang did this)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113926
Approved by: https://github.com/ezyang, https://github.com/eellison
Summary: Scatter fallback calls `at::scatter` in the C++ wrapper codegen. This doesn't work in the ABI compatibility mode, as the latter requires a shim function. One is added in this PR.
Test Plan:
```
$ python test/inductor/test_aot_inductor.py -k test_scatter_fallback
s...
----------------------------------------------------------------------
Ran 4 tests in 52.713s
OK (skipped=1)
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114027
Approved by: https://github.com/chenyang78, https://github.com/desertfire
ghstack dependencies: #114024
Summary: Previously, when we had two dynamic shape symbols `s0` and `s1` bound by the relationship `s1 == s0 + 1`, even when the range constraints were set in accordance with the relationship (e.g., to `[2, 1024]` for `s0` and to `[3, 1025]` for `s1`), `torch._dynamo.export` raised an error saying that the constraint is violated. Here we add a range check between the expression and the constraint and, if the ranges match, don't declare the constraint violated.
We also add a flag to disable the dim constraint solver in `torch._dynamo.export` (not set by default for BC), passed down from the `torch._export.aot_compile`. This is because, even for simple constraints like `s1 == s0 + 1`, the solver claims that the constraint is too complex and the dimension `s0` must be specialized. The new flag is not exposed as a part of the public API (i.e., the one without `_`s in the module names).
Both changes are required to unblock PT2 compilation of an internal model with AOT Inductor.
Test Plan:
```
$ python test/inductor/test_aot_inductor.py -k test_shifted_constraint_ranges
s...
----------------------------------------------------------------------
Ran 4 tests in 53.247s
OK (skipped=1)
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114024
Approved by: https://github.com/zhxchen17
During the course of fake tensor propagation (and, potentially, also Dynamo execution, although I do not believe it is possible to exercise this right now), we may generate deferred runtime asserts, which represent "guards" on unbacked symbols which cannot be immediately checked on entry to a code block; instead, they have to be checked at runtime. However, we currently accumulate these deferred runtime asserts into the ShapeEnv, and don't do anything with them.
This PR modifies Dynamo to automatically insert these runtime asserts into the FX graph, before passing it on to the backend compiler. The assert format coincides with the export assert format as practiced in `torch/_export/passes/add_runtime_assertions_for_constraints_pass.py`, but actually these passes are completely disjoint right now as I only handle deferred runtime asserts, while export only handles ranges (which I should probably also handle, but don't in this PR.)
The assertions must be inserted by Dynamo, because you could potentially then pass the asserts onto another backend like "eager" which no longer looks at the ShapeEnv before. Thanks to previous work in export, these asserts are preserved in AOTAutograd, but they are dropped by Inductor, which needs to be fixed in future work. This piece will be a bit awkward, as Inductor would have preferred to work with the Sympy expressions directly, ah well.
Here is what the Dynamo traced FX graph looks like for the test in question:
```
<eval_with_key>.0 class GraphModule(torch.nn.Module):
def forward(self, L_x_ : torch.Tensor):
l_x_ = L_x_
# File: /data/users/ezyang/c/pytorch/wu.py:8, code: y = x.item()
item = l_x_.item()
# No stacktrace found for following nodes
ge_1 = item >= 0
scalar_tensor_default = torch.ops.aten.scalar_tensor.default(ge_1); ge_1 = None
_assert_async_msg = torch.ops.aten._assert_async.msg(scalar_tensor_default, "Deferred runtime assert failed: i0 >= 0, where i0 was defined by 'item' (for more information, run with TORCH_LOGS=+dynamo,dynamic)"); scalar_tensor_default = None
# File: /data/users/ezyang/c/pytorch/wu.py:9, code: torch._check_is_size
_check_is_size = torch._check_is_size(item)
# File: /data/users/ezyang/c/pytorch/wu.py:10, code: if y >= 0:
ge = item >= 0; item = None
# File: /data/users/ezyang/c/pytorch/wu.py:11, code: return x * 2
mul = l_x_ * 2; l_x_ = None
return (mul,)
```
Note that we actually keep the `_check_is_size` in the graph redundantly. However, assert_async is retained in the graph, whereas _check_is_size ends up getting DCE'ed.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113958
Approved by: https://github.com/aakhundov, https://github.com/tugsbayasgalan
ghstack dependencies: #113978
To codegen deferred runtime asserts, I need to be able to convert sympy expressions back into regular Python expressions that I can put in FX graphs. This PR adds some of the machinery to do this: it adds a new sympy analysis that runs operations on all FX traceable operations that can also be run with plain Python int/float/bool/etc. It's tested by symbolic tracing through the analysis, and then testing that this traced graph gives the same result as running the Python analysis directly.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113978
Approved by: https://github.com/aakhundov, https://github.com/lezcano
This PR creates a prototype of training convolutional neural networks based on DTensor.
- Register required ops and implement operator dispatch
- Add unit tests and example
Basically, we shard the activations and replicate the model weights in this prototype. We can scale out to multiple GPUs and reduce the per-GPU memory footprint with this approach, and achieve weak scaling in terms of training performance (i.e., time per iteration).
Reference log (on 2xA100 GPU):
Unit Test
```bash
root@luna-prod-78-80gb:/pytorch# python3 test/distributed/_tensor/test_convolution_ops.py
/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py:456: UserWarning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (Triggered internally at /opt/conda/conda-bld/pytorch_1699257304556/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2170.)
return F.conv2d(input, weight, bias, self.stride,
/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py:456: UserWarning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (Triggered internally at /opt/conda/conda-bld/pytorch_1699257304556/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2170.)
return F.conv2d(input, weight, bias, self.stride,
..
----------------------------------------------------------------------
Ran 2 tests in 30.354s
OK
root@luna-prod-78-80gb:/pytorch# python3 test/distributed/_tensor/test_other_ops.py
[rank0]:[W ProcessGroupNCCL.cpp:2170] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank0]:[W ProcessGroupNCCL.cpp:2170] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank1]:[W ProcessGroupNCCL.cpp:2170] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank1]:[W ProcessGroupNCCL.cpp:2170] Warning: 0TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
...
----------------------------------------------------------------------
Ran 3 tests in 16.343s
OK
```
ConvNeXt Example
```bash
root@luna-prod-78-80gb:/pytorch# python3 torch/distributed/_tensor/examples/convnext_example.py
rank 3, 20 iterations, latency 584.80 ms, forward 102.84 ms, backward 297.80 ms, max reserved 16.34 GiB, max allocated 14.75 GiB
rank 1, 20 iterations, latency 584.64 ms, forward 104.85 ms, backward 297.60 ms, max reserved 16.40 GiB, max allocated 14.74 GiB
rank 0, 20 iterations, latency 584.48 ms, forward 104.64 ms, backward 297.90 ms, max reserved 16.39 GiB, max allocated 14.75 GiB
rank 2, 20 iterations, latency 584.96 ms, forward 93.21 ms, backward 297.95 ms, max reserved 16.40 GiB, max allocated 14.74 GiB
```
@wanchaol @fduwjj FYI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113123
Approved by: https://github.com/wanchaol
I'm looking for a make_fx(tracing_mode=real) that doesn't error out on
data-dependent operations. This PR adds a flag to do that. We use this
to help implement offline generation, but this is useful by itself:
sometimes we want to trace a function with real tensors and don't care
if we bake values in (because we just want to see what happened).
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114129
Approved by: https://github.com/ezyang
ghstack dependencies: #114128
There can be a potential race condition while loading the `vmap` decomposition library in multi-threading programs.
This PR adds a thread lock to avoid the case of registering the kernel multiple times.
```python
import threading
from torch._functorch.vmap import lazy_load_decompositions
threads = []
for i in range(10000):
thread = threading.Thread(target=lazy_load_decompositions)
threads.append(thread)
for thread in threads:
thread.start()
for thread in threads:
thread.join()
```
```text
RuntimeError: This is not allowed since there's already a kernel registered from python overriding mse_loss_backward's behavior for FuncTorchBatched dispatch key and aten namespace.
VMAP_DECOMPOSITIONS_LIB.impl(decomp, decomposition_table[decomp])
RuntimeError: This is not allowed since there's already a kernel registered from python overriding mse_loss_backward's behavior for FuncTorchBatched dispatch key and aten namespace.
RuntimeError: This is not allowed since there's already a kernel registered from python overriding mse_loss_backward's behavior for FuncTorchBatched dispatch key and aten namespace.
RuntimeError: This is not allowed since there's already a kernel registered from python overriding mse_loss_backward's behavior for FuncTorchBatched dispatch key and aten namespace.
RuntimeError: This is not allowed since there's already a kernel registered from python overriding mse_loss_backward's behavior for FuncTorchBatched dispatch key and aten namespace.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113520
Approved by: https://github.com/zou3519
Summary: I had a job fail due to rank mismatch but didn't find enough information in the assertion message. This change makes the message more informative.
Test Plan:
CI tests and I ran a test job which failed as expected:
```
Rank 1 has different values for step: 8016.0. Other ranks: 7870.0
```
Differential Revision: D51322046
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113765
Approved by: https://github.com/wz337, https://github.com/fegin
Summary: For the second-pass, we don't have to rerun the whole inductor flow again. This PR moves that second-pass to the codegen time. This change not only speeds up the compilation, but also removes kernel scheduling inconsistency between the two passes. Another future improvement is to make the second-pass reuse the scheduler and do the wrapper codegen only.
This is a copy of https://github.com/pytorch/pytorch/pull/113762 to land in github first.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114067
Approved by: https://github.com/chenyang78
This PR:
- updates `torch/sparse/_triton_ops_meta.py` for the API change in `triton.testing.do_bench`
- force `num_stages` to be 1 when blocksize is 128x128 to avoid out of resources exception when `bsr_dense_mm` is called from `nn.linear`.
- as in the title. The performance of `nn.linear` on BSR tensor weights (dtypes `float16` and `bfloat16`) is increased as follows (`NVIDIA A100-SXM4-80GB`):
- for blocksize 16x16, the average/maximum speed up is about 11/20 %
- for blocksize 32x32, the average/maximum speed up is about 15/24 %
- for blocksize 64x64, the average/maximum speed up is about 18/26 %
- for blocksize 128x128, the average/maximum speed up is about 15/28 %
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114026
Approved by: https://github.com/cpuhrsch
Summary:
After the refactoring of matrix multiplication, the `bmm` logic can easily be adapted using the existing `mm` code, since the only difference is on the "batch" dimension, which is not packed now. So we can simply add computation along the "z" dimension.
Further, I realized that `bias` and `beta` are simply a post-processing step after the matrix multiplication, I have factored them out. The nice part about this factoring is that we can directly leverage the broadcasting logic, hence we don't need to have a separate shader just to add the bias.
So this diff massively simply the `mm` code.
1. Reduce 4 shaders `mm`, `addmm`, `bmm`, `baddbmm` into just `mm`.
2. Remove packing for bias.
3. Add support on `at::bmm(m1.vulkan(), m2.vulkan())` <= This is a blocking feature for the emformer models.
Test Plan:
```
% buck2 run -c pt.has_backtraces=1 --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64
...
[ OK ] VulkanAPITest.linear_4d_flat (0 ms)
[ RUN ] VulkanAPITest.linear_4d_small
[ OK ] VulkanAPITest.linear_4d_small (0 ms)
[ RUN ] VulkanAPITest.linear_4d_large
[ OK ] VulkanAPITest.linear_4d_large (1 ms)
[ RUN ] VulkanAPITest.lstm_success
[ OK ] VulkanAPITest.lstm_success (5 ms)
[ RUN ] VulkanAPITest.lstm_mclareninputs_success
[ OK ] VulkanAPITest.lstm_mclareninputs_success (22 ms)
[ RUN ] VulkanAPITest.lstm_prepack_success
[ OK ] VulkanAPITest.lstm_prepack_success (3 ms)
[ RUN ] VulkanAPITest.querypool_flushed_shader_log
xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp:8056: Skipped
QueryPool is not available
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 411 tests from VulkanAPITest (5754 ms total)
[----------] Global test environment tear-down
[==========] 411 tests from 1 test suite ran. (5754 ms total)
[ PASSED ] 410 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
```
Full Paste: P884697749
Differential Revision: D51421256
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113943
Approved by: https://github.com/SS-JIA
I've run into a couple of cases now where max_split_size_mb has been set
in projects as a workaround for fragmentation but it ends up causing problems
later, such as degraded performance from freeing empty segments. While it
is a useful setting to have, expandable_segments is probably a better first
resort for fixing fragmentation since when it works it is less likely to
need synchronous GPU operations to continue running.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113481
Approved by: https://github.com/msaroufim, https://github.com/albanD
ghstack dependencies: #113231
When torchelastic notices that one rank has failed, it will sent a SIGTERM
signal to other trainer ranks to tear them down before restarting. However,
if the trainer itself launches subprocesses, or is launched by a non-python
wrapper script, then the SIGTERM will be delivered only to the direct child of
torch eleastic and not all descendants. This opens subprocesses in a new
linux 'session' which starts a new process group with the pgid the same
as the trainers pid. Then when we send signals, we deliver them to the
process group rather than just the direct child.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113231
Approved by: https://github.com/H-Huang
Summary:
The getCvar* functions allow us to provide multiple environment variables for the same value. This allows us to deprecate some variables in favor of others, while still allowing users to temporarily use the old variables for some time.
Test Plan: OSS CI
Reviewed By: fduwjj, XilunWu
Differential Revision: D51225487
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113797
Approved by: https://github.com/fduwjj
Summary:
Clang compilation failed recently due to unfound crtbeginS.o and libgcc, but we should not be using them, as for self-containness Clang use compiler-rt instead. I'm also switching to using the lld linker which is also from the clang release pacakge.
There was another issue where glibc was not found during linking. It looks like the glibc path was passed into the linker via `-B`. It should be also passed via`-L` which can get to the linker, for library reference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114017
Approved by: https://github.com/hl475
Prior to this PR, autotuned arguments could only be at the back of the argument list. This is an inductor limitation and not triton limitation. Fixing this allows more MRS kernels to use user defined triton kernels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114002
Approved by: https://github.com/aakhundov
ghstack dependencies: #113967
Summary:
Enable logic for converting a channel packed tensor into heigh packed one.
Not yet connecting with rest of the system yet.
Test Plan:
```
(base) yipjustin@yipjustin-mac fbsource % buck2 run -c pt.has_backtraces=1 --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -- --gtest_filter="*packing*"
File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_quantized_api_test.cpp
Buck UI: https://www.internalfb.com/buck2/9a0d6bd6-e4a2-4d58-8f38-f806a0703122
Network: Up: 0B Down: 0B
Jobs completed: 4. Time elapsed: 0.1s.
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *packing*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from VulkanAPITest
[ RUN ] VulkanAPITest.channel_to_height_packing_test
[ OK ] VulkanAPITest.channel_to_height_packing_test (35 ms)
[----------] 1 test from VulkanAPITest (35 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (36 ms total)
[ PASSED ] 1 test.
```
Reviewed By: SS-JIA
Differential Revision: D51379737
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113883
Approved by: https://github.com/SS-JIA
This PR also contains a basket of fixes that were turned up by now testing more arguments with SymInt. I fixed as many of the easy ones as I could easily get earlier in this stack and a bunch here, but there are some more annoying ones I xfailed.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113452
Approved by: https://github.com/Chillee
ghstack dependencies: #113877, #113911
As half of those tests fail if run individually, but first failure masks all subsequent ones, i.e.
```
PYTORCH_TEST_WITH_INDUCTOR=1 python3 test/test_torch.py -v -k test_lazy_clone_cuda_float32
test_lazy_clone_cuda_float32 (__main__.TestTorchDeviceTypeCUDA) ... FAIL
...
self.assertTrue(torch._C._is_cow_tensor(t))
AssertionError: False is not true
----------------------------------------------------------------------
Ran 1 test in 19.419s
FAILED (failures=1)
```
But
```
$ PYTORCH_TEST_WITH_INDUCTOR=1 python3 test/test_torch.py -k test_lazy_clone_
...
......................
----------------------------------------------------------------------
Ran 24 tests in 24.969s
OK
```
This flaky behavior was already detected, for example see https://github.com/pytorch/pytorch/issues/113953
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114012
Approved by: https://github.com/huydhn, https://github.com/kit1980
Fixes#97361
When fused kernel more than 1024 parameters, it should throw error from ctypes.
Limit args number is should be a mechanism to protect stack memory. As we known, CPP is passing args via stack memory, and stack memory has size limitation.
Code change:
1. cpp backend will check the fused nodes' args number, if it is reach the limitation. It will status flush status to ready.
2. scheduler will check `ready_to_flush` API and help backend flush codegen.
3. Add `ready_to_flush` API to `BaseScheduling`, Triton backend will return False due to not support it yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113131
Approved by: https://github.com/jgong5, https://github.com/mlazos
`_TestONNXRuntime` has infra to test models which are either Callable or a `torch.nn.Module`.
After #111497, we want to re-run all those tests for model of type `torch.export.ExportedProgram`.
This PR adds to `self.run_test_with_fx_to_onnx_exporter_and_onnx_runtime` the capability of detect the model type to be tested and export the incoming `torch.nn.Module` model to `torch.export.ExportedProgram` before running ONNX export tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112289
Approved by: https://github.com/titaiwangms
Summary:
As discussed in the meeting, we are inverting the policy on the use of proxy executor for aten fallbacks.
By default, aten fallback ops will use proxy executor, unless a c-shim is available.
Test Plan: CIs
Differential Revision: D51417683
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113918
Approved by: https://github.com/chenyang78
Move the pytest cache downloading into the build step and store it in additional ci files so that it stays consistent during sharding.
Only build env is taken into account now instead of also test config since we might not have the test config during build time, making it less specific, but I also think this might be better since tests are likely to fail across the same test config (I also think it might be worth not even looking at build env but thats a different topic)
Each cache upload should only include information from the current run. Do not merge current cache with downloaded cache during upload (shouldn't matter anyways since the downloaded cache won't exist at the time)
From what I cant tell of the s3 retention policy, pytest cache files will be deleted after 30 days (cc @ZainRizvi to confirm), so we never have to worry about space or pulling old versions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113804
Approved by: https://github.com/ZainRizvi
Abort merges invoked with `-r` if there is nothing to rebase
Make `rebase_onto`/`rebase_ghstack_onto` return False if rebase is no-op and abort merge in that case
Remove `-e` option from both trymerge and tryrebase workflows as one should never report failures on workflow dispatch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113960
Approved by: https://github.com/clee2000
Fixes#113641
As written, there is an off-by-one error whenever `end - start` doesn't evenly
divide into `step`. e.g. if `end - start = 1` and `step = 2` we should get a
single element but `1 // 2 == 0` so this wouldn't take anything from the slice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113838
Approved by: https://github.com/Chillee
Summary: Previously the PT2 QAT code only supported conv2d-bn.
This commit extends all existing QAT fusion support to conv1d-bn,
including support for all variants like relu, no bias, literal
args, cuda etc. This commit also refactors the code such that
we can support conv3d-bn easily in the future.
Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn1d
Reviewers: jerryzh168, kimishpatel
Subscribers: jerryzh168, kimishpatel, supriyar
Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)
Differential Revision: [D51428979](https://our.internmc.facebook.com/intern/diff/D51428979)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113714
Approved by: https://github.com/jerryzh168
The test checks for the `mkldnn_fusion.linear` pass which checks `_is_packable_linear` that depends on `torch._C.has_mkl`. So skip the test as it would fail due to no pattern matches counted.
See https://github.com/pytorch/pytorch/blob/main/torch/_inductor/fx_passes/mkldnn_fusion.py#L827
CC @XiaobingSuper as the author of the test.
Not sure how many other test are affected by similar issues but this is the one in pattern matcher I see failing.
Strangely the first part of the test succeeds where `bias = True` as it finds a match for `unfuse_bias_add_to_pointwise` (torch/_inductor/fx_passes/post_grad.py)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113949
Approved by: https://github.com/jansel
There was missing support for bfloat scalars. When I use gloo backend
`torch.distributed.init_process_group(backend='gloo')`
and run
`torch.nn.parallel.DistributedDataParallel(model)`
and _model_ has Bfloat16 features I receive following error:
`RuntimeError: Invalid scalar type`
This change fix this issue.
c10::BFloat16 defines conversions from/to float, so calculations are made on float for bfloat.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113557
Approved by: https://github.com/XilunWu, https://github.com/jgong5
The existing try-catch doesn't work because it doesn't call err.persist(). This is in contrast to the try-catch for evaluate_function which does work because it calls into python_engine's thread_on_exception which calls persist.
Calling persist on a python_error stashes the PyErr state from the thread-local PyThreadState onto the python_error object, so that when this error object is stored onto the future and passed back to the calling cpu thread, python_engine's execute try-catch can then err.restore() the error state. Finally, the python_engine's execute would re-raise so that this is re-caught by the HANDLE_TH_ERRORS macro.
Fixes https://github.com/pytorch/pytorch/issues/75750
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113702
Approved by: https://github.com/albanD
In cases like #113444, users usually stop at UnsupportedNodeAnalysis with unsupported nodes information. Although in SARIF, they can clearly see it's due to lack of COMPLEX support, in screen error message, it's only showing original FX node name, such as `aten.mul.Tensor`. ~~This PR catches the information from diagnostic messages and reveal it to users.~~
The root cause is that UnsupportedNodeAnalysis is leveraging on `onnxfunction_dispatcher.get_function_overloads()` to decide if an ATen is supported or not. However, in `onnxfunction_dispatcher.get_function_overloads()`, lacking of complex function support is considered unsupported. This PR defines Unsupported FX nodes as not in registry.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113785
Approved by: https://github.com/thiagocrepaldi
The bot that creates the issue got changed, but the search did not, so it wasn't finding old PRs and was just making new ones.
This PR makes it reuse PRs again instead of making a new one everytime.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113961
Approved by: https://github.com/huydhn
WIP
* [X] Update this pinned commit periodically, similar to https://github.com/pytorch/pytorch/pull/113499
* [ ] Increase ET coverage on PT CI, ideally, we should run all ET pull jobs?
* [ ] Switch ExecuTorch's torch, vision, and audio nightly pins to commit pins
* [ ] Update ExecuTorch's torch, vision, and audio commit pins periodically
### Testing
`python .github/scripts/update_commit_hashes.py --repo-name executorch --branch main --pin-folder .ci/docker/ci_commit_pins`
The testing PR is https://github.com/pytorch/pytorch/pull/113834
(I will move the pinned commit out of the Docker image if the docker build process is flaky, otherwise, refreshing Docker image daily seems like a good thing to catch early issue with the images?)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113832
Approved by: https://github.com/clee2000
Skipping importing some packages for now to make this change more
tractable.
For some reason, lintrunner on CI raises errors in all imported `.pyi` files,
even though it doesn't on my local machine. The errors are all from missing
generic types, as the MYPYINDUCTOR config has `disallow_any_generics`
set. I have thus added `disable-error-code` comments to the relevant files,
though I fixed a few that were easy enough.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113830
Approved by: https://github.com/Skylion007
ghstack dependencies: #113722, #113721
Summary:
The mobile_ivalue_size field in the mobile_bytecode flatbuffer schema can be larger than the ivalues vector. This introduces potential for memory corruption when parsing the mobile_bytecode Module.
This diff fixes the issue by ensuring that mobile_ivalue_size is less than the size of the ivalues vector.
Test Plan: contbuild & OSS CI
Differential Revision: D49687548
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110162
Approved by: https://github.com/malfet
Using mypy in code that depends on pytorch, I noticed that the type annotation doesn't allow a device ordinal.
`error: Argument "device" to "to_empty" of "Module" has incompatible type "int"; expected "str | device" [arg-type]`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113647
Approved by: https://github.com/albanD
Fixes https://github.com/pytorch/pytorch/issues/113393
Another chapter in the story of Python's horrible handling of int <-> bool interactions.
```python
print(True and 1) # 1
print(1 and True) # True
print(True or 1) # True
print(1 or True) # 1
```
For sanity's sake, since we have defined more sane type promotion rules, let's use those and ensure `out_hint` conforms to `SymNode`'s `pytype`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113848
Approved by: https://github.com/ezyang
Fixes https://github.com/pytorch/pytorch/issues/113041
In the case where we have an object represented as an UnspecializedNNModuleVariable, the source of an attribute on that object is `AttrSource(base=NotNNModuleSource(base=NNModuleSource(base=AttrSource(base=LocalSource(local_name='self', cell_or_freevar=False), member='seq'))), member='b')`. This causes dynamo to add an extra attribute as it doesn't go to this [`register_attr` step](eddce3c054/torch/_dynamo/variables/builder.py (L955-L962)).
However if we have an object represented as a UserDefinedObjectVariable, the source of an attribute on that object is `AttrSource(base=NNModuleSource(base=AttrSource(base=LocalSource(local_name='self', cell_or_freevar=False), member='seq')), member='b')`.
It seems that UnspecializedNNModuleVariables should behave in the same was as UserDefinedObjectVariables, but the source in these two cases are different. So, I removed the part that changes the source in the UnspecializedNNModuleVariables, and it seems to work! And CI is green (+ reduced graph breaks).
```
def test_unspecialized_nnmodule(self):
class TestModule(torch.nn.Module):
def __init__(self):
super().__init__()
self.a = torch.tensor(1.0)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return x + self.a
def forward_hook(
module: torch.nn.Module, inputs, output
) -> torch.Tensor:
return 2 * output
seq = torch.nn.Sequential(TestModule()).eval()
seq.b = torch.tensor(2)
handle = seq.register_forward_hook(forward_hook)
class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.seq = seq
def forward(self, x):
# self.seq.b has source: AttrSource(base=NotNNModuleSource(base=NNModuleSource(base=AttrSource(base=LocalSource(local_name='self', cell_or_freevar=False), member='seq'))), member='b')
return self.seq(x) + self.seq.b
inp = (torch.randn(2, 8),)
ep = export(M(), inp)
```
```
def test_user_defined_var(self):
class TestModule(torch.nn.Module):
def __init__(self):
super().__init__()
self.a = torch.tensor(1.0)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return x + self.a
class UserDefined:
def __init__(self):
self.test_module = TestModule()
self.b = torch.tensor(2)
def __call__(self, x):
return self.test_module(x)
class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.seq = UserDefined()
def forward(self, x):
# self.seq.b has source: AttrSource(base=NNModuleSource(base=AttrSource(base=LocalSource(local_name='self', cell_or_freevar=False), member='seq')), member='b')
return self.seq(x) + self.seq.b
inp = (torch.randn(2, 8),)
ep = export(M(), inp)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113852
Approved by: https://github.com/yanboliang
Copied from @ezyang 's #113693.
The motivation for this change is that we'd like to guard on storage offset in inductor, to make assumptions about data alignment.
create_symbolic_sizes_strides_storage_offset() creates the sizes/strides/offset for fake tensors - they can either be integers or symints. This PR changes storage_offset to always be dynamic. In variables/builder.py, we remove a conditional so that all tensors get added to tracked_fakes. This is because the storage offset will be dynamic even if the other logic in builder.py suggests that it will be static; otherwise, we run into this issue:
1e260c851b/torch/fx/experimental/symbolic_shapes.py (L892-L895)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113734
Approved by: https://github.com/ezyang
Fixes#103973
**Background:**
After https://github.com/pytorch/pytorch/pull/113098, user verified that torch.sum() worked for environment where PCIe atomics was exposed as a problem for such operation.
**Goal:**
This is to expend the changes to other kernels where assert is called. The goal is the same so that we can disable kernel assertion easily for those users when the call sites consistently use CUDA_KERNEL_ASSERT.
**Test:**
We build wheels with these fixes for those users who had PCIe atomics issue, and users verified they can perform their workflow now with these fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113563
Approved by: https://github.com/jeffdaily, https://github.com/ezyang
This PR add foreach_zero_ op support, to fix when
optim.zero_grad(set_to_none=False) hit this op and erroring out the
device mesh not found issue.
Also move the test to use zero_grad as the last step as that's when we
going to have dtensor as grads
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113897
Approved by: https://github.com/awgu
#113340 was reverted initially due to a bad default parametrization name. The test looked like
```python
@common_utils.parametrize(
"type_fn",
[
type,
lambda obj: obj.__class__,
],
)
def test_access_class_method_from_user_class(self, type_fn):
```
This is a valid parametrization, but results in these default test names:
```bash
❯ pytest test/dynamo/test_export.py -k test_access_class_method_from_user_class --co -q
test/dynamo/test_export.py::ExportTests::test_access_class_method_from_user_class_type_fn_<class 'type'>
test/dynamo/test_export.py::ExportTests::test_access_class_method_from_user_class_type_fn_<function ExportTests_<lambda> at 0x7f3be5de0c10>
```
Ignoring the whitespace in the test names, which can lead to other issues down the line, the problem in #113340 was that the lambda parameter included a memory address. IIUC, internally, the tests are not collected and run in the same process. Meaning, the address of the lambda and in turn the test name is no longer valid on the runner. This is fixed earlier in the stack by giving the parametrization an explicit name with `subtest`, but this PR is about preventing issues in the default case.
`pytest` solves this by simply using the name of the parameter plus its index as id in the test name:
```python
import pytest
class Foo:
def __repr__(self):
return str(id(self))
@pytest.mark.parametrize(
"bar",
[
pytest.param(type),
pytest.param(lambda obj: obj.__class__),
pytest.param(Foo()),
],
)
def test_foo(bar):
pass
```
```
❯ pytest main.py --co -q
main.py::test_foo[type]
main.py::test_foo[<lambda>]
main.py::test_foo[bar2]
```
`pytest` has better defaults for `type` and `lambda` than we do, but is has a safe default for custom objects.
This PR aligns our default test name with `pytest`. Using the parametrization from above again, we now collect
```bash
❯ pytest test/dynamo/test_export.py -k test_access_class_method_from_user_class --co -q
test/dynamo/test_export.py::ExportTests::test_access_class_method_from_user_class_type_fn0
test/dynamo/test_export.py::ExportTests::test_access_class_method_from_user_class_type_fn1
```
which might not be as expressive at first glance, but at least prevents bugs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113856
Approved by: https://github.com/malfet, https://github.com/huydhn
ghstack dependencies: #113855
Sometimes indexing tensors are constructed on cpu and then used to index a cuda tensor. This prevents cudagraphs when it does not need to. Adding a pass which moves constructors from cpu->cuda when we can prove the downstream uses can be safely converted.
This pr allows us to cudagraph `clip` from the blueberries model which improves perf from ~1.5x -> ~4x.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109665
Approved by: https://github.com/ezyang, https://github.com/jansel
Fixes a bug in TD metrics generation where it wouldn't be able to find the rank and relevance that a heuristic gave a test run if that heuristic had divided that test into multiple test runs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113789
Approved by: https://github.com/clee2000
Summary: Quantize bias in setup step so that we do not incur additional time on quantizing bias in the first iteration.
Test Plan:
Ensure all vulkan quantize tests pass:
buck2 run --target-platforms ovr_configplatform/macos:arm64-fbsourcexplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
.....
[----------] Global test environment tear-down
[==========] 78 tests from 1 test suite ran. (1519 ms total)
[ PASSED ] 78 tests.
YOU HAVE 8 DISABLED TESTS
buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 395 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 395 tests from VulkanAPITest
[----------] 395 tests from VulkanAPITest (6515 ms total)
.....
[----------] Global test environment tear-down
[==========] 395 tests from 1 test suite ran. (6515 ms total)
[ PASSED ] 394 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
YOU HAVE 5 DISABLED TESTS
Reviewed By: yipjustin, copyrightly
Differential Revision: D50997531
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113582
Approved by: https://github.com/yipjustin
When the ONNX model is exported from a torch.export.ExportedProgram, a
torch.export.ExportedGraphSignature is available with the specification
of the model inputs and outputs.
ExportedGraphSignature includes information such as the mapping between
the exported input/buffer/output ONNX name to the original pytorch input/buffer/output name.
It also specifies the kind of the input, such as user_input, parameter,
buffer or constant_tensor. Outputs kind can be user_output, loss_output,
buffer_mutation, etc
Such information can be useful to understand what the ONNX model expects
as inputs and how the output will look like when the ONNX input/output
differs from the original PyTorch input/output schema.
When the ONNX model is exported from a Callable or regular
torch.nn.MOdule, such information is not available and
ONNXProgram.model_signature will yield NOne
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113477
Approved by: https://github.com/BowenBao
Followup to https://github.com/pytorch/pytorch/pull/110325 - re-add the `report_all_guard_failures config` as a logging artifact `recompiles_verbose` with the following changes:
- evaluating the check must be wrapped with exception handling because subsequent code parts following the first failure may result in errors if evaluated (e.g. if a guard checks first for size, then tries to index - a guard failure due to insufficient size would result in an index error for the latter check).
- Adding a test for this case
Sample:
```python
import torch
def fn(x):
return torch.rand(x[-1], len(x))
opt_fn = torch.compile(fn)
opt_fn([4, 5, 6])
opt_fn([7, 8])
opt_fn([9])
```
Output (with `TORCH_LOGS="recompiles_verbose"`):
```bash
[2023-11-15 16:13:26,741] torch._dynamo.guards.__recompiles_verbose: [DEBUG] Recompiling function fn in /data/users/williamwen/pytorch/playground5.py:15
[2023-11-15 16:13:26,741] torch._dynamo.guards.__recompiles_verbose: [DEBUG] triggered by the following guard failure(s):
[2023-11-15 16:13:26,741] torch._dynamo.guards.__recompiles_verbose: [DEBUG] guard 0 failures:
[2023-11-15 16:13:26,741] torch._dynamo.guards.__recompiles_verbose: [DEBUG] - len(L['x']) == 3
[2023-11-15 16:13:26,741] torch._dynamo.guards.__recompiles_verbose: [DEBUG] - L['x'][0] == 4
[2023-11-15 16:13:26,741] torch._dynamo.guards.__recompiles_verbose: [DEBUG] - L['x'][1] == 5
[2023-11-15 16:13:26,970] torch._dynamo.guards.__recompiles_verbose: [DEBUG] Recompiling function fn in /data/users/williamwen/pytorch/playground5.py:15
[2023-11-15 16:13:26,970] torch._dynamo.guards.__recompiles_verbose: [DEBUG] triggered by the following guard failure(s):
[2023-11-15 16:13:26,970] torch._dynamo.guards.__recompiles_verbose: [DEBUG] guard 0 failures:
[2023-11-15 16:13:26,970] torch._dynamo.guards.__recompiles_verbose: [DEBUG] - len(L['x']) == 2
[2023-11-15 16:13:26,970] torch._dynamo.guards.__recompiles_verbose: [DEBUG]
[2023-11-15 16:13:26,970] torch._dynamo.guards.__recompiles_verbose: [DEBUG] guard 1 failures:
[2023-11-15 16:13:26,970] torch._dynamo.guards.__recompiles_verbose: [DEBUG] - len(L['x']) == 3
[2023-11-15 16:13:26,970] torch._dynamo.guards.__recompiles_verbose: [DEBUG] - L['x'][0] == 4
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113585
Approved by: https://github.com/jon-chuang, https://github.com/ezyang
Summary: Due to the implementation of `native_layer_norm`, we can simplify the implementation of `layer_norm` by just invoking `native_layer_norm`.
Test Plan:
```
[luwei@devbig984.prn1 /data/users/luwei/fbsource (7f66eb77b)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*layer_norm*"
Building: finished in 0.1 sec (100%) 339/339 jobs, 0/339 updated
Total time: 0.2 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *layer_norm*
[==========] Running 7 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 7 tests from VulkanAPITest
[ RUN ] VulkanAPITest.layer_norm_invalid_inputs
[ OK ] VulkanAPITest.layer_norm_invalid_inputs (69 ms)
[ RUN ] VulkanAPITest.layer_norm_2d
[ OK ] VulkanAPITest.layer_norm_2d (292 ms)
[ RUN ] VulkanAPITest.layer_norm_3d
[ OK ] VulkanAPITest.layer_norm_3d (289 ms)
[ RUN ] VulkanAPITest.layer_norm_4d
[ OK ] VulkanAPITest.layer_norm_4d (4 ms)
[ RUN ] VulkanAPITest.native_layer_norm_2d
[ OK ] VulkanAPITest.native_layer_norm_2d (5 ms)
[ RUN ] VulkanAPITest.native_layer_norm_3d
[ OK ] VulkanAPITest.native_layer_norm_3d (2 ms)
[ RUN ] VulkanAPITest.native_layer_norm_4d
[ OK ] VulkanAPITest.native_layer_norm_4d (4 ms)
[----------] 7 tests from VulkanAPITest (667 ms total)
[----------] Global test environment tear-down
[==========] 7 tests from 1 test suite ran. (667 ms total)
[ PASSED ] 7 tests.
```
Reviewed By: yipjustin
Differential Revision: D51297971
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113676
Approved by: https://github.com/yipjustin
Sometimes indexing tensors are constructed on cpu and then used to index a cuda tensor. This prevents cudagraphs when it does not need to. Adding a pass which moves constructors from cpu->cuda when we can prove the downstream uses can be safely converted.
This pr allows us to cudagraph `clip` from the blueberries model which improves perf from ~1.5x -> ~4x.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109665
Approved by: https://github.com/ezyang, https://github.com/jansel
We saw some use cases in higher order operator that tries to directly inline a user-level function (e.g. pytree.tree_flatten and pytree.tree_unflatten) with no tensor operations by manually constructing a UserFunctionVariable and run call_function on it.
This PR consolidate this pattern a little bit by adding a _make_inlined helper function to make the UX better( i.e. the callilng convention is kept the same with the function that we'd like to inline) and also reduce redundancy, increase readability.
Test Plan:
Exisiting tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113814
Approved by: https://github.com/yanboliang
Summary:
The new implementation of mat-mul missed a critical step that does width-packing on GPU: see T169764697.
The existing implementation of mat-mul also missed a case when the "B" matrix is already in vulkan, it fails to do packing, leading to wrong results. (I have added a disabled unittest to reflect the issue).
We will take multiple steps to enable (width / height) packing and transformation between different packing on Vulkan.
This is a first diff that enable a critical toolset: It allows development to fetch values from the underlying tensor, making it possible to implement tests for the transformation shaders.
Test Plan:
P882053410,
P882053410
Differential Revision: D51291737
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113627
Approved by: https://github.com/SS-JIA
This PR is to enable the store of NCCL flight recorder to storage and make it configurable by letting users register their own way of storing the debug info. We will then provide users a script to offline parse and process the dumped blobs.
One thing, this PR is not trying to resolve is to decide where to dump the debug info. I will send a follow-up PR to address that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113503
Approved by: https://github.com/zdevito
Summary:
Based on discussions with Sherlock + Zhengxu in D51118067, updated the internal thrift schema to match the OSS schema.
Verifier failures:
* Test contains a None as input, resulting in no meta["val"]
* Test contains torch.autograd.grad_mode.set_grad_enabled as an op, which also results in no meta["val"]
* torch.autograd.grad_mode.set_grad_enabled is also not a valid op
* Test adds a "parameter" to the state dict but the parameter is not an nn.Parameter, causing an assertion failure
So to bypass these failures I did the following hacks(?):
* Before creating the exported program in deserialization, populate nodes w/o meta["val"] with meta["val"] = None
* Add torch.autograd.grad_mode.set_grad_enabled to the skip opset
* Duplicated ExportGraphSignature into aot_export.py so that the graph signature checks will be skipped
Configerator changes in D51343615
Test Plan: CI
Reviewed By: zhxchen17
Differential Revision: D51342921
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113810
Approved by: https://github.com/zhxchen17
Copied from @ezyang 's #113693.
The motivation for this change is that we'd like to guard on storage offset in inductor, to make assumptions about data alignment.
create_symbolic_sizes_strides_storage_offset() creates the sizes/strides/offset for fake tensors - they can either be integers or symints. This PR changes storage_offset to always be dynamic. In variables/builder.py, we remove a conditional so that all tensors get added to tracked_fakes. This is because the storage offset will be dynamic even if the other logic in builder.py suggests that it will be static; otherwise, we run into this issue:
1e260c851b/torch/fx/experimental/symbolic_shapes.py (L892-L895)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113734
Approved by: https://github.com/ezyang
Summary:
encapsulate fb changes into `torch._inductor.fx_passes.fb`, so that adding new passes (`fb.xxx`) won't need to touch OSS code like so:
```
# in torch/_inductor/fx_passes/pre_grad.py
if config.is_fbcode():
from .fb import xxx # every new fb/xxx.py would have needed this change in OSS code base
```
Test Plan: CI
Differential Revision: D51315193
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113677
Approved by: https://github.com/khabinov, https://github.com/chenyang78
fixes https://github.com/pytorch/pytorch/issues/90552. This is a simpler fix that just detects the situation where AOTAutograd can't create a proper backward graph for the situation and graph breaks. This was technically a silent correctness issue before.
This PR tries to always graph break when we see a factory function that returns a tensor requiring grad. I check this by seeing if the op returned a `TensorVariable` in dynamo, and if one of the input arguments was a `requires_grad=True` kwarg. I think this is high-fidelity enough, and I'm also hoping that this is uncommon enough that a graph break is reasonable here.
The fix to avoid the graph break in user land is also pretty easy - just instantiate your tensor outside of the compiled region and plumb it in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113277
Approved by: https://github.com/eellison
ghstack dependencies: #113267, #113416, #113584
This is the first part to start build and test ExecuTorch on PyTorch using a pinned commit. There will be another PR later to update the pinned commit periodically.
* The pinned commit is in `.ci/docker/ci_commit_pins/executorch.txt` as part of PT Docker image
* I add one simple test `source .ci/scripts/test.sh mv3 cmake xnnpack-quantization-delegation ''`. More could be added later, in fact, any ET tests on Linux could be run here
* Building and installation vision and audio need to be done in CI after building PyTorch because they will be broken otherwise
Next steps, in sequence:
* [ ] Update this pinned commit periodically, similar to https://github.com/pytorch/pytorch/pull/113499
* [ ] Increase ET coverage on PT CI, ideally, we should run all ET pull jobs?
* [ ] Switch ExecuTorch's torch, vision, and audio nightly pins to commit pins
* [ ] Update ExecuTorch's torch, vision, and audio commit pins periodically
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113364
Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/guangy10
Since MYPYNOFOLLOW is about to turn on import following, there's no
reason to keep test_utils.py in the MYPYNOFOLLOW config. Moreover, I'm
not sure it still takes 10 minutes to typecheck this file; adding it to
the MYPY config takes `lintrunner --take MYPY --all-files` from 53s to
57s on my machine, which is substantial but not horrible. I guess we'll
see how it fares on CI.
(Note that we cannot simply merge MYPY and MYPYNOFOLLOW because the
latter config turns on `disallow_any_generics` and so is in that sense
stricter than the MYPY config.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113745
Approved by: https://github.com/clee2000
I used a couple of type-ignore comments in ir.py because it constructs
short-lived instances of FixedLayout and GraphModuleSerializer, just to
call a single method on them that doesn't use all their members. Making
those unused members optional would make the rest of the code a lot
messier with sprinkled `assert` statements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113534
Approved by: https://github.com/albanD
torch.compile + SAC unit test is causing adjacent unit tests to be flaky due to its modification of shared singleton object. This PR attaches the checkpoint context fn to the checkpointed GraphModule, and look it up during execution, avoiding the need to make the higher-order op stateful.
Specifically, we attach the `context_fn` to the checkpointed GraphModule. These two will be gc'ed at the same time, so it satisfies the lifetime requirement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112672
Approved by: https://github.com/wanchaol
Fixes#110100 by making aot_export_modules uses dynamo.export's fake_mode in export.
Test Plan:
Add new tests. One of the test places the fake tensor on cuda devices manually and we are able to export the program and preserve the device information in the final produced graph module even on a machine that installs a cpu version of pytorch. One workaround we need to do is to set all tensor's requires_grad to false as fake tensor with cuda devices doesn't compose well with aot_autograd right now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113681
Approved by: https://github.com/SherlockNoMad
Summary:
Previously it is scatter in two different places: before inserting observer and during observer,
this PR moved everything before we insert observer
* Next: refactor QuantizationSpec and check more fields for sharing
Test Plan:
CI (regression tests)
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113458
Approved by: https://github.com/kimishpatel
This PR is ALMOST basically just following the steps from #106677 EXCEPT! We do add one feature. Similar to fused_adam(w), for the CUDA dispatches: when the scalar tensor is on CPU, we .item and redispatch to the normal scalar overload. Otherwise, the cuda kernel will complain about mismatch in devices between the scalar and the tensors.
Why do we add this feature? Our optimizers want to allow lr as a tensor, and lr could be a CPU tensor. lr is used with foreach_div_ in Adam, so our CI will break otherwise.
After this PR, `_foreach_mul` and `_foreach_div` will accept either a CPU or a GPU tensor for the scalar tensor (vs only a GPU tensor). They join the ranks of `fused_adam(w)` in this characteristic. I did not yet do the same thing for foreach_add (the only other foreach op with a .Tensor overload) because there is no use case and will be more involved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113688
Approved by: https://github.com/mlazos, https://github.com/albanD
Fixes https://github.com/pytorch/pytorch/issues/104594.
The reason for the exporter behavior in original posted issue is explained as follows:
ONNX model track shape related computes that were done in pytorch by python
numbers as tensor computes. This is the only way for ONNX to track them properly
since ONNX only has tensor type, otherwise the computation result will be tracked
statically as constant, and the model won't work for another input that differs in shape.
Now for type promotion logic, scalars should be treated differently with tensors.
Exporter mistook the shape related scalars as tensors in this case and incorrectly promoted.
This PR fixes the behavior and relaxes the criteria of scalar recognition. For floating point,
previously only a value from model initializer that has dtype torch.double and rank 0 is
treated as scalar. Now it is relaxed to any intermediate value, as well as for dtype torch.float.
Previous assumption was that python number is traced as torch.double dtype, which also
appears to be invalid anymore.
NOTE that this might introduce regression that a REAL 0-rank tensor is now being recognized as
scalar. The downside is the model will drop in accuracy for these cases as certain computations
will happen in lower precision data types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113404
Approved by: https://github.com/justinchuby
Summary:
We are planning to lazily initialize CUPTI when profiling is actually performed. Therefore, we need to remove profiler init dependency on CUPTI Callbacks' RESOURCE_CONTEXT_CREATED.
Instead, we can initialize the profilers during torch profiler pybind, ie. THPAutograd_initExtension() and lazily in profilerStep().
Test Plan:
CI and ran internally, see internal diff logs.
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112623
Approved by: https://github.com/albanD
The original behavior of torch.compile w.r.t. input mutations maintains that if an input to a graph was mutated, **and** requires grad, we will keep the input mutation outside of the graph and replay it at runtime.
This is important because, e.g., an input can have outstanding aliases, and mutating the input in eager mode will cause autograd to change the `grad_fn` of all outstanding aliases.
It looks like landing https://github.com/pytorch/pytorch/pull/111347 changed this behavior slightly:
* The linked PR makes it possible for AOTAutograd to go down the inference code path, even if some inputs require grad (because all of the outputs of the graph were seen to not require grad)
* AOTAutograd's logic in the inference code path today is to **always** keep input mutations in the graph.
This PR fixes that regression: regardless of inference vs. training, we should always keep input mutations outside of the graph if the input requires_grad.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113584
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #113267, #113416
Fixes https://github.com/pytorch/pytorch/issues/113010
In eager mode, when you call an out= op like `add(..., out=out_arg)` with an out argument that is noncontiguous, the noncontiguous out arg will be returned directly. When we functionalize though, functionalization replaces it with a call to `add(...)` which ignores the contiguity of the original out arg.
Instead of trying to support this, this PR detects that situation and graph breaks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113267
Approved by: https://github.com/albanD
If `cpuinfo_initalize` returns false, call to subsequent cpuinfo functions may result in `abort()`
Also, `defaultNumThreads()` method is assumption if one method fails then try another, and finally return 1.
Alas there are no good way to test it on x86 platform, but on ARM one can replicate it by running `sudo chmod 750 /sys` and then `python3 -c "import torch;torch._C.profiler.gather_traceback(True, True, True)"`
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 4d942e8</samp>
> _`cpuinfo` fails_
> _avoid undefined behavior_
> _check before you count_
Partially addresses https://github.com/pytorch/pytorch/issues/113568
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113771
Approved by: https://github.com/atalman
For scenarios where FSDP is not the root module, the `_use_dtensor` flag would not be switched on. This PR fixes it by checking whether the submodule has the `device_mesh` and turn `_use_dtensor` flag on accordingly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113593
Approved by: https://github.com/fegin
Fixes#112633
Fixed errors relating to pydocstyle in the following files. The remaining errors are not covered in this issue. `torch/utils/dlpack.py` was not modified as the errors are relating to the function signature in the first line in the docstring which must be maintained as is for proper Sphinx interpretation.
```python
def from_dlpack(ext_tensor: Any) -> 'torch.Tensor':
"""from_dlpack(ext_tensor) -> Tensor
.....
"""
```
pydocstyle torch/utils/_contextlib.py --count
before: 4
after: 0
pydocstyle torch/backends/mps/__init__.py --count
before: 8
after: 1
**remaining errors**
```
torch/backends/mps/__init__.py:1 at module level:
D104: Missing docstring in public package
```
pydocstyle torch/backends/xeon/run_cpu.py --count
before: 13
after: 1
**remaining errors**
```
torch/backends/xeon/run_cpu.py:864 in public function `main`:
D103: Missing docstring in public function
```
pydocstyle torch/backends/cpu/__init__.py --count
before: 2
after: 1
**remaining errors**
```
torch/backends/cpu/__init__.py:1 at module level:
D104: Missing docstring in public package
```
pydocstyle torch/utils/cpp_backtrace.py --count
before: 4
after: 1
**remaining errors**
```
torch/utils/cpp_backtrace.py:1 at module level:
D100: Missing docstring in public module
```
pydocstyle torch/utils/bundled_inputs.py --count
before: 8
after: 1
**remaining errors**
```
torch/utils/bundled_inputs.py:1 at module level:
D100: Missing docstring in public module
```
pydocstyle torch/utils/file_baton.py --count
before: 8
after: 1
**remaining errors**
```
torch/utils/file_baton.py:1 at module level:
D100: Missing docstring in public module
```
pydocstyle torch/utils/mobile_optimizer.py --count
before: 6
after: 1
**remaining errors**
```
torch/utils/mobile_optimizer.py:8 in public class `LintCode`:
D101: Missing docstring in public class
```
pydocstyle torch/backends/opt_einsum/__init__.py --count
before: 7
after: 5
**remaining errors**
```
torch/backends/opt_einsum/__init__.py:1 at module level:
D104: Missing docstring in public package
torch/backends/opt_einsum/__init__.py:67 in public function `set_flags`:
D103: Missing docstring in public function
torch/backends/opt_einsum/__init__.py:77 in public function `flags`:
D103: Missing docstring in public function
torch/backends/opt_einsum/__init__.py:93 in public class `OptEinsumModule`:
D101: Missing docstring in public class
torch/backends/opt_einsum/__init__.py:94 in public method `__init__`:
D107: Missing docstring in __init__
```
pydocstyle torch/utils/_device.py --count
before: 9
after: 6
**remaining errors**
```
torch/utils/_device.py:58 in public class `DeviceContext`:
D101: Missing docstring in public class
torch/utils/_device.py:59 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/_device.py:62 in public method `__enter__`:
D105: Missing docstring in magic method
torch/utils/_device.py:68 in public method `__exit__`:
D105: Missing docstring in magic method
torch/utils/_device.py:73 in public method `__torch_function__`:
D105: Missing docstring in magic method
torch/utils/_device.py:80 in public function `device_decorator`:
D103: Missing docstring in public function
```
pydocstyle torch/utils/_freeze.py --count
before: 15
after: 7
**remaining errors**
```
torch/utils/_freeze.py:77 in public function `indent_msg`:
D103: Missing docstring in public function
torch/utils/_freeze.py:89 in public class `FrozenModule`:
D101: Missing docstring in public class
torch/utils/_freeze.py:100 in public class `Freezer`:
D101: Missing docstring in public class
torch/utils/_freeze.py:101 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/_freeze.py:106 in public method `msg`:
D102: Missing docstring in public method
torch/utils/_freeze.py:185 in public method `get_module_qualname`:
D102: Missing docstring in public method
torch/utils/_freeze.py:206 in public method `compile_string`:
D102: Missing docstring in public method
```
pydocstyle torch/utils/throughput_benchmark.py --count
before: 25
after: 8
**remaining errors**
```
torch/utils/throughput_benchmark.py:1 at module level:
D100: Missing docstring in public module
torch/utils/throughput_benchmark.py:27 in public class `ExecutionStats`:
D101: Missing docstring in public class
torch/utils/throughput_benchmark.py:28 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/throughput_benchmark.py:33 in public method `latency_avg_ms`:
D102: Missing docstring in public method
torch/utils/throughput_benchmark.py:37 in public method `num_iters`:
D102: Missing docstring in public method
torch/utils/throughput_benchmark.py:46 in public method `total_time_seconds`:
D102: Missing docstring in public method
torch/utils/throughput_benchmark.py:50 in public method `__str__`:
D105: Missing docstring in magic method
torch/utils/throughput_benchmark.py:94 in public method `__init__`:
D107: Missing docstring in __init__
```
pydocstyle torch/utils/hooks.py --count
before: 14
after: 11
**remaining errors**
```
torch/utils/hooks.py:1 at module level:
D100: Missing docstring in public module
torch/utils/hooks.py:23 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/hooks.py:34 in public method `remove`:
D102: Missing docstring in public method
torch/utils/hooks.py:44 in public method `__getstate__`:
D105: Missing docstring in magic method
torch/utils/hooks.py:50 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/utils/hooks.py:64 in public method `__enter__`:
D105: Missing docstring in magic method
torch/utils/hooks.py:67 in public method `__exit__`:
D105: Missing docstring in magic method
torch/utils/hooks.py:82 in public function `warn_if_has_hooks`:
D103: Missing docstring in public function
torch/utils/hooks.py:103 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/hooks.py:188 in public method `setup_input_hook`:
D102: Missing docstring in public method
torch/utils/hooks.py:197 in public method `setup_output_hook`:
D102: Missing docstring in public method
```
pydocstyle torch/utils/_traceback.py --count
before: 19
after: 14
**remaining errors**
```
torch/utils/_traceback.py:47 in public function `report_compile_source_on_error`:
D103: Missing docstring in public function
torch/utils/_traceback.py:160 in public class `CapturedTraceback`:
D101: Missing docstring in public class
torch/utils/_traceback.py:163 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/_traceback.py:167 in public method `cleanup`:
D102: Missing docstring in public method
torch/utils/_traceback.py:170 in public method `summary`:
D102: Missing docstring in public method
torch/utils/_traceback.py:182 in public method `__getstate__`:
D105: Missing docstring in magic method
torch/utils/_traceback.py:190 in public method `extract`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/_traceback.py:190 in public method `extract`:
D400: First line should end with a period (not 't')
torch/utils/_traceback.py:213 in public method `format`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/_traceback.py:213 in public method `format`:
D400: First line should end with a period (not 'f')
torch/utils/_traceback.py:213 in public method `format`:
D401: First line should be in imperative mood (perhaps 'Format', not 'Formats')
torch/utils/_traceback.py:224 in public method `format_all`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/utils/_traceback.py:247 in private function `_extract_symbolized_tb`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/_traceback.py:247 in private function `_extract_symbolized_tb`:
D400: First line should end with a period (not 'f')
```
pydocstyle torch/utils/mkldnn.py --count
before: 28
after: 26
**remaining errors**
```
torch/utils/mkldnn.py:1 at module level:
D100: Missing docstring in public module
torch/utils/mkldnn.py:4 in public class `MkldnnLinear`:
D101: Missing docstring in public class
torch/utils/mkldnn.py:5 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/mkldnn.py:19 in public method `__getstate__`:
D105: Missing docstring in magic method
torch/utils/mkldnn.py:23 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/utils/mkldnn.py:29 in public method `forward`:
D102: Missing docstring in public method
torch/utils/mkldnn.py:75 in public class `MkldnnConv1d`:
D101: Missing docstring in public class
torch/utils/mkldnn.py:76 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/mkldnn.py:82 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/utils/mkldnn.py:88 in public class `MkldnnConv2d`:
D101: Missing docstring in public class
torch/utils/mkldnn.py:89 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/mkldnn.py:100 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/utils/mkldnn.py:110 in public class `MkldnnConv3d`:
D101: Missing docstring in public class
torch/utils/mkldnn.py:111 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/mkldnn.py:122 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/utils/mkldnn.py:133 in public class `MkldnnBatchNorm`:
D101: Missing docstring in public class
torch/utils/mkldnn.py:136 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/mkldnn.py:155 in public method `__getstate__`:
D105: Missing docstring in magic method
torch/utils/mkldnn.py:163 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/utils/mkldnn.py:171 in public method `forward`:
D102: Missing docstring in public method
torch/utils/mkldnn.py:184 in public class `MkldnnPrelu`:
D101: Missing docstring in public class
torch/utils/mkldnn.py:185 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/mkldnn.py:190 in public method `__getstate__`:
D105: Missing docstring in magic method
torch/utils/mkldnn.py:194 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/utils/mkldnn.py:199 in public method `forward`:
D102: Missing docstring in public method
torch/utils/mkldnn.py:205 in public function `to_mkldnn`:
D103: Missing docstring in public function
```
pydocstyle torch/utils/weak.py --count
before: 32
after: 30
**remaining errors**
```
torch/utils/weak.py:1 at module level:
D100: Missing docstring in public module
torch/utils/weak.py:42 in public class `WeakIdRef`:
D101: Missing docstring in public class
torch/utils/weak.py:45 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/weak.py:54 in public method `__call__`:
D102: Missing docstring in public method
torch/utils/weak.py:61 in public method `__hash__`:
D105: Missing docstring in magic method
torch/utils/weak.py:64 in public method `__eq__`:
D105: Missing docstring in magic method
torch/utils/weak.py:84 in public class `WeakIdKeyDictionary`:
D101: Missing docstring in public class
torch/utils/weak.py:87 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/weak.py:131 in public method `__delitem__`:
D105: Missing docstring in magic method
torch/utils/weak.py:135 in public method `__getitem__`:
D105: Missing docstring in magic method
torch/utils/weak.py:138 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/weak.py:145 in public method `__repr__`:
D105: Missing docstring in magic method
torch/utils/weak.py:148 in public method `__setitem__`:
D105: Missing docstring in magic method
torch/utils/weak.py:151 in public method `copy`:
D102: Missing docstring in public method
torch/utils/weak.py:162 in public method `__deepcopy__`:
D105: Missing docstring in magic method
torch/utils/weak.py:172 in public method `get`:
D102: Missing docstring in public method
torch/utils/weak.py:175 in public method `__contains__`:
D105: Missing docstring in magic method
torch/utils/weak.py:182 in public method `items`:
D102: Missing docstring in public method
torch/utils/weak.py:189 in public method `keys`:
D102: Missing docstring in public method
torch/utils/weak.py:198 in public method `values`:
D102: Missing docstring in public method
torch/utils/weak.py:216 in public method `popitem`:
D102: Missing docstring in public method
torch/utils/weak.py:224 in public method `pop`:
D102: Missing docstring in public method
torch/utils/weak.py:228 in public method `setdefault`:
D102: Missing docstring in public method
torch/utils/weak.py:231 in public method `update`:
D102: Missing docstring in public method
torch/utils/weak.py:241 in public method `__ior__`:
D105: Missing docstring in magic method
torch/utils/weak.py:245 in public method `__or__`:
D105: Missing docstring in magic method
torch/utils/weak.py:252 in public method `__ror__`:
D105: Missing docstring in magic method
torch/utils/weak.py:262 in public method `__eq__`:
D105: Missing docstring in magic method
torch/utils/weak.py:276 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/weak.py:280 in public method `__call__`:
D102: Missing docstring in public method
```
@mikaylagawarecki @jbschlosser @svekars
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113311
Approved by: https://github.com/ezyang
In TorchVision we use the following (simplified) dispatch mechanism:
```python
import torch
def kernel1(tensor):
return tensor + 2
def dispatcher1(input):
kernel = get_kernel(dispatcher1, type(input))
return kernel(input)
def kernel2(tensor):
return tensor - 2
def dispatcher2(input):
kernel = get_kernel(dispatcher2, type(input))
return kernel(input)
# We actually use the function and type as keys, rather than their names.
# However, this currently not supported, but should be easy to add after
# https://github.com/pytorch/pytorch/pull/111196
REGISTRY = {
"dispatcher1": {"Tensor": kernel1},
"dispatcher2": {"Tensor": kernel2},
}
def get_kernel(dispatcher, input_type):
dispatcher_registry = REGISTRY[dispatcher.__name__]
for cls in input_type.__mro__:
kernel = dispatcher_registry[cls.__name__]
break
return kernel
```
This can be compiled without graph breaks:
```python
cfn = torch.compile(dispatcher1, fullgraph=True)
torch.testing.assert_close(int(cfn(torch.tensor(3))), 5)
cfn = torch.compile(dispatcher2, fullgraph=True)
torch.testing.assert_close(int(cfn(torch.tensor(3))), 1)
```
However, if we start chaining these calls, we hit some issues:
```python
class Pipeline(torch.nn.Module):
def forward(self, input):
input = dispatcher1(input)
input = dispatcher2(input)
return input
cfn = torch.compile(Pipeline(), fullgraph=True)
torch.testing.assert_close(int(cfn(torch.tensor(3))), 3)
```
```
Can't access members of type(obj) for a generated custom object. Please use __class__ instead
```
The error message is not really helpful here. The following happens: when compiling `dispatcher1`, `get_kernel` gets inlined. That means when hitting `dispatcher2`, the `type` call no longer happens on an input with a source. Thus, in the first iteration we hit the top branch, while in the second we hit the bottom:
addb8e29cd/torch/_dynamo/variables/builtin.py (L1264-L1268)
And the error message I posted above originates from the type being treated as constant. This PR replaces this with a `SourcelessBuilder` instead.
With that fix in place, we hit another pointing to `input_type.__mro__`
```
AssertionError: Consider SourcelessBuilder for ephemeral objects, usually objects created locally.
```
Fix is similar: instead of using a `VariableBuilder` here, we use a `SourcelessBuilder` in case we have no `source`:
addb8e29cd/torch/_dynamo/variables/builtin.py (L1167-L1168)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113340
Approved by: https://github.com/peterbell10, https://github.com/lezcano
# Motivation
If we would like to reinitialize DDP with a different PG with `torch.compile`, we need to do the following:
```
del old_ddp
del old_pg
pg = init_pg(...)
ddp = DDP(pg)
model = torch.compile(DDP)
```
This results in recompilation of the entire model and is very expensive. Since the only thing we need to update is the PG, we should be able to do this without having to compile the model again.
# Proposal
As a result, in this PR I've introduced an `_update_process_group` API which can dynamically update the underlying ProcessGroup used by DDP without needing to reinitialize DDP again.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113580
Approved by: https://github.com/fduwjj
Summary: Currently the QAT tests are very specific to conv-bn-2d.
This makes it difficult to test new patterns like conv-bn-1d if
we want to add them. This commit refactors these tests so we can
add and test future patterns easily.
Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d
Reviewers: jerryzh168, kimishpatel
Subscribers: jerryzh168, kimishpatel, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113658
Approved by: https://github.com/jerryzh168
Fixes#112988
For files
__init__.py
_correct_bias.py
_equalize.py
_learnable_fake_quantize.py
backend_config
experimental
fake_quantize.py
fuse_modules.py
fuser_method_mappings.py
Correct the following
__init__.py:1 at module level:
D104: Missing docstring in public package
__init__.py:144 in public function `default_eval_fn`:
D205: 1 blank line required between summary line and description (found 0)
__init__.py:144 in public function `default_eval_fn`:
D400: First line should end with a period (not 'f')
__init__.py:144 in public function `default_eval_fn`:
D401: First line should be in imperative mood; try rephrasing (found 'Default')
__init__.py:152 in private class `_DerivedObserverOrFakeQuantize`:
D204: 1 blank line required after class docstring (found 0)
__init__.py:152 in private class `_DerivedObserverOrFakeQuantize`:
D205: 1 blank line required between summary line and description (found 0)
__init__.py:152 in private class `_DerivedObserverOrFakeQuantize`:
D210: No whitespaces allowed surrounding docstring text
__init__.py:152 in private class `_DerivedObserverOrFakeQuantize`:
D400: First line should end with a period (not 's')
_correct_bias.py:20 in public function `get_module`:
D200: One-line docstring should fit on one line with quotes (found 2)
_correct_bias.py:20 in public function `get_module`:
D210: No whitespaces allowed surrounding docstring text
_correct_bias.py:20 in public function `get_module`:
D300: Use """triple double quotes""" (found '''-quotes)
_correct_bias.py:20 in public function `get_module`:
D400: First line should end with a period (not 'l')
_correct_bias.py:25 in public function `parent_child_names`:
D200: One-line docstring should fit on one line with quotes (found 2)
_correct_bias.py:25 in public function `parent_child_names`:
D300: Use """triple double quotes""" (found '''-quotes)
_correct_bias.py:25 in public function `parent_child_names`:
D400: First line should end with a period (not 'e')
_correct_bias.py:25 in public function `parent_child_names`:
D401: First line should be in imperative mood (perhaps 'Split', not 'Splits')
_correct_bias.py:34 in public function `get_param`:
D205: 1 blank line required between summary line and description (found 0)
_correct_bias.py:34 in public function `get_param`:
D210: No whitespaces allowed surrounding docstring text
_correct_bias.py:34 in public function `get_param`:
D300: Use """triple double quotes""" (found '''-quotes)
_correct_bias.py:34 in public function `get_param`:
D400: First line should end with a period (not 's')
_correct_bias.py:44 in public class `MeanShadowLogger`:
D204: 1 blank line required after class docstring (found 0)
_correct_bias.py:44 in public class `MeanShadowLogger`:
D205: 1 blank line required between summary line and description (found 0)
_correct_bias.py:44 in public class `MeanShadowLogger`:
D400: First line should end with a period (not 'n')
_correct_bias.py:47 in public method `__init__`:
D107: Missing docstring in __init__
_correct_bias.py:56 in public method `forward`:
D205: 1 blank line required between summary line and description (found 0)
_correct_bias.py:56 in public method `forward`:
D210: No whitespaces allowed surrounding docstring text
_correct_bias.py:56 in public method `forward`:
D300: Use """triple double quotes""" (found '''-quotes)
_correct_bias.py:56 in public method `forward`:
D401: First line should be in imperative mood; try rephrasing (found 'The')
_correct_bias.py:77 in public method `clear`:
D102: Missing docstring in public method
_correct_bias.py:85 in public function `bias_correction`:
D205: 1 blank line required between summary line and description (found 0)
_correct_bias.py:85 in public function `bias_correction`:
D210: No whitespaces allowed surrounding docstring text
_correct_bias.py:85 in public function `bias_correction`:
D300: Use """triple double quotes""" (found '''-quotes)
_correct_bias.py:85 in public function `bias_correction`:
D400: First line should end with a period (not 's')
_correct_bias.py:85 in public function `bias_correction`:
D401: First line should be in imperative mood (perhaps 'Use', not 'Using')
_equalize.py:22 in public function `set_module_weight`:
D103: Missing docstring in public function
_equalize.py:28 in public function `set_module_bias`:
D103: Missing docstring in public function
_equalize.py:34 in public function `get_module_weight`:
D103: Missing docstring in public function
_equalize.py:40 in public function `get_module_bias`:
D103: Missing docstring in public function
_equalize.py:47 in public function `max_over_ndim`:
D200: One-line docstring should fit on one line with quotes (found 2)
_equalize.py:47 in public function `max_over_ndim`:
D210: No whitespaces allowed surrounding docstring text
_equalize.py:47 in public function `max_over_ndim`:
D300: Use """triple double quotes""" (found '''-quotes)
_equalize.py:47 in public function `max_over_ndim`:
D400: First line should end with a period (not 's')
_equalize.py:47 in public function `max_over_ndim`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
_equalize.py:55 in public function `min_over_ndim`:
D200: One-line docstring should fit on one line with quotes (found 2)
_equalize.py:55 in public function `min_over_ndim`:
D210: No whitespaces allowed surrounding docstring text
_equalize.py:55 in public function `min_over_ndim`:
D300: Use """triple double quotes""" (found '''-quotes)
_equalize.py:55 in public function `min_over_ndim`:
D400: First line should end with a period (not 's')
_equalize.py:55 in public function `min_over_ndim`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
_equalize.py:63 in public function `channel_range`:
D200: One-line docstring should fit on one line with quotes (found 2)
_equalize.py:63 in public function `channel_range`:
D210: No whitespaces allowed surrounding docstring text
_equalize.py:63 in public function `channel_range`:
D300: Use """triple double quotes""" (found '''-quotes)
_equalize.py:63 in public function `channel_range`:
D400: First line should end with a period (not 'l')
_equalize.py:63 in public function `channel_range`:
D401: First line should be in imperative mood (perhaps 'Find', not 'finds')
_equalize.py:63 in public function `channel_range`:
D403: First word of the first line should be properly capitalized ('Finds', not 'finds')
_equalize.py:76 in public function `cross_layer_equalization`:
D205: 1 blank line required between summary line and description (found 0)
_equalize.py:76 in public function `cross_layer_equalization`:
D210: No whitespaces allowed surrounding docstring text
_equalize.py:76 in public function `cross_layer_equalization`:
D300: Use """triple double quotes""" (found '''-quotes)
_equalize.py:76 in public function `cross_layer_equalization`:
D400: First line should end with a period (not 't')
_equalize.py:120 in public function `equalize`:
D205: 1 blank line required between summary line and description (found 0)
_equalize.py:120 in public function `equalize`:
D210: No whitespaces allowed surrounding docstring text
_equalize.py:120 in public function `equalize`:
D300: Use """triple double quotes""" (found '''-quotes)
_equalize.py:120 in public function `equalize`:
D400: First line should end with a period (not 'l')
_equalize.py:159 in public function `converged`:
D205: 1 blank line required between summary line and description (found 0)
_equalize.py:159 in public function `converged`:
D210: No whitespaces allowed surrounding docstring text
_equalize.py:159 in public function `converged`:
D300: Use """triple double quotes""" (found '''-quotes)
_equalize.py:159 in public function `converged`:
D400: First line should end with a period (not 's')
_equalize.py:159 in public function `converged`:
D401: First line should be in imperative mood (perhaps 'Test', not 'Tests')
_learnable_fake_quantize.py:8 in private class `_LearnableFakeQuantize`:
D204: 1 blank line required after class docstring (found 0)
_learnable_fake_quantize.py:8 in private class `_LearnableFakeQuantize`:
D205: 1 blank line required between summary line and description (found 0)
_learnable_fake_quantize.py:8 in private class `_LearnableFakeQuantize`:
D210: No whitespaces allowed surrounding docstring text
_learnable_fake_quantize.py:8 in private class `_LearnableFakeQuantize`:
D400: First line should end with a period (not 'h')
_learnable_fake_quantize.py:68 in private method `enable_param_learning`:
D205: 1 blank line required between summary line and description (found 0)
_learnable_fake_quantize.py:68 in private method `enable_param_learning`:
D400: First line should end with a period (not 'd')
_learnable_fake_quantize.py:68 in private method `enable_param_learning`:
D401: First line should be in imperative mood (perhaps 'Enable', not 'Enables')
_learnable_fake_quantize.py:78 in private method `enable_static_estimate`:
D205: 1 blank line required between summary line and description (found 0)
_learnable_fake_quantize.py:78 in private method `enable_static_estimate`:
D400: First line should end with a period (not 'f')
_learnable_fake_quantize.py:78 in private method `enable_static_estimate`:
D401: First line should be in imperative mood (perhaps 'Enable', not 'Enables')
_learnable_fake_quantize.py:87 in private method `enable_static_observation`:
D205: 1 blank line required between summary line and description (found 0)
_learnable_fake_quantize.py:87 in private method `enable_static_observation`:
D400: First line should end with a period (not 't')
_learnable_fake_quantize.py:87 in private method `enable_static_observation`:
D401: First line should be in imperative mood (perhaps 'Enable', not 'Enables')
fake_quantize.py:1 at module level:
D205: 1 blank line required between summary line and description (found 0)
fake_quantize.py:1 at module level:
D400: First line should end with a period (not 'n')
fake_quantize.py:61 in public class `FakeQuantizeBase`:
D205: 1 blank line required between summary line and description (found 0)
fake_quantize.py:61 in public class `FakeQuantizeBase`:
D210: No whitespaces allowed surrounding docstring text
fake_quantize.py:61 in public class `FakeQuantizeBase`:
D400: First line should end with a period (not 'e')
fake_quantize.py:74 in public method `__init__`:
D107: Missing docstring in __init__
fake_quantize.py:83 in public method `forward`:
D102: Missing docstring in public method
fake_quantize.py:87 in public method `calculate_qparams`:
D102: Missing docstring in public method
fake_quantize.py:91 in public method `enable_fake_quant`:
D102: Missing docstring in public method
fake_quantize.py:95 in public method `disable_fake_quant`:
D102: Missing docstring in public method
fake_quantize.py:99 in public method `enable_observer`:
D102: Missing docstring in public method
fake_quantize.py:103 in public method `disable_observer`:
D102: Missing docstring in public method
fake_quantize.py:107 in public method `with_args`:
D102: Missing docstring in public method
fake_quantize.py:115 in public class `FakeQuantize`:
D205: 1 blank line required between summary line and description (found 0)
fake_quantize.py:115 in public class `FakeQuantize`:
D210: No whitespaces allowed surrounding docstring text
fake_quantize.py:115 in public class `FakeQuantize`:
D412: No blank lines allowed between a section header and its content ('Attributes')
fake_quantize.py:150 in public method `__init__`:
D107: Missing docstring in __init__
fake_quantize.py:188 in public method `calculate_qparams`:
D102: Missing docstring in public method
fake_quantize.py:191 in public method `forward`:
D102: Missing docstring in public method
fake_quantize.py:214 in public method `extra_repr`:
D102: Missing docstring in public method
fake_quantize.py:262 in public class `FixedQParamsFakeQuantize`:
D205: 1 blank line required between summary line and description (found 0)
fake_quantize.py:262 in public class `FixedQParamsFakeQuantize`:
D210: No whitespaces allowed surrounding docstring text
fake_quantize.py:262 in public class `FixedQParamsFakeQuantize`:
D400: First line should end with a period (not 'n')
fake_quantize.py:268 in public method `__init__`:
D107: Missing docstring in __init__
fake_quantize.py:279 in public method `calculate_qparams`:
D102: Missing docstring in public method
fake_quantize.py:283 in public method `extra_repr`:
D102: Missing docstring in public method
fake_quantize.py:292 in public class `FusedMovingAvgObsFakeQuantize`:
D205: 1 blank line required between summary line and description (found 0)
fake_quantize.py:292 in public class `FusedMovingAvgObsFakeQuantize`:
D400: First line should end with a period (not 'e')
fake_quantize.py:307 in public method `__init__`:
D107: Missing docstring in __init__
fake_quantize.py:322 in public method `calculate_qparams`:
D102: Missing docstring in public method
fake_quantize.py:326 in public method `extra_repr`:
D102: Missing docstring in public method
fake_quantize.py:342 in public method `forward`:
D102: Missing docstring in public method
fake_quantize.py:480 in private function `_is_fake_quant_script_module`:
D200: One-line docstring should fit on one line with quotes (found 2)
fake_quantize.py:480 in private function `_is_fake_quant_script_module`:
D210: No whitespaces allowed surrounding docstring text
fake_quantize.py:480 in private function `_is_fake_quant_script_module`:
D300: Use """triple double quotes""" (found '''-quotes)
fake_quantize.py:480 in private function `_is_fake_quant_script_module`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
fake_quantize.py:491 in public function `disable_fake_quant`:
D400: First line should end with a period (not ':')
fake_quantize.py:502 in public function `enable_fake_quant`:
D400: First line should end with a period (not ':')
fake_quantize.py:513 in public function `disable_observer`:
D400: First line should end with a period (not ':')
fake_quantize.py:524 in public function `enable_observer`:
D400: First line should end with a period (not ':')
fuse_modules.py:1 at module level:
D100: Missing docstring in public module
fuse_modules.py:39 in public function `fuse_known_modules`:
D205: 1 blank line required between summary line and description (found 0)
fuse_modules.py:39 in public function `fuse_known_modules`:
D400: First line should end with a period (not 'd')
fuse_modules.py:39 in public function `fuse_known_modules`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
fuse_modules.py:104 in public function `fuse_modules`:
D400: First line should end with a period (not 'e')
fuse_modules.py:167 in public function `fuse_modules_qat`:
D200: One-line docstring should fit on one line with quotes (found 2)
fuse_modules.py:167 in public function `fuse_modules_qat`:
D210: No whitespaces allowed surrounding docstring text
fuse_modules.py:167 in public function `fuse_modules_qat`:
D400: First line should end with a period (not '`')
fuser_method_mappings.py:1 at module level:
D100: Missing docstring in public module
fuser_method_mappings.py:18 in public function `fuse_conv_bn`:
D400: First line should end with a period (not 'e')
fuser_method_mappings.py:55 in public function `fuse_conv_bn_relu`:
D400: First line should end with a period (not 'e')
fuser_method_mappings.py:102 in public function `fuse_linear_bn`:
D400: First line should end with a period (not 'e')
fuser_method_mappings.py:131 in public function `fuse_convtranspose_bn`:
D400: First line should end with a period (not 'e')
fuser_method_mappings.py:154 in private function `_sequential_wrapper2`:
D205: 1 blank line required between summary line and description (found 0)
fuser_method_mappings.py:154 in private function `_sequential_wrapper2`:
D210: No whitespaces allowed surrounding docstring text
fuser_method_mappings.py:154 in private function `_sequential_wrapper2`:
D400: First line should end with a period (not 's')
fuser_method_mappings.py:182 in public function `get_fuser_method`:
D205: 1 blank line required between summary line and description (found 0)
fuser_method_mappings.py:182 in public function `get_fuser_method`:
D210: No whitespaces allowed surrounding docstring text
fuser_method_mappings.py:182 in public function `get_fuser_method`:
D300: Use """triple double quotes""" (found '''-quotes)
fuser_method_mappings.py:182 in public function `get_fuser_method`:
D400: First line should end with a period (not ',')
fuser_method_mappings.py:205 in private function `_get_valid_patterns`:
D205: 1 blank line required between summary line and description (found 0)
fuser_method_mappings.py:205 in private function `_get_valid_patterns`:
D400: First line should end with a period (not ',')
fuser_method_mappings.py:205 in private function `_get_valid_patterns`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
fuser_method_mappings.py:238 in public function `get_fuser_method_new`:
D205: 1 blank line required between summary line and description (found 0)
fuser_method_mappings.py:238 in public function `get_fuser_method_new`:
D210: No whitespaces allowed surrounding docstring text
fuser_method_mappings.py:238 in public function `get_fuser_method_new`:
D400: First line should end with a period (not 'd')
fuser_method_mappings.py:238 in public function `get_fuser_method_new`:
D401: First line should be in imperative mood; try rephrasing (found 'This')
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112992
Approved by: https://github.com/kit1980
Subsumes half of https://github.com/pytorch/pytorch/pull/113605
We support fakeifying an already fake tensor, which will give you a new fake tensor mirroring the same structure as the original fake tensor, which is what is needed by https://github.com/pytorch/pytorch/issues/113643 . However, when this refakeification happens, we will naively reallocate all new sizes for all of the fake tensor. This is the right thing to do if you are re-fakeifying on a fresh ShapeEnv (because you're reparametrizing the sizes or something), but if you have two fake tensor modes which are sharing a shape environment, you would actually rather just reuse the original sizes/strides/offset from the original fake tensor. This ends up being pretty simple. I recommend viewing with whitespace diff turned off.
There's some fuzz around jagged tensor handling; that code is probably not quite right, but I fixed it for this particular case in the most straightforward way.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113651
Approved by: https://github.com/albanD, https://github.com/eellison, https://github.com/bdhirsh
Fixes#112594
docstring updated.
Here are the output to with the number before and after.
1) torch/autograd/forward_ad.py
Before :
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:1 at module level:
D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:23 in public function `enter_dual_level`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:23 in public function `enter_dual_level`:
D401: First line should be in imperative mood; try rephrasing (found 'Function')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:42 in public function `exit_dual_level`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:42 in public function `exit_dual_level`:
D401: First line should be in imperative mood; try rephrasing (found 'Function')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:62 in public function `make_dual`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:62 in public function `make_dual`:
D400: First line should end with a period (not 'a')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:128 in public class `UnpackedDualTensor`:
D204: 1 blank line required after class docstring (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:128 in public class `UnpackedDualTensor`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:128 in public class `UnpackedDualTensor`:
D209: Multi-line docstring closing quotes should be on a separate line
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:134 in public function `unpack_dual`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:165 in public class `dual_level`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:165 in public class `dual_level`:
D400: First line should end with a period (not 't')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:199 in public method `__enter__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:202 in public method `__exit__`:
D105: Missing docstring in magic method
15
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:1 at module level:
D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:205 in public method `__enter__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/forward_ad.py:208 in public method `__exit__`:
D105: Missing docstring in magic method
3
```
2) torch/autograd/functional.py
Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:1 at module level:
D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:262 in public function `vjp`:
D202: No blank lines allowed after function docstring (found 1)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:262 in public function `vjp`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:262 in public function `vjp`:
D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:262 in public function `vjp`:
D401: First line should be in imperative mood; try rephrasing (found 'Function')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:359 in public function `jvp`:
D202: No blank lines allowed after function docstring (found 1)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:359 in public function `jvp`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:359 in public function `jvp`:
D400: First line should end with a period (not 'f')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:359 in public function `jvp`:
D401: First line should be in imperative mood; try rephrasing (found 'Function')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:584 in public function `jacobian`:
D401: First line should be in imperative mood; try rephrasing (found 'Function')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:841 in public function `hessian`:
D202: No blank lines allowed after function docstring (found 1)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:841 in public function `hessian`:
D401: First line should be in imperative mood; try rephrasing (found 'Function')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:973 in public function `vhp`:
D202: No blank lines allowed after function docstring (found 1)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:973 in public function `vhp`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:973 in public function `vhp`:
D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:973 in public function `vhp`:
D401: First line should be in imperative mood; try rephrasing (found 'Function')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:1076 in public function `hvp`:
D202: No blank lines allowed after function docstring (found 1)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:1076 in public function `hvp`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:1076 in public function `hvp`:
D400: First line should end with a period (not 'r')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:1076 in public function `hvp`:
D401: First line should be in imperative mood; try rephrasing (found 'Function')
20
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/functional.py:1 at module level:
D100: Missing docstring in public module
1
```
3) torch/autograd/profiler_legacy.py
Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:1 at module level:
D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:27 in public class `profile`:
D400: First line should end with a period (not 'd')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:29 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:62 in public method `config`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:74 in public method `__enter__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:86 in public method `__exit__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:103 in public method `__repr__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:108 in public method `__str__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:117 in public method `table`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:141 in public method `export_chrome_trace`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:148 in public method `export_stacks`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:154 in public method `key_averages`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:161 in public method `total_average`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:170 in public method `self_cpu_time_total`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:170 in public method `self_cpu_time_total`:
D400: First line should end with a period (not 'f')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:180 in private nested function `_get_record_key`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:180 in private nested function `_get_record_key`:
D400: First line should end with a period (not 'd')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:180 in private nested function `_get_record_key`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
18
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:1 at module level:
D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:29 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:62 in public method `config`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:74 in public method `__enter__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:86 in public method `__exit__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:103 in public method `__repr__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:108 in public method `__str__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:117 in public method `table`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:141 in public method `export_chrome_trace`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:148 in public method `export_stacks`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:154 in public method `key_averages`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_legacy.py:161 in public method `total_average`:
D102: Missing docstring in public method
12
```
4) torch/autograd/gradcheck.py
Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:1 at module level:
D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:27 in public class `GradcheckError`:
D204: 1 blank line required after class docstring (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:27 in public class `GradcheckError`:
D400: First line should end with a period (not '`')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:258 in private function `_get_numerical_jacobian`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:258 in private function `_get_numerical_jacobian`:
D400: First line should end with a period (not 'f')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:258 in private function `_get_numerical_jacobian`:
D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:308 in public function `get_numerical_jacobian`:
D401: First line should be in imperative mood; try rephrasing (found 'Deprecated')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:459 in public function `get_numerical_jacobian_wrt_specific_input`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:488 in private function `_get_analytical_jacobian_forward_ad`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:488 in private function `_get_analytical_jacobian_forward_ad`:
D400: First line should end with a period (not 't')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:488 in private function `_get_analytical_jacobian_forward_ad`:
D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:816 in public function `get_analytical_jacobian`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:1944 in public function `gradcheck`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:1944 in public function `gradcheck`:
D400: First line should end with a period (not 'l')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:2133 in public function `gradgradcheck`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:2133 in public function `gradgradcheck`:
D400: First line should end with a period (not 's')
16
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:1 at module level:
D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:463 in public function `get_numerical_jacobian_wrt_specific_input`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/gradcheck.py:820 in public function `get_analytical_jacobian`:
D103: Missing docstring in public function
3
```
5) torch/autograd/function.py
Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:1 at module level:
D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:27 in public class `FunctionCtx`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:29 in public method `save_for_backward`:
D401: First line should be in imperative mood (perhaps 'Save', not 'Saves')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:88 in public method `save_for_forward`:
D401: First line should be in imperative mood (perhaps 'Save', not 'Saves')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:141 in public method `mark_dirty`:
D401: First line should be in imperative mood (perhaps 'Mark', not 'Marks')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:177 in public method `mark_shared_storage`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:185 in public method `mark_non_differentiable`:
D401: First line should be in imperative mood (perhaps 'Mark', not 'Marks')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:217 in public method `set_materialize_grads`:
D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:276 in public class `BackwardCFunction`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:277 in public method `apply`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:291 in public method `apply_jvp`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:308 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:322 in private method `forward`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:322 in private method `forward`:
D400: First line should end with a period (not 's')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:322 in private method `forward`:
D401: First line should be in imperative mood; try rephrasing (found 'This')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:384 in private method `backward`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:384 in private method `backward`:
D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:384 in private method `backward`:
D401: First line should be in imperative mood (perhaps 'Define', not 'Defines')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:416 in private method `jvp`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:416 in private method `jvp`:
D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:416 in private method `jvp`:
D401: First line should be in imperative mood (perhaps 'Define', not 'Defines')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:439 in public class `Function`:
D400: First line should end with a period (not '`')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:472 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:482 in public method `__call__`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:505 in public method `vmap`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:505 in public method `vmap`:
D400: First line should end with a period (not 'h')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:505 in public method `vmap`:
D401: First line should be in imperative mood (perhaps 'Define', not 'Defines')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:536 in public method `apply`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:564 in public function `once_differentiable`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:612 in public function `traceable`:
D401: First line should be in imperative mood (perhaps 'Mark', not 'Marks')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:626 in public class `InplaceFunction`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:627 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:741 in public class `NestedIOFunction`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:761 in public method `backward`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:768 in public method `forward`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:775 in public method `save_for_backward`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:780 in public method `saved_tensors`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:784 in public method `mark_dirty`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:787 in public method `mark_non_differentiable`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:790 in public method `forward_extended`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:793 in public method `backward_extended`:
D102: Missing docstring in public method
41
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:1 at module level:
D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:27 in public class `FunctionCtx`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:177 in public method `mark_shared_storage`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:276 in public class `BackwardCFunction`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:277 in public method `apply`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:291 in public method `apply_jvp`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:308 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:471 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:481 in public method `__call__`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:536 in public method `apply`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:564 in public function `once_differentiable`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:626 in public class `InplaceFunction`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:627 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:741 in public class `NestedIOFunction`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:761 in public method `backward`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:768 in public method `forward`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:775 in public method `save_for_backward`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:780 in public method `saved_tensors`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:784 in public method `mark_dirty`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:787 in public method `mark_non_differentiable`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:790 in public method `forward_extended`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/function.py:793 in public method `backward_extended`:
D102: Missing docstring in public method
22
```
6) torch/autograd/profiler_util.py
Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:1 at module level:
D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:26 in public class `EventList`:
D400: First line should end with a period (not ')')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:28 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:46 in public method `__str__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:70 in private method `_populate_cpu_children`:
D202: No blank lines allowed after function docstring (found 1)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:70 in private method `_populate_cpu_children`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:70 in private method `_populate_cpu_children`:
D401: First line should be in imperative mood (perhaps 'Populate', not 'Populates')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:166 in public method `self_cpu_time_total`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:179 in public method `table`:
D401: First line should be in imperative mood (perhaps 'Print', not 'Prints')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:210 in public method `export_chrome_trace`:
D401: First line should be in imperative mood (perhaps 'Export', not 'Exports')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:266 in public method `supported_export_stacks_metrics`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:273 in public method `export_stacks`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:354 in private function `_format_time`:
D400: First line should end with a period (not 't')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:354 in private function `_format_time`:
D401: First line should be in imperative mood (perhaps 'Define', not 'Defines')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:365 in private function `_format_time_share`:
D400: First line should end with a period (not 't')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:365 in private function `_format_time_share`:
D401: First line should be in imperative mood (perhaps 'Define', not 'Defines')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:373 in private function `_format_memory`:
D400: First line should end with a period (not 'g')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:373 in private function `_format_memory`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:408 in public method `cpu_time`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:412 in public method `cuda_time`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:416 in public method `privateuse1_time`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:420 in public class `Interval`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:421 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:425 in public method `elapsed_us`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:435 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:488 in public method `append_kernel`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:504 in public method `set_cpu_parent`:
D400: First line should end with a period (not 't')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:518 in public method `self_cpu_memory_usage`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:526 in public method `self_cuda_memory_usage`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:534 in public method `self_privateuse1_memory_usage`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:542 in public method `self_cpu_time_total`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:550 in public method `cuda_time_total`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:567 in public method `self_cuda_time_total`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:579 in public method `cpu_time_total`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:586 in public method `self_privateuse1_time_total`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:598 in public method `privateuse1_time_total`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:615 in public method `key`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:618 in public method `__repr__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:659 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:687 in public method `add`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:726 in public method `__iadd__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:729 in public method `__repr__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:763 in public class `StringTable`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:764 in public method `__missing__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:773 in public class `MemRecordsAcc`:
D400: First line should end with a period (not 'l')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:775 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:783 in public method `in_interval`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:846 in private function `_build_table`:
D401: First line should be in imperative mood (perhaps 'Print', not 'Prints')
48
```
After :
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:1 at module level:
D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:28 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:46 in public method `__str__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:166 in public method `self_cpu_time_total`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:266 in public method `supported_export_stacks_metrics`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:273 in public method `export_stacks`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:408 in public method `cpu_time`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:412 in public method `cuda_time`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:416 in public method `privateuse1_time`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:420 in public class `Interval`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:421 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:425 in public method `elapsed_us`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:435 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:488 in public method `append_kernel`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:518 in public method `self_cpu_memory_usage`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:526 in public method `self_cuda_memory_usage`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:534 in public method `self_privateuse1_memory_usage`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:542 in public method `self_cpu_time_total`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:550 in public method `cuda_time_total`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:567 in public method `self_cuda_time_total`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:579 in public method `cpu_time_total`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:586 in public method `self_privateuse1_time_total`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:598 in public method `privateuse1_time_total`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:615 in public method `key`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:618 in public method `__repr__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:659 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:687 in public method `add`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:726 in public method `__iadd__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:729 in public method `__repr__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:763 in public class `StringTable`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:764 in public method `__missing__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:775 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/profiler_util.py:783 in public method `in_interval`:
D102: Missing docstring in public method
33
```
7) torch/autograd/grad_mode.py
Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:1 at module level:
D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:73 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:78 in public method `__enter__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:82 in public method `__exit__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:133 in public method `__enter__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:137 in public method `__exit__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:182 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:187 in public method `__enter__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:190 in public method `__exit__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:193 in public method `clone`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:198 in public class `inference_mode`:
D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:250 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:257 in public method `__new__`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:262 in public method `__enter__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:266 in public method `__exit__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:269 in public method `clone`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:301 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:306 in public method `__enter__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:309 in public method `__exit__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:312 in public method `clone`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:354 in private class `_unsafe_preserve_version_counter`:
D400: First line should end with a period (not '!')
21
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:1 at module level:
D100: Missing docstring in public module
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:73 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:78 in public method `__enter__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:82 in public method `__exit__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:133 in public method `__enter__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:137 in public method `__exit__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:182 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:187 in public method `__enter__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:190 in public method `__exit__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:193 in public method `clone`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:250 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:257 in public method `__new__`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:262 in public method `__enter__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:266 in public method `__exit__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:269 in public method `clone`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:301 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:306 in public method `__enter__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:309 in public method `__exit__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/autograd/grad_mode.py:312 in public method `clone`:
D102: Missing docstring in public method
19
```
@svekars @kit1980 @subramen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113266
Approved by: https://github.com/aaronenyeshi, https://github.com/soulitzer, https://github.com/kit1980
Use the same strategy as for unsafe pickler, i.e. use dummy `torch.serialization.StorageType` to represent legacy typed storage classes during deserialization. Add `_dtype` property to be able to use it for both new and legacy format deserialization.
Parametrize `test_serialization_new_format_old_format_compat`
Add regression test to validate that loading legacy modes can be done
without any warnings
Before the change:
```
% python test_serialization.py -v -k test_serialization_new_format_old_format_compat_
test_serialization_new_format_old_format_compat_cpu (__main__.TestBothSerializationCPU) ... ok
test_serialization_new_format_old_format_compat_safe_cpu (__main__.TestBothSerializationCPU) ... /Users/nshulga/git/pytorch/pytorch/torch/_utils.py:836: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
ok
----------------------------------------------------------------------
Ran 2 tests in 0.116s
OK
```
Without the change but update test to catch warnings:
```
% python test_serialization.py -v -k test_serialization_new_format_old_format_compat_
test_serialization_new_format_old_format_compat_weights_only_False_cpu (__main__.TestBothSerializationCPU) ... ok
test_serialization_new_format_old_format_compat_weights_only_True_cpu (__main__.TestBothSerializationCPU) ... FAIL
======================================================================
FAIL: test_serialization_new_format_old_format_compat_weights_only_True_cpu (__main__.TestBothSerializationCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/nshulga/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2536, in wrapper
method(*args, **kwargs)
File "/Users/nshulga/git/pytorch/pytorch/torch/testing/_internal/common_device_type.py", line 415, in instantiated_test
result = test(self, **param_kwargs)
File "/Users/nshulga/git/pytorch/pytorch/test/test_serialization.py", line 807, in test_serialization_new_format_old_format_compat
self.assertTrue(len(w) == 0, msg=f"Expected no warnings but got {[str(x) for x in w]}")
AssertionError: False is not true : Expected no warnings but got ["{message : UserWarning('TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()'), category : 'UserWarning', filename : '/Users/nshulga/git/pytorch/pytorch/torch/_utils.py', lineno : 836, line : None}"]
To execute this test, run the following from the base repo dir:
python test/test_serialization.py -k test_serialization_new_format_old_format_compat_weights_only_True_cpu
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 2 tests in 0.109s
FAILED (failures=1)
```
Fixes problem reported in https://github.com/pytorch/pytorch/issues/52181#issuecomment-1715738910
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113614
Approved by: https://github.com/kit1980, https://github.com/albanD
Fixes#100764
This PR fixes the unary ops implementation and refactors the binary ops implementation a bit.
For unary ops:
Previously we didn't take into account unary ops that have a non-contiguous/storage-offset output, causing an incorrect result (because the MPS graph kernel always writes the buffer contiguously). Therefore, this PR creates a temporary output tensor for the graph first and then copy the result back to the original output tensor. We currently do not have a better fix other than this I think.
For binary ops, see https://github.com/pytorch/pytorch/pull/97085#discussion_r1140999125
See the added test for repro.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97085
Approved by: https://github.com/malfet
Fixes#112599
Fixed errors relating to pydocstyle in the following files. The remaining errors are related to docstrings at the module level and at methods within each module, `forward()`, `reset_parameters`, `__init__` ..etc
pydocstyle torch/nn/modules/pooling.py --count
before: 49
after: 29
**remaining errors:**
```
torch/nn/modules/pooling.py:1 at module level:
D100: Missing docstring in public module
torch/nn/modules/pooling.py:90 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/pooling.py:163 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/pooling.py:240 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/pooling.py:315 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/pooling.py:321 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/pooling.py:402 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/pooling.py:408 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/pooling.py:472 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/pooling.py:478 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/pooling.py:541 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/pooling.py:550 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/pooling.py:620 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/pooling.py:630 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/pooling.py:706 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/pooling.py:716 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/pooling.py:720 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/nn/modules/pooling.py:774 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/pooling.py:792 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/pooling.py:845 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/pooling.py:863 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/pooling.py:925 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/pooling.py:979 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/pooling.py:1026 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/pooling.py:1068 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/pooling.py:1111 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/pooling.py:1150 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/pooling.py:1189 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/pooling.py:1228 in public method `forward`:
D102: Missing docstring in public method
```
pydocstyle torch/nn/modules/upsampling.py --count
before: 14
after: 7
**remaining:**
```
torch/nn/modules/upsampling.py:1 at module level:
D100: Missing docstring in public module
torch/nn/modules/upsampling.py:142 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/upsampling.py:156 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/upsampling.py:160 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/nn/modules/upsampling.py:166 in public method `extra_repr`:
D102: Missing docstring in public method
torch/nn/modules/upsampling.py:216 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/upsampling.py:263 in public method `__init__`:
D107: Missing docstring in __init__
```
pydocstyle torch/nn/modules/rnn.py --count
before: 47
after: 40
**remaining**
```
torch/nn/modules/rnn.py:1 at module level:
D100: Missing docstring in public module
torch/nn/modules/rnn.py:59 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/rnn.py:160 in public method `__setattr__`:
D105: Missing docstring in magic method
torch/nn/modules/rnn.py:225 in public method `reset_parameters`:
D102: Missing docstring in public method
torch/nn/modules/rnn.py:230 in public method `check_input`:
D102: Missing docstring in public method
torch/nn/modules/rnn.py:242 in public method `get_expected_hidden_size`:
D102: Missing docstring in public method
torch/nn/modules/rnn.py:256 in public method `check_hidden_size`:
D102: Missing docstring in public method
torch/nn/modules/rnn.py:272 in public method `check_forward_args`:
D102: Missing docstring in public method
torch/nn/modules/rnn.py:278 in public method `permute_hidden`:
D102: Missing docstring in public method
torch/nn/modules/rnn.py:284 in public method `extra_repr`:
D102: Missing docstring in public method
torch/nn/modules/rnn.py:305 in public method `__getstate__`:
D105: Missing docstring in magic method
torch/nn/modules/rnn.py:313 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/nn/modules/rnn.py:355 in public method `all_weights`:
D102: Missing docstring in public method
torch/nn/modules/rnn.py:471 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/rnn.py:478 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/rnn.py:481 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/rnn.py:503 in public method `forward` (skipping F811):
D102: Missing docstring in public method
torch/nn/modules/rnn.py:762 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/rnn.py:768 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/rnn.py:771 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/rnn.py:774 in public method `get_expected_cell_size`:
D102: Missing docstring in public method
torch/nn/modules/rnn.py:786 in public method `check_forward_args`:
D102: Missing docstring in public method
torch/nn/modules/rnn.py:798 in public method `permute_hidden`:
D102: Missing docstring in public method
torch/nn/modules/rnn.py:809 in public method `forward` (skipping F811):
D102: Missing docstring in public method
torch/nn/modules/rnn.py:820 in public method `forward` (skipping F811):
D102: Missing docstring in public method
torch/nn/modules/rnn.py:1030 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/rnn.py:1036 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/rnn.py:1039 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/rnn.py:1046 in public method `forward` (skipping F811):
D102: Missing docstring in public method
torch/nn/modules/rnn.py:1054 in public method `forward` (skipping F811):
D102: Missing docstring in public method
torch/nn/modules/rnn.py:1123 in public class `RNNCellBase`:
D101: Missing docstring in public class
torch/nn/modules/rnn.py:1134 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/rnn.py:1152 in public method `extra_repr`:
D102: Missing docstring in public method
torch/nn/modules/rnn.py:1160 in public method `reset_parameters`:
D102: Missing docstring in public method
torch/nn/modules/rnn.py:1224 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/rnn.py:1230 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/rnn.py:1327 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/rnn.py:1332 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/rnn.py:1422 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/rnn.py:1427 in public method `forward`:
D102: Missing docstring in public method
```
pydocstyle torch/nn/modules/pixelshuffle.py --count
before: 13
after: 8
**remaining:**
```
torch/nn/modules/pixelshuffle.py:1 at module level:
D100: Missing docstring in public module
torch/nn/modules/pixelshuffle.py:52 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/pixelshuffle.py:56 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/pixelshuffle.py:59 in public method `extra_repr`:
D102: Missing docstring in public method
torch/nn/modules/pixelshuffle.py:105 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/pixelshuffle.py:109 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/pixelshuffle.py:112 in public method `extra_repr`:
D102: Missing docstring in public method
```
pydocstyle torch/nn/modules/sparse.py --count
before: 14
after: 8
**remaining errors:**
```
torch/nn/modules/sparse.py:1 at module level:
D100: Missing docstring in public module
torch/nn/modules/sparse.py:124 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/sparse.py:153 in public method `reset_parameters`:
D102: Missing docstring in public method
torch/nn/modules/sparse.py:162 in public method `forward`:
D102: Missing docstring in public method
torch/nn/modules/sparse.py:167 in public method `extra_repr`:
D102: Missing docstring in public method
torch/nn/modules/sparse.py:320 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/modules/sparse.py:350 in public method `reset_parameters`:
D102: Missing docstring in public method
torch/nn/modules/sparse.py:396 in public method `extra_repr`:
D102: Missing docstring in public method
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113177
Approved by: https://github.com/ezyang
I prefer to not modify the module if it does not have any of our APIs applied. The side effect of inserting a registry on the module when calling a getter is non-intuitive to me.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113654
Approved by: https://github.com/fegin
Summary: We implement `native_layer_norm`. Compared to [`layer_norm`](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html), the output of native_layer_norm is a tuple of tensors containing: layer_norm, mean, 1/sqrt(var + eps).
Test Plan:
```
[luwei@devbig984.prn1 /data/users/luwei/fbsource (2b2052666|remote/fbandroid/stable)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*native_layer_norm*"
Building: finished in 0.1 sec (100%) 339/339 jobs, 0/339 updated
Total time: 0.2 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *native_layer_norm*
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from VulkanAPITest
[ RUN ] VulkanAPITest.native_layer_norm_2d
[ OK ] VulkanAPITest.native_layer_norm_2d (352 ms)
[ RUN ] VulkanAPITest.native_layer_norm_3d
[ OK ] VulkanAPITest.native_layer_norm_3d (308 ms)
[ RUN ] VulkanAPITest.native_layer_norm_4d
[ OK ] VulkanAPITest.native_layer_norm_4d (6 ms)
[----------] 3 tests from VulkanAPITest (667 ms total)
[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (667 ms total)
[ PASSED ] 3 tests.
```
full test result in P881016177
Reviewed By: yipjustin
Differential Revision: D51247030
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113573
Approved by: https://github.com/yipjustin
Applies PLW0108 which removes useless lambda calls in Python, the rule is in preview so it is not ready to be enabled by default just yet. These are the autofixes from the rule.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113602
Approved by: https://github.com/albanD
This PR removes the need for passing `ragged_size` into the `NestedTensor` constructor. This was an artifact of fake-ification, where sometimes we needed the NT to have a symbolic singleton symint shape for the ragged dimension. The new way of achieving this is to also store mappings between fake / functional tensors -> symbolic symints in the ragged structure registry. Now the `NestedTensor` constructor can just query this registry for the `ragged_size`.
Old: `NestedTensor(values, offsets, *, ragged_size=None, **kwargs)`
New: `NestedTensor(values, offsets, **kwargs)`
This makes it possible to have a `_nested_view_from_values_offsets(values, offsets)` without needing to pass a `ragged_size`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113491
Approved by: https://github.com/ezyang, https://github.com/soulitzer
This prepares the PR where we implement sets in terms of dicts.
To do so, rather than storing internally a dictionary that maps literals
to VariableTrackers, it stores (pretty much) a dictionary from VTs to VTs.
To do so, keys are wrapped in an opaque internal class `_Hashable`.
The Hashable class is opaque on purpose so that it fails hard if
if it inadvertently leaks back into user code.
We also found and fixed a number of latent bugs and inconsistencies
in the way dynamo checked what can be a dict key. More generally, we
make much clearer what are the things that need to be modified to add
a new supported key type to Dicts.
Fixes https://github.com/pytorch/pytorch/issues/107595
Fixes https://github.com/pytorch/pytorch/issues/111603
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111196
Approved by: https://github.com/jansel
Summary: This commit significantly simplifies the QAT fusion
code for the `conv-bn` pattern by removing add and relu nodes
from the match and replacement patterns. This does not reduce
functionality; patterns like `conv-bn-relu`, `conv-bn-add`,
and `conv-bn-add-relu` are still supported. We simply do not
match these extra nodes, since there is actually no need to
replace them.
This has the additional benefit of reducing the number of
patterns being matched by 16x, since for each add and relu
variant of the `conv-bn` pattern there is also an in-place
variant. This also enables more flexible `conv-bn` pattern
matching in the future and keeps the number of patterns
more scalable.
One important change needed in this commit was to remove
the match filter that requires the input and output
activations to be quantized. This was necessary because
otherwise we would always expect q-dq nodes immediately
after the getitem node, instead of after the add or relu
nodes for example. This has another side benefit of
keeping QAT fusion flexible enough to support weight
only quantization.
Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT
Reviewers: jerryzh168, kimishpatel
Subscribers: jerryzh168, kimishpatel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113006
Approved by: https://github.com/jerryzh168
Previous discussion: https://github.com/pytorch/pytorch/pull/109476
In this PR, I made following additions to the original PR:
1) Unlifted graph module now runs the runtime assertions in its' forward call.
2) When we retrace, we make sure we run the assertions to make sure user is tracing the module with correct inputs with respect to the assumptions we made during first tracing. The way I do is that I create new graph module type with modified call method. And the runtime assertions happen under torchdynamo.disable so that it is just run in eager directly. The reason is we don't this to be traced part of the graph.
3) Both ep.module and capture_pre_autograd now returns _UnliftedGraphModule.
Differential Revision: [D51078056](https://our.internmc.facebook.com/intern/diff/D51078056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110222
Approved by: https://github.com/zhxchen17
as titled, built on top of the work @wz337 enabled, this could save some
runtime CPU time to recreate DTensor parameters with correct
shape/stride, and avoid issues when un-even sharding parameters
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113547
Approved by: https://github.com/XilunWu
ghstack dependencies: #113323, #113324
Apply a few optimizations to funcol:
- allgather on non-0 dim, the resulting tensor already needs to access
data in order to do torch.cat, so we sync wait here so that we don;t
need to go through ACT dispatch for chunk + cat alltogether
- have a fast return logic to aten.view as it's a commonly hit op for
view related ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113324
Approved by: https://github.com/XilunWu
ghstack dependencies: #113323
ATT, there are cases where multiple kernel invocations have same kernel names, and key_averages() will wrongly get average results across different invocations. This fix uses cuda_time_total / n_repeat instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113611
Approved by: https://github.com/chenyang78
Summary:
new version of this: https://www.internalfb.com/diff/D49110166?dst_version_fbid=252052334533986
Fix Assign device error, when module has multiple devices
If fc_fp16_quantization enabled for CPU model.
And module REMOTE_OTHER has multiple devices: {device(type='meta'), device(type='cpu')}
We fail on this assertion:
fbcode/caffe2/torch/ao/quantization/fx/utils.py
232
assert len(devices) <= 1, (
Since CPU models work on CPU devices, added a condition before the assertion.
In case, we have CPU in module list of devices. Set device as CPU.
Please see debug details:
https://docs.google.com/document/d/1pMPCeJyMPA15NhFc2uAyNDkS9azR40uaNyOP0DIgHjU/edit
Test Plan:
AIMP_DISAGG_CPU=true buck run mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true lego/scripts:lego_cli -- run-locally --model_entity_id 959168967 --config_version 28 --publish_context OFFLINE_PUBLISH --lego_pipeline aiplatform.modelstore.model_generation.lego.lego_pipeline_builder.gmpp_lego_pipeline --gmpp_config '{"gmpp_pipeline_descriptor": "aiplatform.modelstore.model_generation.v1.ads_pipelines.aimp_pyper_pipeline.model_generation_pipeline", "worker_process_number":12, "worker_thread_per_process_number": 6, "use_work_assignment": true}' 2>&1 | tee /tmp/gmpp_lc.txt
Snapshot:
https://www.internalfb.com/manifold/explorer/ads_storage_fblearner/tree/user/facebook/fblearner/predictor/959168967/47
Differential Revision: D51226114
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113509
Approved by: https://github.com/jerryzh168
`@classproperty` decorator adds another wrapper, so warning with default stacklevel (2) would always point to the wrapper implementation rather than at callee.
For example, before this change following code
```python
import torch
print(torch.FloatStorage.dtype)
```
will produce inactionable warning:
```
/Users/nshulga/git/pytorch/pytorch/torch/_utils.py:836: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
```
But after the change warning turns into:
```
/Users/nshulga/test/bar.py:2: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
print(torch.FloatStorage.dtype)
```
Discovered while reading https://github.com/pytorch/pytorch/issues/109108
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113601
Approved by: https://github.com/kit1980
They are used in many contexts that don't actually check if the returned
type is `None`. I have also created `try_get()` for the cases where we
do actually want an Optional type returned.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113535
Approved by: https://github.com/ezyang
ghstack dependencies: #113412
As in the title.
This PR is a follow-up to PR https://github.com/pytorch/pytorch/pull/112737 to address bfloat16 and float32 dtype cases. The performance increase is as follows (`NVIDIA A100-SXM4-80GB`):
- bsr_scatter_mm and bfloat16
- for blocksize 16x16, the average/maximum speed up is about 29/75 %.
- for blocksize 32x32, the average/maximum speed up is about 23/58 %.
- for blocksize 64x64, the average/maximum speed up is about 27/66 %.
- for blocksize 128x128, the average/maximum speed up is about 33/72 %.
- bsr_dense_mm and bfloat16
- for blocksize 16x16, the average/maximum speed up is about 47/61 %.
- for blocksize 32x32, the average/maximum speed up is about 29/43 %.
- for blocksize 64x64, the average/maximum speed up is about 21/41 %.
- for blocksize 128x128, the average/maximum speed up is about 12/29 %.
- bsr_dense_mm and float32
- for blocksize 16x16, the average/maximum speed up is about 35/49 %.
- for blocksize 32x32, the average/maximum speed up is about 2/5 %.
- for blocksize 64x64, the average/maximum speed up is about 2/21 %.
- for blocksize 128x128, the average/maximum speed up is about 79/84 %.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113553
Approved by: https://github.com/cpuhrsch
While running the accuracy minifier, I was getting the error:
```
NotImplementedError("xor_sum only implemented with inductor")
```
The logs showed that the cache limit was exceeded, and it was falling back to
eager mode which doesn't work for this function. The cache failures was due to
the code guarding on the id of the function being compiled which in this case is
a closure that gets re-created for each function call so the guard always fails.
This fixes the issue by making the storage hash kernel a global function and
working around the dynamo dependency by the `lazy_compile` helper which defers
the `torch.compile` call to the first invocation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113533
Approved by: https://github.com/Skylion007
When SymNode was refactored into its own module, this broke logging for this file, as the `dynamic` alias no longer covered it. This PR adds supports for an alias to point to multiple qualified module names. To drive the refactor, I renamed `log_alias_to_log_qname` to `log_alias_to_log_qnames` and then audited all use sites. I invite you to do so as well.
For good measure, I also add dynamic to dynamo, so that I always get dynamic logs when dynamo is enabled. Empirically this will be helpful because people keep sending me dynamo debug logs that don't have dynamic logs.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113567
Approved by: https://github.com/Skylion007, https://github.com/lezcano, https://github.com/mlazos
ghstack dependencies: #113566
Previously, the way our logging system worked was that for every registered log, we would explicit set a log level for it. This would lead to unintuitive behavior when you had multiple overlapping loggers, e.g., from the module hierarchy. Specifically, if you had `TORCH_LOGS=torch`, this would not actually set the logging level for torch._dynamo to be INFO, because the default log level is WARNING, and because torch._dynamo has a registered logger 'dynamo' this would end up setting the level on torch._dynamo to be WARNING, thereby overriding the level of the parent module torch. The 'all' logger did not display this behavior, but only because it was special cased to directly modify the default log level of all other loggers (and so this would not work for any sub-hierarchies).
This PR refactors our code into a much more logical setup using NOTSET. Instead of setting the level of all loggers to some level, we instead default all loggers to NOTSET, unless a user explicitly requested logging from some logger. This means that if we have some logger which isn't explicitly mentioned by the user, parent loggers now have a chance to influence their log behavior. With this, I can eliminate the 'all' special case; 'all' really just means 'torch'. (I keep special handling directing all to torch for backwards compatibility, though arguably I can probably just turn all into an alias.)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113566
Approved by: https://github.com/mlazos, https://github.com/Skylion007
torch.equal/is_same_size currently skips sharding prop and directly do
local tensor compute, this is wrong. for these two ops:
- torch.equal: should not skip sharding prop, need to have two DTensor
have the SAME sharding before compare local shard values
- torch.is_same_size: need to completely skip both sharding prop and
local compute
This PR refactors the existing op_dispatch to make it a class instance
so that we can do custom op handling, then fixes both torch.equal and
torch.is_same_size
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112927
Approved by: https://github.com/fduwjj, https://github.com/XilunWu
Fixes#112597
### Output:
**BEFORE:**
```functional.py:1 at module level:
D400: First line should end with a period (not 'e')
functional.py:438 in public function `fractional_max_pool2d_with_indices`:
D400: First line should end with a period (not ')')
functional.py:537 in public function `fractional_max_pool3d_with_indices`:
D400: First line should end with a period (not ')')
functional.py:646 in public function `max_pool1d_with_indices`:
D400: First line should end with a period (not ')')
functional.py:732 in public function `max_pool2d_with_indices`:
D400: First line should end with a period (not ')')
functional.py:818 in public function `max_pool3d_with_indices`:
D400: First line should end with a period (not ')')
functional.py:932 in public function `max_unpool1d`:
D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
functional.py:968 in public function `max_unpool2d`:
D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
functional.py:1000 in public function `max_unpool3d`:
D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
functional.py:1031 in public function `lp_pool2d`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:1031 in public function `lp_pool2d`:
D400: First line should end with a period (not 'f')
functional.py:1031 in public function `lp_pool2d`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:1056 in public function `lp_pool1d`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:1056 in public function `lp_pool1d`:
D400: First line should end with a period (not 'f')
functional.py:1056 in public function `lp_pool1d`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:1077 in public function `adaptive_max_pool1d_with_indices`:
D400: First line should end with a period (not ')')
functional.py:1119 in public function `adaptive_max_pool2d_with_indices`:
D400: First line should end with a period (not ')')
functional.py:1163 in public function `adaptive_max_pool3d_with_indices`:
D400: First line should end with a period (not ')')
functional.py:1220 in public function `adaptive_avg_pool2d`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:1220 in public function `adaptive_avg_pool2d`:
D400: First line should end with a period (not 'f')
functional.py:1220 in public function `adaptive_avg_pool2d`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:1237 in public function `adaptive_avg_pool3d`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:1237 in public function `adaptive_avg_pool3d`:
D400: First line should end with a period (not 'f')
functional.py:1237 in public function `adaptive_avg_pool3d`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:1255 in public function `dropout`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:1255 in public function `dropout`:
D400: First line should end with a period (not 't')
functional.py:1275 in public function `alpha_dropout`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:1287 in public function `dropout1d`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:1287 in public function `dropout1d`:
D400: First line should end with a period (not ',')
functional.py:1325 in public function `dropout2d`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:1325 in public function `dropout2d`:
D400: First line should end with a period (not ',')
functional.py:1369 in public function `dropout3d`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:1369 in public function `dropout3d`:
D400: First line should end with a period (not ',')
functional.py:1408 in public function `feature_alpha_dropout`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:1408 in public function `feature_alpha_dropout`:
D400: First line should end with a period (not ',')
functional.py:1466 in public function `relu`:
D400: First line should end with a period (not 'r')
functional.py:1466 in public function `relu`:
D402: First line should not be the function's "signature"
functional.py:1491 in public function `glu`:
D400: First line should end with a period (not 'r')
functional.py:1491 in public function `glu`:
D402: First line should not be the function's "signature"
functional.py:1516 in public function `hardtanh`:
D400: First line should end with a period (not 'r')
functional.py:1516 in public function `hardtanh`:
D402: First line should not be the function's "signature"
functional.py:1542 in public function `relu6`:
D400: First line should end with a period (not 'r')
functional.py:1542 in public function `relu6`:
D402: First line should not be the function's "signature"
functional.py:1558 in public function `elu`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:1582 in public function `selu`:
D400: First line should end with a period (not 'r')
functional.py:1582 in public function `selu`:
D402: First line should not be the function's "signature"
functional.py:1611 in public function `celu`:
D400: First line should end with a period (not 'r')
functional.py:1611 in public function `celu`:
D402: First line should not be the function's "signature"
functional.py:1638 in public function `leaky_relu`:
D400: First line should end with a period (not 'r')
functional.py:1638 in public function `leaky_relu`:
D402: First line should not be the function's "signature"
functional.py:1688 in public function `rrelu`:
D400: First line should end with a period (not 'r')
functional.py:1688 in public function `rrelu`:
D402: First line should not be the function's "signature"
functional.py:1755 in public function `tanhshrink`:
D400: First line should end with a period (not 'r')
functional.py:1755 in public function `tanhshrink`:
D402: First line should not be the function's "signature"
functional.py:1767 in public function `softsign`:
D400: First line should end with a period (not 'r')
functional.py:1767 in public function `softsign`:
D402: First line should not be the function's "signature"
functional.py:1806 in public function `softmin`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:1832 in public function `softmax`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:1868 in public function `gumbel_softmax`:
D401: First line should be in imperative mood (perhaps 'Sample', not 'Samples')
functional.py:1930 in public function `log_softmax`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:1969 in public function `tanh`:
D400: First line should end with a period (not 'r')
functional.py:1969 in public function `tanh`:
D402: First line should not be the function's "signature"
functional.py:1980 in public function `sigmoid`:
D400: First line should end with a period (not 'r')
functional.py:1980 in public function `sigmoid`:
D402: First line should not be the function's "signature"
functional.py:1990 in public function `hardsigmoid`:
D400: First line should end with a period (not 'n')
functional.py:1990 in public function `hardsigmoid`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:2057 in public function `silu`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:2057 in public function `silu`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:2081 in public function `mish`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:2081 in public function `mish`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:2100 in public function `hardswish`:
D400: First line should end with a period (not ':')
functional.py:2100 in public function `hardswish`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:2136 in public function `embedding`:
D202: No blank lines allowed after function docstring (found 1)
functional.py:2136 in public function `embedding`:
D401: First line should be in imperative mood; try rephrasing (found 'A')
functional.py:2254 in public function `embedding_bag`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:2254 in public function `embedding_bag`:
D400: First line should end with a period (not 'e')
functional.py:2254 in public function `embedding_bag`:
D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
functional.py:2462 in public function `batch_norm`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:2507 in public function `instance_norm`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:2507 in public function `instance_norm`:
D400: First line should end with a period (not 'a')
functional.py:2507 in public function `instance_norm`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:2540 in public function `layer_norm`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:2554 in public function `group_norm`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:2567 in public function `local_response_norm`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:2567 in public function `local_response_norm`:
D400: First line should end with a period (not 'f')
functional.py:2567 in public function `local_response_norm`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
functional.py:2611 in public function `ctc_loss`:
D401: First line should be in imperative mood; try rephrasing (found 'The')
functional.py:2679 in public function `nll_loss`:
D401: First line should be in imperative mood; try rephrasing (found 'The')
functional.py:2895 in public function `kl_div`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:2895 in public function `kl_div`:
D400: First line should end with a period (not 's')
functional.py:2895 in public function `kl_div`:
D401: First line should be in imperative mood; try rephrasing (found 'The')
functional.py:2978 in public function `cross_entropy`:
D401: First line should be in imperative mood; try rephrasing (found 'This')
functional.py:3069 in public function `binary_cross_entropy`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:3069 in public function `binary_cross_entropy`:
D400: First line should end with a period (not 't')
functional.py:3069 in public function `binary_cross_entropy`:
D401: First line should be in imperative mood; try rephrasing (found 'Function')
functional.py:3139 in public function `binary_cross_entropy_with_logits`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:3139 in public function `binary_cross_entropy_with_logits`:
D400: First line should end with a period (not 't')
functional.py:3139 in public function `binary_cross_entropy_with_logits`:
D401: First line should be in imperative mood; try rephrasing (found 'Function')
functional.py:3211 in public function `smooth_l1_loss`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:3211 in public function `smooth_l1_loss`:
D400: First line should end with a period (not 'e')
functional.py:3211 in public function `smooth_l1_loss`:
D401: First line should be in imperative mood; try rephrasing (found 'Function')
functional.py:3251 in public function `huber_loss`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:3251 in public function `huber_loss`:
D400: First line should end with a period (not 'e')
functional.py:3251 in public function `huber_loss`:
D401: First line should be in imperative mood; try rephrasing (found 'Function')
functional.py:3282 in public function `l1_loss`:
D400: First line should end with a period (not 'r')
functional.py:3282 in public function `l1_loss`:
D402: First line should not be the function's "signature"
functional.py:3313 in public function `mse_loss`:
D400: First line should end with a period (not 'r')
functional.py:3313 in public function `mse_loss`:
D402: First line should not be the function's "signature"
functional.py:3346 in public function `margin_ranking_loss`:
D400: First line should end with a period (not 'r')
functional.py:3346 in public function `margin_ranking_loss`:
D402: First line should not be the function's "signature"
functional.py:3382 in public function `hinge_embedding_loss`:
D400: First line should end with a period (not 'r')
functional.py:3382 in public function `hinge_embedding_loss`:
D402: First line should not be the function's "signature"
functional.py:3411 in public function `multilabel_margin_loss`:
D400: First line should end with a period (not 'r')
functional.py:3411 in public function `multilabel_margin_loss`:
D402: First line should not be the function's "signature"
functional.py:3439 in public function `soft_margin_loss`:
D400: First line should end with a period (not 'r')
functional.py:3439 in public function `soft_margin_loss`:
D402: First line should not be the function's "signature"
functional.py:3462 in public function `multilabel_soft_margin_loss`:
D400: First line should end with a period (not 'r')
functional.py:3462 in public function `multilabel_soft_margin_loss`:
D402: First line should not be the function's "signature"
functional.py:3510 in public function `cosine_embedding_loss`:
D400: First line should end with a period (not 'r')
functional.py:3510 in public function `cosine_embedding_loss`:
D402: First line should not be the function's "signature"
functional.py:3543 in public function `multi_margin_loss`:
D400: First line should end with a period (not 'r')
functional.py:3543 in public function `multi_margin_loss`:
D402: First line should not be the function's "signature"
functional.py:3708 in public function `upsample` (skipping F811,B950):
D103: Missing docstring in public function
functional.py:3713 in public function `upsample` (skipping F811,B950):
D103: Missing docstring in public function
functional.py:3718 in public function `upsample` (skipping F811):
D205: 1 blank line required between summary line and description (found 0)
functional.py:3718 in public function `upsample` (skipping F811):
D400: First line should end with a period (not 'n')
functional.py:3783 in private function `_is_integer`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:3794 in public function `interpolate` (skipping F811,B950):
D103: Missing docstring in public function
functional.py:3799 in public function `interpolate` (skipping F811,B950):
D103: Missing docstring in public function
functional.py:3804 in public function `interpolate` (skipping F811,B950):
D103: Missing docstring in public function
functional.py:3809 in public function `interpolate` (skipping F811):
D103: Missing docstring in public function
functional.py:3821 in public function `interpolate` (skipping F811,B950):
D205: 1 blank line required between summary line and description (found 0)
functional.py:3821 in public function `interpolate` (skipping F811,B950):
D400: First line should end with a period (not 'n')
functional.py:4062 in public function `upsample_nearest` (skipping F811):
D103: Missing docstring in public function
functional.py:4067 in public function `upsample_nearest` (skipping F811):
D103: Missing docstring in public function
functional.py:4100 in public function `upsample_bilinear` (skipping F811):
D103: Missing docstring in public function
functional.py:4107 in public function `upsample_bilinear` (skipping F811):
D103: Missing docstring in public function
functional.py:4114 in public function `upsample_bilinear` (skipping F811):
D103: Missing docstring in public function
functional.py:4121 in public function `upsample_bilinear` (skipping F811):
D103: Missing docstring in public function
functional.py:4174 in public function `grid_sample`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:4174 in public function `grid_sample`:
D400: First line should end with a period (not 'e')
functional.py:4315 in public function `affine_grid`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:4315 in public function `affine_grid`:
D400: First line should end with a period (not 'f')
functional.py:4315 in public function `affine_grid`:
D401: First line should be in imperative mood (perhaps 'Generate', not 'Generates')
functional.py:4608 in public function `triplet_margin_loss`:
D200: One-line docstring should fit on one line with quotes (found 3)
functional.py:4608 in public function `triplet_margin_loss`:
D400: First line should end with a period (not 's')
functional.py:4643 in public function `triplet_margin_with_distance_loss`:
D200: One-line docstring should fit on one line with quotes (found 3)
functional.py:4705 in public function `normalize`:
D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
functional.py:4733 in public function `assert_int_or_pair`:
D103: Missing docstring in public function
functional.py:4743 in public function `unfold`:
D401: First line should be in imperative mood (perhaps 'Extract', not 'Extracts')
functional.py:4773 in public function `fold`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:4773 in public function `fold`:
D400: First line should end with a period (not 'g')
functional.py:4773 in public function `fold`:
D401: First line should be in imperative mood (perhaps 'Combine', not 'Combines')
functional.py:4800 in private function `_in_projection_packed`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:4800 in private function `_in_projection_packed`:
D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
functional.py:4867 in private function `_in_projection`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:4867 in private function `_in_projection`:
D400: First line should end with a period (not 'y')
functional.py:4867 in private function `_in_projection`:
D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
functional.py:5128 in public function `multi_head_attention_forward`:
D205: 1 blank line required between summary line and description (found 0)
functional.py:5128 in public function `multi_head_attention_forward`:
D400: First line should end with a period (not ':')
160
```
**AFTER:**
```
functional.py:3709 in public function `upsample` (skipping F811,B950):
D103: Missing docstring in public function
functional.py:3714 in public function `upsample` (skipping F811,B950):
D103: Missing docstring in public function
functional.py:3798 in public function `interpolate` (skipping F811,B950):
D103: Missing docstring in public function
functional.py:3803 in public function `interpolate` (skipping F811,B950):
D103: Missing docstring in public function
functional.py:3808 in public function `interpolate` (skipping F811,B950):
D103: Missing docstring in public function
functional.py:3813 in public function `interpolate` (skipping F811):
D103: Missing docstring in public function
functional.py:4068 in public function `upsample_nearest` (skipping F811):
D103: Missing docstring in public function
functional.py:4073 in public function `upsample_nearest` (skipping F811):
D103: Missing docstring in public function
functional.py:4106 in public function `upsample_bilinear` (skipping F811):
D103: Missing docstring in public function
functional.py:4113 in public function `upsample_bilinear` (skipping F811):
D103: Missing docstring in public function
functional.py:4120 in public function `upsample_bilinear` (skipping F811):
D103: Missing docstring in public function
functional.py:4127 in public function `upsample_bilinear` (skipping F811):
D103: Missing docstring in public function
functional.py:4742 in public function `assert_int_or_pair`:
D103: Missing docstring in public function
13
```
The file contained several docstring errors. I have fixed all of them(hopefully) and have tried to improve the over all readability of the code. For most part, I have included relevant description of functions (referred from official PyTorch Docs). In some cases where functions are purely mathematical or it is difficult to give one line description, I have just included references.
For testing, I relied on local system and created a separate file. For final edits, I directly changed the contents of forked repo as visible already.
Kindly review @svekars @subramen @kit1980
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112856
Approved by: https://github.com/kit1980
Fixes#112592
1) **File: torch/cuda/random.py**
```
Before:
/content/pytorch/torch/cuda/random.py:1 at module level:
D100: Missing docstring in public module
/content/pytorch/torch/cuda/random.py:21 in public function `get_rng_state`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/content/pytorch/torch/cuda/random.py:43 in public function `get_rng_state_all`:
D202: No blank lines allowed after function docstring (found 1)
/content/pytorch/torch/cuda/random.py:43 in public function `get_rng_state_all`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/content/pytorch/torch/cuda/random.py:54 in public function `set_rng_state`:
D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
/content/pytorch/torch/cuda/random.py:79 in public function `set_rng_state_all`:
D208: Docstring is over-indented
/content/pytorch/torch/cuda/random.py:79 in public function `set_rng_state_all`:
D209: Multi-line docstring closing quotes should be on a separate line
/content/pytorch/torch/cuda/random.py:79 in public function `set_rng_state_all`:
D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
/content/pytorch/torch/cuda/random.py:79 in public function `set_rng_state_all`:
D414: Section has no content ('Args')
/content/pytorch/torch/cuda/random.py:88 in public function `manual_seed`:
D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/cuda/random.py:88 in public function `manual_seed`:
D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
/content/pytorch/torch/cuda/random.py:110 in public function `manual_seed_all`:
D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/cuda/random.py:110 in public function `manual_seed_all`:
D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
/content/pytorch/torch/cuda/random.py:128 in public function `seed`:
D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/cuda/random.py:128 in public function `seed`:
D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
/content/pytorch/torch/cuda/random.py:146 in public function `seed_all`:
D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/cuda/random.py:146 in public function `seed_all`:
D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
/content/pytorch/torch/cuda/random.py:167 in public function `initial_seed`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
18
```
```
After:
/content/pytorch/torch/cuda/random.py:1 at module level:
D100: Missing docstring in public module
1
```
2) **File: torch/cuda/amp/autocast_mode.py**
```
Before: /content/pytorch/torch/cuda/amp/autocast_mode.py:1 at module level:
D100: Missing docstring in public module
/content/pytorch/torch/cuda/amp/autocast_mode.py:18 in public class `autocast`:
D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/cuda/amp/autocast_mode.py:23 in public method `__init__`:
D107: Missing docstring in __init__
/content/pytorch/torch/cuda/amp/autocast_mode.py:38 in public method `__enter__`:
D105: Missing docstring in magic method
/content/pytorch/torch/cuda/amp/autocast_mode.py:44 in public method `__exit__`:
D105: Missing docstring in magic method
/content/pytorch/torch/cuda/amp/autocast_mode.py:49 in public method `__call__`:
D102: Missing docstring in public method
/content/pytorch/torch/cuda/amp/autocast_mode.py:90 in public function `custom_fwd`:
D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/cuda/amp/autocast_mode.py:90 in public function `custom_fwd`:
D400: First line should end with a period (not 'f')
/content/pytorch/torch/cuda/amp/autocast_mode.py:90 in public function `custom_fwd`:
D401: First line should be in imperative mood; try rephrasing (found 'Helper')
/content/pytorch/torch/cuda/amp/autocast_mode.py:130 in public function `custom_bwd`:
D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/cuda/amp/autocast_mode.py:130 in public function `custom_bwd`:
D400: First line should end with a period (not 'f')
/content/pytorch/torch/cuda/amp/autocast_mode.py:130 in public function `custom_bwd`:
D401: First line should be in imperative mood; try rephrasing (found 'Helper')
12
```
```
After:
/content/pytorch/torch/cuda/amp/autocast_mode.py:1 at module level:
D100: Missing docstring in public module
/content/pytorch/torch/cuda/amp/autocast_mode.py:23 in public method `__init__`:
D107: Missing docstring in __init__
/content/pytorch/torch/cuda/amp/autocast_mode.py:38 in public method `__enter__`:
D105: Missing docstring in magic method
/content/pytorch/torch/cuda/amp/autocast_mode.py:44 in public method `__exit__`:
D105: Missing docstring in magic method
/content/pytorch/torch/cuda/amp/autocast_mode.py:49 in public method `__call__`:
D102: Missing docstring in public method
5
```
3) **File: torch/cuda/amp/grad_scaler.py**
```
Before: /content/pytorch/torch/cuda/amp/grad_scaler.py:1 at module level:
D100: Missing docstring in public module
/content/pytorch/torch/cuda/amp/grad_scaler.py:17 in private class `_MultiDeviceReplicator`:
D200: One-line docstring should fit on one line with quotes (found 3)
/content/pytorch/torch/cuda/amp/grad_scaler.py:39 in public class `OptState`:
D101: Missing docstring in public class
/content/pytorch/torch/cuda/amp/grad_scaler.py:50 in public class `GradScaler`:
D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/cuda/amp/grad_scaler.py:50 in public class `GradScaler`:
D400: First line should end with a period (not 'g')
/content/pytorch/torch/cuda/amp/grad_scaler.py:115 in public method `__init__`:
D107: Missing docstring in __init__
/content/pytorch/torch/cuda/amp/grad_scaler.py:354 in public method `step`:
D400: First line should end with a period (not ':')
/content/pytorch/torch/cuda/amp/grad_scaler.py:456 in public method `update`:
D401: First line should be in imperative mood (perhaps 'Update', not 'Updates')
/content/pytorch/torch/cuda/amp/grad_scaler.py:529 in public method `get_scale`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/content/pytorch/torch/cuda/amp/grad_scaler.py:544 in public method `get_growth_factor`:
D200: One-line docstring should fit on one line with quotes (found 3)
/content/pytorch/torch/cuda/amp/grad_scaler.py:544 in public method `get_growth_factor`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/content/pytorch/torch/cuda/amp/grad_scaler.py:550 in public method `set_growth_factor`:
D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/cuda/amp/grad_scaler.py:550 in public method `set_growth_factor`:
D400: First line should end with a period (not ':')
/content/pytorch/torch/cuda/amp/grad_scaler.py:557 in public method `get_backoff_factor`:
D200: One-line docstring should fit on one line with quotes (found 3)
/content/pytorch/torch/cuda/amp/grad_scaler.py:557 in public method `get_backoff_factor`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/content/pytorch/torch/cuda/amp/grad_scaler.py:563 in public method `set_backoff_factor`:
D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/cuda/amp/grad_scaler.py:563 in public method `set_backoff_factor`:
D400: First line should end with a period (not ':')
/content/pytorch/torch/cuda/amp/grad_scaler.py:570 in public method `get_growth_interval`:
D200: One-line docstring should fit on one line with quotes (found 3)
/content/pytorch/torch/cuda/amp/grad_scaler.py:570 in public method `get_growth_interval`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/content/pytorch/torch/cuda/amp/grad_scaler.py:576 in public method `set_growth_interval`:
D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/cuda/amp/grad_scaler.py:576 in public method `set_growth_interval`:
D400: First line should end with a period (not ':')
/content/pytorch/torch/cuda/amp/grad_scaler.py:592 in public method `is_enabled`:
D200: One-line docstring should fit on one line with quotes (found 3)
/content/pytorch/torch/cuda/amp/grad_scaler.py:592 in public method `is_enabled`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/content/pytorch/torch/cuda/amp/grad_scaler.py:598 in public method `state_dict`:
D400: First line should end with a period (not ':')
/content/pytorch/torch/cuda/amp/grad_scaler.py:598 in public method `state_dict`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/content/pytorch/torch/cuda/amp/grad_scaler.py:624 in public method `load_state_dict`:
D401: First line should be in imperative mood (perhaps 'Load', not 'Loads')
/content/pytorch/torch/cuda/amp/grad_scaler.py:649 in public method `__getstate__`:
D105: Missing docstring in magic method
/content/pytorch/torch/cuda/amp/grad_scaler.py:665 in public method `__setstate__`:
D105: Missing docstring in magic method
28
```
```
After:
/content/pytorch/torch/cuda/amp/grad_scaler.py:1 at module level:
D100: Missing docstring in public module
/content/pytorch/torch/cuda/amp/grad_scaler.py:40 in public class `OptState`:
D101: Missing docstring in public class
/content/pytorch/torch/cuda/amp/grad_scaler.py:117 in public method `__init__`:
D107: Missing docstring in __init__
/content/pytorch/torch/cuda/amp/grad_scaler.py:647 in public method `__getstate__`:
D105: Missing docstring in magic method
/content/pytorch/torch/cuda/amp/grad_scaler.py:663 in public method `__setstate__`:
D105: Missing docstring in magic method
5
```
4) **File: torch/optim/_functional.py**
```
Before:
/content/pytorch/torch/optim/_functional.py:1 at module level:
D400: First line should end with a period (not 'e')
1
```
```
After:
0
```
5) **File: torch/optim/__init__.py**
```
Before:
/content/pytorch/torch/optim/__init__.py:1 at module level:
D205: 1 blank line required between summary line and description (found 0)
1
```
```
After:
0
```
6) **File: torch/optim/lbfgs.py**
```
Before:
/content/pytorch/torch/optim/lbfgs.py:1 at module level:
D100: Missing docstring in public module
/content/pytorch/torch/optim/lbfgs.py:185 in public class `LBFGS`:
D205: 1 blank line required between summary line and description (found 0)
/content/pytorch/torch/optim/lbfgs.py:185 in public class `LBFGS`:
D400: First line should end with a period (not 'c')
/content/pytorch/torch/optim/lbfgs.py:215 in public method `__init__`:
D107: Missing docstring in __init__
/content/pytorch/torch/optim/lbfgs.py:285 in public method `step`:
D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
5
```
```
After:
/content/pytorch/torch/optim/lbfgs.py:1 at module level:
D100: Missing docstring in public module
/content/pytorch/torch/optim/lbfgs.py:217 in public method `__init__`:
D107: Missing docstring in __init__
2
```
7)**File: torch/optim/sparse_adam.py**
```
Before: /content/pytorch/torch/optim/sparse_adam.py:1 at module level:
D100: Missing docstring in public module
/content/pytorch/torch/optim/sparse_adam.py:7 in public class `SparseAdam`:
D101: Missing docstring in public class
/content/pytorch/torch/optim/sparse_adam.py:8 in public method `__init__`:
D107: Missing docstring in __init__
/content/pytorch/torch/optim/sparse_adam.py:40 in public method `step`:
D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
4
```
```
After:
/content/pytorch/torch/optim/sparse_adam.py:1 at module level:
D100: Missing docstring in public module
/content/pytorch/torch/optim/sparse_adam.py:7 in public class `SparseAdam`:
D101: Missing docstring in public class
/content/pytorch/torch/optim/sparse_adam.py:8 in public method `__init__`:
D107: Missing docstring in __init__
3
```
8) **File:torch/optim/adadelta.py**
```
Before:
/content/pytorch/torch/optim/adadelta.py:1 at module level:
D100: Missing docstring in public module
/content/pytorch/torch/optim/adadelta.py:11 in public class `Adadelta`:
D101: Missing docstring in public class
/content/pytorch/torch/optim/adadelta.py:12 in public method `__init__`:
D107: Missing docstring in __init__
/content/pytorch/torch/optim/adadelta.py:44 in public method `__setstate__`:
D105: Missing docstring in magic method
/content/pytorch/torch/optim/adadelta.py:82 in public method `step`:
D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
/content/pytorch/torch/optim/adadelta.py:193 in public function `adadelta`:
D202: No blank lines allowed after function docstring (found 1)
6
```
```
After:
/content/pytorch/torch/optim/adadelta.py:1 at module level:
D100: Missing docstring in public module
/content/pytorch/torch/optim/adadelta.py:11 in public class `Adadelta`:
D101: Missing docstring in public class
/content/pytorch/torch/optim/adadelta.py:12 in public method `__init__`:
D107: Missing docstring in __init__
/content/pytorch/torch/optim/adadelta.py:44 in public method `__setstate__`:
D105: Missing docstring in magic method
4
```
9) **File: torch/optim/adagrad.py**
```
Before:
/content/pytorch/torch/optim/adagrad.py:1 at module level:
D100: Missing docstring in public module
/content/pytorch/torch/optim/adagrad.py:11 in public class `Adagrad`:
D101: Missing docstring in public class
/content/pytorch/torch/optim/adagrad.py:12 in public method `__init__`:
D107: Missing docstring in __init__
/content/pytorch/torch/optim/adagrad.py:63 in public method `__setstate__`:
D105: Missing docstring in magic method
/content/pytorch/torch/optim/adagrad.py:78 in public method `share_memory`:
D102: Missing docstring in public method
/content/pytorch/torch/optim/adagrad.py:100 in public method `step`:
D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
/content/pytorch/torch/optim/adagrad.py:201 in public function `adagrad`:
D202: No blank lines allowed after function docstring (found 1)
7
```
```
After:
/content/pytorch/torch/optim/adagrad.py:1 at module level:
D100: Missing docstring in public module
/content/pytorch/torch/optim/adagrad.py:11 in public class `Adagrad`:
D101: Missing docstring in public class
/content/pytorch/torch/optim/adagrad.py:12 in public method `__init__`:
D107: Missing docstring in __init__
/content/pytorch/torch/optim/adagrad.py:63 in public method `__setstate__`:
D105: Missing docstring in magic method
/content/pytorch/torch/optim/adagrad.py:78 in public method `share_memory`:
D102: Missing docstring in public method
5
```
10) **File: torch/optim/adam.py**
```
Before:
/content/pytorch/torch/optim/adam.py:1 at module level:
D100: Missing docstring in public module
/content/pytorch/torch/optim/adam.py:14 in public class `Adam`:
D101: Missing docstring in public class
/content/pytorch/torch/optim/adam.py:15 in public method `__init__`:
D107: Missing docstring in __init__
/content/pytorch/torch/optim/adam.py:65 in public method `__setstate__`:
D105: Missing docstring in magic method
/content/pytorch/torch/optim/adam.py:135 in public method `step`:
D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
/content/pytorch/torch/optim/adam.py:281 in public function `adam`:
D202: No blank lines allowed after function docstring (found 1)
/content/pytorch/torch/optim/adam.py:281 in public function `adam`:
D205: 1 blank line required between summary line and description (found 0)
7
```
```
After:
/content/pytorch/torch/optim/adam.py:1 at module level:
D100: Missing docstring in public module
/content/pytorch/torch/optim/adam.py:14 in public class `Adam`:
D101: Missing docstring in public class
/content/pytorch/torch/optim/adam.py:15 in public method `__init__`:
D107: Missing docstring in __init__
/content/pytorch/torch/optim/adam.py:65 in public method `__setstate__`:
D105: Missing docstring in magic method
4
```
11) **File: torch/optim/adamax.py**
```
Before:
/content/pytorch/torch/optim/adamax.py:1 at module level:
D100: Missing docstring in public module
/content/pytorch/torch/optim/adamax.py:12 in public class `Adamax`:
D101: Missing docstring in public class
/content/pytorch/torch/optim/adamax.py:13 in public method `__init__`:
D107: Missing docstring in __init__
/content/pytorch/torch/optim/adamax.py:47 in public method `__setstate__`:
D105: Missing docstring in magic method
/content/pytorch/torch/optim/adamax.py:91 in public method `step`:
D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
/content/pytorch/torch/optim/adamax.py:203 in public function `adamax`:
D202: No blank lines allowed after function docstring (found 1)
6
```
```
After:
/content/pytorch/torch/optim/adamax.py:1 at module level:
D100: Missing docstring in public module
/content/pytorch/torch/optim/adamax.py:12 in public class `Adamax`:
D101: Missing docstring in public class
/content/pytorch/torch/optim/adamax.py:13 in public method `__init__`:
D107: Missing docstring in __init__
/content/pytorch/torch/optim/adamax.py:47 in public method `__setstate__`:
D105: Missing docstring in magic method
4
```
12) **File: torch/optim/adamw.py**
```
Before:
/content/pytorch/torch/optim/adamw.py:1 at module level:
D100: Missing docstring in public module
/content/pytorch/torch/optim/adamw.py:12 in public class `AdamW`:
D101: Missing docstring in public class
/content/pytorch/torch/optim/adamw.py:13 in public method `__init__`:
D107: Missing docstring in __init__
/content/pytorch/torch/optim/adamw.py:73 in public method `__setstate__`:
D105: Missing docstring in magic method
/content/pytorch/torch/optim/adamw.py:153 in public method `step`:
D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
/content/pytorch/torch/optim/adamw.py:304 in public function `adamw`:
D202: No blank lines allowed after function docstring (found 1)
6
```
```
After:
/content/pytorch/torch/optim/adamw.py:1 at module level:
D100: Missing docstring in public module
/content/pytorch/torch/optim/adamw.py:12 in public class `AdamW`:
D101: Missing docstring in public class
/content/pytorch/torch/optim/adamw.py:13 in public method `__init__`:
D107: Missing docstring in __init__
/content/pytorch/torch/optim/adamw.py:73 in public method `__setstate__`:
D105: Missing docstring in magic method
4
```
13) **File: torch/optim/asgd.py**
```
Before:
/content/pytorch/torch/optim/asgd.py:1 at module level:
D100: Missing docstring in public module
/content/pytorch/torch/optim/asgd.py:17 in public class `ASGD`:
D101: Missing docstring in public class
/content/pytorch/torch/optim/asgd.py:18 in public method `__init__`:
D107: Missing docstring in __init__
/content/pytorch/torch/optim/asgd.py:52 in public method `__setstate__`:
D105: Missing docstring in magic method
/content/pytorch/torch/optim/asgd.py:107 in public method `step`:
D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
/content/pytorch/torch/optim/asgd.py:195 in public function `asgd`:
D202: No blank lines allowed after function docstring (found 1)
6
```
```
After:
/content/pytorch/torch/optim/asgd.py:1 at module level:
D100: Missing docstring in public module
/content/pytorch/torch/optim/asgd.py:17 in public class `ASGD`:
D101: Missing docstring in public class
/content/pytorch/torch/optim/asgd.py:18 in public method `__init__`:
D107: Missing docstring in __init__
/content/pytorch/torch/optim/asgd.py:52 in public method `__setstate__`:
D105: Missing docstring in magic method
4
```
Resolved docstring errors as listed. I initially changed in the main branch of forked repo which caused changes to appear in my PR to other issue. I have fixed that and hope this PR won't have any conflicts.
Kindly review @svekars @jbschlosser.
In case of any other issues please let me know. Thanks!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112964
Approved by: https://github.com/kit1980
disable test int_mm for sm90 or later
```
python test/test_linalg.py -k test__int_mm_k_32_n_32_use_transpose_a_False_use_transpose_b_False_cuda
_ TestLinalgCUDA.test__int_mm_k_32_n_32_use_transpose_a_False_use_transpose_b_False_cuda _
Traceback (most recent call last):
File "/usr/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
yield
File "/usr/lib/python3.10/unittest/case.py", line 591, in run
self._callTestMethod(testMethod)
File "/usr/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
method()
File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2410, in wrapper
method(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2410, in wrapper
method(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_device_type.py", line 428, in instantiated_test
raise rte
File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_device_type.py", line 415, in instantiated_test
result = test(self, **param_kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_device_type.py", line 1084, in only_fn
return fn(slf, *args, **kwargs)
File "/opt/pytorch/pytorch/test/test_linalg.py", line 5719, in test__int_mm
_test(17, k, n, use_transpose_a, use_transpose_b)
File "/opt/pytorch/pytorch/test/test_linalg.py", line 5680, in _test
c_int32 = torch._int_mm(a_int8, b_int8)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 0 transpose_mat2 0 m 32 n 17 k 32 mat1_ld 32 mat2_ld 32 result_ld 32 abType 3 cType 10 computeType 72 scaleType 10
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113327
Approved by: https://github.com/malfet
Fixes#112589
Fixed errors relating to pydocstyle in the following files. The remaining errors are related to docstrings at the module level and at methods within each module (see details below)
pydocstyle torch/cuda/_utils.py --count
before: 3
after: 0
pydocstyle torch/cuda/jiterator.py --count
before: 3
after: 1
**remaining errors:**
```
torch/cuda/jiterator.py:1 at module level:
D100: Missing docstring in public module
```
pydocstyle torch/cuda/graphs.py --count
before: 25
after: 7
**remaining errors:**
```
torch/cuda/graphs.py:1 at module level:
D100: Missing docstring in public module
torch/cuda/graphs.py:54 in public method `__new__`:
D102: Missing docstring in public method
torch/cuda/graphs.py:108 in public method `debug_dump`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/graphs.py:108 in public method `debug_dump`:
D400: First line should end with a period (not ':')
torch/cuda/graphs.py:150 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/graphs.py:172 in public method `__enter__`:
D105: Missing docstring in magic method
torch/cuda/graphs.py:186 in public method `__exit__`:
D105: Missing docstring in magic method
```
pydocstyle torch/cuda/_sanitizer.py --count
before: 35
after: 31
**remaining errors:**
```
torch/cuda/_sanitizer.py:43 in public class `AccessType`:
D101: Missing docstring in public class
torch/cuda/_sanitizer.py:47 in public method `__str__`:
D105: Missing docstring in magic method
torch/cuda/_sanitizer.py:84 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:96 in public method `__str__`:
D105: Missing docstring in magic method
torch/cuda/_sanitizer.py:139 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:142 in public method `__str__`:
D105: Missing docstring in magic method
torch/cuda/_sanitizer.py:218 in public class `StreamSynchronizations`:
D101: Missing docstring in public class
torch/cuda/_sanitizer.py:219 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:256 in public method `create_stream`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:268 in public method `create_event`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:272 in public method `delete_event`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:276 in public method `update_seq_num`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:280 in public method `record_state`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:291 in public method `stream_wait_for_event`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:298 in public method `all_streams_wait_for_event`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:307 in public method `all_streams_wait_for_stream`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:316 in public method `sync_all_streams`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:323 in public method `is_ordered_after`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:339 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:460 in public function `zip_by_key`:
D103: Missing docstring in public function
torch/cuda/_sanitizer.py:466 in public function `zip_arguments`:
D103: Missing docstring in public function
torch/cuda/_sanitizer.py:478 in public class `ArgumentHandler`:
D101: Missing docstring in public class
torch/cuda/_sanitizer.py:479 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:505 in public method `parse_inputs`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:520 in public method `parse_outputs`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:527 in public class `CUDASanitizerDispatchMode`:
D101: Missing docstring in public class
torch/cuda/_sanitizer.py:528 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:562 in public method `__torch_dispatch__`:
D105: Missing docstring in magic method
torch/cuda/_sanitizer.py:597 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/_sanitizer.py:601 in public method `enable`:
D102: Missing docstring in public method
torch/cuda/_sanitizer.py:605 in public method `__del__`:
D105: Missing docstring in magic method
```
pydocstyle torch/storage.py --count
before: 90
after: 37
**remaining errors:**
```
torch/storage.py:1 at module level:
D100: Missing docstring in public module
torch/storage.py:310 in public class `UntypedStorage`:
D101: Missing docstring in public class
torch/storage.py:311 in public method `__getitem__`:
D105: Missing docstring in magic method
torch/storage.py:317 in public method `is_cuda`:
D102: Missing docstring in public method
torch/storage.py:321 in public method `is_hpu`:
D102: Missing docstring in public method
torch/storage.py:325 in public method `share_memory_`:
D102: Missing docstring in public method
torch/storage.py:444 in public class `TypedStorage`:
D101: Missing docstring in public class
torch/storage.py:453 in public method `fill_`:
D102: Missing docstring in public method
torch/storage.py:458 in public method `__new__`:
D102: Missing docstring in public method
torch/storage.py:530 in public method `__init__`:
D107: Missing docstring in __init__
torch/storage.py:599 in public method `is_cuda`:
D102: Missing docstring in public method
torch/storage.py:604 in public method `is_hpu`:
D102: Missing docstring in public method
torch/storage.py:624 in public method `__len__`:
D105: Missing docstring in magic method
torch/storage.py:653 in public method `__setitem__`:
D105: Missing docstring in magic method
torch/storage.py:681 in public method `__getitem__`:
D105: Missing docstring in magic method
torch/storage.py:715 in public method `copy_`:
D102: Missing docstring in public method
torch/storage.py:723 in public method `nbytes`:
D102: Missing docstring in public method
torch/storage.py:731 in public method `type`:
D102: Missing docstring in public method
torch/storage.py:744 in public method `cuda`:
D102: Missing docstring in public method
torch/storage.py:751 in public method `hpu`:
D102: Missing docstring in public method
torch/storage.py:758 in public method `element_size`:
D102: Missing docstring in public method
torch/storage.py:766 in public method `get_device`:
D102: Missing docstring in public method
torch/storage.py:770 in public method `__str__`:
D105: Missing docstring in magic method
torch/storage.py:781 in public method `__repr__`:
D105: Missing docstring in magic method
torch/storage.py:785 in public method `__iter__`:
D105: Missing docstring in magic method
torch/storage.py:789 in public method `__copy__`:
D105: Missing docstring in magic method
torch/storage.py:793 in public method `__deepcopy__`:
D105: Missing docstring in magic method
torch/storage.py:801 in public method `__sizeof__`:
D105: Missing docstring in magic method
torch/storage.py:877 in public method `device`:
D102: Missing docstring in public method
torch/storage.py:881 in public method `size`:
D102: Missing docstring in public method
torch/storage.py:891 in public method `pickle_storage_type`:
D102: Missing docstring in public method
torch/storage.py:902 in public method `__reduce__`:
D105: Missing docstring in magic method
torch/storage.py:907 in public method `data_ptr`:
D102: Missing docstring in public method
torch/storage.py:915 in public method `resize_`:
D102: Missing docstring in public method
torch/storage.py:931 in public method `from_buffer`:
D102: Missing docstring in public method
torch/storage.py:1032 in public method `from_file`:
D402: First line should not be the function's "signature"
torch/storage.py:1075 in public method `is_shared`:
D102: Missing docstring in public method
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113227
Approved by: https://github.com/kit1980
In our first implemenation of the sdpa shim api, we didn't consider
the case where the optional scale argument could be None. It was
unnoticed because we always got a default argument for the cuda backend.
The issue was detected with the cpu backend.
This PR implements versioning for shim kernels. Currently, we only
have different versions for the sdpa api. We expect we would only
maintain a very small number of abi-compatible shim APIs that
had different versions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113487
Approved by: https://github.com/int3, https://github.com/desertfire
This should be enough to get @voznesenskym 's FSDP branch to plumb `set_()` through AOTAutograd properly and have everything properly no-op out. Main changes are:
(1) graph break on `aten::set_.source_Tensor_storage_offset` (we could support it but it isn't needed, seems safer to graph break)
(2) Functionalization: add a "proper" functionalization kernel for `aten::set_.source_Tensor`. The previous one we had was codegen'd and it was wrong (it would just clone() and call set_(), which does not do the right thing). I also manually mark on the `FunctionalTensorWrapper` when a given tensor has been mutated by a `set_()` call.
(3) AOTAutograd: I added a new field, `InputAliasInfo.mutates_storage_metadata`, so we can distinguish between "regular" metadata mutations, and metadata mutations due to `set_()` calls. This is mainly because at runtime, one requires calling `as_strided_()` to fix up metadata, while the other requires calling `set_()`.
(4) Made AOTAutograd's detection for metadata mutations / set_() mutations smarter and detect no-ops (if the storage and metadata are all the same).
I also killed `was_updated()` and `was_metadata_updated()`, and replaced them with (existing) `has_data_mutation() ` and (new) `has_data_mutation()`, which can more accurately distinguish between data-mutation vs. `set_()` calls vs. metadata-mutation
**This PR is still silently correct in one case though**, which I'd like to discuss more. In particular, this example:
```
def f(x):
x_view = x.view(-1)
x.set_(torch.ones(2))
x_view.mul_(2)
return
```
If you have an input that experiences both a data-mutation **and** a `x_old.set_(x_new)` call, there are two cases:
(a) the data mutation happened on the storage of `x_new`. This case should be handled automatically: if x_new is a graph intermediate then we will functionalize the mutation. If x_new is a different graph input, then we will perform the usual `copy_()` on that other graph input
(b) the data mutation happened on the storage of `x_old`. This is more of a pain to handle, and doesn't currently work. At runtime, the right thing to do is probably something like:
```
def functionalized_f(x):
x_view = x.view(-1)
# set_() desugars into a no-op; later usages of x will use x_output
x_output = torch.ones(2)
# functionalize the mutation on x_view
x_view_updated = x.mul(2)
x_updated = x_view_updated.view(x.shape)
# x experienced TWO TYPES of mutations; a data mutation and a metatadata mutation
# We need to return both updated tensors in our graph
return x_updated, x_output
def runtime_wrapper(x):
x_data_mutation_result, x_set_mutation_result = compiled_graph(x)
# First, perform the data mutation on x's old storage
x.copy_(x_data_mutation_result)
# Then, swap out the storage of x with the new storage
x.set_(x_set_mutation_result)
```
There are two things that make this difficult to do though:
(1) Functionalization: the functionalization rule for `set_()` will fully throw away the old `FunctionalStorageImpl` on the graph input. So if there are any mutations to that `FunctionalStorageImpl` later on in the graph, the current graph input won't know about it. Maybe we can have a given `FunctionalTensorWrapper` remember all previous storages that it had, and track mutations on all of them - although this feels pretty complicated.
(2) AOTAutograd now needs to know that we might have *two* graph outputs that correspond to a single "mutated input", which is annoying.
It's worth pointing out that this issue is probably extremely unlikely for anyone to run into - can we just detect it and error? This feels slightly easier than solving it, although not significantly easier. We would still need `FunctionalTensorWrapper` to keep track of mutations on any of its "previous" storages, so it can report this info back to AOTAutograd so we can raise an error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111554
Approved by: https://github.com/ezyang
Fixes#112591
Fixed errors relating to pydocstyle in the following files. The remaining errors are related to docstrings at the module level and methods within each module (see details below).
pydocstyle torch/cuda/_memory_viz.py --count
before: 7
after: 4
**remaining errors:**
```
torch/cuda/_memory_viz.py:77 in public function `format_flamegraph`:
D103: Missing docstring in public function
torch/cuda/_memory_viz.py:121 in public function `segments`:
D103: Missing docstring in public function
torch/cuda/_memory_viz.py:128 in public function `memory`:
D103: Missing docstring in public function
torch/cuda/_memory_viz.py:135 in public function `compare`:
D103: Missing docstring in public function
```
pydocstyle torch/cuda/streams.py --count
before: 29
after: 8
**remaining errors:**
```
torch/cuda/streams.py:1 at module level:
D100: Missing docstring in public module
torch/cuda/streams.py:31 in public method `__new__`:
D102: Missing docstring in public method
torch/cuda/streams.py:105 in public method `__eq__`:
D105: Missing docstring in magic method
torch/cuda/streams.py:110 in public method `__hash__`:
D105: Missing docstring in magic method
torch/cuda/streams.py:113 in public method `__repr__`:
D105: Missing docstring in magic method
torch/cuda/streams.py:135 in public method `__new__`:
D102: Missing docstring in public method
torch/cuda/streams.py:163 in public method `__new__`:
D102: Missing docstring in public method
torch/cuda/streams.py:237 in public method `__repr__`:
D105: Missing docstring in magic method
```
pydocstyle torch/cuda/__init__.py --count
before: 100
after: 46
**remaining errors:**
```
torch/cuda/__init__.py:251 in public class `DeferredCudaCallError`:
D101: Missing docstring in public class
torch/cuda/__init__.py:327 in public function `cudart`:
D103: Missing docstring in public function
torch/cuda/__init__.py:332 in public class `cudaStatus`:
D101: Missing docstring in public class
torch/cuda/__init__.py:337 in public class `CudaError`:
D101: Missing docstring in public class
torch/cuda/__init__.py:338 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/__init__.py:343 in public function `check_error`:
D103: Missing docstring in public function
torch/cuda/__init__.py:369 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/__init__.py:373 in public method `__enter__`:
D105: Missing docstring in magic method
torch/cuda/__init__.py:376 in public method `__exit__`:
D105: Missing docstring in magic method
torch/cuda/__init__.py:391 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/__init__.py:473 in public class `StreamContext`:
D204: 1 blank line required after class docstring (found 0)
torch/cuda/__init__.py:485 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/__init__.py:499 in public method `__enter__`:
D105: Missing docstring in magic method
torch/cuda/__init__.py:514 in public method `__exit__`:
D105: Missing docstring in magic method
torch/cuda/__init__.py:541 in public function `set_stream`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/__init__.py:838 in public function `current_blas_handle`:
D400: First line should end with a period (not 'e')
torch/cuda/__init__.py:894 in public function `memory_usage`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/__init__.py:894 in public function `memory_usage`:
D400: First line should end with a period (not ')')
torch/cuda/__init__.py:913 in public function `utilization`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/__init__.py:913 in public function `utilization`:
D400: First line should end with a period (not 'r')
torch/cuda/__init__.py:949 in public function `power_draw`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/__init__.py:949 in public function `power_draw`:
D400: First line should end with a period (not ')')
torch/cuda/__init__.py:1089 in public class `ByteStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1091 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1100 in public class `DoubleStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1102 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1111 in public class `FloatStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1113 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1122 in public class `HalfStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1124 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1133 in public class `LongStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1135 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1144 in public class `IntStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1146 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1155 in public class `ShortStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1157 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1166 in public class `CharStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1168 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1177 in public class `BoolStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1179 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1188 in public class `BFloat16Storage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1190 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1199 in public class `ComplexDoubleStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1201 in public method `dtype`:
D102: Missing docstring in public method
torch/cuda/__init__.py:1210 in public class `ComplexFloatStorage`:
D101: Missing docstring in public class
torch/cuda/__init__.py:1212 in public method `dtype`:
D102: Missing docstring in public method
```
@mikaylagawarecki @albanD @svekars @jbschlosser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113233
Approved by: https://github.com/malfet
In TorchVision we use the following (simplified) dispatch mechanism:
```python
import torch
def kernel1(tensor):
return tensor + 2
def dispatcher1(input):
kernel = get_kernel(dispatcher1, type(input))
return kernel(input)
def kernel2(tensor):
return tensor - 2
def dispatcher2(input):
kernel = get_kernel(dispatcher2, type(input))
return kernel(input)
# We actually use the function and type as keys, rather than their names.
# However, this currently not supported, but should be easy to add after
# https://github.com/pytorch/pytorch/pull/111196
REGISTRY = {
"dispatcher1": {"Tensor": kernel1},
"dispatcher2": {"Tensor": kernel2},
}
def get_kernel(dispatcher, input_type):
dispatcher_registry = REGISTRY[dispatcher.__name__]
for cls in input_type.__mro__:
kernel = dispatcher_registry[cls.__name__]
break
return kernel
```
This can be compiled without graph breaks:
```python
cfn = torch.compile(dispatcher1, fullgraph=True)
torch.testing.assert_close(int(cfn(torch.tensor(3))), 5)
cfn = torch.compile(dispatcher2, fullgraph=True)
torch.testing.assert_close(int(cfn(torch.tensor(3))), 1)
```
However, if we start chaining these calls, we hit some issues:
```python
class Pipeline(torch.nn.Module):
def forward(self, input):
input = dispatcher1(input)
input = dispatcher2(input)
return input
cfn = torch.compile(Pipeline(), fullgraph=True)
torch.testing.assert_close(int(cfn(torch.tensor(3))), 3)
```
```
Can't access members of type(obj) for a generated custom object. Please use __class__ instead
```
The error message is not really helpful here. The following happens: when compiling `dispatcher1`, `get_kernel` gets inlined. That means when hitting `dispatcher2`, the `type` call no longer happens on an input with a source. Thus, in the first iteration we hit the top branch, while in the second we hit the bottom:
addb8e29cd/torch/_dynamo/variables/builtin.py (L1264-L1268)
And the error message I posted above originates from the type being treated as constant. This PR replaces this with a `SourcelessBuilder` instead.
With that fix in place, we hit another pointing to `input_type.__mro__`
```
AssertionError: Consider SourcelessBuilder for ephemeral objects, usually objects created locally.
```
Fix is similar: instead of using a `VariableBuilder` here, we use a `SourcelessBuilder` in case we have no `source`:
addb8e29cd/torch/_dynamo/variables/builtin.py (L1167-L1168)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113340
Approved by: https://github.com/peterbell10, https://github.com/lezcano
We found a scenario where ``tensor.device().is_privateuseone()`` is used to determine whether a tensor is privateuse1 but fails.
In the code of ``Autograd``, for example:
```
::std::tuple<at::Tensor,at::Tensor,at::Tensor> native_batch_norm(c10::DispatchKeySet ks, const at::Tensor & input, const c10::optional<at::Tensor> & weight, const c10::optional<at::Tensor> & bias, const c10::optional<at::Tensor> & running_mean, const c10::optional<at::Tensor> & running_var, bool training, double momentum, double eps) {
auto& input_ = unpack(input, "input", 0);
[[maybe_unused]] auto _any_requires_grad = compute_requires_grad( input, weight, bias );
[[maybe_unused]] auto _any_has_forward_grad_result0 = (isFwGradDefined(input) || isFwGradDefined(weight) || isFwGradDefined(bias));
check_no_requires_grad(running_mean, "running_mean", "native_batch_norm");
check_no_requires_grad(running_var, "running_var", "native_batch_norm");
std::shared_ptr<NativeBatchNormBackward0> grad_fn;
if (_any_requires_grad) {
grad_fn = std::shared_ptr<NativeBatchNormBackward0>(new NativeBatchNormBackward0(), deleteNode);
grad_fn->set_next_edges(collect_next_edges( input, weight, bias ));
grad_fn->eps = eps;
grad_fn->input_ = SavedVariable(input, false);
grad_fn->running_mean_ = SavedVariable(running_mean, false);
grad_fn->running_var_ = SavedVariable(running_var, false);
grad_fn->training = training;
grad_fn->weight_ = SavedVariable(weight, false);
}
...
}
```
When ``weight`` is ``None``, an empty tensor is automatically generated and will be transferred to the backward calculation:
c7e12c7427/torch/csrc/autograd/saved_variable.cpp (L121-L128)
At the beginning of the backward calculation in our scenario, we need to determine whether the input tensor is ``PrivateUse1`` . However, if we use ``tensor.device().is_privateuseone()``, we will get an error ``"tensor does not have a device"``:
c7e12c7427/c10/core/TensorImpl.h (L1223-L1235)
I think this part of the code can be optimized, what do you think?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113421
Approved by: https://github.com/albanD
These kernels are incredibly slow to compile and for the most part are
completely independant of ATen/c10 yet they still end up including
half of `c10` transitively through `CUDAGeneratorImpl.h` and
`CUDAContext.h`.
This trims the fat so `mem_eff_attention` doesn't depend on ATen/c10 at all,
and `flash_attn` now only depends on `PhiloxUtils.cuh` (split out from
`CUDAGeneratorImpl.h`) and `CUDAContextLight.h` which doesn't transitively
include `TensorImpl.h`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113493
Approved by: https://github.com/lezcano
Motivation: Since `intel_extension_for_pytorch` is added to `THIRDPARTY_SKIPLIST`, when the IPEX optimized model uses `torch.compile`, the functions defined in IPEX will be skipped, these functions will not be able to generate the corresponding FX graph through dynamo, cannot be optimized by the compiler, and unnecessary graph breaks occurred. This PR is to remove `intel_extension_for_pytorch` from `THIRDPARTY_SKIPLIST` so that IPEX and torch.compile can work better together.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112840
Approved by: https://github.com/jgong5, https://github.com/jansel
Fixes#113194
docstrings updated.
Here are the outputs with the number before and after:-
1) torch/sparse/__init__.py
Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:1 at module level:
D104: Missing docstring in public package
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:183 in public function `sum`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:183 in public function `sum`:
D400: First line should end with a period (not 'n')
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:183 in public function `sum`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:391 in public class `check_sparse_tensor_invariants`:
D207: Docstring is under-indented
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:436 in public method `is_enabled`:
D207: Docstring is under-indented
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:436 in public method `is_enabled`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:448 in public method `enable`:
D207: Docstring is under-indented
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:468 in public method `disable`:
D207: Docstring is under-indented
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:475 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:479 in public method `__enter__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:486 in public method `__exit__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:492 in public method `__call__`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:502 in public function `as_sparse_gradcheck`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:502 in public function `as_sparse_gradcheck`:
D400: First line should end with a period (not 'l')
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:502 in public function `as_sparse_gradcheck`:
D401: First line should be in imperative mood (perhaps 'Decorate', not 'Decorator')
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:518 in private nested function `gradcheck_with_sparse_support`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:518 in private nested function `gradcheck_with_sparse_support`:
D400: First line should end with a period (not 's')
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:518 in private nested function `gradcheck_with_sparse_support`:
D401: First line should be in imperative mood; try rephrasing (found 'Same')
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:528 in private nested function `convert_to_strided_representation`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:528 in private nested function `convert_to_strided_representation`:
D400: First line should end with a period (not 'n')
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:559 in private nested function `restore_from_strided_representation`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:559 in private nested function `restore_from_strided_representation`:
D400: First line should end with a period (not 'd')
23
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:1 at module level:
D104: Missing docstring in public package
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:476 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:480 in public method `__enter__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:487 in public method `__exit__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/sparse/__init__.py:493 in public method `__call__`:
D102: Missing docstring in public method
5
```
2) torch/contrib/_tensorboard_vis.py
Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/contrib/_tensorboard_vis.py:21 in public function `dump_tensorboard_summary`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/contrib/_tensorboard_vis.py:54 in public function `visualize_graph_executor`:
D401: First line should be in imperative mood (perhaps 'Append', not 'Appends')
2
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/contrib/_tensorboard_vis.py:21 in public function `dump_tensorboard_summary`:
D103: Missing docstring in public function
1
```
3) torch/jit/_state.py
Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_state.py:1 at module level:
D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_state.py:20 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_state.py:25 in public method `parse_env`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_state.py:41 in public method `__bool__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_state.py:48 in public function `disable`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_state.py:52 in public function `enable`:
D103: Missing docstring in public function
6
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_state.py:20 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_state.py:25 in public method `parse_env`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_state.py:41 in public method `__bool__`:
D105: Missing docstring in magic method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_state.py:48 in public function `disable`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_state.py:52 in public function `enable`:
D103: Missing docstring in public function
5
```
4) torch/jit/_monkeytype_config.py
Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:27 in public function `is_torch_native_class`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:40 in public function `get_type`:
D200: One-line docstring should fit on one line with quotes (found 3)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:40 in public function `get_type`:
D401: First line should be in imperative mood; try rephrasing (found 'Helper')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:62 in public function `get_optional_of_element_type`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:62 in public function `get_optional_of_element_type`:
D400: First line should end with a period (not 'l')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:62 in public function `get_optional_of_element_type`:
D401: First line should be in imperative mood; try rephrasing (found 'Helper')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:75 in public function `get_qualified_name`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:84 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:87 in public method `log`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:90 in public class `JitTypeTraceStore`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:91 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:98 in public method `add`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:103 in public method `filter`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:111 in public method `analyze`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:122 in public method `consolidate_types`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:139 in public method `get_args_types`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:142 in public class `JitTypeTraceConfig`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:143 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:148 in public method `trace_logger`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:148 in public method `trace_logger`:
D400: First line should end with a period (not 'd')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:148 in public method `trace_logger`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:154 in public method `trace_store`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:157 in public method `code_filter`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:163 in public class `JitTypeTraceStoreLogger`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:164 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:167 in public class `JitTypeTraceStore`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:168 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:171 in public class `JitTypeTraceConfig`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:172 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:179 in public function `jit_code_filter`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:179 in public function `jit_code_filter`:
D401: First line should be in imperative mood; try rephrasing (found 'Custom')
31
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:27 in public function `is_torch_native_class`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:74 in public function `get_qualified_name`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:83 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:86 in public method `log`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:89 in public class `JitTypeTraceStore`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:90 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:97 in public method `add`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:102 in public method `filter`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:110 in public method `analyze`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:121 in public method `consolidate_types`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:138 in public method `get_args_types`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:141 in public class `JitTypeTraceConfig`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:142 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:150 in public method `trace_store`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:153 in public method `code_filter`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:159 in public class `JitTypeTraceStoreLogger`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:160 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:163 in public class `JitTypeTraceStore`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:164 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:167 in public class `JitTypeTraceConfig`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_monkeytype_config.py:168 in public method `__init__`:
D107: Missing docstring in __init__
21
```
5) torch/jit/_fuser.py
Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_fuser.py:9 in public function `optimized_execution`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_fuser.py:9 in public function `optimized_execution`:
D400: First line should end with a period (not 'n')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_fuser.py:9 in public function `optimized_execution`:
D401: First line should be in imperative mood; try rephrasing (found 'A')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_fuser.py:23 in public function `fuser`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_fuser.py:23 in public function `fuser`:
D400: First line should end with a period (not 'n')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_fuser.py:23 in public function `fuser`:
D401: First line should be in imperative mood; try rephrasing (found 'A')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_fuser.py:136 in public function `set_fusion_strategy`:
D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
7
```
After:
```
0
```
6) torch/jit/_async.py
Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_async.py:1 at module level:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_async.py:1 at module level:
D400: First line should end with a period (not 'I')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_async.py:20 in public function `fork`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_async.py:20 in public function `fork`:
D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_async.py:20 in public function `fork`:
D401: First line should be in imperative mood (perhaps 'Create', not 'Creates')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_async.py:88 in public function `wait`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_async.py:88 in public function `wait`:
D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_async.py:88 in public function `wait`:
D401: First line should be in imperative mood (perhaps 'Force', not 'Forces')
8
```
After:
```
0
```
7) torch/jit/_await.py
Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_await.py:11 in private function `_awaitable`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_await.py:11 in private function `_awaitable`:
D400: First line should end with a period (not ',')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_await.py:11 in private function `_awaitable`:
D401: First line should be in imperative mood (perhaps 'Create', not 'Creates')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_await.py:19 in private function `_awaitable_wait`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_await.py:19 in private function `_awaitable_wait`:
D400: First line should end with a period (not ',')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_await.py:19 in private function `_awaitable_wait`:
D401: First line should be in imperative mood (perhaps 'Request', not 'Requests')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_await.py:27 in private function `_awaitable_nowait`:
D200: One-line docstring should fit on one line with quotes (found 3)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_await.py:27 in private function `_awaitable_nowait`:
D401: First line should be in imperative mood (perhaps 'Create', not 'Creates')
8
```
After:
```
0
```
8) torch/jit/_check.py
Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_check.py:10 in public class `AttributeTypeIsSupportedChecker`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_check.py:10 in public class `AttributeTypeIsSupportedChecker`:
D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_check.py:10 in public class `AttributeTypeIsSupportedChecker`:
D412: No blank lines allowed between a section header and its content ('Example')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_check.py:61 in public method `check`:
D102: Missing docstring in public method
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_check.py:110 in public method `visit_Assign`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_check.py:110 in public method `visit_Assign`:
D400: First line should end with a period (not 'n')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_check.py:132 in public method `visit_AnnAssign`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_check.py:132 in public method `visit_AnnAssign`:
D400: First line should end with a period (not '`')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_check.py:187 in public method `visit_Call`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_check.py:187 in public method `visit_Call`:
D400: First line should end with a period (not '`')
10
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_check.py:58 in public method `check`:
D102: Missing docstring in public method
1
```
9) torch/jit/_freeze.py
Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_freeze.py:1 at module level:
D400: First line should end with a period (not 'g')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_freeze.py:16 in public function `freeze`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_freeze.py:16 in public function `freeze`:
D400: First line should end with a period (not 'd')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_freeze.py:127 in public function `run_frozen_optimizations`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_freeze.py:127 in public function `run_frozen_optimizations`:
D401: First line should be in imperative mood (perhaps 'Run', not 'Runs')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_freeze.py:182 in public function `optimize_for_inference`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_freeze.py:182 in public function `optimize_for_inference`:
D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_freeze.py:182 in public function `optimize_for_inference`:
D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
8
```
After:
```
0
```
10) torch/jit/_recursive.py
Before:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:69 in public function `make_stub`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:75 in public function `make_stub_from_method`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:90 in public function `make_stubs_from_exported_methods`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:103 in public function `jit_ignored_properties`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:155 in public class `SourceContext`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:156 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:160 in public function `get_annotations`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:186 in public function `infer_concrete_type_builder`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:186 in public function `infer_concrete_type_builder`:
D400: First line should end with a period (not 's')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:423 in public class `ConcreteTypeStore`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:427 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:434 in public method `get_or_create_concrete_type`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:434 in public method `get_or_create_concrete_type`:
D400: First line should end with a period (not 'T')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:459 in public function `create_methods_and_properties_from_stubs`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:474 in public function `create_hooks_from_stubs`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:485 in public function `get_module_concrete_type`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:485 in public function `get_module_concrete_type`:
D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:485 in public function `get_module_concrete_type`:
D401: First line should be in imperative mood (perhaps 'Get', not 'Gets')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:539 in public function `create_script_module`:
D400: First line should end with a period (not 'e')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:539 in public function `create_script_module`:
D401: First line should be in imperative mood (perhaps 'Create', not 'Creates')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:725 in public function `script_model_defines_attr`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:735 in public function `add_python_attr_to_scripted_model`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:740 in public function `get_overload_annotations`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:772 in public function `get_overload_name_mapping`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:797 in public function `make_stubs_for_overloads`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:816 in public function `check_module_initialized`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:842 in public function `infer_methods_to_compile`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:842 in public function `infer_methods_to_compile`:
D400: First line should end with a period (not 'g')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:842 in public function `infer_methods_to_compile`:
D401: First line should be in imperative mood (perhaps 'Implement', not 'Implements')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:904 in public function `get_hook_stubs`:
D200: One-line docstring should fit on one line with quotes (found 3)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:904 in public function `get_hook_stubs`:
D400: First line should end with a period (not 's')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:904 in public function `get_hook_stubs`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:940 in public function `get_property_stubs`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:940 in public function `get_property_stubs`:
D400: First line should end with a period (not 'd')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:963 in public function `interface_script`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:963 in public function `interface_script`:
D400: First line should end with a period (not 'r')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:963 in public function `interface_script`:
D401: First line should be in imperative mood (perhaps 'Make', not 'Makes')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:977 in private nested function `infer_interface_methods_to_compile`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:977 in private nested function `infer_interface_methods_to_compile`:
D400: First line should end with a period (not 'h')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:989 in public function `try_compile_fn`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:1014 in public function `wrap_cpp_class`:
D200: One-line docstring should fit on one line with quotes (found 3)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:1021 in public function `wrap_cpp_module`:
D200: One-line docstring should fit on one line with quotes (found 3)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:1021 in public function `wrap_cpp_module`:
D400: First line should end with a period (not 's')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:1040 in public function `compile_unbound_method`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:1052 in public function `lazy_bind`:
D205: 1 blank line required between summary line and description (found 0)
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:1052 in public function `lazy_bind`:
D400: First line should end with a period (not 'd')
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:1052 in public function `lazy_bind`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
47
```
After:
```
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:69 in public function `make_stub`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:75 in public function `make_stub_from_method`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:90 in public function `make_stubs_from_exported_methods`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:103 in public function `jit_ignored_properties`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:155 in public class `SourceContext`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:156 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:160 in public function `get_annotations`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:424 in public class `ConcreteTypeStore`:
D101: Missing docstring in public class
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:428 in public method `__init__`:
D107: Missing docstring in __init__
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:457 in public function `create_methods_and_properties_from_stubs`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:472 in public function `create_hooks_from_stubs`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:724 in public function `script_model_defines_attr`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:734 in public function `add_python_attr_to_scripted_model`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:739 in public function `get_overload_annotations`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:771 in public function `get_overload_name_mapping`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:796 in public function `make_stubs_for_overloads`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:815 in public function `check_module_initialized`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:979 in public function `try_compile_fn`:
D103: Missing docstring in public function
/home/ubuntu/Desktop/Docathon/pytorch/torch/jit/_recursive.py:1026 in public function `compile_unbound_method`:
D103: Missing docstring in public function
19
```
@svekars
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113371
Approved by: https://github.com/davidberard98
Notes:
* `debug_insert_nops` in testing.py was passing `None` to the compiler_fn
parameter of `OutputGraph`, hence the modifications there.
* I added `disable-error-code="method-assign"` to debug_utils.py as it
does several such assignments. I guess mypy doesn't like it because it
makes code near-impossible to safely typecheck.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113519
Approved by: https://github.com/Skylion007
ghstack dependencies: #113413, #113518
Summary: Previously, when the first tensor argument to `aten.cat` was empty and there was only one non-empty tensor argument, the first (empty) tensor was erroneously returned by the `aten.cat` decomposition. Here we fix the bug.
Test Plan:
```
$ python test/inductor/test_torchinductor.py -k test_cat_empty
...
----------------------------------------------------------------------
Ran 2 tests in 5.760s
OK
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113514
Approved by: https://github.com/jansel
Fixes https://github.com/pytorch/pytorch/issues/104594.
The reason for the exporter behavior in original posted issue is explained as follows:
ONNX model track shape related computes that were done in pytorch by python
numbers as tensor computes. This is the only way for ONNX to track them properly
since ONNX only has tensor type, otherwise the computation result will be tracked
statically as constant, and the model won't work for another input that differs in shape.
Now for type promotion logic, scalars should be treated differently with tensors.
Exporter mistook the shape related scalars as tensors in this case and incorrectly promoted.
This PR fixes the behavior and relaxes the criteria of scalar recognition. For floating point,
previously only a value from model initializer that has dtype torch.double and rank 0 is
treated as scalar. Now it is relaxed to any intermediate value, as well as for dtype torch.float.
Previous assumption was that python number is traced as torch.double dtype, which also
appears to be invalid anymore.
NOTE that this might introduce regression that a REAL 0-rank tensor is now being recognized as
scalar. The downside is the model will drop in accuracy for these cases as certain computations
will happen in lower precision data types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113404
Approved by: https://github.com/justinchuby
Previously, floor_div operations were defined in
ATen/native/BinaryOps.h. Since this header was not included under
ABI-compat mode, trying to use those indexing operations would result in
compilation errors.
Technically, it is safe to use aten::native::floor_div_* functions in
ABI-compat mode as they are header-only; we could simply include
BinaryOps.h. However, there are other declarations in BinaryOps.h that
are not binary-compatible, so this is not ideal. Thus, I have moved those
functions into a separate file, and put them under c10/util, since they
don't really have tensor-specific logic.
c10 functions are not all header-only, so this still isn't ideal, but
this still seems like an improvement. Moreover, cpp_prefix.h -- used
when compiling cpp kernels -- already includes c10 header files, so
ABI-compatibility already depends on maintaining some c10 functions as
header-only.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113276
Approved by: https://github.com/chenyang78, https://github.com/desertfire
Ex. https://github.com/pytorch/pytorch/actions/runs/6829618325/job/18576593307
```
File "test/run_test.py", line 1806, in main
test_stats = aggregated_heuristics.get_test_stats(test)
File "/var/lib/jenkins/pytorch/tools/testing/target_determination/heuristics/interface.py", line 391, in get_test_stats
metrics = heuristic_results.get_priority_info_for_test(test)
File "/var/lib/jenkins/pytorch/tools/testing/target_determination/heuristics/interface.py", line 307, in get_priority_info_for_test
relevance = self._get_test_relevance_group(test_run)
File "/var/lib/jenkins/pytorch/tools/testing/target_determination/heuristics/interface.py", line 278, in _get_test_relevance_group
raise ValueError(f"Test {test_run} not found in any relevance group")
ValueError: Test test_cuda_expandable_segments not found in any relevance group
```
I believe that the root cause is that HistoricalClassFailurCorrelation splits `test_cuda_expandable_segments` into two sets: one with a class and one without the class. Then, when the entire `test_cuda_expandable_segments` fails (because we currently don't do class level granularity in TD), it is unable to find what HistoricalClassFailurCorrelation ranked the test as since it's split into two.
I don't think this is that important for normal CI users since this code only runs if a test failed in the first place. However, it does mean that we can't gather TD stats, so I am going to disable it for now.
One possible solution is to switch the contains call and take the worst or best of the bunch, which I think is what https://github.com/pytorch/pytorch/blob/main/tools/testing/target_determination/heuristics/interface.py#L272 is trying to do? unclear
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113497
Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/malfet
Fixes https://github.com/pytorch/pytorch/issues/104594.
The reason for the exporter behavior in original posted issue is explained as follows:
ONNX model track shape related computes that were done in pytorch by python
numbers as tensor computes. This is the only way for ONNX to track them properly
since ONNX only has tensor type, otherwise the computation result will be tracked
statically as constant, and the model won't work for another input that differs in shape.
Now for type promotion logic, scalars should be treated differently with tensors.
Exporter mistook the shape related scalars as tensors in this case and incorrectly promoted.
This PR fixes the behavior and relaxes the criteria of scalar recognition. For floating point,
previously only a value from model initializer that has dtype torch.double and rank 0 is
treated as scalar. Now it is relaxed to any intermediate value, as well as for dtype torch.float.
Previous assumption was that python number is traced as torch.double dtype, which also
appears to be invalid anymore.
NOTE that this might introduce regression that a REAL 0-rank tensor is now being recognized as
scalar. The downside is the model will drop in accuracy for these cases as certain computations
will happen in lower precision data types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113404
Approved by: https://github.com/justinchuby
Previously it would fail here:
```
File "/data/users/ezyang/a/pytorch/torch/_inductor/fx_passes/post_grad.py", line 597, in remove_noop_ops
for out in tuple(graph.nodes)[-1].args[0]:
```
Now you'll trigger this assert instead.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113451
Approved by: https://github.com/albanD
We want something like torch.empty(i0, 12).view(4, -1, 12) to work. Right now, it chokes on guards on data dependent accesses. It turns out we are very close to having it work based on experiments in https://github.com/pytorch/pytorch/issues/112347 if we do the replacement trick, setting i0 = i1 * 4 to explicitly encode in the divisibility; this is good enough for Sympy to be able to handle the rest.
There are two parts to this PR.
* First, we must discover that there is this divisibility constraint. The place where this happens on view is in `infer_size`; however, we are unable to discover the modulus test with `expect_true` because the condition is currently written with a Python boolean operator that forces guarding too early: `numel == newsize or (dim is not None and newsize > 0 and numel % newsize == 0)`. We rewrite this into an equivalent version which tests on dim being None or not first, before performing individual tests. The main nontrivial reasoning here is that I must show that my set of tests in the `dim is None` branch are sufficient when `numel == newsize`. However, if `numel == newsize`, then the modulus must pass. Thus this is equivalent.
* Given the modifications to `infer_size`, this suffices to produce a runtime assert `Eq(Mod(192*i0, 2304), 0)`. Now we must simply turn this into the replacement automatically. I wasn't really sure how to use Sympy to do this for me, so I just manually pattern matched for this particular expression form, and if it exists do the replacements.
Note that this is kind of only useful for export, because inductor chokes on views involving unbacked SymInts. That will be follow up.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113165
Approved by: https://github.com/lezcano, https://github.com/aakhundov
Fixes#112595
- `torch/autograd/profiler.py` </br>
**Before: 37**
```
torch/autograd/profiler.py:1 at module level:
D100: Missing docstring in public module
torch/autograd/profiler.py:91 in public class `profile`:
D205: 1 blank line required between summary line and description (found 0)
torch/autograd/profiler.py:175 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/profiler.py:261 in public method `config`:
D102: Missing docstring in public method
torch/autograd/profiler.py:272 in public method `__enter__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:290 in public method `__exit__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:308 in public method `__repr__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:313 in public method `__str__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:322 in public method `table`:
D102: Missing docstring in public method
torch/autograd/profiler.py:346 in public method `export_chrome_trace`:
D102: Missing docstring in public method
torch/autograd/profiler.py:355 in public method `export_stacks`:
D102: Missing docstring in public method
torch/autograd/profiler.py:361 in public method `key_averages`:
D102: Missing docstring in public method
torch/autograd/profiler.py:368 in public method `total_average`:
D102: Missing docstring in public method
torch/autograd/profiler.py:377 in public method `self_cpu_time_total`:
D205: 1 blank line required between summary line and description (found 0)
torch/autograd/profiler.py:377 in public method `self_cpu_time_total`:
D400: First line should end with a period (not 'f')
torch/autograd/profiler.py:555 in public class `record_function`:
D205: 1 blank line required between summary line and description (found 0)
torch/autograd/profiler.py:555 in public class `record_function`:
D400: First line should end with a period (not 'f')
torch/autograd/profiler.py:591 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/profiler.py:602 in public method `__enter__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:608 in public method `__exit__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:625 in private method `_call_end_callbacks_on_future`:
D205: 1 blank line required between summary line and description (found 0)
torch/autograd/profiler.py:625 in private method `_call_end_callbacks_on_future`:
D400: First line should end with a period (not 'c')
torch/autograd/profiler.py:707 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/profiler.py:712 in public method `__enter__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:733 in public method `__exit__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:826 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/profiler.py:831 in public method `__enter__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:853 in public method `__exit__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:863 in public function `load_nvprof`:
D401: First line should be in imperative mood (perhaps 'Open', not 'Opens')
torch/autograd/profiler.py:874 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/profiler.py:877 in public method `see`:
D102: Missing docstring in public method
torch/autograd/profiler.py:883 in public function `parse_nvprof_trace`:
D103: Missing docstring in public function
torch/autograd/profiler.py:951 in public class `KinetoStepTracker`:
D205: 1 blank line required between summary line and description (found 0)
torch/autograd/profiler.py:991 in public method `init_step_count`:
D102: Missing docstring in public method
torch/autograd/profiler.py:995 in public method `erase_step_count`:
D102: Missing docstring in public method
torch/autograd/profiler.py:1000 in public method `increment_step`:
D205: 1 blank line required between summary line and description (found 0)
torch/autograd/profiler.py:1023 in public method `current_step`:
D102: Missing docstring in public method
37
```
**After: 27**
```
torch/autograd/profiler.py:1 at module level:
D100: Missing docstring in public module
torch/autograd/profiler.py:176 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/profiler.py:262 in public method `config`:
D102: Missing docstring in public method
torch/autograd/profiler.py:273 in public method `__enter__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:291 in public method `__exit__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:309 in public method `__repr__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:314 in public method `__str__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:323 in public method `table`:
D102: Missing docstring in public method
torch/autograd/profiler.py:347 in public method `export_chrome_trace`:
D102: Missing docstring in public method
torch/autograd/profiler.py:356 in public method `export_stacks`:
D102: Missing docstring in public method
torch/autograd/profiler.py:362 in public method `key_averages`:
D102: Missing docstring in public method
torch/autograd/profiler.py:369 in public method `total_average`:
D102: Missing docstring in public method
torch/autograd/profiler.py:593 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/profiler.py:604 in public method `__enter__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:610 in public method `__exit__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:708 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/profiler.py:713 in public method `__enter__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:734 in public method `__exit__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:827 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/profiler.py:832 in public method `__enter__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:854 in public method `__exit__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:875 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/profiler.py:878 in public method `see`:
D102: Missing docstring in public method
torch/autograd/profiler.py:884 in public function `parse_nvprof_trace`:
D103: Missing docstring in public function
torch/autograd/profiler.py:993 in public method `init_step_count`:
D102: Missing docstring in public method
torch/autograd/profiler.py:997 in public method `erase_step_count`:
D102: Missing docstring in public method
torch/autograd/profiler.py:1025 in public method `current_step`:
D102: Missing docstring in public method
27
```
- `torch/autograd/graph.py` </br>
**Before: 22**
```
torch/autograd/graph.py:1 at module level:
D100: Missing docstring in public module
torch/autograd/graph.py:24 in public class `Node`:
D101: Missing docstring in public class
torch/autograd/graph.py:27 in public method `name`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/autograd/graph.py:42 in public method `next_functions`:
D102: Missing docstring in public method
torch/autograd/graph.py:47 in public method `metadata`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/autograd/graph.py:56 in public method `register_hook`:
D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/autograd/graph.py:94 in public method `register_prehook`:
D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/autograd/graph.py:129 in public method `__subclasshook__`:
D105: Missing docstring in magic method
torch/autograd/graph.py:147 in public function `get_gradient_edge`:
D205: 1 blank line required between summary line and description (found 0)
torch/autograd/graph.py:147 in public function `get_gradient_edge`:
D400: First line should end with a period (not 'f')
torch/autograd/graph.py:147 in public function `get_gradient_edge`:
D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/autograd/graph.py:166 in public function `increment_version`:
D205: 1 blank line required between summary line and description (found 0)
torch/autograd/graph.py:166 in public function `increment_version`:
D400: First line should end with a period (not 'd')
torch/autograd/graph.py:166 in public function `increment_version`:
D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/autograd/graph.py:243 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/graph.py:251 in public method `__enter__`:
D105: Missing docstring in magic method
torch/autograd/graph.py:256 in public method `__exit__`:
D105: Missing docstring in magic method
torch/autograd/graph.py:261 in public class `save_on_cpu`:
D205: 1 blank line required between summary line and description (found 0)
torch/autograd/graph.py:261 in public class `save_on_cpu`:
D400: First line should end with a period (not 'e')
torch/autograd/graph.py:303 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/graph.py:365 in public function `register_multi_grad_hook`:
D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/autograd/graph.py:588 in public function `allow_mutation_on_saved_tensors`:
D400: First line should end with a period (not 'd')
22
```
**After: 8**
```
torch/autograd/graph.py:1 at module level:
D100: Missing docstring in public module
torch/autograd/graph.py:24 in public class `Node`:
D101: Missing docstring in public class
torch/autograd/graph.py:42 in public method `next_functions`:
D102: Missing docstring in public method
torch/autograd/graph.py:129 in public method `__subclasshook__`:
D105: Missing docstring in magic method
torch/autograd/graph.py:244 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/graph.py:252 in public method `__enter__`:
D105: Missing docstring in magic method
torch/autograd/graph.py:257 in public method `__exit__`:
D105: Missing docstring in magic method
torch/autograd/graph.py:303 in public method `__init__`:
D107: Missing docstring in __init__
8
```
- `torch/multiprocessing/pool.py` </br>
**Before: 6**
```
torch/multiprocessing/pool.py:1 at module level:
D100: Missing docstring in public module
torch/multiprocessing/pool.py:7 in public function `clean_worker`:
D103: Missing docstring in public function
torch/multiprocessing/pool.py:18 in public class `Pool`:
D205: 1 blank line required between summary line and description (found 0)
torch/multiprocessing/pool.py:18 in public class `Pool`:
D209: Multi-line docstring closing quotes should be on a separate line
torch/multiprocessing/pool.py:29 in private method `_repopulate_pool`:
D205: 1 blank line required between summary line and description (found 0)
torch/multiprocessing/pool.py:29 in private method `_repopulate_pool`:
D400: First line should end with a period (not ',')
6
```
**After: 2**
```
torch/multiprocessing/pool.py:1 at module level:
D100: Missing docstring in public module
torch/multiprocessing/pool.py:7 in public function `clean_worker`:
D103: Missing docstring in public function
2
```
- `torch/multiprocessing/queue.py` </br>
**Before: 11**
```
torch/multiprocessing/queue.py:1 at module level:
D100: Missing docstring in public module
torch/multiprocessing/queue.py:8 in public class `ConnectionWrapper`:
D205: 1 blank line required between summary line and description (found 0)
torch/multiprocessing/queue.py:8 in public class `ConnectionWrapper`:
D209: Multi-line docstring closing quotes should be on a separate line
torch/multiprocessing/queue.py:8 in public class `ConnectionWrapper`:
D400: First line should end with a period (not 'o')
torch/multiprocessing/queue.py:11 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/queue.py:14 in public method `send`:
D102: Missing docstring in public method
torch/multiprocessing/queue.py:19 in public method `recv`:
D102: Missing docstring in public method
torch/multiprocessing/queue.py:23 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/multiprocessing/queue.py:29 in public class `Queue`:
D101: Missing docstring in public class
torch/multiprocessing/queue.py:30 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/queue.py:38 in public class `SimpleQueue`:
D101: Missing docstring in public class
11
```
**After: 8**
```
torch/multiprocessing/queue.py:1 at module level:
D100: Missing docstring in public module
torch/multiprocessing/queue.py:10 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/queue.py:13 in public method `send`:
D102: Missing docstring in public method
torch/multiprocessing/queue.py:18 in public method `recv`:
D102: Missing docstring in public method
torch/multiprocessing/queue.py:22 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/multiprocessing/queue.py:28 in public class `Queue`:
D101: Missing docstring in public class
torch/multiprocessing/queue.py:29 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/queue.py:37 in public class `SimpleQueue`:
D101: Missing docstring in public class
8
```
- `torch/multiprocessing/reductions.py` </br>
**Before: 31**
```
torch/multiprocessing/reductions.py:1 at module level:
D100: Missing docstring in public module
torch/multiprocessing/reductions.py:24 in public class `StorageWeakRef`:
D209: Multi-line docstring closing quotes should be on a separate line
torch/multiprocessing/reductions.py:31 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/reductions.py:38 in public method `from_weakref`:
D102: Missing docstring in public method
torch/multiprocessing/reductions.py:44 in public method `expired`:
D102: Missing docstring in public method
torch/multiprocessing/reductions.py:47 in public method `__del__`:
D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:50 in public method `__hash__`:
D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:53 in public method `__eq__`:
D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:60 in public class `SharedCache`:
D400: First line should end with a period (not 'f')
torch/multiprocessing/reductions.py:62 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/reductions.py:75 in public method `get`:
D102: Missing docstring in public method
torch/multiprocessing/reductions.py:79 in public method `__setitem__`:
D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:85 in public method `free_dead_references`:
D102: Missing docstring in public method
torch/multiprocessing/reductions.py:99 in public function `rebuild_event`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:103 in public function `reduce_event`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:108 in public function `rebuild_tensor`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:121 in public function `rebuild_cuda_tensor`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:189 in public function `reduce_tensor`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:347 in public function `rebuild_nested_tensor`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:364 in public function `reduce_nested_tensor`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:389 in public function `fd_id`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:397 in public function `storage_from_cache`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:404 in public function `rebuild_storage_fd`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:417 in public function `rebuild_storage_filename`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:437 in public function `rebuild_storage_empty`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:441 in public function `rebuild_typed_storage`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:446 in public function `reduce_typed_storage`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:450 in public function `rebuild_typed_storage_child`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:455 in public function `reduce_typed_storage_child`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:459 in public function `reduce_storage`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:488 in public function `init_reductions`:
D103: Missing docstring in public function
31
```
**After: 29**
```
torch/multiprocessing/reductions.py:1 at module level:
D100: Missing docstring in public module
torch/multiprocessing/reductions.py:32 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/reductions.py:39 in public method `from_weakref`:
D102: Missing docstring in public method
torch/multiprocessing/reductions.py:45 in public method `expired`:
D102: Missing docstring in public method
torch/multiprocessing/reductions.py:48 in public method `__del__`:
D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:51 in public method `__hash__`:
D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:54 in public method `__eq__`:
D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:63 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/reductions.py:76 in public method `get`:
D102: Missing docstring in public method
torch/multiprocessing/reductions.py:80 in public method `__setitem__`:
D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:86 in public method `free_dead_references`:
D102: Missing docstring in public method
torch/multiprocessing/reductions.py:100 in public function `rebuild_event`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:104 in public function `reduce_event`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:109 in public function `rebuild_tensor`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:122 in public function `rebuild_cuda_tensor`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:190 in public function `reduce_tensor`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:348 in public function `rebuild_nested_tensor`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:365 in public function `reduce_nested_tensor`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:390 in public function `fd_id`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:398 in public function `storage_from_cache`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:405 in public function `rebuild_storage_fd`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:418 in public function `rebuild_storage_filename`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:438 in public function `rebuild_storage_empty`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:442 in public function `rebuild_typed_storage`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:447 in public function `reduce_typed_storage`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:451 in public function `rebuild_typed_storage_child`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:456 in public function `reduce_typed_storage_child`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:460 in public function `reduce_storage`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:489 in public function `init_reductions`:
D103: Missing docstring in public function
29
```
- `torch/multiprocessing/spawn.py` </br>
**Before: 19**
```
torch/multiprocessing/spawn.py:1 at module level:
D100: Missing docstring in public module
torch/multiprocessing/spawn.py:11 in public class `ProcessException`:
D101: Missing docstring in public class
torch/multiprocessing/spawn.py:14 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:20 in public method `__reduce__`:
D105: Missing docstring in magic method
torch/multiprocessing/spawn.py:25 in public class `ProcessRaisedException`:
D205: 1 blank line required between summary line and description (found 0)
torch/multiprocessing/spawn.py:25 in public class `ProcessRaisedException`:
D400: First line should end with a period (not 'n')
torch/multiprocessing/spawn.py:30 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:40 in public class `ProcessExitedException`:
D205: 1 blank line required between summary line and description (found 0)
torch/multiprocessing/spawn.py:40 in public class `ProcessExitedException`:
D400: First line should end with a period (not 'l')
torch/multiprocessing/spawn.py:47 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:59 in public method `__reduce__`:
D105: Missing docstring in magic method
torch/multiprocessing/spawn.py:85 in public class `ProcessContext`:
D101: Missing docstring in public class
torch/multiprocessing/spawn.py:86 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:93 in public method `pids`:
D102: Missing docstring in public method
torch/multiprocessing/spawn.py:97 in public method `join`:
D205: 1 blank line required between summary line and description (found 0)
torch/multiprocessing/spawn.py:97 in public method `join`:
D401: First line should be in imperative mood (perhaps 'Try', not 'Tries')
torch/multiprocessing/spawn.py:166 in public class `SpawnContext`:
D101: Missing docstring in public class
torch/multiprocessing/spawn.py:167 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:180 in public function `start_processes`:
D103: Missing docstring in public function
19
```
**After: 13**
```
torch/multiprocessing/spawn.py:1 at module level:
D100: Missing docstring in public module
torch/multiprocessing/spawn.py:11 in public class `ProcessException`:
D101: Missing docstring in public class
torch/multiprocessing/spawn.py:14 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:20 in public method `__reduce__`:
D105: Missing docstring in magic method
torch/multiprocessing/spawn.py:27 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:41 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:53 in public method `__reduce__`:
D105: Missing docstring in magic method
torch/multiprocessing/spawn.py:79 in public class `ProcessContext`:
D101: Missing docstring in public class
torch/multiprocessing/spawn.py:80 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:87 in public method `pids`:
D102: Missing docstring in public method
torch/multiprocessing/spawn.py:161 in public class `SpawnContext`:
D101: Missing docstring in public class
torch/multiprocessing/spawn.py:162 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:175 in public function `start_processes`:
D103: Missing docstring in public function
13
```
- `torch/multiprocessing/__init__.py` </br>
**Before: 0**
```
torch/multiprocessing/__init__.py:1 at module level:
D205: 1 blank line required between summary line and description (found 0)
torch/multiprocessing/__init__.py:1 at module level:
D400: First line should end with a period (not '`')
torch/multiprocessing/__init__.py:57 in public function `set_sharing_strategy`:
D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
torch/multiprocessing/__init__.py:69 in public function `get_sharing_strategy`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/multiprocessing/__init__.py:74 in public function `get_all_sharing_strategies`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
5
```
**After: 0**
- `torch/nn/__init__.py` </br>
**Before: 3**
```
torch/nn/__init__.py:1 at module level:
D104: Missing docstring in public package
torch/nn/__init__.py:14 in public function `factory_kwargs`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/__init__.py:14 in public function `factory_kwargs`:
D400: First line should end with a period (not 'd')
3
```
**After: 1**
```
torch/nn/__init__.py:1 at module level:
D104: Missing docstring in public package
1
```
- `torch/nn/cpp.py` </br>
**Before: 16**
```
torch/nn/cpp.py:7 in public class `OrderedDictWrapper`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/cpp.py:7 in public class `OrderedDictWrapper`:
D400: First line should end with a period (not 'e')
torch/nn/cpp.py:16 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/cpp.py:21 in public method `cpp_dict`:
D102: Missing docstring in public method
torch/nn/cpp.py:27 in public method `items`:
D102: Missing docstring in public method
torch/nn/cpp.py:30 in public method `keys`:
D102: Missing docstring in public method
torch/nn/cpp.py:33 in public method `values`:
D102: Missing docstring in public method
torch/nn/cpp.py:36 in public method `__iter__`:
D105: Missing docstring in magic method
torch/nn/cpp.py:39 in public method `__len__`:
D105: Missing docstring in magic method
torch/nn/cpp.py:42 in public method `__contains__`:
D105: Missing docstring in magic method
torch/nn/cpp.py:45 in public method `__getitem__`:
D105: Missing docstring in magic method
torch/nn/cpp.py:50 in public class `ModuleWrapper`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/cpp.py:50 in public class `ModuleWrapper`:
D400: First line should end with a period (not 'd')
torch/nn/cpp.py:55 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/cpp.py:83 in public method `training`:
D102: Missing docstring in public method
torch/nn/cpp.py:90 in public method `__repr__`:
D105: Missing docstring in magic method
16
```
**After: 12**
```
torch/nn/cpp.py:16 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/cpp.py:21 in public method `cpp_dict`:
D102: Missing docstring in public method
torch/nn/cpp.py:27 in public method `items`:
D102: Missing docstring in public method
torch/nn/cpp.py:30 in public method `keys`:
D102: Missing docstring in public method
torch/nn/cpp.py:33 in public method `values`:
D102: Missing docstring in public method
torch/nn/cpp.py:36 in public method `__iter__`:
D105: Missing docstring in magic method
torch/nn/cpp.py:39 in public method `__len__`:
D105: Missing docstring in magic method
torch/nn/cpp.py:42 in public method `__contains__`:
D105: Missing docstring in magic method
torch/nn/cpp.py:45 in public method `__getitem__`:
D105: Missing docstring in magic method
torch/nn/cpp.py:52 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/cpp.py:80 in public method `training`:
D102: Missing docstring in public method
torch/nn/cpp.py:87 in public method `__repr__`:
D105: Missing docstring in magic method
12
```
- `torch/nn/grad.py` </br>
**Before: 10**
```
torch/nn/grad.py:1 at module level:
D400: First line should end with a period (not 'e')
torch/nn/grad.py:8 in public function `conv1d_input`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/grad.py:8 in public function `conv1d_input`:
D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch/nn/grad.py:40 in public function `conv1d_weight`:
D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch/nn/grad.py:71 in public function `conv2d_input`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/grad.py:71 in public function `conv2d_input`:
D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch/nn/grad.py:103 in public function `conv2d_weight`:
D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch/nn/grad.py:134 in public function `conv3d_input`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/grad.py:134 in public function `conv3d_input`:
D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch/nn/grad.py:166 in public function `conv3d_weight`:
D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
10
```
**After: 0**
- `torch/nn/parameter.py` </br>
**Before: 17**
```
torch/nn/parameter.py:1 at module level:
D100: Missing docstring in public module
torch/nn/parameter.py:14 in public class `Parameter`:
D204: 1 blank line required after class docstring (found 0)
torch/nn/parameter.py:33 in public method `__new__`:
D102: Missing docstring in public method
torch/nn/parameter.py:54 in public method `__deepcopy__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:62 in public method `__repr__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:65 in public method `__reduce_ex__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:84 in public class `UninitializedTensorMixin`:
D101: Missing docstring in public class
torch/nn/parameter.py:105 in public method `materialize`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parameter.py:125 in public method `shape`:
D102: Missing docstring in public method
torch/nn/parameter.py:132 in public method `share_memory_`:
D102: Missing docstring in public method
torch/nn/parameter.py:138 in public method `__repr__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:141 in public method `__reduce_ex__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:149 in public method `__torch_function__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:164 in public function `is_lazy`:
D103: Missing docstring in public function
torch/nn/parameter.py:186 in public method `__new__`:
D102: Missing docstring in public method
torch/nn/parameter.py:191 in public method `__deepcopy__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:217 in public method `__new__`:
D102: Missing docstring in public method
17
```
**After: 15**
```
torch/nn/parameter.py:1 at module level:
D100: Missing docstring in public module
torch/nn/parameter.py:34 in public method `__new__`:
D102: Missing docstring in public method
torch/nn/parameter.py:55 in public method `__deepcopy__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:63 in public method `__repr__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:66 in public method `__reduce_ex__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:85 in public class `UninitializedTensorMixin`:
D101: Missing docstring in public class
torch/nn/parameter.py:127 in public method `shape`:
D102: Missing docstring in public method
torch/nn/parameter.py:134 in public method `share_memory_`:
D102: Missing docstring in public method
torch/nn/parameter.py:140 in public method `__repr__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:143 in public method `__reduce_ex__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:151 in public method `__torch_function__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:166 in public function `is_lazy`:
D103: Missing docstring in public function
torch/nn/parameter.py:188 in public method `__new__`:
D102: Missing docstring in public method
torch/nn/parameter.py:193 in public method `__deepcopy__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:219 in public method `__new__`:
D102: Missing docstring in public method
15
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113052
Approved by: https://github.com/mikaylagawarecki, https://github.com/soulitzer
This PR fixes two cases when fx generated code is invalid in python (syntax error):
1. multiple type annotation in one line: `var1: annotation1, var2: annotation2 = function_call()`
2. invalid type annotation for scalars like `var1: f32[] = function_call()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113345
Approved by: https://github.com/ezyang
Summary:
We are planning to lazily initialize CUPTI when profiling is actually performed. Therefore, we need to remove profiler init dependency on CUPTI Callbacks' RESOURCE_CONTEXT_CREATED.
Instead, we can initialize the profilers during torch profiler pybind, ie. THPAutograd_initExtension() and lazily in profilerStep().
Test Plan:
CI and ran internally, see internal diff logs.
Differential Revision: D50894961
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112623
Approved by: https://github.com/albanD
SymIntType is referenced by wrapper.py, so I added its .pyi definition.
I also added SymBoolType along the way for completeness.
The `insinstance` checks in wrapper.py reference torch.Type, which seems
to cause mypy to choke. Not entirely sure why; I've just added
type-ignore comments for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113411
Approved by: https://github.com/Skylion007
ghstack dependencies: #113409, #113410
Currently `test_torchinductor.py` imports `torchvision` at import time, which
is problematic when you have a broken `torchvision` install as test collection
will fail. This could happen for example if `torchvision` was built against a
different versions of PyTorch as may happen regularly in development.
This moves the check inside `test_roi_align` so a failure to import
`torchvision` only causes a test failure and the other tests can run fine.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113385
Approved by: https://github.com/lezcano
ghstack dependencies: #113384
`test_dist` uses bfloat16 which isn't well supported by triton on pre-sm80
hardware, so split the test in two and add a skip. This also adds a
`skipCUDAIf` decorator which only skips on CUDA devices so the test still runs
on CPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113384
Approved by: https://github.com/lezcano
This PR mimics what we have done to trace ProcessGroup. This allows use to use FakeProcessGroup with torch.compile. FakeProcessGroup allows us to use world_size > 1 without creating multiple processes thus enabling the usage of PDB to debug bucketing DDP allreduce in the Inductor. We can theoretically use GLOO with world_size==1 to achieve the same goal. However, the `wait()` seems to be optimized away when the world_size is 1.
Differential Revision: [D51136463](https://our.internmc.facebook.com/intern/diff/D51136463/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113314
Approved by: https://github.com/wanchaol
The algorithm is taken from Numpy implementation at https://github.com/numpy/numpy/blob/main/numpy/lib/arraysetops.py#L323, it first do a sort on the input sequence and then use a `mask` to record the unique element of each consecutive section.
Now we don't have parallel sort on 1-dimension float tensor, will have it enabled in next step. Parallel radix sort is used for 1-dimensional int tensor.
The following data is collected with script in the issue on Intel(R) Xeon(R) Gold 6248 CPU @ 2.5GHz with single sockets (20 cores):
#### before (dtype int64)
```
Numpy just sort: 0.4271528720855713 s
Numpy sort + indexes: 6.383563041687012 s
Torch just sort: 0.46924352645874023 s
Torch sort + indexes: 1.8140404224395752 s
```
#### after (dtype int64)
```
Torch just sort: 0.2540090084075928 s
Torch sort + indexes: 0.2766146659851074 s
```
#### before (float32)
```
Numpy just sort: 0.41129398345947266 s
Numpy sort + indexes: 6.422696590423584 s
Torch just sort: 9.109549283981323 s
Torch sort + indexes: 37.59021711349487 s
```
#### after (float32)
```
Torch just sort: 3.5369982719421387 s
Torch sort + indexes: 3.582240581512451 s
```
if we enabled parallel sort on 1-dimension float tensor, the performance is:
```
Torch just sort: 0.3212606906890869 s
Torch sort + indexes: 0.36211371421813965 s
```
Since i have fused the `inverse_indices` and `count` calculation in fused parallel loop (the algorithm is identical to NumPy's but with better optimization), they will take a small amount of additional time.
Use a reduction implementation for unique when dtype is bool on CPU.
This reverts commit 6dca81c054c1f7e378e956900265b085ca521e47 as `torch.sort` errors has been fixed in FBGEMM by 70c6e83c29.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113420
Approved by: https://github.com/malfet
The numeric test for round-trip casting of float8 dtypes originally consisted of generating a 100x100 tensor in the range 0..max.
This change refactors the test, adds further edge cases and fixes multiple issues with the lower precision simulation which the results of the round-trip cast test were checked against.
Set atol=0 and rtol=0 to ensure an exact equality comparison.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113361
Approved by: https://github.com/malfet, https://github.com/Neilblaze
Reorganized the two C++ and Python pytree submodules into a subpackage. I think this would be easier to implement the abstract `PyTreeAPI` class with two implementations. And it will be much easier for the user to switch between the two implementations.
Before:
```text
torch
├── utils
│ ├── _pytree.py
│ ├── _cxx_pytree.py
│ ...
...
```
After:
```text
torch
├── utils
│ ├── _pytree
│ │ ├── __init__.py
│ │ └── api
│ │ ├── __init__.py
│ │ ├── cxx.py
│ │ └── python.py
│ ...
...
```
The `torch.utils._pytree` module will import all APIs from `torch.utils._pytree.api.python`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112278
Approved by: https://github.com/zou3519
ghstack dependencies: #112111
Summary: Seems like we already support kwargs in _infer_argument, so we don't need the extra assertion here.
Test Plan: buck test caffe2/test:test_export -- -r lazy_module_kwargs
Differential Revision: D51170339
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113387
Approved by: https://github.com/yanboliang
Summary:
This change fixes the Chrome trace loading issue with all_to_all input split length > 30.
Now when the `all_to_all` input split size is larger than 30 we truncate the content and adding `...` at the end, which caused trouble when loading with Chrome trace.
Test Plan:
**Trace with length = 2**:
- Link: https://fburl.com/perfdoctor/b94u4x82
{F1145436735}
**Looking into the json file**:
```
Before:
"In split size": [6058496, 5942784]
After
"In split size": "[6058496, 5942784]"
```
Reviewed By: aaronenyeshi
Differential Revision: D51167843
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113392
Approved by: https://github.com/aaronenyeshi
This PR has the following goals:
1. Detect unhealthy nccl watchdog thread by implementing a heartbeat. NCCL watchdog sometimes can hang for several reasons such as nccl/cuda API bugs or unexpected blocking behaviors. This is the last resort to ensure that we don't silently keep the training job run for hours.
2. Sometimes, the process gets stuck in the destroy of NCCL PG, and this PR will ensure that we will eventually abort it after some time (by default 2 mins)
3. Once heartbeat cannot be heard, we dump debug information (for now, we just use the flight recorder implemented in https://github.com/pytorch/pytorch/pull/110960/files) to disk. (How and where to dump the debug info will be addressed in the following PR).
4. Finally, we initiate std::abort via `LOG(FATAL)` to kill the process.
To clarify further what this PR is trying to solve, we first list are four cases when a NCCL PG can end up with:
- case 1: ncclwatchdog gets stuck (maybe some blocking API) and heartbeat monitor kills it during regular heartbeat monitor loop.
- case 2: ncclwatchdog timeout and desync report or destroy kicked in(let's call it shutdown) but this shutdown takes so long and heartbeat believes it has to kills the process anyway.
- case 3: ncclwatchdog aborts the process (heartbeat monitor not involved)
- case 4: program exits cleanly (heartbeat monitor not involved)
As we can see here, this PR is trying to address case one and two and we also want to ensure adding one more monitor thread does not interfere what we are currently doing in case three and four. That's why we added two flags `terminateHeartbeatMonitorThread_` and `collectiveDebugInfoMode_`.
For case three and four, either `monitorWakeUpCV_` will be waked up in the destructor or `terminateHeartbeatMonitorThread_` will be set to true. So that monitor thread will just exit ASAP.
For case one, both `terminateHeartbeatMonitorThread_` and `collectiveDebugInfoMode_` will still false when monitor thread see there are no heartbeat, so it will directly kill the process. For case two, either `terminateHeartbeatMonitorThread_` and `collectiveDebugInfoMode_` will be true, the monitor thread will wait extra time before killing the process.
Differential Revision: [D51146305](https://our.internmc.facebook.com/intern/diff/D51146305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112518
Approved by: https://github.com/kwen2501, https://github.com/wconstab
The problem is that we have a subclass (FunctionalTensor) that overrides size/stride calls, causing them to go through __torch_dispatch__.
But when SAC is active, we have _CachingTorchDispatchMode.__torch_dispatch__ active, that intercepts those size/stride calls first, and does something different with them instead of letting FunctionalTensor.__torch_dispatch__ handle them.
This PR updates the SAC torch dispatch mode to know to not handle metadata calls, and let its tensor arguments handle them directly.
Right now, `FunctionalTensor` has a hardcoded list of metadata ops, but we should probably put them somewhere more general.
I'll add better testing before landing this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113380
Approved by: https://github.com/yf225, https://github.com/wanchaol
Fixes#113191
```
pydocstyle torch/distributed/fsdp/fully_sharded_data_parallel.py --count
```
On master: 80
After my changes on this PR: 3
```
pydocstyle torch/distributed/_spmd/comm_tensor.py --count
```
On master: 5
After my changes on this PR: 3
```
pydocstyle torch/distributed/_spmd/experimental_ops.py --count
```
On master: 3
After my changes on this PR: 1
```
pydocstyle torch/distributed/_spmd/iter_graph_module.py --count
```
On master: 39
After my changes on this PR: 27
```
pydocstyle torch/distributed/_spmd/graph_utils.py --count
```
On master: 16
After my changes on this PR: 4
```
pydocstyle torch/distributed/_spmd/distribute.py --count
```
On master: 19
After my changes on this PR: 10
```
pydocstyle torch/distributed/_spmd/api.py --count
```
On master: 10
After my changes on this PR: 3
```
pydocstyle torch/distributed/_spmd/batch_dim_utils.py --count
```
On master: 14
After my changes on this PR: 3
```
pydocstyle torch/distributed/_spmd/data_parallel.py --count
```
On master: 34
After my changes on this PR: 2
```
pydocstyle torch/distributed/_spmd/graph_optimization.py --count
```
On master: 35
After my changes on this PR: 13
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113216
Approved by: https://github.com/ezyang
Summary:
Python allows users to write code like
```
x: 1
x += y
x += z
```
This code has well-defined semantics: because x is an immutable primitive, the first `+=` will actually re-bind x, it is equivalent to `x = x + y`.
The second in-place operation will either similarly desugar (if the result of `x + y` is itself immutable), or possibly result in "true" in-place operation.
Now, this is a problem for us because today, dynamo tries to both resolve constant variables to their literal values at compile time and also compile in a way that treats `operator.*` builtin functions consistently. This leads to a bug where code like
```
x: 1
x += y
```
actually gets compiled to
```
1 += y
```
which is both semantically meaningless and a syntax error.
A very simple fix that we've already used to fix the special case of `+=` is to detect this, treat it as an edge case, and desugar eagerly into `x = x + y`.
The problem with that fix is that it only patched `iadd`, but actually *all* of the in-place operators exhibit this behavior.
This commit proposes that we tackle all of the inplace opeartors supported by fx in the same way: eagerly remap the operation to an assignment when the left-side is actually an immutable constant.
**Alternatives?**
There might be some other fix possible that wouldn't produce a hardcoded remapping; I know that we generally don't like the growth of mappings and blocklists in dynamo.
I'm a little skeptical about a general solution though, because the bug is due precisely to Python's highly dynamic dispatching of inplace operations by type; since the fx graph has to be purely static, I suspect that we actually have to desugar this somewhere, because the dataflow is fundamentally different for true inplace operations on types that define `__iadd__`, etc vs the desugaring on primitives.
I'm open to other suggestions
Test Plan:
I verified that the code in
https://github.com/pytorch/pytorch/issues/112656
compiles with this fix, and the compiled functions produce the same outputs as the originals.
This needs unit tests, but I'd like to get feedback on the approach in the meantime.
Fixes#112656
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113117
Approved by: https://github.com/yanboliang
This PR reduces docstring erros to 0 from total 98. This can be verified by running,
`pydocstyle path-to-zero_redundancy_optimizer.py --count`
BEFORE the PR:
`pydocstyle torch/distributed/optim/zero_redundancy_optimizer.py --count`
98
AFTER the PR:
`pydocstyle torch/distributed/optim/zero_redundancy_optimizer.py --count`
0
Fixes#112642
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113200
Approved by: https://github.com/weifengpy
I am trying to turn on `follow_imports=silent` for MYPYNOFOLLOW.
However, this requires a huge number of changes, so I am breaking it
down to a per-file basis.
Unfortunately, we will not be able to turn on `follow_imports` until all
files are fixed, so there is no way to stop regressions. So I hope to get
these fixes in as fast as possible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113304
Approved by: https://github.com/Skylion007
Summary:
The NSString writeToFile:atomically: method was deprecated in iOS 2.0.
This diff replaces it with a call to writeToFile:atomically:encoding:error:
duplicate of D51003188 to fix gh permissions
Test Plan: ci
Differential Revision: D51164941
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113377
Approved by: https://github.com/kirklandsign
Summary:
pg_config_info is used to dump pg information in Execution Trace(ET). For trace analysis purpose and PARAM replay benchmark, global rank is more meaningful than group ranks.
p.s. ranks is a map of global rank: group rank
Test Plan: Tested in HPC
Differential Revision: D51136587
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113316
Approved by: https://github.com/XilunWu
Summary: Add quantized max pool 2d operation
Test Plan:
Check that all quantized tests pass"
buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 78 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 78 tests from VulkanAPITest
[ RUN ] VulkanAPITest.uniform_buffer_copy
[ OK ] VulkanAPITest.uniform_buffer_copy (66 ms)
[ RUN ] VulkanAPITest.copy_to_buffer
[ OK ] VulkanAPITest.copy_to_buffer (61 ms)
[ RUN ] VulkanAPITest.copy_to_buffer_channels_last
[ OK ] VulkanAPITest.copy_to_buffer_channels_last (28 ms)
[ RUN ] VulkanAPITest.cpu_to_vulkan_and_dequantize_quint8
[ OK ] VulkanAPITest.cpu_to_vulkan_and_dequantize_quint8 (58 ms)
[ RUN ] VulkanAPITest.cpu_to_vulkan_and_dequantize_qint8
[ OK ] VulkanAPITest.cpu_to_vulkan_and_dequantize_qint8 (44 ms)
[ RUN ] VulkanAPITest.cpu_to_vulkan_and_dequantize_qint32
[ OK ] VulkanAPITest.cpu_to_vulkan_and_dequantize_qint32 (72 ms)
[ RUN ] VulkanAPITest.quantize_dequantize
[ OK ] VulkanAPITest.quantize_dequantize (2 ms)
[ RUN ] VulkanAPITest.quantize_per_tensor_and_dequantize_quint8
[ OK ] VulkanAPITest.quantize_per_tensor_and_dequantize_quint8 (69 ms)
[ RUN ] VulkanAPITest.quantize_per_tensor_and_dequantize_quint8_qparams
[ OK ] VulkanAPITest.quantize_per_tensor_and_dequantize_quint8_qparams (58 ms)
[ RUN ] VulkanAPITest.quantize_per_tensor_and_dequantize_qint8
[ OK ] VulkanAPITest.quantize_per_tensor_and_dequantize_qint8 (77 ms)
[ RUN ] VulkanAPITest.quantize_per_tensor_and_dequantize_qint8_qparams
[ OK ] VulkanAPITest.quantize_per_tensor_and_dequantize_qint8_qparams (54 ms)
[ RUN ] VulkanAPITest.quantize_per_tensor_and_dequantize_qint32
[ OK ] VulkanAPITest.quantize_per_tensor_and_dequantize_qint32 (93 ms)
[ RUN ] VulkanAPITest.quantize_per_tensor_and_dequantize_qint32_qparams
[ OK ] VulkanAPITest.quantize_per_tensor_and_dequantize_qint32_qparams (90 ms)
[ RUN ] VulkanAPITest.quantized_add
[ OK ] VulkanAPITest.quantized_add (2 ms)
[ RUN ] VulkanAPITest.quantized_add_broadcast
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1103 17:42:18.018113 4075724928 Resize.cpp:35] Warning: An output with one or more elements was resized since it had shape [2, 13, 1, 27], which does not match the required output shape [2, 13, 32, 27]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (function _resize_output_check)
[ OK ] VulkanAPITest.quantized_add_broadcast (2 ms)
[ RUN ] VulkanAPITest.quantized_add_broadcast1
[ OK ] VulkanAPITest.quantized_add_broadcast1 (1 ms)
[ RUN ] VulkanAPITest.quantized_add_broadcast2
W1103 17:42:18.022008 4075724928 Resize.cpp:35] Warning: An output with one or more elements was resized since it had shape [32, 1], which does not match the required output shape [32, 27]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (function _resize_output_check)
[ OK ] VulkanAPITest.quantized_add_broadcast2 (0 ms)
[ RUN ] VulkanAPITest.quantized_add_broadcast3
[ OK ] VulkanAPITest.quantized_add_broadcast3 (0 ms)
[ RUN ] VulkanAPITest.quantized_add_dif_params
[ OK ] VulkanAPITest.quantized_add_dif_params (1 ms)
[ RUN ] VulkanAPITest.conv2d
[ OK ] VulkanAPITest.conv2d (4 ms)
[ RUN ] VulkanAPITest.conv2d_pw
[ OK ] VulkanAPITest.conv2d_pw (88 ms)
[ RUN ] VulkanAPITest.conv2d_dw
[ OK ] VulkanAPITest.conv2d_dw (32 ms)
[ RUN ] VulkanAPITest.quantized_sub
[ OK ] VulkanAPITest.quantized_sub (1 ms)
[ RUN ] VulkanAPITest.quantized_mul
[ OK ] VulkanAPITest.quantized_mul (1 ms)
[ RUN ] VulkanAPITest.quantized_div
[ OK ] VulkanAPITest.quantized_div (1 ms)
[ RUN ] VulkanAPITest.quantized_upsample_nearest2d
[ OK ] VulkanAPITest.quantized_upsample_nearest2d (0 ms)
[ RUN ] VulkanAPITest.max_pool2d_qint8
[ OK ] VulkanAPITest.max_pool2d_qint8 (5 ms)
[ RUN ] VulkanAPITest.max_pool2d_quint8
[ OK ] VulkanAPITest.max_pool2d_quint8 (4 ms)
[ RUN ] VulkanAPITest.quantized_add_tests
[ OK ] VulkanAPITest.quantized_add_tests (77 ms)
[ RUN ] VulkanAPITest.quantized_sub_tests
[ OK ] VulkanAPITest.quantized_sub_tests (104 ms)
[ RUN ] VulkanAPITest.quantized_mul_tests
[ OK ] VulkanAPITest.quantized_mul_tests (78 ms)
[ RUN ] VulkanAPITest.quantized_div_tests
[ OK ] VulkanAPITest.quantized_div_tests (124 ms)
[ RUN ] VulkanAPITest.conv2d_quantized_fixed_params_uint8
[ OK ] VulkanAPITest.conv2d_quantized_fixed_params_uint8 (1 ms)
[ RUN ] VulkanAPITest.conv2d_quantized_computed_params_uint8
[ OK ] VulkanAPITest.conv2d_quantized_computed_params_uint8 (0 ms)
[ RUN ] VulkanAPITest.conv2d_quantized_random_params_uint8
[ OK ] VulkanAPITest.conv2d_quantized_random_params_uint8 (0 ms)
[ RUN ] VulkanAPITest.conv2d_quantized_prepack_fixed_params_uint8
[ OK ] VulkanAPITest.conv2d_quantized_prepack_fixed_params_uint8 (0 ms)
[ RUN ] VulkanAPITest.conv2d_quantized_prepack_computed_params_uint8
[ OK ] VulkanAPITest.conv2d_quantized_prepack_computed_params_uint8 (0 ms)
[ RUN ] VulkanAPITest.conv2d_quantized_prepack_random_params_uint8
[ OK ] VulkanAPITest.conv2d_quantized_prepack_random_params_uint8 (0 ms)
[ RUN ] VulkanAPITest.conv2d_dw_quantized_fixed_params_uint8
[ OK ] VulkanAPITest.conv2d_dw_quantized_fixed_params_uint8 (4 ms)
[ RUN ] VulkanAPITest.conv2d_dw_quantized_computed_params_uint8
[ OK ] VulkanAPITest.conv2d_dw_quantized_computed_params_uint8 (3 ms)
[ RUN ] VulkanAPITest.conv2d_dw_quantized_random_params_uint8
[ OK ] VulkanAPITest.conv2d_dw_quantized_random_params_uint8 (3 ms)
[ RUN ] VulkanAPITest.conv2d_dw_quantized_prepack_fixed_params_uint8
[ OK ] VulkanAPITest.conv2d_dw_quantized_prepack_fixed_params_uint8 (3 ms)
[ RUN ] VulkanAPITest.conv2d_dw_quantized_prepack_computed_params_uint8
[ OK ] VulkanAPITest.conv2d_dw_quantized_prepack_computed_params_uint8 (3 ms)
[ RUN ] VulkanAPITest.conv2d_dw_quantized_prepack_random_params_uint8
[ OK ] VulkanAPITest.conv2d_dw_quantized_prepack_random_params_uint8 (3 ms)
[ RUN ] VulkanAPITest.conv2d_pw_quantized_fixed_params_uint8
[ OK ] VulkanAPITest.conv2d_pw_quantized_fixed_params_uint8 (11 ms)
[ RUN ] VulkanAPITest.conv2d_pw_quantized_computed_params_uint8
input_dif too big: 0.0175897. generating input again ...
[ OK ] VulkanAPITest.conv2d_pw_quantized_computed_params_uint8 (17 ms)
[ RUN ] VulkanAPITest.conv2d_pw_quantized_random_params_uint8
[ OK ] VulkanAPITest.conv2d_pw_quantized_random_params_uint8 (11 ms)
[ RUN ] VulkanAPITest.conv2d_pw_quantized_prepack_fixed_params_uint8
[ OK ] VulkanAPITest.conv2d_pw_quantized_prepack_fixed_params_uint8 (11 ms)
[ RUN ] VulkanAPITest.conv2d_pw_quantized_prepack_computed_params_uint8
[ OK ] VulkanAPITest.conv2d_pw_quantized_prepack_computed_params_uint8 (11 ms)
[ RUN ] VulkanAPITest.conv2d_pw_quantized_prepack_random_params_uint8
[ OK ] VulkanAPITest.conv2d_pw_quantized_prepack_random_params_uint8 (11 ms)
[ RUN ] VulkanAPITest.conv2d_quantized_fixed_params_int8_int32
[ OK ] VulkanAPITest.conv2d_quantized_fixed_params_int8_int32 (1 ms)
[ RUN ] VulkanAPITest.conv2d_quantized_computed_params_int8_int32
[ OK ] VulkanAPITest.conv2d_quantized_computed_params_int8_int32 (0 ms)
[ RUN ] VulkanAPITest.conv2d_quantized_random_params_int8_int32
[ OK ] VulkanAPITest.conv2d_quantized_random_params_int8_int32 (0 ms)
[ RUN ] VulkanAPITest.conv2d_quantized_prepack_fixed_params_int8_int32
[ OK ] VulkanAPITest.conv2d_quantized_prepack_fixed_params_int8_int32 (0 ms)
[ RUN ] VulkanAPITest.conv2d_quantized_prepack_computed_params_int8_int32
[ OK ] VulkanAPITest.conv2d_quantized_prepack_computed_params_int8_int32 (0 ms)
[ RUN ] VulkanAPITest.conv2d_quantized_prepack_random_params_int8_int32
[ OK ] VulkanAPITest.conv2d_quantized_prepack_random_params_int8_int32 (0 ms)
[ RUN ] VulkanAPITest.conv2d_dw_quantized_fixed_params_int8_int32
[ OK ] VulkanAPITest.conv2d_dw_quantized_fixed_params_int8_int32 (3 ms)
[ RUN ] VulkanAPITest.conv2d_dw_quantized_computed_params_int8_int32
[ OK ] VulkanAPITest.conv2d_dw_quantized_computed_params_int8_int32 (4 ms)
[ RUN ] VulkanAPITest.conv2d_dw_quantized_random_params_int8_int32
[ OK ] VulkanAPITest.conv2d_dw_quantized_random_params_int8_int32 (3 ms)
[ RUN ] VulkanAPITest.conv2d_dw_quantized_prepack_fixed_params_int8_int32
[ OK ] VulkanAPITest.conv2d_dw_quantized_prepack_fixed_params_int8_int32 (4 ms)
[ RUN ] VulkanAPITest.conv2d_dw_quantized_prepack_computed_params_int8_int32
[ OK ] VulkanAPITest.conv2d_dw_quantized_prepack_computed_params_int8_int32 (3 ms)
[ RUN ] VulkanAPITest.conv2d_dw_quantized_prepack_random_params_int8_int32
[ OK ] VulkanAPITest.conv2d_dw_quantized_prepack_random_params_int8_int32 (3 ms)
[ RUN ] VulkanAPITest.conv2d_pw_quantized_fixed_params_int8_int32
[ OK ] VulkanAPITest.conv2d_pw_quantized_fixed_params_int8_int32 (11 ms)
[ RUN ] VulkanAPITest.conv2d_pw_quantized_computed_params_int8_int32
[ OK ] VulkanAPITest.conv2d_pw_quantized_computed_params_int8_int32 (12 ms)
[ RUN ] VulkanAPITest.conv2d_pw_quantized_random_params_int8_int32
[ OK ] VulkanAPITest.conv2d_pw_quantized_random_params_int8_int32 (11 ms)
[ RUN ] VulkanAPITest.conv2d_pw_quantized_prepack_fixed_params_int8_int32
[ OK ] VulkanAPITest.conv2d_pw_quantized_prepack_fixed_params_int8_int32 (11 ms)
[ RUN ] VulkanAPITest.conv2d_pw_quantized_prepack_computed_params_int8_int32
[ OK ] VulkanAPITest.conv2d_pw_quantized_prepack_computed_params_int8_int32 (12 ms)
[ RUN ] VulkanAPITest.conv2d_pw_quantized_prepack_random_params_int8_int32
[ OK ] VulkanAPITest.conv2d_pw_quantized_prepack_random_params_int8_int32 (11 ms)
[ RUN ] VulkanAPITest.quantized_tensor_get_scale_zero_point
[ OK ] VulkanAPITest.quantized_tensor_get_scale_zero_point (0 ms)
[ RUN ] VulkanAPITest.linear_2d_flat
[ OK ] VulkanAPITest.linear_2d_flat (3 ms)
[ RUN ] VulkanAPITest.linear_2d_small
[ OK ] VulkanAPITest.linear_2d_small (0 ms)
[ RUN ] VulkanAPITest.linear_2d_large
[ OK ] VulkanAPITest.linear_2d_large (2 ms)
[ RUN ] VulkanAPITest.linear_3d_flat
[ OK ] VulkanAPITest.linear_3d_flat (2 ms)
[ RUN ] VulkanAPITest.linear_3d_small
[ OK ] VulkanAPITest.linear_3d_small (1 ms)
[ RUN ] VulkanAPITest.linear_3d_large
[ OK ] VulkanAPITest.linear_3d_large (1 ms)
[ RUN ] VulkanAPITest.linear_4d_flat
[ OK ] VulkanAPITest.linear_4d_flat (1 ms)
[ RUN ] VulkanAPITest.linear_4d_small
[ OK ] VulkanAPITest.linear_4d_small (1 ms)
[ RUN ] VulkanAPITest.linear_4d_large
[ OK ] VulkanAPITest.linear_4d_large (2 ms)
[----------] 78 tests from VulkanAPITest (1537 ms total)
[----------] Global test environment tear-down
[==========] 78 tests from 1 test suite ran. (1537 ms total)
[ PASSED ] 78 tests.
Differential Revision: D50821920
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112937
Approved by: https://github.com/yipjustin
Fixes: #113193
`pydocstyle <all_files_in_issue> --count`
- Before: 345
- After: 130
For deprecated methods, I have added a `noqa` to ignore them. I was not able to find the file `torch/distributed/tensor/parallel/multihead_attention_tp.py`, so I've ignored it for this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113241
Approved by: https://github.com/kit1980
Run each `(batch_size, compile)` benchmark 10 times in `./runner.sh` and get mean and standard deviation of metrics in output table
Only report `warmup latency`, `average_latency`, `throughput` and `gpu_util`
Break `output.md` file into a single markdown file per `(batch_size, compile)` configuration. Further runs of `./runner.sh` will append one row to the table in each file for easy comparison
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113309
Approved by: https://github.com/albanD
full_tensor API should return synchronously instead of
AsyncCollectiveTensor and if the return is that, we do the wait
directly, this makes the full_tensor API be more percise
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113322
Approved by: https://github.com/wz337
This is the cheap and cheerful implementation, which is only enabled on TORCH_SHOW_CPP_STACKTRACES, because it *eagerly* symbolizes immediately at exception throw time, even if the exception will end up getting caught. It would be better to do this lazily and only symbolize when we try to print the exception, but that requires a more involved refactor of c10::Error that I don't feel like doing.
Compare the output before:
```
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x95 (0x7fa21b99d975 in /data/users/ezyang/c/pytorch/torch/lib/libc10.so)
frame #1: c10::TensorImpl::throw_cannot_call_with_symbolic(char const*) const + 0x8d (0x7fa21b951269 in /data/users/ezyang/c/pytorch/torch/lib/libc10.so)
frame #2: c10::TensorImpl::sizes_custom() const + 0x9f (0x7fa21b9770df in /data/users/ezyang/c/pytorch/torch/lib/libc10.so)
frame #3: at::meta::structured_mm::meta(at::Tensor const&, at::Tensor const&) + 0x31e (0x7fa20a202a8e in /data/users/ezyang/c/pytorch/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x29f34de (0x7fa20b5f34de in /data/users/ezyang/c/pytorch/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x2a1fd8e (0x7fa20b61fd8e in /data/users/ezyang/c/pytorch/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x6b907b (0x7fa2142b907b in /data/users/ezyang/c/pytorch/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6b6175 (0x7fa2142b6175 in /data/users/ezyang/c/pytorch/torch/lib/libtorch_python.so)
```
and after:
```
#4 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#5 c10::TensorImpl::throw_cannot_call_with_symbolic(char const*) const from ??:0
#6 c10::TensorImpl::sizes_custom() const [clone .localalias] from TensorImpl.cpp:0
#7 at::meta::structured_mm::meta(at::Tensor const&, at::Tensor const&) from ??:0
#8 at::(anonymous namespace)::wrapper_Meta_mm_out_out(at::Tensor const&, at::Tensor const&, at::Tensor&) from RegisterMeta.cpp:0
#9 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor& (at::Tensor const&, at::Tensor const&, at::Tensor&), &at::(anonymous namespace)::wrapper_Meta_mm_out_out>, at::Tensor&, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor&> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from RegisterMeta.cpp:0
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113207
Approved by: https://github.com/Skylion007
Main: `RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn`
This PR: graph breaks and eager applies the mutation, new tensors are tracked
Fixes https://github.com/pytorch/pytorch/issues/109505 (the original bug does not occur, but a new bug where the mutation isn't applied - because AOTAutograd is not `requires_grad` mutation aware - is mitigated)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113163
Approved by: https://github.com/bdhirsh
Summary:
To be able to get more info on serialization/deserialization events, adding these two files to the metadata logging.
- file_name
- file_size
Test Plan: buck2 test mode/dev caffe2/caffe2/serialize:inline_container_test
Reviewed By: davidberard98
Differential Revision: D51040426
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113077
Approved by: https://github.com/davidberard98
This PR enables AC + torch.compile to work with FSDP + TP, the fix to
high order op path is that we need to check both tensor and tensor
subclass bases to make sourceless builder
NOTE: selective AC + 2D is still not working, need to fix this
separately
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112536
Approved by: https://github.com/yf225
Summary: yolo fixing issues. See Test plan
Test Plan:
buck2 run 'fbcode//mode/dev' fbcode//executorch/examples/portable/test:test_export -- -r test_mv3_export_to_executorch
[Need acl to repro this but the error message looks straight forward]
buck2 test 'fbcode//mode/dev-nosan' fbcode//pye/model_inventory/nlu_stella_cap:nlu_stella_cap_test -- --exact 'pye/model_inventory/nlu_stella_cap:nlu_stella_cap_test - test_export_to_backend_dynamic_quantized (pye.model_inventory.nlu_stella_cap.NluStellaCapTest.NluStellaCapTest)'
Differential Revision: D51128480
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113296
Approved by: https://github.com/tugsbayasgalan
Currently, when we have 2D composition, a global variable _extensions controls the 2D deviation we need to take in state_dict calls (See https://github.com/pytorch/pytorch/blob/release/2.1/torch/distributed/fsdp/_fsdp_extensions.py#L66-L68). This is problematic when we have both a 2D model and a plain FSDP model in the same dist environment, as the _extensions will be mistakenly turned on for the plain FSDP model, resulting in state_dict error (RuntimeError: No parent device_mesh is found for FSDP device_mesh.).
This PR binds _fsdp_extension to the FSDP instances to make sure that state_dict calls would not get interfered with each other when mixing both 2D and 1D parallelism.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113237
Approved by: https://github.com/fduwjj, https://github.com/fegin
Summary:
Test fails on CPU currently with some weird error when `wait_tensor = torch.ops.c10d_functional.wait_tensor.default(all_gather_into_tensor);` runs.
```
[rank2]:[2023-11-08 13:30:29,940] torch.testing._internal.common_distributed: [ERROR] RuntimeError: A view was created in no_grad mode and is being modified inplace with grad mode enabled. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.
```
https://www.internalfb.com/intern/test/562950070959214/
Test Plan: buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/test/distributed/_tensor/experimental:tp_transform
Reviewed By: weifengpy
Differential Revision: D51131676
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113310
Approved by: https://github.com/wanchaol
Currently whenever the sizes or strides are modified for a `TensorImpl` we
eagerly recompute the numel and memory format flags. This is fine for static
shapes as it's all fast C++ code, but for symbolic shapes it runs slow python code.
This instead changes the `SymbolicShapeMeta` object to compute the derived
quantities lazily at the first request. This has the added benefit that we can
now pass assumptions in `empty_tensor_restride` which remove the need to compute
some contiguity flags at all.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112785
Approved by: https://github.com/ezyang
ghstack dependencies: #112689, #112890
masked_scatter_backward was previously implemented as a
CompositeExplicitAutograd, which involved a decomp that calls
masked_select, and masked_select in general produces data-dependent
shapes that inductor doesn't support. But masked_scatter_backward
reshapes the return value of masked_select such that the end result has
a static shape again.
I have converted masked_scatter_backward into an aten op to avoid this
issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109642
Approved by: https://github.com/ezyang
ghstack dependencies: #108170
In this PR, we try to keep the input mutations in the forward graph IFF input mutation is data mutation and not metadata mutation and doesn't require grad. This is for optimizing inductor training graphs. (For more details: https://github.com/pytorch/pytorch/issues/109240)
We keep the input mutation in the graph by wrapping the original callable in a wrapper function where in the end we add input.copy_(updated_input) call which is then traced via make_fx. Previously, this was only enabled for forward-only path but unconditionally disabled for joint graph.
Another caveat is that when we are tracing through tensor subclasses, we won't allow any input mutations to be preserved in the graph. The reason is that it makes the code logic quite ugly for no obvious performance improvement.
Most of the changes in this PR are mechanical and I didn't have to make any change to the partitioner. Previously forward/backward heavily relied on metadata field `num_mutated_inps` to figure out whether something is returned as extra output or not. But now since we keep some mutations in the graph, we need to propogate something similar to `num_mutated_inps - num_graph_handled_inps`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111046
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
Summary:
For a Node: node1 and edge: (node1, node2), since they are observing the same
Tensor, we may want to implicitly share observers, this flag allows people to
turn off this behavior for the output of the node
See the test_allow_implicit_sharing test for use case
Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_allow_implicit_sharing
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112929
Approved by: https://github.com/kimishpatel
This PR adds Inductor support for [native c10d_functional ops](https://github.com/pytorch/pytorch/pull/110570).
The Inductor IRs introduced in this PR will replace the existing `CollectiveKernel` IR hierarchy. Compared to the existing collective IRs, the new IRs:
- Are target language agnostic and support AOTInductor.
- Express the constraints solely with read/write deps. This maximizes the potential for buffer reuse.
- Address an issue where out-of-place collective's input buffers could be mutated while being volatilely read.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112439
Approved by: https://github.com/Chillee
This was originally ipiszy's PR: https://github.com/pytorch/pytorch/pull/112358
It turns out that we need to add support for optional types in order to
support fp8 gemm (i.e. scaled_mm). Since our ABI-stable C interface
can't support optional<> directly, I am passing in optional types via
pointer instead.
`AtenTensorHandle`s are already pointers, so nothing needs to change
there. Only value types need to change.
We decided on this approach instead of adding an extra `bool` param to
the callee because this simplifies things. Having the same number of
arguments regardless of whether we are emitting Python / C++ /
ABI-compatible C++ makes codegen easier.
There are a number of existing ABI-compatible functions that have
optional-typed value parameters. Previously, they just assumed they
would never be passed a `nullopt` / `None` at runtime. Changing them to
use pointer types now would break ABI stability, so I have created an
exclude list for those functions.
Finally, I think the current implementation is kind of messy, and only
works for FallbackKernels, even though technically ExternKernels could
also have the same issue. It also doesn't support optional types nested
in lists. I've left FIXME comments for both issues.
Differential Revision: [D51084289](https://our.internmc.facebook.com/intern/diff/D51084289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112527
Approved by: https://github.com/chenyang78, https://github.com/desertfire
For `_scaled_dot_product_flash_attention` we don't have
`Tensor? attn_mask=None`
but `scaled_dot_product_attention` has. In the original decomp there's a
mixup where I added this argument to
`_scaled_dot_product_flash_attention`.
Fix it so that `_scaled_dot_product_flash_attention` is being decomposed correctly.
Summary:
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113102
Approved by: https://github.com/ezyang
Related to Fixes#94891
**Problem:**
We are trying to disable `printf` in kernels for Pytorch build on ROCm to fix the `torch.sum()` issues for certain community users by disabling `CUDA_KERNEL_ASSERT`, but found that there are still hostcall printf happening in `ReduceSumProdKernel` used by `torch.sum`.
**Reason:**
The reason is that there are `assert` function calls inside `Reduce.cuh`, ( defined as `__assert_fail` ) which caused `printf`.
**Fix:**
This pull request is to change `assert` to `CUDA_KERNEL_ASSERT` so that we can consistently disable assertion/printf in cuda/hip kernel code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113098
Approved by: https://github.com/ezyang
The previous documentation did not appear to accurately describe
the actual semantics in CUDA caching allocator.
When you record stream, we only record a stream use:
```
void recordStream(Block* block, cuda::CUDAStream stream) {
std::lock_guard<std::recursive_mutex> lock(mutex);
if (stream.stream() == block->stream) {
// ignore uses on the allocation stream, since those don't require any
// special synchronization
return;
}
block->stream_uses.insert(stream);
}
```
It is only at deallocation time when we actually install an event on
stream uses that we will subsequently query to determine if the block
can be reused or not.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113282
Approved by: https://github.com/Skylion007, https://github.com/albanD
One small runtime change: `get_interface_for_device()` now throws
instead of returning None when an interface is not found. Inspecting all
the callsites in the codebase shows that none of them actually check if
the return type is None, so I think this is safe.
I also silenced a bunch of mypy errors around method assignment; mypy
seems unable to handle the subtype checks correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112974
Approved by: https://github.com/eellison
ghstack dependencies: #112130, #112970, #112971, #112972, #112973
`install_config_module` makes a regular module into a ConfigModule with
extra methods defined on it. mypy thinks those extra methods (or module
functions) are undefined since it cannot analyze something so
dynamic. As a workaround, I've created a fake module that defines these
extra functions, which I import into the config modules during type
checking.
As part of this change, I've also added more types to config_utils.py
and enabled typechecking for torch/_dynamo/config.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112130
Approved by: https://github.com/jansel
Main: `RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn`
This PR: graph breaks and eager applies the mutation, new tensors are tracked
Fixes https://github.com/pytorch/pytorch/issues/109505 (the original bug does not occur, but a new bug where the mutation isn't applied - because AOTAutograd is not `requires_grad` mutation aware - is mitigated)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113163
Approved by: https://github.com/bdhirsh
After #84624, aten::linalg_vector_norm started being used instead of aten::norm. In the ONNX exporter, the latter leveraged Reduce{L1,L2} when p={1,2}, which resulted in more optimized code in the ONNX Runtime
This PR extends aten::linal_vector_norm to also use Reduce{L1,L2} when ord={1,2}, producing an equivalent ONNX subgraph
This PR is a WIP. Pending work include checking argument equivalence between `aten::norm` and `aten::linalg_vector_norm` and maybe re-enable tests disabled by #84624
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113173
Approved by: https://github.com/justinchuby
Summary: We noticed that the nodes created by users are absent of example value, which could not be normalized in the normalization pass, thus we change the format to the normalization format for enable the split cat merge.
Test Plan: N/A
Differential Revision: D51058817
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113179
Approved by: https://github.com/jackiexu1992
**Summary**:
In #113105 I used the util function `normalize_to_torch_size` to unify the `size` argument that may be in multiple formats. However the use of that function would only handle input of `Sequence[int]` therefore I submit this forward fix to have `normalize_to_torch_size` also able to handle the size argument of type `int` and `torch.Size`. A side product of this fix is it also enables 3 dtensor op tests (check `test_dtensor_ops.py`).
**Test**:
`pytest test/distributed/_tensor/test_dtensor_ops.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113244
Approved by: https://github.com/wanchaol
ghstack dependencies: #113105
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 27084ed</samp>
This pull request simplifies and cleans up the code that uses the cuDNN library for convolution, batch normalization, CTC loss, and quantized operations. It removes the unnecessary checks and conditions for older cuDNN versions and the experimental cuDNN v8 API, and ~~replaces them with the stable `cudnn_frontend` API that requires cuDNN v8 or higher. It also adds the dependency and configuration for the `cudnn_frontend` library in the cmake and bazel files.~~ Correction: The v7 API will still be available with this PR, and can still be used, without any changes to the defaults. This change simply always _builds_ the v8 API, and removes the case where _only_ the v7 API is built.
This is a re-land of https://github.com/pytorch/pytorch/pull/91527
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95722
Approved by: https://github.com/malfet
Summary:
Add split and layer_norm_backward.
Note: It is non trivial to support split_with_sizes backward so adding the split operation to support the use case in the model.
Test Plan: unit tests
Differential Revision: D51052966
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113108
Approved by: https://github.com/soulitzer
Reorganized the two C++ and Python pytree submodules into a subpackage. I think this would be easier to implement the abstract `PyTreeAPI` class with two implementations. And it will be much easier for the user to switch between the two implementations.
Before:
```text
torch
├── utils
│ ├── _pytree.py
│ ├── _cxx_pytree.py
│ ...
...
```
After:
```text
torch
├── utils
│ ├── _pytree
│ │ ├── __init__.py
│ │ └── api
│ │ ├── __init__.py
│ │ ├── cxx.py
│ │ └── python.py
│ ...
...
```
The `torch.utils._pytree` module will import all APIs from `torch.utils._pytree.api.python`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112278
Approved by: https://github.com/zou3519
ghstack dependencies: #112111
Fixes https://github.com/pytorch/pytorch/issues/112446
This is a doozy of a PR, there's a few important things to keep in mind here:
1) We MUST lift all tensors accessed via attrs to inputs, getattr is a no go in the graph, it violates the aot_autograd contract. Furthermore, aot_autograd does not know how to apply in-place ops to intermediary tensors that are attributes (aka from getattr) anyway. Views from ops are fine.
2) `.grad` access handling in dynamo peeks at the underlying value, the real tensor, because re-piping FakeTensors already made with this fake_mode through builder anew is a no go.
3) We have no proper mechanism for updating the hint / grapharg.example (the real value in (2) above) midway through trace
Therefore, what we need to do is reconcile the difference in grad stashed on grapharg.example. The easiest way to do this is lazily, upon .grad access, by reading the new value off the right fake tensors. We can then make a tensor using that data as a hint to VariableBuilder to make the right VariableTracker. Note that the example value used here (torch.zeros) in the PR, is a dummy value only used as a tracing hint, it does not leak out into real runtime code.
Alternatively, we could implement accumulate_grad_ in python...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112811
Approved by: https://github.com/jansel
The following assumptions are not always valid and need checking:
1. `snode.node` exists
2. `snode.node.layout.size` exists
3. `snode.node.layout.stride` exists
4. `snode.node.name` exists
Also there is no guarantee that there won't be two collectives running at the same time. But it's hard to visualize the overlap in that case. So disable the visualization for that case for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113066
Approved by: https://github.com/wanchaol
Thanks aakhundov for constructing the test case. This PR was constructed by running the failing test case, and then fixing problems until we got all the way to the end. There are a few distinct fixes:
* AOTAutograd performs equality tests on tensor metadata to determine if a metadata mutation had occurred. If we test i0 vs i1, we should report these are NOT equal, since obviously we have somehow resized the tensor from i0 to i1 (even if, on a particular run, it is possible i0 == i1).
* There's a sketchy fix for `test_aot_autograd_exhaustive_matmul_cpu_float32` where we check if the output shape equals the tangent shape. Unfortunately, the same `definitely_true` treatment does not work here, it still fails on the example. I piled an extra sketchy fix on top of it, where I just try my best to avoid doing the view. Maybe we should have some sort of logging here.
* Partitioner needs to get out a size for unbacked SymInt when partitioning. I just feed it a random heuristic value in this case, similar to how we've been dealing with this in Inductor.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113159
Approved by: https://github.com/aakhundov, https://github.com/bdhirsh
Summary: Turn on verifier check for exportec program ctor. Note that this effectively detect a large surface of spec violations, so we also spend some time fixing them one by one in this diff.
Test Plan: CI
Differential Revision: D51014944
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113075
Approved by: https://github.com/angelayi
Closes#108931, closes#108932, see also conda-forge/pytorch-cpu-feedstock#203
Currently we compare `CUDA_INCLUDE_DIRS` and expect exact equality
with `CUDAToolkit_INCLUDE_DIR` however this fails in the presense of
symbolic links or for split installs where there are multiple include paths.
Given that, it makes sense to loosen the requirement to just version
equality under the assumption that two installs of the same version
should still be compatible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113174
Approved by: https://github.com/malfet
Summary:
We add a performance test binary for the recently implemented operator `LayerNorm` D50436726. The difference of this test compared to the existing `vulkan_conv_arithmetic_perf_test.cpp` and `vulkan_mm_perf_test.cpp` is that
- the existing tests benchmark a specific Vulkan shader such as `vulkan.mm`, `vulkan.addmm`, etc
- but `LayerNorm` is implemented by invoking [a sequence of other operators (shader files)](https://www.internalfb.com/code/fbsource/[ff4989384cacda66a2eed4c800f69c69f6832c52]/fbcode/caffe2/aten/src/ATen/native/vulkan/ops/Layernorm.cpp?lines=94) instead of a devoted single shader file. Reusing the exiting test code wouldn't print meaningful result.
To deal with this, we add a function `extractTotalShaderResultsAndSetState` which aggregates the latency of all invoked shaders except `nchw_to_image` and `image_to_nchw`. This test can be applied to other operators which don't have a devoted shader file.
Test Plan:
- build the binary
```
(base) luwei@luwei-mbp fbsource % buck2 build -c ndk.debug_info_level=0 -c ndk.static_linking=true -c pt.enable_qpl=0 -c pt.vulkan_use_gpu_diagnostics=1 --target-platforms=ovr_config//platform/android:arm32-fbsource //xplat/caffe2:pt_vulkan_layernorm_perf_test_binAndroid --show-output -c pt.vulkan_full_precision=1
```
- push to device
```
(base) luwei@luwei-mbp fbsource % adb push buck-out/v2/gen/fbsource/f1f3f9bed27e143c/xplat/caffe2/__pt_vulkan_layernorm_perf_test_binAndroid__/pt_vulkan_layernorm_perf_test_binAndroid /data/local/tmp
```
- test on device
```
(base) luwei@luwei-mbp ~ % adb shell /data/local/tmp/pt_vulkan_layernorm_perf_test_binAndroid
```
- output, excerpt below, full test result in P871803721
**it shows that the aggregation of invoked shaders takes 14.2 ms on average**
```
Kernel Name Workgroup Size Duration (ns)
=========== ============== ===========
vulkan.nchw_to_image {75, 75, 19} 1310660
vulkan.nchw_to_image {75, 75, 19} 1313260
vulkan.nchw_to_image {75, 75, 19} 1268748
vulkan.mean_dim_keepdim {1, 75, 19} 878124
vulkan.mean_dim_keepdim {1, 1, 19} 53300
vulkan.mean_dim_keepdim {1, 1, 1} 62660
vulkan.mean_dim_keepdim {1, 75, 19} 871260
vulkan.mean_dim_keepdim {1, 1, 19} 53144
vulkan.mean_dim_keepdim {1, 1, 1} 62400
vulkan.sub {75, 75, 19} 1787760
vulkan.mul {75, 75, 19} 1866904
vulkan.mean_dim_keepdim {1, 75, 19} 868764
vulkan.mean_dim_keepdim {1, 1, 19} 56212
vulkan.mean_dim_keepdim {1, 1, 1} 62400
vulkan.sub {75, 75, 19} 1782872
vulkan.add_scalar {1, 1, 1} 2028
vulkan.pow_tensor_scalar {1, 1, 1} 2236
vulkan.mul {75, 75, 19} 1771276
...
vulkan.add {75, 75, 19} 1909544
vulkan.image_to_nchw {75, 75, 19} 1143844
------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------------------------------------------
layer_norm_benchmark/N:75/M:75/P:75/iterations:50/manual_time/threads:1 14.2 ms 48.9 ms 50
```
Reviewed By: yipjustin
Differential Revision: D50940613
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112915
Approved by: https://github.com/yipjustin
Adopt algorithm from AVX2 implementation.
This change fixes test test_complex_div_underflow_overflow_cpu_complex128
from test/test_binary_ufuncs.py
At the same time it breaks some of Arithmetics/*.Division tests
from vec_test_all_types_ZVECTOR,
but it's also broken on AVX2 and AVX512.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108516
Approved by: https://github.com/ezyang
Fixes#112639
```txt
torch/utils/_sympy/value_ranges.py
torch/utils/_sympy/value_ranges.py:60 in public class `ValueRanges`:
D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:68 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:81 in public method `__contains__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:86 in public method `tighten`:
D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:90 in public method `__and__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:103 in public method `__or__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:113 in public method `is_singleton`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:118 in public method `unknown`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:122 in public method `wrap`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:129 in public method `increasing_map`:
D400: First line should end with a period (not ')')
torch/utils/_sympy/value_ranges.py:135 in public method `decreasing_map`:
D400: First line should end with a period (not ')')
torch/utils/_sympy/value_ranges.py:141 in public method `monotone_map`:
D400: First line should end with a period (not 'g')
torch/utils/_sympy/value_ranges.py:149 in public method `convex_min_zero_map`:
D400: First line should end with a period (not '0')
torch/utils/_sympy/value_ranges.py:149 in public method `convex_min_zero_map`:
D403: First word of the first line should be properly capitalized ('Fn', not 'fn')
torch/utils/_sympy/value_ranges.py:158 in public method `coordinatewise_increasing_map`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/_sympy/value_ranges.py:158 in public method `coordinatewise_increasing_map`:
D400: First line should end with a period (not ':')
torch/utils/_sympy/value_ranges.py:171 in public method `coordinatewise_monotone_map`:
D400: First line should end with a period (not 'e')
torch/utils/_sympy/value_ranges.py:180 in private class `SymPyValueRangeAnalysis`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/_sympy/value_ranges.py:180 in private class `SymPyValueRangeAnalysis`:
D400: First line should end with a period (not 's')
torch/utils/_sympy/value_ranges.py:386 in private method `reciprocal`:
D210: No whitespaces allowed surrounding docstring text
torch/utils/_sympy/value_ranges.py:386 in private method `reciprocal`:
D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:488 in public class `ValueRangeAnalysis`:
D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:489 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:501 in public method `bool_handler`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:506 in public method `default_handler`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:511 in public method `load`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:514 in public method `store`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:517 in public method `reduction`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:520 in public method `index_expr`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:525 in public method `to_dtype`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:558 in public method `square`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:562 in public method `neg`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:566 in public method `truncdiv`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:577 in public method `sub`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:580 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:585 in public function `bound_sympy`:
D103: Missing docstring in public function
36
torch/utils/_sympy/value_ranges.py:60 in public class `ValueRanges`:
D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:68 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:81 in public method `__contains__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:86 in public method `tighten`:
D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:90 in public method `__and__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:103 in public method `__or__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:113 in public method `is_singleton`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:118 in public method `unknown`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:122 in public method `wrap`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:182 in private class `SymPyValueRangeAnalysis`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/_sympy/value_ranges.py:182 in private class `SymPyValueRangeAnalysis`:
D400: First line should end with a period (not 's')
torch/utils/_sympy/value_ranges.py:388 in private method `reciprocal`:
D210: No whitespaces allowed surrounding docstring text
torch/utils/_sympy/value_ranges.py:388 in private method `reciprocal`:
D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:490 in public class `ValueRangeAnalysis`:
D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:491 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:503 in public method `bool_handler`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:508 in public method `default_handler`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:513 in public method `load`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:516 in public method `store`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:519 in public method `reduction`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:522 in public method `index_expr`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:527 in public method `to_dtype`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:560 in public method `square`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:564 in public method `neg`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:568 in public method `truncdiv`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:579 in public method `sub`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:582 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:587 in public function `bound_sympy`:
D103: Missing docstring in public function
28
torch/utils/viz/_cycles.py
torch/utils/viz/_cycles.py:14 in public function `observe_garbage`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:207 in public function `object_annotation`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/viz/_cycles.py:207 in public function `object_annotation`:
D400: First line should end with a period (not 'g')
torch/utils/viz/_cycles.py:256 in public class `Node`:
D101: Missing docstring in public class
torch/utils/viz/_cycles.py:262 in public function `create_graph`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:308 in public function `escape`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:312 in public function `is_cuda_tensor`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:315 in public function `cuda_allocation_context`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:335 in public function `to_dot`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:406 in public function `to_html`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:416 in public function `observe_tensor_cycles`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`:
D400: First line should end with a period (not 'p')
torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`:
D401: First line should be in imperative mood; try rephrasing (found 'Reference')
14
torch/utils/viz/_cycles.py:14 in public function `observe_garbage`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:256 in public class `Node`:
D101: Missing docstring in public class
torch/utils/viz/_cycles.py:262 in public function `create_graph`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:308 in public function `escape`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:312 in public function `is_cuda_tensor`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:315 in public function `cuda_allocation_context`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:335 in public function `to_dot`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:406 in public function `to_html`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:416 in public function `observe_tensor_cycles`:
D103: Missing docstring in public function
9
torch/distributed/argparse_util.py
torch/distributed/argparse_util.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/argparse_util.py:13 in public class `env`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/argparse_util.py:13 in public class `env`:
D400: First line should end with a period (not 'g')
torch/distributed/argparse_util.py:13 in public class `env`:
D412: No blank lines allowed between a section header and its content ('Example')
torch/distributed/argparse_util.py:43 in public method `__init__`:
D107: Missing docstring in __init__
torch/distributed/argparse_util.py:56 in public method `__call__`:
D102: Missing docstring in public method
torch/distributed/argparse_util.py:61 in public class `check_env`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/argparse_util.py:61 in public class `check_env`:
D400: First line should end with a period (not 's')
torch/distributed/argparse_util.py:61 in public class `check_env`:
D412: No blank lines allowed between a section header and its content ('Example')
torch/distributed/argparse_util.py:97 in public method `__init__`:
D107: Missing docstring in __init__
torch/distributed/argparse_util.py:102 in public method `__call__`:
D102: Missing docstring in public method
11
torch/distributed/argparse_util.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/argparse_util.py:43 in public method `__init__`:
D107: Missing docstring in __init__
torch/distributed/argparse_util.py:56 in public method `__call__`:
D102: Missing docstring in public method
torch/distributed/argparse_util.py:97 in public method `__init__`:
D107: Missing docstring in __init__
torch/distributed/argparse_util.py:102 in public method `__call__`:
D102: Missing docstring in public method
5
torch/distributed/_composable_state.py
torch/distributed/_composable_state.py:20 in private function `_get_module_state`:
D202: No blank lines allowed after function docstring (found 1)
torch/distributed/_composable_state.py:20 in private function `_get_module_state`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/_composable_state.py:20 in private function `_get_module_state`:
D400: First line should end with a period (not '`')
3
0
torch/distributed/launch.py
torch/distributed/launch.py:1 at module level:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/launch.py:1 at module level:
D400: First line should end with a period (not 'd')
torch/distributed/launch.py:156 in public function `parse_args`:
D103: Missing docstring in public function
torch/distributed/launch.py:171 in public function `launch`:
D103: Missing docstring in public function
torch/distributed/launch.py:180 in public function `main`:
D103: Missing docstring in public function
5
torch/distributed/launch.py:157 in public function `parse_args`:
D103: Missing docstring in public function
torch/distributed/launch.py:172 in public function `launch`:
D103: Missing docstring in public function
torch/distributed/launch.py:181 in public function `main`:
D103: Missing docstring in public function
3
torch/distributed/remote_device.py
torch/distributed/remote_device.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/remote_device.py:81 in private method `worker_name`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/remote_device.py:81 in private method `worker_name`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/distributed/remote_device.py:88 in private method `rank`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/remote_device.py:88 in private method `rank`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/distributed/remote_device.py:95 in private method `device`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/distributed/remote_device.py:95 in private method `device`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
7
torch/distributed/remote_device.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/remote_device.py:85 in private method `rank`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/remote_device.py:85 in private method `rank`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
3
torch/distributed/rendezvous.py
torch/distributed/rendezvous.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/rendezvous.py:23 in public function `register_rendezvous_handler`:
D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/distributed/rendezvous.py:88 in public function `rendezvous`:
D103: Missing docstring in public function
torch/distributed/rendezvous.py:147 in private function `_create_c10d_store`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/rendezvous.py:147 in private function `_create_c10d_store`:
D400: First line should end with a period (not 'r')
5
torch/distributed/rendezvous.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/rendezvous.py:89 in public function `rendezvous`:
D103: Missing docstring in public function
2
torch/distributed/run.py
torch/distributed/run.py:9 at module level:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/run.py:9 at module level:
D400: First line should end with a period (not '`')
torch/distributed/run.py:393 in public function `get_args_parser`:
D202: No blank lines allowed after function docstring (found 1)
torch/distributed/run.py:393 in public function `get_args_parser`:
D401: First line should be in imperative mood; try rephrasing (found 'Helper')
torch/distributed/run.py:610 in public function `parse_args`:
D103: Missing docstring in public function
torch/distributed/run.py:615 in public function `parse_min_max_nnodes`:
D103: Missing docstring in public function
torch/distributed/run.py:629 in public function `determine_local_world_size`:
D103: Missing docstring in public function
torch/distributed/run.py:670 in public function `get_rdzv_endpoint`:
D103: Missing docstring in public function
torch/distributed/run.py:677 in public function `get_use_env`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/run.py:677 in public function `get_use_env`:
D401: First line should be in imperative mood (perhaps 'Retrieve', not 'Retrieves')
torch/distributed/run.py:689 in public function `config_from_args`:
D103: Missing docstring in public function
torch/distributed/run.py:770 in public function `run_script_path`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/run.py:770 in public function `run_script_path`:
D401: First line should be in imperative mood (perhaps 'Run', not 'Runs')
torch/distributed/run.py:781 in public function `run`:
D103: Missing docstring in public function
torch/distributed/run.py:804 in public function `main`:
D103: Missing docstring in public function
15
torch/distributed/run.py:611 in public function `parse_args`:
D103: Missing docstring in public function
torch/distributed/run.py:616 in public function `parse_min_max_nnodes`:
D103: Missing docstring in public function
torch/distributed/run.py:630 in public function `determine_local_world_size`:
D103: Missing docstring in public function
torch/distributed/run.py:671 in public function `get_rdzv_endpoint`:
D103: Missing docstring in public function
torch/distributed/run.py:691 in public function `config_from_args`:
D103: Missing docstring in public function
torch/distributed/run.py:784 in public function `run`:
D103: Missing docstring in public function
torch/distributed/run.py:807 in public function `main`:
D103: Missing docstring in public function
7
torch/distributed/__init__.py
torch/distributed/__init__.py:1 at module level:
D104: Missing docstring in public package
torch/distributed/__init__.py:8 in public function `is_available`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/__init__.py:8 in public function `is_available`:
D400: First line should end with a period (not ',')
torch/distributed/__init__.py:8 in public function `is_available`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
4
torch/distributed/__init__.py:1 at module level:
D104: Missing docstring in public package
1
torch/distributed/utils.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/utils.py:16 in private function `_pack_kwargs`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:16 in private function `_pack_kwargs`:
D400: First line should end with a period (not ')')
torch/distributed/utils.py:47 in private function `_cast_forward_inputs`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:88 in private function `_recursive_to`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/distributed/utils.py:141 in private function `_p_assert`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:141 in private function `_p_assert`:
D209: Multi-line docstring closing quotes should be on a separate line
torch/distributed/utils.py:141 in private function `_p_assert`:
D400: First line should end with a period (not 't')
torch/distributed/utils.py:141 in private function `_p_assert`:
D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/distributed/utils.py:275 in private function `_sync_module_states`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:275 in private function `_sync_module_states`:
D400: First line should end with a period (not 'n')
torch/distributed/utils.py:275 in private function `_sync_module_states`:
D401: First line should be in imperative mood (perhaps 'Sync', not 'Syncs')
torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`:
D400: First line should end with a period (not 'y')
torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`:
D401: First line should be in imperative mood (perhaps 'Synchronize', not 'Synchronizes')
15
torch/distributed/utils.py:1 at module level:
D100: Missing docstring in public module
1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112953
Approved by: https://github.com/weifengpy
Summary:
Add an ability to customize log lines and addtional template like behavior to enrich log information.
Motivation:
a) Log stream processing/aggregation gains additional value when it includes information about the global rank. Extension to that is that it will be easier to map ranks to hosts from log stream information (less relevant at the moment)
b) Users can easily map the failure to the right rank without matching node rank offset+local rank.
Implementation
- BC change - keeps the logs line prefix as `[<role name><local rank>]:`
- Optional env variable TORCHELASTIC_LOG_LINE_HEADER that will be used as a prefix when specified and currently exposes `role_name`, `rank` and `local_rank` variables that will be bound when agent assigns the ranks.
Test Plan:
CI
https://fburl.com/mlhub/mzx5xspv
Differential Revision: D50584590
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112357
Approved by: https://github.com/kiukchung
Summary:
We've made the following changes:
- The new way to use the API is `m.impl_abstract_pystub(module, context)`.
Every subsequent m.def of an op inside the TORCH_LIBRARY block gives
the op the `impl_abstract_pystub`.
- Added a mechanism to determine if an operator was defined in Python or C++.
Library.define in Python appends the op to a global set, which is analogous
to what we do for tracking Library.impl.
- If someone does `torch.library.impl_abstract` in Python for an operator, then
we require that it has an `impl_abstract_pystub` specified and we also check
that the module in the `impl_abstract_pystub` is the same as the module where
the call to `torch.library.impl_abstract` exists.
- Unfortunately we can't check the "context" (which is the buck target on
buck-based systems) because buck sits above us.
bypass-github-export-checks
Test Plan: - existing tests
Differential Revision: D51080493
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113182
Approved by: https://github.com/ezyang
Fixes#112632
Before: 171
```
torch/backends/_nnapi/prepare.py:24 in public method `__init__`:
D107: Missing docstring in __init__
torch/backends/_nnapi/prepare.py:46 in public method `init`:
D102: Missing docstring in public method
torch/backends/_nnapi/prepare.py:60 in public method `forward`:
D102: Missing docstring in public method
torch/backends/_nnapi/prepare.py:94 in public function `convert_model_to_nnapi`:
D103: Missing docstring in public function
torch/backends/_nnapi/prepare.py:153 in public function `process_for_nnapi`:
D103: Missing docstring in public function
torch/backends/_nnapi/prepare.py:177 in private nested class `ShapeComputeModule`:
D400: First line should end with a period (not 'n')
torch/backends/_nnapi/serializer.py:19 in public class `NNAPI_OperandCode`:
D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:35 in public class `NNAPI_OperationCode`:
D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:133 in public class `NNAPI_FuseCode`:
D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:140 in public class `OperandValueSourceType`:
D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:150 in public class `TorchScalarTypes`:
D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:154 in public function `approx_equal`:
D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:158 in public function `tensor_size`:
D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:172 in public function `change_element`:
D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:194 in public class `DimOrder`:
D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:225 in public method `use_nchw`:
D102: Missing docstring in public method
torch/backends/_nnapi/serializer.py:233 in public function `broadcast_shapes`:
D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:260 in public function `get_conv_pool_shape`:
D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:284 in public function `fix_shape`:
D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:301 in public function `reverse_map_dim`:
D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:312 in public function `flex_name`:
D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:1337 in private method `_do_add_binary`:
D400: First line should end with a period (not 's')
torch/backends/_nnapi/serializer.py:1337 in private method `_do_add_binary`:
D401: First line should be in imperative mood; try rephrasing (found 'Helper')
torch/backends/_nnapi/serializer.py:2180 in public function `serialize_model`:
D202: No blank lines allowed after function docstring (found 1)
torch/backends/_nnapi/serializer.py:2180 in public function `serialize_model`:
D205: 1 blank line required between summary line and description (found 0)
torch/backends/_nnapi/serializer.py:2180 in public function `serialize_model`:
D400: First line should end with a period (not ':')
torch/backends/cuda/__init__.py:1 at module level:
D104: Missing docstring in public package
torch/backends/cuda/__init__.py:30 in public function `is_built`:
D205: 1 blank line required between summary line and description (found 0)
torch/backends/cuda/__init__.py:30 in public function `is_built`:
D209: Multi-line docstring closing quotes should be on a separate line
torch/backends/cuda/__init__.py:30 in public function `is_built`:
D400: First line should end with a period (not 's')
torch/backends/cuda/__init__.py:30 in public function `is_built`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/backends/cuda/__init__.py:37 in public class `cuFFTPlanCacheAttrContextProp`:
D101: Missing docstring in public class
torch/backends/cuda/__init__.py:40 in public method `__init__`:
D107: Missing docstring in __init__
torch/backends/cuda/__init__.py:44 in public method `__get__`:
D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:47 in public method `__set__`:
D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:54 in public class `cuFFTPlanCache`:
D205: 1 blank line required between summary line and description (found 0)
torch/backends/cuda/__init__.py:54 in public class `cuFFTPlanCache`:
D400: First line should end with a period (not 'e')
torch/backends/cuda/__init__.py:60 in public method `__init__`:
D107: Missing docstring in __init__
torch/backends/cuda/__init__.py:73 in public method `clear`:
D102: Missing docstring in public method
torch/backends/cuda/__init__.py:78 in public class `cuFFTPlanCacheManager`:
D205: 1 blank line required between summary line and description (found 0)
torch/backends/cuda/__init__.py:78 in public class `cuFFTPlanCacheManager`:
D400: First line should end with a period (not ',')
torch/backends/cuda/__init__.py:89 in public method `__init__`:
D107: Missing docstring in __init__
torch/backends/cuda/__init__.py:93 in public method `__getitem__`:
D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:106 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:109 in public method `__setattr__`:
D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:116 in public class `cuBLASModule`:
D101: Missing docstring in public class
torch/backends/cuda/__init__.py:117 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:126 in public method `__setattr__`:
D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:147 in public function `preferred_linalg_library`:
D202: No blank lines allowed after function docstring (found 1)
torch/backends/cuda/__init__.py:204 in public class `SDPBackend`:
D204: 1 blank line required after class docstring (found 0)
torch/backends/cudnn/__init__.py:1 at module level:
D104: Missing docstring in public package
torch/backends/cudnn/__init__.py:81 in public function `version`:
D400: First line should end with a period (not 'N')
torch/backends/cudnn/__init__.py:81 in public function `version`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/backends/cudnn/__init__.py:95 in public function `is_available`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/backends/cudnn/__init__.py:99 in public function `is_acceptable`:
D103: Missing docstring in public function
torch/backends/cudnn/__init__.py:122 in public function `set_flags`:
D103: Missing docstring in public function
torch/backends/cudnn/__init__.py:150 in public function `flags`:
D103: Missing docstring in public function
torch/backends/cudnn/__init__.py:174 in public class `CudnnModule`:
D101: Missing docstring in public class
torch/backends/cudnn/__init__.py:175 in public method `__init__`:
D107: Missing docstring in __init__
torch/backends/mkl/__init__.py:1 at module level:
D104: Missing docstring in public package
torch/backends/mkl/__init__.py:5 in public function `is_available`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/backends/mkl/__init__.py:14 in public class `verbose`:
D205: 1 blank line required between summary line and description (found 0)
torch/backends/mkl/__init__.py:14 in public class `verbose`:
D400: First line should end with a period (not 'y')
torch/backends/mkl/__init__.py:41 in public method `__init__`:
D107: Missing docstring in __init__
torch/backends/mkl/__init__.py:44 in public method `__enter__`:
D105: Missing docstring in magic method
torch/backends/mkl/__init__.py:53 in public method `__exit__`:
D105: Missing docstring in magic method
torch/backends/mkldnn/__init__.py:1 at module level:
D104: Missing docstring in public package
torch/backends/mkldnn/__init__.py:9 in public function `is_available`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/backends/mkldnn/__init__.py:19 in public class `verbose`:
D205: 1 blank line required between summary line and description (found 0)
torch/backends/mkldnn/__init__.py:19 in public class `verbose`:
D400: First line should end with a period (not 'y')
torch/backends/mkldnn/__init__.py:47 in public method `__init__`:
D107: Missing docstring in __init__
torch/backends/mkldnn/__init__.py:50 in public method `__enter__`:
D105: Missing docstring in magic method
torch/backends/mkldnn/__init__.py:59 in public method `__exit__`:
D105: Missing docstring in magic method
torch/backends/mkldnn/__init__.py:64 in public function `set_flags`:
D103: Missing docstring in public function
torch/backends/mkldnn/__init__.py:71 in public function `flags`:
D103: Missing docstring in public function
torch/backends/mkldnn/__init__.py:81 in public class `MkldnnModule`:
D101: Missing docstring in public class
torch/backends/mkldnn/__init__.py:82 in public method `__init__`:
D107: Missing docstring in __init__
torch/backends/openmp/__init__.py:1 at module level:
D104: Missing docstring in public package
torch/backends/openmp/__init__.py:5 in public function `is_available`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/nn/intrinsic/qat/modules/conv_fused.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/intrinsic/qat/modules/linear_fused.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/intrinsic/qat/modules/linear_relu.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/qat/__init__.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/qat/dynamic/__init__.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/qat/dynamic/modules/linear.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/qat/modules/__init__.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/qat/modules/conv.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/qat/modules/embedding_ops.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/qat/modules/linear.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantizable/modules/activation.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantizable/modules/rnn.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantized/_reference/modules/__init__.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantized/_reference/modules/conv.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantized/_reference/modules/linear.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantized/_reference/modules/rnn.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantized/_reference/modules/sparse.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantized/_reference/modules/utils.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantized/dynamic/modules/__init__.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantized/dynamic/modules/conv.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantized/dynamic/modules/linear.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantized/dynamic/modules/rnn.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantized/functional.py:1 at module level:
D400: First line should end with a period (not 'l')
torch/nn/quantized/modules/__init__.py:1 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantized/modules/activation.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantized/modules/batchnorm.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantized/modules/conv.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantized/modules/dropout.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantized/modules/embedding_ops.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantized/modules/functional_modules.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantized/modules/linear.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantized/modules/normalization.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantized/modules/rnn.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/quantized/modules/utils.py:2 at module level:
D400: First line should end with a period (not 's')
torch/nn/utils/_expanded_weights/conv_utils.py:13 in public function `conv_picker`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:23 in public function `conv_args_and_kwargs`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:31 in public function `conv_normalizer`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:35 in public function `conv_input_for_string_padding`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:43 in public function `int_padding_for_string_padding`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:59 in public function `conv_padding_for_same`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:66 in public function `conv_backward`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:131 in public function `conv_unfold_weight_grad_sample`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:166 in public function `conv_group_weight_grad_sample`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:189 in public function `unfold3d`:
D202: No blank lines allowed after function docstring (found 1)
torch/nn/utils/_expanded_weights/conv_utils.py:189 in public function `unfold3d`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_expanded_weights/conv_utils.py:189 in public function `unfold3d`:
D401: First line should be in imperative mood (perhaps 'Extract', not 'Extracts')
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:6 in public function `is_batch_first`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:19 in public function `standard_kwargs`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:19 in public function `standard_kwargs`:
D300: Use """triple double quotes""" (found '''-quotes)
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:19 in public function `standard_kwargs`:
D400: First line should end with a period (not 'e')
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:28 in public function `forward_helper`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:28 in public function `forward_helper`:
D300: Use """triple double quotes""" (found '''-quotes)
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:28 in public function `forward_helper`:
D400: First line should end with a period (not ')')
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:84 in public function `maybe_scale_by_batch_size`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:90 in public function `set_grad_sample_if_exists`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:108 in public function `unpack_expanded_weight_or_tensor`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:123 in public function `sum_over_all_but_batch_and_last_n`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:123 in public function `sum_over_all_but_batch_and_last_n`:
D400: First line should end with a period (not 't')
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:123 in public function `sum_over_all_but_batch_and_last_n`:
D401: First line should be in imperative mood (perhaps 'Calculate', not 'Calculates')
torch/nn/utils/convert_parameters.py:1 at module level:
D100: Missing docstring in public module
torch/nn/utils/convert_parameters.py:57 in private function `_check_param_device`:
D202: No blank lines allowed after function docstring (found 1)
torch/nn/utils/convert_parameters.py:57 in private function `_check_param_device`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/convert_parameters.py:57 in private function `_check_param_device`:
D400: First line should end with a period (not 'd')
torch/nn/utils/convert_parameters.py:57 in private function `_check_param_device`:
D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/nn/utils/rnn.py:1 at module level:
D100: Missing docstring in public module
torch/nn/utils/rnn.py:28 in public class `PackedSequence`:
D204: 1 blank line required after class docstring (found 0)
torch/nn/utils/rnn.py:63 in public method `__new__`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:73 in public method `pin_memory`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:80 in public method `cuda`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:87 in public method `cpu`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:94 in public method `double`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:97 in public method `float`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:100 in public method `half`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:103 in public method `long`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:106 in public method `int`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:109 in public method `short`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:112 in public method `char`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:115 in public method `byte`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:119 in public method `to`:
D202: No blank lines allowed after function docstring (found 1)
torch/nn/utils/rnn.py:119 in public method `to`:
D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
torch/nn/utils/rnn.py:146 in public method `is_cuda`:
D400: First line should end with a period (not 'u')
torch/nn/utils/rnn.py:150 in public method `is_pinned`:
D400: First line should end with a period (not 'y')
torch/nn/utils/rnn.py:150 in public method `is_pinned`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/nn/utils/rnn.py:198 in public function `invert_permutation`:
D103: Missing docstring in public function
torch/nn/utils/rnn.py:274 in public function `pad_packed_sequence`:
D401: First line should be in imperative mood (perhaps 'Pad', not 'Pads')
torch/nn/utils/rnn.py:347 in public function `pad_sequence`:
D202: No blank lines allowed after function docstring (found 1)
torch/nn/utils/rnn.py:347 in public function `pad_sequence`:
D400: First line should end with a period (not '`')
torch/nn/utils/rnn.py:408 in public function `unpad_sequence`:
D202: No blank lines allowed after function docstring (found 1)
torch/nn/utils/rnn.py:408 in public function `unpad_sequence`:
D400: First line should end with a period (not 's')
torch/nn/utils/rnn.py:454 in public function `pack_sequence`:
D400: First line should end with a period (not 's')
torch/nn/utils/rnn.py:490 in public function `unpack_sequence`:
D202: No blank lines allowed after function docstring (found 1)
torch/nn/utils/rnn.py:490 in public function `unpack_sequence`:
D400: First line should end with a period (not 's')
171
```
After: 81
```
torch/backends/_nnapi/prepare.py:24 in public method `__init__`:
D107: Missing docstring in __init__
torch/backends/_nnapi/prepare.py:46 in public method `init`:
D102: Missing docstring in public method
torch/backends/_nnapi/prepare.py:60 in public method `forward`:
D102: Missing docstring in public method
torch/backends/_nnapi/prepare.py:94 in public function `convert_model_to_nnapi`:
D103: Missing docstring in public function
torch/backends/_nnapi/prepare.py:153 in public function `process_for_nnapi`:
D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:19 in public class `NNAPI_OperandCode`:
D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:35 in public class `NNAPI_OperationCode`:
D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:133 in public class `NNAPI_FuseCode`:
D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:140 in public class `OperandValueSourceType`:
D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:150 in public class `TorchScalarTypes`:
D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:154 in public function `approx_equal`:
D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:158 in public function `tensor_size`:
D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:172 in public function `change_element`:
D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:194 in public class `DimOrder`:
D101: Missing docstring in public class
torch/backends/_nnapi/serializer.py:225 in public method `use_nchw`:
D102: Missing docstring in public method
torch/backends/_nnapi/serializer.py:233 in public function `broadcast_shapes`:
D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:260 in public function `get_conv_pool_shape`:
D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:284 in public function `fix_shape`:
D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:301 in public function `reverse_map_dim`:
D103: Missing docstring in public function
torch/backends/_nnapi/serializer.py:312 in public function `flex_name`:
D103: Missing docstring in public function
torch/backends/cuda/__init__.py:1 at module level:
D104: Missing docstring in public package
torch/backends/cuda/__init__.py:39 in public class `cuFFTPlanCacheAttrContextProp`:
D101: Missing docstring in public class
torch/backends/cuda/__init__.py:42 in public method `__init__`:
D107: Missing docstring in __init__
torch/backends/cuda/__init__.py:46 in public method `__get__`:
D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:49 in public method `__set__`:
D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:63 in public method `__init__`:
D107: Missing docstring in __init__
torch/backends/cuda/__init__.py:76 in public method `clear`:
D102: Missing docstring in public method
torch/backends/cuda/__init__.py:91 in public method `__init__`:
D107: Missing docstring in __init__
torch/backends/cuda/__init__.py:95 in public method `__getitem__`:
D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:108 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:111 in public method `__setattr__`:
D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:118 in public class `cuBLASModule`:
D101: Missing docstring in public class
torch/backends/cuda/__init__.py:119 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/backends/cuda/__init__.py:128 in public method `__setattr__`:
D105: Missing docstring in magic method
torch/backends/cudnn/__init__.py:1 at module level:
D104: Missing docstring in public package
torch/backends/cudnn/__init__.py:99 in public function `is_acceptable`:
D103: Missing docstring in public function
torch/backends/cudnn/__init__.py:122 in public function `set_flags`:
D103: Missing docstring in public function
torch/backends/cudnn/__init__.py:150 in public function `flags`:
D103: Missing docstring in public function
torch/backends/cudnn/__init__.py:174 in public class `CudnnModule`:
D101: Missing docstring in public class
torch/backends/cudnn/__init__.py:175 in public method `__init__`:
D107: Missing docstring in __init__
torch/backends/mkl/__init__.py:1 at module level:
D104: Missing docstring in public package
torch/backends/mkl/__init__.py:42 in public method `__init__`:
D107: Missing docstring in __init__
torch/backends/mkl/__init__.py:45 in public method `__enter__`:
D105: Missing docstring in magic method
torch/backends/mkl/__init__.py:54 in public method `__exit__`:
D105: Missing docstring in magic method
torch/backends/mkldnn/__init__.py:1 at module level:
D104: Missing docstring in public package
torch/backends/mkldnn/__init__.py:48 in public method `__init__`:
D107: Missing docstring in __init__
torch/backends/mkldnn/__init__.py:51 in public method `__enter__`:
D105: Missing docstring in magic method
torch/backends/mkldnn/__init__.py:60 in public method `__exit__`:
D105: Missing docstring in magic method
torch/backends/mkldnn/__init__.py:65 in public function `set_flags`:
D103: Missing docstring in public function
torch/backends/mkldnn/__init__.py:72 in public function `flags`:
D103: Missing docstring in public function
torch/backends/mkldnn/__init__.py:82 in public class `MkldnnModule`:
D101: Missing docstring in public class
torch/backends/mkldnn/__init__.py:83 in public method `__init__`:
D107: Missing docstring in __init__
torch/backends/openmp/__init__.py:1 at module level:
D104: Missing docstring in public package
torch/nn/utils/_expanded_weights/conv_utils.py:13 in public function `conv_picker`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:23 in public function `conv_args_and_kwargs`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:31 in public function `conv_normalizer`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:35 in public function `conv_input_for_string_padding`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:43 in public function `int_padding_for_string_padding`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:59 in public function `conv_padding_for_same`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:66 in public function `conv_backward`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:131 in public function `conv_unfold_weight_grad_sample`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/conv_utils.py:166 in public function `conv_group_weight_grad_sample`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:6 in public function `is_batch_first`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:87 in public function `maybe_scale_by_batch_size`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:93 in public function `set_grad_sample_if_exists`:
D103: Missing docstring in public function
torch/nn/utils/_expanded_weights/expanded_weights_utils.py:111 in public function `unpack_expanded_weight_or_tensor`:
D103: Missing docstring in public function
torch/nn/utils/convert_parameters.py:1 at module level:
D100: Missing docstring in public module
torch/nn/utils/rnn.py:1 at module level:
D100: Missing docstring in public module
torch/nn/utils/rnn.py:64 in public method `__new__`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:74 in public method `pin_memory`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:81 in public method `cuda`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:88 in public method `cpu`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:95 in public method `double`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:98 in public method `float`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:101 in public method `half`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:104 in public method `long`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:107 in public method `int`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:110 in public method `short`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:113 in public method `char`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:116 in public method `byte`:
D102: Missing docstring in public method
torch/nn/utils/rnn.py:198 in public function `invert_permutation`:
D103: Missing docstring in public function
81
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112695
Approved by: https://github.com/mikaylagawarecki
Currently custom VariableTrackers exist for classes that live outside of pytorch.
For these cases dynamo currently eagerly imports the module to get the class
object to compare against.
This instead uses `sys.modules.get("module.path")` such that the module is never
imported by dynamo itself, but if the user has imported the module then we will
still access the module and grab the type we need to compare against.
I noticed this issue because importing `KeyedJaggedTensor` fails half-way
through if `fbgemm_gpu` has been built with an incompatible PyTorch version, in
which case it retries the import again each time!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112319
Approved by: https://github.com/lezcano, https://github.com/ezyang
Summary:
Fix zero-dim test. Use `at::zeros` instead of `at::empty` as the init value inside a `at::empty` tensor is undefined. Likely to be the cause of test flakiness.
{F1142344469}
Test Plan:
Run on devserver
```
$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin
...
[ OK ] VulkanAPITest.linear_4d_large (2 ms)
[ RUN ] VulkanAPITest.lstm_success
[ OK ] VulkanAPITest.lstm_success (4 ms)
[ RUN ] VulkanAPITest.lstm_mclareninputs_success
[ OK ] VulkanAPITest.lstm_mclareninputs_success (45 ms)
[ RUN ] VulkanAPITest.lstm_prepack_success
[ OK ] VulkanAPITest.lstm_prepack_success (2 ms)
[ RUN ] VulkanAPITest.querypool_flushed_shader_log
xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp:7773: Skipped
QueryPool is not available
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 402 tests from VulkanAPITest (24598 ms total)
[----------] Global test environment tear-down
[==========] 402 tests from 1 test suite ran. (24598 ms total)
[ PASSED ] 399 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
[ FAILED ] 2 tests, listed below:
[ FAILED ] VulkanAPITest.conv2d_pw_prepack
[ FAILED ] VulkanAPITest.conv2d_pw_prepack_bc
2 FAILED TESTS
YOU HAVE 7 DISABLED TESTS
```
Last two are known failures on devserver.
Full output: P875058890
Differential Revision: D51055623
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113116
Approved by: https://github.com/manuelcandales
This got reverted a couple of months ago. We have since fixed the known
problems with the flag. It is time to try again.
Context:
This PR adds a new fallback to the Autograd dispatch keys. The previous
behavior was a big footgun; we are moving to deprecating it.
If you would prefer the old behavior:
- A quick (unsupported) way to get the previous behavior is to call
torch._C._set_autograd_fallback("nothing")
- Register "torch::CppFunction::makeFallthrough()" to your Autograd key,
like in https://gist.github.com/zou3519/d09a5f4b1afe2430af09fea67c6ff2c8
It is possible that this PR regresses performance of overhead-bound
models. If this is the case, please reach out (and apply one of the
temporary fixes in the previous section).
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113166
Approved by: https://github.com/soulitzer
Summary: Only propagate sharding if the op has sharding strategy registered, otherwise mark in/out of the op as `Replicate`.
Test Plan: buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/test/distributed/_tensor/experimental:tp_transform
Differential Revision: D50747611
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112530
Approved by: https://github.com/wanchaol
Attempt number 2 at https://github.com/pytorch/pytorch/issues/108950.
Improves debugging for guard failures/recompilations by:
- only running guard fail reason generation during recompilation, instead of when a guard fails during dynamo cache lookup (so generating guard failure reasons is not on the critical path)
- ~~always reporting all guard failures~~ Reports the first-failing guard failure for each cache entry.
We don't expect a performance hit since the guard fail reasons are only generated at recompile time rather than runtime. Perf benchmark to check this (https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?startTime=Fri,%2027%20Oct%202023%2017:42:43%20GMT&stopTime=Fri,%2003%20Nov%202023%2017:42:43%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=gh/williamwen42/62/head&lCommit=f4724f5ffc6d17ceae513a42fc18627be7b85482&rBranch=main&rCommit=29f3d392bf230072e3bffae37b078e770cae1956). We may also need to verify this on benchmarks where guard fails are common.
Sample script:
```python
import torch
def generate_data(b):
return (
torch.randn(b, 3, 32, 32).to(torch.float32).cuda(),
torch.randint(1000, (b,)).cuda(),
)
from torchvision.models import resnet18
def init_model():
return resnet18().to(torch.float32).cuda()
model = init_model()
model_opt = torch.compile(model, dynamic=False)
for b in range(16, 32):
data = generate_data(b)
model_opt(data[0])
```
Sample logs:
```bash
(/data/users/williamwen/py310-env) [williamwen@devgpu020.odn1 /data/users/williamwen/pytorch (wwen/log-all-guards)]$ python playground5.py
/data/users/williamwen/pytorch/torch/_inductor/compile_fx.py:141: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
warnings.warn(
[2023-11-06 14:50:47,605] torch._dynamo.convert_frame: [WARNING] torch._dynamo hit config.cache_size_limit (8)
[2023-11-06 14:50:47,605] torch._dynamo.convert_frame: [WARNING] function: 'forward' (/data/users/williamwen/torchvision/torchvision/models/resnet.py:284)
[2023-11-06 14:50:47,605] torch._dynamo.convert_frame: [WARNING] last reason: tensor 'L['x']' size mismatch at index 0. expected 16, actual 24
[2023-11-06 14:50:47,605] torch._dynamo.convert_frame: [WARNING] To log all recompilation reasons, use TORCH_LOGS="recompiles".
[2023-11-06 14:50:47,605] torch._dynamo.convert_frame: [WARNING] To diagnose recompilation issues, see https://pytorch.org/docs/master/compile/troubleshooting.html.
(/data/users/williamwen/py310-env) [williamwen@devgpu020.odn1 /data/users/williamwen/pytorch (wwen/log-all-guards)]$ TORCH_LOGS="recompiles" python playground5.py
/data/users/williamwen/pytorch/torch/_inductor/compile_fx.py:141: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
warnings.warn(
[2023-11-06 14:53:31,591] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284
[2023-11-06 14:53:31,591] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s):
[2023-11-06 14:53:31,591] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 16, actual 17
[2023-11-06 14:53:41,333] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284
[2023-11-06 14:53:41,333] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s):
[2023-11-06 14:53:41,333] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 17, actual 18
[2023-11-06 14:53:41,333] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 16, actual 18
[2023-11-06 14:53:50,463] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284
[2023-11-06 14:53:50,463] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s):
[2023-11-06 14:53:50,463] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 18, actual 19
[2023-11-06 14:53:50,463] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 17, actual 19
[2023-11-06 14:53:50,463] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 16, actual 19
[2023-11-06 14:53:59,848] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284
[2023-11-06 14:53:59,848] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s):
[2023-11-06 14:53:59,848] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 19, actual 20
[2023-11-06 14:53:59,848] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 18, actual 20
[2023-11-06 14:53:59,848] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 17, actual 20
[2023-11-06 14:53:59,848] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 16, actual 20
[2023-11-06 14:54:08,549] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284
[2023-11-06 14:54:08,549] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s):
[2023-11-06 14:54:08,549] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 20, actual 21
[2023-11-06 14:54:08,549] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 19, actual 21
[2023-11-06 14:54:08,549] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 18, actual 21
[2023-11-06 14:54:08,549] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 17, actual 21
[2023-11-06 14:54:08,549] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 16, actual 21
[2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284
[2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s):
[2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 21, actual 22
[2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 20, actual 22
[2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 19, actual 22
[2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 18, actual 22
[2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 17, actual 22
[2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 16, actual 22
[2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284
[2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s):
[2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 22, actual 23
[2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 21, actual 23
[2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 20, actual 23
[2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 19, actual 23
[2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 18, actual 23
[2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 17, actual 23
[2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 16, actual 23
[2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284
[2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s):
[2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 23, actual 24
[2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 22, actual 24
[2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 21, actual 24
[2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 20, actual 24
[2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 19, actual 24
[2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 18, actual 24
[2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 17, actual 24
[2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 16, actual 24
[2023-11-06 14:54:36,744] torch._dynamo.convert_frame: [WARNING] torch._dynamo hit config.cache_size_limit (8)
[2023-11-06 14:54:36,744] torch._dynamo.convert_frame: [WARNING] function: 'forward' (/data/users/williamwen/torchvision/torchvision/models/resnet.py:284)
[2023-11-06 14:54:36,744] torch._dynamo.convert_frame: [WARNING] last reason: tensor 'L['x']' size mismatch at index 0. expected 16, actual 24
[2023-11-06 14:54:36,744] torch._dynamo.convert_frame: [WARNING] To log all recompilation reasons, use TORCH_LOGS="recompiles".
[2023-11-06 14:54:36,744] torch._dynamo.convert_frame: [WARNING] To diagnose recompilation issues, see https://pytorch.org/docs/master/compile/troubleshooting.html.
[2023-11-06 14:54:45,922] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function _forward_impl in /data/users/williamwen/torchvision/torchvision/models/resnet.py:266
[2023-11-06 14:54:45,922] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s):
[2023-11-06 14:54:45,922] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 24, actual 25
[2023-11-06 14:54:54,691] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function _forward_impl in /data/users/williamwen/torchvision/torchvision/models/resnet.py:266
[2023-11-06 14:54:54,691] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s):
[2023-11-06 14:54:54,691] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 25, actual 26
[2023-11-06 14:54:54,691] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 24, actual 26
[2023-11-06 14:55:03,591] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function _forward_impl in /data/users/williamwen/torchvision/torchvision/models/resnet.py:266
[2023-11-06 14:55:03,591] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s):
[2023-11-06 14:55:03,591] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 26, actual 27
[2023-11-06 14:55:03,591] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 25, actual 27
[2023-11-06 14:55:03,591] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 24, actual 27
[2023-11-06 14:55:12,384] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function _forward_impl in /data/users/williamwen/torchvision/torchvision/models/resnet.py:266
[2023-11-06 14:55:12,384] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s):
[2023-11-06 14:55:12,384] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 27, actual 28
[2023-11-06 14:55:12,384] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 26, actual 28
[2023-11-06 14:55:12,384] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 25, actual 28
[2023-11-06 14:55:12,384] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 24, actual 28
[2023-11-06 14:55:21,442] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function _forward_impl in /data/users/williamwen/torchvision/torchvision/models/resnet.py:266
[2023-11-06 14:55:21,442] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s):
[2023-11-06 14:55:21,442] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 28, actual 29
[2023-11-06 14:55:21,442] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 27, actual 29
[2023-11-06 14:55:21,442] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 26, actual 29
[2023-11-06 14:55:21,442] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 25, actual 29
[2023-11-06 14:55:21,442] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 24, actual 29
[2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function _forward_impl in /data/users/williamwen/torchvision/torchvision/models/resnet.py:266
[2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s):
[2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 29, actual 30
[2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 28, actual 30
[2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 27, actual 30
[2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 26, actual 30
[2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 25, actual 30
[2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 24, actual 30
[2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function _forward_impl in /data/users/williamwen/torchvision/torchvision/models/resnet.py:266
[2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s):
[2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 30, actual 31
[2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 29, actual 31
[2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 28, actual 31
[2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 27, actual 31
[2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 26, actual 31
[2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 25, actual 31
[2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 24, actual 31
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110325
Approved by: https://github.com/ezyang, https://github.com/jon-chuang
Summary: Previously we only copied over q/dq args for the per
tensor case. This was because the qparams for `quantize_per_tensor`
are literals while the qparams for `quantize_per_channel` are
`get_attr` nodes (tensors), which disappear from the original
nodes in the graph after subgraph rewriting.
However, this is problematic because, in the per channel case,
not all q/dq args are tensors. In particular, the args after
the qparams (axis, qmin, qmax, dtype) are all literals. For
these literal args we simply used the hardcoded ones
(0, -127, 127, torch.int8 respectively), even if the user
explicitly specified to use a different weight dtype. This
commit fixes this by copying over these literal args for the
per channel case as well.
Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT.test_qat_per_channel_weight_custom_dtype
Reviewers: jerryzh168, kimishpatel
Subscribers: jerryzh168, kimishpatel, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112612
Approved by: https://github.com/jerryzh168
CC @malfet @ptrblck
~~We've been seeing a lot of noise from Ampere and later devices due to reduced precision reductions, so preemptively disabling them for addmm tests.~~
Breaking out addmm tests into one with and without reduced precision reductions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112545
Approved by: https://github.com/malfet
This PR:
1. Adds `run_subtests` to `DTensorTestbase`, which runs test functions given by `test_fn` as subtests. This amortizes the costly dist setup.
2. Update `test/distributed/checkpoint/test_state_dict.py` to use `DTensorTestbase`. This fixes the "Duplicate GPU detected: rank 0 and rank 1 both on CUDA device 11000" when running on 1 GPU, as the skip_if_lt_x_gpu is currently happening after dist setup.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113051
Approved by: https://github.com/fegin
This PR fixes#106294.
Due to the lack of request validation mechanism, TCPStore in torch mistakenly treats nmap scan messages as valid query messages, which leads to DDP OOM. The simple solution enforces the very first query from a client is a validation query with a predefined magic number. If the validation fails, the server will terminate the connection.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107607
Approved by: https://github.com/cbalioglu, https://github.com/XilunWu
**Summary**:
dtensor_ops test has helper function `assert_ref_dtensor_equal` which was written as expecting a `DTensor` argument `dtensor_rs` but actually receives a `torch.Tensor` in test. This PR removes the `to_local()` call on that object since it's actually a `torch.Tensor`.
This PR is a part of internal task [T169242924](https://www.internalfb.com/intern/tasks/?t=169242924) for better engineering.
**Test**:
`pytest test/distributed/_tensor/test_dtensor_ops.py`
`pytest test/distributed/_tensor/test_dtensor_ops.py -s -k baddbmm`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113104
Approved by: https://github.com/wanchaol
Summary:
We've made the following changes:
- The new way to use the API is `m.impl_abstract_pystub(module, context)`.
Every subsequent m.def of an op inside the TORCH_LIBRARY block gives
the op the `impl_abstract_pystub`.
- Added a mechanism to determine if an operator was defined in Python or C++.
Library.define in Python appends the op to a global set, which is analogous
to what we do for tracking Library.impl.
- If someone does `torch.library.impl_abstract` in Python for an operator, then
we require that it has an `impl_abstract_pystub` specified and we also check
that the module in the `impl_abstract_pystub` is the same as the module where
the call to `torch.library.impl_abstract` exists.
- Unfortunately we can't check the "context" (which is the buck target on
buck-based systems) because buck sits above us.
Test Plan: - existing tests
Differential Revision: D50972148
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112851
Approved by: https://github.com/ezyang
In https://github.com/pytorch/pytorch/pull/112156 I added support for creating replacements on unbacked SymInts, so if you asserted that `i0 == s0`, we would replace i0 with s0 (only ever replacing unbacked with backed.)
However, if we have assertions involving only unbacked SymInts, we can also replace in this case! E.g., `i0 == i1` or `i0 == i1 * 12`. The previous logic for generating replacements would reject these cases, because you're not allowed to replace unbacked with unbacked. Modifying the logic is not so easy though; ordinarily, we decide what substitution to prioritize by trying to replace the largest hinted symbol, but for unbacked integers we don't have this. To get around this problem, for now I only setup replacements for trivial symbol equals something else situations. Check the diff with whitespace ignored, the addition is quite small.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112653
Approved by: https://github.com/aakhundov
Summary: Support passing in parallel strategy map and apply TP transform based on that. This is to make it easier manually select layers from real model to parallelize and benchmark.
Test Plan: buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/test/distributed/_tensor/experimental:tp_transform
Differential Revision: D50591039
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112286
Approved by: https://github.com/wanchaol
Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.
This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113094
Approved by: https://github.com/fduwjj
Fixes https://github.com/pytorch/pytorch/issues/113030
Alias information needs to be applied in eager before we can continue to trace the graph.
----
Perhaps this is too strict - couldn't we fx trace through the in-graph (pointer) aliasing, and track mutations through fake tensors instead, and still apply the aliasing mutation epilogue for further mutations outside of graph? 🤔
Regardless, it didn't seem to work too well when I tried this. Seems that `Tensor.__setattr__` doesn't work well in fx graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113043
Approved by: https://github.com/ezyang, https://github.com/voznesenskym
TestNNDeviceTypeCUDA.test_softmax_forward_64bit_indexing_cuda started failing for ROCm after #112096 with the message
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 13.35 GiB. GPU 0 has a total capacity of 31.98 GiB of which 3.89 GiB is free. Of the allocated memory 26.69 GiB is allocated by PyTorch, and 18.91 MiB is reserved by PyTorch but unallocated.
This amounts to approximately 41GB. The test is currently decorated with `largeTensorTest("30GB", "cuda")` but this is not sufficient for ROCm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113093
Approved by: https://github.com/malfet
After adding support for labeler, we don't need CODEOWNERS.
This change will cause the distributed team members previously listed in
CODEOWNERS to stop being auto-added as reviewers on PRs touching these
files. The preceding PR adds labeler support for these same sets of
files, and contains instructions for adding yourself to be cc'd for that
label.
It is preferable to be auto-cc'd rather than auto-tagged as reviewer, so
that there is more signal in the reviewers list (either someone opted in
which shows the PR author someone is likely looking at it, or the PR
author added someone specifically which is a stronger notification to
the tagged reviewer than the blanket CODEOWNERS behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112813
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
Fix https://github.com/pytorch/pytorch/issues/112454 .
The current fix is quite simple. The kernel has multiple triton configs. Previously any triton config fail to compile, we skip everything else and fail. Now we just skip the bad configs and pick the best one from the remaining configs.
There are other ways to fix the issues more fundamentally but requires much more work:
1. Horace mentioned an idea to make sure the largest one of size_hints is the right most dimension. This way, that largest dimension with be mapped to XBLOCK and we won't scale it up too much since the threshold for the max grid size for x dimension is quite large (2**31 - 1). But this may require us to change loop ordering heuristics which may have other perf impact.
2. The issue happens because the kernel requires 2D tiling which uses shared memory. We can stop scaling up block size if: `XBLOCK * YBLOCK * element_size >= max_shared_memory` . max_shared_memory is around 160K for A100. The tricky part here is we don't know dtype in `triton_config` method to decide the `element_size`. From metadata, we can find the dtype for each tensor, but if the kernel uses tensors of mixed types, we won't know what dtype is actually used for the data loaded into the shared memory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112916
Approved by: https://github.com/Chillee, https://github.com/eellison, https://github.com/jansel
Summary:
`conv2d_pw_prepack` and `conv2d_pw_prepack_bc` has been broken for a long time on Meta's CI.
Cause unknown yet. The tests passes with a smaller input.
Hence disable the large test, and enable test with a smaller tensor.
Test Plan:
Devserver:
```
LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*"
```
Output: P872944689
```
...
[ OK ] VulkanAPITest.linear_4d_small (0 ms)
[ RUN ] VulkanAPITest.linear_4d_large
[ OK ] VulkanAPITest.linear_4d_large (2 ms)
[ RUN ] VulkanAPITest.lstm_success
[ OK ] VulkanAPITest.lstm_success (7 ms)
[ RUN ] VulkanAPITest.lstm_mclareninputs_success
[ OK ] VulkanAPITest.lstm_mclareninputs_success (39 ms)
[ RUN ] VulkanAPITest.lstm_prepack_success
[ OK ] VulkanAPITest.lstm_prepack_success (3 ms)
[ RUN ] VulkanAPITest.querypool_flushed_shader_log
xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp:7627: Skipped
QueryPool is not available
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 396 tests from VulkanAPITest (23847 ms total)
[----------] Global test environment tear-down
[==========] 396 tests from 1 test suite ran. (23847 ms total)
[ PASSED ] 395 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
YOU HAVE 9 DISABLED TESTS
```
Differential Revision: D50997218
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112936
Approved by: https://github.com/manuelcandales
This may seem a bit silly but we spend ~5% of compilation on simply checking if the `ShapeEnv` cache has been invalidated. It isn't necessarily slow, but we call it millions of times per compile so everything adds up.
To improve the situation, I've added a version counter to the shape env that gets incremented whenever the cache key changes. This does require a bit of care in `ShapeEnv` that we don't modify the relevant state without calling `self._update_version_counter()`. However, we already have a similar situation for the translation validation feature which requires `_set_replacement` to be called instead of modifying the replacements directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112687
Approved by: https://github.com/ezyang
ghstack dependencies: #112933
Summary:
- Introduce improved algorithm for 2d float GEMM [ output = alpha * (input) * (weight) + beta * (bias) ] that shows more than 50% shader latency reduction on Qualcomm GPUs. Does not apply for the quantized [integer] and batch [3d] matrix multiplication cases.
- At function call of `run_linear_context()`/`run_addmm_context()`, re-pack the input tensor data to be row-wise element-dense in each texel
- Reducing global I/O reads and writes through "batching" by fetching 4 input and weight texels each, then performing 16 output computations and writes, in each shader invocation
- Leverage a loop unrolling/coalescing compile-time optimization of the GLSL->SPIR-V compiler using a macro for 4
Test Plan:
# Numerical Validation
- There are two pre-existing failures on trunk related to conv2d, unrelated to this diff's code paths
- `LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin`
```
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 396 tests from VulkanAPITest (38014 ms total)
[----------] Global test environment tear-down
[==========] 396 tests from 1 test suite ran. (38014 ms total)
[ PASSED ] 393 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
[ FAILED ] 2 tests, listed below:
[ FAILED ] VulkanAPITest.conv2d_pw_prepack
[ FAILED ] VulkanAPITest.conv2d_pw_prepack_bc
```
# Performance Validation with Matrix Multiplication Benchmark Binary
- build the benchmark binary on both this diff and trunk modified for 100 iterations, `buck2 build -c ndk.debug_info_level=0 -c ndk.static_linking=true -c pt.enable_qpl=0 -c pt.vulkan_use_gpu_diagnostics=1 --target-platforms=ovr_config//platform/android:arm32-fbsource //xplat/caffe2:pt_vulkan_mm_perf_test_binAndroid --show-output`
__on local testing against a Samsung Galaxy S22 Ultra **75% reduction**__
- this diff
```
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------------------------------------------
run_linear_context_benchmark/N:500/M:500/P:500/iterations:100/manual_time/threads:1 2.08 ms 10.3 ms 100
```
- trunk:
```
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------------------------------------------
run_linear_context_benchmark/N:500/M:500/P:500/iterations:100/manual_time/threads:1 9.11 ms 13.8 ms 100
```
__on local testing against our Android chipset of interest **50% reduction**__
- this diff:
```
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------------------------------------------
[...]
run_linear_context_benchmark/N:500/M:500/P:500/iterations:100/manual_time/threads:1 40.0 ms 90.6 ms 100
```
- trunk:
```
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------------------------------------------
[...]
run_linear_context_benchmark/N:500/M:500/P:500/iterations:100/manual_time/threads:1 81.3 ms 106 ms 100
```
__on local testing against Google Pixel 7 Pro **55% reduction**__
- this diff:
```
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------------------------------------------
[...]
run_linear_context_benchmark/N:500/M:500/P:500/iterations:100/manual_time/threads:1 7.38 ms 10.7 ms 100
```
- trunk:
```
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------------------------------------------
[...]
run_linear_context_benchmark/N:500/M:500/P:500/iterations:100/manual_time/threads:1 16.2 ms 12.4 ms 100
```
Reviewed By: yipjustin
Differential Revision: D50441864
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112918
Approved by: https://github.com/yipjustin
Since PyTorch 2.1, torch.export API was introduced and the term "export"
got overloaded due to the already existing torch.onnx.export API.
The torch.onnx.dynamo_export API was introduced on pyTorch 2.0 and it
exposed a torch.onnx.ExportOutput which now can be confused with
torch.export.export output
To prevent such ambiguity and standardize names around the new
torch.export.ExportedProgram, this PR renames torch.onnx.ExportOutput to
torch.onnx.ONNXProgram
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112263
Approved by: https://github.com/BowenBao
ghstack dependencies: #112444
It doesn't make sense to set this (on import!) as CUDA cannot be used with PyTorch in this case but leads to messages like
> No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
when CUDA happens to be installed which is at least confusing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106310
Approved by: https://github.com/ezyang
Fixes#111876
`torch.load` without setting `weights_only=True` is unsafe. So updating examples of `torch.load` to use `weights_only=True` where possible and `weights_only=False` elsewhere with a warning of being unsafety.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112860
Approved by: https://github.com/kit1980
Fix ONNX Runtime failure due to `[ONNXRuntimeError] : 1 : FAIL : Type Error : Type parameter (T) of Optype (Conv) bound to different types (tensor(float) and tensor(float16) in node (Conv_5401).`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112806
Approved by: https://github.com/BowenBao
We spend somewhere on the order 1% in `sympy.Expr.free_symbols` as it is called millions of times.
Most of the time we actually just want to know "is this a constant", however `e.is_constant()` is
horribly slow. It turns out though that there is another propery `is_number` that does what we want.
> property is_number:
>
> Returns True if self has no free symbols and no undefined functions (AppliedUndef, to be precise). It will be faster
> than if not self.free_symbols, however, since is_number will fail as soon as it hits a free symbol or undefined
> function.
Even further, we also avoid the overhead of building the unnecessary set object.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112688
Approved by: https://github.com/lezcano
Summary:
This PR adds epilogue fusion code generation support for the new experimental
[Inductor Cutlass backend]([https://github.com/pytorch/pytorch/pull/108015]).
Details:
A fusion happens on the GEMM template level by taking a Cutlass 3.x GEMM Universal Matmul Kernel template
and adding a custom template functor based on Cutlass new “Epilogue Visitor Trees” (EVT) on top, which represents and
performs the computation of the fused Pointwise / Elementwise computation nodes.
This is the approach dictated by [NVIDIA/cutlass example 49](https://github.com/NVIDIA/cutlass/blob/main/examples/49_hopper_gemm_with_collective_builder/49_collective_builder.cu),
which is currently the only documentation and example of Cutlass Epilogue Visitor Trees.
This EVT functor in turn is a hierarchical template expression which represents an abstract syntax tree of the fused computation to perform.
A second codegen task is to create a hierarchical initializer expression, which provides potentially necessary arguments
to each of the functor subexpressions.
Step 1 functionality:
* End to end code generation is possible using the above approach.
* Supports simple elementwise expression fusion of chains of elementwise operations (with scalar constants )
after a matmul.
* Elementwise operation support includes addition, subtraction, multiplication, division, minimum, maximum etc.
* Examples / Unit tests include ReLU and ReLU6 fusion.
* Support for fp16 and fp16 with fp32 accumulation data types.
* Generates SM90 ( Hopper ) based CUDA Kernels ( as Cutlass up to 3.2.0 only supported EVT for SM90 )
The following is not yet supported, and is left for future work:
* Full operation support ( e.g. full set of all ops usually handled via V.ops handlers )
* Cutlass EVT with SM80 support ( possible in Cutlass 3.2.1 according to release notes, but not yet documented )
* Add support for additional (auxiliary) inputs, which changes the Template Kernels' call signature
* Add support for additional (auxiliary) outputs ( requires support for full computation graphs )
* Add support for reduction operations and operations which use different output layouts than the input
* Add support for additional dtypes ( as far as Cutlass allows )
This PR updates third_party/cutlass to v3.2.2, which has some important improvements and features
for the inductor backend.
See also Cutlass release notes:
https://github.com/NVIDIA/cutlass/releases/tag/v3.2.1 and https://github.com/NVIDIA/cutlass/releases/tag/v3.2.2
Notable changes in Cutlass 3.2.1 include:
* Cutlass codegen python code has moved into a package with the "cutlass_library" namespace, which allows to
prevent namespace clashes without resolving to monkey-patching ( which was done earlier ).
* Support for SM80 epilogue visitor trees ( according to the Release Notes, not tried yet )
* Small API changes to the cutlass_library API ( requires adapting the inductor backend code )
Notable changes in Cutlass 3.2.2 include:
* Bugfix that led to CUDA Illegal memory access in some Pytorch unit tests involving flash attention
Test Plan:
* CI
* pytest test/inductor/test_max_autotune.py
Note: So far, the CUTLASS backend is still disabled by default. Benchmarks are planned once more advanced fusions are enabled.
Differential Revision: [D50988161](https://our.internmc.facebook.com/intern/diff/D50988161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110890
Approved by: https://github.com/jansel
ghstack dependencies: #112762
Summary:
We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations.
How to track and register all cache segments:
- It registers a register and a deregister hook to cache allocator as action tracker callbacks, tracking SEGMENT_ALLOC and SEGMENT_FREE trace entries, respectively. When SEGMENT_ALLOC is tracked, the register hook will register to the PG's communicators on the same device. Similarly, when SEGMENT_FREE is tracked, the deregister hook handles deregistration before cudaFree.
- When a new NCCL communicator is created, it dumps the snapspot from cache allocator to register all existing cache segments at once.
- When a NCCL communicator is aborted, it deregisters all segments that have been registered by this communicator
Test Plan: See test in D50726971
Reviewed By: wconstab
Differential Revision: D50726970
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112850
Approved by: https://github.com/wconstab
This PR reduces docstring erros to 0 from total 128. This can be verified by running, pydocstyle path-to-distributed_c10d.py --count
Where, path-to-distributed_c10d.py is `torch/distributed/distributed_c10d.py`
BEFORE the PR:
`pydocstyle torch/distributed/distributed_c10d.py --count`
128
AFTER the PR:
`pydocstyle torch/distributed/distributed_c10d.py --count`
0
Fixes#112640
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112693
Approved by: https://github.com/H-Huang
* `torch/distributed/_tensor/ops/math_ops.py` and `test/distributed/_tensor/test_math_ops.py`: add min, max and prod sharding propagation rules
* `torch/distributed/_tensor/sharding_prop.py` Validate OutputSpec to provide better errors when provided invalid specs
* `torch/distributed/_tensor/op_schema.py`: import `OpOverload` directly to aid linters
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112403
Approved by: https://github.com/wanchaol
Summary: Previously QAT fusion assumes bias is not quantized.
This works for the existing XNNPACKQuantizer, but not for custom
quantizers that wish to quantize the bias. This commit supports
this by adding the necessary patterns. This requires refactoring
the code, however, since it previously assumed that there will
only be one pair of q-dq (from conv weight) in the matched
pattern, and this is no longer true.
Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT.test_qat_conv_bn_bias_derived_qspec
Reviewers: jerryzh168, kimishpatel
Subscribers: jerryzh168, kimishpatel, supriyar
Differential Revision: [D50856377](https://our.internmc.facebook.com/intern/diff/D50856377)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112528
Approved by: https://github.com/jerryzh168
For #108345, #111484
Addresses the forward kernels implicated in the issues, but will take another look at the backward kernels (in follow-up PRs if necessary).
The spatial softmax kernel is changed to use signed integer indexing rather than unsigned as `ScalarType` only has signed integer types declared for now, but this should be a minor change.
CC @ptrblck @crcrpar (who landed a few related PRs recently).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112096
Approved by: https://github.com/mikaylagawarecki
Fixes#112631. As the previous PR #112943 has some accidental merge and it resolved through this PR.
- torch/nn/utils/parametrizations.py
**Before - 6**
```
torch\nn\utils\parametrizations.py:1 at module level:
D100: Missing docstring in public module
torch\nn\utils\parametrizations.py:23 in private function `_make_orthogonal`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\parametrizations.py:23 in private function `_make_orthogonal`:
D210: No whitespaces allowed surrounding docstring text
torch\nn\utils\parametrizations.py:178 in public function `orthogonal`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
torch\nn\utils\parametrizations.py:309 in public function `weight_norm`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
torch\nn\utils\parametrizations.py:483 in public function `spectral_norm`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
6
```
**After - 1**
```
torch\nn\utils\parametrizations.py:1 at module level:
D100: Missing docstring in public module
1
```
- torch/nn/utils/prune.py
**Before - 100**
```
torch\nn\utils\prune.py:1 at module level:
D200: One-line docstring should fit on one line with quotes (found 3)
torch\nn\utils\prune.py:1 at module level:
D400: First line should end with a period (not 's')
torch\nn\utils\prune.py:13 in public class `BasePruningMethod`:
D204: 1 blank line required after class docstring (found 0)
torch\nn\utils\prune.py:21 in public method `__call__`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:21 in public method `__call__`:
D400: First line should end with a period (not ')')
torch\nn\utils\prune.py:34 in public method `compute_mask`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:34 in public method `compute_mask`:
D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch\nn\utils\prune.py:53 in public method `apply_mask`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:53 in public method `apply_mask`:
D400: First line should end with a period (not 'g')
torch\nn\utils\prune.py:74 in public method `apply`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:74 in public method `apply`:
D400: First line should end with a period (not 'd')
torch\nn\utils\prune.py:74 in public method `apply`:
D401: First line should be in imperative mood (perhaps 'Add', not 'Adds')
torch\nn\utils\prune.py:200 in public method `prune`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:200 in public method `prune`:
D400: First line should end with a period (not '`')
torch\nn\utils\prune.py:200 in public method `prune`:
D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch\nn\utils\prune.py:229 in public method `remove`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:229 in public method `remove`:
D400: First line should end with a period (not 'd')
torch\nn\utils\prune.py:229 in public method `remove`:
D401: First line should be in imperative mood (perhaps 'Remove', not 'Removes')
torch\nn\utils\prune.py:256 in public class `PruningContainer`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:264 in public method `__init__`:
D107: Missing docstring in __init__
torch\nn\utils\prune.py:277 in public method `add_pruning_method`:
D401: First line should be in imperative mood (perhaps 'Add', not 'Adds')
torch\nn\utils\prune.py:297 in public method `__len__`:
D105: Missing docstring in magic method
torch\nn\utils\prune.py:300 in public method `__iter__`:
D105: Missing docstring in magic method
torch\nn\utils\prune.py:303 in public method `__getitem__`:
D105: Missing docstring in magic method
torch\nn\utils\prune.py:307 in public method `compute_mask`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:307 in public method `compute_mask`:
D400: First line should end with a period (not 's')
torch\nn\utils\prune.py:307 in public method `compute_mask`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
torch\nn\utils\prune.py:335 in private nested function `_combine_masks`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:335 in private nested function `_combine_masks`:
D400: First line should end with a period (not ':')
torch\nn\utils\prune.py:404 in public class `Identity`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:404 in public class `Identity`:
D400: First line should end with a period (not 'e')
torch\nn\utils\prune.py:410 in public method `compute_mask`:
D102: Missing docstring in public method
torch\nn\utils\prune.py:416 in public method `apply`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:416 in public method `apply`:
D400: First line should end with a period (not 'd')
torch\nn\utils\prune.py:416 in public method `apply`:
D401: First line should be in imperative mood (perhaps 'Add', not 'Adds')
torch\nn\utils\prune.py:442 in public method `__init__`:
D107: Missing docstring in __init__
torch\nn\utils\prune.py:447 in public method `compute_mask`:
D102: Missing docstring in public method
torch\nn\utils\prune.py:469 in public method `apply`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:469 in public method `apply`:
D400: First line should end with a period (not 'd')
torch\nn\utils\prune.py:469 in public method `apply`:
D401: First line should be in imperative mood (perhaps 'Add', not 'Adds')
torch\nn\utils\prune.py:486 in public class `L1Unstructured`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:486 in public class `L1Unstructured`:
D400: First line should end with a period (not 's')
torch\nn\utils\prune.py:498 in public method `__init__`:
D107: Missing docstring in __init__
torch\nn\utils\prune.py:503 in public method `compute_mask`:
D102: Missing docstring in public method
torch\nn\utils\prune.py:527 in public method `apply`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:527 in public method `apply`:
D400: First line should end with a period (not 'd')
torch\nn\utils\prune.py:527 in public method `apply`:
D401: First line should be in imperative mood (perhaps 'Add', not 'Adds')
torch\nn\utils\prune.py:564 in public method `__init__`:
D107: Missing docstring in __init__
torch\nn\utils\prune.py:571 in public method `compute_mask`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:571 in public method `compute_mask`:
D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch\nn\utils\prune.py:634 in public method `apply`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:634 in public method `apply`:
D400: First line should end with a period (not 'd')
torch\nn\utils\prune.py:634 in public method `apply`:
D401: First line should be in imperative mood (perhaps 'Add', not 'Adds')
torch\nn\utils\prune.py:653 in public class `LnStructured`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:653 in public class `LnStructured`:
D400: First line should end with a period (not 'r')
torch\nn\utils\prune.py:669 in public method `__init__`:
D107: Missing docstring in __init__
torch\nn\utils\prune.py:677 in public method `compute_mask`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:677 in public method `compute_mask`:
D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch\nn\utils\prune.py:747 in public method `apply`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:747 in public method `apply`:
D400: First line should end with a period (not 'd')
torch\nn\utils\prune.py:747 in public method `apply`:
D401: First line should be in imperative mood (perhaps 'Add', not 'Adds')
torch\nn\utils\prune.py:779 in public class `CustomFromMask`:
D101: Missing docstring in public class
torch\nn\utils\prune.py:783 in public method `__init__`:
D107: Missing docstring in __init__
torch\nn\utils\prune.py:786 in public method `compute_mask`:
D102: Missing docstring in public method
torch\nn\utils\prune.py:793 in public method `apply`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:793 in public method `apply`:
D400: First line should end with a period (not 'd')
torch\nn\utils\prune.py:793 in public method `apply`:
D401: First line should be in imperative mood (perhaps 'Add', not 'Adds')
torch\nn\utils\prune.py:806 in public function `identity`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:806 in public function `identity`:
D400: First line should end with a period (not 'e')
torch\nn\utils\prune.py:806 in public function `identity`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
torch\nn\utils\prune.py:839 in public function `random_unstructured`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:839 in public function `random_unstructured`:
D400: First line should end with a period (not '`')
torch\nn\utils\prune.py:874 in public function `l1_unstructured`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:874 in public function `l1_unstructured`:
D400: First line should end with a period (not '`')
torch\nn\utils\prune.py:916 in public function `random_structured`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:916 in public function `random_structured`:
D400: First line should end with a period (not '`')
torch\nn\utils\prune.py:955 in public function `ln_structured`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:955 in public function `ln_structured`:
D400: First line should end with a period (not '`')
torch\nn\utils\prune.py:1000 in public function `global_unstructured`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:1000 in public function `global_unstructured`:
D400: First line should end with a period (not '`')
torch\nn\utils\prune.py:1120 in public function `custom_from_mask`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:1120 in public function `custom_from_mask`:
D400: First line should end with a period (not '`')
torch\nn\utils\prune.py:1154 in public function `remove`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:1154 in public function `remove`:
D400: First line should end with a period (not 'e')
torch\nn\utils\prune.py:1154 in public function `remove`:
D401: First line should be in imperative mood (perhaps 'Remove', not 'Removes')
torch\nn\utils\prune.py:1184 in public function `is_pruned`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:1184 in public function `is_pruned`:
D400: First line should end with a period (not 'r')
torch\nn\utils\prune.py:1211 in private function `_validate_pruning_amount_init`:
D401: First line should be in imperative mood (perhaps 'Validate', not 'Validation')
torch\nn\utils\prune.py:1243 in private function `_validate_pruning_amount`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:1243 in private function `_validate_pruning_amount`:
D400: First line should end with a period (not 'e')
torch\nn\utils\prune.py:1243 in private function `_validate_pruning_amount`:
D401: First line should be in imperative mood (perhaps 'Validate', not 'Validation')
torch\nn\utils\prune.py:1265 in private function `_validate_structured_pruning`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:1265 in private function `_validate_structured_pruning`:
D400: First line should end with a period (not '-')
torch\nn\utils\prune.py:1265 in private function `_validate_structured_pruning`:
D401: First line should be in imperative mood (perhaps 'Validate', not 'Validation')
torch\nn\utils\prune.py:1284 in private function `_compute_nparams_toprune`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:1284 in private function `_compute_nparams_toprune`:
D400: First line should end with a period (not 'a')
torch\nn\utils\prune.py:1308 in private function `_validate_pruning_dim`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:1308 in private function `_validate_pruning_dim`:
D400: First line should end with a period (not ':')
torch\nn\utils\prune.py:1318 in private function `_compute_norm`:
D205: 1 blank line required between summary line and description (found 0)
torch\nn\utils\prune.py:1318 in private function `_compute_norm`:
D400: First line should end with a period (not 'n')
100
```
**After - 14**
```
torch\nn\utils\prune.py:266 in public method `__init__`:
D107: Missing docstring in __init__
torch\nn\utils\prune.py:299 in public method `__len__`:
D105: Missing docstring in magic method
torch\nn\utils\prune.py:302 in public method `__iter__`:
D105: Missing docstring in magic method
torch\nn\utils\prune.py:305 in public method `__getitem__`:
D105: Missing docstring in magic method
torch\nn\utils\prune.py:411 in public method `compute_mask`:
D102: Missing docstring in public method
torch\nn\utils\prune.py:445 in public method `__init__`:
D107: Missing docstring in __init__
torch\nn\utils\prune.py:450 in public method `compute_mask`:
D102: Missing docstring in public method
torch\nn\utils\prune.py:502 in public method `__init__`:
D107: Missing docstring in __init__
torch\nn\utils\prune.py:507 in public method `compute_mask`:
D102: Missing docstring in public method
torch\nn\utils\prune.py:570 in public method `__init__`:
D107: Missing docstring in __init__
torch\nn\utils\prune.py:677 in public method `__init__`:
D107: Missing docstring in __init__
torch\nn\utils\prune.py:790 in public class `CustomFromMask`:
D101: Missing docstring in public class
torch\nn\utils\prune.py:794 in public method `__init__`:
D107: Missing docstring in __init__
torch\nn\utils\prune.py:797 in public method `compute_mask`:
D102: Missing docstring in public method
14
```
- torch/nn/utils/weight_norm.py
**Before - 10**
```
torch\nn\utils\weight_norm.py:1 at module level:
D200: One-line docstring should fit on one line with quotes (found 3)
torch\nn\utils\weight_norm.py:1 at module level:
D400: First line should end with a period (not '8')
torch\nn\utils\weight_norm.py:12 in public class `WeightNorm`:
D101: Missing docstring in public class
torch\nn\utils\weight_norm.py:16 in public method `__init__`:
D107: Missing docstring in __init__
torch\nn\utils\weight_norm.py:23 in public method `compute_weight`:
D102: Missing docstring in public method
torch\nn\utils\weight_norm.py:29 in public method `apply`:
D102: Missing docstring in public method
torch\nn\utils\weight_norm.py:59 in public method `remove`:
D102: Missing docstring in public method
torch\nn\utils\weight_norm.py:66 in public method `__call__`:
D102: Missing docstring in public method
torch\nn\utils\weight_norm.py:73 in public function `weight_norm`:
D401: First line should be in imperative mood (perhaps 'Apply', not 'Applies')
torch\nn\utils\weight_norm.py:137 in public function `remove_weight_norm`:
D401: First line should be in imperative mood (perhaps 'Remove', not 'Removes')
10
```
**After - 6**
```
torch\nn\utils\weight_norm.py:10 in public class `WeightNorm`:
D101: Missing docstring in public class
torch\nn\utils\weight_norm.py:14 in public method `__init__`:
D107: Missing docstring in __init__
torch\nn\utils\weight_norm.py:21 in public method `compute_weight`:
D102: Missing docstring in public method
torch\nn\utils\weight_norm.py:27 in public method `apply`:
D102: Missing docstring in public method
torch\nn\utils\weight_norm.py:57 in public method `remove`:
D102: Missing docstring in public method
torch\nn\utils\weight_norm.py:64 in public method `__call__`:
D102: Missing docstring in public method
6
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113021
Approved by: https://github.com/lezcano
As this is the oldest gcc that is fully compatible with C++17 standard.
- Replace number of conditional version with simpler `if(CMAKE_COMPILER_IS_GNUCXX)` or `append_cxx_flag_if_supported`.
- As `-Wsuggest-override` condition was hidden before incorrect guard, add missing `override` keywords to `torch::autograd::PyFunctionTensorPostAccGradHooks::apply_with_saved` , `caffe2::python::TensorFeeder::Feed` and `cafee2::NetObserverReporterPrint::report```
Fixes https://github.com/pytorch/pytorch/issues/101839
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112858
Approved by: https://github.com/Skylion007, https://github.com/albanD
Summary:
If there are xfails in the failures_dict and the operator has the
pt2_compliant_tag, then we raise an error. These generated tests are separate
from those in the failures dict because we don't actually need any sample
inputs to check this.
Test Plan: - New tests
Differential Revision: D50936201
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112759
Approved by: https://github.com/ezyang
Testing out some new rules that are in beta, I think I will apply this one codebase wide once it's out of preview. Replaces the hack of using `[:]` to do copies of list with the proper copy method. More efficient and more readable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112990
Approved by: https://github.com/ezyang
Enables some tests that were incorrectly not being run and enables PIE794 globally. This rule checks if a classvar is defined twice as flags it as it is likely a bug. In fact, we found several cases where it was a bug. It does have a couple of false positives which I flagged upstream and replaced with noqas: https://github.com/astral-sh/ruff/issues/8497
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112989
Approved by: https://github.com/malfet
In my work on making guards installed eagerly (look up the stack), I found that our checkpoint/restore mechanism is very broken. There is lots of state (especially in shape_env) which we don't checkpoint and restore properly. We also have lots of mutable state on variable trackers already which is not checkpointed/restored. (See other PRs in this stack for some spot fixes.)
Since we wanted to get rid of this anyway for making VariableTracker mutable, I figured I would just switch to restarting analysis.
For other usages of copy_graphstate/restore_graphstate:
1) Many usages were pointless and not needed, these are removed in PRs below this.
2) Some other usage (similar to this one) is removed in PRs above this.
3) The tricky one I am not handling is higher_order_ops, which uses checkpoint/restore a lot. There might be some cases there where this speculate/restart trick won't work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112902
Approved by: https://github.com/voznesenskym
Finding optimal meta parameters for bsr_dense_mm and bsr_scatter_mm triton kernels is a tedious job. This PR introduces a tool (a Python script `torch/sparse/_triton_ops_meta.py`) that finds the optimal set of meta parameters for a given set of matrix multiplication inputs and their block sizes. Currently, such a set is found for square bsr tensor inputs with sizes 256...16384 and square blocksizes 16...128, and dense tensor inputs with sizes 256...131072.
As a result, bsr_dense_mm performance has increased as follows (`NVIDIA A100-SXM4-80GB`):
- for blocksize 16x16, the average/maximum speed up is about 40/60 %.
- for blocksize 32x32, the average/maximum speed up is about 28/45 %.
- for blocksize 64x64, the average/maximum speed up is about 26/43 %.
- for blocksize 128x128, the average/maximum speed up is about 12/28 %.
To enable the performance improvements through meta parameter optimization for other CUDA devices, one must execute the `_triton_ops_meta.py` which will calculate the optimal meta parameters and store the results in a dictionary object defined in `_triton_ops_meta.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112737
Approved by: https://github.com/cpuhrsch
As part of this diff, I have upgraded the `python_version` config setting to 3.11. `bytecode_transformation.py` (and a few other files) have functions using APIs only available in Python 3.11+. Those APIs are gated by a sys.version_info check in their typeshed .pyi files. So setting the min version to 3.11 allows those functions to typecheck properly.
An alternative is to make the relevant types Any:
```
if sys.version_info >= (3, 11):
_Positions = dis.Positions
else:
_Positions = Any
```
However, with python_version = 3.8, that means we're not getting any useful typechecking signal when encountering values of type _Position.
Changing the python_version to 3.11 does mean that we will stop typechecking codepaths that run only on lower versions, but that seems a small price to pay. It does also mean that we won't catch code that uses newer APIs without the appropriate version check, but again, not sure this has much of an impact.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112561
Approved by: https://github.com/ezyang
Fixes#110762
This PR:
fixes issue described in #110762 by adding kwarg for shape and stride when creating DTensor using `DTensor.from_local()`. When `shape` and `stride` are provided, we skip calcualtion for `tensor_shape` and `tensor_stride` using `compute_global_tensor_info()`, as `compute_global_tensor_info()` always assume even sharding.
Test plan:
```
python3 test/distributed/_tensor/test_dtensor.py -k test_from_local_uneven_sharding
python3 test/distributed/_tensor/test_dtensor.py -k test_from_local_uneven_sharding_raise_error
```
cc. @wanchaol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110781
Approved by: https://github.com/wanchaol
As this is the oldest gcc that is fully compatible with C++17 standard.
- Replace number of conditional version with simpler `if(CMAKE_COMPILER_IS_GNUCXX)` or `append_cxx_flag_if_supported`.
- As `-Wsuggest-override` condition was hidden before incorrect guard, add missing `override` keywords to `torch::autograd::PyFunctionTensorPostAccGradHooks::apply_with_saved` , `caffe2::python::TensorFeeder::Feed` and `cafee2::NetObserverReporterPrint::report```
Fixes https://github.com/pytorch/pytorch/issues/101839
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112858
Approved by: https://github.com/Skylion007, https://github.com/albanD
Plan B for https://github.com/pytorch/pytorch/pull/112839
Motivation for the change:
1. We need to remove `funcol` as a dependency for device_mesh.py to resolve circular dependency issues when introducing device_mesh as an arg for DDP. In the meantime, we should not go from funcol to non-funcol as @voznesenskym suggested. Therefore, we want to remove this all_gather check completely.
2. For large scale, it would not make sense to validate the mesh at global scale anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112928
Approved by: https://github.com/wanchaol
Sample output if incorrect/missing args are specified:
```
RuntimeError: Missing expected env vars for `test_distributed_spawn.py`. Please
ensure to specify the following:
'BACKEND' = one of ('gloo', 'nccl', 'ucc')
'WORLD_SIZE' = int >= 2
'TEMP_DIR' specifying a directory containing a barrier file named
'barrier'.
e.g.
touch /tmp/barrier && TEMP_DIR=/tmp BACKEND='nccl' WORLD_SIZE=2 python
/data/users/whc/pytorch/test/distributed/test_distributed_spawn.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112924
Approved by: https://github.com/wanchaol
**Summary**
This Diff re-structures Quantization testcase pattern matcher check. Instead of checking all the pattern matched in the Inductor, we will only check the core pattern match count and node numbers such as: dequant promotion, QConv/Linear Unary and QConv Binary.
**TestPlan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_q
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112570
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
Summary: With D50030659, we are able to support zero-size tensor. Hence remove the check in slice. Also update related tests.
Test Plan:
```
[yipjustin@189650.od ~/fbsource (876ab81e3)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*slice*"
File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp
File changed: fbcode//caffe2/aten/src/ATen/test/vulkan_api_test.cpp
File changed: fbcode//caffe2/aten/src/ATen/native/vulkan/ops/Slice.cpp
1 additional file change events
Buck UI: https://www.internalfb.com/buck2/85adf6a3-7d17-4685-8d8b-a0b600df0b73
Network: Up: 44KiB Down: 1.3MiB (reSessionID-5afd53d4-0303-4f4d-a245-1eb810308fd3)
Jobs completed: 6. Time elapsed: 22.8s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 1, local: 1)
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *slice*
[==========] Running 6 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 6 tests from VulkanAPITest
[ RUN ] VulkanAPITest.slice_width_success
[ OK ] VulkanAPITest.slice_width_success (160 ms)
[ RUN ] VulkanAPITest.slice_height_success
[ OK ] VulkanAPITest.slice_height_success (6 ms)
[ RUN ] VulkanAPITest.slice_feature_success
[ OK ] VulkanAPITest.slice_feature_success (84 ms)
[ RUN ] VulkanAPITest.slice_batch_success
[ OK ] VulkanAPITest.slice_batch_success (5 ms)
[ RUN ] VulkanAPITest.slice_zero_sized
[ OK ] VulkanAPITest.slice_zero_sized (0 ms)
[ RUN ] VulkanAPITest.slice_invalidinputs_exceptions
[ OK ] VulkanAPITest.slice_invalidinputs_exceptions (0 ms)
[----------] 6 tests from VulkanAPITest (257 ms total)
[----------] Global test environment tear-down
[==========] 6 tests from 1 test suite ran. (257 ms total)
[ PASSED ] 6 tests.
```
Differential Revision: D50961979
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112879
Approved by: https://github.com/manuelcandales
Summary:
The assertion is causing build failures when running Pysa, our security-focused static analyzer.
This is because we run `pyre infer` on the source code before analyzing it, which introduces annotations such as `def foo() -> 'torch._tensor.Tensor'`.
This does not work with the `out_wrapper` decorator which relies on inspecting the signature of the decorated function.
Let's skip the check on the return type if we detect that it was introduced by `pyre infer`.
Test Plan: eyes
Differential Revision: D50976601
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112870
Approved by: https://github.com/ZainRizvi
Summary: tsia
Test Plan:
```
[yipjustin@189650.od ~/fbsource (631468db3)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*softmax*"
File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp
File changed: fbcode//caffe2/aten/src/ATen/native/vulkan/ops/Softmax.cpp
File changed: fbcode//caffe2/aten/src/ATen/test/vulkan_api_test.cpp
1 additional file change events
Buck UI: https://www.internalfb.com/buck2/d4f62e52-aba9-448a-a181-cf8881affb14
Network: Up: 0B Down: 0B
Jobs completed: 4. Time elapsed: 0.5s.
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *softmax*
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from VulkanAPITest
[ RUN ] VulkanAPITest.softmax
[ OK ] VulkanAPITest.softmax (467 ms)
[ RUN ] VulkanAPITest.log_softmax
[ OK ] VulkanAPITest.log_softmax (95 ms)
[----------] 2 tests from VulkanAPITest (563 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (563 ms total)
[ PASSED ] 2 tests.
YOU HAVE 1 DISABLED TEST
[yipjustin@189650.od ~/fbsource (631468db3)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*log*"
Buck UI: https://www.internalfb.com/buck2/e8210eb5-fd56-45f7-bf6c-5024931e778e
Network: Up: 0B Down: 0B
Jobs completed: 4. Time elapsed: 0.2s.
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *log*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from VulkanAPITest
[ RUN ] VulkanAPITest.log_softmax
[ OK ] VulkanAPITest.log_softmax (572 ms)
[ RUN ] VulkanAPITest.unary_op_log
[ OK ] VulkanAPITest.unary_op_log (0 ms)
[ RUN ] VulkanAPITest.unary_op_log_
[ OK ] VulkanAPITest.unary_op_log_ (59 ms)
[ RUN ] VulkanAPITest.querypool_flushed_shader_log
xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp:7677: Skipped
QueryPool is not available
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 4 tests from VulkanAPITest (633 ms total)
[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (633 ms total)
[ PASSED ] 3 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
YOU HAVE 1 DISABLED TEST
```
Differential Revision: D50961359
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112828
Approved by: https://github.com/manuelcandales
Summary:
Use concurrent multiple readers to access record from different start index. It can provide better performance when the data being accessed is large.
bypass-github-pytorch-ci-checks
Test Plan:
```
buck2 run @//mode/dev //caffe2/caffe2/serialize:inline_container_test
```
Reviewed By: YazhiGao
Differential Revision: D50957607
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112818
Approved by: https://github.com/houseroad, https://github.com/huydhn
A follow-up of PR #112617 on issue #112596
Added suggested changes from the review.
- More specific on the type of uniform and normal distribution used.
```py
def xavier_uniform_(tensor: Tensor, gain: float = 1.) -> Tensor:
r"""Fill the input `Tensor` with values using a Xavier uniform distribution.
The method is described in `Understanding the difficulty of training...
"""
```
```py
def kaiming_normal_(
tensor: Tensor, a: float = 0, mode: str = 'fan_in', nonlinearity: str = 'leaky_relu'
):
r"""Fill the input `Tensor` with values using a Kaiming normal distribution.
The method is described in `Delving deep into rectifiers: Surpassing...
"""
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112864
Approved by: https://github.com/kit1980
Fixes#112636
Before: 265
```
torch/utils/data/datapipes/dataframe/structures.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/dataframe/structures.py:8 in public class `DataChunkDF`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/dataframe/structures.py:8 in public class `DataChunkDF`:
D208: Docstring is over-indented
torch/utils/data/datapipes/dataframe/structures.py:8 in public class `DataChunkDF`:
D400: First line should end with a period (not ',')
torch/utils/data/datapipes/dataframe/structures.py:13 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/dataframe/structures.py:17 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/datapipe.py:43 in public class `IterDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/datapipe.py:119 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:122 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:135 in public method `register_function`:
D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:139 in public method `register_datapipe_as_function`:
D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:161 in public method `__getstate__`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/datapipe.py:161 in public method `__getstate__`:
D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/utils/data/datapipes/datapipe.py:171 in public method `__reduce_ex__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:180 in public method `set_getstate_hook`:
D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:186 in public method `set_reduce_ex_hook`:
D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:191 in public method `__repr__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:197 in public method `__str__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:203 in public method `__dir__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:208 in public method `reset`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/datapipe.py:208 in public method `reset`:
D400: First line should end with a period (not ',')
torch/utils/data/datapipes/datapipe.py:217 in public class `DFIterDataPipe`:
D101: Missing docstring in public class
torch/utils/data/datapipes/datapipe.py:223 in public class `MapDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/datapipe.py:261 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:274 in public method `register_function`:
D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:278 in public method `register_datapipe_as_function`:
D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:293 in public method `__getstate__`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/datapipe.py:293 in public method `__getstate__`:
D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/utils/data/datapipes/datapipe.py:303 in public method `__reduce_ex__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:312 in public method `set_getstate_hook`:
D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:318 in public method `set_reduce_ex_hook`:
D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:323 in public method `__repr__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:329 in public method `__str__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:335 in public method `__dir__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:392 in public class `DataChunk`:
D101: Missing docstring in public class
torch/utils/data/datapipes/datapipe.py:393 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/datapipe.py:397 in public method `as_str`:
D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:401 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:404 in public method `raw_iterator`:
D102: Missing docstring in public method
torch/utils/data/datapipes/iter/callable.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/iter/callable.py:23 in public class `MapperIterDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/callable.py:23 in public class `MapperIterDataPipe`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/callable.py:63 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/callable.py:121 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/callable.py:125 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/callable.py:173 in public class `CollatorIterDataPipe`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/callable.py:213 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combinatorics.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/iter/combinatorics.py:18 in public class `SamplerIterDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/combinatorics.py:29 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combinatorics.py:44 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:47 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:56 in public class `ShufflerIterDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/combinatorics.py:56 in public class `ShufflerIterDataPipe`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/combinatorics.py:56 in public class `ShufflerIterDataPipe`:
D400: First line should end with a period (not 'r')
torch/utils/data/datapipes/iter/combinatorics.py:94 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combinatorics.py:114 in public method `set_shuffle`:
D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combinatorics.py:118 in public method `set_seed`:
D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combinatorics.py:122 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:137 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:142 in public method `reset`:
D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combinatorics.py:150 in public method `__getstate__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:165 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:179 in public method `__del__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/iter/combining.py:26 in public class `ConcaterIterDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/combining.py:26 in public class `ConcaterIterDataPipe`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/combining.py:26 in public class `ConcaterIterDataPipe`:
D400: First line should end with a period (not 'l')
torch/utils/data/datapipes/iter/combining.py:44 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combining.py:51 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:55 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:64 in public class `ForkerIterDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/combining.py:92 in public method `__new__`:
D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combining.py:108 in private class `_ContainerTemplate`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/combining.py:108 in private class `_ContainerTemplate`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/combining.py:108 in private class `_ContainerTemplate`:
D400: First line should end with a period (not 'd')
torch/utils/data/datapipes/iter/combining.py:126 in private method `get_length_by_instance`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/utils/data/datapipes/iter/combining.py:126 in private method `get_length_by_instance`:
D400: First line should end with a period (not '`')
torch/utils/data/datapipes/iter/combining.py:136 in private class `_ForkerIterDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/combining.py:136 in private class `_ForkerIterDataPipe`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/combining.py:136 in private class `_ForkerIterDataPipe`:
D400: First line should end with a period (not 's')
torch/utils/data/datapipes/iter/combining.py:275 in private class `_ChildDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/combining.py:275 in private class `_ChildDataPipe`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/combining.py:275 in private class `_ChildDataPipe`:
D400: First line should end with a period (not 's')
torch/utils/data/datapipes/iter/combining.py:320 in private method `_set_main_datapipe_valid_iterator_id`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/combining.py:343 in private method `_check_valid_iterator_id`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/utils/data/datapipes/iter/combining.py:351 in public class `DemultiplexerIterDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/combining.py:351 in public class `DemultiplexerIterDataPipe`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/combining.py:351 in public class `DemultiplexerIterDataPipe`:
D400: First line should end with a period (not 'n')
torch/utils/data/datapipes/iter/combining.py:384 in public method `__new__`:
D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combining.py:399 in private class `_DemultiplexerIterDataPipe`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/combining.py:399 in private class `_DemultiplexerIterDataPipe`:
D400: First line should end with a period (not 's')
torch/utils/data/datapipes/iter/combining.py:534 in public class `MultiplexerIterDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/combining.py:534 in public class `MultiplexerIterDataPipe`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/combining.py:534 in public class `MultiplexerIterDataPipe`:
D400: First line should end with a period (not ',')
torch/utils/data/datapipes/iter/combining.py:549 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combining.py:553 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:566 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:572 in public method `reset`:
D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combining.py:575 in public method `__getstate__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:585 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:593 in public method `__del__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:599 in public class `ZipperIterDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/combining.py:599 in public class `ZipperIterDataPipe`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/combining.py:615 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combining.py:622 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:626 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/filelister.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/iter/filelister.py:15 in public class `FileListerIterDataPipe`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/filelister.py:36 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/filelister.py:58 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/filelister.py:62 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/fileopener.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/iter/fileopener.py:15 in public class `FileOpenerIterDataPipe`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/fileopener.py:15 in public class `FileOpenerIterDataPipe`:
D400: First line should end with a period (not 'm')
torch/utils/data/datapipes/iter/fileopener.py:42 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/fileopener.py:66 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/fileopener.py:69 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/iter/grouping.py:31 in public class `BatcherIterDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/grouping.py:31 in public class `BatcherIterDataPipe`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/grouping.py:31 in public class `BatcherIterDataPipe`:
D400: First line should end with a period (not 's')
torch/utils/data/datapipes/iter/grouping.py:55 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/grouping.py:68 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:79 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:91 in public class `UnBatcherIterDataPipe`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/grouping.py:91 in public class `UnBatcherIterDataPipe`:
D400: First line should end with a period (not 'l')
torch/utils/data/datapipes/iter/grouping.py:112 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/grouping.py:118 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:143 in public class `GrouperIterDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/grouping.py:143 in public class `GrouperIterDataPipe`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/grouping.py:143 in public class `GrouperIterDataPipe`:
D400: First line should end with a period (not ',')
torch/utils/data/datapipes/iter/grouping.py:185 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/grouping.py:233 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:257 in public method `reset`:
D102: Missing docstring in public method
torch/utils/data/datapipes/iter/grouping.py:261 in public method `__getstate__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:278 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:294 in public method `__del__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/routeddecoder.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/iter/routeddecoder.py:19 in public class `RoutedDecoderIterDataPipe`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/routeddecoder.py:19 in public class `RoutedDecoderIterDataPipe`:
D400: First line should end with a period (not 'a')
torch/utils/data/datapipes/iter/routeddecoder.py:37 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/routeddecoder.py:53 in public method `add_handler`:
D102: Missing docstring in public method
torch/utils/data/datapipes/iter/routeddecoder.py:56 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/routeddecoder.py:62 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/selecting.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/iter/selecting.py:21 in public class `FilterIterDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/selecting.py:46 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/selecting.py:70 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/sharding.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/iter/sharding.py:17 in public class `SHARDING_PRIORITIES`:
D101: Missing docstring in public class
torch/utils/data/datapipes/iter/sharding.py:30 in public class `ShardingFilterIterDataPipe`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/sharding.py:30 in public class `ShardingFilterIterDataPipe`:
D400: First line should end with a period (not 's')
torch/utils/data/datapipes/iter/sharding.py:39 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/sharding.py:47 in public method `apply_sharding`:
D102: Missing docstring in public method
torch/utils/data/datapipes/iter/sharding.py:74 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/sharding.py:79 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/streamreader.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/iter/streamreader.py:10 in public class `StreamReaderIterDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/streamreader.py:10 in public class `StreamReaderIterDataPipe`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/iter/streamreader.py:10 in public class `StreamReaderIterDataPipe`:
D400: First line should end with a period (not 'l')
torch/utils/data/datapipes/iter/streamreader.py:27 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/streamreader.py:31 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/utils.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/iter/utils.py:9 in public class `IterableWrapperIterDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/iter/utils.py:29 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/utils.py:33 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/utils.py:49 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/callable.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/map/callable.py:14 in public function `default_fn`:
D103: Missing docstring in public function
torch/utils/data/datapipes/map/callable.py:20 in public class `MapperMapDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/map/callable.py:20 in public class `MapperMapDataPipe`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/map/callable.py:45 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/map/callable.py:55 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/callable.py:58 in public method `__getitem__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combinatorics.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/map/combinatorics.py:15 in public class `ShufflerIterDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/map/combinatorics.py:55 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/map/combinatorics.py:68 in public method `set_shuffle`:
D102: Missing docstring in public method
torch/utils/data/datapipes/map/combinatorics.py:72 in public method `set_seed`:
D102: Missing docstring in public method
torch/utils/data/datapipes/map/combinatorics.py:76 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combinatorics.py:85 in public method `reset`:
D102: Missing docstring in public method
torch/utils/data/datapipes/map/combinatorics.py:92 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combinatorics.py:95 in public method `__getstate__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combinatorics.py:110 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combining.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/map/combining.py:12 in public class `ConcaterMapDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/map/combining.py:12 in public class `ConcaterMapDataPipe`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/map/combining.py:34 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/map/combining.py:43 in public method `__getitem__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combining.py:52 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combining.py:58 in public class `ZipperMapDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/map/combining.py:58 in public class `ZipperMapDataPipe`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/map/combining.py:76 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/map/combining.py:85 in public method `__getitem__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combining.py:94 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/grouping.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/map/grouping.py:12 in public class `BatcherMapDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/map/grouping.py:12 in public class `BatcherMapDataPipe`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/map/grouping.py:12 in public class `BatcherMapDataPipe`:
D400: First line should end with a period (not 's')
torch/utils/data/datapipes/map/grouping.py:34 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/map/grouping.py:47 in public method `__getitem__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/grouping.py:60 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/utils.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/map/utils.py:9 in public class `SequenceWrapperMapDataPipe`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/map/utils.py:32 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/map/utils.py:45 in public method `__getitem__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/utils.py:48 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/utils/common.py:26 in public function `validate_input_col`:
D400: First line should end with a period (not 'n')
torch/utils/data/datapipes/utils/common.py:26 in public function `validate_input_col`:
D401: First line should be in imperative mood (perhaps 'Check', not 'Checks')
torch/utils/data/datapipes/utils/common.py:127 in private function `_check_unpickable_fn`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/utils/common.py:127 in private function `_check_unpickable_fn`:
D400: First line should end with a period (not 'g')
torch/utils/data/datapipes/utils/common.py:127 in private function `_check_unpickable_fn`:
D401: First line should be in imperative mood (perhaps 'Check', not 'Checks')
torch/utils/data/datapipes/utils/common.py:156 in public function `match_masks`:
D103: Missing docstring in public function
torch/utils/data/datapipes/utils/common.py:170 in public function `get_file_pathnames_from_root`:
D103: Missing docstring in public function
torch/utils/data/datapipes/utils/common.py:207 in public function `get_file_binaries_from_pathnames`:
D103: Missing docstring in public function
torch/utils/data/datapipes/utils/common.py:220 in public function `validate_pathname_binary_tuple`:
D103: Missing docstring in public function
torch/utils/data/datapipes/utils/common.py:290 in public class `StreamWrapper`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/utils/common.py:290 in public class `StreamWrapper`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/utils/common.py:290 in public class `StreamWrapper`:
D400: First line should end with a period (not 'y')
torch/utils/data/datapipes/utils/common.py:298 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/utils/common.py:315 in public method `close_streams`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/utils/data/datapipes/utils/common.py:331 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:335 in public method `close`:
D102: Missing docstring in public method
torch/utils/data/datapipes/utils/common.py:351 in public method `autoclose`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/utils/common.py:351 in public method `autoclose`:
D400: First line should end with a period (not 's')
torch/utils/data/datapipes/utils/common.py:359 in public method `__dir__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:364 in public method `__del__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:368 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:371 in public method `__next__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:374 in public method `__repr__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:380 in public method `__getstate__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:383 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/decoder.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/utils/decoder.py:31 in public function `basichandlers`:
D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:87 in public function `handle_extension`:
D202: No blank lines allowed after function docstring (found 1)
torch/utils/data/datapipes/utils/decoder.py:87 in public function `handle_extension`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/utils/decoder.py:87 in public function `handle_extension`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/utils/data/datapipes/utils/decoder.py:115 in public class `ImageHandler`:
D204: 1 blank line required after class docstring (found 0)
torch/utils/data/datapipes/utils/decoder.py:115 in public class `ImageHandler`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/utils/decoder.py:139 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/utils/decoder.py:143 in public method `__call__`:
D102: Missing docstring in public method
torch/utils/data/datapipes/utils/decoder.py:187 in public function `imagehandler`:
D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:194 in public function `videohandler`:
D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:215 in public function `audiohandler`:
D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:236 in public class `MatHandler`:
D101: Missing docstring in public class
torch/utils/data/datapipes/utils/decoder.py:237 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/utils/decoder.py:247 in public method `__call__`:
D102: Missing docstring in public method
torch/utils/data/datapipes/utils/decoder.py:253 in public function `mathandler`:
D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:261 in public function `extension_extract_fn`:
D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:270 in public class `Decoder`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/utils/decoder.py:276 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/utils/decoder.py:282 in public method `add_handler`:
D102: Missing docstring in public method
torch/utils/data/datapipes/utils/decoder.py:292 in public method `decode1`:
D102: Missing docstring in public method
torch/utils/data/datapipes/utils/decoder.py:309 in public method `decode`:
D102: Missing docstring in public method
torch/utils/data/datapipes/utils/decoder.py:326 in public method `__call__`:
D102: Missing docstring in public method
torch/utils/data/datapipes/utils/snapshot.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/utils/snapshot.py:11 in private function `_simple_graph_snapshot_restoration`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/data/datapipes/utils/snapshot.py:11 in private function `_simple_graph_snapshot_restoration`:
D400: First line should end with a period (not ',')
torch/utils/data/datapipes/utils/snapshot.py:11 in private function `_simple_graph_snapshot_restoration`:
D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/utils/tensorboard/_convert_np.py:1 at module level:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/utils/tensorboard/_convert_np.py:9 in public function `make_np`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/tensorboard/_convert_np.py:9 in public function `make_np`:
D400: First line should end with a period (not ':')
265
```
After: 166
```
torch/utils/data/datapipes/dataframe/structures.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/dataframe/structures.py:10 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/dataframe/structures.py:14 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/datapipe.py:120 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:123 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:136 in public method `register_function`:
D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:140 in public method `register_datapipe_as_function`:
D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:173 in public method `__reduce_ex__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:182 in public method `set_getstate_hook`:
D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:188 in public method `set_reduce_ex_hook`:
D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:193 in public method `__repr__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:199 in public method `__str__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:205 in public method `__dir__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:221 in public class `DFIterDataPipe`:
D101: Missing docstring in public class
torch/utils/data/datapipes/datapipe.py:266 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:279 in public method `register_function`:
D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:283 in public method `register_datapipe_as_function`:
D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:309 in public method `__reduce_ex__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:318 in public method `set_getstate_hook`:
D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:324 in public method `set_reduce_ex_hook`:
D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:329 in public method `__repr__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:335 in public method `__str__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:341 in public method `__dir__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:398 in public class `DataChunk`:
D101: Missing docstring in public class
torch/utils/data/datapipes/datapipe.py:399 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/datapipe.py:403 in public method `as_str`:
D102: Missing docstring in public method
torch/utils/data/datapipes/datapipe.py:407 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/datapipe.py:410 in public method `raw_iterator`:
D102: Missing docstring in public method
torch/utils/data/datapipes/iter/callable.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/iter/callable.py:65 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/callable.py:123 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/callable.py:127 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/callable.py:216 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combinatorics.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/iter/combinatorics.py:30 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combinatorics.py:45 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:48 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:97 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combinatorics.py:117 in public method `set_shuffle`:
D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combinatorics.py:121 in public method `set_seed`:
D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combinatorics.py:125 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:140 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:145 in public method `reset`:
D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combinatorics.py:153 in public method `__getstate__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:168 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combinatorics.py:182 in public method `__del__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/iter/combining.py:46 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combining.py:53 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:57 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:95 in public method `__new__`:
D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combining.py:388 in public method `__new__`:
D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combining.py:556 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combining.py:560 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:573 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:579 in public method `reset`:
D102: Missing docstring in public method
torch/utils/data/datapipes/iter/combining.py:582 in public method `__getstate__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:592 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:600 in public method `__del__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:624 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/combining.py:631 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/combining.py:635 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/filelister.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/iter/filelister.py:37 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/filelister.py:59 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/filelister.py:63 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/fileopener.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/iter/fileopener.py:41 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/fileopener.py:65 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/fileopener.py:68 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/iter/grouping.py:57 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/grouping.py:70 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:81 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:115 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/grouping.py:121 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:190 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/grouping.py:238 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:262 in public method `reset`:
D102: Missing docstring in public method
torch/utils/data/datapipes/iter/grouping.py:266 in public method `__getstate__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:283 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/grouping.py:299 in public method `__del__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/routeddecoder.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/iter/routeddecoder.py:38 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/routeddecoder.py:54 in public method `add_handler`:
D102: Missing docstring in public method
torch/utils/data/datapipes/iter/routeddecoder.py:57 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/routeddecoder.py:63 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/selecting.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/iter/selecting.py:47 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/selecting.py:71 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/sharding.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/iter/sharding.py:17 in public class `SHARDING_PRIORITIES`:
D101: Missing docstring in public class
torch/utils/data/datapipes/iter/sharding.py:40 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/sharding.py:48 in public method `apply_sharding`:
D102: Missing docstring in public method
torch/utils/data/datapipes/iter/sharding.py:75 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/sharding.py:80 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/streamreader.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/iter/streamreader.py:29 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/streamreader.py:33 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/utils.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/iter/utils.py:30 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/iter/utils.py:34 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/iter/utils.py:50 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/callable.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/map/callable.py:14 in public function `default_fn`:
D103: Missing docstring in public function
torch/utils/data/datapipes/map/callable.py:47 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/map/callable.py:57 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/callable.py:60 in public method `__getitem__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combinatorics.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/map/combinatorics.py:56 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/map/combinatorics.py:69 in public method `set_shuffle`:
D102: Missing docstring in public method
torch/utils/data/datapipes/map/combinatorics.py:73 in public method `set_seed`:
D102: Missing docstring in public method
torch/utils/data/datapipes/map/combinatorics.py:77 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combinatorics.py:86 in public method `reset`:
D102: Missing docstring in public method
torch/utils/data/datapipes/map/combinatorics.py:93 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combinatorics.py:96 in public method `__getstate__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combinatorics.py:111 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combining.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/map/combining.py:36 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/map/combining.py:45 in public method `__getitem__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combining.py:54 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combining.py:80 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/map/combining.py:89 in public method `__getitem__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/combining.py:98 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/grouping.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/map/grouping.py:36 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/map/grouping.py:49 in public method `__getitem__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/grouping.py:62 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/utils.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/map/utils.py:33 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/map/utils.py:46 in public method `__getitem__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/map/utils.py:49 in public method `__len__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/utils/common.py:157 in public function `match_masks`:
D103: Missing docstring in public function
torch/utils/data/datapipes/utils/common.py:171 in public function `get_file_pathnames_from_root`:
D103: Missing docstring in public function
torch/utils/data/datapipes/utils/common.py:208 in public function `get_file_binaries_from_pathnames`:
D103: Missing docstring in public function
torch/utils/data/datapipes/utils/common.py:221 in public function `validate_pathname_binary_tuple`:
D103: Missing docstring in public function
torch/utils/data/datapipes/utils/common.py:300 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/utils/common.py:331 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:335 in public method `close`:
D102: Missing docstring in public method
torch/utils/data/datapipes/utils/common.py:356 in public method `__dir__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:361 in public method `__del__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:365 in public method `__iter__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:368 in public method `__next__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:371 in public method `__repr__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:377 in public method `__getstate__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/common.py:380 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/utils/data/datapipes/utils/decoder.py:1 at module level:
D100: Missing docstring in public module
torch/utils/data/datapipes/utils/decoder.py:31 in public function `basichandlers`:
D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:141 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/utils/decoder.py:145 in public method `__call__`:
D102: Missing docstring in public method
torch/utils/data/datapipes/utils/decoder.py:189 in public function `imagehandler`:
D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:196 in public function `videohandler`:
D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:217 in public function `audiohandler`:
D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:238 in public class `MatHandler`:
D101: Missing docstring in public class
torch/utils/data/datapipes/utils/decoder.py:239 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/utils/decoder.py:249 in public method `__call__`:
D102: Missing docstring in public method
torch/utils/data/datapipes/utils/decoder.py:255 in public function `mathandler`:
D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:263 in public function `extension_extract_fn`:
D103: Missing docstring in public function
torch/utils/data/datapipes/utils/decoder.py:279 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/data/datapipes/utils/decoder.py:285 in public method `add_handler`:
D102: Missing docstring in public method
torch/utils/data/datapipes/utils/decoder.py:295 in public method `decode1`:
D102: Missing docstring in public method
torch/utils/data/datapipes/utils/decoder.py:312 in public method `decode`:
D102: Missing docstring in public method
torch/utils/data/datapipes/utils/decoder.py:329 in public method `__call__`:
D102: Missing docstring in public method
torch/utils/data/datapipes/utils/snapshot.py:1 at module level:
D100: Missing docstring in public module
166
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112765
Approved by: https://github.com/ejguan
Fix: #111506
This PR skips aliasing correction on `lift_fresh` calls. Reasoning is: although unlifted and lifted tensors are technically aliases, they are from different levels of abstraction (`FunctionalTensorWrapper` and `XLATensor`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112202
Approved by: https://github.com/bdhirsh
- Added `runner.sh` that does a sweep over `batch_size=(1, 32, 64, 128, 256)` and `compile=(True, False)`
- Added GPU utilization as a metric
- Converted frontend from 2 processes (one putting requests into `request_queue` and one reading from `response_queue` and collecting metrics) to a single process with 3 threads (one putting requests into `request_queue` and one reading from `response_queue` and collecting metrics and one polling `nvidia-smi` for gpu utilization)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112863
Approved by: https://github.com/albanD
ghstack dependencies: #112582
Summary:
Since nested tensor doesn't have size(), when profiler with_flops is turned on, it throws exception in saveExtraArgs().
It is tricky to support flop computation for Nested tensor because it has dynamic shape. So skip the flop compute for Nested tensor for now instead of throwing exception.
Test Plan:
Used profiler with NT, the log shows this warning instead of throwing.
```/torch/nested/_internal/nested_tensor.py:205: UserWarning: Failed to save extra arguments for flops computation of op aten::add with input[0] as nested tensor. (Triggered internally at fbcode/caffe2/torch/csrc/profiler/util.cpp:433.)```
Differential Revision: D50919789
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112767
Approved by: https://github.com/aaronenyeshi
By adding `.cast_common_dtype_to_outputs(true)` to `build_ternary_op`.
According to profiling, this change does not result in additional kernel invocation on GPUs, i.e. following script
```python
import torch
def bench_addcdiv(size=(32*1024**2, 5), device="cuda"):
x=torch.rand(size, device=device, dtype=torch.float)
y=torch.rand(size, device=device, dtype=torch.double)
with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CUDA]) as prof:
torch.addcdiv(x, x, x, out=y)
rc=prof.key_averages()
print(rc)
if __name__ == "__main__":
bench_addcdiv()
```
Shows that before and after the change it took roughly the same time to finish the computation.
Before:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaLaunchKernel 92.99% 20.096ms 92.99% 20.096ms 20.096ms 0.000us 0.00% 0.000us 0.000us 1
void at::native::unrolled_elementwise_kernel<at::nat... 0.00% 0.000us 0.00% 0.000us 0.000us 1.605ms 100.00% 1.605ms 1.605ms 1
cudaDeviceSynchronize 7.01% 1.515ms 7.01% 1.515ms 1.515ms 0.000us 0.00% 0.000us 0.000us 1
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 21.611ms
Self CUDA time total: 1.605ms
```
After:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaLaunchKernel 92.92% 19.996ms 92.92% 19.996ms 19.996ms 0.000us 0.00% 0.000us 0.000us 1
void at::native::unrolled_elementwise_kernel<at::nat... 0.00% 0.000us 0.00% 0.000us 0.000us 1.603ms 100.00% 1.603ms 1.603ms 1
cudaDeviceSynchronize 7.08% 1.523ms 7.08% 1.523ms 1.523ms 0.000us 0.00% 0.000us 0.000us 1
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 21.519ms
Self CUDA time total: 1.603ms
```
Add regression test.
Fixes https://github.com/pytorch/pytorch/issues/112490
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112682
Approved by: https://github.com/albanD
Fixes https://github.com/pytorch/pytorch/issues/112072
Grad mode mutations, which are the responsibility of aotautograd, need to be persisted outside of the graph as side-effects in the runtime wrapper.
To facilitate this, and to maintain global state hygeine, we restore the grad mode to their value prior to tracing, for both dynamo (alongside other global states) and aot_autograd.
This is in line with the assumption that aot_autograd should work as though it were called from eager, before the given GraphModule has been run.
It is assumed that other global state (autocast mode, torch function) already maintain hygeine via their context manager APIs.
---
### Future Work?
Should we also do this for:
1. autocast mode
2. torch_function_enabled
Answer: no. (at least at present)
It is assumed that other global state (autocast mode, torch function) already maintain hygeine via their context manager APIs.
Furthermore, mutating this state directly is currently unsupported in dynamo, unlike `set_grad_enabled`
Repro:
```python
import torch
def fn(x):
x = x + 1
torch.set_autocast_enabled(True)
return x + 1
print(torch.compile(fn, fullgraph=True)(torch.zeros(1)))
# torch._dynamo.exc.Unsupported: call_method UserDefinedObjectVariable(set_autocast_enabled) __call__ [ConstantVariable(bool)] {}
```
```python
import torch
def fn(x):
x = x + 1
torch.overrides.BaseTorchFunctionMode.__enter__()
return x + 1, torch._C._is_torch_function_enabled()
print(torch.compile(fn, fullgraph=True)(torch.zeros(1)))
# torch._dynamo.exc.Unsupported: 'call_function TorchFunctionMode.__enter__ in skip_files /home/jonch/Desktop/Programming/mlsys/pytorch/torch/overrides.py, skipped according skipfiles.SKIP_DIRS'
```
~~I believe 1. is clearly yes - even if it is a corner case (autocast only has ctx manager public API, while dynamo will always emit ctx manager exits before compiling the graph, so one needs to use the internal _enter_autocast API to directly perform a mutation).~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112396
Approved by: https://github.com/bdhirsh
Fixes#112590
Fixed docstring errors in `torch/cuda/memory.py` and `torch/cuda/nvtx.py`.
memory.py
Before
```
torch/cuda/memory.py:1 at module level:
D100: Missing docstring in public module
torch/cuda/memory.py:67 in public function `caching_allocator_alloc`:
D401: First line should be in imperative mood (perhaps 'Perform', not 'Performs')
torch/cuda/memory.py:103 in public function `caching_allocator_delete`:
D401: First line should be in imperative mood (perhaps 'Delete', not 'Deletes')
torch/cuda/memory.py:122 in public function `set_per_process_memory_fraction`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:148 in public function `empty_cache`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:148 in public function `empty_cache`:
D400: First line should end with a period (not 'g')
torch/cuda/memory.py:163 in public function `memory_stats`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:163 in public function `memory_stats`:
D400: First line should end with a period (not 'a')
torch/cuda/memory.py:163 in public function `memory_stats`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:264 in public function `memory_stats_as_nested_dict`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:272 in public function `reset_accumulated_memory_stats`:
D401: First line should be in imperative mood (perhaps 'Reset', not 'Resets')
torch/cuda/memory.py:292 in public function `reset_peak_memory_stats`:
D401: First line should be in imperative mood (perhaps 'Reset', not 'Resets')
torch/cuda/memory.py:311 in public function `reset_max_memory_allocated`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:311 in public function `reset_max_memory_allocated`:
D400: First line should end with a period (not 'y')
torch/cuda/memory.py:311 in public function `reset_max_memory_allocated`:
D401: First line should be in imperative mood (perhaps 'Reset', not 'Resets')
torch/cuda/memory.py:338 in public function `reset_max_memory_cached`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:338 in public function `reset_max_memory_cached`:
D400: First line should end with a period (not 'e')
torch/cuda/memory.py:338 in public function `reset_max_memory_cached`:
D401: First line should be in imperative mood (perhaps 'Reset', not 'Resets')
torch/cuda/memory.py:365 in public function `memory_allocated`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:365 in public function `memory_allocated`:
D400: First line should end with a period (not 'n')
torch/cuda/memory.py:365 in public function `memory_allocated`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:383 in public function `max_memory_allocated`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:383 in public function `max_memory_allocated`:
D400: First line should end with a period (not 'n')
torch/cuda/memory.py:383 in public function `max_memory_allocated`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:405 in public function `memory_reserved`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:405 in public function `memory_reserved`:
D400: First line should end with a period (not 's')
torch/cuda/memory.py:405 in public function `memory_reserved`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:421 in public function `max_memory_reserved`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:421 in public function `max_memory_reserved`:
D400: First line should end with a period (not 's')
torch/cuda/memory.py:421 in public function `max_memory_reserved`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:443 in public function `memory_cached`:
D401: First line should be in imperative mood; try rephrasing (found 'Deprecated')
torch/cuda/memory.py:452 in public function `max_memory_cached`:
D401: First line should be in imperative mood; try rephrasing (found 'Deprecated')
torch/cuda/memory.py:461 in public function `memory_snapshot`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:474 in public function `memory_summary`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:474 in public function `memory_summary`:
D400: First line should end with a period (not 'r')
torch/cuda/memory.py:474 in public function `memory_summary`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:612 in public function `list_gpu_processes`:
D202: No blank lines allowed after function docstring (found 1)
torch/cuda/memory.py:612 in public function `list_gpu_processes`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:612 in public function `list_gpu_processes`:
D400: First line should end with a period (not 's')
torch/cuda/memory.py:612 in public function `list_gpu_processes`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:648 in public function `mem_get_info`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:648 in public function `mem_get_info`:
D400: First line should end with a period (not 'n')
torch/cuda/memory.py:648 in public function `mem_get_info`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:684 in private function `_record_memory_history`:
D202: No blank lines allowed after function docstring (found 1)
torch/cuda/memory.py:684 in private function `_record_memory_history`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:684 in private function `_record_memory_history`:
D400: First line should end with a period (not 'y')
torch/cuda/memory.py:684 in private function `_record_memory_history`:
D401: First line should be in imperative mood (perhaps 'Enable', not 'Enables')
torch/cuda/memory.py:742 in private function `_snapshot`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:742 in private function `_snapshot`:
D401: First line should be in imperative mood (perhaps 'Save', not 'Saves')
torch/cuda/memory.py:818 in private function `_dump_snapshot`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:818 in private function `_dump_snapshot`:
D401: First line should be in imperative mood (perhaps 'Save', not 'Saves')
torch/cuda/memory.py:849 in public function `get_allocator_backend`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:849 in public function `get_allocator_backend`:
D400: First line should end with a period (not 'y')
torch/cuda/memory.py:849 in public function `get_allocator_backend`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/cuda/memory.py:894 in public method `__init__`:
D107: Missing docstring in __init__
torch/cuda/memory.py:904 in public function `change_current_allocator`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:904 in public function `change_current_allocator`:
D401: First line should be in imperative mood (perhaps 'Change', not 'Changes')
torch/cuda/memory.py:917 in private function `_get_current_allocator`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
58
```
After
```
torch/cuda/memory.py:151 in public function `empty_cache`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:151 in public function `empty_cache`:
D400: First line should end with a period (not 'g')
torch/cuda/memory.py:439 in public function `memory_cached`:
D401: First line should be in imperative mood; try rephrasing (found 'Deprecated')
torch/cuda/memory.py:448 in public function `max_memory_cached`:
D401: First line should be in imperative mood; try rephrasing (found 'Deprecated')
torch/cuda/memory.py:676 in private function `_record_memory_history`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:676 in private function `_record_memory_history`:
D400: First line should end with a period (not 'y')
torch/cuda/memory.py:841 in public function `get_allocator_backend`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/memory.py:841 in public function `get_allocator_backend`:
D400: First line should end with a period (not 'y')
8
```
nvtx.py
Before
```
torch/cuda/nvtx.py:1 at module level:
D100: Missing docstring in public module
torch/cuda/nvtx.py:24 in public function `range_push`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:24 in public function `range_push`:
D400: First line should end with a period (not 'd')
torch/cuda/nvtx.py:35 in public function `range_pop`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:35 in public function `range_pop`:
D400: First line should end with a period (not 'e')
torch/cuda/nvtx.py:43 in public function `range_start`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:43 in public function `range_start`:
D400: First line should end with a period (not 'e')
torch/cuda/nvtx.py:81 in public function `range`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:81 in public function `range`:
D400: First line should end with a period (not 'g')
9
```
After
```
torch/cuda/nvtx.py:41 in public function `range_start`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:41 in public function `range_start`:
D400: First line should end with a period (not 'e')
torch/cuda/nvtx.py:79 in public function `range`:
D205: 1 blank line required between summary line and description (found 0)
torch/cuda/nvtx.py:79 in public function `range`:
D400: First line should end with a period (not 'g')
4
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112751
Approved by: https://github.com/kit1980
Summary:
This patch prototypes a trace tracker callback mechanism based on existing TraceEntry records.
- It allows external of cache allocator to "attach" trace tracker callbacks.
- When a TraceEntry is recorded, it triggers all attached callbacks. Callbacks can selectively behave based on the trace action.
- **RISK**: The attached callback would be called within an allocator call stack (e.g., free during an allocate call). Potential deadlock may occur if other locks are called within the callback and has interdependency w/ the device allocator lock. It is the callback developer's responsibility to avoid any potential deadlock.
- **ADVICE**: The callback mechanism is designed **only for Pytorch internal use**. We should not expose it to Python layer due to Python GIL that would cause a deadlock.
See example in D50726970 that attaches NCCL register/deregister hooks via the trace tracker callback, so that all CUDA segments allocated by the allocator can be registered to NCCL communicators before any NCCL communication happens. This enables fast zero copy algorithms in NCCL.
Differential Revision: D50726971
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112238
Approved by: https://github.com/zdevito
Adds E2E tests for saving/loading distributed checkpoints. Supported so far are:
- FSDP
- HSDP
- FSDP + TP
Each method is also tested using `torch.compile`
To run all tests:
`python test/distributed/checkpoint/test/distributed/checkpoint/e2e/test_e2e_save_and_load.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112541
Approved by: https://github.com/fegin, https://github.com/wz337
If code is compiled without `glog`, there are no way to control log levels other than explicitly calling `c10::initLogging()`
Test plan: Run `TORCH_CPP_LOG_LEVEL=0 ./bin/ProcessGroupNCCLTest` and observe extra log messages
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112809
Approved by: https://github.com/fduwjj
Summary:
If IS_FBCODE is False, then we print an OSS repro if a test fails. We do
set IS_FBCODE manually on most internal tests, but we don't do it for
all of them. This PR changes it so that the IS_FBCODE gets set to the
correct default value (and then tests are able to override them if
they'd like).
Test Plan:
- Tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112766
Approved by: https://github.com/williamwen42
1) This PR moves the grid function codegen to wrapper so that we can use
IndentBuffers as opposed to manually adding tabs for indentation.
2) In inductor, emits the grid function in the body of the kernel call so
that it can use free symbols from dynamic shapes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112523
Approved by: https://github.com/Chillee
Fixes#112596
Fix docstring errors in init.py
### Before the change -> 38 errors
```
╭─user@pc ~/Path/to/pytorch ‹fix/docstring_init›
╰─➤ pydocstyle torch/nn/init.py --count 127 ↵
torch/nn/init.py:1 at module level:
D100: Missing docstring in public module
torch/nn/init.py:68 in public function `calculate_gain`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:123 in public function `uniform_`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:123 in public function `uniform_`:
D400: First line should end with a period (not 'm')
torch/nn/init.py:123 in public function `uniform_`:
D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:141 in public function `normal_`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:141 in public function `normal_`:
D400: First line should end with a period (not 'l')
torch/nn/init.py:141 in public function `normal_`:
D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:165 in public function `trunc_normal_`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:165 in public function `trunc_normal_`:
D400: First line should end with a period (not 'd')
torch/nn/init.py:165 in public function `trunc_normal_`:
D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:187 in public function `constant_`:
D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:203 in public function `ones_`:
D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:216 in public function `zeros_`:
D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:229 in public function `eye_`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:229 in public function `eye_`:
D400: First line should end with a period (not 'y')
torch/nn/init.py:229 in public function `eye_`:
D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:249 in public function `dirac_`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:249 in public function `dirac_`:
D400: First line should end with a period (not 'c')
torch/nn/init.py:249 in public function `dirac_`:
D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:311 in public function `xavier_uniform_`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:311 in public function `xavier_uniform_`:
D400: First line should end with a period (not 'd')
torch/nn/init.py:311 in public function `xavier_uniform_`:
D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:338 in public function `xavier_normal_`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:338 in public function `xavier_normal_`:
D400: First line should end with a period (not 'd')
torch/nn/init.py:338 in public function `xavier_normal_`:
D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:376 in public function `kaiming_uniform_`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:376 in public function `kaiming_uniform_`:
D400: First line should end with a period (not 'd')
torch/nn/init.py:376 in public function `kaiming_uniform_`:
D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:425 in public function `kaiming_normal_`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:425 in public function `kaiming_normal_`:
D400: First line should end with a period (not 'd')
torch/nn/init.py:425 in public function `kaiming_normal_`:
D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:462 in public function `orthogonal_`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:462 in public function `orthogonal_`:
D400: First line should end with a period (not 's')
torch/nn/init.py:462 in public function `orthogonal_`:
D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
torch/nn/init.py:507 in public function `sparse_`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/init.py:507 in public function `sparse_`:
D400: First line should end with a period (not 'e')
torch/nn/init.py:507 in public function `sparse_`:
D401: First line should be in imperative mood (perhaps 'Fill', not 'Fills')
38
```
### After the change -> 0 errors
```
╭─user@pc ~/Path/to/pytorch ‹fix/docstring_init*›
╰─➤ pydocstyle torch/nn/init.py --count
0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112617
Approved by: https://github.com/mikaylagawarecki
### **Description**:
The problem is that the graph was cast to `fp32` at a certain point but never reverted to `fp16`, causing the rest of the graph to run on `fp32`. This change aims to fix that issue and improve performance.
### **Changes Made**:
- Modified the ONNX exporter code to ensure that the graph is correctly cast back to `fp16` after a necessary cast to `fp32`.
### **Why This Change is Necessary**:
This change is necessary to ensure that the exported ONNX graph remains in `fp16` where appropriate, leading to significant gains in performance and memory savings. Without this fix, the graph would run entirely in `fp32`, causing suboptimal performance.
### **Testing**:
- Performed extensive testing with various models and scenarios to validate the correctness of the changes.
### **Benchmarking Results**:
Experiments Ran on:
8 GPUS - Tesla V100 - 32GB
**Before Fix: ort + 4 hidden layers + without fix**
- **Train Runtime**: 78.7088 seconds
- **Train Samples per Second**: 10.164
- **Train Steps per Second**: 1.271
- **Train Loss**: 5.624655108451844
- **Epoch**: 0.3
**After Fix: ort + 4 hidden layers + with fix**
- **Train Runtime**: 72.5636 seconds
- **Train Samples per Second**: 11.025
- **Train Steps per Second**: 1.378
- **Train Loss**: 5.6252727746963505
- **Epoch**: 0.3
We can see 7.79% perf gain after this fix.
- I only ran it on 4 hidden layers due to GPU constraints, the perf gain is going to be much higher on the full model.
- You could see the gain on other models that uses _attention_scale as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112554
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi
Previously we added a change which required users to pass in a serialized name if they want to serialize a pytree so that the serialized name does not depend on the python environment. However this is currently breaking AOTInductor benchmark tests as AOTInductor will serialize the pytree into the .so for flattening/unflattening the inputs. However, the registration for those pytree types in the AOTInductor benchmarks are in the huggingface repo, so I'm not sure what's a good fix for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112748
Approved by: https://github.com/zhxchen17, https://github.com/malfet
This PR fixes#106294.
Due to the lack of request validation mechanism, TCPStore in torch mistakenly treats nmap scan messages as valid query messages, which leads to DDP OOM. The simple solution enforces the very first query from a client is a validation query with a predefined magic number. If the validation fails, the server will terminate the connection.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107607
Approved by: https://github.com/cbalioglu, https://github.com/XilunWu
Summary:
Zion-4s core has poor perf when it comes to reading the large tensor (e.g. 300G), no matter for manifold downloading or reading from files. In this diff, I changed the getRecord function from single thread to multiple threads by passing multiple readers to getRecord function and access the same record at different chunks with different readers.
We control the number of additional reader with the`sigrid_model_manager_additional_reader` flag. The default value is 0. When `additional_reader=2`, we allocate `2` extra read client threads.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111426
Approved by: https://github.com/jiayisuse
The default rendering of these code snippets renders the `TORCH_CUDA_ARCH_LIST` values with typographic quotes which prevent the examples from being directly copyable. Use code style for the two extension examples.
Fixes#112763
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112764
Approved by: https://github.com/malfet
This creates a more consistent interface for saving and loading sharded state dicts. A planner is able to be specified when saving a sharded optimizer state dict, but there is currently no planner support for loading one. This change does not affect the default behavior of the function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112259
Approved by: https://github.com/wz337
Since https://github.com/pytorch/pytorch/pull/99699 introduced a dependency on nvml for oom reporting in `c10/cuda/driver_api.h`, `c10/cuda/driver_api.cpp`, and `reportProcessMemoryInfo` from `c10/cuda/CUDACachingAllocator.cpp`, we've seen failures regarding cuda expandable segments and oom reporting in NVIDIA's internal CI, specifically on Jetson devices which don't have nvml support as it is incompatible with Jetson. Example failures using the latest upstream on Orin AGX node:
`python test/test_cuda.py -k test_notifies_oom` generates
```
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/opt/pytorch/pytorch/test/test_cuda.py", line 1643, in _worker
results[t] = torch.nn.functional.conv2d(results[t], weight, padding=0)
RuntimeError: CUDA driver error: out of memory
```
`python test/test_cuda_expandable_segments.py` generates
```
Traceback (most recent call last):
File "/opt/pytorch/pytorch/test/test_cuda_expandable_segments.py", line 12, in <module>
exec(compile(open(filepath).read(), filepath, mode='exec'))
File "/opt/pytorch/pytorch/test/test_cuda.py", line 66, in <module>
class TestCuda(TestCase):
File "/opt/pytorch/pytorch/test/test_cuda.py", line 1609, in TestCuda
@unittest.skipIf(not TEST_CUDNN, 'CUDNN not available')
File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 4628, in wrapped
self._value = self._cb()
File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_cuda.py", line 20, in <lambda>
TEST_CUDNN = LazyVal(lambda: TEST_CUDA and torch.backends.cudnn.is_acceptable(torch.tensor(1., device=CUDA_DEVICE)))
RuntimeError: handle_0 INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/driver_api.cpp":15, please report a bug to PyTorch.
```
This PR intends to fix this issue by adding various dlopen checks to make sure nvml actually exists, and safely fall back to using the older libcuda based features of cuda expandable segments and oom reporting if nvml is not found.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112121
Approved by: https://github.com/eqy, https://github.com/ngimel, https://github.com/albanD
Allow heuristics to actually downgrade the relevance of a test. Note that NONE/UNLIKELY tests will still get executed, but they will be ran at the end of the CI
The Relevance chosen affects the outcome when Heuristics offer conflicting predictions. A relevance higher up in this list means higher confidence in the declared relevance:
HIGH > NONE > PROBABLE > UNLIKELY > UNRANKED
Given that we assume ordering based on the list in init right now since the lists are appended, do a similar thing for UNLIKELY and NONE
ex HEURISTICS = [a, b, c, d]
currently all things in b.high and added after a.high
if b.none includes things in a.high, a.high trumps
if b.none includes things in a.probable, then b.none trumps since none is stronger than probable
if b.unlikely includes things from a.high/probable, a.high/probable trumps since unlikely and probable are at a higher strength
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112671
Approved by: https://github.com/clee2000
Fixes #112601
```
pydocstyle torch/nn/modules/module.py --count
```
On master:
115
After my changes on this PR:
8
The remaining 8 are due to missing docstrings in the magic methods:
```
torch/nn/modules/module.py:1 at module level:
D100: Missing docstring in public module
torch/nn/modules/module.py:1635 in public method `__getstate__`:
D105: Missing docstring in magic method
torch/nn/modules/module.py:1640 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/nn/modules/module.py:1674 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/nn/modules/module.py:1689 in public method `__setattr__`:
D105: Missing docstring in magic method
torch/nn/modules/module.py:1748 in public method `__delattr__`:
D105: Missing docstring in magic method
torch/nn/modules/module.py:2480 in public method `__repr__`:
D105: Missing docstring in magic method
torch/nn/modules/module.py:2505 in public method `__dir__`:
D105: Missing docstring in magic method
```
Should I add them too? Happy to do it, I just wasn't sure if you wanted these documented. Please let me know.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112674
Approved by: https://github.com/mikaylagawarecki
Summary:
When passed from C++ to Python, `c10d::ProcessGroup` and `c10d::Work` are automatically converted to their pybind class which can't be used for dispatcher ops. `.boxed()` exposes `c10d::ProcessGroup` and `c10d::Work` as boxed custom class object to Python.
```python
import tempfile
import torch
import torch.distributed as dist
if __name__ == "__main__":
with tempfile.NamedTemporaryFile(delete=False) as tmpf:
dist.init_process_group(
backend="nccl", init_method=f"file://{tmpf.name}", rank=0, world_size=1
)
group = dist.group.WORLD
print(group)
print(group.boxed())
```
```
<torch.distributed.distributed_c10d.ProcessGroup object at 0x7fe42fb78d30>
ScriptObject <__torch__.torch.classes.c10d.ProcessGroup>
```
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111997
Approved by: https://github.com/lw
Add a shortcut for a sequence of arrays only. This remove a graph break on a common pattern of
`np.array([np.cos(theta), np.sin(theta)])` and its ilk.
This PR is a simpified alternative to https://github.com/pytorch/pytorch/pull/112521 --- it still breaks on mixing arrays and scalars or array_likes (e.g. `np.array([[1, 2], np.array[3, 4]])`) and instead adds a simple shortcut.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112711
Approved by: https://github.com/lezcano
Previous to this PR, we only checked TorchScript nodes for scope compatibility, skipping their parent's scope reference check.
This PR fixes adds a check not only for the node being traversed, but its parents as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112654
Approved by: https://github.com/BowenBao
Previously a crash in PyTorch on power systems was fixed with #110708.
Even with the fix, the torch_test.py test throws the following error
for one of the tests.
"Error in cpuinfo: processor architecture is not supported in cpuinfo"
This is a follow up patch to fix this error.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112707
Approved by: https://github.com/albanD
While reading the documentation of the optimizers I noticed the description of the `maximize` option is misleading. It currently reads as if the parameters would we maximized, which is factually incorrect. This PR proposes a more clear description.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112724
Approved by: https://github.com/albanD
Previously we added a change which required users to pass in a serialized name if they want to serialize a pytree so that the serialized name does not depend on the python environment. However this is currently breaking AOTInductor benchmark tests as AOTInductor will serialize the pytree into the .so for flattening/unflattening the inputs. However, the registration for those pytree types in the AOTInductor benchmarks are in the huggingface repo, so I'm not sure what's a good fix for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112748
Approved by: https://github.com/zhxchen17, https://github.com/malfet
This was originally @jansel's PR:
https://github.com/pytorch/pytorch/pull/102625, which I've built upon.
This diff implements static memory planning. It's disabled by default
while we examine its performance.
We use a greedy-by-size approach. For dynamic shapes, the sizes of the
example inputs are used as estimates when making planning decisions. We
generate expressions to calculate the actual memory offsets and sizes at
runtime when the values of the dynamic shapes are known. In order to
simplify these calculations, we have organized the allocations into a
tree that branches on space (address offsets) and time (live ranges).
Finally, we need to align these offsets, so we have added an `align`
sympy Expr to express these calculations.
Some limitations:
1. It is only enabled during inference for now. Enabling it for training
increases peak memory usage as we allocate all the memory needed for
training upfront, before freeing the memory allocated during
inference. We can probably address this by doing planning for both
the inference and training passes together.
2. It doesn't work with PyTorch Distributed, because kernels like
AllGatherIntoTensor codegen strings which do memory operations. We
can fix this down the line by having them emit MemoryPlanningLines
instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178
Approved by: https://github.com/desertfire, https://github.com/jansel
Fixes#112604
Fixes docstring by following `pydocstyle` outputs.
- torch/nn/parallel/distributed.py
Before: 84
```
torch/nn/parallel/distributed.py:1 at module level:
D100: Missing docstring in public module
torch/nn/parallel/distributed.py:92 in private function `_cast_buffers`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:103 in private function `_setup_mixed_precision_params`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:103 in private function `_setup_mixed_precision_params`:
D401: First line should be in imperative mood (perhaps 'Create', not 'Creates')
torch/nn/parallel/distributed.py:143 in private function `_find_tensors`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:273 in private method `__init__`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:273 in private method `__init__`:
D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
torch/nn/parallel/distributed.py:287 in private method `main_hook`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:287 in private method `main_hook`:
D400: First line should end with a period (not 'd')
torch/nn/parallel/distributed.py:324 in private method `post_hook`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:324 in private method `post_hook`:
D400: First line should end with a period (not 'l')
torch/nn/parallel/distributed.py:324 in private method `post_hook`:
D401: First line should be in imperative mood (perhaps 'Sync', not 'Syncs')
torch/nn/parallel/distributed.py:332 in public class `DistributedDataParallel`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:332 in public class `DistributedDataParallel`:
D400: First line should end with a period (not 'n')
torch/nn/parallel/distributed.py:633 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/parallel/distributed.py:960 in private method `_fire_reducer_autograd_hook`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:960 in private method `_fire_reducer_autograd_hook`:
D401: First line should be in imperative mood (perhaps 'Fire', not 'Fires')
torch/nn/parallel/distributed.py:969 in private method `_root_copy_hook`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:969 in private method `_root_copy_hook`:
D400: First line should end with a period (not 's')
torch/nn/parallel/distributed.py:1012 in private method `_module_wait_for_copy_hook`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1012 in private method `_module_wait_for_copy_hook`:
D400: First line should end with a period (not 'e')
torch/nn/parallel/distributed.py:1050 in private method `_ddp_init_helper`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1050 in private method `_ddp_init_helper`:
D400: First line should end with a period (not ':')
torch/nn/parallel/distributed.py:1050 in private method `_ddp_init_helper`:
D401: First line should be in imperative mood (perhaps 'Initialize', not 'Initialization')
torch/nn/parallel/distributed.py:1146 in public method `__getstate__`:
D105: Missing docstring in magic method
torch/nn/parallel/distributed.py:1154 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/nn/parallel/distributed.py:1222 in private method `_assign_modules_buffers`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1222 in private method `_assign_modules_buffers`:
D400: First line should end with a period (not 'o')
torch/nn/parallel/distributed.py:1222 in private method `_assign_modules_buffers`:
D401: First line should be in imperative mood (perhaps 'Assign', not 'Assigns')
torch/nn/parallel/distributed.py:1277 in private method `_get_parameters`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:1277 in private method `_get_parameters`:
D400: First line should end with a period (not 's')
torch/nn/parallel/distributed.py:1277 in private method `_get_parameters`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/nn/parallel/distributed.py:1312 in public method `no_sync`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1312 in public method `no_sync`:
D400: First line should end with a period (not 'P')
torch/nn/parallel/distributed.py:1312 in public method `no_sync`:
D401: First line should be in imperative mood; try rephrasing (found 'A')
torch/nn/parallel/distributed.py:1340 in private method `_get_active_ddp_module`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:1340 in private method `_get_active_ddp_module`:
D403: First word of the first line should be properly capitalized ('Torchdynamo', not 'TorchDynamo')
torch/nn/parallel/distributed.py:1517 in public method `forward`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1527 in public method `scatter`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1530 in public method `to_kwargs`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1539 in public method `gather`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1542 in public method `train`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1617 in public method `join`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1617 in public method `join`:
D400: First line should end with a period (not 'f')
torch/nn/parallel/distributed.py:1617 in public method `join`:
D401: First line should be in imperative mood; try rephrasing (found 'A')
torch/nn/parallel/distributed.py:1723 in public method `join_hook`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1723 in public method `join_hook`:
D400: First line should end with a period (not 'y')
torch/nn/parallel/distributed.py:1723 in public method `join_hook`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/nn/parallel/distributed.py:1752 in public method `join_device`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1756 in public method `join_process_group`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1765 in private method `_register_buffer_comm_hook`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1765 in private method `_register_buffer_comm_hook`:
D400: First line should end with a period (not 'e')
torch/nn/parallel/distributed.py:1765 in private method `_register_buffer_comm_hook`:
D401: First line should be in imperative mood (perhaps 'Allow', not 'Allows')
torch/nn/parallel/distributed.py:1805 in public method `register_comm_hook`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1805 in public method `register_comm_hook`:
D400: First line should end with a period (not 'a')
torch/nn/parallel/distributed.py:1805 in public method `register_comm_hook`:
D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/nn/parallel/distributed.py:1887 in private method `_register_builtin_comm_hook`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1887 in private method `_register_builtin_comm_hook`:
D400: First line should end with a period (not 'P')
torch/nn/parallel/distributed.py:1887 in private method `_register_builtin_comm_hook`:
D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/nn/parallel/distributed.py:1914 in private method `_register_fused_optim`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1914 in private method `_register_fused_optim`:
D400: First line should end with a period (not 'a')
torch/nn/parallel/distributed.py:1914 in private method `_register_fused_optim`:
D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/nn/parallel/distributed.py:2005 in public method `will_sync_module_buffers`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:2060 in private method `_default_broadcast_coalesced`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2060 in private method `_default_broadcast_coalesced`:
D400: First line should end with a period (not 'e')
torch/nn/parallel/distributed.py:2128 in private method `_get_data_parallel_params`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:2128 in private method `_get_data_parallel_params`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/nn/parallel/distributed.py:2141 in private method `_set_params_and_buffers_to_ignore_for_model`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2141 in private method `_set_params_and_buffers_to_ignore_for_model`:
D400: First line should end with a period (not 'r')
torch/nn/parallel/distributed.py:2141 in private method `_set_params_and_buffers_to_ignore_for_model`:
D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
torch/nn/parallel/distributed.py:2170 in private method `_get_ddp_logging_data`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2170 in private method `_get_ddp_logging_data`:
D400: First line should end with a period (not 's')
torch/nn/parallel/distributed.py:2170 in private method `_get_ddp_logging_data`:
D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/nn/parallel/distributed.py:2184 in private method `_set_ddp_runtime_logging_sample_rate`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2184 in private method `_set_ddp_runtime_logging_sample_rate`:
D400: First line should end with a period (not 'g')
torch/nn/parallel/distributed.py:2184 in private method `_set_ddp_runtime_logging_sample_rate`:
D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/nn/parallel/distributed.py:2202 in private method `_set_static_graph`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2202 in private method `_set_static_graph`:
D400: First line should end with a period (not 'l')
torch/nn/parallel/distributed.py:2202 in private method `_set_static_graph`:
D401: First line should be in imperative mood; try rephrasing (found 'It')
torch/nn/parallel/distributed.py:2227 in private method `_remove_autograd_hooks`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:2227 in private method `_remove_autograd_hooks`:
D401: First line should be in imperative mood (perhaps 'Remove', not 'Removes')
torch/nn/parallel/distributed.py:2233 in private method `_check_reducer_finalized`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2233 in private method `_check_reducer_finalized`:
D400: First line should end with a period (not 'd')
torch/nn/parallel/distributed.py:2233 in private method `_check_reducer_finalized`:
D401: First line should be in imperative mood (perhaps 'Check', not 'Checks')
84
```
After: 12
```
torch/nn/parallel/distributed.py:1 at module level:
D100: Missing docstring in public module
torch/nn/parallel/distributed.py:618 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/parallel/distributed.py:1133 in public method `__getstate__`:
D105: Missing docstring in magic method
torch/nn/parallel/distributed.py:1141 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/nn/parallel/distributed.py:1503 in public method `forward`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1513 in public method `scatter`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1516 in public method `to_kwargs`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1525 in public method `gather`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1528 in public method `train`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1734 in public method `join_device`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1738 in public method `join_process_group`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1986 in public method `will_sync_module_buffers`:
D102: Missing docstring in public method
12
```
- torch/nn/utils/_named_member_accessor.py
Before: 23
```
torch/nn/utils/_named_member_accessor.py:12 in public function `set_tensor`:
D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:29 in public function `swap_tensor`:
D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:85 in public function `swap_submodule`:
D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:109 in public class `NamedMemberAccessor`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:109 in public class `NamedMemberAccessor`:
D400: First line should end with a period (not 's')
torch/nn/utils/_named_member_accessor.py:115 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/utils/_named_member_accessor.py:122 in public method `get_submodule`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:155 in public method `swap_submodule`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:164 in public method `get_tensor`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:185 in public method `set_tensor`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:194 in public method `del_tensor`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:211 in public method `swap_tensor`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:224 in public method `get_tensors`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:233 in public method `set_tensors`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:249 in public method `set_tensors_dict`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:261 in public method `del_tensors`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:276 in public method `swap_tensors`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:296 in public method `swap_tensors_dict`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:325 in public method `check_keys`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/utils/_named_member_accessor.py:340 in public method `named_parameters`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/utils/_named_member_accessor.py:349 in public method `named_buffers`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/utils/_named_member_accessor.py:358 in public method `named_tensors`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/utils/_named_member_accessor.py:368 in public method `named_modules`:
D200: One-line docstring should fit on one line with quotes (found 3)
23
```
After: 4
```
torch/nn/utils/_named_member_accessor.py:12 in public function `set_tensor`:
D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:29 in public function `swap_tensor`:
D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:85 in public function `swap_submodule`:
D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:116 in public method `__init__`:
D107: Missing docstring in __init__
4
```
- torch/nn/utils/_per_sample_grad.py
Before: 3
```
torch/nn/utils/_per_sample_grad.py:12 in public function `call_for_per_sample_grads`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_per_sample_grad.py:12 in public function `call_for_per_sample_grads`:
D400: First line should end with a period (not ')')
torch/nn/utils/_per_sample_grad.py:12 in public function `call_for_per_sample_grads`:
D402: First line should not be the function's "signature"
3
```
After: 0
```
0
```
- torch/nn/utils/init.py
Before: 3
```
torch/nn/utils/init.py:1 at module level:
D100: Missing docstring in public module
torch/nn/utils/init.py:6 in public function `skip_init`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/init.py:6 in public function `skip_init`:
D400: First line should end with a period (not 'g')
3
```
After: 1
```
torch/nn/utils/init.py:1 at module level:
D100: Missing docstring in public module
1
```
- torch/nn/utils/memory_format.py
Before: 4
```
torch/nn/utils/memory_format.py:1 at module level:
D100: Missing docstring in public module
torch/nn/utils/memory_format.py:5 in public function `convert_conv2d_weight_memory_format`:
D202: No blank lines allowed after function docstring (found 1)
torch/nn/utils/memory_format.py:5 in public function `convert_conv2d_weight_memory_format`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/memory_format.py:5 in public function `convert_conv2d_weight_memory_format`:
D400: First line should end with a period (not '`')
4
```
After: 1
```
torch/nn/utils/memory_format.py:1 at module level:
D100: Missing docstring in public module
1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112657
Approved by: https://github.com/fduwjj
This PR is spilt out of https://github.com/pytorch/pytorch/pull/108193 . It adds the ability to add assertion after each triton kernel calls to make sure all tensor arguments are not nan/inf. It helps me find a few bugs when working on benchmark fusion (due to messing up some kernel/graph level states when generating kernel code).
Right now we have to disable cudagraphs to enable the nan/inf checks. Otherwise we will see errors like: https://gist.github.com/shunting314/053db66c4f121e5f4c5de159bf0032ed . My best guess is it's due to GPU->CPU copy during capturing for cudagraphs. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @eellison if there is easy way to make it work with cudagraphs. But even if the nan-checker is not compatible with cudagraphs, it's probably still fine since it's just for debugging purpose.
Test command:
```
TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_NAN_ASSERTS=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training --disable-cudagraphs
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112091
Approved by: https://github.com/eellison, https://github.com/jansel
This PR introduces the ability to produce GraphModules with Core ATen IR only through decompositions. It also removes `None` from user inputs as ONNX does not supports them
Tests for these features will be executed when #112289 is merged, but for reference, they are as below:
```python
def test_log_sigmoid(self):
# This produces op as `torch.ops.aten.log_sigmoid_forward`, instead of the more
# conventional `torch.ops.aten.log_sigmoid`.
class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.m = torch.nn.LogSigmoid()
def forward(self, x):
return self.m(x)
input = torch.randn(2)
self.run_test_with_fx_to_onnx_exporter_and_onnx_runtime(
Model(), (input,), model_type=self.model_type
)
def test_none_input(self):
class NoneInputModel(torch.nn.Module):
def forward(
self, x: torch.Tensor, y: Optional[torch.Tensor], z: torch.Tensor
):
if y is None:
return x + z
return x + y + z
self.run_test_with_fx_to_onnx_exporter_and_onnx_runtime(
NoneInputModel(),
(torch.randn(1, 2), None, torch.randn(1, 2)),
model_type=self.model_type,
)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112444
Approved by: https://github.com/BowenBao
triton_meta is intended to be passed directly to triton. Previous we were also putting other metadata into triton_meta; but we should split out the other metadata into a separate dict to avoid possible conficts in the future.
This PR splits out triton_meta and inductor_meta so we have a place to put additional metadata that isn't intended to be passed to triton.
Tests - wait for CI
Differential Revision: [D50864493](https://our.internmc.facebook.com/intern/diff/D50864493)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112351
Approved by: https://github.com/eellison
Custom classes that are serialized with pytree are serialized by default with `f”{class.__module__}.{class.__name__}”`. This is a dependency from our serialized program directly into the outer Python environment. If a user moves the class to a different directory, the serialized program will be unable to be loaded. So, we will require users to pass in an FQN if they want to serialize their custom treespec type.
Differential Revision: [D50886366](https://our.internmc.facebook.com/intern/diff/D50886366)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112428
Approved by: https://github.com/suo
To do this, there is a little detour to remove hint caching for unbacked
SymInts; now, we just always attempt to update the hint (using
maybe_evaluate_static; this is much better than the replace we were
doing before) if we don't think we know it.
With this change, we now can generally infer that i0 == 1 is false for
a size-like unbacked SymInt. So if we write the size match /
broadcasting test very carefully (see comment), we will eventually
end up expect_true(sizeA == sizeB), which is good enough to cause
refinement. Phew!
I think I still want to setup a replacement if you do i0 == s0, but I'm
going to do that in a follow up.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112155
Approved by: https://github.com/aakhundov, https://github.com/voznesenskym
Summary:
See internal diff for more changes. Whenever we encounter a non-compliant op,
we add it to a set on the OutputGraph. When a compilation event happens, we log
the contents of this set.
I'm planning on flipping the `only_allow_pt2_compliant_ops` config from False
to True after the logging determines that existing models do not use
non-compliant ops.
Test Plan: - Tested the logging internally locally
Differential Revision: D50884828
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112581
Approved by: https://github.com/yanboliang
split_cat fx passes expect the `example_value` metadata on every node. However, the graph module from _export_torch_ir does not contain this metadata, causing the split_cat fx passes to not run. So, I added a pass to add this metadata to every node in the graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112415
Approved by: https://github.com/frank-wei
Summary:
In order to make sure that quantization_tag is preserved through second
stage export, this PR adds it as a special metadata that should be
preserved.
Since quantization in export path will work on top of pre dispatch
graph, subsequent post dispatch op decomposition, will decompose ops
that quant workflow tagged. In order to make sure that the patterns
identified by quantizer, remains identifiable, even after decompositions
are applied, we must preserve "quantization_tag".
This enables backend delegates, that quantized a model for specific
backend, to be able to identify "quantized" patterns.
Test Plan:
metadata porting tests
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D49056259](https://our.internmc.facebook.com/intern/diff/D49056259)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108764
Approved by: https://github.com/tugsbayasgalan, https://github.com/jerryzh168
Fixes#112169
This PR follows voz's idea of disabling rename inside higher order operator body to avoid the confusion between renaming and mutating, where we'd like to allow rename and forbid mutation. Specifically, the confusion is because rename creates a new variable tracker and calls replace_all for MutableLocal. We either have to 1. look at the fields of the variable tracker to determine whether it's just a name change or 2. pass some information into replace_all and telling it it's a rename op so don't check for side-effects. Both approach seems undesirable, or 3. make rename mutate the user_code_variable_name for the variable tracker (note that: we've been doing this for MutableSideEffects). All approaches seem undesirable.
We end up disabling rename if dynamo is speculating inside a higher order operator's body.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112537
Approved by: https://github.com/zou3519
Summary:
Update the submodule of the Kineto project. Includes the following changes:
- Fix HAS_CUPTI macro uses
- Added error condition count tracking and prints
- Collect more info on cudaEventRecord for stream wait sync events
- Fix CUDA 11.7 support for new cudaLaunchKernelExC
- Fix newlines in error info logging causing broken JSON
- Kineto samples programs are fixed and updated
- ROCm lib path fixed.
- Clearing rocTracer cached data causing memory leaks
- Fix int overflow in counter of activities
- Populate collective metadata from CPU op to GPU kernels
- Updated TEARDOWN_CUPTI to check value is 1
Test Plan: CI
Differential Revision: D50861994
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112540
Approved by: https://github.com/davidberard98
`ShapeEnv` has tons of functionallity that is conditioned on this
`translation_validation_enabled()` check, to the point where 8% of
time in `empty_strided` is spent just in that function.
However, it doesn't really make sense for the value of
`translation_validation_enabled()` to change throughout the life of a `ShapeEnv`
so we might as well run the check once and store it in the `ShapeEnv`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112493
Approved by: https://github.com/lezcano
ghstack dependencies: #112418
This function repeatedly flattens and unflattens the `args, kwargs` pair so we
get a quite significant perf improvement from saving the `flat_args` and
operating directly on those. I see a 15% improvement in dispatch for
`empty_strided`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112418
Approved by: https://github.com/lezcano
In dynamo/inductor, sometimes it helps to gather metrics/statistics for each model in different levels like model level, graph level, kernel level or pair of fusion nodes level. This kind of thing will be very easy to do with Scuba, but we only have scuba in fbcode. This PR build metric tables to solve part of the problem.
Q: why not log to stdout/err direclty
A: sometimes we need more structured data. E.g., it would be helpful to gather all the stats in a CSV and then do post-processing (like calculating a geomean etc.). Also metric table will tag each row with the model name which is helpful.
Q: what's the difference with speedup_indcutor.csv
A: speedup_indcutor.csv is a special case that gather statistics on model level: i.e., we have one row for each model. But recording statistics on finer grain level like graph etc. is also helpful.
Example use cases:
- As a followup on the bechmark fusion PR, I want to gather all the 'slow' fusion and analyze them. With the metric table, I can easily log slow fusion for each model into a csv file. Here is the log gathered for huggingface:
https://gist.github.com/shunting314/964e73cc98368b301414ec7b7ad4c702 .
- To help understand the effect of 'loop ordering after fusion' PR, it would be helpful to gather stats like how many fusions happens for each graph. Previously we log the metric to stderr directly. But logging these metrics in a structural way is useful.
- gather number of registers, register spills, shared memory usage for each kernel in each model with runnable kernel code logged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109245
Approved by: https://github.com/jansel, https://github.com/mlazos
This PR comprises a few small contributions:
1. `PowerTransform` returned a sign of `+1` irrespective of exponent. However, it should return the sign of the exponent because the gradient has the same sign as the exponent. That issue has been fixed.
2. Added tests to catch errors akin to 1. in the future.
3. Added an `InverseGamma` distribution as a `TransformedDistribution` with `PowerTransform(-1)` and `Gamma` base distribution. The `InverseGamma` is often used as a prior for the length scale of Gaussian processes to aggressively suppress short length scales (see [here](https://betanalpha.github.io/assets/case_studies/gaussian_processes.html#323_Informative_Prior_Model) for a discussion).
Note: I added a `positive` constraint for the support of the inverse gamma distribution because the `PowerTransform(-1)` can fail for `nonnegative` constraints if the random variable is zero.
```python
>>> torch.distributions.InverseGamma(0.5, 1.0).log_prob(torch.zeros(1))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-8-758aa22deacd> in <module>
----> 1 torch.distributions.InverseGamma(0.5, 1.0).log_prob(torch.zeros(1))
~/git/pytorch/torch/distributions/transformed_distribution.py in log_prob(self, value)
140 """
141 if self._validate_args:
--> 142 self._validate_sample(value)
143 event_dim = len(self.event_shape)
144 log_prob = 0.0
~/git/pytorch/torch/distributions/distribution.py in _validate_sample(self, value)
298 valid = support.check(value)
299 if not valid.all():
--> 300 raise ValueError(
301 "Expected value argument "
302 f"({type(value).__name__} of shape {tuple(value.shape)}) "
ValueError: Expected value argument (Tensor of shape (1,)) to be within the support (GreaterThan(lower_bound=0.0)) of the distribution InverseGamma(), but found invalid values:
tensor([0.])
```
This differs from the scipy implementation.
```python
>>> scipy.stats.invgamma(0.5).pdf(0)
0.0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104501
Approved by: https://github.com/fritzo, https://github.com/ezyang
Since PyTorch does not have readonly tensors, compiling code with readonly numpy arrays warns about possible UB. Thus detect readonly arrays, flip them to be writeable and clone the resulting tensor.
BTW, this is a break away from numpy semantics: the resulting array is writeable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112524
Approved by: https://github.com/lezcano
Printing just the device name is not helpful when investigating PyTorch issues filed for specific AMD GPUs, as the support/issue might depend on the gfx arch, which is part of the gcnArchName property.
`torch.cuda.get_device_properties(0).gcnArchName` will print the value of the `gcnArchName` property: eg.
```
>>> torch.cuda.get_device_properties(0).gcnArchName
'gfx906:sramecc+:xnack-'
```
```
root@6f064e3c19fb:/data/pytorch/test# python ../torch/utils/collect_env.py
...
GPU models and configuration: AMD Radeon Graphics(gfx906:sramecc+:xnack-)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107477
Approved by: https://github.com/albanD
Summary:
Allow doing pattern replacement with just an fx.Graph instead of a fx.GraphModule,
which can let callers avoid paying the cost of `recompile()` for a small graph if they
don't need the module.
This is a significant speedup if you use hundreds of small patterns for replacement.
Test Plan: Tested in a diff stacked on top of this: {D50756722}
Reviewed By: SherlockNoMad, angelayi
Differential Revision: D50756723
@diff-train-skip-merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112409
Approved by: https://github.com/ZainRizvi
Fixes https://github.com/pytorch/pytorch/issues/112449
elementwise_type_promotion_wrapper will promote `aten.normal` to the dtypes of `mean`, `std` args.
But this is incorrect if we provide the dtype param. Hence, we allow overriding the result_dtype if a specified dtype arg is available.
This problem is unique to `aten.normal`, all other ops decorated do not have a dtype param.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112467
Approved by: https://github.com/lezcano
Summary:
As of https://github.com/pytorch/pytorch/pull/103192, dynamo
supports code that creates OrderedDict instances using kwargs
for the key-value pairs rather than passing a dict literal.
But custom dicts (for example subclasses of OrderedDict) follow
a different codepath so that we can check for conditions such
as a custom `__init__` that need to force a graph break.
This commit allows kwargs for custom dict constructors - if the
args are empty and the class is not also a dataclass (which is
the case that, for example, a
`transformers.modeling_outputs.ModelOutput` instance will wind
up hitting) then treat the kwargs as the key-value pairs.
NOTE: For this to behave 100% correctly, we are relying on
the fact that python dicts behave like ordered dicts so that they
preserve the kwargs' ordering. Technically it is not guaranteed that
future versions of Python will respect this; if that behavior changes
we would need to ensure that dynamo uses OrderedDict for kwargs all
the way down in order to handle special cases like OrderedDict where
the kwargs' ordering does matter.
Test Plan:
```
pytest test/dynamo/test_functions.py
```
I also verified that the new test fails without the changes to
`dicts.py`.
Reviewers: yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112513
Approved by: https://github.com/yanboliang
Summary:
CUDA Graph does not work well with CUPTI teardown.
1) crashes on 1st lazy CUPTI re-init after teardown (CUDA 11)
2) crashes on 2nd non-lazy CUPTI re-init after teardown (CUDA 12)
Workaround: turn off CUPTI teardown when using CUDA Graphs completely.
Test Plan: CI
Differential Revision: D50811284
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112507
Approved by: https://github.com/davidberard98
This was originally @jansel's PR:
https://github.com/pytorch/pytorch/pull/102625, which I've built upon.
This diff implements static memory planning. It's disabled by default
while we examine its performance.
We use a greedy-by-size approach. For dynamic shapes, the sizes of the
example inputs are used as estimates when making planning decisions. We
generate expressions to calculate the actual memory offsets and sizes at
runtime when the values of the dynamic shapes are known. In order to
simplify these calculations, we have organized the allocations into a
tree that branches on space (address offsets) and time (live ranges).
Finally, we need to align these offsets, so we have added an `align`
sympy Expr to express these calculations.
Some limitations:
1. It is only enabled during inference for now. Enabling it for training
increases peak memory usage as we allocate all the memory needed for
training upfront, before freeing the memory allocated during
inference. We can probably address this by doing planning for both
the inference and training passes together.
2. It doesn't work with PyTorch Distributed, because kernels like
AllGatherIntoTensor codegen strings which do memory operations. We
can fix this down the line by having them emit MemoryPlanningLines
instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178
Approved by: https://github.com/desertfire, https://github.com/jansel
I propose some changes so that the `FileSystemReader` and `FileSystemWriter` can be used on other file systems. User only needs to provide `path` as a subclass of `Path` that overrides the necessary interfaces.
For example, one can utilize `tf.io.gfile` to implement an interface to save to or load from HDFS. The following code snippet shows a working implementation.
```python
from pathlib import Path
import tensorflow as tf
class GFileWrapper(tf.io.gfile.GFile):
def __init__(self, path, mode="r") -> None:
super().__init__(path, mode)
def write(self, data):
return super().write(bytes(data))
# a not quite efficient readinto, but it works
def readinto(self, buffer):
# read up to buffer's length
data = self.read(len(buffer))
length = len(data)
buffer[:length] = data
return length
class HdfsPath(type(Path())):
def __new__(cls, *pathsegments):
return super().__new__(cls, *pathsegments)
@staticmethod
def _fix_path(path):
path = str(path)
if path.startswith("hdfs:/") and not path.startswith("hdfs://"):
path = path.replace("hdfs:/", "hdfs://")
return path
def open(self, mode="r", *args, **kwargs):
return GFileWrapper(HdfsPath._fix_path(self), mode=mode)
def mkdir(self, **kwargs) -> None:
return tf.io.gfile.makedirs(HdfsPath._fix_path(self))
def rename(self, target):
return tf.io.gfile.rename(HdfsPath._fix_path(self), HdfsPath._fix_path(target))
```
```python
writer = FileSystemWriter(HdfsPath("hdfs://..."), sync_files=False)
reader = FileSystemReader(HdfsPath("hdfs://..."))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106635
Approved by: https://github.com/fduwjj
Summary: Add support for caching graphs that have tensor args with symbolic shapes. The high-level appraoch is to serialize guards with the on-disk cached object and validating those guards pass before serving a cached object.
Test Plan: New unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111421
Approved by: https://github.com/ezyang
Previously, under config.only_allow_pt2_compliant_ops, Dynamo graph
breaks when it see an OpOverloadPacket where any overloads are not
PT2 compliant. This is potentially brittle: if someone (unlikely) adds
a new overload for a custom operator, then this would cause a
previously non-graph-breaking call to the OpOverloadPacket to graph break.
In this PR:
- When Dynamo is about to write a call to an operator to the FX graph,
we check if it is PT2 compliant.
- For OpOverload, we check to see if the tag is on it
- For OpOverloadPacket, we do overload resolution and check to see if
the tag is on the OpOverload that it resolves to.
Test Plan:
- new tests, existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112200
Approved by: https://github.com/bdhirsh
~~Shape is assumed by `TensorMetadata` to be torch.Shape/tuple, however, some of the scheduler node groups utilize `int`, so convert to tuple.~~
Root cause is actually `foreach` scheduler node having silent-error group of int, when in fact it ought to be opaque `foreach`.
**Previously:** silent error / confusing shape of (0,)

**Now:** clear that it is foreach which does not have well-defined shape:

~~Alternate might be to create list of shapes for each of its subnodes. Actually, for debuggability sake, I may prefer this. We can ensure that the recursive generation of this string is only done dynamically in a debug code path. Else, incrementally computing it on initialization of ForeachKernel may also be feasible.~~ This is quite infeasible for 100s of params.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110336
Approved by: https://github.com/mlazos
Changes the heuristic framework to support multiple prioritizing individual classes within a test file.
Components of this included:
- Updating TestPrioritizations to accept individual test classes being prioritized. Previously, when a heuristic wanted to prioritize a test file it would pass in the test's name, now to prioritize a class within a test it uses the notation "test::classname"
- Changes are fully backwards compatible with existing heuristics
- Test sharding now supports sharding individual tests (for when they're prioritized)
- When a TestClass is prioritized, we pass the appropriate "-k" flags down to pytest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112161
Approved by: https://github.com/huydhn
Summary: This commit refactors q-dq patterns used in QAT fusion,
reducing code duplication. This is important for future efforts
to support quantizing bias.
Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT
Reviewers: jerryzh168, kimishpatel
Subscribers: jerryzh168, kimishpatel, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112279
Approved by: https://github.com/jerryzh168
ghstack dependencies: #112159
Summary: Today, we have special handling for special qspecs like
`SharedQuantizationSpec` or `DerivedQuantizationSpec`, since these
qspecs refer to other nodes in the graph and these node references
need to be updated after replacement (since they referred to nodes
in the original graph that no longer exist in the new graph).
However, we only do the above for special nodes like conv, bn,
getitem, and relu. This doesn't cover the common use case of
having conv bias derive its qparams from those of conv input
activations and conv weight. This commit adds support for this
use case by also replacing the node references for these nodes.
Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT.test_qat_conv_bn_bias_derived_qspec
Reviewers: jerryzh168, kimishpatel
Subscribers: jerryzh168, kimishpatel, supriyar
Differential Revision: [D50697078](https://our.internmc.facebook.com/intern/diff/D50697078)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112159
Approved by: https://github.com/jerryzh168
When the value of `fill` is of complex, this line `value.toDouble() == 0.0` will error out saying that converting complex to double will cause overflow. So we should firstly handle the complex value and then enter this condition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111937
Approved by: https://github.com/malfet
ghstack dependencies: #111885
Summary: att, after SharedQuantizationSpec bug fix we are doing some checks before hand, this can simplify the logic when we insert observers
Test Plan:
contbuild & OSS CI, see bf998a2c5d
Test plan from GitHub:
python test/test_quantization.py TestQuantizePT2E
CIs
Differential Revision: D50816224
Pulled By: jerryzh168
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112453
Approved by: https://github.com/andrewor14
We commonly do some variation of `tree_leaves((args, kwargs))`. This adds a new
function `arg_tree_leaves(*args, **kwargs)` which takes advantage of the known
structure of `args` and `kwargs` to skip their `flatten_fn`.
I see ~1 us improvement per call for args + kwargs, or a 0.5 us improvement
when passing just one of `args` or `kwargs`. For shallow structures, this can be
proportionally quite significant. For example, the empty_strided call I've been
using as a benchmark:
```
args = ((100, 100), (100, 1))
kwargs = dict(device="cuda")
```
Sees a 30% speedup from this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112393
Approved by: https://github.com/lezcano
ghstack dependencies: #112391, #112392
In https://github.com/pytorch/pytorch/pull/111122, an optimization is introduced for reduction + pointwise + multi-level reduction fusion. The main idea of this optimization is to have the first-level reduction of the multi-level reduction reuses the reduction sizes of the first reduction kernel so that there are better chances that the first reduction kernel and the first-level reduction of the multi-level reduction kernel can be fused. However, it introduces a bug for pattern pointwise + multi-level reduction, where the first-level reduction kernel wrongly reuses the reduction ranges (which is []) from the previous pointwise kernel. This PR fixes this issue.
Test plan:
`python timm_models.py --training --amp --performance --only=dm_nfnet_f0 --inductor`
Results before this PR: 0.869x
Results after this PR: 1.232x
Benchmark results:

<img width="1491" alt="Screenshot 2023-10-30 at 3 10 06 PM" src="https://github.com/pytorch/pytorch/assets/10527447/608d26ea-dcc5-4f2a-8700-4a928701392b">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112297
Approved by: https://github.com/jansel
Tested conditions in `test_binary_op_list_error_cases` looks almost identical, although it tests method and in-place variants. Use for loop to make distinction a bit more explicit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112348
Approved by: https://github.com/albanD
ghstack dependencies: #112349
This PR introduces a `full_tensor` API to DTensor, there were so many
callsites that exercises the `redistribute(replicate)` path and I feel
it deserves a separate API, mostly just a syntactic sugar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112224
Approved by: https://github.com/wz337
Summary: The current implementation is not properly attaching output strides to the tracing context when an fx graph is loaded from the cache. That bugs leads to assertion failures like `AssertionError: expected size 3==3, stride 1==9 at dim=1`. This change saves the output strides in the serialized object cached on disk and inserts them into the tracing context whether the graph is loaded from cache or compiled.
Test Plan:
* New unit test using resnet18 (which repros the problem)
* Ran the timm benchmark suite with `--training`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112041
Approved by: https://github.com/ezyang
(legality) It is currently impossible (and should remain impossible) - (due to dedup guards - all static tensors are unique) - to access the same **static** tensor value from a **different source**.
As for `getattr(nn.Module, tensor)` source collisions, we will never instantiate a `nn.Module getattr` source for a static tensor, due to:
- side-effect tracking (as long as we track all static tensors - see also https://github.com/pytorch/pytorch/pull/112025 for extra sanity check)
- See: c8a5bb451e/torch/_dynamo/variables/builder.py (L227)
(no worse) In any case, this field is currently unused.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111911
Approved by: https://github.com/voznesenskym
This creates a more consistent interface for saving and loading sharded state dicts. A planner is able to be specified when saving a sharded optimizer state dict, but there is currently no planner support for loading one. This change does not affect the default behavior of the function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112259
Approved by: https://github.com/wz337
When I run Stable Diffusion in [Huggingface/Diffusers](https://github.com/huggingface/diffusers),an error occured:
```
LoweringException: AssertionError: should have been handled in replace_random.py.
target: aten.randn.generator
args[0]: [1, 4, 64, 64]
kwargs: {'generator': None, 'dtype': torch.float16, 'layout': torch.strided, 'device': device(type='cuda', index=0), 'pin_memory': False}
```
It looks like some bug of dynamo, and you can reproduce this bug like this:
```python
import torch
def model(shape, generator):
return torch.randn(shape, generator=generator, device="cuda:0")
model = torch.compile(model)
x = model((1, 3, 64, 64), None)
print(x)
```
Error occurs because 'None' is passed into ‘generator' , and dynamo has to process `torch.randn` into fx node `torch.ops.aten.randn.generator`.
aten.randn.generator is not processed by decomposition and it is processed by lowering in [torch/_inductor/lowering.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/lowering.py#L1815), randn.generator is processed like this:
```python
@register_lowering(aten.randn)
def randn(*args, **kwargs):
if kwargs.get("generator", None) is not None:
return fallback_randn_generator(*args, **kwargs)
elif config.fallback_random:
return fallback_randn_default(*args, **kwargs)
raise AssertionError("should have been handled in replace_random.py")
```
As you can see, because 'generator' is None, it will not step into `fallback_randn_generator`, and of course, if you don't open `config.fallback_random`, it will not step into `fallback_randn_default`, too. Actually, if 'generator' is None, it could also be processed as`aten.randn.default`. And then, AssertionError will be throw, but in here, I will not disscuss too much about how to process this bug and will open an issue.
Actually, `config.fallback_random` offers a way to debug randn in [config.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/config.py#L190), so I try to open `config.fallback_random` to debug my model. But when I open it by:
```python
# fallback to eager for random/dropout, this is slow but useful for debugging
fallback_random = True
```
Another error occurs!
```python
LoweringException: RuntimeError: Unknown keyword argument 'generator' for operator 'aten::randn'. Schema: aten::randn(SymInt[] size, *, ScalarType? dtype=None, Layouit? layout=None, Device? device=None, bool? pin_memory=None) -> Tensor
```
Obviously, `aten::randn` does not support `kwargs:{generator: None}`, so it should be popped before kwargs is feeded into `fallback_randn_default`.
That's all I'm going to say. Thanks for reading carefully.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112240
Approved by: https://github.com/jansel
For some reason, fbcode internal tests have list errors when a test is skipped and have ", " in the name. This fix tries to replace shape list into a string to avoid internal test listing errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112300
Approved by: https://github.com/chenyang78
By checking that lists sizes are the same before computing forward gradients.
Before the change
```cpp
::std::vector<at::Tensor> _foreach_add_List(c10::DispatchKeySet ks, at::TensorList self, at::TensorList other, const at::Scalar & alpha) {
auto self_ = unpack(self, "self", 0);
auto other_ = unpack(other, "other", 1);
[[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self, other );
std::vector<bool> _any_has_forward_grad_result(self.size());
for (const auto& i : c10::irange(self.size())) {
_any_has_forward_grad_result[i] = isFwGradDefined(self[i]) || isFwGradDefined(other[i]);
}
...
```
after the change:
```cpp
::std::vector<at::Tensor> _foreach_add_List(c10::DispatchKeySet ks, at::TensorList self, at::TensorList other, const at::Scalar & alpha) {
auto self_ = unpack(self, "self", 0);
auto other_ = unpack(other, "other", 1);
[[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self, other );
TORCH_CHECK(
self.size() == other.size(),
"Tensor lists must have the same number of tensors, got ",
self.size(),
" and ",
other.size());
std::vector<bool> _any_has_forward_grad_result(self.size());
for (const auto& i : c10::irange(self.size())) {
_any_has_forward_grad_result[i] = isFwGradDefined(self[i]) || isFwGradDefined(other[i]);
}
```
Add regression test
Fixes https://github.com/pytorch/pytorch/issues/112305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112349
Approved by: https://github.com/Chillee
Summary: This diff fixes an OOB read found by fuzzing in torch/../jit/mobile
Test Plan:
CI and
```
arc lionhead crash reproduce 853835926354224
```
doesn't crash anymore.
Differential Revision: D49537377
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110453
Approved by: https://github.com/davidberard98
This adds bfloat16 support to `torch.nn.functional.grid_sample` this is particularly important when doing feature sampling such as for rendering techniques used in PyTorch3d or for camera projections to voxel grids such as in SimpleBEV.
Related to #57707
Test plan:
```
pytest test/test_nn.py -k grid_sample
pytest test/test_ops.py -k grid_sample
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112331
Approved by: https://github.com/zou3519
Currently meta_utils relies on as_strided when handling the view case (recursively meta-ify the base, and then do as_strided to simulate the view), but NestedTensor does not support as_strided today (though maybe it could?), so what we want to do instead is call Tensor. _view_func. Conveniently, _view_func IS always available for nested tensors.
A detail to note is that _view_func actually incurs a guard because it needs to perform some metadata checks to make sure the view is still valid. This PR adds Tensor._unsafe_view_func which can avoid that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112205
Approved by: https://github.com/jbschlosser
Summary:
Generalize [layer_norm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html) to all tensors of 2d to 4d. Using the mean and var operators in this diff stack, we can compute the layer_norm directly and remove the old shader file `layernorm.glsl`.
```
(input - input.mean(normalized_shape, keepdim=True)) / torch.sqrt(input.var(normalized_shape, correction=0, keepdims = True) + eps) * weight + bias
```
Test Plan:
```
[luwei@devbig984.prn1 /data/users/luwei/fbsource (0a5028d8c)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*layer_norm*"
Building: finished in 0.1 sec (100%) 339/339 jobs, 0/339 updated
Total time: 0.1 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *layer_norm*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from VulkanAPITest
[ RUN ] VulkanAPITest.layer_norm_invalid_inputs
[ OK ] VulkanAPITest.layer_norm_invalid_inputs (69 ms)
[ RUN ] VulkanAPITest.layer_norm_2d
[ OK ] VulkanAPITest.layer_norm_2d (288 ms)
[ RUN ] VulkanAPITest.layer_norm_3d
[ OK ] VulkanAPITest.layer_norm_3d (302 ms)
[ RUN ] VulkanAPITest.layer_norm_4d
[ OK ] VulkanAPITest.layer_norm_4d (8 ms)
[----------] 4 tests from VulkanAPITest (668 ms total)
[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (668 ms total)
[ PASSED ] 4 tests.
```
Reviewed By: yipjustin
Differential Revision: D50436726
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112322
Approved by: https://github.com/yipjustin
As in the title.
Currently, the try-catch block hides the failures from triton kernel launches that are not related to exceptions that the try-catch block is meant to ignore. When triton kernel launch fails (e.g. due to bugs in triton or lack of resources), ignoring such failures will lead to hard-to-explain/unrelated errors in subsequent code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112154
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
pytree is used in many hot paths for dynamo tracing and in many cases we don't
care about the tree spec and just want the flattened list. This improves
`pytree.tree_leaves` to not construct the spec which gives a noticeable
performance improvement when multiplied by the many times it gets called
during tracing.
Concretely, I see a 2x speedup compared to `tree_flatten` in this benchmark:
```python
import torch.utils._pytree as pytree
%timeit pytree.tree_flatten([((100, 100), (100, 1)), dict(device="cuda")])[0]
%timeit pytree.tree_leaves([((100, 100), (100, 1)), dict(device="cuda")])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112323
Approved by: https://github.com/lezcano, https://github.com/XuehaiPan
ghstack dependencies: #112327
This function is called in `normalize_function` which is in a fairly hot path for
`FakeTensor` dispatch. In this simple benchmark I see `normalize_function`
improve from 92 us to 17 us just by caching this signature object.
```python
import torch
from torch._subclasses import FakeTensorMode
from torch.fx.operator_schemas import normalize_function
aten = torch._ops.ops.aten
%timeit normalize_function(
aten.empty_strided.default, args=((100, 100), (100, 1)),
kwargs=dict(device="cuda"), normalize_to_only_use_kwargs=True)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112327
Approved by: https://github.com/lezcano
I need this for later. This roughly returns all the OpOverloads
for an OpOverloadPacket in the order that the OpOverloadPacket decides
to resolve them in.
Test Plan:
- wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112198
Approved by: https://github.com/ezyang
AutogradFunctionContextVariable was mutating self._saved_tensors, which is generally not allowed since VariableTracker objects should be read-only and are frequently copied via apply/clone. This was causing some test failures up the PR stack.
This moves the mutation into a separate object that is not copied.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112216
Approved by: https://github.com/voznesenskym
ghstack dependencies: #112122
Summary:
## Context
The embedding feature validation for GatherRangeToDense was added in the previous diff: D18031155. The logic will check the mismatchedranges or empty ranges in the whole model lifecycle, once the ratio exceeds some thresehold, it will trigger ENFORCE failure (exception).
In the current implementation, it may have carry-on impact. The mismatch ratio is equal to:
```
ratio = mismatched_ranges_from_t0_to_t1 / total_ranges_from_t0_to_t1
```
if the mismatched_ranges_from_t0 somehow increased a lot (bad value spike) at request N, the ratio will be much larger than the treshold. Then it may take long util t2 to make the new ratio drops below the threshold. however, the requests between t1 and t2 may be all good requests, then it brings carry-on impact.
Instead, we would like to propose a new strategy, when exception happen at T1, we will clean up all the history counters for this bad feature, and make it a clean run for the next phase util the next exception. it then will get rid of the carry-on impact.
In this logic, we will clean up the counter based on the bad feature J.
more context: https://docs.google.com/document/d/1tYHISyiLf-PVKPVGlZRZ0iq2Hvog3g5BjCKMCDLZHHo/edit
Test Plan:
hardcode a much smaller threshold as 0.0001 to force trigger the exception, deploy on some hosts in prod tiers
```
EPHEMERAL_PACKAGE=d44a3de1305c3b4c30fd62bc354a1285 tw update fbcode/tupperware/config/admarket/sigrid/predictor/prod.tw tsp_cln/admarket/sigrid_predictor_v2_dh_t1_elastic_ha --tasks=100-199 --fast --force
```
## 1) totalRanges can be correct logged
```
I1018 20:37:09.012523 2074 gather_ranges_to_dense_op.h:69 req:00f00000001abbf1] In GatherRangesToDenseOp:
Lifetime empty ranges for each feature is 12354.
Lifetime mismatched ranges for each feature is 526.
With a total of 87503 examples for each feature.
```
## 2) exception can be still triggered
```
E1018 21:08:42.007398 668 LoggingPredictorService.cpp:701 req:001000000013df51] getRequestPrecomputedDataOnePass failure on model 481948521_146: [enforce fail at gather_ranges_to_dense_op.h:215] std::max(totalRangesTemp, minObservation_) * maxMismatchedRatio_ >= mismatchedRangesTemp. 0.1 vs 1. Ratio of range length mismatch for feature at index 0 is 0.00813008 (1/123) which exceeds 1e-05. The incorrect lengths include: 15 (Error from operator:
```
Differential Revision: D50570811
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111838
Approved by: https://github.com/malfet
Even after PR #111431, the `collective(...)` function still uses the underlined version `avoidRecordStreams_` inside and does not respect each collective call's preference, as the underlined `avoidRecordStreams_` is only controlled by environment variable.
As a fix, we pass `avoidRecordStreams` into the collective() function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112195
Approved by: https://github.com/awgu
Summary:
Currently error message doesn't give you details on nature of mismatch:
Output annotation element type and runtime tensor element type must match for tolist()
After update, error becomes actionable:
RuntimeError: Output annotation element type and runtime tensor element type must match for tolist(): Long vs Int
Test Plan: existing unit tests
Reviewed By: iseeyuan
Differential Revision: D50082858
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110872
Approved by: https://github.com/houseroad
This PR:
* Adds support for the `layout` kwarg to `torch.nested.as_nested_tensor()`
* Fixes `torch.nested.nested_tensor()`
* It should accept a list of lists of scalars
* It should not preserve autograd history
* Adds extensive testing for these two functions
Semantics for the two functions follow those of the strided layout:
* `torch.nested.nested_tensor(tensor_list, layout=torch.jagged)`: Creates a new jagged layout NT **with no autograd history**
* `tensor_list` can be a list of Tensors or list of lists of scalars
* `torch.nested.as_nested_tensor(tensor_list, layout=torch.jagged)`: Creates a new jagged layout NT **preserving autograd history of `tensor_list`**
* `tensor_list` must be a list of Tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112304
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
This is a small fix in "do_bench_using_profiling()".
When CUDA kernels are executed in a non-default CUDA stream, if cuda.synchronize() is called, a CUDA kernel named "Context Sync" will be launched to the default stream to wait until all other streams are finished. This CUDA kernel has "CUDA time" but is not a real kernel to profile. This fix excludes "Context Sync" when calculating kernel total time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112223
Approved by: https://github.com/int3, https://github.com/chenyang78
Summary: The current implementation is not properly attaching output strides to the tracing context when an fx graph is loaded from the cache. That bugs leads to assertion failures like `AssertionError: expected size 3==3, stride 1==9 at dim=1`. This change saves the output strides in the serialized object cached on disk and inserts them into the tracing context whether the graph is loaded from cache or compiled.
Test Plan:
* New unit test using resnet18 (which repros the problem)
* Ran the timm benchmark suite with `--training`
Differential Revision: [D50756653](https://our.internmc.facebook.com/intern/diff/D50756653)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112041
Approved by: https://github.com/ezyang
Major change in this PR is to make torch context manager class a separate ```TorchCtxManagerClassVariable```, since we have dynamo implementation for these ctx managers.
I was thinking to wrap them as ```UserDefinedClassVariable``` and do dispatch at ```USCVariable.call_function```, but it seems almost the same amount of work and this way is more clear.
This is on the way of moving ```TorchVariable``` to ```TorchFunctionVariable``` which will only handle the functions who would be allowed in graph (e.g, ```torch.sin```) and constant folded (e.g, ```torch.is_floating_point```). All other torch functions would be go through skip/inline rules, and would be wrapped as ```UserFunctionVariable``` (for inlined) and ```SkipFilesVariable``` (for skipped).
The next steps:
* Wrap torch modules, classes, objects as regular ```PythonModuleVariable```, ```UserDefinedClassVariable``` and ```UserDefinedObjectVariable```.
* Generate the allow in graph torch functions list and wrap them as ```TorchFunctionVariable```.
* Finally merge ```skipfiles.check``` and ```is_allowed``` into one function ```allow_skip.check(fn)``` which would return a Enum of allow, skip and inline.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111622
Approved by: https://github.com/jansel
Summary: tsia
Test Plan:
## Compile on Mac and run on Android
```
buck2 build -c ndk.static_linking=true -c pt.enable_qpl=0 --target-platforms=ovr_config//platform/android:arm32-fbsource //xplat/caffe2:pt_vulkan_api_test_binAndroid --show-output && adb push buck-out/v2/gen/fbsource/f1f3f9bed27e143c/xplat/caffe2/__pt_vulkan_api_test_binAndroid__/pt_vulkan_api_test_binAndroid /data/local/tmp
```
Run on android
```
$ adb shell /data/local/tmp/pt_vulkan_api_test_binAndroid
...
[ RUN ] VulkanAPITest.lstm_prepack_success
[ OK ] VulkanAPITest.lstm_prepack_success (11 ms)
[ RUN ] VulkanAPITest.querypool_flushed_shader_log
xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp:7667: Skipped
QueryPool is not available
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 396 tests from VulkanAPITest (29980 ms total)
[----------] Global test environment tear-down
[==========] 396 tests from 1 test suite ran. (29980 ms total)
[ PASSED ] 395 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
YOU HAVE 7 DISABLED TESTS
```
All Passed.
Full Output: P865232089
Reviewed By: copyrightly
Differential Revision: D50677361
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112190
Approved by: https://github.com/manuelcandales
Summary:
Move the profiler's Approximate Clock from libtorch to libc10. The main reason is to allow c10 features to get time.
The clock is using TSC when available for performance. CUDA Caching Allocator's implementation of memory snapshot will add the timestamps to memory events with this same clock in subsequent diff.
Test Plan: CI
Differential Revision: D50601935
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111972
Approved by: https://github.com/davidberard98
Currently, if the master_store does not have all clients join in the `timeout` time, it will just continue silently which could lead to errors down the road. However, if a client does not connect with the master within the specified time then an exception will be raised. This change will have master_store error out if not all clients have joined, making server and client consistent with each other.
Since this is changing the default behavior of master store I am open to suggestions.
Example:
```python
import torch.distributed as dist
import torch.multiprocessing as mp
from datetime import timedelta
def main(rank, world_size):
if rank == 0:
print("creating store")
# world size is 2 so this eventually times out
store = dist.TCPStore("localhost", 1234, 2, True, timeout=timedelta(seconds=5))
print("finished creating store")
if __name__ == "__main__":
world_size = 2
mp.spawn(main, (world_size,), nprocs=world_size)
```
Previous
```
print("creating store")
print("finished creating store")
```
Now
```
print("creating store")
torch.distributed.DistStoreError: Timed out after 6 seconds waiting for workers. 1/2 workers joined.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111805
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
Summary:
Always enter no_grad mode in AOTInductor run entries.
```
// AOTInductor uses at::addmm_out, which doesn't supports
// arguments that requires gradient. For this reason, we
// enforce no_grad context for run APIs.
```
Test Plan:
buck2 test mode/dev-nosan caffe2/test/inductor:test_aot_inductor
and OSS CI
Differential Revision: D50432042
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111613
Approved by: https://github.com/chenyang78, https://github.com/khabinov
Summary:
We implement [`torch.var`](https://pytorch.org/docs/stable/generated/torch.var.html) for tensors of 2d to 4d.
By using the `mean`, `sub` and `pow` ops, we can compute the variance as below without adding a new shader.
```
at::Tensor self_mean = self.mean(opt_dim, true);
at::Tensor output = (self.sub(self_mean).pow(2)).mean(opt_dim, keepdim);
```
Test Plan:
```
[luwei@devbig984.prn1 /data/users/luwei/fbsource (2da0640c6)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*var*"
Building: finished in 0.1 sec (100%) 339/339 jobs, 0/339 updated
Total time: 0.1 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *var*
[==========] Running 6 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 6 tests from VulkanAPITest
[ RUN ] VulkanAPITest.var_2d_unbiased
[ OK ] VulkanAPITest.var_2d_unbiased (322 ms)
[ RUN ] VulkanAPITest.var_2d_biased
[ OK ] VulkanAPITest.var_2d_biased (0 ms)
[ RUN ] VulkanAPITest.var_3d_unbiased
[ OK ] VulkanAPITest.var_3d_unbiased (2 ms)
[ RUN ] VulkanAPITest.var_3d_biased
[ OK ] VulkanAPITest.var_3d_biased (2 ms)
[ RUN ] VulkanAPITest.var_4d_unbiased
[ OK ] VulkanAPITest.var_4d_unbiased (175 ms)
[ RUN ] VulkanAPITest.var_4d_biased
[ OK ] VulkanAPITest.var_4d_biased (5 ms)
[----------] 6 tests from VulkanAPITest (508 ms total)
[----------] Global test environment tear-down
[==========] 6 tests from 1 test suite ran. (508 ms total)
[ PASSED ] 6 tests.
```
Reviewed By: yipjustin
Differential Revision: D50398925
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111965
Approved by: https://github.com/yipjustin
Previously layout opt with sdpa would cause failures because we would pass a non-dense last dim to sdpa. Those layout constraints have been added in prior prs. Now we can do conv layout opt with sdpa.
Improves twins_pcpvt_base 1.4622 → 1.5351, xcit_large_24_p8_224 3.0681 → 3.1839
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112045
Approved by: https://github.com/shunting314
ghstack dependencies: #111976, #111721
Triggers `__torch_function__` tracing on attribute/method/property access matching the eager behavior for non-overridden attributes/methods/properties that are present on `torch.Tensor`.
Some caveats:
1. for methods there doesn't seem to be a way to check if the original implementation of a method is overridden via monkey patching or not. For example:
```
class LocalSubclass(torch.Tensor):
@classmethod
def __torch_function__(cls, func, types, args=(), kwargs=None):
if kwargs is None:
kwargs = {}
return super().__torch_function__(func, types, args, kwargs)
x = torch.ones(2, 2).as_subclass(LocalSubclass)
> x.sigmoid
<built-in method sigmoid of LocalSubclass object at 0x7f8d305bb5e0>
```
There isn't a way to verify that this built-in method is equivalent to the base `torch.Tensor` implementation as each instance will have a different built-in method object that can't be traced back to the original `torch.Tensor` impl. You can check that the class itself has the original implementation via
```
> inspect.getattr_static(LocalSubclass, "sigmoid")
<method 'sigmoid' of 'torch._C.TensorBase' objects>
```
But we can't detect if the user dynamically patches an object with a built-in method called sigmoid which does something completely different.
2. If a user overrides a method but calls the original implementation we will still graph break. This will require modifying `SuperVariable` (and any other way to get the original impl) to handle tensor subclasses.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111737
Approved by: https://github.com/jansel, https://github.com/ezyang
Summary: - Modify the result of get_estimated_runtime() for ExternKernelSchedulerNode to count both bytes and FLOPs and return the maximum of the two.
Reviewed By: xmfan
Differential Revision: D48987490
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110539
Approved by: https://github.com/xw285cornell
Summary: Adding fsspec.transaction to safeguard checkpointing writing. With the context, it should only commit if there was no exception and discard otherwise.
Test Plan:
```
command: buck test @//mode/dev-nosan //caffe2/test/distributed/checkpoint/fb:test_fsspec_filesystem -- --print-passing-details
```
Reviewed By: rohan-varma
Differential Revision: D50701929
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112191
Approved by: https://github.com/rohan-varma
This PR fixes the pointwise op strategy linearity, and switch the
linear pointwise ops to use strategy. Also add tests show that using
the new way we can enable full shard (S(0), S(0)) like operations
Why this is useful? for 2-D Parallel like patterns where the named
parameters are possibly fully sharded on all devices, [S(0), S(0)] or
[S(1), S(0)], etc. need to work, since we don't use the sharding rules
anymore, this is possible at this point.
@awgu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112107
Approved by: https://github.com/wz337
Summary:
When raising an exception here this causes pybind11's dispatcher to kick in, which causes aiplatform's logic to kick in (aiplatform::error_reporting::util::printAddressesWithBestEffortLocationInfo), which ultimately uses `folly::symbolizer::Symbolizer::symbolize` for building up the stack trace. In 3.8 this uses about 3.62% of the CPU time per pyperf (https://fburl.com/scuba/pyperf_experimental/on_demand/oi554uvy). In Cinder 3.8 for some reason this is worse - using 5.94% of the CPU.
This exception is happening when doing a hasattr() on `prims` for things like `bitwise_left_shift` which don't exist: https://www.internalfb.com/code/fbsource/[2d695f650d00]/fbcode/caffe2/torch/_inductor/lowering.py?lines=590
That exception is ultimately going to be swallowed anyway, and the stack trace has no meaningful value. Furthermore because this is kind of an expected outcome in the code versus some random C++ exception the stack trace is less valuable as well.
This changes this to return a (None, None) on the failure case instead of returning a valid op/overload list, avoiding the exception, and reclaiming the 3.62%-5.94% of time.
Test Plan: Existing CI and perf run: https://fburl.com/scuba/pyperf_experimental/on_demand/oi554uvy
Differential Revision: D50018789
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111438
Approved by: https://github.com/davidberard98
This PR:
- Moves TrueDiv, LShift, RShift, IsNonOverlappingAndDenseIndicator to `_sympy.functions.py`
- Moves SymNode to `fx.experimental.sym_node`.
- This file does not have any SymPy dependencies at import time
- It installs the magic methods in Sym{Bool,Int,Float}.
- N.b. With this split, we may be able to move Sym{Bool,Int,Float} to this file, and remove quite a few of the hacks around these classes
- Imports `sym_node` in `torch/__init__.py` rather than the whole `symbolic_shapes.py`.
This breaks the import-time dependency between torch and SymPy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112037
Approved by: https://github.com/peterbell10
ghstack dependencies: #112035, #112036
Updates `_export.aot_compile` to pass a torch IR graph to inductor, allowing inductor to now run the pre_grad_passes, and reuse more of inductor's code.
Also updates the API to only return the `so_path`, and not returning the exported program. The pytree call spec is now serialized and placed inside of the generated model code. When calling the model, because there is no c++ pytree implementation linked yet, we can access the call specs through `get_call_spec()`, and call pytree flatten/unflattenin python.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110020
Approved by: https://github.com/desertfire
Summary:
Previously, we want to maintain forward-compatibility by skipping
default args in the serialized artifacts in fbcode. However, some of our shim
interfaces require default values being set. Discussed with Sherlock offline
and we decided to allow serializing default args into the C++ wrapper code
for now. We will refine this part if we see real FC requirement.
Test Plan: ci
Differential Revision: D50638663
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112085
Approved by: https://github.com/SherlockNoMad
Fixes#111776
Support check_regex in FileCheck() by adding `find_regex` in `struct TORCH_API StringCordView`.
Callsite accepts RE syntax for std::regex.
However, I haven't figured out submatch ID yet.
For example, "buf5[0], buf6_inputs[0]" is still considered a match.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112077
Approved by: https://github.com/yf225
When converting bn to sync bn, we need to keep sync bn's training flag with the original bn flag, the motivation is there in case the given origin model has set some bn training flag and others are not seated, after we convert sync bn, we hoping not to change this behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111998
Approved by: https://github.com/mikaylagawarecki
Motivation: MYPYNOFOLLOW currently typechecks almost all inductor files
and some dynamo files as well. However, it has `follow_imports=skip`
enabled which greatly nerfs its effectiveness. I would like to enable
import following for all the files currently checked by MYPYNOFOLLOW.
But that leads to a lot of new errors in other files.
I can exclude errors from files in other directories, but it is somewhat
difficult to do that for dynamo and inductor files themselves. Thus I am
making sure all the dynamo files typecheck first.
Note on changes: I could not type the return value of
`make_function_id_set` since it was returning a class defined in the
function body. Thus I deleted `make_function_id_set` and replaced it
with a direct construction of the `FunctionIdSet` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111894
Approved by: https://github.com/Skylion007, https://github.com/eellison
Per documentation, one should be able to explicitly pass dim argument as None to get tensor size across all dimentions/strides, but before this change it was incorrectly interpreted as named tensor call.
Modify `size` and `stride` signatures generated by `gen_pyi.py` to highlight that overload with `None` will return a Tuple, but one with `dim: _int` returns `int`.
Add regression test to validate the behavior, and remove the check for asserts from two named tensors tests (NamedTensors are dead, aren't they?)
Fixes https://github.com/pytorch/pytorch/issues/111944
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111991
Approved by: https://github.com/zou3519
Partially addresses https://github.com/pytorch/pytorch/issues/111081
This fixes the majority of the slowness from https://fb.workplace.com/groups/1405155842844877/permalink/7491314274228973/. In particular, the type of example that suffers the most perf-wise in AOTAutograd looks like this:
```
@torch.compile
def f(x):
intermediate = x.mul(2)
outs = intermediate.unbind(0)
return *outs
x = torch.randn(50, 50, requires_grad=True)
outs = f(x)
sum(outs).sum().backward()
```
There are 50 output tensors in the above function, that all alias each other. AOTAutograd will dutifully exercise its intermediate base [logic](https://github.com/pytorch/pytorch/blob/main/torch/_functorch/aot_autograd.py#L294), and try to regenerate the aliases outside of the compiled `autograd.Function` at runtime, to ensure that the autograd engine is aware of the aliasing.
In this case, this will result in **50 AsStridedBackward nodes in the backward**, because we will fall back to using as_strided to generate each of those 50 outputs. The current PR as is (somewhat unsafely) ensures that the backward graph consists of a single `UnbindBackward`, or a call to `aten.cat()`.
I left a long comment in the code describing the situation, but the core idea is that **autograd does not let you mutate grad_fn of tensor aliases that come from multi-output views**. So if we have `k` outputs that alias each other, but `k-1` of them are aliases that came from multi-output views, then in eager mode, it would not be possible to mutate one of the aliases in a way that would change the grad_fn of any of the other aliases, without causing an error in the backward. So the claim I'm making is that if we hide this aliasing from the autograd engine, then it is impossible for the user to perform any mutations that would cause autograd metadata to diverge between torch.compile and eager in a way that isn't an error in eager mode.
To be fair, I think that taking the approach outlined in https://docs.google.com/document/d/1DlfFq8TKbuAn2zyJxLfoW-X1qkkm5PLdHFtySo03QAk/edit would also help us avoid the as_strided calls in this particularly egregious case, **and** keep the autograd error messages. This relies on both pre-dispatch functionalization being fully hardened **and** adding some pretty invasive changes to AOTAutograd though, and is probably at least several months out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111411
Approved by: https://github.com/ezyang
- rename `__HIP_PLATFORM_HCC__` to `__HIP_PLATFORM_AMD__`
- rename `HIP_HCC_FLAGS` to `HIP_CLANG_FLAGS`
- rename `PYTORCH_HIP_HCC_LIBRARIES` to `PYTORCH_HIP_LIBRARIES`
- workaround in tools/amd_build/build_amd.py until submodules are updated
These symbols have had a long deprecation cycle and will finally be removed in ROCm 6.0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111975
Approved by: https://github.com/ezyang, https://github.com/hongxiayang
Summary:
Traditionally when user want to update the arguments for an FX node, the only way is to call the setter of .args property on nodes. This may be problematic when we insert a lot of arguments. Because of the semantics of the setter method, it has a worst case O(n) complexity.
Adding a new insert_arg provides us two benefits:
1. The operation is guaranteed to be O(1) cost.
2. User can express the intentation more directly, instead of writing code like `node.args = (arg,) + node.args`
Test Plan: caffe2/test:fx -- -r test_insert_arg
Reviewed By: suo
Differential Revision: D50574435
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111974
Approved by: https://github.com/angelayi
Running `python test_nn.py -v -k test_nll_loss_large_tensor` on a machine with a small host RAM availability (e.g. ~50GB) fails with a `SIGKILL` even though the currently specified memory requirements for CPU (and GPU) are set to 48GB and are thus met.
Profiling the peak memory usage via:
```
\time -v python test_nn.py -v -k test_nll_loss_large_tensor
```
and adding `print(torch.cuda.memory_summaryu())` at the end of the test shows a higher host RAM usage of >100GB and a device memory usage of ~32GB.
```
Command being timed: "python test_nn.py -v -k test_nll_loss_large_tensor"
User time (seconds): 81.66
System time (seconds): 229.02
Percent of CPU this job got: 671%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:46.30
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 118150096
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 90280839
Voluntary context switches: 1669
Involuntary context switches: 1214548
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
```
```
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 32769 MiB | 32769 MiB | 81923 MiB | 49154 MiB |
| from large pool | 32768 MiB | 32768 MiB | 81921 MiB | 49152 MiB |
| from small pool | 0 MiB | 0 MiB | 1 MiB | 1 MiB |
|---------------------------------------------------------------------------|
| Active memory | 32769 MiB | 32769 MiB | 81923 MiB | 49154 MiB |
| from large pool | 32768 MiB | 32768 MiB | 81921 MiB | 49152 MiB |
| from small pool | 0 MiB | 0 MiB | 1 MiB | 1 MiB |
|---------------------------------------------------------------------------|
| Requested memory | 32769 MiB | 32769 MiB | 81923 MiB | 49154 MiB |
| from large pool | 32768 MiB | 32768 MiB | 81921 MiB | 49152 MiB |
| from small pool | 0 MiB | 0 MiB | 1 MiB | 1 MiB |
|---------------------------------------------------------------------------|
| GPU reserved memory | 32774 MiB | 32774 MiB | 81938 MiB | 49164 MiB |
| from large pool | 32772 MiB | 32772 MiB | 81930 MiB | 49158 MiB |
| from small pool | 2 MiB | 2 MiB | 8 MiB | 6 MiB |
|---------------------------------------------------------------------------|
...
```
We haven't seen this issue before as the majority of our runners have sufficient host RAM and I just ran into it by chance.
CC @atalman @malfet @crcrpar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110963
Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy, https://github.com/malfet
Ignore dims of value 1 in require_stride_order since they don't affect layout. Previously, unsqueezed dims would always cause a copy because the stride of 0 would throw off the sorted stride order. This was causing perf problems with require_stride_order in next commit in stack.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111976
Approved by: https://github.com/Chillee
This PR introduces a native version of c10d_functional ops. The main goal is to add collective support in AOTInductor and allow collective ops to work in multi-threaded native runtimes.
The native version also incorporated API improvements we wished to implement in Python c10d_functional:
- Removed `ranks` and `group_size` from collective op signatures which were proven to be redundant.
- Use tensor storage as opposed to `void*` to resolve in-flight work.
The native process group registration/resolution mechansim is only used for native c10d_functional in the PR. It will become the single source of truth in upcoming PRs.
The upcoming PRs will implement Inductor/AOTInductor support for c10d_functional, after which native c10d_functional will replace Python c10d_functional.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110570
Approved by: https://github.com/wanchaol
This PR:
- adds the pt2 compliant tag. This tag specifies that the operator works
with the PT2 compilation APIs. A custom op author should test their
ops with opcheck if they choose to add this tag.
- adds a config for Dynamo to allow only pt2 compliant ops into the
graph and graph break on all other OpOverload/OpOverloadPacket.
Bikeshedding help wanted on the name of the tag. It should be easily
grep-able so we can set up rules for it.
Test Plan:
- new tests
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111933
Approved by: https://github.com/ezyang
ghstack dependencies: #111912, #111915, #111948
Unlike the previous torch.library.define, this schema doesn't take a
name (the name is a part of the qualname). We separated out the qualname
from the schema in the new APIs so that they're all consistent with each
other (they all accept the qualname separately).
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111915
Approved by: https://github.com/suo, https://github.com/ezyang
ghstack dependencies: #111912
This build is used in release as far as I know. For release we don't need suffix.
Test in Release:
```
python3 .github/scripts/generate_pytorch_version.py
2.1.1+cpu
python3 .github/scripts/generate_pytorch_version.py --no-build-suffix
2.1.1
```
Test with nightly:
```
python3 .github/scripts/generate_pytorch_version.py --no-build-suffix
2.2.0.dev20231025
```
With suffix:
```
python3 .github/scripts/generate_pytorch_version.py
2.2.0.dev20231025+cpu
````
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112046
Approved by: https://github.com/huydhn
Summary
- faster than previous try-catch.
- more stable than previous try-catch. In some circumstances serializing models > 2GB into a single protobuf file ends up with a corrupted file without raising an exception.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111984
Approved by: https://github.com/justinchuby
Now we compile the generated wrapper C++ code with clang in fbcode.
When the Model's run_impl function is too large, clang will issue
a warning like:
Function foo is too big to optimize [-Wignored-optimization-argument]
and compile the code without any optimization.
I think we may want to be more proactive in such cases. If the
generated C++ code is too complex or too large to be optimized,
we would like to be notified loudly with errors, so that we
would figure out ways to address the issue.
Later if we feel that turning this warning into an error is too
aggressive, we would add a config to disable it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112008
Approved by: https://github.com/desertfire, https://github.com/htyu
Fixes#109889
This PR adds `torch.export.export` as another `FXGraphExtractor` implementation. `torch.onnx.dynamo_export` automatically uses this new FX tracer when a `torch.export.ExportedProgram` is specified as `model`
Implementation is back compatible, thus non `ExportedProgram` models are handled the exact same way as before
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111497
Approved by: https://github.com/BowenBao
This should be the last of the "it used to work with static shapes but
it doesn't work with dynamic shapes" hard errors. Now we will just
specialize if you hit it from C++.
The strategy here is a bit clever. We shunt the size() call to Python
binding if an error would have occurred. Importantly, we already have
logic to make sure the newly allocated ints stay live for the duration
of the ArrayRef access.
storage_offset is intentionally omitted because there are some problems
with it. I will fix them next.
This should let us get rid of the aotautograd_static test configuration.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111935
Approved by: https://github.com/zou3519
Use conditional imports: when running under dynamo, import the original NumPy not torch._numpy. This is what we want to trace, not our implementation.
With this, the test suite passes with and without `PYTORCH_TEST_WITH_DYNAMO=1` (modulo a couple of test modules which are not meant to be compiled, e.g. `test_nep50_examples`). There are two new decorators, `x{fail,pass}ifTorchDynamo`, the `xpass` in most cases indicates a graph break and a fallback to eager for things we do not implement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110401
Approved by: https://github.com/lezcano
Fixes#111999
Adds an assert that provides a more informative error message
For example, when running a compiled function with mps (currently unsupported):
```
...
File "/Users/andrew.hu/Desktop/pytorch/torch/_inductor/graph.py", line 927, in init_wrapper_code
assert wrapper_code_gen_cls is not None, f"Device {device_type} not supported"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
AssertionError: Device mps not supported
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112001
Approved by: https://github.com/peterbell10
Summary:
There's no reason to terminate the parent process trying to find the name of the signal received by the child process.
Let's make sure this is handled properly, which then will ensure that parent process can process child failures.
Test Plan: Unit tests.
Reviewed By: aaronenyeshi
Differential Revision: D50615419
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111961
Approved by: https://github.com/aaronenyeshi
Summary:
att, after SharedQuantizationSpec bug fix we are doing some checks before hand, this can simplify the logic when we insert observers
Test Plan:
python test/test_quantization.py TestQuantizePT2E
CIs
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111828
Approved by: https://github.com/kimishpatel
ghstack dependencies: #111827
The default size based auto wrap policy may not be representative of actual usage of the models. We add support for a few handpicked models, and fallback to the size based policy.
sample command:
`PYTHONPATH=~/benchmark/ python benchmarks/dynamo/torchbench.py -dcuda --training --backend=inductor --multiprocess --performance --only nanogpt --fsdp`
1.257x
1.256x
1.257x
1.252x
1.257x
1.262x
1.258x
1.272x
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111505
Approved by: https://github.com/H-Huang, https://github.com/xuzhao9
Fixes#110611.
The current Torchinductor's `log` implementations will call `sleef` functions in `aten::Vec` which show worse performance than Aten's `log` implementations that invoke `MKL` functions. The reason is that the `sleef` algorithms sacrifice performance in order to have a higher precision. This PR changes Torchinductor's `log` implementations from the `sleef` functions with `1.0` ULP error bound to the ones with `3.5` ULP error bound.
**Performance**
Machine: ICX
The original perf number, perf with `Sleef_logf16_u10`:
```bash
numactl -C0 python test.py
log
eager: 368.8463559374213
compiled: 616.8672097846866
logit
eager: 565.499295014888
compiled: 1010.4096410796046
```
Perf with `Sleef_logf16_u35`:
```bash
numactl -C0 python test.py
log
eager: 364.8629770614207
compiled: 360.2141812443733
logit
eager: 562.3160391114652
compiled: 545.2622110024095
```
**Accuracy**
error_bound | tol=1e-6 | tol=1e-7
-- | -- | --
1.0 ULP | PASS | FAIL
3.5 ULP | PASS | FAIL
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111898
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
Fixes#109889
This PR adds `torch.export.export` as another `FXGraphExtractor` implementation. `torch.onnx.dynamo_export` automatically uses this new FX tracer when a `torch.export.ExportedProgram` is specified as `model`
Implementation is back compatible, thus non `ExportedProgram` models are handled the exact same way as before
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111497
Approved by: https://github.com/BowenBao
Summary: Spotted this while compiling on a Mac M1. The code in these files is gated behind #ifdef and requires SSE, so when building for ARM these files become empty.
Test Plan: CI
Differential Revision: D50407334
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111475
Approved by: https://github.com/digantdesai
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at a3b51df</samp>
This pull request adds support for testing binary uploads to Anaconda Cloud using different tokens and channels based on the branch name. It modifies the `.github/workflows/_binary-upload.yml` workflow and several other workflows that use the `.github/templates/upload.yml.j2` template. It also adds a new secret variable `conda-pytorchbot-token-test` to store the test token.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111821
Approved by: https://github.com/osalpekar, https://github.com/huydhn
We set mannualy_set_graph_inputs to False for CondHigherOrder. After that, it became necessary to deduplicate the inputs. We'll add pytree tests in the follow-up pr.
Test Plan:
existing tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111611
Approved by: https://github.com/zou3519
ghstack dependencies: #111610
Fixes#111222
This pull request adds tests for factory functions that create tensors with a strided layout. The tests are added to the `test_ops.py` file and check the behavior of the `empty`, `zeros`, `ones`, and `rand` factory functions when used with the `layout=torch.strided` argument.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111463
Approved by: https://github.com/lezcano
Summary:
The dist backend used on MTIA doesn't support int32 allreduce for now. The local_used_map_dev_ has to be placed on CPU.
Test Plan: See diff D50387636
Differential Revision: D50460304
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111581
Approved by: https://github.com/fduwjj
In https://github.com/pytorch/pytorch/pull/111122, an optimization is introduced for reduction() + () + multi-level reduction. In this case, we make a multi-level reduction first-level reduction ranges the same as the previous reduction ranges so that the Inductor has better chances to fuse the first reduction and the first-level reduction of the multi-level reduction kernel together.
There is a corner case that the multi-level reduction kernel has `keepdim=True`. In this case, ranges of the multi-level reduction kernel is not empty, and the dim info needs to be used to create the inner loader of the first-level reduction kernel. To keep the logic simple, for now we simply disable optimization when `keepdim=True`.
Differential Revision: [D50544876](https://our.internmc.facebook.com/intern/diff/D50544876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111781
Approved by: https://github.com/malfet, https://github.com/jansel
Keep a buffer of the last 16384 nccl work actions, including the stack
trace that launched the event.
When torch._C._distributed_c10d._dump_nccl_trace(), it an dump these to
a pickled archive.
For each action we get:
process_group_id, seq_id, collective_name, size_of_first_tensor, stack trace
state - issued, started, completed (based on cuda events and queried if
necessary when the dump is requested)
I tested that it is possible to query event state when the streams are
otherwise stuck.
Differential Revision: [D50138956](https://our.internmc.facebook.com/intern/diff/D50138956)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110960
Approved by: https://github.com/wconstab
Currently, the only way ProcessGroupNCCL shuts down its background threads and aborts all communicators is via the destructor.
However, given how python GC works and code holding references to the PG in multiple places, in practice calling `destroy_process_group` doesn't actually end up invoking the destructor.
As a result, in this PR I'm adding a explicit shutdown method to that users can call to cleanup all resources.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111392
Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fduwjj
Right now, memory ops are being lowered to strings partly in
scheduler.codegen() and partly in wrapper.codegen(). But that makes
static memory planning (which is done entirely in `wrapper.codegen()`)
difficult to implement as information is "lost" by that point.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111402
Approved by: https://github.com/jansel
Tracks https://github.com/pytorch/pytorch/issues/98161
Complex number support in Pytorch isn't ideal today as complex operations will mostly end up taken care of by the aten runtime, except for `torch.angle` which is handled in [105609](https://github.com/pytorch/pytorch/pull/105609). In general a better way to handle that could be to decompose complex operations first so that more opportunities for fusion could be unveiled, and then to have Triton take care of non-continuous (strided) tensor operations more efficiently. This change adds support to decompose complex addtions.
```
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 6
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex
tmp0 = tl.load(in_ptr0 + (x0), xmask)
tmp1 = tl.load(in_ptr1 + (x0), xmask)
tmp2 = tmp0 + tmp1
tl.store(out_ptr0 + (x0), tmp2, xmask)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110740
Approved by: https://github.com/jansel
Summary:
This PR adds in torch.compile support for semi-structured sparsity,
using the subclass tracing @bdhirsh added.
Based on wether we are using cuSPARSELt or CUTLASS, we return a
different representation of the inner tensors.
Test Plan:
```
python test/test_sparse_semi_structured.py -k compile
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111049
Approved by: https://github.com/cpuhrsch
Argument "device" was missed.
So, "test_sparse.py::TestSparseAnyCUDA::test_as_sparse_gradcheck_*_cuda" was always run on the default device ("cpu") if another default torch device was not configured before.
This fix will probably detect a number of issues on various devices which were previously missed.
Should fix failed rocm CI jobs with "##[error]The action has timed out." and speedup test execution
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111584
Approved by: https://github.com/soulitzer
As in the title.
The figures below illustrate the performance differences of bsr_dense_mm with optimized parameters and bsr_dense_mm with default parameters (GPU: NVIDIA A100-SXM4-80GB). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value bsr_dense_mm have the same performance characteristics as torch.matmul. The second figure represents speedups from using optimized meta parameters in bsr_dense_mm at its performance equilibrium points with respect to bsr_dense_mm with default meta parameters.
In sum, this PR speeds up `bsr_dense_mm` about 50 % depending on the bsr tensor shape and blocksize and lowers the performance equilibrium points of BSR tensor sparsity and strided tensor for matmul operations.
<img src="https://github.com/pytorch/pytorch/assets/402156/6fe9d35f-dd21-4aa0-bb01-6ee257254453" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/506921c6-3770-4209-ad3d-498d2ae4989d" width="48%">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111760
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #110396, #111470, #111489
torch.library.impl now accepts a device string (e.g. "cpu", "cuda"). It
still accepts DispatchKey strings, but we no longer document this, because
using arbitrary DispatchKeys is more for the power users.
We map the device string to a DispatchKey and then register the impl for
said DispatchKey. A user may also specify multiple device strings at once
or specify "types=default" to get a CompositeExplicitAutograd registration.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111659
Approved by: https://github.com/soulitzer
ghstack dependencies: #111380
See title. Makes this consistent with torch.library.{define, impl, impl_device}, where we have named the same argument `qualname`. This is not BC-breaking because we have not released a version of PyTorch with impl_abstract in it yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111380
Approved by: https://github.com/soulitzer
# Summary:
This pull request **removes** support for non-square sequence lengths in causal attention when using FlashAttention V2.
### Why are doing this
// FlashAttention 2 updated the default mask meaning for causal in this PR:
// 9e5e8bc91e it is now aligned to lower_right which would be a BC break
// for non-square masks. We will not support non-square masks for causal w/ FAV2
For more context see:
https://github.com/pytorch/pytorch/issues/108108
### Followup
A large number of people will likely want to use FAV2 with lower_right causal attention for non equal sequence lengths. See this RFC : https://github.com/pytorch/pytorch/issues/110681
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111007
Approved by: https://github.com/cpuhrsch
Summary:
The slimdsnn model is currently built with GCC, and I see Clang-15 generates better code than GCC which is 10% faster, after a stack of backporting (D50338220).
There are likely other improvements to internal Clang as the TOT Clang in LLVM upstream generates even better code.
Test Plan:
Before:
buck2 run mode/{opt,inplace} //accelerators/workloads/models/slimdsnn:slimdsnn_dso_benchmark -- --iterations=100000000
Starting benchmark, 100000000 iterations...
Batch=1 latency: 0.643 us
After:
buck2 run mode/{opt,inplace} //accelerators/workloads/models/slimdsnn:slimdsnn_dso_benchmark -- --iterations=100000000
Starting benchmark, 100000000 iterations...
Batch=1 latency: 0.593 us
Reviewed By: bertmaher
Differential Revision: D50399150
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111537
Approved by: https://github.com/bertmaher
This PR introduces `scatter_mm` operation (compute `mm` of arbitrary pairs of tensors given in batches of tensors) that is used to implement `bsr_scatter_mm` that is equivalent to `bsr_dense_mm` (the `mm` operation on bsr and strided tensors). The implementation is provided both in Triton (when tensor dimensions are multiples of 16) and in PyTorch (otherwise).
The figures below illustrate the performance differences of `bsr_scatter_mm` and `bsr_dense_mm` (GPU: `NVIDIA GeForce RTX 2060 SUPER`). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value `bsr_scatter_mm` or `bsr_dense_mm` have the same performance characteristics as `torch.matmul`. The second figure represents speedups from using `bsr_scatter_mm` at its performance equilibrium points with respect to `bsr_dense_mm`.
<img src="https://github.com/pytorch/pytorch/assets/402156/526d182e-937f-4812-a6c4-904f52d6d5ab" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/ccb606ab-1f3f-4133-887c-b56285f4f168" width="48%">
The same figures for GPU card `NVIDIA A100-SXM4-80GB`:
<img src="https://github.com/pytorch/pytorch/assets/402156/25466f1d-df34-4d1c-a975-afb478e4d9f0" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/6ada91f0-a20f-4f0d-8a48-1f4ccc60d08e" width="48%">
In sum:
- `bsr_scatter_mm` is about 2x faster than `bsr_dense_mm` for small block sizes of 16 and 32 and large tensors [GPU: `NVIDIA GeForce RTX 2060 SUPER`].
- `bsr_scatter_mm` is up to 2x faster than `bsr_dense_mm` for small block sizes of 16 and large tensors [GPU: `NVIDIA A100-SXM4-80GB`].
- `bsr_dense_mm` is up to 20 % faster than `bsr_scatter_mm` for block sizes of 64 or larger [GPU: `NVIDIA GeForce RTX 2060 SUPER`].
- However, `bsr_dense_mm` fails with `OutOfResources` exception for block sizes of 256 or larger whereas `bsr_scatter_mm` succeeds.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110396
Approved by: https://github.com/cpuhrsch
Summary: When doing quantization int_mm -> mul or int_mm -> mul ->
to(dtype) is an extremely common op pattern which is currently not
handled well by inductor. Ideally, since the output of
int_mm has dtype int32 we'd prefer to only realize a smaller dtype like
bf16 or float16. Currently inductor doesn't have a way to force this, in
many cases the mul gets fused with a bunch of subsequent pointwise and reduction
ops from the dequant and following ops creating an increase in memory overhead and a general
slowdown compared to the fused version.
as an external benchmark, for SAM this seems to improve our e2e image encoder times by 3-5% depending on
batchsize and reduces memory usage by 20%
Test Plan: python test/inductor/test_pattern_matcher.py -k
"int_mm_mul"
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111413
Approved by: https://github.com/jansel
Summary:
This is a re-land of https://github.com/pytorch/pytorch/pull/111117 with
updates to our internal tests included.
This splits out changes from
https://github.com/pytorch/pytorch/pull/102625 to make things easier to
review.
This diff creates a `make_allocation()` method that extracts the logic
from `make_buffer_allocation()` while allowing us to allocate non-buffer
objects. In particular, we will use this to allocate memory pools during
memory planning.
This diff also includes a small optimization -- if the desired
allocation is contiguous, then we emit a call to `empty()` instead of
`empty_strided()` with its superfluous stride argument.
Test Plan: contbuild & OSS CI, see 9ce0ae836d
Differential Revision: D50429424
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111511
Approved by: https://github.com/jansel
Previously, we throw runtime_error exceptions with some string
operations upon failures. However, inlining such calls into
the main run function causes some exponential compilation-time
behavior for the host compiler, where the compiler may spend
an hour running call-graph related passes for some large models.
This PR replaces the relevant code with no-inline calls.
With this change, we reduced the compilation time from more than
an hour down to a couple of minutes for some large models.
Note that these non-inline calls have little impact to the
model inference runtime, because they are on the error-handling
paths.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111787
Approved by: https://github.com/desertfire, https://github.com/jansel
- Pass `GIT_DEFAULT_BRANCH` and `TEST_CONFIG` as well.
- Unify `_mac-test.yml` and `_mac-test-mps.yml` further by passing runner type via the matrix and uploading results using the same pattern (before the change MacOS12 and MacOS13 results on PRs were overwritten)
- Add `Cleanup disk space` step to `_mac-test-mps.yml` job
Should fix the
```
Warning: Gathered no stats from artifacts for build env None build env and None test config. Using default build env and default test config instead.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111595
Approved by: https://github.com/atalman
Summary:
We implement [`torch.mean(input, dim, keepdim)`](https://pytorch.org/docs/stable/generated/torch.mean.html) for tensors of 2d to 4d.
Since 0-dim tensor hasn't been supported yet, we only support `dim.size() < input.dim()` for now. We will support following cases in the future work:
- `dim.size() == input.dim()`
- `input.dim() == 1`
Test Plan:
```
[luwei@devbig984.prn1 /data/users/luwei/fbsource (970fcd90c)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*mean*"
Building: finished in 0.1 sec (100%) 339/339 jobs, 0/339 updated
Total time: 0.1 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *mean*
[==========] Running 7 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 7 tests from VulkanAPITest
[ RUN ] VulkanAPITest.mean_invalid_inputs
[ OK ] VulkanAPITest.mean_invalid_inputs (46 ms)
[ RUN ] VulkanAPITest.mean_dim_2d
[ OK ] VulkanAPITest.mean_dim_2d (127 ms)
[ RUN ] VulkanAPITest.mean_dim_3d
[ OK ] VulkanAPITest.mean_dim_3d (103 ms)
[ RUN ] VulkanAPITest.mean_dim_4d
[ OK ] VulkanAPITest.mean_dim_4d (89 ms)
[ RUN ] VulkanAPITest.mean_dim_keepdim_2d
[ OK ] VulkanAPITest.mean_dim_keepdim_2d (66 ms)
[ RUN ] VulkanAPITest.mean_dim_keepdim_3d
[ OK ] VulkanAPITest.mean_dim_keepdim_3d (127 ms)
[ RUN ] VulkanAPITest.mean_dim_keepdim_4d
[ OK ] VulkanAPITest.mean_dim_keepdim_4d (4 ms)
[----------] 7 tests from VulkanAPITest (564 ms total)
[----------] Global test environment tear-down
[==========] 7 tests from 1 test suite ran. (564 ms total)
[ PASSED ] 7 tests.
```
Reviewed By: yipjustin
Differential Revision: D50312990
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111609
Approved by: https://github.com/yipjustin
Summary: Use `TORCHDYNAMO_DISABLE=1` when running generated opcheck tests. Enable some `fbgemm::pack_segments` tests that errored out (with error `RuntimeError: expected int but got s0*s1**2`) because dynamo was being run in the opcheck tests.
Test Plan: `parsh -v --build-flags mode/dev-nosan //deeplearning/fbgemm/fbgemm_gpu:sparse_ops_test` then `run_tests("test_pack_segments")`
Differential Revision: D50508958
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111685
Approved by: https://github.com/zou3519
This PR supports sym_ite. This is useful for converting SymBool to SymInt in e.g. #109916. Internally, it uses sympy.Piecewise. We cannot use sympy.ITE because it expects the arguments and output all to be boolean type but we want return SymInt type when converting a SymBool to SymInt. So we use sympy.Piecewise to denote the symbolic relationship.
Note that this pr uses the range analysis for sympy.Piecewise implemented in https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/value_ranges.py.
Test Plan:
See added test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111440
Approved by: https://github.com/ezyang
Building with `USE_CUSTOM_DEBINFO=torch/csrc/Module.cpp python setup.py develop` for example will provide debug info only for this file.
This allows to enable debug symbols very fast from a non-debug build by doing a clean then develop (as long as you have ccache) and avoid very large binaries that take a very long time to load in gdb.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111748
Approved by: https://github.com/drisspg, https://github.com/ezyang, https://github.com/malfet
Summary:
Some builds use -Wshadow and currently there is a compiler warning when building that file.
Code inspection shows that `torch::autograd::impl::get_view_autograd_meta` simply extracts information from the passed object, which is `const`. Therefore the returned views should be the same all the time, and we can fetch the view only once.
Test Plan:
CI
NOTE: please advise for a more comprehensive test plan.
Differential Revision: D50407625
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111476
Approved by: https://github.com/Skylion007, https://github.com/albanD
CUDAAllocatorConfig uses size_t max_split_size and initializes it to std:: numeric_limits<size_t>::max(), and then the value is assigned to max_split_size of DeviceStats which is of type int64_t, so that the command
```
python3 -c "import torch;y=torch.empty(3,device='cuda');print(torch.cuda.memory_stats(0)['max_split_size'])"
```
returned -1.
After skimming through the code, and reading the doc in https://pytorch.org/docs/stable/generated/torch.cuda.memory_stats.html, It was sure that negative values of max_split_size make no sense and we should use size_t instead. Now the error has been fixed and the command returns std:: numeric_limits<size_t>::max().
This issue was found in revert of #111137
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111555
Approved by: https://github.com/colesbury
This pull request addresses an inconsistency in the representation of the Hadamard product across PyTorch documentation. Currently, the notation varies among different modules:
- In `torch.nn.LSTM` documentation the Hadamard product is represented with $\odot$
- In `torch.nn.GRU` documentation the Hadamard product is represented with $*$
- In `torch.nn.LSTMCell` documentation the Hadamard product is represented with $*$
- In `torch.nn.GRUCell` documentation the Hadamard product is represented with $*$
- In `torch.ao.nn.quantized.dynamic.GRU` documentation the Hadamard product is represented with $*$
This PR proposes consistently representing the Hadamard product throughout the documentation to enhance clarity and align with established standards.
The notation $\odot$ will be uniformly adopted, following the convention in the [Deep Learning Book](https://www.deeplearningbook.org/contents/linear_algebra.html).
**Changes Made:**
- Modified `torch.nn.GRU` documentation to represent the Hadamard product with $\odot$
- Modified `torch.nn.LSTMCell` documentation to represent the Hadamard product with $\odot$
- Modified `torch.nn.GRUCell` documentation to represent the Hadamard product with $\odot$
- Modified `torch.ao.nn.quantized.dynamic.GRU` documentation to represent the Hadamard product with $\odot$
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111763
Approved by: https://github.com/albanD
Summary: Somehow "getitem" started to get Tensor starting from ads_ranking:996 and broke SDD pipelining FX-transformer. We need to skip the Tensor node in annotation.
Test Plan:
N4326037
with ads_ranking kernel
# Before
ads_ranking:v996
{F1100009226}
# With this diff
{F1100009310}
Differential Revision: D49567615
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109937
Approved by: https://github.com/xush6528
This PR implements 2 things:
1. support the device agnostic stream and runtime APIs captured by the dynamo.
2. support the stream methods(include the event) captured by the dynamo.
Here are details for 1st.
Previously the stream captured in dynamo was tightly bind to CUDA. Here we implement a global singleton container named `StreamMethodContainer` for different backends to register their associated stream methods to dynamo. When import the backend’s product, the stream operations can be registered directly by calling
```
device_stream_method = {'current_stream': method_1,
'create_stream_context': method_2,
'set_stream': method_3,
'set_stream_by_id': method_4}
torch._dynamo.stream.register_stream_method(device_name, device_stream_method)
```
Stream methods need to be passed in this API according to the precise semantics represented by the dict key in `device_stream_method`. After register, these methods can be used by dynamo to capture the stream operations in users’ script, for example, get the current stream or set the specific stream. Additionally, the wrapped stream variable and the stream context variable are changed to be the device-agnostic, the proxy functions of these variables are assigned by the associated methods in the container. All of this are illustrated in the below. Below is a illustration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108312
Approved by: https://github.com/jansel, https://github.com/jgong5
This major change in this PR is to consolidate the skipfiles.check rules, the major thing done is merging the original ```FILE_INLINELIST``` with ```SUBMOD_INLINELIST``` into new ```MOD_INLINELIST``` and a legacy ```LEGACY_MOD_INLINELIST```.
Let's use the following example to illustrate what is the expected behavior for this force inline list:
fa995626a8/torch/_dynamo/skipfiles.py (L344-L369)
The handling logic is:
* If f2 is inlined, we will check both ```MOD_INLINELIST``` and ```LEGACY_MOD_INLINELIST``` to consultant force inline rules for f3.
* If f2 is skipped, we will check ```LEGACY_MOD_INLINELIST``` only for inline rules for f3.
The reason behind this design is: if f2 is skipped, if we always trace all recursively called functions, we will go to the very low level functions (e.g, ```super().__init__```) which caused graph breaks. We treated this as a signal that all functions that f2 recursively called should be skipped as well if f2 is skipped. This is also a feature that many PyTorch developers requested, they just want to skip all recursive functions if they mark the upper level functions as skipped.
For PyTorch developers, we should only use ```MOD_INLINELIST``` going forward. I think most of the modules in the ```LEGACY_MOD_INLINELIST``` are legacy things to workaround when we didn't have a good skip/inline API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111451
Approved by: https://github.com/ezyang
TP style still have some regression due to negative dim specifications,
fix it by allow DTensor API to handle negative dims and normalize them.
i.e. TP uses `Shard(-1)`, and then try to redistribute `Shard(1) -> Shard(-1)`, this should ideally be no-op but current it runs a decompose sharding phrase and it would turn this transformation to `Shard(1) -> Replicate -> Shard(-1)`, which is wrong and triggers unnecessary allgathers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111750
Approved by: https://github.com/rohan-varma
AOTAutograd's handling for resize_() isn't fully robust (and on top of that, functionalization can potentially give up and raise an error if the tensor you're resizing has outstanding views).
So given that, and given that resize_() is rare, I updated dynamo to graph break on resize_() instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111553
Approved by: https://github.com/ezyang
Improves perf of llama_v2 locally from 1.55 -> 1.57
The initial heuristic is to lower to pointwise if # of inputs is <= 4, and all the inputs are pointwise or cannot be memory planned away, or if all the outputs are pointwise.
Perf run was +3% on inference.. There are definitely instances where we should be lowering to foreach_kernels, but it's less flexible for fusion. The motivating example was:
```
def rotate_half(x):
"""Rotates half the hidden dims of the input."""
x1 = x[..., : x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2 :]
return torch.cat((-x2, x1), dim=-1)
def apply_rotary_pos_emb(q, k, cos, sin):
iota = torch.ops.prims.iota.default(512, start = 0, step = 1, dtype = torch.int64, device = device(type='cuda', index=0), requires_grad = False)
# File: /scratch/eellison/work/torchdynamo/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py:657, code: position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
unsqueeze = torch.ops.aten.unsqueeze.default(iota, 0)
position_ids = torch.ops.aten.reshape.default(unsqueeze, [-1, 512]); unsqueeze = None
# The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
cos = cos.squeeze(1).squeeze(0) # [seq_len, dim]
sin = sin.squeeze(1).squeeze(0) # [seq_len, dim]
cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
sin = sin[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed
```
Also not sure if I should be more worried about concatting reduction->pointwise inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111233
Approved by: https://github.com/Chillee
This is fixing following failure in the release branch:
```
cp: cannot create directory '/tmp/pytorch-release/2.1': No such file or directory
```
Link: https://github.com/pytorch/pytorch/actions/runs/6591657669/job/17910724990
cp will report that error if the parent directory (pytorch-release in this case) does not exist.
This is working in main since ``PT_RELEASE_NAME: pytorch-main`` however for release its ``PT_RELEASE_NAME: pytorch-release/2.1``
Test:
```
export tag_or_branch=release/2.1
tag_or_branch="${tag_or_branch//\//_}"
echo $tag_or_branch
release_2.1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111697
Approved by: https://github.com/huydhn, https://github.com/osalpekar
Alternative to #104487.
Several have chimed in that #104487 introduces a dependency from torch (c10d) to ATen, which is considered backward and messy. This alternative switches the dependency relationship at the cost of requiring graphs to potentially do some polling before the capture.
CC @huydhn @malfet @Aidyn-A @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110665
Approved by: https://github.com/kwen2501
Summary:
Cinder has two new opcodes which optimize `super()` in classes. This implements
the opcodes for `torch._dynamo`.
Test Plan:
```
buck2 test mode/opt-split-dwarf aps_models/ads/icvr/... -c fbcode.use_cinder_fast_test=true
```
Differential Revision: D50516475
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111707
Approved by: https://github.com/jansel
Summary:
Previously we actually did not really support this, this PR added the support.
Next
* clean up insert observer logic
* add allow_transitive_sharing boolean flag to allow people to turn this op for certain edges
Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_shared_qspec_transitivity
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D50250789](https://our.internmc.facebook.com/intern/diff/D50250789)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111172
Approved by: https://github.com/kimishpatel
Fixes https://github.com/pytorch/pytorch/issues/111031
The current design of autograd.Function tracing in dynamo is that we:
1) speculate fwd, and if its fine,
2) speculate bwd, and if its fine
3) install the .apply in the graph alongside fwd guards
The mechanism for doing so involves creating HOPs for fwd, bwd, and apply. The speculation for fwd and bwd create their own subtracer. This is fine, until a proxy created in fwd is used in bwd.
For a simple example, consider:
```
class Foo(Function):
@staticmethod
def forward(ctx, x):
ctx.x0 = x.size(0)
return x * 2
@staticmethod
def backward(ctx, grad_out):
return grad_out * ctx.x0
```
the value stored at `x0` is a proxy - but it is a proxy belonging to the fwd speculation subtracer. Rather than teaching it to the subtracer for bwd, we choose to create a subtracer that covers both fwd and bwd speculation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111588
Approved by: https://github.com/zou3519
Summary:
The existing implementation of `aten::sum.dim_IntList` depends on the following steps:
- store the items of arguments `opt_dim` in a `std::set<int64_t> dims_set;`
- iterate through `dims_set` in the reverse order(i.e. from largest to smallest) and compute the sum for one designated dim in `sum_dim`
But when `opt_dim` contains negative items and `keepdim==false`, the dimension iteration over the set will be messed up. For example, the existing implementation will fail at the test case `test_sum_dim({10, 7, 5}, {-1, -2});`.
We fix the issue by invoking `int64_t dim_normalized = utils::normalize(d, self.dim());` to get normalized dim in the range [0, `self.dim()` - 1].
Moreover the existing TORCH_CHECK of the condition
```
d >= -self.dim() - 1 && d <= self.dim()
```
is wrong and fixed by
```
d >= -self.dim() && d < self.dim()
```
Test Plan:
```
[luwei@devbig984.prn1 /data/users/luwei/fbsource (04b08a835)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*sum*"
Building: finished in 0.1 sec (100%) 339/339 jobs, 0/339 updated
Total time: 0.2 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *sum*
[==========] Running 8 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 8 tests from VulkanAPITest
[ RUN ] VulkanAPITest.cumsum
[ OK ] VulkanAPITest.cumsum (105 ms)
[ RUN ] VulkanAPITest.sum_invalid_inputs
[ OK ] VulkanAPITest.sum_invalid_inputs (0 ms)
[ RUN ] VulkanAPITest.sum_dim_2d
[ OK ] VulkanAPITest.sum_dim_2d (145 ms)
[ RUN ] VulkanAPITest.sum_dim_3d
[ OK ] VulkanAPITest.sum_dim_3d (91 ms)
[ RUN ] VulkanAPITest.sum_dim_4d
[ OK ] VulkanAPITest.sum_dim_4d (89 ms)
[ RUN ] VulkanAPITest.sum_dim_keepdim_2d
[ OK ] VulkanAPITest.sum_dim_keepdim_2d (63 ms)
[ RUN ] VulkanAPITest.sum_dim_keepdim_3d
[ OK ] VulkanAPITest.sum_dim_keepdim_3d (135 ms)
[ RUN ] VulkanAPITest.sum_dim_keepdim_4d
[ OK ] VulkanAPITest.sum_dim_keepdim_4d (4 ms)
[----------] 8 tests from VulkanAPITest (637 ms total)
[----------] Global test environment tear-down
[==========] 8 tests from 1 test suite ran. (637 ms total)
[ PASSED ] 8 tests.
```
Reviewed By: yipjustin
Differential Revision: D50442152
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111586
Approved by: https://github.com/yipjustin
We are refactoring parallel style to solve the following things:
1. To further simplifying code logic to make more readable for users.
2. To remove tuple check so that we can work with dynamo for now. Ideally dynamo needs to support this case and we will fix it in parallel.
3. Add tests for newly added parallel style in UT and torch compile test so that we can capture regression due to code change.
4. Move placements early return check into DTensor since it is by passed by dynamo.
5. Remove PairwiseParallelStyle from unit tests to use the new Col/Rowwise parallel style.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111625
Approved by: https://github.com/wanchaol
Summary:
0-size tensors are allowed in PyTorch (e.g. a tensor with size {2, 1, 0}). However, this currently causes issues with PyTorch Vulkan as the Vulkan API would raise an error when attempting to allocate a resource with no memory.
This diff fixes the behaviour by adding support for `VulkanImage` and `VulkanBuffer` objects that do not have any associated memory.
Test Plan:
Tested locally with `vulkan_api_test` on Mac as a sanity test.
```
buck run //xplat/caffe2:pt_vulkan_api_test_bin --target-platforms ovr_config//platform/macos:x86_64-fbsource -- --gtest_filter="*"
```
But given how foundational of a change this is, more extensive testing should be done in order to be safe.
Reviewed By: yipjustin
Differential Revision: D50030659
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111512
Approved by: https://github.com/yipjustin
Adding a Tensor overload will allow us to:
- optimize in more cases than before
- increase coverage for scalarTensor instead of just scalars in our foreach APIs
The main complication in this PR was that add.Tensor has a scalar overload, so I've now built out support for that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111079
Approved by: https://github.com/albanD
Improves llama_v2 perf locally from 1.48x -> 1.55x.
A good future rewrite would be to unify the freezing batching with the other batching rules that @yanboliang & co were working on. I want to wait for the forthcoming pre-dispatch changes to settle down first.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111232
Approved by: https://github.com/Chillee
Did some easy fixes from enabling TRY200. Most of these seem like oversights instead of intentional. The proper way to silence intentional errors is with `from None` to note that you thought about whether it should contain the cause and decided against it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111496
Approved by: https://github.com/malfet
Fixes#111422
allreduce_sparse_cuda gets dispatched to allreduce_sparse which doesnt exist for gloo. However, gloo has an existing implementation so this is just fixing the dispatching to that.
The reason CI didn't catch this is because we are calling the backend directly. Added a test which calls the public API (dist.XYZ) and goes through the dispatcher
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111485
Approved by: https://github.com/fduwjj
This PR optimizes cases like layer_norm + fp8 quant (which includes amax and fp8 quant) fusion when amax is split into multiple reduction kernels.
Benchmark:
```
python test/inductor/test_fp8.py -k test_layernorm_fp8_quant_benchmark
Before this PR:
Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096).
Benchmark results: Inductor: 0.13262102689486555ms, Eager: 0.8211962616822429ms, LN only Inductor: 0.09606276150627614ms.
After this PR:
Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096).
Benchmark results: Inductor: 0.08281274131274131ms, Eager: 0.8217452830188678ms, LN only Inductor: 0.09586902286902287ms.
```
LN + fp8 quant is even faster than LN itself. The reason could be that LN + fp8 outputs fp8 while LN outputs fp16.
From Inductor nightly benchmark test:
There are perf differences in cuda_graph / cuda_graph_dynamic / default runs, but no difference in inductor_max_autotune. So it seems to me that the perf differences are mostly like fluctuations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111122
Approved by: https://github.com/jansel
This is code related to parallelism and test times that isn't used, so remove it.
Tested by running locally with `python3 -m tools.stats.upload_test_stats --workflow-run-id 6551035874 --workflow-run-attempt 1 --head-branch main --head-repository "pytorch/pytorch"` and commenting out parts for uploading to s3.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111504
Approved by: https://github.com/huydhn
Currently, for `matrix_exp` function, if we have NaN values in the input matrices (small batches), it will keep outputting a "normal" result without any NaN value in it, and this will cause some problems that we may can't notice. This PR is for preventing such undefined behavior by "bring back" those NaN values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111539
Approved by: https://github.com/lezcano
Fixes#111499
This PR slightly alters the new fused `all_gather` `optim_state_dict` implementation to support optimizers without tensor state (e.g. SGD) in a `use_orig_params=True` context.
The principle change is to short-circuit `_allgather_orig_param_states` if an empty `state_buffers` dict is returned after completing `_convert_all_state_info` here:
93e5065ba0/torch/distributed/fsdp/_optim_utils.py (L1481-L1484)
To allow `_convert_all_state_info` to accommodate optimizers with no tensor state, I also change the scope of `dtype` and make the return type `Optional`.
As discussed in the issue this PR fixes, I'm [extending](93e5065ba0/test/distributed/fsdp/test_fsdp_optim_state.py (L1915I)) `test_state_dict_with_none_tensor_state` to test with both Adam and SGD optimizers to validate scalar and non-tensor states continue to be restored for both optimizer types.
Thanks to the distributed team as always for their adroit design and exceptionally valuable contributions to the open source ML community. Hope you all feel appreciated commensurate with the compounding progress your work enables.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111501
Approved by: https://github.com/fegin
The vectorized implementation in https://github.com/pytorch/pytorch/pull/111021 changed the order of arithmetic instructions in `layer_norm_grad_input`, causing non bitwise identical results when compared to the non-vectorized implementation. At merging, all accuracy checks passed, including internal inductor ones.
There are CI periodic inductor dynamo tests (e.g. `pit_b_224`) that run eager mode models several times and compare results. If the input buffers are aligned to the vector length, the vectorized implementation will be used. If not, the default one will be used. If the 2 eager runs end up having different buffer alignments, 2 implementations will be called and then the results would be very close but not bitwise identical. The tests check for bitwise identical results and in some cases they may fail.
This fix makes sure that the operation order between non-vectorized and vectorized is the same and the 2 implementations **should** produce bitwise identical results.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111488
Approved by: https://github.com/malfet
Something is broken with automatic slow detection, so let's do it manually
Those tests were previously classified as slow, see:
```
test_decomp.py::TestDecompCUDA::test_quick_core_backward_baddbmm_cuda_float64 SKIPPED [0.0003s] (test is slow; run with PYTORCH_TEST_WITH_SLOW to enable test) [ 53%]
test_decomp.py::TestDecompCUDA::test_quick_core_backward_clamp_max_cuda_float64 SKIPPED [0.0002s] (test is slow; run with PYTORCH_TEST_WITH_SLOW to enable test) [ 53%]
test_decomp.py::TestDecompCUDA::test_quick_core_backward_clamp_min_cuda_float64 SKIPPED [0.0002s] (test is slow; run with PYTORCH_TEST_WITH_SLOW to enable test) [ 53%]
```
from https://ossci-raw-job-status.s3.amazonaws.com/log/17792633247
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111524
Approved by: https://github.com/kit1980, https://github.com/izaitsevfb, https://github.com/huydhn
For synchronous ops (i.e. `asyncOp = False`), we don't want to record streams because we know that the NCCL stream will join back to the "current" stream right after this op. So we might just as well keep the stream ownership of the input/output tensors unchanged. The benefit would be that the allocation/free of the tensors would look deterministic to the "current" stream so that the caching allocator can reuse memory pool for this stream in a clever way.
To prevent the input/output tensors from being recycled by python, we rely on the stashing mechanism in ProcessGroupNCCL (which can be also turned on by setting `TORCH_NCCL_AVOID_RECORD_STREAMS=1`).
This mechanism change is for libraries like FSDP which uses `all_gather_into_tensor` and `reduce_scatter_tensor` in a synchronous way and which cannot set `TORCH_NCCL_AVOID_RECORD_STREAMS=1` for their users. And therefore, this change is limited to these two collectives for now.
Cc: @awgu @janeyx99 @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111431
Approved by: https://github.com/H-Huang
We’ve been strugging to get the job id since 9/28/2023 12:03 pm. Before this we had almost 0 problems getting job id, but after, we get a lot of `Recieved status code '502' when attempting to retrieve https://api.github.com/repos/pytorch/pytorch/actions/runs/6551579728/jobs?per_page=100:\n", 'Bad Gateway\n\nheaders=Server: GitHub.com\nDate: Tue, 17 Oct 2023 20:32:52 GMT\nContent-Type: application/json\nContent-Length: 32\nETag: "652eed15-20"\nVary: Accept-Encoding, Accept, X-Requested-With\nX-GitHub-Request-Id: EC62:7EE0:166AAF5:2D51A8E:652EEF6A\nconnection: close\n\n` ex https://github.com/pytorch/pytorch/actions/runs/6551579728/job/17793898278#step:18:22
Recently, it has been happening around 1/4 of the time, possibly more. I think this happens almost only on linux.
I believe this is somehow caused by a test, since distributed tests seems to be disproportionately affected, so I move the step to get the job id before the test step. This also has the benefit of the test step being able to get the job id now if we want it.
Regardless of whether this works or not, its a pretty harmless change that might make things easier in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111483
Approved by: https://github.com/huydhn
As its painfully slow (10+ min on A100):
```shell
$ time python3 test_decomp.py -v -k test_quick_core_backward_baddbmm_cuda_float64
Fail to import hypothesis in common_utils, tests are not derandomized
test_quick_core_backward_baddbmm_cuda_float64 (__main__.TestDecompCUDA) ... ok
----------------------------------------------------------------------
Ran 1 test in 897.523s
OK
real 15m4.773s
user 15m0.207s
sys 0m6.492s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111493
Approved by: https://github.com/clee2000, https://github.com/huydhn
Summary: In AOTInductor ABI Compatible-mode, we don't serialize missing args with default value.
Test Plan: buck2 run mode/dev-nosan deeplearning/aot_inductor/test:test_custom_ops
Differential Revision: D50345729
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111425
Approved by: https://github.com/angelayi
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 21e87dc</samp>
> _We're sailing on the sea of code, with warnings to avoid_
> _We use the `C10_UNUSED` macro for variables unexploited_
> _We heave and ho and pull and push, and make the code more neat_
> _We sing this shanty as we go, to keep us in good spirits_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111439
Approved by: https://github.com/huydhn
`test_typing.py` was written to use `pytest` in https://github.com/pytorch/pytorch/pull/54234 which unfortunately rendered it incompatible with run_test.py, and therefore it was not running in CI all this time.
In this PR, same functionality is re-written using unittest framework, and `parametrize` from `torch.testing._internal._common_utils`.
Valid `test_typing.py` with ufmt
Disable `fail/bitwise_ops.py` and `pass/jit.py` as it regressed at some point as well as one of examples in `namedtuple.py` as `torch.linalg.qr` type is no longer revealed correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111428
Approved by: https://github.com/clee2000
Removes the existing integration code & build of nvfuser in TorchScript.
Note that I intentionally left the part where we wipe out `third_party/nvfuser` repo. I'll do that in a separate PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111093
Approved by: https://github.com/albanD
Summary:
as title. only tensor_scalar.
this diff does not include the element-wise tensor-tensor operation.
Test Plan:
```
[yipjustin@33167.od ~/fbsource (9cfca7c97)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*floor_divide_scalar*"
Watchman fresh instance: new mergebase, cleared graph state, cleared dep files
Buck UI: https://www.internalfb.com/buck2/bcac40be-79af-47c5-bd3f-95c11179aa68
Network: Up: 29MiB Down: 264MiB (reSessionID-2fef8b89-76b0-4496-bb27-b10d42cf7ef4)
Jobs completed: 5196. Time elapsed: 45.9s.
Cache hits: 81%. Commands: 2070 (cached: 1672, remote: 375, local: 23)
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *floor_divide_scalar*
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from VulkanAPITest
[ RUN ] VulkanAPITest.floor_divide_scalar
[ OK ] VulkanAPITest.floor_divide_scalar (150 ms)
[ RUN ] VulkanAPITest.floor_divide_scalar_inplace
[ OK ] VulkanAPITest.floor_divide_scalar_inplace (39 ms)
[----------] 2 tests from VulkanAPITest (189 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (190 ms total)
[ PASSED ] 2 tests.
```
Differential Revision: D50001740
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110785
Approved by: https://github.com/SS-JIA
This PR has a number of changes that improve subclass support for AOTAutograd/Inductor in general:
- previously if a subclass does extra aliasing between graph outputs/inputs in a way, the partitioner would complain because grad_outputs are the outputs reused as-is. Now we do a view_as(self) to workaround this.
- Use dense -> dense metadata when working with fwd_output_strides during backward. This is important since the stride information comes from inductor which sees the dense to dense graph.
- Inductor requires that the inputs to the compiled backward to match some expected strides computed during compilation. We make sure to make the inner tensors of the subclass contiguous (previously, we only made the subclass itself contiguous)
Changes specific to NestedTensor relevant to compilation:
- Properly handle the case where `__tensor_unflatten__` is passed non-symbolic dense tensors and with meta extracted from fake subclasses.
- Skip var_to_range logic for singleton int
- Skip size hint logic in inductor for singleton int
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110529
Approved by: https://github.com/bdhirsh
Summary:
_wrapped_fns_to_patch points to f_globals which might change during iteration due to factors like lazy imports. This diff fixes potential runtime errors like:
```
RuntimeError: dictionary changed size during iteration
```
Test Plan: CI
Reviewed By: Kronuz
Differential Revision: D50283983
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111267
Approved by: https://github.com/yanboliang
When the model returns a constant, we cannot "release" its handle,
because the constant doesn't have any handle at all. Instead,
we should allocate a new tensor and then return a copy of the constant.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111356
Approved by: https://github.com/hl475
This splits out changes from
https://github.com/pytorch/pytorch/pull/102625 to make things easier to
review.
This diff creates a `make_allocation()` method that extracts the logic
from `make_buffer_allocation()` while allowing us to allocate non-buffer
objects. In particular, we will use this to allocate memory pools during
memory planning.
This diff also includes a small optimization -- if the desired
allocation is contiguous, then we emit a call to `empty()` instead of
`empty_strided()` with its superfluous stride argument.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111117
Approved by: https://github.com/jansel
`state_dict` is a very common variable name people use to represent a local
state_dict and `load_state_dict` conflicts with DCP's `load_state_dict`.
This PR changes `state_dict` to `get_state_dict`. `get_state_dict` is more close to what is this API does -- users use the API to get the current state_dict for saving or for loading (passed to DCP for loading in-place)..
This PR also changes `load_state_dict` to `set_state_dict`. `set_state_dict` is less ideal compared to `get_state_dict` but is symetric. We can still change the API name before it goes to beta.
This PR also simplies the API signatures. `model_only` is removed and `optim_only` only exists for `get_state_dict`.
Differential Revision: [D50213931](https://our.internmc.facebook.com/intern/diff/D50213931/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111120
Approved by: https://github.com/wz337
ghstack dependencies: #111106, #111107, #111275, #111109, #111110
Re-land of https://github.com/pytorch/pytorch/pull/111011.
The original PR ended up having a bad interaction with code that tried to run `torch.compile` under `with torch.inference_mode`, which caused some internal tests to fail.
The issue was that:
(1) AOTInductor invokes the pattern matcher passes in inductor
(2) The pattern matcher registers some code with [training_graph](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/fx_passes/pad_mm.py#L461)
(3) The `training_graph` function expects to be able to set the global autograd state to `requires_grad`, and always get out a join graph (assertion [here](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/pattern_matcher.py#L1196)).
(4) However, when inference_mode is activated, and you try to run AOTAutograd, AOTAutograd will witness that all outputs to the traced function will not require grad, and (now correctly) think that we are tracing an inference graph, which fails the above assert.
After talking to Bin, it sounds like these training-only patterns aren't necessary when we know we are compiling an inference graph (which should always be the case if you're running torch.compile with inference_mode). So I updated the pattern matcher to ignore any pattern matches using `training_graph`, when inference_mode is enabled.
This reverts commit cf6b1cdf6ac74d375b0787bd8f9463cb3a53b0e5.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111347
Approved by: https://github.com/Chillee
Summary:
Original commit changeset: 96813f0fac68
Original Phabricator Diff: D50161780
This breaks the integration test on T166457344
Test Plan: Sandcastle.
Differential Revision: D50344243
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111401
Approved by: https://github.com/izaitsevfb
Summary:
This PR adds in torch.compile support for semi-structured sparsity,
using the subclass tracing @bdhirsh added.
Based on wether we are using cuSPARSELt or CUTLASS, we return a
different representation of the inner tensors.
Test Plan:
```
python test/test_sparse_semi_structured.py -k compile
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111049
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #110583
Fixes part 1 of https://github.com/pytorch/pytorch/issues/111370#issuecomment-1764730773
While at it, add a test for numpy ndarray `.size` attribute. This started as an attempt to remove the delegation of what looks like a `.size()` method --- which does not exist in numpy --- on the same line this patch adds a `tolist` to.
But this is apparently needed for something else and existing tests start failing. Thus, declare it as _ain't broken don't fix_, and only keep the test. Can remove the test if wanted though.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111382
Approved by: https://github.com/lezcano
We add a new overload to torch.library.impl that accepts an optional
Library arg. If provided, the lifetime of the registration will be
tied to the Library arg, otherwise, it will live forever.
Test Plan:
- existing and new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111308
Approved by: https://github.com/soulitzer
ghstack dependencies: #111307
Summary: Add linear quantize for vulkan to custom ops so it can be used from a model.
Test Plan:
buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_full_precision=1
//xplat/caffe2/fb/custom_ops/vulkan_quantized:pt_vulkan_quantized_test_binAppleMac\#macosx-arm64
[ OK ] VulkanAPITest.convert_qconv2d_context (135 ms)
[ RUN ] VulkanAPITest.linear_2d
[ OK ] VulkanAPITest.linear_2d (4 ms)
[----------] 2 tests from VulkanAPITest (139 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (139 ms total)
[ PASSED ] 2 tests.
##############################################################
buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource
//xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac
[ OK ] VulkanAPITest.conv2d_pw_quantized_prepack_random_params_int8_int32 (11 ms)
[ RUN ] VulkanAPITest.linear_2d_flat
[ OK ] VulkanAPITest.linear_2d_flat (4 ms)
[ RUN ] VulkanAPITest.linear_2d_small
[ OK ] VulkanAPITest.linear_2d_small (1 ms)
[ RUN ] VulkanAPITest.linear_2d_large
[ OK ] VulkanAPITest.linear_2d_large (1 ms)
[ RUN ] VulkanAPITest.linear_3d_flat
[ OK ] VulkanAPITest.linear_3d_flat (2 ms)
[ RUN ] VulkanAPITest.linear_3d_small
[ OK ] VulkanAPITest.linear_3d_small (2 ms)
[ RUN ] VulkanAPITest.linear_3d_large
[ OK ] VulkanAPITest.linear_3d_large (1 ms)
[ RUN ] VulkanAPITest.linear_4d_flat
[ OK ] VulkanAPITest.linear_4d_flat (1 ms)
[ RUN ] VulkanAPITest.linear_4d_small
[ OK ] VulkanAPITest.linear_4d_small (1 ms)
[ RUN ] VulkanAPITest.linear_4d_large
[ OK ] VulkanAPITest.linear_4d_large (1 ms)
[ RUN ] VulkanAPITest.linear_custom
[ OK ] VulkanAPITest.linear_custom (0 ms)
[----------] 76 tests from VulkanAPITest (1811 ms total)
[----------] Global test environment tear-down
[==========] 76 tests from 1 test suite ran. (1811 ms total)
[ PASSED ] 76 tests.
YOU HAVE 8 DISABLED TESTS
##############################################################
buck2 run --target-platforms ovr_configplatform/macos:arm64-fbsourcexplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1
[----------] Global test environment tear-down
[==========] 346 tests from 1 test suite ran. (5648 ms total)
[ PASSED ] 345 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
YOU HAVE 5 DISABLED TESTS
Reviewed By: manuelcandales
Differential Revision: D49609985
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111148
Approved by: https://github.com/yipjustin
Fixes#110597
Summary:
* Generic code: The `torch._C.Value.node().mustBeNone()` is encapsulated into the high-level API `JitScalarType.from_value` ; `_is_none` was also extended to allow either `None` or `torch._C.Value.node.mustBeNone()`, so users don't manually call into TorchScript API when implementing operators
* Specific to `new_zeros` (and ops of ` *_like` and `new_*`): When checking `dtype`, we always must use ` _is_none`, which will call proposed by #110935
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110956
Approved by: https://github.com/justinchuby, https://github.com/BowenBao
This is kind of hard to test, but I can try to add a test case if requested.
I noticed locally that we now end up logging to the ProxyTensorMode and FakeTensorMode `not_implemented` logs in very simple compile examples: https://github.com/pytorch/pytorch/blob/main/torch/fx/experimental/proxy_tensor.py#L269
It was because `_mirror_autograd_meta_to()` indirectly queries sizes, and since modes have higher priority than subclasses, `aten::sym_sizes()` was getting dispatched to our modes before going to `FunctionalTensor.__torch_dispatch__`.
This works out fine (they return NotImplemented and we eventually get to `FunctionalTensor`) but I figured we want to avoid cluttering up the logs. So I wrapped the calls with `FunctionalTensorMode`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111040
Approved by: https://github.com/ezyang
This PR switches matrix ops to generate the sharding strategies, and
with the cost selection algorithm introduced in the previous PR we are
able to enable this and more ops to leverage strategy based sharding
prop
This also fixes a bunch of corner cases that existing propagation does
not cover, resulting in full coverage for baddbmm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110717
Approved by: https://github.com/fduwjj
ghstack dependencies: #109145
Previously we were generating a graph to add runtime assertions on inputs and then running that graph to check input constraints. This PR checks input constraints directly.
Differential Revision: D50289970
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111262
Approved by: https://github.com/zhxchen17
Fixes#109604
Resubmit gh-109715 + several skips and small fixes to make tests pass.
The main fix here is by @ysiraichi : previously, dynamo did not resume tracing numpy ndarrays after a graph break.
While at it, fix several small issues Yukio's fix uncovers:
- graph break gracefully on numpy dtypes which do not map to torch.dtypes (uint16 etc)
- recognize array scalars in dynamo, treat them as 0D ndarrays
- make sure that iterating over torch.ndarray generates arrays not bare tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110512
Approved by: https://github.com/lezcano
As part of TP UX improvements, we want to keep our API simple (not easy) so that users get the flexibility to do what they want and avoid a too generic API which tries to solve everything and get things too complicated. We are updating the doc accordingly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111176
Approved by: https://github.com/wanchaol
ghstack dependencies: #111160, #111166
# Problem
When sharding_strategy is set to SHARD_GRAD_OP and forward_prefetch is turned on, the validation after the train has an incorrect weight shape.
<img width="1508" alt="image" src="https://github.com/pytorch/pytorch/assets/41232043/57a9c3bb-cb5c-46df-ac26-922740686f9e">
# Analyze
When using `SHARD_GRAD_OP`, the `free_unsharded_flat_param` in `_post_forward_reshard` is often False, so it does not set the handle's `_prefetched` flag to False after the forward.
The normal train phase sets this flag to False in the `_post_backward_final_callback`, and the validation phase doesn't execute the hook, so after the first iter of the validation is done, the flag of the handle of the prefetched will remain True.
This will cause the handle to skip the `_unshard` in the next `_pre_forward_unshard`, and the `_prefetch_handle` will not do a prefetch, which will result in an incorrect weight shape.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110139
Approved by: https://github.com/awgu
The scalar overloads of some ops like `bitwise_xor.Scalar` were dispatched to `CompositeImplicitAutograd` by default. It is against the rule for `CompositeImplicitAutograd` that all tensor operations (except reading metadata) must be done through calls to the ATen dispatcher rather than interacting with the Tensor directly.
So, here update the dispatch of these overloads to `CompositeExplicitAutograd`.
Fixes#93224
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110918
Approved by: https://github.com/peterbell10
In some use cases, we found that users might want to annote the input/output DTensor layout for the parent module rather than the submodule whose parameters are to be distributed so that we want to have these two class for users to annote input/output DTensor layouts so that we register pre-FWD/FWD hook for the TP-lized module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111166
Approved by: https://github.com/wanchaol
ghstack dependencies: #111160
One thing we find it challenging for users is that we don't want to expose the concept of prepare_input and prepare_out to users since there are so many func names for users to select from which is quite confusing. On the other hand, the colwise and rowwise parallel always need input(out) and output(in) to be certain layout so we can somehow simplify the logic here and make it more usable.
So we added three public attributes to the parallelStyle here and the code logic is like:
```python
class ParallelStyle(ABC):
"""
The parallel style user wants the module or submodule to be parallelized.
We can add more in future, but this seems sufficient for immediate needs. Users can extend this class to build their own parallel style with customized input/output preparations.
"""
input_layouts: Union[placement, Tuple[placement]]
output_layouts: Union[placement, Tuple[placement]]
use_local: bool
class RowwiseParallel(ParallelStyle):
"""
Partitioning the row of a module. We assume the input to be a sharded DTensor and output to be a replicate Tensor.
"""
def __init__(self):
super().__init__(input_layouts=Shard(-1), output_layouts=Replicate(), use_local=True)
Class ColwiseParallel(ParallelStyle):
"""
Partitioning the column of a module. We assume the input to be a Replicated DTensor and output to be a sharded DTensor.
"""
def __init__(self):
super().__init__(input_layouts=Replicate(), output_layouts=Shard(-1), use_local=True)
# For the case of Sequence parallel, users just set different input_shard, Shard(0) or Shard(1) instead of Replicate()
Class PrepareModuleInput(ParallelStyle):
"""
Only used to specify the input distribute spec for a module.
"""
def __init__(self):
super().__init__(input_layouts=Shard(0), output_layouts=Replicate(), use_local=False)
Class PrepareModuleOutput(ParallelStyle):
"""
Only used to specify the output distribute spec for a module.
"""
def __init__(self):
super().__init__(input_layouts=Replicate(), output_layouts=Shard(0), use_local=True)
parallelize_plan = {
"embedding": ColwiseParallel(output_shard=Replicate()),
"attn": PrepareModuleInput(),
"attn.w1": ColwiseParallel(),
"attn.w2": ColwiseParallel(),
"attn.w3": ColwiseParallel(),
"attn.wo": RowwiseParallel(),
}
parallelize_module(
module=block, # this can be a submodule or module
device_mesh=mesh['tp'],
parallelize_plan=parallelize_plan,
)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111160
Approved by: https://github.com/wanchaol
This PR implements intra-graph communication reordering pass on Inductor scheduler IR, based on Horace's previous PR #100762.
Main algorithm:
1. Greedily moves waits as late as possible (i.e. until we reach a use)
2. Greedily moves comms as early as possible (i.e. until we reach an input)
3. Move computes following simple heuristics to improve overlap.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108091
Approved by: https://github.com/Chillee, https://github.com/wanchaol
Summary: Unbacked SymInts can't get a `sizevars.size_hint` due to being data-dependent. #109893 has added a new `fallback` parameter to `sizevars.size_hint` to specify the fallback value in cases like unbacked SymInt. In this PR we add more of those.
Test Plan: CI
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110520
Approved by: https://github.com/jansel, https://github.com/ezyang
Summary:
Sometimes the backend compiler can encounter a transient failure (in
our case, a remote build service infrequently hits a hiccup). We'd rather run
eager than fail the training job.
Test Plan:
Inject an exception in the RE path and run:
```
buck2 run @//mode/{opt,inplace} //caffe2/test/inductor:smoke
```
Differential Revision: D50234516
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111153
Approved by: https://github.com/ezyang, https://github.com/jansel
Summary: fall back to the old nodes when meta val is missing.
Test Plan: buck2 run //executorch/examples/portable/scripts:export -- --model_name=emformer_predict
Differential Revision: D50278439
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111225
Approved by: https://github.com/larryliu0820
Summary:
This PR adds in torch.compile support for semi-structured sparsity,
using the subclass tracing @bdhirsh added.
Based on wether we are using cuSPARSELt or CUTLASS, we return a
different representation of the inner tensors.
Test Plan:
```
python test/test_sparse_semi_structured.py -k compile
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111049
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #110583
This PR relands #110022 but accounts for the changes in #110191. Also, the function for creating COW storages is called `lazy_clone_storage` in this PR, instead of `try_ensure`
NOTE: COW storages do not actually copy on write yet, they just have the COW deleter and deleter context applied to them
Part of #109833
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110192
Approved by: https://github.com/ezyang
Fixes#110982https://github.com/pytorch/pytorch/pull/62257 deprecated `torch.onnx.export(use_external_data_format: bool=...)` argument, but it seems the introduced `EncoderBase::GetGraphProtoSize` has a bug and doesn't detect models > 2GB when onnx Constant nodes are large (and responsible for the size overflow)
This PR adds the constant node to the total size of the model, along with initializers.
In python, what we need to do is:
```python
import onnx
def compute_tensor_size(tensor):
# Compute the size of the tensor based on its shape and data type
size = tensor.size * tensor.itemsize
return size
def sum_constant_and_initializer_sizes(model_path):
# Load the ONNX model
model = onnx.load(model_path)
total_size = 0
initializer_size = 0
constant_size = 0
# Compute the size of constant nodes
for node in model.graph.node:
if node.op_type == 'Constant':
constant_value = node.attribute[0].t
# Convert constant value to numpy array
constant_array = onnx.numpy_helper.to_array(constant_value)
# Compute the size of the constant tensor
tensor_size = compute_tensor_size(constant_array)
total_size += tensor_size
constant_size += tensor_size
# Compute the size of initializer nodes that are not graph inputs
for initializer in model.graph.initializer:
if initializer.name not in [input.name for input in model.graph.input]:
# Convert the shape and data type information to calculate size
# tensor = onnx.helper.tensor_value_info_to_tensor(input)
tensor = onnx.numpy_helper.to_array(initializer)
tensor_size = compute_tensor_size(tensor)
total_size += tensor_size
initializer_size += tensor_size
return total_size, constant_size, initializer_size
model_path = '/path/to/model.onnx'
total_size, constant_size, initializer_size = sum_constant_and_initializer_sizes(model_path)
print("Total size of constant nodes in bytes:", constant_size)
print("Total size of initializer nodes (excluding graph inputs) in bytes:", initializer_size)
print("Total size of constant and initializer nodes (excluding graph inputs) in bytes:", total_size)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111097
Approved by: https://github.com/justinchuby, https://github.com/zhipenghan
The CUDA architecture flags from TORCH_CUDA_ARCH_LIST will be skipped if the TORCH_EXTENSION_NAME includes the substring "arch". A C++ Extension should be allowed to have any name. I just manually skip the TORCH_EXTENSION_NAME flag when checking if one of the flags is "arch". There is probably a better fix, but I'll leave this to experts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111211
Approved by: https://github.com/ezyang
Summary: When doing quantization int_mm -> mul or int_mm -> mul ->
to(dtype) is an extremely common op pattern which is currently not
handled well by inductor. Ideally, since the output of
int_mm has dtype int32 we'd prefer to only realize a smaller dtype like
bf16 or float16. Currently inductor doesn't have a way to force this, in
many cases the mul gets fused with a bunch of subsequent pointwise
ops from the dequant creating an increase in memory overhead and a general
slowdown compared to the fused version.
Theoretically with better control of/smarter inductor fusion, this could be something we get for free, at which point these changes can be removed.
Test Plan: python test/inductor/test_pattern_matcher.py -k
"int_mm_mul"
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111125
Approved by: https://github.com/jansel, https://github.com/cpuhrsch
Fixes#110549
We currently have a circular import between dynamo and einops as described in the issue.
This works around the issue by adding a mechanism to register initialization callbacks
that are called the first time an object is seen from that particular module.
This means that dynamo will only import `einops` after it's already fully initialized
and being called in a function being traced.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110575
Approved by: https://github.com/jansel
ghstack dependencies: #110990
`sys.modules` is currently treated as a constant dictionary and any reference to
it will result in guards on the full contents of `sys.modules`. This instead
adds a specialized variable tracker which tries to guard only on the modules
referenced by the code. e.g.
```
sys.modules["operator"].add(x, x)
```
will generate the guard
```
___dict_contains('operator', G['sys'].modules)
```
It does this with special support for `__contains__` `__getitem__` and `.get`
which are probably the most commonly used with `sys.modules`. For anything else
we just fall back to building the dict tracker as normal.
While accessing `sys.modules` may seem unusual, it actually comes up when
inlining the `warnings.catch_warnings` context manager which internally accesses
`sys.modules["warnings"]`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110990
Approved by: https://github.com/ezyang
Summary:
Currently we have shape constraints in semi-structured sparsity for both
CUTLASS and cuSPARSELt
These shape constraints unfortunately apply to both the dense and sparse
matrices in sparsedense matmul.
This PR adds in support for calling `F.pad` in order to pad dense
matrices to the right size with zeros and then pull out the
corresponding rows from the resultant result matrix.
We also throw a warning in this case.
The tests have also been updated to take in a dense_input_shape
parameter.
Test Plan:
```
python test/test_sparse_semi_structured.py
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110583
Approved by: https://github.com/alexsamardzic, https://github.com/cpuhrsch
Summary:
without the `all` in the fix
```
node.kwargs.get("beta", 1.0) == 1.0
node.kwargs.get("alpha", 1.0) == 1.0
and len(input_shape) == 2
and len(weight_shape) == 2
and all(x % 2 == 0 for x in input_shape + weight_shape)
and shape <= MAX_FUSE_TENSOR_SIZE_GROUP_LINEAR # <----- HERE
for shape in input_shape + weight_shape
```
this statement defaults to a generator object which means it will always be true. One of the issues is that the shapes could be an odd number which forces gmm to load element-by-element rather than vectorized load. In VDDv3 torchbench example(posted in test plan), you can see there is a 37ms GMM call which swamps any gain from fusion. Overall this change makes the GMM fusion 24% faster
Differential Revision: D48696572
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111174
Approved by: https://github.com/davidberard98
Summary:
This was error was run into when running ExportPassBase on an exported model with lifted constant tensors:
```
File "/data/users/angelayi/pytorch/torch/_subclasses/fake_tensor.py", line 1444, in dispatch
len(kwargs) == 0 and len(args) == 1 and type(args[0]) is torch.Tensor
AssertionError: (FakeTensor(..., size=(s0,)),) {}
While executing %lift_fresh_copy_1 : [num_users=1] = call_function[target=torch.ops.aten.lift_fresh_copy.default](args = (%_lifted_tensor_constant99,), kwargs = {})
Original traceback:
File "" in forward
mean = torch.tensor([0.485, 0.456, 0.406]).reshape(3, 1, 1)
```
In ExportPassBase, we retrace using the fake tensors in the placeholder nodes, but when running into this lift_fresh_copy operators, it's unable to be called with the fake tensors.
Test Plan: CI
Reviewed By: chakriu
Differential Revision: D50211827
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111140
Approved by: https://github.com/zhxchen17
C++ side callbacks allow for advance users to get
access to the collective firehose.
It's worth mentioning and discussing the dire environment that those
callbacks are invoked. From either main thread of watchdog thread and
with a PTD lock held.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110307
Approved by: https://github.com/fduwjj
ghstack dependencies: #111061, #111072
This PR addresses the persistent issue of merge conflicts in the benchmarks/dynamo/ci_expected_accuracy/ directory, specifically those arising from frequently updated CSV files. Based on @malfet suggestion, the solution implemented adds three spaces between each line in the CSV files. This approach has proven effective in preventing merge conflicts, as evidenced in [D50239634](https://www.internalfb.com/intern/diff/D50239634/). Regardless of these changes the extra new lines should still allow the csvs to be ingested as normal.
If you have access to the diff:
Normally, modifying a line that is later altered in the stack results in a merge conflict during restacking. With this new spacing strategy, lines that are not modified further down the stack will not trigger merge conflicts, achieving our intended outcome.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111163
Approved by: https://github.com/malfet, https://github.com/huydhn
This PR will cause logging.exception() to also dump the exception and stacktrace. Copied from 74723e1110/Lib/logging/__init__.py (L707-L711)
repro:
<details>
```python
import torch
import torch._inductor.config
torch._inductor.config.triton.inject_relu_bug_TESTING_ONLY = "runtime_error"
def fn(x, y):
return (x @ y).relu()
x, y = [torch.rand((16, 16), device='cuda') for _ in range (2)]
torch.compile(fn)(x, y)
```
run with TORCHDYNAMO_REPRO_AFTER=aot TORCHDYNAMO_REPRO_LEVEL=4
</details>
before:
```
...
[2023-10-12 14:18:52,902] torch._dynamo.debug_utils: [ERROR] While minifying the program in accuracy minification mode, ran into a runtime exception which is likely an unrelated issue. Skipping this graph.
```
now:
```
...
[2023-10-12 14:18:52,902] torch._dynamo.debug_utils: [ERROR] While minifying the program in accuracy minification mode, ran into a runtime exception which is likely an unrelated issue. Skipping this graph.
Traceback (most recent call last):
File "/data/users/dberard/scripts/relu_accuracy_issue.py", line 10, in <module>
torch.compile(fn)(x, y)
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111164
Approved by: https://github.com/eellison
Summary:
We add a 2D implementation to the op [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html?fbclid=IwAR00Xi7gt-qo4_LDFo18aaKTxnC4s1vlqk5EREqL0KE0Iz_97-WTlvi0muY)
The current implementation of layer_norm D37407311 supports
- input of 3D and normalized_shape also of 3D, or
- input of 4D with batch dim equal to 1 and normalized_shape of 3D
Since a 2D tensor of [H, W] can be represented as [1, H, W] in shader, we make a straightforward generalization to the case where both input and normalized_shape are of 2D.
Test Plan:
## Before
```
[luwei@devbig984.prn1 ~/fbsource (e09fe4ae4|remote/fbsource/stable...)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*layer_norm_2d*"
Recommended: For faster builds try buck2: replace 'buck' with 'buck2'
NOTE: buck-out/ has changed: look for files in fbsource/buck-out/v2/
'buck2 build --show-output //xplat/caffe2:pt_vulkan_api_test_bin' will print the new output paths.
If you are building in fbsource//xplat and have questions, post in 'Cross Platform Dev Discussions': https://fb.workplace.com/groups/xplat.qa
Targets matching .buckconfig buck2.supported_projects:
{'//xplat/caffe2:pt_vulkan_api_test_bin': '//xplat'}
To suppress this warning: touch ~/.config/.dont_hint_buck2
clang-12: warning: argument unused during compilation: '-pthread' [-Wunused-command-line-argument]
Downloaded 2/4 artifacts, 125.45 Kbytes, 33.3% cache miss (for updated rules)
Building: finished in 4.9 sec (100%) 2637/2637 jobs, 3/2637 updated
Total time: 4.9 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *layer_norm_2d*
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from VulkanAPITest
[ RUN ] VulkanAPITest.layer_norm_2d_small
unknown file: Failure
C++ exception with description "Vulkan layernorm expects 3-dim or 4-dim input!
Exception raised from layer_norm at xplat/caffe2/aten/src/ATen/native/vulkan/ops/Layernorm.cpp:66 (most recent call first):
(no backtrace available)" thrown in the test body.
[ FAILED ] VulkanAPITest.layer_norm_2d_small (56 ms)
[ RUN ] VulkanAPITest.layer_norm_2d_medium
unknown file: Failure
C++ exception with description "Vulkan layernorm expects 3-dim or 4-dim input!
Exception raised from layer_norm at xplat/caffe2/aten/src/ATen/native/vulkan/ops/Layernorm.cpp:66 (most recent call first):
(no backtrace available)" thrown in the test body.
[ FAILED ] VulkanAPITest.layer_norm_2d_medium (0 ms)
[ RUN ] VulkanAPITest.layer_norm_2d_large
unknown file: Failure
C++ exception with description "Vulkan layernorm expects 3-dim or 4-dim input!
Exception raised from layer_norm at xplat/caffe2/aten/src/ATen/native/vulkan/ops/Layernorm.cpp:66 (most recent call first):
(no backtrace available)" thrown in the test body.
[ FAILED ] VulkanAPITest.layer_norm_2d_large (27 ms)
[----------] 3 tests from VulkanAPITest (84 ms total)
[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (84 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 3 tests, listed below:
[ FAILED ] VulkanAPITest.layer_norm_2d_small
[ FAILED ] VulkanAPITest.layer_norm_2d_medium
[ FAILED ] VulkanAPITest.layer_norm_2d_large
3 FAILED TESTS
```
## After
```
[luwei@devbig984.prn1 ~/fbsource (e09fe4ae4|remote/fbsource/stable...)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*layer_norm_2d*"
Recommended: For faster builds try buck2: replace 'buck' with 'buck2'
NOTE: buck-out/ has changed: look for files in fbsource/buck-out/v2/
'buck2 build --show-output //xplat/caffe2:pt_vulkan_api_test_bin' will print the new output paths.
If you are building in fbsource//xplat and have questions, post in 'Cross Platform Dev Discussions': https://fb.workplace.com/groups/xplat.qa
Targets matching .buckconfig buck2.supported_projects:
{'//xplat/caffe2:pt_vulkan_api_test_bin': '//xplat'}
To suppress this warning: touch ~/.config/.dont_hint_buck2
clang-12: warning: argument unused during compilation: '-pthread' [-Wunused-command-line-argument]
Downloaded 1/3 artifacts, 1.40 Mbytes, 50.0% cache miss (for updated rules)
Building: finished in 5.0 sec (100%) 2637/2637 jobs, 2/2637 updated
Total time: 5.0 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *layer_norm_2d*
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from VulkanAPITest
[ RUN ] VulkanAPITest.layer_norm_2d_small
[ OK ] VulkanAPITest.layer_norm_2d_small (282 ms)
[ RUN ] VulkanAPITest.layer_norm_2d_medium
[ OK ] VulkanAPITest.layer_norm_2d_medium (0 ms)
[ RUN ] VulkanAPITest.layer_norm_2d_large
[ OK ] VulkanAPITest.layer_norm_2d_large (214 ms)
[----------] 3 tests from VulkanAPITest (497 ms total)
[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (497 ms total)
[ PASSED ] 3 tests.
```
full test result: P848167714
Differential Revision: D50048054
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110796
Approved by: https://github.com/yipjustin
This PR contains the changes needed to support using the NT jagged subclass within SAM. Note that a NT with multiple ragged dims is still required at the extremes for inputs / outputs, but the internal computation generally involves a single ragged dim, making the jagged layout usable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109123
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
`_shard_tensor()` calls into `dist.all_gather_object()` and this is causing optimizer state dict loading to be super slow. Workaround: call `FSDP._shard_utils._create_chunk_sharded_tensor()` to construct ShardedTensor without any communication.
Thanks to @fegin for suggesting the fix!
Thanks @mvpatel2000 for reporting the issue and providing profiling details to help us isolate the problematic source code quickly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111096
Approved by: https://github.com/fegin
Fixes#111066#111065#111064
Currently use_cutlass_template is returning True on ROCm but the feature is not supported. Fix to return false on ROCm. I considering adding this change to `try_import_cutlass` instead but the comments hinted that this function would be removed at some point.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111132
Approved by: https://github.com/jansel
This reverts commit 314a502eb04c6382e2cc9af0573533efba54109d.
Changes since original PR:
Reland 1
* rename torch.distributed.hooks to torch.distributed._hooks
Reland 2
* make _hooks importable even if !distributed.is_available()
* handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack)
(original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below)
Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.
The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.
This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:
register_collective_start_hook
register_collective_end_hook
register_process_group_hook
The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.
The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.
Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111072
Approved by: https://github.com/malfet
ghstack dependencies: #111061
The pattern here is that main may exit and kill cuda driver before
c10d watchdog related threads have cleanly exited. If this happens,
c10d threads may still make CUDA api calls and raise an exception about
the cuda driver being dead.
In the past we've patched a few helper functions that call into cuda
to specifically handle this driver exiting message. Instead, we know
that this problem applies only to codepaths in our background threads,
so we should catch at that scope and not worry about fine-grained
catching at the helper granularity. (and if a helper is used from the main
thread, we should NOT catch this exception- it's the application's fault)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111061
Approved by: https://github.com/malfet, https://github.com/fduwjj
`libshm.so` depends on the torch library exclusively for `at::RefcountedMapAllocator`,
so it makes sense to move it to c10 along with the other memory allocators.
This means `libshm.so` only depends on `c10` and we don't need to relink
`libshm.so` for every ATen change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109881
Approved by: https://github.com/albanD
`is_allowed` is a tricky bit of functionality - it sits early up in builder and is used to drive the creation of TorchVariable (more notes here, meta only https://fb.workplace.com/groups/pytorch.dev/permalink/1393563781222098/)
If we are tracing distributed in full, we want to route certain calls in distributed to NOT PASS is_allowed (this does not, confusingly, mean that they are not allowed, lol, but rather that we dont want them to become TorchVariable), others, we are fine with preserving.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110894
Approved by: https://github.com/ezyang
We switch to pytorch-linux-jammy-cuda11.8-cudnn8-py3.9-linter for lintrunner for checking CUDA cpp source. Meanwhile, there is a Dockerfile change due to missing libiomp installation and some other clang-tidy fixes triggered by the switch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110502
Approved by: https://github.com/malfet
This can be useful for advanced users (like AOTAutograd) who don't want to keep the corresponding Tensor alive (for memory reasons for example) or when inplace op will change the Tensor's grad_fn (but gradients wrt to the original value is needed).
I went minimal API change but open to suggestions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110867
Approved by: https://github.com/soulitzer
Summary:
Previously we design the GraphSignature format as a bunch of inputs and outputs node names. After a discussion in the design meeting we decide to change the format to make signature more self-contained. Now the signature format look like the following:
```
[
InputSpec(
kind=InputKind.USER_INPUT,
arg=TensorArgument(name="arg0_1"),
target=None,
),
...
]
```
Test Plan: CI
Reviewed By: angelayi
Differential Revision: D49876258
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111017
Approved by: https://github.com/angelayi
Currently, we only support intranode TP when compositin TP with other parallelism. This PR adds additional check to validate the TP mesh dim in TP initialization when parent mesh exists.
cc. @fegin, @fduwjj
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111001
Approved by: https://github.com/fduwjj
Summary: Adding Linear vulkan quantize operator
Test Plan:
buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_full_precision=1
//xplat/caffe2/fb/custom_ops/vulkan_quantized:pt_vulkan_quantized_test_binAppleMac\#macosx-arm64
[ OK ] VulkanAPITest.convert_qconv2d_context (135 ms)
[ RUN ] VulkanAPITest.linear_2d
[ OK ] VulkanAPITest.linear_2d (4 ms)
[----------] 2 tests from VulkanAPITest (139 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (139 ms total)
[ PASSED ] 2 tests.
##############################################################
buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource
//xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac
[ OK ] VulkanAPITest.conv2d_pw_quantized_prepack_random_params_int8_int32 (11 ms)
[ RUN ] VulkanAPITest.linear_2d_flat
[ OK ] VulkanAPITest.linear_2d_flat (4 ms)
[ RUN ] VulkanAPITest.linear_2d_small
[ OK ] VulkanAPITest.linear_2d_small (1 ms)
[ RUN ] VulkanAPITest.linear_2d_large
[ OK ] VulkanAPITest.linear_2d_large (1 ms)
[ RUN ] VulkanAPITest.linear_3d_flat
[ OK ] VulkanAPITest.linear_3d_flat (2 ms)
[ RUN ] VulkanAPITest.linear_3d_small
[ OK ] VulkanAPITest.linear_3d_small (2 ms)
[ RUN ] VulkanAPITest.linear_3d_large
[ OK ] VulkanAPITest.linear_3d_large (1 ms)
[ RUN ] VulkanAPITest.linear_4d_flat
[ OK ] VulkanAPITest.linear_4d_flat (1 ms)
[ RUN ] VulkanAPITest.linear_4d_small
[ OK ] VulkanAPITest.linear_4d_small (1 ms)
[ RUN ] VulkanAPITest.linear_4d_large
[ OK ] VulkanAPITest.linear_4d_large (1 ms)
[ RUN ] VulkanAPITest.linear_custom
[ OK ] VulkanAPITest.linear_custom (0 ms)
[----------] 76 tests from VulkanAPITest (1811 ms total)
[----------] Global test environment tear-down
[==========] 76 tests from 1 test suite ran. (1811 ms total)
[ PASSED ] 76 tests.
YOU HAVE 8 DISABLED TESTS
##############################################################
buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1
[----------] Global test environment tear-down
[==========] 346 tests from 1 test suite ran. (5648 ms total)
[ PASSED ] 345 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
YOU HAVE 5 DISABLED TESTS
Reviewed By: manuelcandales
Differential Revision: D48812642
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110581
Approved by: https://github.com/yipjustin
Several improvements for skipfiles:
* Add ```FUNC_INLINELIST``` to support function level skip/inline check.
* Use ```fn.__code__``` to match function since we can't get the function object sometimes.
* Use python module string name for ```FILE_INLINELIST``` and ```SUBMODULE_INLINELIST```.
* Use filename to match file and python module, which can fundamentally resolved the circular import issues introduced by skipfiles.
* Use ```TYPE_CHECKING``` to ensure the python module string name is correct.
* Add unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110835
Approved by: https://github.com/ezyang
This PR updates DTensor to support torch.compile
Cool stuff: there are some new tests in `test_dtensor.py` that show both the forward and backward graphs that we can send to inductor, when running a matmul with DTensor's. In particular, for this user code:
```
def fn(x, y):
dt = DTensor.from_local(x.reshape(2, 4), mesh, [Shard(0)], run_check=False)
dt2 = DTensor.from_local(y.reshape(4, 2), mesh, [Shard(1)], run_check=False)
dt_out = torch.matmul(dt, dt2)
dt_out_redistribute = dt_out.redistribute(mesh, [Replicate()])
return dt_out.to_local()
```
We generate the following fw and backward graphs.
Forward graph:
```
def forward(self, primals_1, primals_2):
view = torch.ops.aten.view.default(primals_1, [2, 4]); primals_1 = None
_to_copy = torch.ops.aten._to_copy.default(view, dtype = torch.float32, layout = torch.strided, device = device(type='cuda', index=0)); view = None
detach = torch.ops.aten.detach.default(_to_copy); _to_copy = None
detach_1 = torch.ops.aten.detach.default(detach); detach = None
view_1 = torch.ops.aten.view.default(primals_2, [4, 2]); primals_2 = None
_to_copy_1 = torch.ops.aten._to_copy.default(view_1, dtype = torch.float32, layout = torch.strided, device = device(type='cuda', index=0)); view_1 = None
detach_2 = torch.ops.aten.detach.default(_to_copy_1); _to_copy_1 = None
detach_3 = torch.ops.aten.detach.default(detach_2); detach_2 = None
detach_4 = torch.ops.aten.detach.default(detach_1)
all_gather_into_tensor = torch.ops.c10d_functional.all_gather_into_tensor.default(detach_3, 'ptd:0', [0, 1], 2)
wait_tensor = torch.ops.c10d_functional.wait_tensor.default(all_gather_into_tensor); all_gather_into_tensor = None
split = torch.ops.aten.split.Tensor(wait_tensor, 4); wait_tensor = None
getitem = split[0]
getitem_1 = split[1]; split = None
cat = torch.ops.aten.cat.default([getitem, getitem_1], 1); getitem = getitem_1 = None
detach_5 = torch.ops.aten.detach.default(cat); cat = None
mm = torch.ops.aten.mm.default(detach_4, detach_5); detach_4 = detach_5 = None
detach_6 = torch.ops.aten.detach.default(mm); mm = None
detach_9 = torch.ops.aten.detach.default(detach_6); detach_6 = None
detach_10 = torch.ops.aten.detach.default(detach_9); detach_9 = None
t = torch.ops.aten.t.default(detach_1); detach_1 = None
detach_13 = torch.ops.aten.detach.default(t); t = None
t_1 = torch.ops.aten.t.default(detach_3); detach_3 = None
detach_15 = torch.ops.aten.detach.default(t_1); t_1 = None
clone = torch.ops.aten.clone.default(detach_15, memory_format = torch.contiguous_format); detach_15 = None
return [detach_10, detach_13, clone]
```
Backward graph:
```
def forward(self, detach_13, clone, tangents_1):
detach_11 = torch.ops.aten.detach.default(tangents_1); tangents_1 = None
detach_12 = torch.ops.aten.detach.default(detach_11); detach_11 = None
mm_1 = torch.ops.aten.mm.default(detach_13, detach_12); detach_13 = None
detach_14 = torch.ops.aten.detach.default(mm_1); mm_1 = None
detach_16 = torch.ops.aten.detach.default(detach_12); detach_12 = None
all_gather_into_tensor_2 = torch.ops.c10d_functional.all_gather_into_tensor.default(clone, 'ptd:0', [0, 1], 2); clone = None
wait_tensor_2 = torch.ops.c10d_functional.wait_tensor.default(all_gather_into_tensor_2);
detach_17 = torch.ops.aten.detach.default(wait_tensor_2); wait_tensor_2 = None
mm_2 = torch.ops.aten.mm.default(detach_16, detach_17); detach_16 = detach_17 = None
detach_18 = torch.ops.aten.detach.default(mm_2); mm_2 = None
split_1 = torch.ops.aten.split.Tensor(detach_14, 2, 1); detach_14 = None
getitem_2 = split_1[0]
getitem_3 = split_1[1]; split_1 = None
cat_1 = torch.ops.aten.cat.default([getitem_2, getitem_3]); getitem_2 = getitem_3 = None
reduce_scatter_tensor = torch.ops.c10d_functional.reduce_scatter_tensor.default(cat_1, 'SUM', 'ptd:0', [0, 1], 2); cat_1 = None
wait_tensor_3 = torch.ops.c10d_functional.wait_tensor.default(reduce_scatter_tensor); reduce_scatter_tensor = None
detach_19 = torch.ops.aten.detach.default(wait_tensor_3); wait_tensor_3 = None
detach_20 = torch.ops.aten.detach.default(detach_19); detach_19 = None
detach_21 = torch.ops.aten.detach.default(detach_20); detach_20 = None
detach_22 = torch.ops.aten.detach.default(detach_21); detach_21 = None
_to_copy_2 = torch.ops.aten._to_copy.default(detach_22, dtype = torch.float32, layout = torch.strided, device = device(type='cpu')); detach_22 = None
view_2 = torch.ops.aten.view.default(_to_copy_2, [8]); _to_copy_2 = None
detach_23 = torch.ops.aten.detach.default(detach_18); detach_18 = None
detach_24 = torch.ops.aten.detach.default(detach_23); detach_23 = None
_to_copy_3 = torch.ops.aten._to_copy.default(detach_24, dtype = torch.float32, layout = torch.strided, device = device(type='cpu')); detach_24 = None
view_3 = torch.ops.aten.view.default(_to_copy_3, [8]); _to_copy_3 = None
return [view_3, view_2]
```
Some of the stuff in this graph looks kinda of silly though (e.g. an unnecessary split() + cat(), and all the extra detach() calls).
Stuff that's broken:
- functionalization is pretty horribly broken. In particular, the original strategy I used in this stack was to have functionalization run **above** subclass desugaring. But that doesn't play well with with the way we want to compile DTensor. DTensor has a few API's like `.redistribute()`, `.to_local()`, and the `DTensor()` constructor, that we want to put directly into the graph so that we can compile them (e.g. redistribute() will desugar into collective ops). Doing this requires functionalization to run **underneath** the subclass though. I hacked around this for now, by forcing these functions to run functionalization first if they need to.
- the backward test that I have is... wrong. The backward graph that we trace out looks kind of reasonable, but it gives incorrect gradients on one of the two inputs. This needs further debugging (presumably we should be able to stare at the graph and identify which part of it is wrong?).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105236
Approved by: https://github.com/wanchaol
Summary:
current ShardedTensor.gather is not working as expectation when the shard is empty on any rank
The root cause is identified that when a sharded tensor has no placement on a specific rank, the metadata doesn't include that rank's placement which introduces KeyError in : ```shard_offset = shard_placement[shard. Metadata][1]```
It's fixed by adding an empty tensor check.
Test Plan:
before change:
after change:
Differential Revision: D50114085
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110962
Approved by: https://github.com/wz337
Summary:
https://docs.google.com/document/d/1QJJEGnj2nHGPODlw38BEG3KLLCOTfdOVjPrNQbz_LM8/edit#bookmark=id.lp80wfshq130
Changes:
* `torch.export` will return a functional ATen graph but not lowered to core aten decompositions (CompositeImplicitAutograd decomps still run)
* `exported_program.run_decompositions(decomposition_table)` will optionally take a decomposition table, and run decompositions on the exported program, returning a new exported program. By default we will run the Core ATen decomposition table.
Calling convention for Executorch stays the same:
```
pre_autograd_graph = capture_pre_autograd_graph(f, args, ...)
aten_graph_no_decomps = torch.export.export(pre_autograd_graph, args, ...)
# Within to_edge we decompose to core aten and then convert to edge
edge_graph = exir.to_edge(aten_graph_no_decomps)
```
Test Plan: CI
Differential Revision: D50172210
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111030
Approved by: https://github.com/ydwu4
Summary:
This is a hacky flag that we had before in fx flow, and we don't want this in the new flow
Test Plan:
python test/test_quantization.py TestQuantizePT2E
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111000
Approved by: https://github.com/andrewor14
This PR adds a all_gather_dtensor() method to fsdp/_fsdp_extensions.py and the actual implementation in tensor/parallel/fsdp.py. This enables FSDP to load 2D DTensor state_dict into model when calling `model.load_state_dict()`.
cc. @fegin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110925
Approved by: https://github.com/fegin
ghstack dependencies: #110831, #110846
Several improvements for skipfiles:
* Add ```FUNC_INLINELIST``` to support function level skip/inline check.
* Use ```fn.__code__``` to match function since we can't get the function object sometimes.
* Use python module string name for ```FILE_INLINELIST``` and ```SUBMODULE_INLINELIST```.
* Use filename to match file and python module, which can fundamentally resolved the circular import issues introduced by skipfiles.
* Use ```TYPE_CHECKING``` to ensure the python module string name is correct.
* Add unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110835
Approved by: https://github.com/ezyang
This PR switches the replicate -> partial to do division instead of
zeroing out other ranks, it preserve same numerics, but avoid the
per-rank behavior difference, and friendly to torch compile
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110898
Approved by: https://github.com/fduwjj
Summary: Add more details about the export_memory_timeline API, as we've landed new representations of the memory timeline data.
Test Plan: CI, should be no functional change, as we only changed comments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110983
Approved by: https://github.com/DanilBaibak
Summary: Introduce a utility class AOTIModelRunner to take care of running an AOTInductor compiled model. It does things like dlopen a model, initialize the model container, setup inputs and outputs, and destroy the model container.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110891
Approved by: https://github.com/chenyang78
ghstack dependencies: #110652
This PR adds a parametrized test for cond. It tests cond can be traced with valid inputs. Specifically valid inputs is combination of:
- pred (python boolean, boolean tensor, int tensor, scalar tensor)
- true_fn/false_fn (func, obj, nn_module)
- Operands (0 or more tensor inputs), tested with 0 and 2
- closures (0 or more tensor closures), tested with 0 and 2
- nested_level (no nesting or level-2 nested cond)
What this test doesn't cover:
- pred: symbolic boolean expression as predicate
- true_fn/false_fn: that mutates indermediate tensors
- operands: non-tensor operands such as float, int
- closures: nn_module attribute closures, python constant closures
- nested_level: 3+
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110727
Approved by: https://github.com/zou3519
By calling `at::mps::sign_outf` rather than `at::sign_out` that calls dispatcher again.
Also, do not copy output unnecessarily.
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at f942e74</samp>
> _Metal tensors rise from the ashes_
> _`sign` and `sgn` unleash their flashes_
> _MPSFunctions reign supreme_
> _In the header of the metal dream_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110955
Approved by: https://github.com/kulinseth, https://github.com/albanD
Summary: Implement an on-disk cache to save and reuse compiled FX Graphs. This implementation does not handle tensors with symbolic shapes. This needs to be done in a follow-up PR.
Test Plan:
* New unit tests exercising saving and load from the cache.
* New unit tests to exercise the cache key calculations.
* Ran several benchmarks to see cache hit and resulting compilation times.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103453
Approved by: https://github.com/eellison, https://github.com/Chillee
Summary:
Make it easier to add `generate_opcheck_tests` by adding defaults for
the failures_dict location, the additional decorators, and the test
utils.
Test Plan:
Existing tests
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110977
Approved by: https://github.com/williamwen42
ghstack dependencies: #110951
- This PR is the first part of a bigger change to use `MPSEvent` to synchronize shared-buffers between CPU/GPU.
- Add APIs to record and wait for `MPSEvents` in `MPSAllocator`.
- Use a container list for Buffer Pools to simplify iterating over them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106938
Approved by: https://github.com/kulinseth
Summary:
https://docs.google.com/document/d/1QJJEGnj2nHGPODlw38BEG3KLLCOTfdOVjPrNQbz_LM8/edit#bookmark=id.lp80wfshq130
Changes:
* `torch.export` will return a functional ATen graph w/o decompositions
* `exported_program.run_decompositions(decomposition_table)` will optionally take a decomposition table, and run decompositions on the exported program, returning a new exported program. By default we will run the Core ATen decomposition table.
Calling convention for Executorch stays the same:
```
pre_autograd_graph = capture_pre_autograd_graph(f, args, ...)
aten_graph_no_decomps = torch.export.export(pre_autograd_graph, args, ...)
# Within to_edge we decompose to core aten and then convert to edge
edge_graph = exir.to_edge(aten_graph_no_decomps)
```
Test Plan: CI
Differential Revision: D49742989
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110410
Approved by: https://github.com/ydwu4
Otherwise following error is thrown when attempted to compile with WERROR enabled:
```
In file included from /home/nshulga/git/pytorch/pytorch/torch/csrc/distributed/c10d/socket.cpp:30:
/home/nshulga/git/pytorch/pytorch/third_party/fmt/include/fmt/chrono.h:340:24: warning: redundant redeclaration of ‘constexpr’ static data member ‘fmt::v10::detail::codecvt_result<CodeUnit>::max_size’ [-Wdeprecated]
340 | constexpr const size_t codecvt_result<CodeUnit>::max_size;
| ^~~~~~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/third_party/fmt/include/fmt/chrono.h:335:33: note: previous declaration of ‘fmt::v10::detail::codecvt_result<CodeUnit>::max_size’
335 | static constexpr const size_t max_size = 32;
| ^~~~~~~~
```
or following if using clang as host compiler
```
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/distributed/c10d/socket.cpp:30:
/Users/nshulga/git/pytorch/pytorch/third_party/fmt/include/fmt/chrono.h:340:50: warning: out-of-line definition of constexpr static data member is redundant in C++17 and is deprecated [-Wdeprecated]
constexpr const size_t codecvt_result<CodeUnit>::max_size;
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111002
Approved by: https://github.com/drisspg
Better support device agnostic, add a "cpu" return for `current_device()` in torch.cpu so that we won't run into `AttributeError: module 'torch.cpu' has no attribute 'current_device'`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110987
Approved by: https://github.com/wanchaol
In this PR:
- Adds support for strides for jagged tensor (design doc for this coming soon)
- NestedTensor skips automatic dynamic
- Make use of @bdhirsh's subclass fakification logic by adding the __tensor_{un,}flatten__ functions.
- Additional logic for fakification: since existing subclass fakification logic does not handle the case where the outer tensor has an additional dimension. We insert one-off logic to (1) insert an extra SingletonSymInt onto the fakified NestedTensor. (2) make sure we call track_symint on both the sizes on the inner and outer tensor during guard creation.
Remaining things that are weird:
- Still need to skip some logic in meta utils for some reason (I was going to write this up more, but decided not to since we're not able to do this anyway for a immediate reason: we cannot arbitrarily compare singleton ints. For now I'm just following Brian's advise from [here](https://github.com/pytorch/pytorch/pull/109171#discussion_r1328137070) )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109171
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
People access activation checkpoint through many layers of config and it is not always guaranteed that all the layers of wrapping around checkpoint properly propagate all the kwargs, e.g. debug mode. This context manager offers an alternative way to enable debug mode that bypasses the need for all layers to propagate kwargs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110728
Approved by: https://github.com/albanD
ghstack dependencies: #110673, #110674, #110675, #110676
The main thrust of the initial effort here was to capture `register_hook` calls on tensors in compile regions. The first part of this was done in https://github.com/pytorch/pytorch/pull/108903 wherein we added support for register_hook input tensors.
The distinction between input and intermediary is due to implementation differences.
There are 2 kinds of hooks:
1) Hooks on objects with sources (inputs, params)
2) Hooks on objects w/o sources (intermediaries, and outputs).
Note: As outputs can be made simple by how dynamo handles residuals, they could actually be handled as if they were inputs, but, for the sake of this PR, we will refer to hooks as either hooks on inputs (sourced), or hooks on intermediaries (not sourced).
**The plan:**
For tensors w/ a source: (The PR above)
We record registered hooks, store them as a global, and associate them with the tensor in residuals. This means that when dynamo goes to create the frame, where we produce bytecode to stitch together our PT2 modified bytecode with the original eager code, we call register_hook. This registration of hooks in residuals is sound because (a) it happens right after a Pt2 frame region ends and (b) we know that the tensor is alive in f_locals, f_globals, or a module in the users invoking frame. This means we can soundly know it will be around to invoke register_hook on. As long as we guard on the identity of the lifted function, this is sound to do.
For tensors w/o a source: (This PR)
Ostensibly, the most correct and complete solution would be to smuggle hooks into a runtime wrapper in aot_autograd, where all the items the hooks close over are lifted to inputs as necessary and passed alongside the user provided function. This is necessary so that we can properly trace out and capture all the mutations within the user defined hook at backwards time.
This is too complicated - so, we limited the scope of this initial PR to a simple subset of hooks:
- Hooks must have a source (be known to us already, not a lambda or intermediary defined function)
- We must be tracing under compiled autograd
**The flow**:
We use the HOP added in https://github.com/pytorch/pytorch/pull/109690/files, referred to as the HOP below.
1) We intercept register_hook calls and wrap the user defined fn in the HOP
2) We write a `_register_hook_trampoline` to the graph that is a local no-arg function that is invoked as a call_function in the dynamo graph
3) aot_autograd inlines through it during its trace, and sees the HOP
4) the HOP preserves itself in the graph - it does not get traced into
5) During backwards, compiled_autograd installs the HOP under a hook call
6) When compiled_autograd enters compilation over its generated graph, dynamo traces the contents of the hook
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109537
Approved by: https://github.com/ezyang
## Context
Add decompositions for `aten.max`, `aten.min`, and `aten.var_mean`. These operators follow a pattern of returning a tuple of outputs from two component operators:
```
aten.max(x) -> return aten.amax(x), aten.argmax(x)
aten.min(x) -> return aten.amin(x), aten.argmin(x)
aten.var_mean(x) -> return aten.var(x), aten.mean(x)
```
For `var_mean`, the `refs` implementation was doing something similar, so I changed it to call `torch.` ops instead like was done for other `refs` implementations previously. cc: @peterbell10 @lezcano
Note that Inductor lowers all these directly, so they are excluded from the Inductor decomp table.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110906
Approved by: https://github.com/manuelcandales
Avoid changing default for other backends as CPU backend (GLOO) may need
longer timeouts.
Motivated by trying to save cluster time when encountering collective
hangs. Generally collectives should time out within seconds and 30
minutes (or 10 minutes) should provide ample headroom for edge cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110947
Approved by: https://github.com/xw285cornell, https://github.com/fduwjj
Summary:
We want the matcher to return a name -> node in target graph
so that we can refer to the node by name, this is useful for downstream applications like
quantization.
and also we can use the torch API as source of truth instead of matching aten API directly.
Test Plan:
python test/fx/test_matcher_utils.py
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110743
Approved by: https://github.com/SherlockNoMad
This PR adds the following helper functions for generated opcheck tests:
- dontGenerateOpCheckTests is a decorator that skips generation of the
opcheck tests for the generated function
- is_inside_opcheck_mode lets us query if we are in a generated test.
Useful for fast debugging out-of-tree without needing to update
PyTorch.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110951
Approved by: https://github.com/williamwen42
This reverts commit ff0358b0384d6a3a5b8ceeae625c93221612ba8e.
(original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below)
Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.
The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.
This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:
register_collective_start_hook
register_collective_end_hook
register_process_group_hook
The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.
The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.
Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110907
Approved by: https://github.com/fduwjj
# Summary
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 318764f</samp>
This pull request implements the CUDA backend of the SDPA kernel for nested tensors, which enables efficient transformer models with variable-length sequences. It adds a new dispatch key, a backward function, a unit test, and some helper functions for the kernel. It modifies `test/test_transformers.py`, `aten/src/ATen/native/native_functions.yaml`, `aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctionsBackward.cpp`, and `aten/src/ATen/native/nested/cuda/NestedTensorTransformerUtils.h`.
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at ed4a773</samp>
> _Fused kernels of doom, unleash the flash attention_
> _Nested tensors on fire, reshape and pad with caution_
> _Backward pass of power, dispatch the CUDA key_
> _Test the gradients of hell, warn the user if they disagree_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97485
Approved by: https://github.com/jbschlosser
Summary: Add more details about the export_memory_timeline API, as we've landed new representations of the memory timeline data.
Test Plan: CI, should be no functional change, as we only changed comments.
Differential Revision: D50123450
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110949
Approved by: https://github.com/davidberard98
This is a PoC of AOTDispatch support. This PR actually works on basic examples, and I'm working on testing it out on `DTensor` (with @wanchaol), `SemiStructuredSparsityTensor` (with @jcaip), and `FP8Tensor`.
There are some design decisions baked into the PR that I think we need consensus on though - so I'm planning on writing a larger design doc to go over the changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104483
Approved by: https://github.com/ezyang
Fixes#108754.
`hf_T5_generate` would encounter a regression when calling `extern_kernels.bmm`, if one input is `reinterpret_tensor(buf2, (8, 1, 64), (64, 0, 1))` rather than `reinterpret_tensor(buf2, (8, 1, 64), (64, 512, 1), 0)`. As @jgong5 mentioned in comment, in fact the two tensors are equivalent: The stride doesn't matter when the corresponding size is 1.
We revise the definition of contiguity in `bmm` to add the above situation as a contiguous case. Thus, when stride equals to 0, `extern_kernels.bmm` could still use `gemm` of MKL to gain the performance.
Speedup of `hf_T5_generate` is **1.343x** now and **1.138x** before, with script `bash inductor_single_test.sh multiple inference performance torchbench hf_T5_generate float32 first dynamic default 0`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110811
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/Chillee
We want to get to a point where most UserErrors link to exportdb examples. This PR makes passing case names non-optional to make this intent clearer and encourage developers who raise UserErrors to make or point to examples that make fixing such errors more obvious for users.
In addition, sometimes there are multiple examples that are relevant to an error. Thus this PR also enables passing multiple case names.
Retry of #110733 which was reverted due to a landrace.
Differential Revision: [D50087148](https://our.internmc.facebook.com/intern/diff/D50087148/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110878
Approved by: https://github.com/gmagogsfm, https://github.com/tugsbayasgalan
Summary:
Currently, PyTorch incorrectly calculates the size of the returned
matrix when we pass a non-contiguous batched (>2d) input to the
semi-structured sparse subclass.
This is most common in MLP layers, where we have 2 linear layers back to back.
This will lead to an error like the following:
```
RuntimeError: shape '[20, 64, 64, 3072]' is invalid for input of size
62914560
```
Where the size of the sparse matmul result is off because we infer the
output shape with the wrong tensor shape.
This happens because of a bug where we did not update the subclass
tensor shape when doing transpose.
For semi-structured sparsity, transposing is a no-op where we just set
the boolean flag, but we forgot to also update the tensor shape.
Note that this error goes away in inference mode, since we avoid
decomposing the aten.linear op and handle shape folding ourselves,
which changes the execution path.
An alternative way to fix this issue is to set
TORCH_FLATTEN_LINEAR_3D=True, which will also fix this error.
Test Plan:
```
python test/test_sparse_semi_structured.py -k test_mlp
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110420
Approved by: https://github.com/alexsamardzic, https://github.com/cpuhrsch
In https://github.com/pytorch/pytorch/pull/99243, a check was added to ensure the `size` only contained integers.
This PR updates the check to also include numpy integers based on this comment (cc @kit1980): https://github.com/pytorch/pytorch/pull/99243#issuecomment-1646736646. Similar to the other commenter, I also ran into issues where existing software broke due to this after upgrading to PT2.1:
```
if not torch.jit.is_scripting():
if not all(_is_integer(x) for x in size):
> raise TypeError(
"expected size to be one of int or Tuple[int] or Tuple[int, int] or "
f"Tuple[int, int, int], but got size with types {[type(x) for x in size]}"
)
E TypeError: expected size to be one of int or Tuple[int] or Tuple[int, int] or Tuple[int, int, int], but got size with types [<class 'numpy.int64'>, <class 'numpy.int64'>]
/conda-env/lib/python3.8/site-packages/torch/nn/functional.py:3924: TypeError
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110778
Approved by: https://github.com/mikaylagawarecki
I tested by adding some warning logs in C++, run a distributed program and show that they now had `[rank0]:` in the messages. There is no existing test infra for C++ logging so I couldn't easily add a unit test.
The implementation strategy is to setup a global variable in C++, and then poke it when we initialize a process group. This was the simplest thing I could think of that would work.
This PR only works for non-glog logging. Probably need to come up with some other strategy for glog, e.g., a custom prefix, but need to make sure this doesn't conflict with fbcode. I can't easily test this from OSS, will leave as follow up work.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110623
Approved by: https://github.com/voznesenskym, https://github.com/wanchaol, https://github.com/fduwjj
`libshm.so` depends on the torch library exclusively for `at::RefcountedMapAllocator`,
so it makes sense to move it to c10 along with the other memory allocators.
This means `libshm.so` only depends on `c10` and we don't need to relink
`libshm.so` for every ATen change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109881
Approved by: https://github.com/albanD
## Context
For more context, please refer to [this PyTorch forums post](https://dev-discuss.pytorch.org/t/defining-the-core-aten-opset/1464).
This PR registers some additional ATen operators as `core`, based on feedback from the forums post as well as the experiences from adding other core ATen decompositions.
The ATen operators registered as core in this diff, with the associated reasoning, are:
ATen op | reasoning
--|--
aten::atan2 | This operator often maps to a hardware intrinsic.
aten::diagonal | There is no straightforward decomposition for this operator.
aten::empty_like | Decomposition for this operator would require `as_strided` to retain the strides of the input tensor, which should be avoided.
aten::expm1 | This operator often maps to a hardware intrinsic; Furthermore, decomposing it will negatively impact the numerical precision of the output.
aten::full_like | Decomposition for this operator would require `as_strided` to retain the strides of the input tensor, which should be avoided.
aten::log10 | This operator often maps to a hardware intrinsic; Furthermore, decomposing it will negatively impact the numerical precision of the output.
aten::log1p | This operator often maps to a hardware intrinsic; Furthermore, decomposing it will negatively impact the numerical precision of the output.
aten::log2 | This operator often maps to a hardware intrinsic; Furthermore, decomposing it will negatively impact the numerical precision of the output.
aten::pow.Scalar_Tensor | This is a Scalar variant of pow.Tensor_Tensor, which is a part of core.
aten::resize | There is no valid decomposition for this operator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110882
Approved by: https://github.com/lezcano
Summary
Currently, when detecting a timeout/exception in the watchdog
workCleanupLoop, we call nccl APIs to abort all the active communicators
before finally re-raising the exception and killing the process. The
nccl APIs may hang, causing additional problems. Instead, just re-raise.
@kumpera proposed that changing this default should save us from a lot of commonly observed errors.
Note: there are other cuda/nccl api calls in our watchdog, which also could hang. This change is not a substitute for a deeper refactor.
Detail
The current default (NCCL_ASYNC_ERROR_HANDLING=1:TearDown) meant the following:
SHOULD_TEAR_DOWN() evaluates to true
- This affects 'ProcessGroupNCCL::WorkNCCL::handleException`
- handleException is called from two places:
- work.wait() -> synchronizeInternal() -> handleException()
- workCleanupLoop() -> handleException()
- when true, the excpetion is logged and rethrown
SHOULD_CLEAN_UP() evaluates to true
- This only impacts the workCleanupLoop()
- When true, it means all communicators will be aborted (ncclCommAbort())
upon work exception or timeout
The proposed new default is NCCL_ASYNC_ERROR_HANDLING3=3:SkipCleanUp.
This only changes SHOULD_CLEAN_UP() to false, impacting workCleanupLoop() behavior.
Communicators will no longer be aborted, which should avoid a class of bugs where the watchdog hangs due to calling nccl APIs which may block/hang.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110723
Approved by: https://github.com/fduwjj, https://github.com/xw285cornell
skip tensor.to in from_local and distribute_tensor when device_type of
device mesh matches tensor.device type, since from_local on the critial
path of TP, this might also reduce some overhead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110774
Approved by: https://github.com/fduwjj
Fixes#86805
Adds support for sgn to MPS backend.
Notes:
1. @malfet self-assigned this when he was working on implementing polar, but from what I can tell, he didn't end up needing to implement it.
2. @Berzeg implemented this last year, before view_as_complex was supported. Because of @malfet recent contributions, however, @Berzeg 's implementation works. I've removed the part of his implementation that dealt with non-complex dtypes (since these can just be passed to at::sign), matched the more recent pattern we've been using in UnaryOps.mm, and thrown in a simple implementation of _efficientzerotensor for mps, so that the backward function works.
3. @Berzeg deserves a good bit of credit for this, so let me know if there's a way to assign him some without jamming up the pr (he seems to be AWOL since last working on this)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110829
Approved by: https://github.com/malfet
Enable Flake8-PYI rules codebase wide. Most of the rules already match our codebase style, the remaining ones that were not autofixed I have added to the pyproject.toml to be enabled in a later PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110830
Approved by: https://github.com/albanD
Summary: Implement an on-disk cache to save and reuse compiled FX Graphs. This implementation does not handle tensors with symbolic shapes. This needs to be done in a follow-up PR.
Test Plan:
* New unit tests exercising saving and load from the cache.
* New unit tests to exercise the cache key calculations.
* Ran several benchmarks to see cache hit and resulting compilation times.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103453
Approved by: https://github.com/eellison
The "import torch" crashes with following cpuinfo error on powerpc64.
==============================================================
>>> import torch
Error in cpuinfo: processor architecture is not supported in cpuinfo
Fatal error in cpuinfo: cpuinfo_get_processors_count called before cpuinfo is initialized
Aborted (core dumped)
==================================================================
The patch fixes this by excluding powerpc from using cpuinfo as it is not supported for ppc64.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110708
Approved by: https://github.com/ezyang
The motivation for removing this is already present in the pre-PR comments. Copying it
~~~
# NB - SuperSource is a weird one.
# it is our only source with 2 bases, so we use the objec
# as the base, rather than the type, since an invocation
# like super(Foo, foo) is represented here, the source object base is more spiritually
# aligned with the instance, rather than the type.
# This whole construction is questionable tho, and we should probably find a way to
# avoid this exception to our otherwise nice source parentage invariant.
~~~
Instead of using super(a, b), we can use `type(b).__mro__[index]`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110475
Approved by: https://github.com/jansel
Summary: Forward fix a performance regression caused by https://github.com/pytorch/pytorch/pull/110510. When a model is run once, all those kernel pointers are initialized and removing the if-nullptr check will cause those loadKernel be unnecessarily executed again when we rerun the foward function. Another way to do this is to codegen loadKernel in the initializer, which I may do in a later PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110800
Approved by: https://github.com/jansel
optree recently landed and provide quite good perf, conditionally import
new optree if optree is installed
Some numbers testing mlp layer with TP + func collective:
before this PR: 10.390ms
after this PR: 9.189ms
so around e2e 10% CPU overhead reduction
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110670
Approved by: https://github.com/fegin
This pr expose torch._higher_order_ops.cond as torch.cond.
1. Need to add #noqa: F811 to the _check calls in torch/__init__.py to address some confusing linter error "Redefinition of unused 'cond'" but only one cond is imported and for these lines that have this error, they don't define the cond but just use it as an argument.
2. Also add cond to the list that allows it to be traced through so as dynamo could trigger the CondHigherOrder logic instead of creating a TorchVariable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110293
Approved by: https://github.com/zou3519
Fixes#88919
@mruberry @peterbell10
This PR adds a warning to the .cpp STFT and ISTFT functions if a window is not provided.
It also describes the warning in the documentation on `functional.py`.
Finally, it adds unit tests to check if the warning is being produced.
I have audited for internal calls of `stft` and `istft` on Pytorch and haven't found any.
Thank you for the opportunity to contribute!
Eric
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110695
Approved by: https://github.com/ezyang
Fixes the string_view errors and reland the work. The previous changes in torch/csrc/utils/invalid_arguments.cpp were too aggressive and not tested thoroughly. They are discarded.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110518
Approved by: https://github.com/ezyang
Summary: `torch.nonzero` doesn't have inductor lowering (yet). To invoke the operator in AOT Inductor's ABI compatibility mode we need a dedicated shim function.
Test Plan:
```
$ python test/inductor/test_aot_inductor.py -k test_zero_grid_with_unbacked_symbols
...
----------------------------------------------------------------------
Ran 4 tests in 78.650s
OK
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110766
Approved by: https://github.com/chenyang78
ghstack dependencies: #110713, #110745, #110764
Summary: For unbacked SymInts, the C++ wrapper codegen can generate expressions like `buf123.size()` or `.stride()` or `.storage_offset()`:
7cc0020a80/torch/_inductor/ir.py (L2504-L2520)
Here we add corresponding methods to the `RAIIAtenTensorHandle` class so that the above codegen works in the ABI compatibility mode.
Test Plan: CI + the following PR
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110764
Approved by: https://github.com/chenyang78
ghstack dependencies: #110713, #110745
Summary: `repeat_interleave.Tensor` doesn't have inductor lowering. To invoke the operator in AOT Inductor's ABI compatibility mode we need a dedicated shim function.
Test Plan:
```
$ python test/inductor/test_aot_inductor.py -k test_repeat_interleave
...
----------------------------------------------------------------------
Ran 4 tests in 70.526s
OK
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110745
Approved by: https://github.com/chenyang78
ghstack dependencies: #110713
Summary:
In some scenarios, we want to update constants at runtime.
In such cases, we have to keep the original constants in
the generated code without applying any constant-inlining
optimizations.
This PR adds a config to force us to add tensor constants.
Differential Revision: D49895154
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110491
Approved by: https://github.com/mikekgfb
**Get device index by torch.privateuse1._utils._get_device_index, if the metched exists.**
Reason:
Can only get device_index 0 if ```location``` such as 'privateuse1' before modify.
Can get accurate deivce index use _get_device_index in this scenario.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108123
Approved by: https://github.com/albanD
Summary:
## Context
Both `aten.sum` and `aten.squeeze` have a "most generic" variant in the form of `aten.sum.dim_IntList` and `aten.squeeze.dims` respectively. Add decompositions for other non generic variants of these operators to express them using the most generic variant.
Note that to register these decomps, the reference implementation under `_refs` had to be removed as registered decompositions. cc: @lezcano @peterbell10
Test Plan: Github CI + Meta Internal CI
Differential Revision: D49965952
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110645
Approved by: https://github.com/peterbell10, https://github.com/digantdesai, https://github.com/manuelcandales
We want to get to a point where most `UserError`s link to `exportdb` examples. This PR makes passing case names non-optional to make this intent clearer and encourage developers who raise `UserError`s to make or point to examples that make fixing such errors more obvious for users.
In addition, sometimes there are multiple examples that are relevant to an error. Thus this PR also enables passing multiple case names.
Differential Revision: [D50020465](https://our.internmc.facebook.com/intern/diff/D50020465/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110733
Approved by: https://github.com/zhxchen17
We had the option but never used cpu_offload as optimizer state_dict offloads the tensors to CPU by default. And this is usually most users want as the tensors are required to be moved to CPU eventually. However, we may want to disable offloading to CPU in some cases, epsecially for the debugging purpose. This PR lets optimizer state_dict read the flag.
Differential Revision: [D48913340](https://our.internmc.facebook.com/intern/diff/D48913340/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108434
Approved by: https://github.com/wz337
This PR adds support for aten.where and support implicit scalar
promotion, basically when we meet scalar tensors in dispatching logic,
we implicitly convert it those to replicated dtensor
The latter also enables bunch of ops in op db to pass
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110584
Approved by: https://github.com/fduwjj
Summary:
As title, this diff hides the contiguous requirement for user input mesh when initializing DeviceMesh.
In the current implementation, when testing with inter-node model parallelism, an exception is thrown during mesh validation when the following input is provided:
```
mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1)
device_mesh = DeviceMesh(
"cuda",
mesh.contiguous(),
mesh_dim_names=("dp", "mp")
)
```
Test Plan:
**Unit Test**:
```
buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:device_mesh -- test_validate_device_mesh
Test UI: https://www.internalfb.com/intern/testinfra/testrun/3940649876878399
Network: Up: 0B Down: 0B
Jobs completed: 6. Time elapsed: 1:58.7s.
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
```
**Test with MP**
```
mesh = torch.arange(0, world_size).view(mp_size, dp_size).transpose(0, 1)
device_mesh = DeviceMesh(
"cuda",
mesh.contiguous(),
mesh_dim_names=("dp", "mp")
)
```
Without the change: exception.
After this change: initialzied sucessfully.
Differential Revision: D49942839
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110628
Approved by: https://github.com/wanchaol, https://github.com/xw285cornell, https://github.com/fduwjj
Follow up to #110123, removing the CUDA_VERSION check for ROCm because HIP already has hipMallocAsync() and doesn't need the version check there.
Follow up to #108488, fixing the unit failing unit tests by accepting either a "cuda" or "hip" attribute for the caching allocator options. This is aligned to the masquerading strategy for ROCm/HIP.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110715
Approved by: https://github.com/ezyang
Previously,`Dim` definitions that shared the same name but had different ranges were allowed to appear in the `dynamic_shapes` argument of an `export` call. They would correspond to the *same* dynamic dimension (identified by the shared name) with an effective range would be the *intersection* of the different ranges.
However this behavior can be confusing, because having different definitions with the same name is more likely than not unintentional. Therefore, this PR makes it a user error.
We still allow different definitions with the same name to exist at the same time (no global uniqueness) as long as they are not confused in the same `export` call. Redefinitions with the same bounds are also allowed, in case they are accidentally created by executing the same code multiple times.
Differential Revision: [D49965944](https://our.internmc.facebook.com/intern/diff/D49965944/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110638
Approved by: https://github.com/zhxchen17
While the `range_constraints` that is initially derived by processing of constraints only contains symbols that appear in the graph module, eventually the `range_constraints` that are in the exported program seem to contain more symbols than those that appear in the graph module. Clearly this is a regression, because the example of "Expressing Dynamism" in our public docs (https://pytorch.org/docs/stable/export.html#expressing-dynamism) does not show the extra symbols in `range_constraints`, but running the example does.
The problem seems to arise when we are running `_transform` passes, where we regenerate the `range_constraints` from the `shape_env`. However, as a rule, symbols that have `replacements` are actually replaced (by other expressions, including constants or other symbols), so they should never appear in the graph module. Thus we can filter such symbols out from `range_constraints` as well.
Differential Revision: [D49969620](https://our.internmc.facebook.com/intern/diff/D49969620/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110644
Approved by: https://github.com/zhxchen17
Summary: The runtime assertions inserted in the `torch._export.export` by the `_AddRuntimeAssertionsForInlineConstraintsPass` lead to errors in AOT Inductor like #109884. In `torch._export.aot_compile` export and AOT compilation are run consecutively which would lead to the above issue if any assertions are inserted.
In this PR, we're adding a new parameter / flag to `torch._export.aot_compile`, `remove_runtime_assertions`, to remove the assertions inserted during export before AOT compilation. The flag is set to `False` for BC.
Additionally, we remove the flag `add_runtime_assertions_for_inline_constraints` recently added to `torch._dynamo.config`, as it can lead to undesirable `torch._export` behavior and is 's no longer required for the AOT Inductor testing purposes.
Test Plan: CI
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110710
Approved by: https://github.com/zhxchen17, https://github.com/chenyang78
Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.
The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.
This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:
register_collective_start_hook
register_collective_end_hook
register_process_group_hook
The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.
The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.
Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108815
Approved by: https://github.com/wconstab, https://github.com/fduwjj
Summary:
With the release of cuSPARSELt v0.5.0, we now have support for
int8 int8 -> int32 matmul.
This PR adds support for this via out_dtype.
Test Plan:
```
python test/test_sparse_semi_structured.py -k int32
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110499
Approved by: https://github.com/cpuhrsch
This pr expose torch._higher_order_ops.cond as torch.cond.
1. Need to add #noqa: F811 to the _check calls in torch/__init__.py to address some confusing linter error "Redefinition of unused 'cond'" but only one cond is imported and for these lines that have this error, they don't define the cond but just use it as an argument.
2. Also add cond to the list that allows it to be traced through so as dynamo could trigger the CondHigherOrder logic instead of creating a TorchVariable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110293
Approved by: https://github.com/zou3519
Fixes recent broken unit tests caused by PR #109908 because cudnn and miopen have separate batch norm functions.
```
2023-10-05T09:35:01.6606614Z _______________ TestQuantizePT2EQAT.test_qat_conv_bn_fusion_cuda _______________
2023-10-05T09:35:01.6606948Z Traceback (most recent call last):
2023-10-05T09:35:01.6607362Z File "/var/lib/jenkins/pytorch/test/quantization/pt2e/test_quantize_pt2e_qat.py", line 323, in test_qat_conv_bn_fusion_cuda
2023-10-05T09:35:01.6607767Z self._verify_symmetric_xnnpack_qat_graph(
2023-10-05T09:35:01.6608217Z File "/var/lib/jenkins/pytorch/test/quantization/pt2e/test_quantize_pt2e_qat.py", line 130, in _verify_symmetric_xnnpack_qat_graph
2023-10-05T09:35:01.6608658Z self._verify_symmetric_xnnpack_qat_graph_helper(
2023-10-05T09:35:01.6609105Z File "/var/lib/jenkins/pytorch/test/quantization/pt2e/test_quantize_pt2e_qat.py", line 173, in _verify_symmetric_xnnpack_qat_graph_helper
2023-10-05T09:35:01.6609623Z m = prepare_qat_pt2e(m, quantizer)
2023-10-05T09:35:01.6610171Z File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/ao/quantization/quantize_pt2e.py", line 178, in prepare_qat_pt2e
2023-10-05T09:35:01.6610561Z _fuse_conv_bn_qat(model)
2023-10-05T09:35:01.6611072Z File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/ao/quantization/pt2e/qat_utils.py", line 501, in _fuse_conv_bn_qat
2023-10-05T09:35:01.6611497Z m = _fuse_conv_bn_qat_helper(m, is_cuda=True)
2023-10-05T09:35:01.6612065Z File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/ao/quantization/pt2e/qat_utils.py", line 575, in _fuse_conv_bn_qat_helper
2023-10-05T09:35:01.6612492Z _get_conv_bn_getitem_nodes(r.replacements)
2023-10-05T09:35:01.6613058Z File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/ao/quantization/pt2e/qat_utils.py", line 383, in _get_conv_bn_getitem_nodes
2023-10-05T09:35:01.6613465Z assert bn_node is not None
2023-10-05T09:35:01.6613716Z AssertionError
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110653
Approved by: https://github.com/jerryzh168, https://github.com/pruthvistony
Add non-package python modules to the public API checks.
The original change is to remove the `ispkg` check in this line
https://github.com/pytorch/pytorch/blob/main/docs/source/conf.py#L518
Everything else is to add the appropriate modules to the rst files, make sure every module we provide can be imported (fixed by either making optional dependencies optional or just deleting files that have been un-importable for 3 years), make API that are both modules and functions (like torch.autograd.gradcheck) properly rendered on the docs website without confusion and add every non-documented API to the allow list (~3k of them).
Next steps will be to try and fix these missing docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110568
Approved by: https://github.com/zou3519
Adam part of: https://github.com/pytorch/pytorch/issues/110506
TODO:
- If this approach is validated as a good one, it an also be applied to all other optimizers which convert `complex` via list comprehensions
### Results:
`NUM_PARAMS=200, foreach=True`
- main: dynamo: 43s, inductor: 31s, total: 74s
- this PR: dynamo: 3.5s, inductor: 30s, total: 34s (dynamo speedup: 12.3x, overall speedup: 34s, 2.1x)
`NUM_PARAMS=1000, foreach=True, has_complex shortcut`:
```
<class 'torch.optim.adam.Adam'> {'lr': 0.01, 'foreach': True} torch.float32 TorchDynamo compilation metrics:
Function Runtimes (s)
------------------------------------ -------------------------------
_compile.<locals>.compile_inner 0.0329, 50.0806, 0.0041
OutputGraph.call_user_compiler 44.9924
```
`NUM_PARAMS=1000, foreach=True`:
```
<class 'torch.optim.adam.Adam'> {'lr': 0.01, 'foreach': True} torch.float32 TorchDynamo compilation metrics:
Function Runtimes (s)
------------------------------------ -------------------------------
_compile.<locals>.compile_inner 0.0389, 58.6069, 0.0043
OutputGraph.call_user_compiler 44.1425
```
### Discussion
- `has_complex` shortcut provides additional 2x dynamo speedup. It is not necessary to achieve a significant overall speedup.
CC: @janeyx99 @mlazos
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110607
Approved by: https://github.com/janeyx99, https://github.com/lezcano
In https://github.com/pytorch/pytorch/pull/110362, the failure was flaky but merge bot treated it as an actual failure. This is a regression after https://github.com/pytorch/test-infra/pull/4604 where the name returned by Dr.CI now includes workflow name. For example, the name is `trunk / macos-12-py3-arm64 / test (default, 2, 3, macos-m1-12)` in the JSON response:
```
{"FAILED": [], "FLAKY": [{"workflowId": 6372581477, "id": 17297638807, "name": "trunk / macos-12-py3-arm64 / test (default, 2, 3, macos-m1-12)", "jobName": "macos-12-py3-arm64 / test (default, 2, 3, macos-m1-12)", "conclusion": "failure", "completed_at": "2023-10-01T22:18:28Z", "html_url": "https://github.com/pytorch/pytorch/actions/runs/6372581477/job/17297638807", "head_branch": "ciflow/trunk/110362", "pr_number": 110362, "head_sha": "03f51e36dedf234931006d1db61677b229c9a119", "failure_captures": ["Failure: There is only 4671284KB free space left in /, which is less than the minimum requirement of"], "failure_line": "Failure: There is only 4671284KB free space left in /, which is less than the minimum requirement of 6291456KB for macOS", "time": "2023-10-01T22:17:53.847751Z"}], "BROKEN_TRUNK": [], "UNSTABLE": []}
```
I update merge bot to handle this better by considering both workflow name, job name, and the combination full name.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110661
Approved by: https://github.com/clee2000
Move print to the beginning instead because putting it at the end makes it so you have to scroll through when debugging, and nothing in that function indicates that it should be printing anything
Also the line for printing disabled issues out of the for loop
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110621
Approved by: https://github.com/huydhn
Summary:
Sometimes local_shards are empty on some ranks, and out.dtype is float16, which will cause error if enforce_dtype is True because `data` will be float32.
Callers know best what dtype they want, so we can just let callers decide.
Temporarily keep enforce_dtype for backward compatibility
Test Plan: Run local and MAST job
Reviewed By: uciyc123
Differential Revision: D46886551
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110561
Approved by: https://github.com/wanchaol, https://github.com/malfet
Summary:
Since we changed IR that we are working with to pre autograd aten IR, it's easier
to use plain pattern match instead of relying on source_matcher_utils now, this
PR refactors the annotation for conv to use aten ops directly.
Also fixed reentrant test after this change.
Test Plan:
python test/test_quantization.py TestQuantizePT2E
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110308
Approved by: https://github.com/kimishpatel
Replace `dispatch_to_CDouble`, `dispatch_to_CLong` and `dispatch_to_CComplexDouble` with `dispatch_to<T>` template
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at c3d9d01</samp>
> _Sing, O Muse, of the clever coder who devised_
> _A wondrous template function, `dispatch_to<T>`, that could_
> _Handle with ease the various scalar types that vexed_
> _The previous code, which was verbose and dull as wood._
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110619
Approved by: https://github.com/soulitzer, https://github.com/albanD
ghstack dependencies: #110618
When we convert to local tensor, dtensor can't track autograd or
gradient layout of the local tensor anymore, if user do sth not expected, there
needs to be a way for user to hint about the gradient layout of the
local tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110629
Approved by: https://github.com/zdevito
Summary:
Original commit changeset: 03980fb054d5
Original Phabricator Diff: D49519512
Bisecting shows that this diff is the cause of S369683. Since this affects Ads production, need to back out this diff immediately.
Test Plan: See S369683
Reviewed By: ezyang
Differential Revision: D49958638
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110622
Approved by: https://github.com/yanboliang
To reduce the amount of logs
* for successes, only print the part that says what tests ran and don't print the rest. Zip the log into an artifact. The line listing al the test names is really long, but if you view source of the raw logs, it will not wrap so it will only be one line. The log classifier can also be configured to ignored this line. Gets rid of lines like `test_ops.py::TestCommonCPU::test_multiple_devices_round_cpu_int64 SKIPPED [0.0010s] (Only runs on cuda) [ 9%]`
* for failures/reruns, print logs. Do not zip.
Also
* change log artifact name
Examples of various logs:
a074db0f7f failures
1b439e24c4 failures
possibly controversial haha
should i include an option for always printing?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110033
Approved by: https://github.com/huydhn
Summary:
D49187352 caused our model conversion and loading of QAT checkpoint to be stuck with thrift time out.
we are actively checking in final code and model for static quant HTP prod model, and encountered this breakage at head Thursday.
Thrift timeout is a not failing, and because of that, it's hard to bisect and find this culprit. It is also hard to set up unit test, because the job simply time-out. Better test is needed to guard downstream model conversion against upstream changes.
Our suspicion of why this diff broke us is that we create a lot of modules with qat (in a recursive manner) but our model is not a qat traceable module (it is a graph with many qat modules and floating point modules). With fuctools.partial as in the original diff, we will be caching modules in the memory and causing the memory of the machine to be taken up completely.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110392
Approved by: https://github.com/junesg, https://github.com/jerryzh168
Defining kernels as static vars is problematic for subsequent model loading on non-default CUDA devices.
Assuming those kernels were loaded in context of the device #0, so, they are not nullptr anymore, therefore kernels won't work on devices other than the device #0.
This change makes devices remembered at model level in AOT mode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110554
Approved by: https://github.com/chenyang78, https://github.com/desertfire
Fixes#106698
Also added a check for python API, because current error message
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/sehoon/pytorch-latest/torch/nn/modules/adaptive.py", line 128, in __init__
or (min(cutoffs) <= 0) \
ValueError: min() arg is an empty sequence
```
is not very comprehensible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106777
Approved by: https://github.com/albanD
Ideally all `_dynamo.exc.UserError`s should have "case names", i.e., link to examples in `exportdb`.
This PR adds case names to several instances of `_dynamo.exc.UserError`. In particular, looking at coverage based on `UserErrorType`:
* `DYNAMIC_CONTROL_FLOW`, `ANTI_PATTERN`, and `STANDARD_LIBRARY` are fully covered.
* `CONSTRAINT_VIOLATION` and `DYNAMIC_DIM` have no coverage. We don't seem to have any dedicated examples of specifying dynamic shapes in `exportdb` (although they are used in some other examples without explanation, to avoid some specialization that would make such examples moot).
* `INVALID_INPUT` is only partly covered. Frankly this is tedious to cover via examples.
Differential Revision: [D49928518](https://our.internmc.facebook.com/intern/diff/D49928518/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110555
Approved by: https://github.com/angelayi, https://github.com/ydwu4
Summary: Today, we get different batch norm ops depending on
the device the model is placed on at export time. Exporting
`model.cpu()` gives `_native_batch_norm_legit`, while exporting
`model.cuda()` gives `cudnn_batch_norm`. QAT fusion currently
only supports the former and silently ignores the latter. This
commit fixes this by additionally matching on the latter op
during QAT fusion.
Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT.test_qat_conv_bn_fusion
python test/test_quantization.py TestQuantizePT2EQAT.test_qat_conv_bn_relu_fusion
Reviewers: jerryzh168, kimishpatel
Subscribers: jerryzh168, kimishpatel, supriyar
Differential Revision: [D49615145](https://our.internmc.facebook.com/intern/diff/D49615145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109908
Approved by: https://github.com/jerryzh168
There is an issue with float8 type promotion, because _promoteTypesLookup doesn't contain records for few types between bfloat16 and float8.
I have simply moved float8 types just after bfloat16, however I'm not sure if it doesn't break serialization.
Please, decide if it can stay like this, or should I insert missing records filled with "ud" into _promoteTypesLookup instead of moving types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110279
Approved by: https://github.com/albanD
Part of #109684
- #109684
Changes:
- Add new functions `tree_structure`, `tree_leaves`, `tree_map_` and `tree_map_only_` to Python pytree.
- Extract reusable tests for pytree to `TestGenericPytree`.
- Change `treespec_dumps` and `treespec_loads` in C++ pytree to call Python pytree and use JSON string as serialization type.
- Rename `torch.utils.pytree` -> `torch.utils._cxx_pytree`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110395
Approved by: https://github.com/zou3519
We want to be able to use SingletonSymNode to represent strides for Jagged layout tensor. The following is for 3D, but easily generalizable to higher dimensions.
Constraints:
- [B, x, D] (where x represents the "variably lengthed dim") can be strided in two ways [x, 1, sum(x)] and [dx, d, 1]. We need two different placeholder values depending on how the jagged tensor is strided.
- When doing operations we need the strides of output tensors to be expressable in terms of the strides and sizes of the inner tensors. Given [B, x, D] @ [D, D'], the output strides is [x * D', D', 1] rather than some opaque [x2, D', 1]. This constraint exists because if I'm tracing, I need a symint to represent the output stride. This symint needs to come from somewhere; I get it in several ways: (1) create a constant, (2) unbacked symint, (3) create a new input using a source, (4) output of an operation on an existing symint. It is clear that (4) is what we want here, which brings us to the design below.
Design:
Given the two constraints, the most straightforward way to implement this is actually to update SingletonSymNode to include some scalar factor, i.e. Morally, SingletonSymNode represents `factor * [s_0, s_1, …, s_n]` This enables us to symbolically compute strides from sizes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110369
Approved by: https://github.com/ezyang
ghstack dependencies: #110044
Previously, something like j0 >= 3, would return False. In sympy however, it is not possible to make it so that both j0 >= 3 and j0 < 3 return False. In sympy, you only get to dispatch on Ge, and the remaining are derived, e.g. defining Ge(j0 >= 3) to be False would force Lt(j0, 3) to be True, which is not what we want.
In this PR, we make it so that both j0 >=3 and j0 < 3 error, so that in a future PR when we create the symbolic counterpart of this singleton, the behaviors can be the same.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110044
Approved by: https://github.com/ezyang
When mapping between the original signature of a program and the graph-captured signature of its exported program, we emit errors when we see unexpected original or graph-captured inputs or outputs.
These errors can arise because of various reasons, e.g.:
1. some input or output has been lifted because of mutation
2. some type is not pytree-registered for flattening / unflattening
3. some type cannot be realized with graph operations
(This is probably not an exhaustive list.)
Previously we used to emit errors based on a vanilla id-based membership check between the two sides, mostly anticipating (1) as the reason for errors. But this does not do justice to errors because of (2) or (3).
This PR emits a different error when it finds (3) to be a probable cause. Specifically, it considers only Tensor and Sym* types to be "supported": no other type seems to be realizable by graph operations.
When (2) is a probable cause, we sometimes also hit the same error because we would expect the supported types to show through upon registration. But this kind of error may need some more work in the future.
Differential Revision: [D49885828](https://our.internmc.facebook.com/intern/diff/D49885828/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110472
Approved by: https://github.com/ydwu4
Summary:
This PR removed several APIs from the AOTInductor interface,
which are not used by the client.
It also simplified AOTInductor's model class by removing
the dim info for input/output tensors. We included dim info
before to return max output shapes, which was used by the client
to allocate memory for output tensors. Now, we allocate output
tensor memory from the .so so that we don't need to maintain
such information any more. The deletion of dim info from
the model class also simplified the codegen quite a bit.
Test Plan: ci
Reviewed By: khabinov
Differential Revision: D49835430
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110411
Approved by: https://github.com/khabinov, https://github.com/desertfire, https://github.com/jansel
Summary: Registering param/buffer will write into a vector inside Object, need to maintain thread safety if we have threads reading from the vector and writing to the vector at the same time.
Test Plan: CI
Differential Revision: D49882601
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110488
Approved by: https://github.com/davidberard98
Summary: This diff refactors the code by moving CUDAAllocatorConfig into the header file. This config refactoring is done so that we can use the same config code for CUDA pinned memory as well.
Test Plan: sandcastle
Differential Revision: D49653265
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110123
Approved by: https://github.com/zdevito
This PR inlines the expecteds strings onto the `assertExpectedInline` calls, so that, when
change is needed, we may do that by using the `expectedtest` machinery: setting the
environment variable `EXPECTTEST_ACCEPT=1`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110290
Approved by: https://github.com/zou3519
Summary:
See wrapper.codegen_reinterpret_view(), it return a temporary handle for tensor, which has following problem.
```
# NB, the return handle here represents a temporary tensor, which will be automatically
# released.
# Here's a sample usage in the cpp wrapper code:
# ```
# aoti_torch_addmm_out(
# buf1,
# arg1_1,
# RAIIAtenTensorHandle(tmp_tensor_handle_0),
# buf0,
# 1L,
# 1L));
# ```
# RAIIAtenTensorHandle(tmp_tensor_handle_0) will be released after the call to addmm_out.
# This could be problematic when it's used in a different pattern, for example:
# ````
# AtenTensorHandle tensor_args[] = {RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6};
# aoti_torch_proxy_executor_call_function(..., tensor_args);
# ````
# RAIIAtenTensorHandle(tmp_tensor_handle_2) will be invalid when it's used in the latter
# kernel call.
return f"RAIIAtenTensorHandle({tmp_name})"
```
As a result, ProxyExecutor would generate following code, which cause invalid memory access.
Before:
```
// Source Nodes: [fn_with_tuple_output], Original ATen: [fb.fn_with_tuple_output]
AtenTensorHandle tmp_tensor_handle_2;
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__reinterpret_tensor(buf3, 2, int_array_0, int_array_1, 0L, &tmp_tensor_handle_2));
...
AtenTensorHandle tensor_args[] = {RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6};
int64_t int_args[] = {1};
aoti_torch_proxy_executor_call_function(proxy_executor, 1, 1, int_args, 3, tensor_args);
buf3.reset();
```
With fix in this diff, ProxyExecutor generates following code
After:
```
// Source Nodes: [fn_with_tuple_output], Original ATen: [fb.fn_with_tuple_output]
AtenTensorHandle tmp_tensor_handle_2;
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__reinterpret_tensor(buf3, 2, int_array_0, int_array_1, 0L, &tmp_tensor_handle_2));
...
aoti_torch_proxy_executor_call_function(proxy_executor, 1, 1, std::vector<int64_t>{1}.data(), 3, std::vector<AtenTensorHandle>{RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6}.data());
buf3.reset();
```
I am not exactly a big fan of such `std::vector{...}.data()` for creating a temp array, but I can't think of another fix.
Test Plan: buck2 run mode/dev-nosan deeplearning/aot_inductor/test:test_custom_ops
Reviewed By: desertfire
Differential Revision: D49758764
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110451
Approved by: https://github.com/desertfire
Summary: Point to point ops don't enqueue their work to the `workMetaList_` which means that the NCCL watchdog does not watch over them, hence they do not respect the collective timeouts.
Test Plan:
While trying to add a test I found we dont have tests which validate the nccl watch dog. It looks like this is because we dont have a good way to detect when nccl watchdog has thrown an error (exception is thrown in a side thread) in our testing framework / `MultiprocessTestCase`
I manually tested this change with the script in https://github.com/pytorch/pytorch/issues/109401, but need to look more closely at how to automate a test for NCCL watchdog
Differential Revision: D49418976
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109611
Approved by: https://github.com/wconstab
Triplet Margin Loss takes in a Callable `distance_function` parameter which is not supported as an argument on the fx graph. See previous error:
> File "/scratch/eellison/work/pytorch/torch/_dynamo/symbolic_convert.py", line 562, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/scratch/eellison/work/pytorch/torch/_dynamo/variables/torch.py", line 723, in call_function
*proxy_args_kwargs(args, kwargs),
File "/scratch/eellison/work/pytorch/torch/_dynamo/utils.py", line 504, in proxy_args_kwargs
f"call_function args: {typestr(*args)} {typestr(*list(kwargs.values()))}"
File "/scratch/eellison/work/pytorch/torch/_dynamo/exc.py", line 143, in unimplemented
raise Unsupported(msg)
torch._dynamo.exc.Unsupported: call_function args: TensorVariable() TensorVariable() TensorVariable() ConstantVariable(float) NNModuleVariable()
This is fixable by just inlining into `triplet_margin_loss` and continuing to compile it. This required support for `has_torch_function_variadic`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110302
Approved by: https://github.com/mlazos
The first reland broke internal (failing diff: D49617462).
The major error looks like it's because there's an internal-only higher order op that needs a new functionalization rule. I'm going to land an internal diff for that and confirm tests pass before relanding this PR.
Also confirmed that the issue from https://github.com/pytorch/pytorch/issues/110121 is fixed, and added a test.
This reverts commit 1b90f07f5a9fcb9187fee94f769fc117490c1e39.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110079
Approved by: https://github.com/ezyang
Summary: This diff fixes a heap underflow found by fuzzing in torch/csrc/jit/runtime/vararg_functions.cpp
Test Plan:
CI and
```
arc lionhead crash reproduce 1753074381791061
```
doesn't crash anymore.
Differential Revision: D49537535
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110441
Approved by: https://github.com/Skylion007
Summary:
Previously, we link against cuda libs even for pure cpp backend.
This caused issues for cases where the inference platform does not
have GPUs. This diff removed cuda dependency for cpp backend.
Reviewed By: bertmaher, muchulee8, mikekgfb
Differential Revision: D49800712
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110409
Approved by: https://github.com/bertmaher, https://github.com/desertfire
Description:
- Fixed misleading test sample case
Context: sample input is composed of input tensor `(N, C, iH, iW)` and grid tensor `(N, oH, oW, 2)`, however, grid is defined as `(N, C, oW, 2)`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110383
Approved by: https://github.com/peterbell10
**Background**: recordStreams can result in memory spikes, so we don't want them to appear in FSDP (https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486). @ awgu is working on fixing this, but it turns out profiler was causing recordStream to get called when it is enabled.
Why profiler was causing recordStream to get called: NCCL calls add profiler events manually; they register a callback to be executed when the future for the collective is completed; this indicates the end of the CPU-side profiler event for the callback:
c2c7c4035f/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L1822-L1824)
In order to guarantee safety, ivalue::Future::invokeCallback calls `recordStream` on the future's storage buffers; this marks the fact that other streams (e.g. the one that the callback runs on) may need to use the storage.
c2c7c4035f/aten/src/ATen/core/ivalue_inl.h (L1171-L1173)
**Change**: The end-profiler-event callback doesn't actually use the future, so we don't need to recordStream on it. This PR introduces an optional parameter `uses_future` for adding callbacks; a user can set this variable to "false" to unsafely skip the recordStream, if the user knows that the future will not be used in the lambda.
**Tests**: (a) unit tests; (b) added an assert in recordStream: c2c7c4035f/c10/cuda/CUDACachingAllocator.cpp (L3260) and verified that it doesn't get triggered when running basic distributed tests w/ profiler enabled
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109933
Approved by: https://github.com/wconstab
Summary: after converting nn.multihead attention we weren't deleting the
old in_proj_weight and in_proj_bias despite not (really) using them.
Test Plan: python test/test_quantization.py -k
"test_custom_module_multi_head_attention"
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110407
Approved by: https://github.com/jerryzh168
Unlike half/bfloat16 casts, where entire model is cast to half-precision floats, only parts of the network can be in float8 and therefore performance of the casts is important.
Speedup casts by implementing non-dynamically castable variants using new refactored `gpu_kernel_nocast` template.
Mesaure performance using the following script:
```python
import torch
def run_cast_bench(size=(10000, 10000), src_dtype=torch.float16, dtype=torch.float8_e5m2):
x=torch.rand(size, device="cuda", requires_grad=False, dtype=src_dtype)
z=torch.empty(size, device="cuda", dtype=dtype, requires_grad=False)
with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CUDA]) as prof:
z.copy_(x)
rc=prof.key_averages()
print(f"Running bench for src_dtype={src_dtype} dst_dtype={dtype} cuda_time={rc[1].cuda_time}")
if __name__ == "__main__":
for dtype in [torch.float8_e5m2, torch.float8_e4m3fn]:
run_cast_bench(src_dtype=torch.half, dtype=dtype)
run_cast_bench(src_dtype=torch.float, dtype=dtype)
run_cast_bench(src_dtype=torch.bfloat16, dtype=dtype)
```
Below are before and after results:
| Cast type | After | Before |
| ---------- | ------ | ----- |
| fp32->e5m2 | 228 us | 336 us|
| fp16->e5m2 | 150 us | 323 us|
| bf16->e5m2 | 150 us | 322 us|
| fp32->e4m3 | 227 us | 331 us|
| fp16->e4m3 | 148 us | 318 us|
| bf16->e4m3 | 149 us | 318 us|
Skip the optimizations on ROCm platform
TODO:
- Investigate why `__nv_cvt` intrinsics defined in https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__FP8__MISC.html end up being slower
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110251
Approved by: https://github.com/drisspg
This PR updates backend as a property to DTensorTestbase and add "cpu:gloo,cuda:nccl" support in DTensorTestbase so that we can use `cpu:gloo,cuda:nccl` backend for checkpoint unit tests.
cc. @wanchaol, @fduwjj, @XilunWu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110397
Approved by: https://github.com/wanchaol
Summary: See summary
Test Plan:
buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_full_precision=1
//xplat/caffe2/fb/custom_ops/vulkan_quantized:pt_vulkan_quantized_test_binAppleMac\#macosx-arm64
[ OK ] VulkanAPITest.convert_qconv2d_context (135 ms)
[ RUN ] VulkanAPITest.linear_2d
[ OK ] VulkanAPITest.linear_2d (4 ms)
[----------] 2 tests from VulkanAPITest (139 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (139 ms total)
[ PASSED ] 2 tests.
##############################################################
buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource
//xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac
[ OK ] VulkanAPITest.conv2d_pw_quantized_prepack_random_params_int8_int32 (11 ms)
[ RUN ] VulkanAPITest.linear_2d_flat
[ OK ] VulkanAPITest.linear_2d_flat (4 ms)
[ RUN ] VulkanAPITest.linear_2d_small
[ OK ] VulkanAPITest.linear_2d_small (1 ms)
[ RUN ] VulkanAPITest.linear_2d_large
[ OK ] VulkanAPITest.linear_2d_large (1 ms)
[ RUN ] VulkanAPITest.linear_3d_flat
[ OK ] VulkanAPITest.linear_3d_flat (2 ms)
[ RUN ] VulkanAPITest.linear_3d_small
[ OK ] VulkanAPITest.linear_3d_small (2 ms)
[ RUN ] VulkanAPITest.linear_3d_large
[ OK ] VulkanAPITest.linear_3d_large (1 ms)
[ RUN ] VulkanAPITest.linear_4d_flat
[ OK ] VulkanAPITest.linear_4d_flat (1 ms)
[ RUN ] VulkanAPITest.linear_4d_small
[ OK ] VulkanAPITest.linear_4d_small (1 ms)
[ RUN ] VulkanAPITest.linear_4d_large
[ OK ] VulkanAPITest.linear_4d_large (1 ms)
[ RUN ] VulkanAPITest.linear_custom
[ OK ] VulkanAPITest.linear_custom (0 ms)
[----------] 76 tests from VulkanAPITest (1811 ms total)
[----------] Global test environment tear-down
[==========] 76 tests from 1 test suite ran. (1811 ms total)
[ PASSED ] 76 tests.
YOU HAVE 8 DISABLED TESTS
##############################################################
buck2 run --target-platforms ovr_configplatform/macos:arm64-fbsourcexplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1
[----------] Global test environment tear-down
[==========] 346 tests from 1 test suite ran. (5648 ms total)
[ PASSED ] 345 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
YOU HAVE 5 DISABLED TESTS
Differential Revision: D49609986
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110095
Approved by: https://github.com/yipjustin
Minor logging changes that just kind of annoyed me:
* prevent constant printing of `No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'` by moving import within the function (idk if this is ok)
* prevent constant printing of `Ignoring disabled issues: ['']` (no idea why it was not gated behind a function or main)
* change all prints in run_tests.py to be through stderr so theres no weird interleaving (although if everything goes through stderr, might as well just print everything through stdout...)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110188
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/ZainRizvi
Partially fixes `test_memory_format_factory_like_functions_preserve` with PYTORCH_TEST_WITH_INDUCTOR. Inductor preserves memory layouts for user-visible outputs as annotated on the fx graph that it is passed in. That graph is generated from running aot_autograd with decompositions. If the decompositions give incorrect strides, so will inductor.
This preserves the layout of `_like` operators when it corresponds to a `torch.memory_format`. It doesnt fix a) arbitrary permutations, b) striding of non-dense outputs. Both of these are lower-pri compared to preserving channels last. We would need either https://github.com/pytorch/pytorch/issues/92920 or a `to` variant that takes in a physical layout arbitrary permutations. I converted the output of rand to the correct layout instead of passing the layout in so that this would compose with the `replace_random` pass, and because the two pointwise ops will get fused anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110242
Approved by: https://github.com/int3
The original implementation of `_gather_orig_param_state` is naive. It performs one allgather_object and two allgather (if the optimizer is Adam) per FQN. This can be slow and make `_optim_state_dict` become bottleneck.
This PR rewrite the implementation and fuse all the `allgather_object`s into one. As for `allgather`, it is fused based on the information of FlatParameters. So there will be 2N `allgather` where N is the number of FlatParameter and 2 is due to Adam having 2 states per FQN.
One experiment on 8GPU A100 shows that the execution of the gathering is improved to 0.3 seconds from 3 seconds.
Differential Revision: [D48835138](https://our.internmc.facebook.com/intern/diff/D48835138/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108298
Approved by: https://github.com/awgu
This PR inlines the expecteds strings onto the `assertExpectedInline` calls, so that, when
change is needed, we may do that by using the `expectedtest` machinery: setting the
environment variable `EXPECTTEST_ACCEPT=1`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110290
Approved by: https://github.com/zou3519
Summary:
Adding back D46578700 / PR https://github.com/pytorch/pytorch/pull/108426
Note: The changes were originally reverted due to memory regression, these changes are putting the code behind a gflag so it is only used by binaries that require expanded stack for BPF Profiling.
Original Diff comment:
To get a Node's call stack we currently loop on the InlinedCallStack graph and follow the "callee" chain. Since the node's inlined stack does not change we can optimize this but expanding the node's inlined stack once and reusing it. This is particularly useful when reading the node's stack from another process (e.g. BPF) as it simplified the memory traversal process.
The new data structure (NodeSourceInfo) only holds pointers to the function name and file name variables, and assumes these objects will be alive throughout the lifetime of the process.
Each Node has an extended attribute that has an index to a vector of stack frames expanded_node_stacks_
node_stack_attr_symbol_ is only needed to make accessing the stack vector index attribute easier from BPF.
Test Plan:
- Verified using BPF Program in subsequent diffs
- Perf testing for loading large model: P822455246
Differential Revision: D49565461
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110229
Approved by: https://github.com/zdevito
If `XLA_HLO_DEBUG` flag is enabled, generated a minified HLO graph when using the minifier. This function enables HLO minification support by porting the minified FX graph to StableHLO via the `save_torch_model_as_stablehlo` function.
This allows users to port the minified graph to compilers that are not compatible with TorchDynamo/Inductor workflow and use XLA instead. The purpose of this PR is to help XLA users debug accuracy and compilation errors. It will also be helpful for existing TorchDynamo/XLA workflow on `torchxla_trace_once` backend as well.
Fixes [#5461](https://github.com/pytorch/xla/issues/5461) in Torch XLA repo. CC @GleasonK @qihqi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109084
Approved by: https://github.com/anijain2305
This fixes the daily timeout of ROCm jobs when running with memory leak check turning on. I want to use something like `inputs.timeout-minutes * 2` but that syntax, unfortunately, isn't supported in GitHub action YAML. So I decide to just x2 the current timeout value of 300 minutes to make it 600 minutes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110193
Approved by: https://github.com/clee2000
pytree is a great tool, but it sometimes considers to be evil for
tensor subclasses, it's useful to implement subclass quickly, but it:
* exposes non-trival CPU overhead
* many ops don't need pytree, only the one with list/dict ops needs
* blindly use pytree to re-wrap have semantic issues for inplace/out
ops
This PR avoid using pytree for most ops during torch_dispatch and only
enable it for certain ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110132
Approved by: https://github.com/fduwjj
Summary:
Previously we used export's constraints to specify all batch-size dimensions being dynamic. This is done by creating 1 constraint `dynamic_dim(inp[0][0], lower, upper)`, followed by `dynamic_dim(inp[0][0]) == dynamic_dim(inp[i][0])` for every input `i`.
Through the new `dynamic_shapes` API, we can use `Dims("batch_size")` on every dimension to specify which dimensions are dynamic and equal to each other, and `None` otherwise: `{i: [Dims("batch_size", lower, upper), None] for every input i}`
Note: `dynamic_shapes` and `constraints` utilize the same "constraints" backend so this diff should be idempotent.
Test Plan: `buck2 run @//mode/dev-nosan //caffe2/torch/fb/model_transform/experimental/benchmark/test/aotinductor:test_aot_inductor_benchmark`
Reviewed By: chenyang78, aakhundov
Differential Revision: D49784351
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110360
Approved by: https://github.com/desertfire
When running on "gloo" and "cpu:gloo,cuda:nccl" backend, it will run into the following error.
```
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/data/users/irisz/pytorch/torch/multiprocessing/spawn.py", line 74, in _wrap
fn(i, *args)
File "/data/users/irisz/pytorch/torch/distributed/checkpoint/examples/fsdp_checkpoint_example.py", line 105, in run_fsdp_checkpoint_example
optim_state = load_sharded_optimizer_state_dict(
File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 295, in load_sharded_optimizer_state_dict
_alloc_tensor(value.properties, value.size, dp_pg_device_type), sharding_spec
File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 109, in _alloc_tensor
device=cast(torch.device, _get_device_module(device_type).current_device()),
AttributeError: module 'torch.cpu' has no attribute 'current_device'
```
This PR fix the error in optimizer.py. Will follow up to add "cpu:gloo,cuda:nccl" support in DTensorBase so we can update unit test to include this backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110299
Approved by: https://github.com/kumpera
Summary:
https://docs.google.com/document/d/1QJJEGnj2nHGPODlw38BEG3KLLCOTfdOVjPrNQbz_LM8/edit#bookmark=id.lp80wfshq130
`exported_program.run_decompositions(decomposition_table)` will optionally take a decomposition table, and run decompositions on the exported program, returning a new exported program. By default we will run the Core ATen decomposition table.
Splitting up this diff with the following one (D49742989) to make migrating Executorch easier:
1. Land this diff
1. Wait for a pytorch nightly to include this diff
1. Update executorch's pytorch nightly
1. Land the following diff to have export() return no decomps
Test Plan: Tested in following diff
Differential Revision: D49743208
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110236
Approved by: https://github.com/gmagogsfm
Example of when the `evict_first` heuristic helps.
```
@torch.compile
def f(a, b):
return (a * b).sum(dim=-1)
N = 512
inps = (torch.randn(N, N, N).permute(2, 1, 0), torch.randn(N, N, N).permute(1, 2, 0))
from torch._inductor.utils import do_bench
print(do_bench(lambda: f(*inps)))
```
This generates code like this: http://ix.io/4HFs
```
Original: 3.8 ms
This PR: 3.54 ms
Always `evict_first: 5.4ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108841
Approved by: https://github.com/lezcano, https://github.com/jansel
When users define customized `attention mask` using `dtype=torch.float16`, e.g.
```
from torch.nn import functional as F
float_min = torch.finfo(torch.float16).min
attention_mask_fp16 = (attention_mask * 1.0).masked_fill(attention_mask, float_min).to(torch.float16)
attn_output = F.scaled_dot_product_attention(
query_layer_, key_layer_, value_layer_, attention_mask_fp16, 0.0, is_causal=False
)
```
the onnx graph cannot be exported.
When q, k ,v have the fp16 type, we can support this `attn_mask` to be `fp16` type, by adding
```
elif (
_type_utils.JitScalarType.from_value(attn_mask)
== _type_utils.JitScalarType.FLOAT
in (_type_utils.JitScalarType.FLOAT, _type_utils.JitScalarType.HALF)
```
This can export `.onnx` graph.
Fixes#109336
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110306
Approved by: https://github.com/titaiwangms
Summary:
## Why?
The .so and .h files are compiled seperately with different flags. The .so is compiled by AOTInductor and .h files (eg. c10/core/TensorImpl.h) are compiled by buck2.
Let's make sure the .so is also compiled with this macro in fbcode.
Differential Revision: D49664078
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110122
Approved by: https://github.com/chenyang78, https://github.com/khabinov
This PR enables the misc-XX checks in clang-tidy. Meanwhile, I excluded some of them that require a lot of code changes and have no immediate benefits. Some additional fixes and suppression were also given.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110283
Approved by: https://github.com/albanD
Summary: with the grid computed in terms of unbacked `SymInt`s, it can happen that the grid is zero size. This causes CUDA error on `cuLaunchKernel` in the AOT Inductor codegen.
In this PR, when the grid contains unbacked `SymInt`s, a check is added around the `launchKernel` in the AOT Inductor's C++ wrapper codegen to make sure that the grid is not zero-size.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110312
Approved by: https://github.com/chenyang78
# Summary
Logging Mode is great, and helped me identify that we are doing an unnecessary slice sometimes.
### Numbers
For small sizes: ie. (16, 16, 32, 32)
This brings the timing from:
`flash_time: 29.344002110883594 micro seconds`
to
`flash_time: 26.971791498363018 micro seconds`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110324
Approved by: https://github.com/cpuhrsch
**Summary**
Follow up https://github.com/pytorch/pytorch/pull/109893 which has issue in support of CPU as reported in https://github.com/pytorch/pytorch/issues/109897. This fix mainly includes 2 changes:
- Current implementation of `rename_indexing`
10c646295d/torch/_inductor/codegen/common.py (L1023) only add symbol name start with `s` or `ps` into `kernel.args.sizevars`. However, `Unbacked symint` will start as `i`, so we extend the implementation of `rename_indexing` to support symbol start with `i`.
- Currently, the internal loop index also name start as `i`. Since `i` has has been used as `Unbacked symint`, change the name to start with `x` which should align with trition.
**Test Plan**
```
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_bool_mask_nobreak
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_nonzero_size_factory_nobreak
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_item_zeros_nobreak
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110262
Approved by: https://github.com/ezyang, https://github.com/jgong5
Summary: This diff fixes a heap UAF found by fuzzing in torch/csrc/jit/mobile/interpreter.cpp
Test Plan:
CI and
```
arc lionhead crash reproduce 1009060456885023
```
doesn't crash anymore.
Reviewed By: malfet
Differential Revision: D49538326
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110289
Approved by: https://github.com/malfet
# Summary
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 318764f</samp>
This pull request implements the CUDA backend of the SDPA kernel for nested tensors, which enables efficient transformer models with variable-length sequences. It adds a new dispatch key, a backward function, a unit test, and some helper functions for the kernel. It modifies `test/test_transformers.py`, `aten/src/ATen/native/native_functions.yaml`, `aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctionsBackward.cpp`, and `aten/src/ATen/native/nested/cuda/NestedTensorTransformerUtils.h`.
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at ed4a773</samp>
> _Fused kernels of doom, unleash the flash attention_
> _Nested tensors on fire, reshape and pad with caution_
> _Backward pass of power, dispatch the CUDA key_
> _Test the gradients of hell, warn the user if they disagree_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97485
Approved by: https://github.com/jbschlosser
An `ExportedProgram`'s `__call__` signature is different from the original module, so `dynamic_shapes` that follow the original signature would fail when applied to re-export an `ExportedProgram`.
This PR fixes this issue, in other words, the original `dynamic_shapes` should now work when re-exporting.
Differential Revision: D49764011
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110276
Approved by: https://github.com/tugsbayasgalan
generate_output may return non-list/tuple outputs. Let's force
those to be list, because we will enumerate kernel.outputs
later in the codegen.
Also fixed a minor issue in an assertion message.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110145
Approved by: https://github.com/aakhundov
caused by #109866
The test registers new device module, the above pr checks for xpu, sees that it got registered and uses it but its a dummy module.
This causes any test after it to fail so I "clean up" the registered module
Another possible solution would be to run this test last lol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110254
Approved by: https://github.com/huydhn
Generating reference outputs somtimes fails because of type mismatches in the graph,
an issue which was noticed previously for `prims.convert_element_type` and fixed in #92036
but the same issue happens with other functions such as tensor constructors.
This expands the fix from #92036 to all dtype keyword arguments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110232
Approved by: https://github.com/ezyang
Fix bug where the historical correlations heuristic currently sorts heuristics in the opposite order, ranking the least relevant tests most highly
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 70333d1</samp>
> _`test_files` sorted_
> _by ratings, high to low_
> _a faster spring test_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110257
Approved by: https://github.com/clee2000
Removing the functionalities from nvfuser python APIs.
Since the use of nvfuser has been deprecated before the last release cut. We are removing torch script support.
I'll have the next PR to actually remove the code base.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110124
Approved by: https://github.com/davidberard98
Summary:
Also added annotation support for conv1d_relu and conv1d in XNNPACKQuantizer, the quantized results still
matches fx quant path (didn't quantize conv1d) so tests are not disabled
Test Plan: with-proxy buck2 run executorch/examples/quantization:example -- -m=w2l --verify
Differential Revision: D49479546
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109830
Approved by: https://github.com/kimishpatel
Summary:
Add the test to make sure we can call the quantize API multiple times
Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_reentrant
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110125
Approved by: https://github.com/kimishpatel
ghstack dependencies: #110097
Opaque pointers support is disabled in llvm 14 and enabled by default from llvm 15 and above.
setOpaquePointers api usage is deprecated from llvm 16. Removed this API.
Update CreateMalloc and CreateFree apis for latest llvm release.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110200
Approved by: https://github.com/Skylion007
One should never edit `gql_mocks.json` by hand, as otherwise it does not validate mergebot behavior using the actual GitHub data, but rather snapshot of this data frozen in time.
Unfortunately, GitHub started to delete checkrun statuses against older
PR, so some tests needs to be updated.
For example https://github.com/pytorch/pytorch/pull/77700/checks committed on May 19th 2022 has no checks at the time of the writing (Sep 28th 2023)
Deleted `test_checksuites_pagination` as its checks are gone it tests the same functionality as `test_get_checkruns_many_runs`, which was updated to use more recent PR.
Deleted `test_get_classifications_pending_unstable`, because what it wants to test is inherently unreliable and therefore it must be rewritten using some different mechanisms.
Disabled `test_internal_changes` as the mechanism is broken at the moment, see https://github.com/pytorch/pytorch/issues/110218
Updated `test_pr_dependencies_ghstack` and `test_pr_dependencies` to generate `msg` using `pr.get_body()` rather than hardcode the text (that were updated after test was committed.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110221
Approved by: https://github.com/clee2000, https://github.com/huydhn
As Indeed `uint32_t(x) >= 0` is always true
Warning typically looks as follows:
```
[337/1379] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/EmbeddingBag.cu.o
../aten/src/ATen/core/ivalue.h(1283): warning #186-D: pointless comparison of unsigned integer with zero
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110227
Approved by: https://github.com/atalman, https://github.com/albanD
Make `np.arange` respect an explicitly provided dtype.
Also remove duplicated tests:
- torch_np/test_function_base.py::TestArange is a dupe of
- torch_np/numpy_tests/core/test_multiarray.py::TestArange
Fixes#109975
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110005
Approved by: https://github.com/lezcano
A resubmit of https://github.com/pytorch/pytorch/pull/108447. Copy over the descriptions:
This is a follow-up of the discussion in https://github.com/pytorch/pytorch/pull/108356, where we want to repalce source_fn with source_fn_stack
Before this PR, for the following example:
```python
backend = EagerAndRecordGraphs()
@torch.compile(backend=backend, fullgraph=True)
def cond_f(pred, pred2, x, y):
def true_fn(pred2, x, y):
return x + y
def false_fn(pred2, x, y):
def true_fn2(x, y):
return x.sin() - y.cos()
def false_fn2(x, y):
return x.cos() - y.sin()
return control_flow.cond(pred2, true_fn2, false_fn2, (x, y))
return control_flow.cond(pred, true_fn, false_fn, (pred2, x, y))
```
The graph captured is shown below:
```python
class GraphModule(torch.nn.Module):
def forward(self, L_pred_ : torch.Tensor, L_pred2_ : torch.Tensor, L_x_ : torch.Tensor, L_y_ : torch.Tensor):
l_pred_ = L_pred_
l_pred2_ = L_pred2_
l_x_ = L_x_
l_y_ = L_y_
cond_true_1 = self.cond_true_1
cond_false_1 = self.cond_false_1
cond = torch.ops.higher_order.cond(l_pred_, cond_true_1, cond_false_1, [l_pred2_, l_x_, l_y_]); l_pred_ = cond_true_1 = cond_false_1 = l_pred2_ = l_x_ = l_y_ = None
return (cond,)
class GraphModule(torch.nn.Module):
def forward(self, l_pred2_, l_x_, l_y_):
add = l_x_ + l_y_; l_x_ = l_y_ = None
return add
class GraphModule(torch.nn.Module):
def forward(self, l_pred2_, l_x_, l_y_):
cond_true_0 = self.cond_true_0
cond_false_0 = self.cond_false_0
cond = torch.ops.higher_order.cond(l_pred2_, cond_true_0, cond_false_0, [l_x_, l_y_]); l_pred2_ = cond_true_0 = cond_false_0 = l_x_ = l_y_ = None
return cond
class GraphModule(torch.nn.Module):
def forward(self, l_x_, l_y_):
sin = l_x_.sin(); l_x_ = None
cos = l_y_.cos(); l_y_ = None
sub = sin - cos; sin = cos = None
return sub
class GraphModule(torch.nn.Module):
def forward(self, l_x_, l_y_):
cos = l_x_.cos(); l_x_ = None
sin = l_y_.sin(); l_y_ = None
sub = cos - sin; cos = sin = None
return sub
```
the source_fn for inner cond, sin, cos will be a (name, target) tuple:
```
('cond', <torch._ops.HigherOrderOperator object at xxx>)
('sin', 'sin')
('cos', 'cos')
('sub'. <built-in function sub>)
```
After this pr, the source_fn_stack will be a list of (name, target) tuple. The bottom of stack is the end of the list.
```
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>)],
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>), ('sin', 'sin')],
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cos', 'cos')]
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>), ('sub', <built-in function sub>)]
```
Test Plan:
See added tests in test_higher_order_ops.py and modify existing test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108595
Approved by: https://github.com/angelayi, https://github.com/zou3519
This PR allows us to use the same failures_dict for multiple test
classes. This is helpful if you have a bunch of small TestCase(es) and
to centralize all the failures dict into one big one.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110164
Approved by: https://github.com/williamwen42
This PR does the following:
* Combine `cow/context.<h/cpp>` and `cow/deleter.<h/cpp>` into `cow/COWDeleter.<h/cpp>`
* Rename `Context` to `COWDeleterContext`
* Rename `delete_context` to `cow_deleter`
* Remove the separate `impl_cow_context` bazel library, combining it with the base c10 core library
* Rename `context_test.cpp` to `cow_test.cpp`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110191
Approved by: https://github.com/ezyang
When the sharding_strategy is set to SHARD_GRAD_OP and forward_prefetch=True, during direct validation run, self.is_first_iter will always be True (because training=False, iter+1 is not executed). Additionally, the _pre_forward_order_index of the first handle entering the record_pre_forward function is 0. This causes the handle to have a False result in the if condition at line 166 when entering the record_pre_forward function again (the expected value should be True because _pre_forward_order_index has actually been assigned a value). As a result, the first handle is repetitively added to handles_pre_forward_order, leading to incorrect prefetching order.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110138
Approved by: https://github.com/awgu
It's useful to have a simple, lightweight way to run a model that adds
essentially no overhead to calling the model's generated `run_impl` method.
This C API is a super thin wrapper around AOTInductorModel: Create, Run, and
Delete are provided, and do very little work beyond dispatch to the appropriate
helpers.
Note the Create function also provides additional functionality beyond the
Container API; it allows the user to pass in a weight map defined in userland,
which is a requirement for several serving use cases.
Differential Revision: [D49670711](https://our.internmc.facebook.com/intern/diff/D49670711/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110158
Approved by: https://github.com/desertfire, https://github.com/chenyang78
Summary:
Use the upstream `torch._dynamo.same` function in accuracy checking and remove the self-hosted version in torchbench.
Now cmf_10x and ads_dhen_5x can run in deterministic mode, enable deepcopy and deterministic mode.
Test Plan:
```
$ buck2 run mode/opt //pytorch/benchmark:run -- cmf_10x -d cuda -t train --accuracy
Running train method from cmf_10x on cuda in eager mode with input batch size 4 and precision tf32.
Accuracy: pass
```
```
$ buck2 run mode/opt //pytorch/benchmark:run -- cmf_10x -d cuda -t train --torchdynamo inductor --torchinductor_enable_batch_fusion --torchinductor_enable_split_cat_fx_pass --accuracy
Running train method from cmf_10x on cuda in dynamo inductor mode with input batch size 4 and precision tf32.
Accuracy: pass
```
Without this PR, it will print:
```
File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_dynamo/utils.py", line 190, in time_wrapper
r = func(*args, **kwargs)
File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/graph.py", line 464, in run
return super().run(*args)
File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/fx/interpreter.py", line 138, in run
self.env[node] = self.run_node(node)
File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/graph.py", line 826, in run_node
result.realize_hint()
File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/ir.py", line 5273, in realize_hint
and self.is_pointwise_non_scalar_tensor_num_reads_larger_than_one()
File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/utils.py", line 343, in wrapper
setattr(self, key, fn(self))
File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/ir.py", line 5332, in is_pointwise_non_scalar_tensor_num_reads_larger_than_one
(sum(read.index != 0 for read in self.data.get_reads()) > 1)
File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/ir.py", line 5332, in <genexpr>
(sum(read.index != 0 for read in self.data.get_reads()) > 1)
File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/dependencies.py", line 74, in index
raise NotImplementedError("StarDep does not have an index")
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
NotImplementedError: StarDep does not have an index
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
```
Reviewed By: jackiexu1992, mengluy0125
Differential Revision: D49639733
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110189
Approved by: https://github.com/desertfire, https://github.com/jansel
Recently we updated the `export` API to take an experimental `dynamic_shapes` argument that was meant to subsume the existing `constraints` argument.
This PR deprecates `constraints` (with a warning on its use, but without actually removing it). Simultaneously it replaces all uses of `constraints` in docs, examples, and tests with corresponding uses of `dynamic_shapes` (preserving behavior). This exercise fortunately revealed some minor bugs in the implementation which have also been fixed in this PR.
Some uses of `constraints` still remain, e.g., when `torch._dynamo.export` is called directly. (Meta-internal uses will be updated in a separate diff.)
Differential Revision: D49676049
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110143
Approved by: https://github.com/tugsbayasgalan
Summary:
Resolving error:
AttributeError: Can't pickle local object '_add_module_to_qconfig_obs_ctr.<locals>.get_factory_kwargs_based_on_module_device'
by moving nested function out to the main module
Test Plan: Added test to CI
Reviewed By: andrewor14
Differential Revision: D49187352
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109288
Approved by: https://github.com/andrewor14
The commit message of #107862 says it enabled mypy checking for
sizevars.py, but it seems that it neglected to update .lintrunner.toml.
New type errors appear to have crept in since then, so I've fixed them
accordingly.
A similar mistake happened with #109955 for joint_graph.py, though that
one is more recent and so hasn't had any new type errors to fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110109
Approved by: https://github.com/Skylion007
This is reland of PRs #https://github.com/pytorch/pytorch/pull/108626 and #109564. We fixed the IOS build failure by changing
```
((CHECK) ? (EXPR) : ([] { assert(!#CHECK); }(), (EXPR)))
```
to
```
((CHECK) ? (EXPR) : ([] { assert(false); }(), (EXPR)))
```
in TR2_OPTIONAL_ASSERTED_EXPRESSION, since the former syntax was invalid on Apple Clang. Anyway, we could apply the simple fix hoping that c10::optional would be replaced by std::optional soon.
We also enabled -Wdeprecated on c10.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110019
Approved by: https://github.com/clee2000
## Context
Add existing decomps for `lift_fresh`, `split.Tensor`, and `unbind` to the core ATen decomposition table. Do not use them in inductor, since Inductor currently lowers these directly.
One note though is that `lift_fresh`'s decomposition has a note saying it's not correct under autograd. However, my understanding is that these decompositions are registered to the `"post_autograd"` decomposition table, meaning autograd wouldn't be a factor. Would like some confirmation that this premise is correct.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110102
Approved by: https://github.com/jansel
PR #108360 uses the same default `last_dim_size` formula from complex-to-real (C2R) transforms for
complex-to-complex (C2C) and real-to-complex (R2C). However, this is not correct because for C2R
the input is only half the size of the full tensor, which is not the case for C2C and C2R.
This error is mostly benign since `last_dim_size` was only used for the `>= 1` condition which is
almost always met anyway.
For this PR I now use it as the argument to `_apply_norm` which makes it load-bearing for correctness
and so is thoroughly tested now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109083
Approved by: https://github.com/lezcano
Summary:
- Remove onnx bench related scripts and `_onnx` folder.
- Update `common.py` to include onnx related patches previously under `_onnx` folder.
- Update `merge_rules.json` to include bench files.
- Added quick sanity onnx bench test to onnx CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103983
Approved by: https://github.com/kit1980
First error
```
Traceback (most recent call last):
File "/home/ubuntu/exporty.py", line 8, in <module>
ep = torch.export.export(MyModule(), torch.randn(5))
File "/opt/conda/envs/sam/lib/python3.10/site-packages/torch/export/__init__.py", line 509, in export
return export(f, args, kwargs, constraints)
File "/opt/conda/envs/sam/lib/python3.10/site-packages/torch/_export/__init__.py", line 314, in export
raise UserError(UserErrorType.INVALID_INPUT,
torch._dynamo.exc.UserError: Expecting `args` to be a tuple of example positional inputs, got <class 'torch.Tensor'>
```
Second error
```
(sam) ubuntu@ip-172-31-9-217:~$ python exporty.py
Traceback (most recent call last):
File "/home/ubuntu/exporty.py", line 13, in <module>
torch.export.save(ep, 'exported_program.pt2', extra_files=extra_files)
File "/opt/conda/envs/sam/lib/python3.10/site-packages/torch/export/__init__.py", line 566, in save
save(ep, f, extra_files=extra_files, opset_version=opset_version)
File "/opt/conda/envs/sam/lib/python3.10/site-packages/torch/_export/__init__.py", line 595, in save
encoded_content = content.encode('utf-8')
AttributeError: 'bytes' object has no attribute 'encode'. Did you mean: 'decode'?
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110169
Approved by: https://github.com/angelayi
After https://github.com/pytorch/test-infra/pull/4589, we can now query Dr.CI to get the list of flaky failures there. This change queries Dr.CI API endpoint and check if the failure is a flaky one using `is_flaky` function.
Because the change is relatively large, I'm breaking it down to several smaller PRs in this order:
* [x] This PR queries Dr.CI and adds `is_flaky` check
* [ ] Clean up the flaky rules logic because it has already been implemented on Dr. CI
* [ ] Clean up the broken trunk logic for the same reason
### Testing
* Create a new `drci_mocks.json` file to catch the JSON response from Dr.CI API endpoint. The API requires `DRCI_BOT_KEY`.
* `pytest -v test_trymerge.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110054
Approved by: https://github.com/clee2000
Summary:
Exposing a codegen mode for generating a hook for user to register their kernels.
If we pass `--manual-registration` flag to `gen_executorch.py`, we will generate the following files:
1. RegisterKernels.h which declares a `register_all_kernels()` API inside `torch::executor` namespace.
2. RegisterKernelsEverything.cpp which implements `register_all_kernels()` by defining an array of generated kernels.
This way user can depend on the library declared by `executorch_generated_lib` macro (with `manual_registration=True`) and be able to include `RegisterKernels.h`. Then they can manually call `register_all_kernels()` instead of relying on C++ static initialization mechanism which is not available in some embedded systems.
Test Plan:
Rely on the unit test:
```
buck2 test fbcode//executorch/runtime/kernel/test:test_kernel_manual_registration
```
Reviewed By: cccclai
Differential Revision: D49439673
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110086
Approved by: https://github.com/cccclai
Generally, to extend PyTorch with custom operators, a user will
create a Python module whose import triggers registration of
the custom operators via a torch.ops.load_library call or a call
to one or more torch.library.* APIs.
It is unexpected for Python modules to have side effects, so some
linters and formatters will complain. Use torch.ops.import_module to
import the module without a linter or formatter complaining.
NB: A more robust API would actually check if a custom op was registered
or modified, but this is technically challenging to do. In the future we
can add a warning if a custom op wasn't registered or modified.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110090
Approved by: https://github.com/ezyang
Summary: Split out from D48975975, this handles the pytorch specific changes to add support for event_tracer in codegen layer.
Test Plan: CI
Reviewed By: dbort
Differential Revision: D49487710
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109990
Approved by: https://github.com/Jack-Khuu
See the test case for what we didn't catch (SymInt vs const SymInt&
mismatch.)
It's necessary to test for both, because we will fall back to the
non-SymInt signature if there is no SymInt unboxed kernel available.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109727
Approved by: https://github.com/zou3519
This PR adds a test for the previous PR in this stack: #109904. In summary, it calls
functions decorated with `@record_shapeenv_event`, that don't have an explicit `ShapeEnv`
parameter, with arguments that don't hold a `ShapeEnv` instance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109944
Approved by: https://github.com/ezyang
Summary:
This PR supports _scaled_dot_product_flash_attention fallback kernel.
Note that in the abi_compatible mode, we retrieve outputs by passing
output argument pointers rather than relying on std::get.
It also fixes an issue related to dynamic shapes, where we wrongfully
query undefined dynamic symbols.
Test Plan: ci
Reviewed By: frank-wei
Differential Revision: D49620191
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110085
Approved by: https://github.com/desertfire
Summary: We are trying to use wired message to pass python objects like KJT. In order to make JIT be able to unpickle it, we need to provide a type resolver as well as an obj loader. This diff modify the interface to let we be able to do that.
Test Plan:
Rely on current CI to make sure existing usage doesn't break.
In the next diff, test e2e
Differential Revision: D49438569
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109730
Approved by: https://github.com/davidberard98
After https://github.com/pytorch/test-infra/pull/4589, we can now query Dr.CI to get the list of flaky failures there. This change queries Dr.CI API endpoint and check if the failure is a flaky one using `is_flaky` function.
Because the change is relatively large, I'm breaking it down to several smaller PRs in this order:
* [x] This PR queries Dr.CI and adds `is_flaky` check
* [ ] Clean up the flaky rules logic because it has already been implemented on Dr. CI
* [ ] Clean up the broken trunk logic for the same reason
### Testing
* Create a new `drci_mocks.json` file to catch the JSON response from Dr.CI API endpoint. The API requires `DRCI_BOT_KEY`.
* `pytest -v test_trymerge.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110054
Approved by: https://github.com/clee2000
@crcrpar's last attempt to fix the 0-size problem unfortunately did not pass all cases. See my comment in https://github.com/pytorch/pytorch/issues/100701. When we have a tail tensor of size 0, the old code would mess with the chunk logic to check the previous tensor's length. This is flawed because:
1. if the previous tensor was also 0 sized, (so a tensor list of [tensor, tensor, tensor, ..., 0-sized tensor, 0-sized tensor],) chunks would still be 0 and the nested for loop would be missed.
2. the nested forloop pronounces side effects on tensorListMeta that _shouldn't_ be there! This can mess up the compute in unexpected ways that I haven't really needed to reason through.
We noticed that the problem had not been fixed due to an internal report. This PR solves the issue by:
- removing the finagling of chunks when the tail tensor is 0-sized
- adding a surefire way for the kernel to be launched in the case where the last tensor is 0-sized AND there's content in the metadata, signifying there is stuff to compute still.
## test plan
As I went through the code, I also added some comments explaining what's up and modified our tensor inputs to ensure that this case is tested in the test_parity test in test_foreach.py. Yes, I do realize there is quite a bit of duplication and that this file could be due for a refactor. That said, the primary goal of this PR is to fix the pretty egregious bug and refactoring can be a followup.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109402
Approved by: https://github.com/albanD
Sequence numbers must be associated with a Work object
if we want to use it as a way to report collective progress.
The API surface change is introducing Work::getSequenceNumber, which
should eventually be exposed to python.
The bulk of this change is changing gloo to make the sequence number
be always in use and weave it to the dozens subclasses of Work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109136
Approved by: https://github.com/fduwjj
# Summary
Introduced a BC breaking change in #109533 when self is a view of the value. By using the copy_() op inside fill_ we were hitting `assert_no_partial_overlap` in tensor iterator.
Ideal we would be able to avoid this check if value.numel() ==1 . But rather than monkeying around with tensor iterator I just clone the input instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109835
Approved by: https://github.com/mikaylagawarecki
Extend metric library to allow setting global metrics on a process level which will always be emitted.
Current use case for them is to include shard information every time a metric is emitted by run_test.py
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 0cae92c</samp>
> _`run_test` refactored_
> _Sharding metrics in Rockset_
> _Autumn of testing_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110035
Approved by: https://github.com/clee2000
Fixes#109196
When we have a split reduction and the tensor is not an even multiple of the split size,
we use `ops.masked` to pad to an even multiple. In the case here we generated:
```python
tmp5 = tl.where(mask, tmp4, 0)
```
which implicitly promotes our boolean value to `int32`. The fix is to give the default
value the same dtype as `result`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109325
Approved by: https://github.com/lezcano
Replacing https://github.com/pytorch/pytorch/pull/109553 as it gets reverted.
This PR enables training with new 2D flow and adds associated test. In addition, this PR moves the tensor/parallel/_data_parallel_utils.py that are fsdp specific back to tensor/parallel/fsdp.py to avoid circular dependency for ddp.py and test/distributed/tensor/parallel/test_ddp_2d_parallel.py.
state_dict related changes would be in later PRs.
cc. @fegin, @fduwjj, @wanchaol, @awgu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110034
Approved by: https://github.com/fduwjj
## Context
Introduce a core decomposition for `aten.floor_divide` into other `aten` ops, and add it to the core ATen decomposition table.
This replaces the decomposition of `floor_divide` that was used by Inductor. I noticed there was a note on that decomposition
```
# TorchInductor-only decomposition. It should not be taken to core.
# See https://github.com/pytorch/torchdynamo/pull/1120
```
but couldn't discern the reason why this is the case. cc: @lezcano
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110046
Approved by: https://github.com/peterbell10
This flag is requested by @Chillee who is seeing recompilations with simple gpt experiments. We are observing recompilations because `_parameters` ordered dict keeps changing from run to run, and its unclear why that is happening.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110039
Approved by: https://github.com/Chillee
ghstack dependencies: #110023
Summary:
Saw this issue when running pytorch vulkan on a LSTM model:
https://www.internalfb.com/phabricator/paste/view/P834993118
Found that we don't always to the vulkan transfer on `at::cat`
Test Plan:
(Not running the LSTM model yet. Since there are other crahses.)
```
[yipjustin@47884.od /data/sandcastle/boxes/fbsource (3fd2308f8|remote/fbcode/warm_fbcode_od_stable...)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*cat*"
Building: finished in 0.1 sec (100%) 330/330 jobs, 0/330 updated
Total time: 0.2 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *cat*
[==========] Running 43 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 43 tests from VulkanAPITest
[ RUN ] VulkanAPITest.replication_pad2d
[ OK ] VulkanAPITest.replication_pad2d (102 ms)
[ RUN ] VulkanAPITest.cat_4d_dim0_invalidinputs_exceptions
[ OK ] VulkanAPITest.cat_4d_dim0_invalidinputs_exceptions (67 ms)
[ RUN ] VulkanAPITest.cat_4d_dim0_samebatch_success
[ OK ] VulkanAPITest.cat_4d_dim0_samebatch_success (111 ms)
[ RUN ] VulkanAPITest.cat_4d_dim0_diffbatch_success
[ OK ] VulkanAPITest.cat_4d_dim0_diffbatch_success (76 ms)
[ RUN ] VulkanAPITest.cat_4d_dim0_singledepth_success
[ OK ] VulkanAPITest.cat_4d_dim0_singledepth_success (40 ms)
[ RUN ] VulkanAPITest.cat_4d_dim0_singletensor_success
[ OK ] VulkanAPITest.cat_4d_dim0_singletensor_success (7 ms)
[ RUN ] VulkanAPITest.cat_4d_dim0_twotensors_success
[ OK ] VulkanAPITest.cat_4d_dim0_twotensors_success (30 ms)
[ RUN ] VulkanAPITest.cat_4d_dim0_negdim_success
[ OK ] VulkanAPITest.cat_4d_dim0_negdim_success (78 ms)
[ RUN ] VulkanAPITest.cat_4d_dim1_negdim_success
[ OK ] VulkanAPITest.cat_4d_dim1_negdim_success (130 ms)
[ RUN ] VulkanAPITest.cat_4d_dim2_negdim_success
[ OK ] VulkanAPITest.cat_4d_dim2_negdim_success (75 ms)
[ RUN ] VulkanAPITest.cat_4d_dim3_negdim_success
[ OK ] VulkanAPITest.cat_4d_dim3_negdim_success (68 ms)
[ RUN ] VulkanAPITest.cat_4d_dim1_texture2d_success
[ OK ] VulkanAPITest.cat_4d_dim1_texture2d_success (2 ms)
[ RUN ] VulkanAPITest.cat_4d_dim1_singledepth_success
[ OK ] VulkanAPITest.cat_4d_dim1_singledepth_success (65 ms)
[ RUN ] VulkanAPITest.cat_4d_dim1_singletensor_success
[ OK ] VulkanAPITest.cat_4d_dim1_singletensor_success (8 ms)
[ RUN ] VulkanAPITest.cat_4d_dim1_bat1_mult4ch_success
[ OK ] VulkanAPITest.cat_4d_dim1_bat1_mult4ch_success (9 ms)
[ RUN ] VulkanAPITest.cat_4d_dim1_bat2_mult4ch_success
[ OK ] VulkanAPITest.cat_4d_dim1_bat2_mult4ch_success (18 ms)
[ RUN ] VulkanAPITest.cat_4d_dim1_mult4ch_mixed_success
[ OK ] VulkanAPITest.cat_4d_dim1_mult4ch_mixed_success (60 ms)
[ RUN ] VulkanAPITest.cat_4d_dim2_sameheight_success
[ OK ] VulkanAPITest.cat_4d_dim2_sameheight_success (80 ms)
[ RUN ] VulkanAPITest.cat_4d_dim2_diffheight_success
[ OK ] VulkanAPITest.cat_4d_dim2_diffheight_success (69 ms)
[ RUN ] VulkanAPITest.cat_4d_dim2_singledepth_success
[ OK ] VulkanAPITest.cat_4d_dim2_singledepth_success (12 ms)
[ RUN ] VulkanAPITest.cat_4d_dim2_invalidinputs_exceptions
[ OK ] VulkanAPITest.cat_4d_dim2_invalidinputs_exceptions (63 ms)
[ RUN ] VulkanAPITest.cat_4d_dim3_invalidinputs_exceptions
[ OK ] VulkanAPITest.cat_4d_dim3_invalidinputs_exceptions (86 ms)
[ RUN ] VulkanAPITest.cat_4d_dim3_samewidth_success
[ OK ] VulkanAPITest.cat_4d_dim3_samewidth_success (117 ms)
[ RUN ] VulkanAPITest.cat_4d_dim3_diffwidth_success
[ OK ] VulkanAPITest.cat_4d_dim3_diffwidth_success (72 ms)
[ RUN ] VulkanAPITest.cat_3d_dim0_mult4ch_success
[ OK ] VulkanAPITest.cat_3d_dim0_mult4ch_success (12 ms)
[ RUN ] VulkanAPITest.cat_3d_dim0_diff_channel_success
[ OK ] VulkanAPITest.cat_3d_dim0_diff_channel_success (28 ms)
[ RUN ] VulkanAPITest.cat_3d_dim0_same_channel_success
[ OK ] VulkanAPITest.cat_3d_dim0_same_channel_success (15 ms)
[ RUN ] VulkanAPITest.cat_3d_dim1_diffheight_success
[ OK ] VulkanAPITest.cat_3d_dim1_diffheight_success (21 ms)
[ RUN ] VulkanAPITest.cat_3d_dim1_same_height_success
[ OK ] VulkanAPITest.cat_3d_dim1_same_height_success (10 ms)
[ RUN ] VulkanAPITest.cat_3d_dim2_diffwidth_success
[ OK ] VulkanAPITest.cat_3d_dim2_diffwidth_success (21 ms)
[ RUN ] VulkanAPITest.cat_3d_dim2_samewidth_success
[ OK ] VulkanAPITest.cat_3d_dim2_samewidth_success (11 ms)
[ RUN ] VulkanAPITest.cat_3d_dim0_negdim_success
[ OK ] VulkanAPITest.cat_3d_dim0_negdim_success (25 ms)
[ RUN ] VulkanAPITest.cat_3d_dim1_negdim_success
[ OK ] VulkanAPITest.cat_3d_dim1_negdim_success (23 ms)
[ RUN ] VulkanAPITest.cat_3d_dim2_negdim_success
[ OK ] VulkanAPITest.cat_3d_dim2_negdim_success (10 ms)
[ RUN ] VulkanAPITest.cat_2d_dim0_same_height_success
[ OK ] VulkanAPITest.cat_2d_dim0_same_height_success (3 ms)
[ RUN ] VulkanAPITest.cat_2d_dim0_diff_height_success
[ OK ] VulkanAPITest.cat_2d_dim0_diff_height_success (2 ms)
[ RUN ] VulkanAPITest.cat_2d_dim1_same_width_success
[ OK ] VulkanAPITest.cat_2d_dim1_same_width_success (3 ms)
[ RUN ] VulkanAPITest.cat_2d_dim1_diff_width_success
[ OK ] VulkanAPITest.cat_2d_dim1_diff_width_success (4 ms)
[ RUN ] VulkanAPITest.cat_2d_dim0_negdim_success
[ OK ] VulkanAPITest.cat_2d_dim0_negdim_success (3 ms)
[ RUN ] VulkanAPITest.cat_2d_dim1_negdim_success
[ OK ] VulkanAPITest.cat_2d_dim1_negdim_success (3 ms)
[ RUN ] VulkanAPITest.cat_1d_dim0_same_width_success
[ OK ] VulkanAPITest.cat_1d_dim0_same_width_success (52 ms)
[ RUN ] VulkanAPITest.cat_1d_dim0_diff_width_success
[ OK ] VulkanAPITest.cat_1d_dim0_diff_width_success (0 ms)
[ RUN ] VulkanAPITest.cat_1d_dim0_negdim_success
[ OK ] VulkanAPITest.cat_1d_dim0_negdim_success (0 ms)
[----------] 43 tests from VulkanAPITest (1717 ms total)
[----------] Global test environment tear-down
[==========] 43 tests from 1 test suite ran. (1717 ms total)
[ PASSED ] 43 tests.
YOU HAVE 4 DISABLED TESTS
```
Differential Revision: D49566743
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109936
Approved by: https://github.com/SS-JIA
For tests that TD prioritizes, we should track what their ordering _would have been_ if none of the TD heuristics had applied to it.
This is useful for two reasons:
1. It lets us better understand TD may have contributed to that test running sooner
2. it's possible that heuristics actually mark a test as less important than the default sorting would have claimed (the default sorts tests in a fixed order). This will let us track how often that happens
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110031
Approved by: https://github.com/clee2000
Changelog:
- torch.library.impl_abstract optionally accepts a torch.library.Library
object. If passed in, then the lifetime of the registration is tied to
the Library object.
- we've also changed torch.library.impl_abstract to work on all
operators, including overloads.
- we refactored the `torch._custom_ops.*` and `torch._custom_op.*`
impl_abstract APIs and put them under torch._library. This is the
final resting place for them. I will follow-up with deleting
all the `torch._custom_ops.*` stuff later.
- There is a new "SimpleOperatorRegistry" where we actually collect the
abstract_impl. We will expand this to also hold the other
torch._custom_ops.* APIs when we move those to torch.library
NB: Previously we had designed
`impl_abstract` assuming a very high-level Python-only custom op API.
We've revisited that since; now, impl_abstract works for all custom ops,
no matter python or C++, no matter the schema. The new refactored design
reflects this better.
Test Plan:
- existing and new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109912
Approved by: https://github.com/ezyang
This PR adds a test for the previous PR in this stack: #109904. In summary, it calls
functions decorated with `@record_shapeenv_event`, that don't have an explicit `ShapeEnv`
parameter, with arguments that don't hold a `ShapeEnv` instance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109944
Approved by: https://github.com/ezyang
ghstack dependencies: #109904
- Extend `test_torch_dispatch_meta_outplace` to test torch ops that do not have an out parameter but have aten op overloads that have out parameters. Additionally, Python decompositions may register `OpOverloadPacket`'s so decompositions need to be tested to ensure all `OpOverloads` still function for the `Meta` key (e.g. if a python decomposition is registered for an aten op `aten.foo` with overloads `[default, out]`, the python function needs to support receiving out arguments)
- Add out parameter wrappers to python decomps for aten ops that have out overloads
CC. @ezyang @albanD @lezcano
Fixes#107713
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107707
Approved by: https://github.com/lezcano
In cudagraph trees, we invalidate tensors at some point and drop their storage. Then, when they are accessed with .data_ptr(), a custom error message is thrown. Previously, this invalidation didn't also make untyped_storage()/storage() error which could result in a segfault.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109750
Approved by: https://github.com/zou3519
Fixes#101777
- [x] Duplicated the tests from `test/jit/test_union.py` into [`test/jit/test_union_pep604.py`](https://github.com/pytorch/pytorch/pull/109293/files#diff-b981f6493093482b43b0e62057b0c01b004b3e932d4e63a1166c3808c0172b83), using PEP604 style Unions
- [x] Exchanged custom `get_args` and `get_origin` with `typing.get_args` and `typing.get_origin` which have the same functionality and became part of the standard library in 3.8
- [x] Added utility function `pep604union_to_union` in `tree_views.h` which converts a `BinOP("|")` node into the corresponding `Union`. This function intercepts `ScriptTypeParser::parseTypeFromExpr` and `ScriptTypeParser::parseTypeFromExprImpl` and patches the expression.
- [ ] There is a single failing test, I commented it out for the moment to see if CI complains about anything else. I tried several hours to figure out how to patch it, but I am not experienced with C++ development and debugging.
From what I could gather, the following fails:
```python
def test_union_optional_of_union_return(self):
@torch.jit.script
def fn() -> None | str | int:
y: Optional[int | str] = "foo"
return y
```
In the section:
75b954b715/torch/csrc/jit/frontend/script_type_parser.cpp (L232-L243)
When using regular `Union`, the `resolver` path is taken, whereas with the patch pep604 union, `resolveType` doesn't work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109293
Approved by: https://github.com/ezyang
I'm pretty sure this is fixed but I'll run inductor and trunk CI. The failing test in trunk previously was that the selective activation checkpointing code that landed recently assumes that it can detect whether or not AOTAutograd is running by seeing if the inputs to SAC are C++ `FunctionalTensorWrapper`s
previous land broke some inductor trunk tests
This reverts commit 629a628cc8bb1f62e2cce11bf0c8a00d3d06f896.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109906
Approved by: https://github.com/ezyang
Most `torch.cuda` ops (ex: `torch.cuda.synchronize`) do not release GIL in C++ land. This has the potential of causing deadlocks and freeze the python process. For example, `torch.cuda.synchronize` could hold GIL and get blocked on some operation. However, that operation might never complete in python land since GIL is held by `torch.cuda.synchronize`.
In this PR, I've tried to release GIL as much as possible in `torch.cuda` ops.
See https://github.com/pytorch/pytorch/issues/109074 for an example of how holding GIL causes a deadlock.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109159
Approved by: https://github.com/ezyang
# Motivation
@jansel As discussed before, we expected to generalize some cuda-specific code. This can make inductor more friendly to third-party backend so that we can leverage inductor code as much as possible.
# Solution
To implement this, we give a solution to introduce device runtime abstraction. We wrapper them inside `DeviceInterface` and use `register_interface_for_device` to register each kind of device to inductor. Then use `get_interface_for_device` to fetch the corresponding runtime from device type. Then usage is like this:
```python
device_interface = get_interface_for_device("xpu")
device_interface .is_available() # to check if XPU is available
device_interface .device_count() # to check how much XPU device is available
```
The `DeviceInterface` is a simple abstraction, which enables third-party backends that implement CUDA-like semantics to be integrated with inductor. This can prevent third-party backend from using monkey patch to override some utility functions, like `decode_device` that is hard-coded with CUDA.
# Additional Context
The main code change:
- To leverage AsyncCompile, make it device-agnostic
- Avoid monkey patches, make some utility functions device-agnostic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109486
Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/EikanWang
Summary:
This preserves `requires_grad` in the case where all parameters within a `FlatParameter` have the same `requires_grad` value.
Currently, unsharded parameters have `requires_grad=True` in some cases where the `FlatParameter` and all original parameters have `requires_grad=False`.
This could be extended to support `FlatParameters` with a mix of `requires_grad` states by extending `ParamInfo` to capture `requires_grad` for each parameter.
Test Plan: test added
Differential Revision: D49517155
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109892
Approved by: https://github.com/awgu
Based on William's recent diff on preserving node metadata on retracing, we no longer need to skip dynamo on retracing. This softens our previous restriction of not allowing any new constraints from user side because we can utilize dynamo to analyze through constraints now. As a result, re-export can technically happen with any new constraints. This opens up another problem that "Is it ok to use more loose constraints on the retracing?" If we allow loose constraints, we can technically diverge from eager behaviour because for example we could have eliminated unsafe control flow based on previous assumption. But we can also argue this is ok because we can say we treat the Exported callable to be an independent callable from its' original source code.
We can technically ban loose constraints inside export, but my concern is we are breaking abstraction by doing special case checks on ExportedProgram.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109476
Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17
Fixes#109842.
This disables the implicit `UserWarning`s that were raised for deprecated `torch` attributes. The filtering was designed to be as specific as possible, in order to not filter any other warnings that may be raised.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109890
Approved by: https://github.com/ezyang
1. Inherit from TestCase
2. Use pytorch parametrization
3. Use unittest.expectedFailure to mark xfails
All this to make pytest-less invocation work:
$ python test/torch_np/test_basic.py
Furthermor, tests can now be run under dynamo, and we see first errors:
```
$ PYTORCH_TEST_WITH_DYNAMO=1 python test/torch_np/test_basic.py -k test_toscalar_list_func
.E.
======================================================================
ERROR: test_toscalar_list_func_<function shape at 0x7f9b83a4fc10>_np_func_<function shape at 0x7f9a8dd38af0> (__main__.TestOneArrToScalar)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/ev-br/repos/pytorch/torch/testing/_internal/common_utils.py", line 356, in instantiated_test
test(self, **param_kwargs)
File "test/torch_np/test_basic.py", line 232, in test_toscalar_list
@parametrize("func, np_func", one_arg_scalar_funcs)
File "/home/ev-br/repos/pytorch/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ev-br/repos/pytorch/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ev-br/repos/pytorch/torch/_dynamo/eval_frame.py", line 406, in _fn
return fn(*args, **kwargs)
File "/home/ev-br/repos/pytorch/torch/fx/graph_module.py", line 726, in call_wrapped
return self._wrapped_call(self, *args, **kwargs)
File "/home/ev-br/repos/pytorch/torch/fx/graph_module.py", line 305, in __call__
raise e
File "/home/ev-br/repos/pytorch/torch/fx/graph_module.py", line 292, in __call__
return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
File "/home/ev-br/repos/pytorch/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ev-br/repos/pytorch/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "<eval_with_key>.2", line 5, in forward
shape = torch._numpy._funcs_impl.shape([[1, 2, 3], [4, 5, 6]])
File "/home/ev-br/repos/pytorch/torch/_numpy/_funcs_impl.py", line 655, in shape
return tuple(a.shape)
AttributeError: 'list' object has no attribute 'shape'
----------------------------------------------------------------------
Ran 3 tests in 0.915s
FAILED (errors=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109593
Approved by: https://github.com/lezcano
Building IPEX-XPU with PyTorch fails with `error: builtin is not supported on this target _BitScanReverse` on Windows.
The root cause of the error is due to `_BitScanReverse` compiler intrinsic function not being supported in SYCL target device code with DPC++ compiler, while being supported in host code with MSVC compiler. Thanks to @gujinghui, @xuhancn for the help in identifying the root cause and debugging.
A minimal reproducible script:
```cpp
#include <CL/sycl.hpp>
#include <chrono>
#include <iostream>
#ifdef _MSC_VER
#include <intrin.h>
#endif
void test(
sycl::queue& q) {
uint8_t input = 123;
const uint32_t w = (uint32_t)input << 24;
const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
unsigned long nonsign_bsr;
_BitScanReverse(&nonsign_bsr, (unsigned long)nonsign); // host code, no error
sycl::range<2> global_range{1, 1};
sycl::range<2> local_range{1, 1};
auto e = q.submit([&](auto& h) {
sycl::stream out(100000, 256, h);
h.parallel_for(sycl::nd_range<2>{global_range, local_range},
[=](sycl::nd_item<2> item) {
#if defined(_MSC_VER)
uint8_t input = 123;
const uint32_t w = (uint32_t)input << 24;
unsigned long nonsign_bsr;
_BitScanReverse(&nonsign_bsr, (unsigned long)nonsign); // device code, error: builtin is not supported on this target
#else
__builtin_clz(nonsign);
#endif
// Fix to add a check for SYCL device code:
/*
#if defined(__SYCL_DEVICE_ONLY__)
out << "DPC++ SYCL" << sycl::endl;
__builtin_clz(nonsign);
#elif defined(_MSC_VER)
out << "MSVC" << sycl::endl;
uint8_t input = 123;
const uint32_t w = (uint32_t)input << 24;
unsigned long nonsign_bsr;
_BitScanReverse(&nonsign_bsr, (unsigned long)nonsign);
#endif
*/
});
});
q.wait();
}
int main() {
#if defined(__SYCL_DEVICE_ONLY__)
std::cout << "DPC++ SYCL" << std::endl;
#elif defined(_MSC_VER)
std::cout << "MSVC" << std::endl;
#endif
sycl::queue q(sycl::default_selector_v);
test(q);
return 0;
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109911
Approved by: https://github.com/ezyang
### Description
This PR fixes a bug with getting module attributes during `torch.onnx.export` when `export_modules_as_functions` is used. With this fix, we can compare the LLaMA-2 models produced by the TorchScript exporter and the [Dynamo exporter](https://github.com/pytorch/pytorch/issues/104903).
### Context
When exporting LLaMA-2 from Hugging Face with `export_modules_as_functions`, the `Embedding` object does not have the `freeze` attribute.
```
File "/home/kvaishnavi/.local/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 662, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "/home/kvaishnavi/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/kvaishnavi/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1558, in _call_impl
args_result = hook(self, args)
File "/home/kvaishnavi/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1394, in _track_module_attributes_forward_pre_hook
setattr(module, attr_name, _get_module_attributes(module))
File "/home/kvaishnavi/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1474, in _get_module_attributes
return {k: getattr(module, k) for k in annotations}
File "/home/kvaishnavi/.local/lib/python3.8/site-packages/torch/onnx/utils.py", line 1474, in <dictcomp>
return {k: getattr(module, k) for k in annotations}
File "/home/kvaishnavi/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1696, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'Embedding' object has no attribute 'freeze'
```
To get around this issue, we can skip adding the keys in the dictionary when the object does not have the attribute.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109759
Approved by: https://github.com/BowenBao
Fix https://github.com/pytorch/pytorch/issues/109736 .
HF pin move causes regression on accuracy check for HF models on the dashboard. Manually reverting the HF PR ( https://github.com/huggingface/transformers/pull/24696/files ) could recover, but this may hide some real issue. I happen to found that using a warm matmul max-autotune cache can work around the issue. Or putting it in another way:
- make all calls to check_cache cache miss repro the issue
- make all cals to check_cache cache hit works around the issue
I did some sort of 'bisect' to force halving the amount of cache miss each time while still make sure we can repro. Luckily reducing to a single cache miss still repro the issue. With more debugging, it turns out that it's the call to `torch.randn` on cuda device causing the problem.
The fix is to make sure we restore the rng state when we generate random inputs for max-autotune benchmarking.
TBH, I can not fully explain the root cause although I know it's caused by rng state change. AOTAutograd already has some logic to preserve rng state. And I can not repro the issue in unit tests. I have a few guess why the RNG state is not restored in the first place after we generate random inputs for max-autotune:
- maybe AOTAutograd misses some corner case to preserve the rng state
- maybe for the failed models, there are some eager fallback that's not handled by inductor. And if those fallback calles random number related APIs, we will see the issue. But again I don't find a good way to simulate this.
Repro:
```
TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 CUDA_VISIBLE_DEVICES=3 time python benchmarks/dynamo/huggingface.py --backend inductor --amp --accuracy --only PLBartForCausalLM --training --cold-start-latency
```
We always repro the issue without the PR but pass the accuracy check with the PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109828
Approved by: https://github.com/eellison
Fix a bug socket.cpp in timeout detection that only shows up with 10k ranks.
Make the minimum wait time in _store_based_barrier to be adaptative based on
the number of ranks.
Longer timeouts give more room for the store to do productive work when swamped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109218
Approved by: https://github.com/XilunWu
ghstack dependencies: #109217
Previously when we deciding if dynamically scaling down rblock, we use the following formule to compute the upper bound of number of blocks per sm:
```
max_threads_per_multi_processo / (32 * num_warps)
```
This is correct but it's a bit loose and some times because of the loose upper bound, we skip some optimization opportunities.
The new upper bound is: 65536 / n_reg_used_by_each_block . This is a tighter upper bound and can be helpful if the kernel uses too many registers (i.e. much larger than 32).
For kernel https://gist.github.com/shunting314/59aeafd297ed8ff03aa12030a2dd41ae (this is a real kernel inductor generates for HF), the change improve its perf from:
0.485ms 0.332GB 684.29GB/s
to
0.240ms 0.332GB 1382.70GB/s
. The perf is bad previsouly because of register spills
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109839
Approved by: https://github.com/jansel
Summary:
Similar implementation like BMM & ADDMM, the bias tensor is using the packed weights, similar to MM, but increases the index via the z-dim to get more matrices in the batch.
Packed bias (input of MM):
```
ivec3 pos(k_, j_, 0);
float v = texelFetch(uInput, pos, 0)
# v.xyzw are 4 numbers in one matrix
# no batch
# k_, j_ has only 1/4 of the range as the original matrix size (H*W matrix i=> H/2*W/2*1 3D Image).
```
Packed bias (input of BMM):
```
ivec3 pos(k_, j_, i);
float v = texelFetch(uInput, pos, 0)
# v.xyzw are 4 numbers in one matrix
# i as batch id
```
**To support broadcasting**, the bias packing of `mm` is slightly different than weight packing, which repeats the single element in height-dim twice to fill the 4 planes (see code for details). The width-dim doesn’t repeat twice, but the code still works, because stacking 3 planes together with the last one empty yields the same 3D image.
However, this doesn’t work for `bmm`, since it’s a series of `{4 planes} {4 planes} … {4 planes}`, and each `{4 planes}` represents a matrix, so only 3 planes completely mess up the indexing. Thus, I repeat the single element in width-dim as well to fill all 4 planes to have the correct indexing.
https://pytorch.org/docs/stable/generated/torch.baddbmm.html
Test Plan:
```
[ttingchulin@27298.od /data/sandcastle/boxes/fbsource (bmm)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin
```
Reviewed By: yipjustin
Differential Revision: D49402181
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109851
Approved by: https://github.com/yipjustin
Remove redundant (and unsafe) `mobile::serialization::ModuleBufferHasIdentifier(data)` as ` mobile::serialization::VerifyModuleBuffer(verifier)` validates the same thing but in boundary-check safe manner.
Test Plan: Out of bounds read crash no longer reproduces
Differential Revision: D48914114
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108439
Approved by: https://github.com/manuelcandales, https://github.com/malfet
Summary: Change the returned values to be in the back of the parameters, because 1) it is more consistent with AOTInductor runtime API convention; 2) because the out-variant ops have the out tensor at the beginning of parameters, this makes the return values more distinguished from those
Test Plan:
```
buck test mode/opt caffe2/torch/fb/model_transform/experimental/benchmark/test/aotinductor:test_aot_inductor_benchmark
```
Differential Revision: D49522928
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109834
Approved by: https://github.com/chenyang78
See the test case for what we didn't catch (SymInt vs const SymInt&
mismatch.)
It's necessary to test for both, because we will fall back to the
non-SymInt signature if there is no SymInt unboxed kernel available.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109727
Approved by: https://github.com/zou3519
Now that FunctionalTensor and `FunctionalTensorMode` are lower down in this stack, the changes in this PR are more mechanical: Everywhere in AOTAutograd that I used to use the C++ functionalization API, I now use the python functionalization API.
Note that this doesn't actually cause functionalization to run underneath torch_dispatch. I'm saving that re-ordering for later in the stack.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106406
Approved by: https://github.com/ezyang
ghstack dependencies: #108654, #109662, #109632, #109023
I added some tests for Conj, Neg and ZeroTensor for both python and C++ functionalization. This also fixes a nasty segfult when running a functorch `jacfwd` test with `torch.compile`, once AOTAutograd is using `FunctionalTensor`.
Changes:
(1) I use Jeffrey's `make_wrapper_subclass(extra_dispatch_keys)` kwarg to plumb extra dispatch keys ontoto the wrapper, mirroring what C++ functionalization does (C++ functionalization will mirror all dispatch keys from the inner tensor to the wrapper, except for python and functorch keys).
(2) FunctionalTensorMode will decompose CompositeImplicitAutograd ops, since (for example) ZeroTensor kernels can send ops like `.to()` directly to the Python key. We'll need a way to toggle this later for pre-dispatch functionalization
(3) Bound `_ForceDispatchKeyGuard` and BatchedTensorImpl's dispatch keyset to python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109023
Approved by: https://github.com/zou3519
ghstack dependencies: #108654, #109662, #109632
I missed a few tests the first time around - this fixes out= op handling for `_return_and_correct_aliasing`, which failed a few tests in the python functionalization <> AOTAutograd PR above.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109662
Approved by: https://github.com/ezyang
ghstack dependencies: #108654
This PR fixes the ownership/lifetime handling for tensor subclasses that override sizes/strides, when tensors get resized.
This is needed now, because `FunctionalTensor` is a subclass that has a custom size/stride (so it can plumb requests to its inner tensor), and is also a core piece of infra (it's used during tracing in AOTAutograd, which means that metadata mutation and resizing that happens to work with torch.compile today needs to work with FunctionalTensor).
After a bunch of discussion with @ezyang and @soulitzer, I updated `PyInterpreter::sym_sizes()` (and friends) so that:
(1) They allocate a py::capsule buffer and stash it on the tensor on the first call to size/stride
(2) On a size/stride call where we noticed that the number of **dimensions** on the tensor has changed (so our buffer it stale), we re-allocate the buffer
(3) On a size/strude cal where we notice that the number of dimensions is the same, but the values are different (this happens whenever a tensor experiences a metadata mutation, like `.transpose_()`), we inplace-modify the buffer and put the new ints/symints into it
I also ended up doing the SmallVector optimization, which was required to fix some tests in AOTAutograd. Ideally we should look into those tests, and nail down the parts of our codebase that rely on SmallVector not re-allocating on a resize... but I'm saving this for a followup.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108654
Approved by: https://github.com/ezyang
Our experience using `constraints` / `dynamic_dim` with the existing export API has found it to be (subjectively) clunky and (objectively) verbose in common cases.
This PR implements a new design for the export API that replaces the use of `constraints` / `dynamic_dim` with a new way of specifying dynamic shapes, involving the following concepts:
* a constructor `Dim` for first-class named dynamic dimensions with ranges (similar to `functorch.dim`, and analogous to internal symbolic sizes)
* a mechanism that uses the above in `export` calls to associate inputs to their dynamic shape specifications (`dynamic_shapes`)
Design doc: https://docs.google.com/presentation/d/168U7XK72C_WSsZpGESP6Cho9udh193fi0gfjxCNcJ4E/edit#slide=id.p (Meta-only). Note that we only implement Option 1 in that doc. An older version of this PR also implemented Option 3, which is an alternative way of specifying dynamic shapes using tensor type annotations on the exported callable; but we have moved that to future work for now.
See docs for these new features in `torch.export`. The existing `torch.export.export` is modified to use the new API, `torch._export.export__RC__`, whenever `constraints=None`. We have not deprecated the existing API yet, but will do in a follow-up.
Constraint violation errors arising through use of the new API will now contain suggested fixes using the new API. No longer do we need to report all specializations for static dimensions and suggest all constraints over dynamic dimensions to fix such errors. Instead, due to the redesign, the suggested fixes are much more concise, only involving modifying the definitions of relevant `Dim`s.
Differential Revision: [D48919204](https://our.internmc.facebook.com/intern/diff/D48919204/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108448
Approved by: https://github.com/suo, https://github.com/gmagogsfm
We want users to be able to define custom ops in C++ but put the
abstract impl in Python (since it is easier to write them in Python and
the abstract impl better models device semantics and data-dependent
operators).
`m.impl_abstract_pystub(opname, python_module, context)` declares the
abstract_impl of the operator to exist in the given python module.
When the abstract_impl needs to be accessed (either via FakeTensor or
Meta), and it does not exist, the PyTorch Dispatcher will yell
with a descriptive error message.
Some details:
- We construct a new global AbstractImplPyStub mapping in
Dispatcher.cpp. Read/write to this map is protected by the Dispatcher
lock.
- We add a new Meta Tensor fallback kernel. The fallback errors out if there is
no meta kernel, but also offers a nicer error message if we see that there is
a pystub.
- We create a `torch._utils_internal.throw_abstract_impl_not_imported_error`
helper function to throw errors. This way, we can throw different error
messages in OSS PyTorch vs internal PyTorch. To invoke this from C++, we
added a PyInterpreter::throw_abstract_impl_not_imported_error.
Differential Revision: [D49464753](https://our.internmc.facebook.com/intern/diff/D49464753/)
Differential Revision: [D49464753](https://our.internmc.facebook.com/intern/diff/D49464753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109529
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
Summary:
BMM is developed on top of MM methodology, the main difference is the 1st input matrix changed from a standard Vulkan 2D tensor to a Vulkan 3D tenser, so the indexing changed quite differently. The matrices of a batch are appended on z-dimension and the channel of size 4 (texel).
The 2nd input matrix remains the same as a packed format (fit a `H*W` matrix into a `H/2 * W/2 * 1` 3D image texture by utilizing all 4 values in the texel), but appends more matrices in the batch on z-dimension (only has 1 element in the case of MM).
**Vulkan 2D Basic (1st input of MM & output):**
```
ivec3 pos(j, i, 0);
float v = texelFetch(uInput, pos, 0)[0];
# no batch
```
**Vulkan 3D Basic (1st input of BMM & output):**
```
ivec3 pos(k, j, i/4);
float v = texelFetch(uInput, pos, 0)[i % 4];
# i as batch id
```
**Packed weights (2nd input of MM):**
```
ivec3 pos(k_, j_, 0);
float v = texelFetch(uInput, pos, 0)
# v.xyzw are 4 numbers in one matrix
# no batch
# k_, j_ has only 1/4 of the range as the original matrix size (H*W matrix i=> H/2*W/2*1 3D Image).
```
**Packed weights (2nd input of BMM):**
```
ivec3 pos(k_, j_, i);
float v = texelFetch(uInput, pos, 0)
# v.xyzw are 4 numbers in one matrix
# i as batch id
```
Based on the different indexing of MM & BMM. I modified the MM methodology to produce the desired output image.
Test Plan:
```
[ttingchulin@27298.od /data/sandcastle/boxes/fbsource (bmm)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*<test>*" eg. -- --gtest_filter="*mm*"
Building: finished in 0.1 sec (100%) 328/3361 jobs, 0/3361 updated
Total time: 0.1 sec
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *mm*
[==========] Running 8 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 8 tests from VulkanAPITest
[ RUN ] VulkanAPITest.addmm
[ OK ] VulkanAPITest.addmm (125 ms)
[ RUN ] VulkanAPITest.addmm_expand
[ OK ] VulkanAPITest.addmm_expand (76 ms)
[ RUN ] VulkanAPITest.addmm_expand2
[ OK ] VulkanAPITest.addmm_expand2 (0 ms)
[ RUN ] VulkanAPITest.bmm
[ OK ] VulkanAPITest.bmm (152 ms)
[ RUN ] VulkanAPITest.bmm_large
[ OK ] VulkanAPITest.bmm_large (4818 ms)
[ RUN ] VulkanAPITest.bmm_small
[ OK ] VulkanAPITest.bmm_small (4 ms)
[ RUN ] VulkanAPITest.bmm_one
[ OK ] VulkanAPITest.bmm_one (0 ms)
[ RUN ] VulkanAPITest.mm
[ OK ] VulkanAPITest.mm (55 ms)
[----------] 8 tests from VulkanAPITest (5233 ms total)
[----------] Global test environment tear-down
[==========] 8 tests from 1 test suite ran. (5233 ms total)
[ PASSED ] 8 tests.
```
Differential Revision: D49306279
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109360
Approved by: https://github.com/yipjustin
Summary:
Change AOTInductor to directly return output tensors instead of taking pre-allocated output tensors to return the results. This gives several benefits:
* It makes sure AOTInductor has the same behavior when managing the output tensors as the default Inductor, which is widely tested and thus more reliable.
* As we have debugged before, there are cases we still have to codegen extra copy_ ops to fill the pre-allocated output tensors which doesn't make sense for performance.
* With the coming enhanced memory planning, this again will make sure the memory planning logic is the between AOTInductor and Inductor, which will greatly simplify the problem and improve the reliability.
This change also combines D49494954 from Yang and https://github.com/pytorch/pytorch/pull/109560 from Angela.
Differential Revision: D49502318
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109790
Approved by: https://github.com/chenyang78
Previously, if ClosingTHPObjectPtr was destructed because we
were unwinding the stack from an exception, we would attempt to call
close() which just isn't going to work. Two fixes:
1. Detect if we're unwinding due to a Python error, and don't try
to do more Python stuff if so.
2. If close() fails somehow, write an unraisable exception, don't
try to throw because that will terminate if you're in an
exception.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109758
Approved by: https://github.com/jansel
`python benchmarks/dynamo/torchbench.py --multiprocess` currently fails due to initializing distributed multiple times:
```
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:6789 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:6789
(errno: 98 - Address already in use).
```
Because torchbench calls itself via mp.spawn, there is the parent run (with `--multiprocess`) and child runs (with `--multiprocess --only <model>`).
This PR addresses this by fixing two issues:
1) distributed is initialized once in parent run and once in child runs, it should be initialized only in child runs where we have accurate rank and world size info
2) torchbench overrides CUDA_VISIBLE_DEVICES/world_size sometimes, but it shouldn't for distributed use cases where we want to use all available gpus
I am also adding a CI test to cover this type of issue in #109311
### Test plan
parent run test: `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --output /home/xmfan/local/pytorch/test/test-reports/inference_torchbench.csv --multiprocess`
child run test: `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --output /home/xmfan/local/pytorch/test/test-reports/inference_torchbench.csv --multiprocess --only simple_gpt`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109657
Approved by: https://github.com/H-Huang
Python decompositions wrapped by `out_wrapper` need to be unwrapped before compiling with TorchScript since:
- `out_wrapper` extends the decompositions signature with an out parameter, however this `out` parameter is not present in the source code of the original decomposition so the resulting `ScriptFunction` will not have an `out` parameter
- `out_wrapper` is in the `torch._prims_common.wrappers` module so its `globals()` are different to the globals of the decomposition to be wrapped. This may cause symbol resolution to fail with the TorchScript compiler since it is compiling the unwrapped decomps source code rather than the wrapper
The python decomposition for `aten.trace` is wrapped as an example, other decompositions are to be fixed in https://github.com/pytorch/pytorch/pull/107707
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109367
Approved by: https://github.com/lezcano
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105489
NOTE: this PR is tagged "not user facing", because it's not ready to be announced externally yet.
This PR implements torch.compile + selective activation checkpoint (SAC) integration, by using `TagActivationCheckpoint` (same backend as torch.compile + full activation checkpoint integration).
TorchDispatchMode based implementation cannot support including inplace ops in the checkpointed region at the moment (the reason for this needs investigation), and there is also no way to ban them (because TorchDispatchMode now only sees "after-functionalization" ops, so can't detect if an op is in-place). Hence we hide torch.compile + SAC behind a flag (`torch._dynamo.config._experimental_support_context_fn_in_torch_utils_checkpoint`) and will only use it internally for cases that are known to not have in-place ops. This state won't last too long, because in-place op will at least be able to be detected after Brian's mode reordering and related functionalization changes.
So next steps after this PR:
1. Wait for Brian's mode reordering and related functionalization changes to land, and then try to enable the "inplace ops" unit test for torch.compile + selective activation checkpoint (if it doesn't work, investigate why).
2. Unify selective- and full-checkpoint under TorchDispatchMode based implementation.
Differential Revision: D47497145
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105489
Approved by: https://github.com/anijain2305
Summary: Add enough typehints to enable mypy checking for codegen/triton_foreach. Also fixed a bug in a dtype param.
Test Plan:
* `python test/inductor/test_foreach.py`
* `lintrunner`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109643
Approved by: https://github.com/mlazos
ghstack dependencies: #109146
Summary: The curent parallel autotune implementation sets the CUDA_VISIBLE_DEVICES env var too late -- after the benchmarking subprocess has started -- and the torch libraries don't recognize the change. Since the multiprocessing library doesn't support providing an environment for the subprocess, temporarily set CUDA_VISIBLE_DEVICES in the parent process so that the change is inherited by the subprocess.
Test Plan:
* New unit test to verify the env var is set in the sub-process and fail the benchmark if it's not.
* Ran multiprocess autotuning and looked at the output from `nvidia-smi pmon` to make sure that all GPUs were assigned processes.
Snippet:
```
1 3442314 C 2 1 - - python
2 3442318 C 2 1 - - python
3 3442320 C 8 2 - - python
4 3442323 C 9 4 - - python
5 3442325 C 10 4 - - python
6 3442327 C 10 4 - - python
7 3442329 C 2 0 - - python
0 3434906 C 0 0 - - python
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109500
Approved by: https://github.com/eellison, https://github.com/shunting314
example usage
* `TORCH_COMPILE_DEBUG=1 INDUCTOR_ORIG_FX_SVG=1 INDUCTOR_POST_FUSION_SVG=1 python trig.py`: show original fx node name, file, and code. see snapshot 2 where we have origin_0, 1, 2
* trig.py can be found in P816304818
Implementation
* keep original fx graph in GraphLowering, ```self.orig_gm: torch.fx.GraphModule = gm.__copy__()```
* draw original fx graph with origins ir_post_fusion ```V.debug.draw_orig_fx_graph(self.orig_gm, self.scheduler.nodes)```. node.meta["buff_meta"] tracks buf_name
<img width="350" alt="Screenshot 2023-08-29 at 12 40 24 PM" src="https://github.com/pytorch/pytorch/assets/134637289/c4e197cb-ab3b-4a09-a584-c1356376accb">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107752
Approved by: https://github.com/mlazos
`torch.autocast` with `xla` backend has been restricted to `torch.bfloat16`. This shouldn't be the case anymore.
This works with `xla::cast( ..., type=f16)`
```
IR {
%0 = f32[] prim::Constant(), xla_shape=f32[], value=1
%1 = f32[3,2]{1,0} aten::expand(%0), xla_shape=f32[3,2]{1,0}, size=(3, 2), dynamic_dims=(0, 0)
%2 = f16[3,2]{1,0} xla::cast(%1), xla_shape=f16[3,2]{1,0}, type=f16, dtype=Half, stype=Float
%3 = f32[] prim::Constant(), xla_shape=f32[], value=1
%4 = f32[2,3]{1,0} aten::expand(%3), xla_shape=f32[2,3]{1,0}, size=(2, 3), dynamic_dims=(0, 0)
%5 = f16[2,3]{1,0} xla::cast(%4), xla_shape=f16[2,3]{1,0}, type=f16, dtype=Half, stype=Float
%6 = f16[2,2]{1,0} aten::mm(%5, %2), xla_shape=f16[2,2]{1,0}, ROOT=0
}
```
This will allow PyTorch/XLA to extend its autocast implementation to use `xla` backend for `float16` type as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109554
Approved by: https://github.com/JackCaoG, https://github.com/bdhirsh
In Python-3.11+ typed enums (such as `enum.IntEnum`) retain `__new__`,`__str__` and so on method of the base class via `__init__subclass__()` method (see https://docs.python.org/3/whatsnew/3.11.html#enum ), i.e. following code
```python
import sys
import inspect
from enum import Enum
class IntColor(int, Enum):
RED = 1
GREEN = 2
class Color(Enum):
RED = 1
GREEN = 2
def get_methods(cls):
def predicate(m):
if not inspect.isfunction(m) and not inspect.ismethod(m):
return False
return m.__name__ in cls.__dict__
return inspect.getmembers(cls, predicate=predicate)
if __name__ == "__main__":
print(sys.version)
print(f"IntColor methods {get_methods(IntColor)}")
print(f"Color methods {get_methods(Color)}")
```
Returns empty list for both cases for older Python, but on Python-3.11+ it returns list contains of enum constructors and others:
```shell
% conda run -n py310 python bar.py
3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:41:52) [Clang 15.0.7 ]
IntColor methods []
Color methods []
% conda run -n py311 python bar.py
3.11.0 | packaged by conda-forge | (main, Oct 25 2022, 06:21:25) [Clang 14.0.4 ]
IntColor methods [('__format__', <function Enum.__format__ at 0x105006ac0>), ('__new__', <function Enum.__new__ at 0x105006660>), ('__repr__', <function Enum.__repr__ at 0x1050068e0>)]
Color methods []
```
This change allows typed enums to be scriptable on 3.11, by explicitly marking several `enum.Enum` method to be dropped by jit script and adds test that typed enums are jit-scriptable.
Fixes https://github.com/pytorch/pytorch/issues/108933
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109717
Approved by: https://github.com/atalman, https://github.com/davidberard98
For some weird reason, the batch file gets rid of the `exit /b 1` inside the for loop, so failures never actually get surfaced. Add skips for the tests that were failing.
Also don't run the windows cpu build on main since it's in trunk. This is what currently works for the rocm build.
The temp file failure originates from https://github.com/pytorch/pytorch/pull/108508 (got fixed before I merged this PR)
I'm not sure when the ChunkRecordIteratorTest started failing, but it was after the above.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109393
Approved by: https://github.com/malfet
Bugfix:
- previously, SymBool does not implement `__eq__`, Python falls back to default `__eq__ `and `__hash__`
- in this PR, we make SymBool implement `__eq__`
- symbolic SymBool now raises an error when hashed just like SymInt/SymFloat
New feature:
- previously, SymInt and SymFloat are unhashable (even if you are singleton or constant)
- in this PR, SymInt and SymBool are hashable if singleton/constant
Stay the same:
- SymNode are hashable due to default Python behavior
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109170
Approved by: https://github.com/ezyang
ghstack dependencies: #109169
In this PR:
- When Constant SymNode are detected in unary/binary ops demote them to plain int/bool before proceeding. Sometimes this means doing a unary op with a Constant SymNode would result in a plain bool.
- Introduce an is_symbolic method, only available from Python. We need this because isinstance(x, SymInt) is no longer sufficient to check whether a given int/SymInt is symbolic or not. See later PR in the stack to see how this is used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109169
Approved by: https://github.com/ezyang
Collectives timing gates the tracking when a collective starts on device.
Currently it's enabled by set the NCCL_ENABLE_TIMING env var.
The goal of this PR is to make it possible to dynamically enable that flag so users of the PG hooks don't have to set that flag in order to have their hooks work.
The design is that once set, all new collectives will have such behavior so we track it on each Work object.
We make enableTiming_ atomic in PGNCCL to avoid races on non-TSO hardware.
To ensure consistency, we copy its value during Work construction and replace all previous usage of enableTiming_ from the PG with usages from the Work, which now has an immutable value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108814
Approved by: https://github.com/wconstab, https://github.com/fduwjj
ghstack dependencies: #108813
The "safety" aspect refers to the output not being registered as aliasing the
input, but after AOTAutograd I don't think this distinction matters. However,
we shouldn't use the same decomposition as the safe variant in case the backend
doesn't want to decompose split.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109668
Approved by: https://github.com/lezcano
ghstack dependencies: #109667
This PR fixes the `expectedFailure` introduced in the previous PR.
**Problem:** container variables, such as `ConstDictVariable`, aren't registered nodes anymore. But, we have to process
the tensors inside them, anyway.
**Solution:** wrap the pytree functions in a `UserFunctionVariable`, and call it. This should inline the given pytree function, and return the expected processed arguments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109107
Approved by: https://github.com/zou3519
ghstack dependencies: #109201, #108533
Summary: AOT inductor is only sort-of supported on CPU right now, but it works
with a few hacks (the .so needs to be compiled and run with CUDA present,
because we haven't excised the CUDA deps; also there's an `is_cpu` flag that
needs to be plumbed into the call, or else all the weights are erroneously
allocated on GPU).
But, with those hacks in place, it currently works, so it's worth having the
current state of it continue working (and at some point we'll remove the
hacks).
Test Plan:
```
python test_aot_inductor -k test_simple_cpu
```
Reviewers: binbao
Subscribers:
Tasks:
Tags:
Differential Revision: [D49427400](https://our.internmc.facebook.com/intern/diff/D49427400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109625
Approved by: https://github.com/mikekgfb, https://github.com/chenyang78, https://github.com/desertfire
## Description
Add a test case to cover the corner case of empty shards when creating ShardedTensor.
Original fix contributed by a user.
https://github.com/pytorch/pytorch/pull/108915
## Test
With the fix, the test added runs fine.
Without the fix in https://github.com/pytorch/pytorch/pull/108915, the test case added would throw the following assertion error.
```
(/home/irisz/local/a/pytorch-env) [irisz@devgpu051.cln3 ~/local/pytorch (add_test_for_corner_case_for_chunk_sharding_spec)]$ python3 test/distributed/_shard/sharded_tensor/test_sharded_tensor.py TestShardTensor.test_shard_tensor_with_empty_shard
Fail to import hypothesis in common_utils, tests are not derandomized
INFO:numba.cuda.cudadrv.driver:init
Fail to import hypothesis in common_utils, tests are not derandomized
Fail to import hypothesis in common_utils, tests are not derandomized
Fail to import hypothesis in common_utils, tests are not derandomized
Fail to import hypothesis in common_utils, tests are not derandomized
NCCL version 2.18.3+cuda12.0
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] Caught exception:
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last):
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 658, in run_test
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] getattr(self, test_name)()
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 544, in wrapper
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] fn()
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] File "/data/users/irisz/pytorch/torch/testing/_internal/common_utils.py", line 2406, in wrapper
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] method(*args, **kwargs)
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] File "/data/users/irisz/pytorch/torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py", line 94, in wrapper
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] func(self, *args, **kwargs)
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 174, in wrapper
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] return func(*args, **kwargs)
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] File "/data/users/irisz/pytorch/test/distributed/_shard/sharded_tensor/test_sharded_tensor.py", line 258, in test_shard_tensor_with_empty_shard
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] st = _shard_tensor(tensor, spec)
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] File "/data/users/irisz/pytorch/torch/distributed/_shard/api.py", line 68, in _shard_tensor
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] st = sharding_spec.shard(tensor, src_rank=src_rank, process_group=process_group)
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] File "/data/users/irisz/pytorch/torch/distributed/_shard/sharding_spec/chunk_sharding_spec.py", line 170, in shard
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] assert local_tensor is not None
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] AssertionError
[rank3]:[2023-09-19 11:19:27,071] torch.testing._internal.common_distributed: [ERROR] exiting process 3 with exit code: 10
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] Caught exception:
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last):
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 658, in run_test
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] getattr(self, test_name)()
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 544, in wrapper
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] fn()
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] File "/data/users/irisz/pytorch/torch/testing/_internal/common_utils.py", line 2406, in wrapper
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] method(*args, **kwargs)
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] File "/data/users/irisz/pytorch/torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py", line 94, in wrapper
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] func(self, *args, **kwargs)
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 174, in wrapper
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] return func(*args, **kwargs)
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] File "/data/users/irisz/pytorch/test/distributed/_shard/sharded_tensor/test_sharded_tensor.py", line 258, in test_shard_tensor_with_empty_shard
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] st = _shard_tensor(tensor, spec)
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] File "/data/users/irisz/pytorch/torch/distributed/_shard/api.py", line 68, in _shard_tensor
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] st = sharding_spec.shard(tensor, src_rank=src_rank, process_group=process_group)
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] File "/data/users/irisz/pytorch/torch/distributed/_shard/sharding_spec/chunk_sharding_spec.py", line 179, in shard
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] dist.scatter(
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] File "/data/users/irisz/pytorch/torch/distributed/c10d_logger.py", line 68, in wrapper
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] return func(*args, **kwargs)
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] File "/data/users/irisz/pytorch/torch/distributed/distributed_c10d.py", line 3143, in scatter
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] _check_tensor_list(scatter_list, "scatter_list")
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] File "/data/users/irisz/pytorch/torch/distributed/distributed_c10d.py", line 808, in _check_tensor_list
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] raise TypeError(
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] TypeError: Invalid function argument. Expected parameter `scatter_list` to be of type List[torch.Tensor].
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir:
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] python test/distributed/_shard/sharded_tensor/test_sharded_tensor.py -k test_shard_tensor_with_empty_shard
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR]
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
[rank0]:[2023-09-19 11:19:27,123] torch.testing._internal.common_distributed: [ERROR] exiting process 0 with exit code: 10
Process 3 terminated with exit code 10, terminating remaining processes.
E
======================================================================
ERROR: test_shard_tensor_with_empty_shard (__main__.TestShardTensor)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 542, in wrapper
self._join_processes(fn)
File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 761, in _join_processes
self._check_return_codes(elapsed_time)
File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 811, in _check_return_codes
raise RuntimeError(error)
RuntimeError: Process 3 exited with error code 10 and exception:
Traceback (most recent call last):
File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 658, in run_test
getattr(self, test_name)()
File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 544, in wrapper
fn()
File "/data/users/irisz/pytorch/torch/testing/_internal/common_utils.py", line 2406, in wrapper
method(*args, **kwargs)
File "/data/users/irisz/pytorch/torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py", line 94, in wrapper
func(self, *args, **kwargs)
File "/data/users/irisz/pytorch/torch/testing/_internal/common_distributed.py", line 174, in wrapper
return func(*args, **kwargs)
File "/data/users/irisz/pytorch/test/distributed/_shard/sharded_tensor/test_sharded_tensor.py", line 258, in test_shard_tensor_with_empty_shard
st = _shard_tensor(tensor, spec)
File "/data/users/irisz/pytorch/torch/distributed/_shard/api.py", line 68, in _shard_tensor
st = sharding_spec.shard(tensor, src_rank=src_rank, process_group=process_group)
File "/data/users/irisz/pytorch/torch/distributed/_shard/sharding_spec/chunk_sharding_spec.py", line 170, in shard
assert local_tensor is not None
AssertionError
----------------------------------------------------------------------
Ran 1 test in 21.207s
FAILED (errors=1)
```
cc. @fduwjj @wanchaol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109626
Approved by: https://github.com/fduwjj
This PR adds an `efficient_conv_bn_eval_graph_transform` pass to the inductor. It tries to identify consecutive conv + bn **computation** with bn in eval mode, and changes it to a more efficient implementation. It does not modify parameters, which makes it **support training** without any pain. If no such patterns are identified, it does nothing. Therefore, it is backward compatible.
It has great benefit in terms of memory footprint:
For resnet50 with input batchsize 64, image size 224, forward + backward training:
| Technique | Memory Footprint (GB) | Remarks |
|-------------------------------|----------------------------|-------------------------------------------|
| Eager Mode | 5.18 | |
| torch.compile | 5.46 | Strangely, not saving memory |
| torch.compile with this PR | 2.88 | **Saves about 50% memory! ** |
The script to measure the memory footprint:
```python
from torchvision.models.resnet import resnet50
import torch
net = resnet50().eval().cuda()
input = torch.randn(64, 3, 224, 224).cuda()
opt_net = torch.compile(net) # Use torch.compile
# opt_net = net # Eager mode
current_memory = torch.cuda.memory_allocated()
torch.cuda.reset_peak_memory_stats()
for i in range(10):
opt_net.zero_grad()
output = opt_net(input)
output.sum().backward()
del output
peak_memory = torch.cuda.max_memory_allocated()
additional_peak_memory = peak_memory - current_memory
print(f"Additional peak memory used: {additional_peak_memory / (1024 ** 3)} GB")
```
More results can be found in the corresponding paper: (this method is called Tune Mode in the tables).
<img width="709" alt="image" src="https://github.com/pytorch/pytorch/assets/23236638/db4815b0-d93e-4726-b1d5-e6651f256484">
<img width="653" alt="image" src="https://github.com/pytorch/pytorch/assets/23236638/22e5e1ab-6129-4c3d-a875-3c7343293b2e">
Note: the difference between this PR and https://github.com/pytorch/pytorch/pull/106372 is that, https://github.com/pytorch/pytorch/pull/106372 tries to fix and change the implementation of `torch.fx.experimental.optimization.fuse`, which causes compatibility issues; this PR only introduces a new graph transform passes, and does not break the previous code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108757
Approved by: https://github.com/jansel
aten.softmax will generate a different decomposition for fp16/bf16 and fp32 because when invoked in lower precision it will upcast the inputs to fp32 and then downcast after. This has been causing us to miss bf16 patterns. For example, Camembert improves 20% with this PR (as do I'm sure many other models).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109142
Approved by: https://github.com/yanboliang
ghstack dependencies: #109663, #108894, #108917
On failure of a test, we will always print a "repro". This repro isn't
really runnable but gives the user a sense of how to actually reproduce
the test without the test suite, because using the test suite is a bit
convoluted.
If the user passes PYTORCH_OPCHECK_PRINT_BETTER_REPRO, we will print a
fuller repro that saves the exact problematic test inputs to disk and
reads them back out.
Test Plan:
- expecttests on the generate_repro helper function
- tried this out locally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109640
Approved by: https://github.com/bdhirsh, https://github.com/soulitzer
ghstack dependencies: #109637, #109638, #109639
Adds a Python Pretty Printer to the pattern matcher that serializes patterns as python. Generating our fuse attention patterns was taking 4 seconds of compile time, which will only get worse as we add more variants (which I will do in the rest of this stack). To write out patterns, build pytorch, then run `gen_attention_patterns.py`.
Since there is a line limit for PRs i'm only including the _sdpa_pattern1 in this first diff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108894
Approved by: https://github.com/yanboliang
ghstack dependencies: #109663
Summary: We met with problem in practice. Although I think whenever we meet with this problem, something bad probably happened in upstream. Like the run instance possibly returned immediately due to error. Throw this out to see if we can catch something earlier. That'll be better.
Test Plan: CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109328
Approved by: https://github.com/ezyang, https://github.com/jansel
We now have two types of functionalization, C++ Functionalization (through the `Functionalize` dispatch key), and python functionalization (through the `FunctionalTensorMode` torch_dispatch mode).
This means that all higher order ops need custom functionalization rules for the python variant too. I added them here, as well as a helper function `dispatch_functionalize()` - equivalent to `torch.func.functionalize()`, except that it uses `FunctionalTensorMode`.
In theory we could have secretly switched `torch.func.functionalize` to use `FunctionalTensorMode`. This would be BC-breaking, though, since `FunctionalTensorMode` isn't composable with the other functorch transforms (the functorch layer-mode stack doesn't know how to re-order torch_dispatch modes arbitrarily).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108656
Approved by: https://github.com/zou3519
ghstack dependencies: #109024, #109248
This is needed to allow the custom ops in our custom op autograd tests to accept `FunctionalTensor` arguments as inputs that we compute gradients for. Previously, custom ops would raise an error if you tried to pass in a tensor subclass when using autograd.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109248
Approved by: https://github.com/zou3519
ghstack dependencies: #109024
Fix https://github.com/pytorch/pytorch/issues/107098, improve `unique` performance on CPU.
The algorithm is taken from Numpy implementation at https://github.com/numpy/numpy/blob/main/numpy/lib/arraysetops.py#L323, it first do a sort on the input sequence and then use a `mask` to record the unique element of each consecutive section.
Now we don't have parallel sort on 1-dimension float tensor, will have it enabled in next step. Parallel radix sort is used for 1-dimensional int tensor.
The following data is collected with script in the issue on Intel(R) Xeon(R) Gold 6248 CPU @ 2.5GHz with single sockets (20 cores):
#### before (dtype int64)
```
Numpy just sort: 0.4271528720855713 s
Numpy sort + indexes: 6.383563041687012 s
Torch just sort: 0.46924352645874023 s
Torch sort + indexes: 1.8140404224395752 s
```
#### after (dtype int64)
```
Torch just sort: 0.2540090084075928 s
Torch sort + indexes: 0.2766146659851074 s
```
#### before (float32)
```
Numpy just sort: 0.41129398345947266 s
Numpy sort + indexes: 6.422696590423584 s
Torch just sort: 9.109549283981323 s
Torch sort + indexes: 37.59021711349487 s
```
#### after (float32)
```
Torch just sort: 3.5369982719421387 s
Torch sort + indexes: 3.582240581512451 s
```
if we enabled parallel sort on 1-dimension float tensor, the performance is:
```
Torch just sort: 0.3212606906890869 s
Torch sort + indexes: 0.36211371421813965 s
```
Since i have fused the `inverse_indices` and `count` calculation in fused parallel loop (the algorithm is identical to NumPy's but with better optimization), they will take a small amount of additional time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107846
Approved by: https://github.com/jgong5, https://github.com/nikitaved, https://github.com/peterbell10
Fix: #107315
This PR enables dynamo to trace through the `pytree` API by inlining its functions. In
order to do so, a few details of `pytree` had to be changed.
In summary, this PR:
- Introduces `TreeSpecVariable` for representing `TreeSpec` instances
- Specializes `<type>.__bases__` call, returning a `TupleVariable`
- Enables the call to `id` builtin function for every variable that implements
`as_python_constant` method
- Specializes `ConstantVariable.call_method` for its (un)flatten functions
- Implements `UserDefinedObjectVariable.as_python_constant`
- Modifies `pytree` by:
- Make `SUPPORTED_NODES` a map of ids (instead of types) to `NodeDef`
- Removed `functools.wraps` function, since it can't be inlined
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108533
Approved by: https://github.com/ezyang, https://github.com/voznesenskym
ghstack dependencies: #109201
This PR fix the `is_typing` function: checks whether a value is an instance of a class
from the `typing` package.
This reverts commit b09c09f7bb3adb6a5b8a107a5b96757b569daa8d.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109201
Approved by: https://github.com/ezyang
aten.softmax will generate a different decomposition for fp16/bf16 and fp32 because when invoked in lower precision it will upcast the inputs to fp32 and then downcast after. This has been causing us to miss bf16 patterns. For example, Camembert improves 20% with this PR (as do I'm sure many other models).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109142
Approved by: https://github.com/yanboliang
ghstack dependencies: #108894, #108917
Fixes#109186.
This PR updates the docs for
- `torch.var`
- `torch.var_mean`
- `torch.std`
- `torch.std_mean`
- `torch.cov`
to reflect the actual implementation behavior when `correction >= N`. The math for `torch.cov` should probably be double checked before merging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109326
Approved by: https://github.com/albanD
Adds a Python Pretty Printer to the pattern matcher that serializes patterns as python. Generating our fuse attention patterns was taking 4 seconds of compile time, which will only get worse as we add more variants (which I will do in the rest of this stack). To write out patterns, build pytorch, then run `gen_attention_patterns.py`.
Since there is a line limit for PRs i'm only including the _sdpa_pattern1 in this first diff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108894
Approved by: https://github.com/yanboliang
Summary: This can help debug issues esp fc/bc issues with coreml tools, when a model fails to load.
Test Plan:
On a macbook fbsource,
```
arc focus2 -b pp-ios -a ModelRunner -a //xplat/caffe2/c10:c10Apple -a //xplat/caffe2/fb/dynamic_pytorch:dynamic_pytorch_implApple -a //xplat/caffe2:coreml_delegateApple --auto-test-schemes --force-with-wrong-xcode
```
It builds and runs the Playground app using a bunch of coreml models on my iPhone. Here is one for example,
https://pxl.cl/3nSPn
Also forcefully triggering MLModel ctor failure to test this code by setting a `modelURL=nil`, and as expected got this,
```
libc++abi: terminating due to uncaught exception of type c10::Error: Error loading MLModel Error details: Localized_description: nil value for URL Domain: com.apple.CoreML Code: 3 User Info: {
NSLocalizedDescription = "nil value for URL";
} Input Shapes: N/A
Exception raised from compile at xplat/caffe2/torch/csrc/jit/backends/coreml/objc/PTMCoreMLBackend.mm:162 (most recent call first):
(no backtrace available)
```
Instead of a previous message would have been,
```
Loading MLModel failed
```
Unrelated issues
* P829736691 - with running MaskRCNN on Coreml with the Playground app. Only happens some times.
* P829741377 - with Metal Operator Tests with the Playground app.
Differential Revision: D49349726
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109444
Approved by: https://github.com/kimishpatel
Reland - the previous PR was reverted by internal with this error:
```
File "/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/buck-out/v2/gen/fbcode/363cd7e240f5d021/caffe2/torch/fb/trainer/data_modules/tests/__test_dataloader__/test_dataloader#link-tree/torch/__init__.py", line 29, in <module>
from ._utils_internal import _functionalize_sync as _sync
ImportError: cannot import name '_functionalize_sync' from 'torch._utils_internal'
```
I couldn't figure out why internal was unhappy with the import. One potential reason is that I see a build rule for *another* `_utils_internal.py` in the fb folder here ([link](https://www.internalfb.com/code/fbsource/[30ed85cd88409af98b7490be137aaa5dfd7afd01]/fbcode/caffe2/TARGETS?lines=444))
Rather than burn more time investigating, I confirmed internally that the error goes away if I move the util from `torch/_utils_internal.py` to `torch/_utils.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109518
Approved by: https://github.com/albanD
Summary:
X-link: https://github.com/pytorch/executorch/pull/359
When exporting using enable_aot (through the torch.export path), we want to lift all constant tensors as buffers to the exported program. The ScalarToTensor pass in EXIR's aten_to_edge passes will create some constant tensors in the graph, so we will need to run a lift_constant_tensors pass afterwards.
Note that this only needs to be applied when exporting using the torch.export path because in the original path, nothing is lifted.
Test Plan: CI
Differential Revision: D49207492
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109382
Approved by: https://github.com/cccclai
Added properties
- "code"
- "code_with_constants"
- "graph"
- "inlined_graph"
- "original_name"
With appropriate type hints to `ScriptModule` stub and removed them from child class `RecursiveScriptModule`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109535
Approved by: https://github.com/ezyang
# Motivation
fix hpu deserialization bug. It should check hpu model if and only if location start with hpu. Otherwise, it always raise an AssertError if hpu is not imported. This break the serialization/desirialization functionality abourt other third-party like IPEX.
# Solution
only assert hpu model when start with hpu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109499
Approved by: https://github.com/ezyang
Previously, the code for passing inputs to exported program was:
```
if kwargs:
return (args, kwargs)
else:
return args
```
However, this causes some inconsistency where if the original input contains args and kwargs, the treespec would be a tuple containing a tuple of arguments, and a dictionary of keyword arguments. But if the original input only contained args, the treespec would just be a tuple of arguments. This inconsistency causes some inconveniences in the runtime.
So I updated the code to just always keep the kwargs around.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109160
Approved by: https://github.com/zhxchen17, https://github.com/avikchaudhuri
Summary:
If there's no inline constraints added, just return the original graph.
We want to do this because sometimes this pass mess up the node names,
before we actually fix this, we could make the behavior a bit less buggy
by skipping noop passes.
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109395
Approved by: https://github.com/angelayi
Summary:
Now that quantization works on pre-dispatch aten IR, moving to full set
of aten ops is ok. Plus when tracing models like ViT, the linear
projections of of k, q, v uses functional.linear and not nn.Linear,
which results not being able to extract nodes corresponding to linear.
Test Plan:
quant tests
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D49252194](https://our.internmc.facebook.com/intern/diff/D49252194)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109254
Approved by: https://github.com/jerryzh168
For large reduction (with large xnumel and rnumel), we potentially need run large number of thread blocks. Occupancy matters here since with larger occupancy we can run more blocks on each SM and we may need less number of waves to run the entire kernel on the GPU. Number of registers used by each thread can limit the occupancy. For A100, it's safe to say that register usage does not limit occupancy only if each thread use <= 32 registers. This PR leverage this observation and reduce RBLOCK (thus reduce registers used by each thread) if thread usage limit occupancy for large reduction.
The scenario mentioned can happen for the softmax kernel used in transformers. Here are some results get from devgpu:
- PLBartForCausalLM we improve from 1.88x (58.7ms) to 2.00x (55.82ms)
- TrOCRForCausalLM we improve from 1.45x (92.9ms) to 1.51x (89.12ms)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109275
Approved by: https://github.com/jansel
Summary: Use a RAII class to wrap around at::cuda::CUDAStreamGuard. Previous implementation didn't follow the exact CUDAStreamGuard behavior.
Test Plan: CI
Differential Revision: D49355542
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109471
Approved by: https://github.com/chenyang78
Summary:
The basic concept behind this diff is to modify Dynamo's tracing behavior when it encounters a KeyedJaggedTensor that is synced (aka has `_length_per_key` and `_offset_per_key` populated). These fields are lists of integers; ordinarily, Dynamo will optimistically try to specialize on integers, however, for KJTs, we know that these integers will definitely vary from run-to-run. Furthermore, ordinarily, we would also specialize these integers if they are 0/1, but we will frequently expect features in KJTs to be 0/1.
The fix is to detect KJTs and treat these integers as *unbacked integers*. This is NOT a universally sound optimization: when treating these integers as unbacked, we never report them as equal to zero or one. In return, we always generate graphs that generalize no matter the length of values on features. This is enough to trace through APS sparse arch, torchrec_dlrm and some small split-cat examples.
The special integer behavior is triggered by a dynamically scoped `force_unspec_int_unbacked_size_like` variable on TracingContext, which we trigger when we wrap a KJT. There probably are other ways to do this, but this was simple and worked.
Test Plan:
```
buck2 test mode/dev-nosan //pytorch/benchmark/fb/test_gpu:run_test_gpu
```
from aakhundov
1. first build feed_lower_benchmark:
```
buck2 build --show-output mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true hpc/new/models/feed/benchmark:feed_lower_benchmark
```
2. then run the lowering of the model with it:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCH_LOGS="output_code,graph_code" TORCH_COMPILE_DEBUG=1 ../buck-out/v2/gen/fbcode/79c6b019ee0f9469/hpc/new/models/feed/benchmark/__feed_lower_benchmark__/feed_lower_benchmark.par --load=manifold://ig_inference_model/tree/user/facebook/fblearner/predictor/960999465/60/gpu_lowering/input.predictor --skip-trt --skip-ait --sync-mode=0 --enable-aot-inductor --lower-presets="ig_stories" --gpu-trace
```
cf https://docs.google.com/document/d/1yD30xYrdmM8r2HTdmXnZTg0-MHVexfVrAa0294m1AUE/edit?pli=1#heading=h.qiv3fp7e6zg0
From torchrec: https://www.internalfb.com/intern/wiki/Torchrec/Development/Testing_production_models/
From ge0405
baseline (without your diff): f477293168
your diff: f477292363
```
buck2 test //caffe2/test/dynamo:test_dynamo_torchrec
buck2 run 'fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- 'pytorch.benchmark.fb.test_gpu.test_gpu.TestBenchmarkFbGpu.test_train_blue_reels_vdd_v3_inductor_speedup'
```
Differential Revision: D49236757
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109216
Approved by: https://github.com/voznesenskym
Fixes#107995
In the reproducer we have the fx graph:
```python
class <lambda>(torch.nn.Module):
def forward(self, arg0_1: f32[1]):
# File: <ipython-input-1-5f62cb746ad5>:10, code: return self.layer1(inputs)
gt: b8[1] = torch.ops.aten.gt.Scalar(arg0_1, 0)
mul: f32[1] = torch.ops.aten.mul.Tensor(arg0_1, 5.2955089)
where: f32[1] = torch.ops.aten.where.self(gt, arg0_1, mul); gt = mul = None
# No stacktrace found for following nodes
copy_: f32[1] = torch.ops.aten.copy_.default(arg0_1, where); arg0_1 = None
return (where,)
```
The `where` node is both copied into `arg0_1` and returned as the output of the
function. Currently `realize_into` converts the where's storage into a
`MutationLayout` of `arg0_1`, for which no tensor named `buf0` is allocated.
This is incorrect as `where` and `arg0_1` shouldn't share storage. It also
breaks the wrapper code generation which references `buf0` directly in the
return, but never allocates a `buf0`.
This issue only appears for size zero tensors, because otherwise the `src`
buffer becomes a user of `arg0_1` which forces this copy to happen anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108126
Approved by: https://github.com/jansel
Enables two ruff rules derived from pylint:
* PLR1722 replaces any exit() calls with sys.exit(). exit() is only designed to be used in repl contexts as may not always be imported by default. This always use the version in the sys module which is better
* PLW3301 replaces nested min / max calls with simplified versions (ie. `min(a, min(b, c))` => `min(a, b. c)`). The new version is more idiomatic and more efficient.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109461
Approved by: https://github.com/ezyang
In https://github.com/pytorch/pytorch/pull/107901, the CUDA event based
profiling is changed to profiler based profiling to avoid counting CPU-side
kernel launch overhead in final latency numbers. However, it turns out that
torch.profile() is significantly slower than CUDA event which affects model
compilation speed quite significantlly. This PR changes back to CUDA event
based profiling.
Follow-ups:
* Try CUDA event profiling with CUDAGraphs;
* Multi-GPU profiling;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109338
Approved by: https://github.com/frank-wei
Summary:
This PR adds a limited C shim layer for libtorch. The ultimate goal is to ban any direct reference to aten/c10 data structures or functions, to avoid ABI breakage by providing stable C interfaces.
To make the review and landing easier, we broke the changes into several steps. In this PR (a combination of https://github.com/pytorch/pytorch/pull/109022 and https://github.com/pytorch/pytorch/pull/109351), we add C interfaces for certain libtorch functions and modify the wrapper codegen to generate calls to those interfaces. There are a few other items to be addressed in future PRs:
* The AOTInductor runtime interface still takes lists of aten tensors as input and output
* The interaction with ProxyExecutor (general fallback support) needs to move away from aten tensor
* Remove all references to aten/c10 headers in the AOTInductor-generated code
Differential Revision: D49302669
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109391
Approved by: https://github.com/chenyang78
This fixes numerous tests which were xfailing. For instance, the
`_segment_reduce.lengths` OpInfo test, which was previously relying on
the fallback kernel to determine the shape of the meta tensor. The
fallback kernel would fail with
segment_reduce(): Expected all rows of lengths along axis to sum to data.size(lengths.dim()-1) when !unsafe.
as it was trying to read the values of a meta tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109359
Approved by: https://github.com/ezyang
When you inplace matmul two one dimensional numpy arrays, numpy=="1.24.3" gives
```
TypeError: In-place matrix multiplication is not (yet) supported. Use 'a = a @ b' instead of 'a @= b'.
```
but numpy=="1.25.2" gives
```
ValueError: inplace matrix multiplication requires the first operand to have at least one and the second at least two dimensions.
```
This diff makes it so that newer versions of numpy does not fail on this test because we do not catch ValueError.
An alternative solution would be to update the test cases to be 2 dimensional, but that would have impact on other operators being tested.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109087
Approved by: https://github.com/jansel
Before this PR, if we run the following code:
```python
def true_fn(x):
return x - x.cos()
def false_fn(x):
return x + x.sin()
def foo(x):
return cond(x.shape[0] == 4, true_fn, false_fn, [x])
gm = make_fx(foo, tracing_mode='symbolic')(torch.ones(3, 4))
gm = make_fx(foo, tracing_mode='symbolic')(torch.ones(4, 5))
```
we'll have the following error:
```python
Traceback (most recent call last):
File "/home/yidi/local/pytorch/make_fx.py", line 16, in <module>
gm = make_fx(foo, tracing_mode='symbolic')(torch.ones(4, 5))
File "/home/yidi/local/pytorch/torch/fx/experimental/proxy_tensor.py", line 841, in wrapped
t = dispatch_trace(wrap_key(func, args, fx_tracer, pre_dispatch), tracer=fx_tracer, concrete_args=tuple(phs))
File "/home/yidi/local/pytorch/torch/_compile.py", line 24, in inner
return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 397, in _fn
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/fx/experimental/proxy_tensor.py", line 461, in dispatch_trace
graph = tracer.trace(root, concrete_args)
File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 397, in _fn
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/fx/_symbolic_trace.py", line 817, in trace
(self.create_arg(fn(*args)),),
File "/home/yidi/local/pytorch/torch/fx/experimental/proxy_tensor.py", line 497, in wrapped
out = f(*tensors)
File "/home/yidi/local/pytorch/make_fx.py", line 13, in foo
return control_flow.cond(x.shape[0] == 4, true_fn, false_fn, [x])
File "/home/yidi/local/pytorch/torch/_higher_order_ops/cond.py", line 151, in cond
return torch.compile(cond_op, backend="eager", fullgraph=True)(
File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 397, in _fn
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 545, in catch_errors
return callback(frame, cache_entry, hooks, frame_state)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 140, in _fn
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 380, in _convert_frame_assert
return _compile(
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 561, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 483, in compile_inner
out_code = transform_code_object(code, transform)
File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
transformations(instructions, code_options)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 432, in transform
tracer = InstructionTranslator(
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2032, in __init__
self.symbolic_locals = collections.OrderedDict(
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2035, in <genexpr>
VariableBuilder(
File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 229, in __call__
vt = self._wrap(value).clone(**self.options())
File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 374, in _wrap
return type_dispatch(self, value)
File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 808, in wrap_listlike
output = [
File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 809, in <listcomp>
VariableBuilder(self.tx, GetItemSource(self.get_source(), i))(
File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 229, in __call__
vt = self._wrap(value).clone(**self.options())
File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 374, in _wrap
return type_dispatch(self, value)
File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 808, in wrap_listlike
output = [
File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 809, in <listcomp>
VariableBuilder(self.tx, GetItemSource(self.get_source(), i))(
File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 229, in __call__
vt = self._wrap(value).clone(**self.options())
File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 374, in _wrap
return type_dispatch(self, value)
File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 1040, in wrap_tensor
tensor_variable = wrap_fx_proxy(
File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 1267, in wrap_fx_proxy
return wrap_fx_proxy_cls(
File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 1382, in wrap_fx_proxy_cls
example_value = wrap_to_fake_tensor_and_record(
File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 1652, in wrap_to_fake_tensor_and_record
dynamic_dims, constraint_dims = _automatic_dynamic(
File "/home/yidi/local/pytorch/torch/_dynamo/variables/builder.py", line 1550, in _automatic_dynamic
if dim is not None and e.size()[i] != dim:
File "/home/yidi/local/pytorch/torch/__init__.py", line 352, in __bool__
return self.node.bool_()
File "/home/yidi/local/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1019, in bool_
return self.guard_bool("", 0)
File "/home/yidi/local/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1001, in guard_bool
r = self.shape_env.evaluate_expr(self.expr, self.hint, fx_node=self.fx_node)
File "/home/yidi/local/pytorch/torch/fx/experimental/recording.py", line 227, in wrapper
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3793, in evaluate_expr
assert orig_expr == hint, f"{orig_expr} != {hint}"
AssertionError: False != True
from user code:
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
```
It's because we record the SymInt in the frame state in _automatic_dynamic the first time we compile the function. Then In the second time, when we are given a symint sized input with different hints, the comparison fails.
Implementation:
This PR returns shape dynamism according to the dynamism of inputs: if a diemsion is SymInt, return DYNAMIC else return static.
Test Plan:
Add a test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109331
Approved by: https://github.com/ezyang
Requested from @tugsbayasgalan: we want dynamo to preserve some FX node metadata when we trace `GraphModule`s (`nn_module_stack`, `source_fn`, `stack_trace`). This is helpful for the case when we export an aten-level `GraphModule`, add some (possibly non-torch or non-aten) ops, and we want to transform the graph back into an aten-level graph. Without preserving metadata, future passes that look at metadata (e.g. quantization passes) won't work.
This feature also has the additional benefit of being able to preserve origin line of code when `print_readable`'ing a `GraphModule`. This is helpful when debugging graphs that have passed through dynamo several times.
The added unit test demonstrates the added functionality of this PR.
~This PR is currently a proof-of-concept implementation that shows that preserving node metadata across dynamo is possible.~ This PR preserves node metadata across dynamo by doing the following:
- ~inject a counter variable into the `GraphModule` source code, which is incremented every time a node is run~
- Construct a line number -> node index map in `GraphModule` as the source code is being generated.
- pass a list of node metadata and the line number map to dynamo's bytecode analyzer
- ~dynamo traces the counter as a `ConstantVariable`, so when we create a new proxy, we can determine which original node index this proxy corresponds by looking at the value of the traced counter~
- When we create a new proxy, get the current instruction's line number, and get the node index using the line number map
- index into the original node metadata ~using the counter variable's tracked value.~
~Some things that should be addressed off the top of my head:~
- ~Is this feature even desirable? (Do we really want Dynamo to have special behavior for `GraphModules`? Should we expect users to re-export `GraphModules`?)~
- ~Is there a better approach than to use a counter? We considered using node names, line numbers, and assuming that proxies are created in the same order as the nodes, but each of these 3 have shortcomings. For node names, we only have access to new node names, not the old ones. Using line number is fragile. The third is problematic since not all created nodes go through `create_proxy` (e.g. inputs). We currently generate a line number to node index map when the `GraphModule`'s code is generated.~
- ~What's the best way to send data across the "CPython gap"? That is, it is not obvious how to cleanly pass data from dynamo's `eval_frame.py:_TorchDynamoContext.__call__` to `symbolic_convert.py:InstructionTranslatorBase.__init__`. In this PR, we use a global.~
Differential Revision: [D49257108](https://our.internmc.facebook.com/intern/diff/D49257108)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107067
Approved by: https://github.com/jansel
This issue is that `str(torch.ops.aten.conv2d.default._schema)` does not return the same schema that is in native_functions.yaml ([link](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml#L1654)).
Torchscript appears to change the default arg string `int[2] strides=1` to `int[2] strides=[1, 1]`. If you try to parse that with torchgen, torchgen is unhappy (it tries to split arguments on comma, but now we have a comma inside of the default argument).
Fixing the issue directly in torchgen was a bit more painful, so I opted just to undo the transformation that torchscript made: convert `=[1, 1]` back into `=1`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108897
Approved by: https://github.com/ezyang
ghstack dependencies: #106404, #107917
Added two new utils to help with turning python functionalization on in AOTAutograd (next PR):
(1) updated `torch._sync()`. Previously, this API could only handle `torch.Tensor` instances that had a `FunctionalTensorWrapper` TensorImpl. It now needs to handle python `FunctionalTensor`'s. In theory I can probably break BC and change this API (since it's private?), but I decided not to do it in this PR stack do minimize the chance of reverts. Instead of updating that API directly (which is in C++), I just added a python shim that first tries to unwrap the python `FunctionalTensor` if there is one, then calls the existing C++ logic
(2) `mirror_autograd_meta` is now a standalone API that tries to mirror the `requires_grad` and `is_leaf` autograd metadata from one tensor to another. Previously this was hardcoded into `torch._to_functional_tensor()`. But I now need to use it in a more standalone way: later in AOTAutograd when we unwrap and re-wrap a tensor subclasses, we need to manually mirror the autograd metadata from the original to the updated version of the subclass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107917
Approved by: https://github.com/ezyang
ghstack dependencies: #106404
This PR adds a new `FunctionalTensor` subclass, and `FunctionalTensorMode` torch dispatch mode. Together, this class/mode are a lightweight wrapper around our existing C++ functionalization logic.
This idea came from Ed - later in the stack, I want to be able to run functionalization **underneath** torch_dispatch, when performing tracing in AOTAutograd. I can't do this easily with vanilla C++ functionalization, because it has a dedicated dispatch key that always runs before TorchDispatch. However, by adding a torch_dispatch mode shim around functionalization, we can use functionalization as a torch_dispatch mode, which will make it easier to run underneath other modes later.
This PR provides the basic new classes, and some light testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106404
Approved by: https://github.com/ezyang
Summary:
This PR adds in support for sparse matmul using cuSPASRELt with int8
inputs and fp16 outputs.
It does so by adding a out_dtype flag to `torch_cslt_sparse_mm`.
Because the only mixed_dtype support present in cuSPARSELt is for int8
input and fp16 output, we error out if:
* out_dtype is set and the input tensors are not int8.
* out_dtype is set to any value other than fp16
Test Plan:
python test/test_sparse_semi_structured -k int8_in_fp16_out
Reviewers:
@cphursh
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109214
Approved by: https://github.com/cpuhrsch
This decomposes masked_scatter into `aten.cumsum` and a single pointwise kernel,
which is similar to what is done in eager. I only do this for CUDA because on CPU
it isn't split into two passes like this so would cause a slowdown.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108803
Approved by: https://github.com/lezcano
For code like this:
```python
import torch
from functorch.experimental import control_flow
def exportdb_example2(x):
def true_fn():
return torch.sin(x)
def false_fn():
return torch.cos(x)
return control_flow.cond(x.sum() > 0, true_fn, false_fn, [])
ep = torch._export.export(exportdb_example2, (torch.randn(4, 5),))
```
before the pr, when the branches take an empty/list of tuple as inputs, we'll have error like following:
```python
Traceback (most recent call last):
File "/home/yidi/local/pytorch/test_cond.py", line 11, in <module>
ep = torch._export.export(exportdb_example2, (torch.randn(4, 5),))
File "/home/yidi/local/pytorch/torch/_export/__init__.py", line 340, in export
gm_torch_level, _ = torch._dynamo.export(
File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 1207, in inner
result_traced = opt_f(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 397, in _fn
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/test_cond.py", line 3, in exportdb_example2
def exportdb_example2(x):
File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 397, in _fn
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 1173, in result_capturing_wrapper
graph_captured_result = torch.func.functional_call(
File "/home/yidi/local/pytorch/torch/_functorch/functional_call.py", line 143, in functional_call
return nn.utils.stateless._functional_call(
File "/home/yidi/local/pytorch/torch/nn/utils/stateless.py", line 264, in _functional_call
return module(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/fx/graph_module.py", line 725, in call_wrapped
return self._wrapped_call(self, *args, **kwargs)
File "/home/yidi/local/pytorch/torch/fx/graph_module.py", line 305, in __call__
raise e
File "/home/yidi/local/pytorch/torch/fx/graph_module.py", line 292, in __call__
return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "<eval_with_key>.2", line 10, in forward
File "/home/yidi/local/pytorch/torch/_ops.py", line 301, in __call__
return wrapper()
File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 397, in _fn
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_ops.py", line 297, in wrapper
return self.dispatch(
File "/home/yidi/local/pytorch/torch/_ops.py", line 280, in dispatch
return kernel(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_higher_order_ops/utils.py", line 52, in inner
return autograd_not_implemented_inner(op, deferred_error, *args, **kwargs)
File "/home/yidi/local/pytorch/torch/_higher_order_ops/utils.py", line 25, in autograd_not_implemented_inner
result = operator(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_ops.py", line 301, in __call__
return wrapper()
File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 397, in _fn
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_ops.py", line 297, in wrapper
return self.dispatch(
File "/home/yidi/local/pytorch/torch/_ops.py", line 255, in dispatch
return self.python_key_mode_table[type(curr_mode)](*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_higher_order_ops/cond.py", line 310, in cond_fake_tensor_mode
flat_false_outs, _ = pytree.tree_flatten(false_fn(*operands))
File "/home/yidi/local/pytorch/torch/fx/graph_module.py", line 725, in call_wrapped
return self._wrapped_call(self, *args, **kwargs)
File "/home/yidi/local/pytorch/torch/fx/graph_module.py", line 305, in __call__
raise e
File "/home/yidi/local/pytorch/torch/fx/graph_module.py", line 292, in __call__
return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
TypeError: forward() takes 2 positional arguments but 3 were given
```
Thanks for @williamwen42 spotting this error! We fix it by addressing the case when add_after is -1.
Test Plan:
See newly added tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109308
Approved by: https://github.com/williamwen42
This PR introduces binary search for finding smaller validation errors, when they occur.
We do that by bisecting the sequence of `torch._assert` FX nodes recorded as the source
expression of the translation validator (TV) by `ShapeEnv.evaluate_expr` calls. Then, we
raise the error caused by the earliest node.
In summary, the changes are:
- Call `bisect` on `ValidationError` @ _torch/_dynamo/convert_frame.py_
- Implement the binary search @ _torch/fx/experimental/symbolic_shapes.py_
Edit: moved `ShapeEnv` replay-recording to #107989
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107493
Approved by: https://github.com/ezyang
ghstack dependencies: #107989
Summary:
Port x86 inline assembly to aarch64:
- Use `sp` instead of `%rsp` for stack pointer; move to second caller-
saved register `x1` instead of `%rsi`
- Use `x29` instead of `%rbp` for base pointer; move to third caller-
saved register `x2` instead of `%rdx`
Test Plan:
```
$ buck2 build fbcode//mode/opt fbcode//caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file
```
Reviewed By: jasonjk-park
Differential Revision: D47242468
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104707
Approved by: https://github.com/aaronenyeshi
The tests were failing because the input tensors were getting
incorrectly promoted to int64, due to `transform_args` incorrectly
considering the `dim` argument when determining the type to promote to.
It doesn't seem like type promotion is necessary in general for max and
min, so I've switched them to `type_promotion_kind=None`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109264
Approved by: https://github.com/eellison
ghstack dependencies: #109165
Both the eager and compiled versions fail with the following message
when trying to compute the grad:
RuntimeError: linalg_eig_backward: The eigenvectors in the complex
case are specified up to multiplication by e^{i phi}. The specified
loss function depends on this quantity, so it is ill-defined.
I'm not sure if there's a way to adapt the OpInfo such that the grad is
computable, but we should at least check that the forward pass is
correct.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109165
Approved by: https://github.com/eellison
https://github.com/pytorch/pytorch/pull/108351/files is failing on mac and windows because we don't have the dependency
It is available on linux because it is included in .ci/docker/requirements-docs.txt
Adding skips to make it green.
Here are some outputs for future debugging
https://github.com/pytorch/pytorch/actions/runs/6192933622/job/16813841625https://ossci-raw-job-status.s3.amazonaws.com/log/16813841625
```
2023-09-15T02:09:43.2397460Z =================================== FAILURES ===================================
2023-09-15T02:09:43.2397650Z ______________________ TestTensorBoardSummary.test_audio _______________________
2023-09-15T02:09:43.2397830Z Traceback (most recent call last):
2023-09-15T02:09:43.2398090Z File "/Users/ec2-user/runner/_work/pytorch/pytorch/test/test_tensorboard.py", line 417, in test_audio
2023-09-15T02:09:43.2398390Z self.assertTrue(compare_proto(summary.audio('dummy', tensor_N(shape=(42,))), self))
2023-09-15T02:09:43.2398720Z File "/Users/ec2-user/runner/_work/_temp/conda_environment_6192933622/lib/python3.9/unittest/case.py", line 688, in assertTrue
2023-09-15T02:09:43.2399100Z ##[endgroup]
2023-09-15T02:09:43.2399240Z raise self.failureException(msg)
2023-09-15T02:09:43.2399400Z AssertionError: False is not true
2023-09-15T02:09:43.2399490Z
2023-09-15T02:09:43.2399590Z To execute this test, run the following from the base repo dir:
2023-09-15T02:09:43.2399820Z python test/test_tensorboard.py -k test_audio
2023-09-15T02:09:43.2399930Z
```
https://github.com/pytorch/pytorch/actions/runs/6192933622/job/16814065258https://ossci-raw-job-status.s3.amazonaws.com/log/16814065258
```
2023-09-15T02:38:44.6284979Z ================================== FAILURES ===================================
2023-09-15T02:38:44.6285295Z ______________________ TestTensorBoardNumpy.test_scalar _______________________
2023-09-15T02:38:44.6285556Z Traceback (most recent call last):
2023-09-15T02:38:44.6285915Z File "C:\actions-runner\_work\pytorch\pytorch\test\test_tensorboard.py", line 794, in test_scalar
2023-09-15T02:38:44.6286325Z res = make_np(np.float128(1.00008 + 9))
2023-09-15T02:38:44.6286705Z File "C:\Jenkins\Miniconda3\lib\site-packages\numpy\__init__.py", line 315, in __getattr__
2023-09-15T02:38:44.6287700Z raise AttributeError("module {!r} has no attribute "
2023-09-15T02:38:44.6288060Z AttributeError: module 'numpy' has no attribute 'float128'
2023-09-15T02:38:44.6288241Z
2023-09-15T02:38:44.6288390Z To execute this test, run the following from the base repo dir:
2023-09-15T02:38:44.6288679Z python test\test_tensorboard.py -k test_scalar
2023-09-15T02:38:44.6288846Z
```
https://github.com/pytorch/pytorch/actions/runs/6193449301/job/16815113985https://ossci-raw-job-status.s3.amazonaws.com/log/16815113985
```
2023-09-15T03:25:53.7797550Z =================================== FAILURES ===================================
2023-09-15T03:25:53.7797790Z __________________ TestTensorBoardSummary.test_histogram_auto __________________
2023-09-15T03:25:53.7798000Z Traceback (most recent call last):
2023-09-15T03:25:53.7798310Z File "/Users/ec2-user/runner/_work/pytorch/pytorch/test/test_tensorboard.py", line 426, in test_histogram_auto
2023-09-15T03:25:53.7798690Z self.assertTrue(compare_proto(summary.histogram('dummy', tensor_N(shape=(1024,)), bins='auto', max_bins=5), self))
2023-09-15T03:25:53.7799090Z File "/Users/ec2-user/runner/_work/_temp/conda_environment_6193449301/lib/python3.9/unittest/case.py", line 688, in assertTrue
2023-09-15T03:25:53.7799430Z raise self.failureException(msg)
2023-09-15T03:25:53.7799610Z AssertionError: False is not true
2023-09-15T03:25:53.7799720Z
2023-09-15T03:25:53.7799840Z To execute this test, run the following from the base repo dir:
2023-09-15T03:25:53.7800170Z python test/test_tensorboard.py -k test_histogram_auto
2023-09-15T03:25:53.7800310Z
2023-09-15T03:25:53.7800430Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-09-15T03:25:53.7800870Z - generated xml file: /Users/ec2-user/runner/_work/pytorch/pytorch/test/test-reports/python-pytest/test_tensorboard/test_tensorboard-aef95b5e2d69c061.xml -
2023-09-15T03:25:53.7801200Z =========================== short test summary info ============================
```
https://github.com/pytorch/pytorch/actions/runs/6193576371/job/16815396352https://ossci-raw-job-status.s3.amazonaws.com/log/16815396352
```
2023-09-15T03:47:02.9430070Z _________________ TestTensorBoardSummary.test_histogram_doane __________________
2023-09-15T03:47:02.9430250Z Traceback (most recent call last):
2023-09-15T03:47:02.9430520Z File "/Users/ec2-user/runner/_work/pytorch/pytorch/test/test_tensorboard.py", line 433, in test_histogram_doane
2023-09-15T03:47:02.9430850Z self.assertTrue(compare_proto(summary.histogram('dummy', tensor_N(shape=(1024,)), bins='doane', max_bins=5), self))
2023-09-15T03:47:02.9431180Z File "/Users/ec2-user/runner/_work/_temp/conda_environment_6193576371/lib/python3.9/unittest/case.py", line 688, in assertTrue
2023-09-15T03:47:02.9431390Z raise self.failureException(msg)
2023-09-15T03:47:02.9431550Z AssertionError: False is not true
2023-09-15T03:47:02.9431640Z
2023-09-15T03:47:02.9431730Z To execute this test, run the following from the base repo dir:
2023-09-15T03:47:02.9432000Z python test/test_tensorboard.py -k test_histogram_doane
2023-09-15T03:47:02.9432120Z
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109349
Approved by: https://github.com/huydhn
resolves https://github.com/pytorch/pytorch/issues/109101
The problem is essentially because we were hashing all the arguments, including
the scalar too (i.e. aten.div(tensor, scalar)), in the optimizer, the scalar might
change everytime we call the op, thus cache miss everytime we call the op
This PR improves the sharding cache behavior by introducing a
RuntimeSchemaInfo, used to record some runtime necessary hashing
information during op registration time. This enable us to:
* only hash arguments that are tensor or have static_argnum, this is to
enable many cases like aten.div.Tensor(tensor, 0.23231) hit the cache.
as we currently hashing all args which exclude those cases
* with the correct cache behavior, optimizers will hit the cache again
and resolve the high cpu overhead issue.
simple MLP shows all cache hit and for a single addmm -> 0.319ms (from 0.341ms), shows some hashing improvements:
<img width="1172" alt="Screenshot 2023-09-14 at 11 06 07 AM" src="https://github.com/pytorch/pytorch/assets/9443650/3406d673-dd8d-4ad9-9b80-9d4721c430e3">
Adam optimizer shows aten.div hit sharding cache again
<img width="1016" alt="Screenshot 2023-09-14 at 11 02 10 AM" src="https://github.com/pytorch/pytorch/assets/9443650/4280e8e3-af44-4fc2-8360-ea80b768f1d9">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109306
Approved by: https://github.com/fduwjj
Document triton dependency for the Pytorch Release
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at e9773f0</samp>
Add documentation for Triton dependency in `RELEASE.md`. The documentation covers how to install and use Triton for various PyTorch builds and platforms.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109296
Approved by: https://github.com/malfet
Summary:
The `tensor_proto` function in the TensorBoard summary writer code doesn't correctly encode `torch.bfloat16` tensors; it tries to use a data type of `DT_BFLOAT` when creating the protobuf, but `DT_BFLOAT` is not a valid enum value (see `types.proto`). The correct value to use when encoding tensors of this type is `DT_BLOAT16`. This diff updates the type map in the summary code to use the correct type.
While fixing this error, I also noticed the wrong field of the protobuf was being used when encoding tensors of this type; per the docs in the proto file, the DT_HALF and DT_BFLOAT16 types should use the `half_val` field, not `float_val`. Since this might confuse folks trying to read this data from storage in the future, I've updated the code to correctly use to `half_val` field for these cases. Note that there's no real size advantage from doing this, since both the `half_val` and `float_val` fields are 32 bits long.
Test Plan:
Added a parameterized unit test that tests encoding tensors with `torch.half`, `torch.float16`, and `torch.bfloat16` data types.
# Before this change
The test fails with an `ValueError` due to the incorrect enum label:
```
======================================================================
ERROR: test_bfloat16_tensor_proto (test_tensorboard.TestTensorProtoSummary)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/data/users/jcarreiro/fbsource/buck-out/v2/gen/fbcode/f88b3f368c9334db/caffe2/test/__tensorboard__/tensorboard#link-tree/torch/testing/_internal/common_utils.py", line 2382, in wrapper
method(*args, **kwargs)
File "/data/users/jcarreiro/fbsource/buck-out/v2/gen/fbcode/f88b3f368c9334db/caffe2/test/__tensorboard__/tensorboard#link-tree/test_tensorboard.py", line 871, in test_bfloat16_tensor_proto
tensor_proto(
File "/data/users/jcarreiro/fbsource/buck-out/v2/gen/fbcode/f88b3f368c9334db/caffe2/test/__tensorboard__/tensorboard#link-tree/torch/utils/tensorboard/summary.py", line 400, in tensor_proto
tensor_proto = TensorProto(**tensor_proto_args)
ValueError: unknown enum label "DT_BFLOAT"
To execute this test, run the following from the base repo dir:
python test/__tensorboard__/tensorboard#link-tree/test_tensorboard.py -k test_bfloat16_tensor_proto
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
```
# After this change
The test passes.
Reviewed By: tanvigupta17
Differential Revision: D48828958
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108351
Approved by: https://github.com/hamzajzmati, https://github.com/XilunWu
We could have SymBool inputs for torch.compile, e.g. in the following situation:
```
def f(x:torch.Tensor):
pred = x.size(0) == 3
torch.compile(f)(pred, x)
make_fx(f, tracing_mode="symbolic")(x)
```
The idea of this PR (credit to @ezyang) is to support SymBool by re-using the infra we've already had for SymInt so that we don't need to replicate a lot of stuff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107850
Approved by: https://github.com/ezyang
ghstack dependencies: #107662
## Summary
- Change the default of `supports_autograd` and `supports_forward_ad` of `ForeachFuncInfo` to `True`
- Add `test_zero_size_tensor_inputs` to make sure that foreach functions can handle 0-size Tensor inputs
- Add `test_parity` to check the consistency between outputs of foreach and for-loop of native function.
- Add `test_autodiff` to check forward-mode and reverse-mode AD
- Keep the corner cases that are not covered by the newly introduced methods
rel:
- #58833
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107869
Approved by: https://github.com/janeyx99
Fixes#108955.
Right now, the `_is_zipfile` check in `torch.load` performs multiple `read()` calls, reading 1 byte at a time in a loop. This is rather wasteful and leads to performance problems when accessing files on a network share (see #108955) .
This PR replaces those 1 byte calls with a single big call. Functionally, this is equivalent as `read(n)` only reads up to `n` bytes, so even if the file is shorter there should not be any problems.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109119
Approved by: https://github.com/mikaylagawarecki
As we deprecate the TorchScript ONNX exporter we need to refactor the onnx caffe2 tests to start using private functions instead of public ones
That requires changes to the merge rules to allow ONNX exporter to drive deprecation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109295
Approved by: https://github.com/malfet
The sample inputs is a bit involved because there are a lot of
shenanigans in the derivative formula. Check comments.
This is exercised in vdd, internal test `buck2 run '@fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- 'pytorch.benchmark.fb.test_gpu.test_gpu.TestBenchmarkFbGpu.test_train_blue_reels_vdd_v3_inductor_speedup'`
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109211
Approved by: https://github.com/albanD, https://github.com/zou3519
before the PR, for HF's ModelOutput class, we use dicts.py/DataClassVariable with our own implementation on __getItem__, __setAttr__, __setItem__. There is a risk that ModelOutput logic may change since it is a user code
after the PR, we inline __getItem__, __setAttr__, __setItem__ using dicts.py/CustomizedDictVariable so the logic always keep AA
unit test
* python test/dynamo/test_model_output.py -k test_HF_bert_model_output
test on HF benchmark
* python benchmarks/dynamo/huggingface.py -d cuda --inference --accuracy --progress --inductor --print-dataframe-summary 2>&1
* all metric are the same before/after the PR, including pass rate, unique_graphs, graph_breaks, unique_graph_breaks
* before the PR: P790393916
* after the PR: P790368991
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105044
Approved by: https://github.com/jansel
This decomposes masked_scatter into `aten.cumsum` and a single pointwise kernel,
which is similar to what is done in eager. I only do this for CUDA because on CPU
it isn't split into two passes like this so would cause a slowdown.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108803
Approved by: https://github.com/lezcano
ghstack dependencies: #108802
Summary:
added a flag is_cpu that can be specified by the user to
indicate whether the AOTInductor runtime is for CPU. It's
false by default.
Test Plan: ci
Reviewed By: hl475, aakhundov
Differential Revision: D49253826
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109300
Approved by: https://github.com/aakhundov
**Motivation:**
We try to make torch.cond use torch.compile automatically so that we could error out when there is side-effects in the branches and correctly handle the closures.
Before this PR, we have a warning if we don't turn on a config raise_on_backend_change (turning it on gives us an error) for the following code:
```python
def foo()
# Inside torch.cond, we'd like to do something like
torch.compile(foo, backend="eager", fullgraph=True)(...)
...
# Users may then call torch.compile somewhere else.
# Dynamo will use the cached code of foo for "eager" backend
# but we expect dynamo to recompile with "inductor" backend.
torch.compile(foo, backend="inductor")(...)
```
This PR adds a BACKEND_MATCH guard. Effectively, it implements a per-backend cache. In the above example, the cached code for "eager" won't work for "inductor" due to guard check failures and the second torch.compile will do a re-compilation. In the future, it might be useful to have something like a configuration guard that guards against dynamo configuration changes across different compiles (e.g. compile a function with fullgraph=False then compile it again with fullgraph=True).
**Implementation:**
1. We add a guarded_backend_cache and check the most_recent_backend against the backend associated with cached code. We also remove the raise_on_backend_change flag.
Note: More lines are printed for debug log due to newly added context manager and guard adds .
**Test Plan:**
Removed original tests that raise on different backend and add a new test to test whether the BACKEND_MATCH guard can guard against backend change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107337
Approved by: https://github.com/jansel
Fixes#98431
This change addresses a hardware assertion that was triggered on ROCm only tests.
Description of the problem:
Assertion triggered
```
Device-side assertion `target_val >= zero && target_val <= one' failed.
```
The issue in question is due to a GPU side assertion in `binary_cross_entropy_out_cuda` where a `target_val` get's passed to the kernel that does not fall between 0 and 1. The value in question that triggers the assertion is -0.000000000810. The origin of this negative value comes from one of the tensors generated for the test. In this tensor, one of the values (on ROCM) is 0.000000999190 which adhere's to the restriction that it is between 0 and 1. However, this value is eventually passed as a single entry tensor to gradcheck.py::_compute_numerical_gradient
( https://github.com/pytorch/pytorch/blob/main/torch/autograd/gradcheck.py#L347)
This function perturbs the tensor value in-place by subtracting `v` from it and then adding it back. The value of `v` comes from the default `eps` value defined here https://github.com/pytorch/pytorch/blob/main/torch/autograd/gradcheck.py#L2119
Currently pegged at `1e-6`. So what occurs is when an input is less than the default eps (like 0.000000999190 ), the perturbation calculation causes an entry in the tensor to flip to negative, i.e. 0.000000999190 - 1e-6 = -0.000000000810 (due to the subtraction here: https://github.com/pytorch/pytorch/blob/main/torch/autograd/gradcheck.py#L364) which then triggers the device side assertion in `binary_cross_entropy_out_cuda`.
This PR loosens the EPS by an order of magnitude to get around the error. Since this issue has not been caught in the field in any meaningful way, I find this to be an adequate solution, though am happy to hear opposing viewpoints.
Important to mention, while this error was only occurring on ROCm platforms, the issue described is also present in CUDA based environments. The difference being that CUDA doesn't seem to generate a tensor with any values less than `1e-6`. When injecting the small value on an Nvidia box, the same device side assertion was triggered.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109038
Approved by: https://github.com/jeffdaily, https://github.com/albanD
Richard, I'm curious to see what you think of this. I'm trying to use optest on the torchvision test suite, and after hacking up pytest support in https://github.com/pytorch/pytorch/pull/108929 I noticed that this was 5x'ing the test time... for no good reason.
* torchvision nms tests before optests: 60 passed, 4 skipped, 1206 deselected in 11.47s
* after optests: 300 passed, 20 skipped, 1206 deselected in 49.85s
It's no good reason because torchvision has parametrized the tests to get a spread of various random generation, but for checking schema or fake tensor, we don't actually need to test for different values.
This PR hacks up the codegen to replace pytest parametrize markers so that, instead of sampling many values, we sample only one value if you mark it with `opcheck_only_one`. There's a carveout for device parametrization, where we always run all those variants.
With this PR:
* reduced optests: 88 passed, 4 skipped, 1206 deselected in 13.89s
Companion torchvision PR which uses this at https://github.com/pytorch/vision/pull/7961
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108936
Approved by: https://github.com/zou3519
Summary:
We skip pre_grad_passes if graph comes from export (aten IR) since
pre_grad_passes (i.e. remove_identity) would not preserve meta["val"] in aten IR.
Differential Revision: D49246374
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109246
Approved by: https://github.com/aakhundov
Summary:
integer adaptive_avg_pool2d is not well defined due to different possible ways of rounding fp32 value to integer value, and
this op isn't too critical for numerics (since it appears not too often), so we'll skip this for now.
we might need to revert the changes that adds integer impl for adaptive_avg_pool op as well
Test Plan:
python test/test_quantization.py TestQuantizePT2ERepresentation
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108924
Approved by: https://github.com/kimishpatel
Summary: This PR adds dynamic-shape support for AOTInductor
* On the runtime/interface side, we added two structs, StaticDimInfo
and DynamicDimInfo, to hold values for static and dynamic dimensions,
respectively. Dynamic dimensions are tracked by an unordered map field
defined in AOTInductorModelBase. At inference time, the inference run
method will assign the current real dimensional value to each dynamic
dimension before executing any kernel.
* On the CUDA wrapper codegen side, we generate dynamic symbols
appropriately for shape computations. We simulate kernel launch grids
in the C++ land by re-using the grid functions from the Python world.
The returned grid configs, which may contain symbolic expressions,
are printed out in their C++ forms via the CppPrinter. Note that
when dynamic shapes are involved, we have to compute grid configs
for each kernel at runtime in the same way as we do for launching
the corresponding Triton kernel. Otherwise, we may end up with
memory-access failures or mis-computations caused by invalid indices
for fetching or storing data in device memory.
Differential Revision: D49100472
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109012
Approved by: https://github.com/khabinov, https://github.com/desertfire, https://github.com/hl475
Or any other distro that have different purelib and platlib paths Regression was introduced, when small wheel base dependency was migrated from CUDA-11 to CUDA-12
Not sure why, but minor version of the package is no longer shipped with following CUDA-12:
- nvidia_cuda_nvrtc_cu12-12.1.105
- nvidia-cuda-cupti-cu12-12.1.105
- nvidia-cuda-cupti-cu12-12.1.105
But those were present in CUDA-11 release, i.e:
``` shell
bash-5.2# curl -OL 922c5996aa/nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl; unzip -t nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl |grep \.so
testing: nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.11.7 OK
testing: nvidia/cuda_nvrtc/lib/libnvrtc.so.11.2 OK
bash-5.2# curl -OL c64c03f49d/nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl; unzip -t nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl|grep \.so
testing: nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.12.1 OK
testing: nvidia/cuda_nvrtc/lib/libnvrtc.so.12 OK
```
Fixes https://github.com/pytorch/pytorch/issues/109221
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109244
Approved by: https://github.com/huydhn
The strategy in this PR is pretty straightforward.
There are 2 kinds of hooks:
1) Hooks on objects with sources (inputs, params)
2) Hooks on objects w/o sources (intermediaries, and outputs).
Note: As outputs can be made simple by how dynamo handles residuals, they could actually be handled as if they were inputs, but, for the sake of this PR, we will refer to hooks as either hooks on inputs (sourced), or hooks on intermediaries (not sourced).
The plan:
**For tensors w/ a source:**
We record registered hooks, store them as a global, and associate them with the tensor in residuals. This means that when dynamo goes to create the frame, where we produce bytecode to stitch together our PT2 modified bytecode with the original eager code, we call `register_hook`. This registration of hooks in residuals is sound because (a) it happens right after a Pt2 frame region ends and (b) we know that the tensor is alive in f_locals, f_globals, or a module in the users invoking frame. This means we can soundly know it will be around to invoke `register_hook` on. As long as we guard on the identity of the lifted function, this is sound to do.
**For tensors w/o a source:**
Graph break - we will support this in a subsequent PR
**Handles:**
An interesting new component here is the creation of a `STORE_FAST `->`LOAD_FAST` associated with the handle, the return result of `register_hook`. If the user code stored the result of `register_hook` in a handle, we need to honor that. We do so by interceding into `STORE_FAST`, and recording the name of the local variable as directed by user code. We then honor that same name in the reconstructed bytecode. If the user did not store a hook, we merely pop the produced value to preserve the stack.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108903
Approved by: https://github.com/ezyang
ghstack dependencies: #108846, #109092
Summary: Let's just enable if mypy checking already passes. I checked all entries in the exclude list and enabled any that individually pass. Also needed one trivial change to a file already enabled.
Test Plan: `lintrunner torch/_inductor/*.py torch/_inductor/*/*.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109238
Approved by: https://github.com/eellison
We could have SymBool inputs for torch.compile, e.g. in the following situation:
```
def f(x:torch.Tensor):
pred = x.size(0) == 3
torch.compile(f)(pred, x)
make_fx(f, tracing_mode="symbolic")(x)
```
The idea of this PR (credit to @ezyang) is to support SymBool by re-using the infra we've already had for SymInt so that we don't need to replicate a lot of stuff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107850
Approved by: https://github.com/ezyang
ghstack dependencies: #107662
Summary: Step 1 in revamping subprocess autotune to support multiple GPUs. This diff just does some refactoring to autotune_process.py in order to prepare for the next diff:
* Move all logic for managing the sub-process (like detecting sub-process crashes) into the TuningProcess class.
* Use log.debug statements instead of print statements
Test Plan: python test/inductor/test_max_autotune.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109126
Approved by: https://github.com/shunting314, https://github.com/eellison
Summary: This broke as a result of the flashv2 PR. The tests couldnt' be listed expect for a100 machine which is weird..
Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:fused_attention
Differential Revision: D49239716
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109220
Approved by: https://github.com/eellison
## Exception in cond:
For code below:
```python
import torch
import functorch.experimental.control_flow as control_flow
def true_fn(x):
return x.sin()
def false_fn(x):
return x, x
def f(x, y):
return control_flow.cond(y, true_fn, false_fn, [x])
f(torch.ones(3, 4), torch.tensor(False))
```
The original exception stack trace is:
```python
Traceback (most recent call last):
File "/home/yidi/local/pytorch/test_exc.py", line 33, in <module>
f(torch.ones(3, 4), torch.tensor(False))
File "/home/yidi/local/pytorch/test_exc.py", line 31, in f
return control_flow.cond(y, true_fn, false_fn, [x])
File "/home/yidi/local/pytorch/torch/_higher_order_ops/cond.py", line 154, in cond
return torch.compile(cond_op, backend="eager", fullgraph=True)(
File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 365, in _fn
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 513, in catch_errors
return callback(frame, cache_entry, hooks, frame_state)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 140, in _fn
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 380, in _convert_frame_assert
return _compile(
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 560, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 197, in time_wrapper
r = func(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 482, in compile_inner
out_code = transform_code_object(code, transform)
File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
transformations(instructions, code_options)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 449, in transform
tracer.run()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2083, in run
super().run()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 733, in run
and self.step()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 696, in step
getattr(self, inst.opname)(inst)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 397, in wrapper
return inner_fn(self, inst)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1164, in CALL_FUNCTION_EX
self.call_function(fn, argsvars.items, kwargsvars.items)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 570, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 418, in call_function
(false_r, false_graph, false_lifted_freevars) = speculate_branch(False)
File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 410, in speculate_branch
raise UncapturedHigherOrderOpError(
torch._dynamo.exc.UncapturedHigherOrderOpError: Expected branch to return a single tensor
from user code:
File "/home/yidi/local/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
```
After this PR we get:
```python
Traceback (most recent call last):
File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 50, in graph_break_as_hard_error
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 429, in call_function
(false_r, false_graph, false_lifted_freevars) = speculate_branch(False)
File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 421, in speculate_branch
unimplemented(
File "/home/yidi/local/pytorch/torch/_dynamo/exc.py", line 187, in unimplemented
raise Unsupported(msg)
torch._dynamo.exc.Unsupported: Expected branch to return a single tensor
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/yidi/local/pytorch/test_exc.py", line 33, in <module>
f(torch.ones(3, 4), torch.tensor(False))
File "/home/yidi/local/pytorch/test_exc.py", line 31, in f
return control_flow.cond(y, true_fn, false_fn, [x])
File "/home/yidi/local/pytorch/torch/_higher_order_ops/cond.py", line 154, in cond
return torch.compile(cond_op, backend="eager", fullgraph=True)(
File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 338, in _fn
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 500, in catch_errors
return callback(frame, cache_entry, hooks, frame_state)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 140, in _fn
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 382, in _convert_frame_assert
return _compile(
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 562, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 484, in compile_inner
out_code = transform_code_object(code, transform)
File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
transformations(instructions, code_options)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 451, in transform
tracer.run()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2088, in run
super().run()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 728, in run
and self.step()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 691, in step
getattr(self, inst.opname)(inst)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 392, in wrapper
return inner_fn(self, inst)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1159, in CALL_FUNCTION_EX
self.call_function(fn, argsvars.items, kwargsvars.items)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 565, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 53, in graph_break_as_hard_error
raise UncapturedHigherOrderOpError(reason + msg) from e
torch._dynamo.exc.UncapturedHigherOrderOpError: Cond doesn't work unless it is captured completely with torch.compile. Scroll up to find out what causes the graph break.
from user code:
File "/home/yidi/local/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
```
## Exception during speculating branches
The example code below has a inplace-buffer mutation error,
```python
import torch
import functorch.experimental.control_flow as control_flow
class Foo(torch.nn.Module):
def __init__(self):
super().__init__()
self.register_buffer("buffer", torch.ones(6, 4))
def forward(self, x):
def true_fn(x):
self.buffer += 1
return self.buffer.sum() + x.sum()
def false_fn(x):
return (x - 1).sum()
return control_flow.cond(x.shape[0] > 4, true_fn, false_fn, [x])
mod_for_compile = torch.compile(Foo(), backend="eager", dynamic=True)
mod_for_compile(torch.ones(3, 4))
```
Before this PR the exception looks like:
```python
[2023-09-08 15:20:03,332] [0/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting cond, we were unable to trace function `true_fn` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[2023-09-08 15:20:03,332] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] Can't inplace modify module params/buffers inside HigherOrderOp
Traceback (most recent call last):
File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 163, in speculate_subgraph
output = f.call_function(tx, args, sub_kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
return tx.inline_user_function_return(
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 606, in inline_user_function_return
result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2200, in inline_call
return cls.inline_call_(parent, func, args, kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2316, in inline_call_
tracer.run()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 733, in run
and self.step()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 696, in step
getattr(self, inst.opname)(inst)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1219, in STORE_ATTR
.call_function(self, [obj, ConstantVariable(inst.argval), val], {})
File "/home/yidi/local/pytorch/torch/_dynamo/variables/builtin.py", line 618, in call_function
result = handler(tx, *args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/variables/builtin.py", line 1169, in call_setattr
raise AttributeMutationError(
torch._dynamo.exc.AttributeMutationError: Can't inplace modify module params/buffers inside HigherOrderOp
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 394, in speculate_branch
ret_val, ret_graph, ret_lifted_freevars = speculate_subgraph(
File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 222, in speculate_subgraph
raise Unsupported(
torch._dynamo.exc.Unsupported: speculate_subgraph: while introspecting cond, we were unable to trace function `true_fn` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. Scroll up for the stack trace of the initial exception. The reason was: Can't inplace modify module params/buffers inside HigherOrderOp
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/yidi/local/pytorch/test_exc.py", line 20, in <module>
mod_for_compile(torch.ones(3, 4))
File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 365, in _fn
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 513, in catch_errors
return callback(frame, cache_entry, hooks, frame_state)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 632, in _convert_frame
result = inner_convert(frame, cache_entry, hooks, frame_state)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 140, in _fn
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 380, in _convert_frame_assert
return _compile(
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 560, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 197, in time_wrapper
r = func(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 482, in compile_inner
out_code = transform_code_object(code, transform)
File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
transformations(instructions, code_options)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 449, in transform
tracer.run()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2083, in run
super().run()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 733, in run
and self.step()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 696, in step
getattr(self, inst.opname)(inst)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 397, in wrapper
return inner_fn(self, inst)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1124, in CALL_FUNCTION
self.call_function(fn, args, {})
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 570, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 261, in call_function
return super().call_function(tx, args, kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
return tx.inline_user_function_return(
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 606, in inline_user_function_return
result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2200, in inline_call
return cls.inline_call_(parent, func, args, kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2316, in inline_call_
tracer.run()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 733, in run
and self.step()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 696, in step
getattr(self, inst.opname)(inst)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 397, in wrapper
return inner_fn(self, inst)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1124, in CALL_FUNCTION
self.call_function(fn, args, {})
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 570, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 415, in call_function
(true_r, true_graph, true_lifted_freevars) = speculate_branch(True)
File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 405, in speculate_branch
raise UncapturedHigherOrderOpError(
torch._dynamo.exc.UncapturedHigherOrderOpError: Cond doesn't work unless it is captured completely with torch.compile
from user code:
File "/home/yidi/local/pytorch/test_exc.py", line 16, in forward
return control_flow.cond(x.shape[0] > 4, true_fn, false_fn, [x])
File "/home/yidi/local/pytorch/torch/_higher_order_ops/cond.py", line 127, in cond
return cond_op(pred, true_fn, false_fn, operands)
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
```
after this PR, the only difference is the error message of UncapturedHigherOrderOpError changes from `Cond doesn't work unless it is captured completely with torch.compile` to `Cond doesn't work unless it is captured completely with torch.compile. Scroll up to find out what causes the graph break`.
```python
[2023-09-08 15:17:02,052] [0/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting cond, we were unable to trace function `true_fn` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[2023-09-08 15:17:02,052] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] Can't inplace modify module params/buffers inside HigherOrderOp
Traceback (most recent call last):
File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 177, in speculate_subgraph
output = f.call_function(tx, args, sub_kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
return tx.inline_user_function_return(
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 601, in inline_user_function_return
result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2193, in inline_call
return cls.inline_call_(parent, func, args, kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2300, in inline_call_
tracer.run()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 728, in run
and self.step()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 691, in step
getattr(self, inst.opname)(inst)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1214, in STORE_ATTR
.call_function(self, [obj, ConstantVariable(inst.argval), val], {})
File "/home/yidi/local/pytorch/torch/_dynamo/variables/builtin.py", line 618, in call_function
result = handler(tx, *args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/variables/builtin.py", line 1169, in call_setattr
raise AttributeMutationError(
torch._dynamo.exc.AttributeMutationError: Can't inplace modify module params/buffers inside HigherOrderOp
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 50, in graph_break_as_hard_error
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 426, in call_function
(true_r, true_graph, true_lifted_freevars) = speculate_branch(True)
File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 410, in speculate_branch
ret_val, ret_graph, ret_lifted_freevars = speculate_subgraph(
File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 236, in speculate_subgraph
raise Unsupported(
torch._dynamo.exc.Unsupported: speculate_subgraph: while introspecting cond, we were unable to trace function `true_fn` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. Scroll up for the stack trace of the initial exception. The reason was: Can't inplace modify module params/buffers inside HigherOrderOp
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/yidi/local/pytorch/test_exc.py", line 20, in <module>
mod_for_compile(torch.ones(3, 4))
File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 338, in _fn
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 500, in catch_errors
return callback(frame, cache_entry, hooks, frame_state)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 634, in _convert_frame
result = inner_convert(frame, cache_entry, hooks, frame_state)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 140, in _fn
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 382, in _convert_frame_assert
return _compile(
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 562, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 484, in compile_inner
out_code = transform_code_object(code, transform)
File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
transformations(instructions, code_options)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 451, in transform
tracer.run()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2088, in run
super().run()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 728, in run
and self.step()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 691, in step
getattr(self, inst.opname)(inst)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 392, in wrapper
return inner_fn(self, inst)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1119, in CALL_FUNCTION
self.call_function(fn, args, {})
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 565, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 261, in call_function
return super().call_function(tx, args, kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
return tx.inline_user_function_return(
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 601, in inline_user_function_return
result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2193, in inline_call
return cls.inline_call_(parent, func, args, kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2300, in inline_call_
tracer.run()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 728, in run
and self.step()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 691, in step
getattr(self, inst.opname)(inst)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 392, in wrapper
return inner_fn(self, inst)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1119, in CALL_FUNCTION
self.call_function(fn, args, {})
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 565, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 53, in graph_break_as_hard_error
raise UncapturedHigherOrderOpError(reason + msg) from e
torch._dynamo.exc.UncapturedHigherOrderOpError: Cond doesn't work unless it is captured completely with torch.compile. Scroll up to find out what causes the graph break.
from user code:
File "/home/yidi/local/pytorch/test_exc.py", line 16, in forward
return control_flow.cond(x.shape[0] > 4, true_fn, false_fn, [x])
File "/home/yidi/local/pytorch/torch/_higher_order_ops/cond.py", line 127, in cond
return cond_op(pred, true_fn, false_fn, operands)
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108817
Approved by: https://github.com/zou3519
Summary:
There's currently a bug in `CUDACachingAllocator` which makes it impossible to determine whether a `malloc`ed sample has been deallocated (introduced in D48229150).
It happens because we currently instrument the `malloc` SDT **before** a block of memory has been allocated by either `cudaMalloc` or local cashing allocator `malloc` call. Since this is a static tracepoint, it receives arg values at the point of instrumentation. Currently, it receives the memory pointer, `void* p`, which is NULL.
Changes in this diff:
1) Move this SDT to right before the `allocate` function returns, so that memory has been allocated already and `p` pointer points to a valid, non-NULL address.
2) Enable tracing of `cudaMalloc` calls, in addition to `NativeCachingAllocator::malloc`
3) renames a poorly-named local var: `r` --> `devPtr` (pointer to the allocated memory block)
Test Plan:
Tested with a local PyTorch script that leaks memory. Verified the following:
* prior to this fix (prod), malloc samples are **not** marked as "freed"
* with the fix (branch), samples **are** marked as "freed"
* results are comparable with the current uprobe implementation to sample PyTorch malloc events in `gpusnoop`
Reviewed By: chaekit
Differential Revision: D48873734
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108907
Approved by: https://github.com/chaekit
For this program:
```python
def func(a, *, tag, ranks, group_size):
ar = torch.ops.c10d_functional.all_reduce(a, "sum", tag, ranks, group_size)
ar = torch.ops.c10d_functional.wait_tensor(ar)
c = torch.relu(a)
# c = a
d = torch.matmul(c, c)
e = d + ar
return (e,)
```
the generated code is:
```python
def call(args):
arg0_1, = args
args.clear()
assert_size_stride(arg0_1, (4, 4), (4, 1))
with torch.cuda._DeviceGuard(1):
torch.cuda.set_device(1) # no-op to ensure context
buf0 = empty_strided((4, 4), (4, 1), device='cuda', dtype=torch.float32)
buf0.copy_(arg0_1) #no reuse
buf1_pg = c10d._find_or_create_pg_by_ranks_and_tag('', [0, 1], 2)
buf1 = buf0
buf1_work = dist.all_reduce(buf1, async_op=True, group=buf1_pg, op=fun_col_impl._str_to_reduce_op('sum'))
fun_col_impl._register_tensor_work(buf1, buf1_work)
del buf1
buf0 = _wait_tensor(buf0)
buf2 = buf0
buf3 = buf0; del buf0 # reuse
# Source Nodes: [relu], Original ATen: [aten.relu]
stream1 = get_cuda_stream(1)
triton_poi_fused_relu_0.run(arg0_1, buf3, 16, grid=grid(16), stream=stream1)
del arg0_1
buf4 = empty_strided((4, 4), (4, 1), device='cuda', dtype=torch.float32)
# Source Nodes: [add, relu], Original ATen: [aten.add, aten.relu]
extern_kernels.addmm(buf2, buf3, buf3, alpha=1, beta=1, out=buf4)
return (buf4, )
```
We can notice that allreduce input (`buf1` which is alias of `buf0`) is incorrectly reused as input (`buf3`) to the triton `triton_poi_fused_relu_0` inplace kernel, diverging from eager mode logic.
In general, we should make it so that Inductor doesn't try to reuse the input buffer to an inplace functional collective.
We have a similar problem for output buffer of out-of-place functional collectives, see https://github.com/pytorch/pytorch/issues/108780#issuecomment-1714921994.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108811
Approved by: https://github.com/Chillee, https://github.com/wconstab
The execution order check seems to have been causing more problems than it prevents. Motivated by an internal issue, we can move this check to only `DISTRIBUTED_DEBUG_LEVEL=DETAIL`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109049
Approved by: https://github.com/fegin
Surrounds `stream->synchronize()` call with `dispatch_sync(stream->queue(), ^{});`, which is a noop for signle threaded program, but serializes calls to the synchronize across the threads using the same stream.
Prevent `[IOGPUMetalCommandBuffer validate]:215: failed assertion 'commit an already committed command buffer'` non-recoverable exception, which is triggered every time one is using PyCharm to inspect tensors on MPS device
Fixes https://github.com/pytorch/pytorch/issues/100285
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 1662ce2</samp>
> _Sing, O Muse, of the swift and skillful coders_
> _Who fixed the dreadful deadlock of the stream_
> _That crashed the mighty tensors of the MPS_
> _When they sought out the nonzero elements._
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108996
Approved by: https://github.com/kulinseth
The order of LOC can change and so it should not be used in creating a link. Also, a specific LOC is not needed here given the function name as used in general in overall documentaton.
Previously, a fix was provided by updating the line number for the mentioned issue in this PR but the LOC was eventually changed resulting a broken link.
Fixes#102183
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108957
Approved by: https://github.com/ezyang
Summary:
To get a Node's call stack we currently loop on the InlinedCallStack graph and follow the "callee" chain. Since the node's inlined stack does not change we can optimize this but expanding the node's inlined stack once and reusing it. This is particularly useful when reading the node's stack from another process (e.g. BPF) as it simplified the memory traversal process.
The new data structure (NodeSourceInfo) only holds pointers to the function name and file name variables, and assumes these objects will be alive throughout the lifetime of the process.
Each Node has an extended attribute that has an index to a vector of stack frames `expanded_node_stacks_`
`node_stack_attr_symbol_` is only needed to make accessing the stack vector index attribute easier from BPF.
Test Plan:
- Performance Impact: The cost of expanding the call stack is between 500 - 1000 ns and happens only per instruction node at initialization time.
- Verified using BPF Program in subsequent diffs
Reviewed By: zdevito
Differential Revision: D46578700
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108426
Approved by: https://github.com/zdevito
Summary:
Original commit changeset: e11cddf1fecc
Original Phabricator Diff: D49064185
Test Plan:
Comparing PT1 and PT2 performance on the IG Feed Model with this diff backed out: N4274204
Comparing the PT1 and PT2 performance on IG Feed with this diff committed: N4271093
Reviewed By: zou3519
Differential Revision: D49230047
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109199
Approved by: https://github.com/zou3519, https://github.com/xw285cornell
# Summary
## PR Dependencies
I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier:
- [x] Separate build flags for Flash and MemEff: #107985
### Description
This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao
### Changes Made
The majority of the changes in this pull request involve:
- Copying over the flash_attention sources.
- Updating header files.
- Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd.
- Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates.
- Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80.
- Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes.
- Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources.
- Adding/Updating tests.
### Notes for Reviewers
This is not a fun review, and I apologize in advance.
Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO:
- aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp
- aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github)
There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts.
### Follow up items
- Include the updates from e07aa036db and 9e5e8bc91e | https://github.com/pytorch/pytorch/issues/108108
### Work Items
- [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee
- [x] Let multi_query/attention pass through and test | UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup.
- [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers.
- [x] Update test exercise above codepath
- [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it a4f148b6ab)
- [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b
- [x] Update dispatcher to universally prefer FlashV2
- [x] Update tests to exercise new head_dims
- [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional
- [x] Create template generator script
- [x] Initial cmake support for building kernels/ folder
- [x] Replay CudaGraph changes
### Results
#### Forward only
The TFlops are reported here are on a100 that is underclocked.

#### Forward+Backward
Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back.
<img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602
Approved by: https://github.com/huydhn, https://github.com/cpuhrsch
- Update cross-ref FakeMode test to use ShapeEnv. Dynamic ops can now
return an unbacked SymInt. We always accept this as equal to whatever
the real value was.
- Relax test so it works on all classes, not just unittest.TestCase
- Properly wrap the original method, so things like
pytree.mark.parametrize are carried over
- Support dynamic shapes by default for make_fx `tracing_mode="fake"` without symbolifying everything else
Fixes https://github.com/pytorch/pytorch/issues/108927
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108929
Approved by: https://github.com/zou3519
We changed the failures_dict format from .py to json and added a way to
automatically update the failures dict (the user can set
PYTORCH_OPCHECK_ACCEPT=1 to do so), assuming the tests don't crash in the
process.
Some details:
- We introduced a FailuresDict class that handles save/load and from which one
can query a test status ("xfail", "skip", etc).
- PYTORCH_OPCHECK_ACCEPT=1 does not override everything. In particular: it
doesn't try to update the failures dict for a test marked as "skip", but it
will update it for tests marked as "xfail" or "success".
- PYTORCH_OPCHECK_ACCEPT=1 also does not override the "comment" field, unless
it is flipping an "xfail" into "success".
- I'll update the gdoc linked in the comments with how to actually use
PYTORCH_OPCHECK_ACCEPT=1 internally (it's not trivial).
Note that this isn't multithreading-safe, the current recommendation is to run
the tests sequentially if the user wants to use PYTORCH_OPCHECK_ACCEPT=1.
Differential Revision: D49167181
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109110
Approved by: https://github.com/ezyang
Fix several issues with `torch._numpy.random` functions on eager
1. actually return scalars when `size is None`
2. fix dispatch with USE_NUMPY_STREAM
3. make tnp.random functions composable: make numpy functions receive numpy arguments, not `tnp.ndarray`s
4. fix random.shuffle for e.g. lists
The main need for this gymnastics is due to `np.random` functions returning an ndarray or python scalar depending on the `size` argument. We decided a while ago to replicate this behavior in `tnp.random` and not elsewhere where we always return 0D arrays instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108944
Approved by: https://github.com/lezcano
We were using make_fx for strategy based propagation so that we can get
a graph and the shape related metadata, this becomes too much overkill
for the sharding propagation purpose. This change refactors the strategy
propagation to remove the graph based propagation, instead just use the
op to index to the strategy functions.
We also just use a fake shape prop instead of relying on fx tracing for
the shape/stride propagation.
for a future possible decomposed propagation, we will exercise different
codepath to enable that
NOTE that this would also greatly reduce the latency of:
1. first time dtensor operations when populating the cache, the first
iter would become faster again!
2. greatly reduce the test_dtensor_ops.py time again, right now the
whole test finished within 2-3 mins again.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108262
Approved by: https://github.com/fduwjj
ghstack dependencies: #107306, #108261
This PR switches the usage of fx's shape prop TensorMetadata to
dtensor's own dedicated defined TensorMeta, this is because DTensor
only cares three fields: shape/stride/dtype, all other fields are not
necessary and can be inferred from local_tensor directly. This would
help significantly simplify how we deal with the tensor metadata by not
caring other fields.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108261
Approved by: https://github.com/fduwjj
ghstack dependencies: #107306
function schema doesn't provide us anything as we can also get the schema from `op._schema`, include the op directly in op_schema makes easier for sharding prop to do fake execution, and in principle it should also make the hash comparison faster as we don't need to hash the function schema, instead we just hash the `id(op)` which is constant
This PR is just a refactor to include op to OpSchema instead of func schema, no other logic changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107306
Approved by: https://github.com/fduwjj
Device should always be "cpu" for cpu tensor types.
This will fix fb buck test failure when running in internal.
```
buck2 test '@fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:file_system_checkpoint_cpu -- --exact 'caffe2/test/distributed/checkpoint:file_system_checkpoint_cpu - test_switch_between_sharded_tensor_to_tensor_thread_count_1 (test_file_system_checkpoint_cpu.TestDistributedReshardOnLoad)'
```
This will unblock [D48667323](https://www.internalfb.com/diff/D48667323).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109141
Approved by: https://github.com/fegin
Prior to this PR, if None is returned from intermediate nodes, it will crashes the export because None is not expected to be passed into `_fill_tensor_shape_type`, and raise beartype roar. The function fills in shape and type to TorchScriptTensor according to its info from FX graph.
This is discovered after https://github.com/microsoft/onnxscript/pull/1043 is supported. The op specifically generates None in one of its inputs, but the only output from it being consumed is the first one (not None).
Reference test from a TorchBench model:
```python
def test_nanogpt(self):
import sys
sys.path.append("/home/titaiwang")
from nanoGPT.model import GPT, GPTConfig
# Load the model
kwargs = {
"block_size": 256,
"vocab_size": 8096, # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency
"n_layer": 2,
"n_head": 2,
"n_embd": 128,
"dropout": 0.0,
"bias": False, # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster
}
config = GPTConfig(**kwargs)
with torch.backends.cuda.sdp_kernel(
enable_flash=True, enable_mem_efficient=True
):
model = GPT(config)
print("Done loading model")
inputs = torch.arange(128).view(2, 64)
targets = torch.arange(128).view(2, 64)
self.run_test_with_fx_to_onnx_exporter_and_onnx_runtime(
model,
(inputs,),
input_kwargs={
"targets": targets,
},
verbose=True,
)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108708
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi
In this PR:
- {in,}equality between singleton and plain ints returns false instead of erroring
- Morally define the semantic of j0 > c to be as if j0 represented an array [s_0, s_1, ... s_n] and s_k > c for all k
- Just like for equality, we don't actually want to do the comparison one by one, instead j0 is constrained to some range [min, max]. By default this range is [2, int64_t::max] so that it acts like a size and passes 0/1 specialization checks.
- In the future, we can define some API to allow users to constrain the range of their singletons
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108315
Approved by: https://github.com/ezyang
Summary:
Implemented `aten::normal_` shader and used it to create `aten::randn_like`.
Ops defintions:
https://pytorch.org/docs/stable/generated/torch.randn_like.htmlhttps://pytorch.org/docs/stable/generated/torch.Tensor.normal_.html
Test Plan:
```
[ttingchulin@53491.od /data/sandcastle/boxes/fbsource (randn)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*<test>*" eg. -- --gtest_filter="*randn_like*"
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from VulkanAPITest
[ RUN ] VulkanAPITest.randn_like
[ OK ] VulkanAPITest.randn_like (230 ms)
[ RUN ] VulkanAPITest.randn_like_large
[ OK ] VulkanAPITest.randn_like_large (570 ms)
[----------] 2 tests from VulkanAPITest (801 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (801 ms total)
[ PASSED ] 2 tests.
[ttingchulin@53491.od /data/sandcastle/boxes/fbsource (randn)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*<test>*" eg. -- --gtest_filter="*normal_*"
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from VulkanAPITest
[ RUN ] VulkanAPITest.normal_
[ OK ] VulkanAPITest.normal_ (222 ms)
[ RUN ] VulkanAPITest.normal_large
[ OK ] VulkanAPITest.normal_large (136 ms)
[ RUN ] VulkanAPITest.normal_error
[ OK ] VulkanAPITest.normal_error (37 ms)
[----------] 3 tests from VulkanAPITest (396 ms total)
[----------] Global test environment tear-down
[==========] 3 tests f.
```
Reviewed By: yipjustin
Differential Revision: D48814024
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109075
Approved by: https://github.com/yipjustin
This PR introduces record and replay functionality for `ShapeEnv` instances. In short,
throughout the execution of a program, we record events (e.g. function calls that modify
its state) so that, in the future, we are able to reproduce any intermediary state of the
instance.
In summary, this PR introduces the following changes (they mostly belong to
_symbolic_shapes.py_ unless otherwise stated):
- Create `ShapeEnvEvent` class for recording function calls + arguments
- Create `record_shapeenv_event` decorator and decorate every function that changes the
state of a `ShapeEnv`: it creates an appropriate event and add it to the available
ShapeEnv instance (sometimes it has to extract from `SymTypes`).
- Create `SymNode.with_shape_env` convenient function for replacing `ShapeEnv` references
- Wraps `ShapeEnv` initialization method: so that we also save the exact way a `ShapeEnv`
was constructed, i.e. arguments
- Introduces a way to compare two `ShapeEnv` instances, defining a concept of state for
that class. In short, the state of `ShapeEnv` is every variable that may change the
execution flow
- Create `check_shape_env_recorded_events` dynamo configuration for enabling the check for
equality the state of `ShapeEnv` with another one that was constructed by replaying all
the recorded events. This check takes place inside `produce_guards`
- Create `replay_shape_env_events` function for replaying given events. It assumes the
first event is `ShapeEnv` initialization function
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107989
Approved by: https://github.com/ezyang
Without setting `reference_in_float`, cumprod's single sample case
passes (i.e. the compiled f16 result matches the eager mode f16 result;
in fact they are identical because they both call into aten). However,
the grad calculation does not line up.
Turning on `reference_in_float` causes the grad check to pass (i.e. we
are closer to the more accurate f64 grad calculation) but causes the
single sample case to fail. Since the compiled f16 case is no less
accurate than the eager f16 case for the single sample, relaxing the
tolerances here seems fine.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109128
Approved by: https://github.com/eellison
ghstack dependencies: #109081, #109089
It seems that the compiled f16 op is more accurate than the eager f16
op:
**Compiled float16 vs Eager float64**
Mismatched elements: 25 / 25 (100.0%)
Greatest absolute difference: 3.718038455710615e-05 at index (1, 0) (up to 1e-07 allowed)
Greatest relative difference: 0.0018021699903143316 at index (0, 4) (up to 1e-07 allowed)
**Eager float16 vs Eager float64**
Mismatched elements: 25 / 25 (100.0%)
Greatest absolute difference: 7.280254198286512e-05 at index (3, 3) (up to 1e-07 allowed)
Greatest relative difference: 0.004104326045245938 at index (0, 4) (up to 1e-07 allowed)
**Compiled float16 vs Eager float16**
Mismatched elements: 7 / 25 (28.0%)
Greatest absolute difference: 7.62939453125e-05 at index (3, 3) (up to 1e-05 allowed)
Greatest relative difference: 0.00588226318359375 at index (0, 4) (up to 0.001 allowed)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109081
Approved by: https://github.com/eellison
- Add `if __name__ == "__main__": run_tests()` stanzas to test files in `torch_np` folder so that these tests run on CI
- Skip / xfail things smoked out by this change
- remove a stray python file which should not have been added to tests in the first place.
- fix einsum if opt_einsum is present
- add skips for older numpies
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108762
Approved by: https://github.com/lezcano
Fixes issue mentioned in #77764
e.g. https://github.com/pytorch/pytorch/issues/77764#issuecomment-1654111744
Adds MPS support for the following ops:
- lgamma
- mvlgamma
- digamma
- polygamma
The lgamma fucntion does not yet have an MPS backend implementation. I've added one using a custom metal kernel (following John D. Cook's c++ implementation of the log gamma function: https://www.johndcook.com/blog/cpp_gamma/). For the backward pass op, I've added a digamma kernel that follows the cpu+cuda digamma implementation, and for the backward pass of the digamma op, I've added a polygamma + trigamma kernel following, again, the cpu+cuda implementations.
NOTE:
The cpu implementation of the polygamma function incorrectly (as far as I can tell) outputs a finite number for order = 1 and x in the negative integers. The mps implementation correctly outputs infinite. (see https://github.com/pytorch/pytorch/issues/106692)
The polygamma tests currently don't pass because of the error in the cpu+cuda kernels, but also because there are smallish discrepancies near the negative integers between the cpu+cuda and the mps polygamma and trigamma kernels. I'm not sure exactly why this is, but let me know if the discrepancies are too big.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106292
Approved by: https://github.com/kulinseth
This decomposes masked_scatter into `aten.cumsum` and a single pointwise kernel,
which is similar to what is done in eager. I only do this for CUDA because on CPU
it isn't split into two passes like this so would cause a slowdown.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108803
Approved by: https://github.com/lezcano
ghstack dependencies: #108802
Summary:
When we retrace the graph containing constant tensors, they get lifted as buffer inputs.
AotInductor also wants to lift all the constants as inputs.
If we separate the constants as a separate thing, then it adds an additional complexity where we now have to keep track of 3 inputs (params, buffers, constants).
Cons: People might care about specifically what buffers are/are not buffers?
If people want to know specifically which buffers are constants, we can add an additional field in the graph signature to mark this.
Test Plan: CI
Differential Revision: D49153367
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109040
Approved by: https://github.com/zhxchen17
Summary:
0-dim tensor are not supported in Vulkan.
If a binary_op_tensor is called with 'other_arg' being a 0-dim tensor, then we extract the scalar out and call binary_op_scalar.
Used to run the [FD model](
https://www.internalfb.com/manifold/explorer/wrist-camera-ml/tree/models/fd-ted-pi/fd-hybrid/fd_hybrid_vulkan.ptl) on [CLI](https://www.internalfb.com/intern/wiki/Malibu/Software/Machine_Learning/PyTorch_On_Device_Catalog/#build-and-run-native-pyt)
Test Plan:
```
lfq@lfq-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1
...
[ RUN ] VulkanAPITest.querypool_flushed_shader_log
xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp:6891: Skipped
QueryPool is not available
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 339 tests from VulkanAPITest (5308 ms total)
[----------] Global test environment tear-down
[==========] 339 tests from 1 test suite ran. (5308 ms total)
[ PASSED ] 338 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
YOU HAVE 5 DISABLED TESTS
```
Differential Revision: D48672535
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109035
Approved by: https://github.com/manuelcandales
Summary:
The basic concept behind this diff is to modify Dynamo's tracing behavior when it encounters a KeyedJaggedTensor that is synced (aka has `_length_per_key` and `_offset_per_key` populated). These fields are lists of integers; ordinarily, Dynamo will optimistically try to specialize on integers, however, for KJTs, we know that these integers will definitely vary from run-to-run. Furthermore, ordinarily, we would also specialize these integers if they are 0/1, but we will frequently expect features in KJTs to be 0/1.
The fix is to detect KJTs and treat these integers as *unbacked integers*. This is NOT a universally sound optimization: when treating these integers as unbacked, we never report them as equal to zero or one. In return, we always generate graphs that generalize no matter the length of values on features. This is enough to trace through APS sparse arch, torchrec_dlrm and some small split-cat examples.
The special integer behavior is triggered by a dynamically scoped `force_unspec_int_unbacked_size_like` variable on TracingContext, which we trigger when we wrap a KJT. There probably are other ways to do this, but this was simple and worked.
Test Plan:
```
buck2 test mode/dev-nosan //pytorch/benchmark/fb/test_gpu:run_test_gpu
```
from aakhundov
1. first build feed_lower_benchmark:
```
buck2 build --show-output mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true hpc/new/models/feed/benchmark:feed_lower_benchmark
```
2. then run the lowering of the model with it:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCH_LOGS="output_code,graph_code" TORCH_COMPILE_DEBUG=1 ../buck-out/v2/gen/fbcode/79c6b019ee0f9469/hpc/new/models/feed/benchmark/__feed_lower_benchmark__/feed_lower_benchmark.par --load=manifold://ig_inference_model/tree/user/facebook/fblearner/predictor/960999465/60/gpu_lowering/input.predictor --skip-trt --skip-ait --sync-mode=0 --enable-aot-inductor --lower-presets="ig_stories" --gpu-trace
```
cf https://docs.google.com/document/d/1yD30xYrdmM8r2HTdmXnZTg0-MHVexfVrAa0294m1AUE/edit?pli=1#heading=h.qiv3fp7e6zg0
From torchrec: https://www.internalfb.com/intern/wiki/Torchrec/Development/Testing_production_models/
From ge0405
baseline (without your diff): f477293168
your diff: f477292363
Differential Revision: D49019987
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108960
Approved by: https://github.com/voznesenskym
Summary: Allow the user to specify the cxx version to use when compiling. For applications that compile with C++20, we wish to also compile this library with C++20 to avoid subtle ODR violations with using different library standards.
Test Plan: Built the project successfully.
Reviewed By: smeenai
Differential Revision: D48636406
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108687
Approved by: https://github.com/davidberard98
# Motivate
Without this PR, if we would like to include the header file like ```#include <ATen/native/ForeachUtils.h>``` in our C++ extension, it will raise a Error ```/home/xxx/torch/include/ATen/native/ForeachUtils.h:7:10: fatal error: 'ATen/native/utils/ParamsHash.h' file not found```. We should fix it.
# Solution
Add the ATen/native/utils header file in the build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109013
Approved by: https://github.com/ezyang
This PR:
1. Drop assert for 1D DeviceMesh check to allow DTensor with nD DeviceMesh when creating write_item.
2. Add tests for both placement changes and mesh changes for both 1D and 2D scenarios.
cc. @kumpera @wanchaol @fegin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106230
Approved by: https://github.com/kumpera
Summary:
NNAPI: Internal test infra can't find test_nnapi.py. Easiest solution is to just skip these tests if test_nnapi.py can't be found
test_ivalue: fails due to qscheme op not implemented for CPU backend. In OSS, it doesn't run because it's not included in test_jit.py.
CPU NNC tests: test_working_byte_cpu_float32 is failing, but hard to repro; we don't use CPU NNC internally, so let's just skip CPU NNC tests internally.
Differential Revision: D48041615
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108937
Approved by: https://github.com/eellison
### Add python binding variables for enabling and disabling
These changes will be used in the pytorch/xla repository for lowering HLO for the AWS Neuron compiler. For correct tensor lowerings the device cache size must be set to zero.
It is advantageous to be able to enable and disable the cache without deleting it. This allows use of the XLA device, and HLO lowering in a single python file, isolating cache disablement to a python context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107827
Approved by: https://github.com/JackCaoG, https://github.com/wconstab, https://github.com/bdhirsh
Summary:
I plan to use the Box-Muller method for sampling from the normal distribution to implement `aten::randn_like`, which can use the existing uniform functions, so I move them out to a `random.h`.
https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform
Test Plan:
```
[ttingchulin@95660.od /data/sandcastle/boxes/fbsource (rand_lib)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*<test>*" eg. -- --gtest_filter="*uniform*"
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *uniform*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from VulkanAPITest
[ RUN ] VulkanAPITest.uniform
[ OK ] VulkanAPITest.uniform (120 ms)
[----------] 1 test from VulkanAPITest (120 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (120 ms total)
[ PASSED ] 1 test.
```
Reviewed By: yipjustin
Differential Revision: D48750679
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108724
Approved by: https://github.com/yipjustin
The following two cases fail due to a small oversight `CUDAGraph::reset()` that causes failures in graph destructor
```Python
import torch
x = torch.zeros(4, device="cuda")
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
x = x + 1
g.reset()
del g
```
that fails with:
```
terminate called after throwing an instance of 'c10::Error'
what(): uc >= 0 INTERNAL ASSERT FAILED at ".../pytorch/c10/cuda/CUDACachingAllocator.cpp":2157, please report a bug to PyTorch.
```
and reset and subsequent re-capture
```Python
import torch
x = torch.zeros(4, device="cuda")
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
x = x + 1
g.reset()
with torch.cuda.graph(g):
x = x + 1
g.replay()
```
which fails with:
```
Traceback (most recent call last):
File "test_graph.py", line 11, in <module>
with torch.cuda.graph(g):
File ".../pytorch/torch/cuda/graphs.py", line 192, in __enter__
self.cuda_graph.capture_begin(
File ".../pytorch/torch/cuda/graphs.py", line 77, in capture_begin
super().capture_begin(pool=pool, capture_error_mode=capture_error_mode)
RuntimeError: This CUDAGraph instance already owns a captured graph. To capture a new graph, create a new instance.
```
This PR fixes `CUDAGraph::reset()` function for above to use cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108896
Approved by: https://github.com/ezyang
This adds the `ir.Scan` node (currently only supported on CUDA) which re-uses the existing reduction kernel machinery to support different kinds of non-pointwise ops. Just like reductions it supports prologue and epilogue fusions and has both persistent and non-persistent kernel generation.
Currently this doesn't support the equivalent of `Reduction.create_multilayer` and will instead fall back to eager in those cases. This is because splitting into multiple kernel invocations ends up being far slower than cub's single kernel strategy which matches the performance of a copy kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106581
Approved by: https://github.com/lezcano, https://github.com/atalman
Requested from @tugsbayasgalan: we want dynamo to preserve some FX node metadata when we trace `GraphModule`s (`nn_module_stack`, `source_fn`, `stack_trace`). This is helpful for the case when we export an aten-level `GraphModule`, add some (possibly non-torch or non-aten) ops, and we want to transform the graph back into an aten-level graph. Without preserving metadata, future passes that look at metadata (e.g. quantization passes) won't work.
This feature also has the additional benefit of being able to preserve origin line of code when `print_readable`'ing a `GraphModule`. This is helpful when debugging graphs that have passed through dynamo several times.
The added unit test demonstrates the added functionality of this PR.
~This PR is currently a proof-of-concept implementation that shows that preserving node metadata across dynamo is possible.~ This PR preserves node metadata across dynamo by doing the following:
- ~inject a counter variable into the `GraphModule` source code, which is incremented every time a node is run~
- Construct a line number -> node index map in `GraphModule` as the source code is being generated.
- pass a list of node metadata and the line number map to dynamo's bytecode analyzer
- ~dynamo traces the counter as a `ConstantVariable`, so when we create a new proxy, we can determine which original node index this proxy corresponds by looking at the value of the traced counter~
- When we create a new proxy, get the current instruction's line number, and get the node index using the line number map
- index into the original node metadata ~using the counter variable's tracked value.~
~Some things that should be addressed off the top of my head:~
- ~Is this feature even desirable? (Do we really want Dynamo to have special behavior for `GraphModules`? Should we expect users to re-export `GraphModules`?)~
- ~Is there a better approach than to use a counter? We considered using node names, line numbers, and assuming that proxies are created in the same order as the nodes, but each of these 3 have shortcomings. For node names, we only have access to new node names, not the old ones. Using line number is fragile. The third is problematic since not all created nodes go through `create_proxy` (e.g. inputs). We currently generate a line number to node index map when the `GraphModule`'s code is generated.~
- ~What's the best way to send data across the "CPython gap"? That is, it is not obvious how to cleanly pass data from dynamo's `eval_frame.py:_TorchDynamoContext.__call__` to `symbolic_convert.py:InstructionTranslatorBase.__init__`. In this PR, we use a global.~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107067
Approved by: https://github.com/jansel
Summary:
This commit fixes two silent correctness problems with
the current implementation of `move_model_to_eval`:
(1) Previously the user had to manually call `eliminate_dead_code`
before calling `move_model_to_eval`, otherwise the dropout pattern
won't actually get eliminated. This is because subgraph rewriter
complains the match is not self-contained, and so silently does
not do the replacement.
(2) We wish to error when the user calls `model.train()` or
`model.eval()` on an exported model. This error is raised
correctly immediately after export today, but no longer raised
after the user calls prepare or convert.
We fix (1) by moving the `eliminate_dead_code` call into
`move_model_to_eval`, and fix (2) by ensuring the respective
errors are thrown after prepare and convert as well.
Additionally, this commit renames `move_model_to_eval` to
`move_exported_model_to_eval` to be more explicit.
bypass-github-export-checks
Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_disallow_eval_train
python test/test_quantization.py TestQuantizePT2E.test_move_exported_model_to_eval
Imported from OSS
Differential Revision: D49097293
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108891
Approved by: https://github.com/jerryzh168
When use hybrid_shard mode FSDP,
state.process_group means gpu_0,1,,,~,7 on node 0,so gpus on node 1 cannot receive parameters, setting process_group to default_group(global_group)can fix this issue
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108331
Approved by: https://github.com/awgu
Unknown -Wno-XXX flags are still appended to GCC via append_cxx_flag_if_supported because of the behavior mentioned in GCC document:
```
When an unrecognized warning option is requested (e.g., -Wunknown-warning),
GCC emits a diagnostic stating that the option is not recognized.
However, if the -Wno- form is used, the behavior is slightly different:
no diagnostic is produced for -Wno-unknown-warning unless other diagnostics are being produced.
This allows the use of new -Wno- options with old compilers,
but if something goes wrong, the compiler warns that an unrecognized option is present.
```
This PR tries to fix by detection the flag of the -WXXX form. Unfortunately, third_party/fbgemm/CMakeLists.txt redefines append_cxx_flag_if_supported and our version is overwritten. As a result, we have to re-include utils.cmake to overwrite it again.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109000
Approved by: https://github.com/malfet
PR #106656 was reverted due to IOS failures. It seems that IOS builds don't have full support of std::filesystem. This PR discards std::filesystem changes and add temp file creation on Windows. It also moves the platform syscalls into a separate cpp file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108508
Approved by: https://github.com/ezyang
Fixes https://github.com/pytorch/pytorch/issues/101939
Several fixes bundled together:
1. When we valueToTensor, we only handled non-symbolic inputs and not symbolic inputs. We support symbolic Scalar, so also handle symbolic values.
2. In the symbolic case, we MUST NOT lift_fresh, as you're not going to inline a constant into the graph, it's going to be from a `scalar_tensor` call (so no need to clone it to avoid mutations)
3. In indexing scalarToTensor, must not do the static, directly read out the scalar contents logic with the scalar is symbolic
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108873
Approved by: https://github.com/jansel
- Add `if __name__ == "__main__": run_tests()` stanzas to test files in `torch_np` folder so that these tests run on CI
- Skip / xfail things smoked out by this change
- remove a stray python file which should not have been added to tests in the first place.
- fix einsum if opt_einsum is present
- add skips for older numpies
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108762
Approved by: https://github.com/lezcano
The strategy for supporting functools partials is relatively straightforward.
There are 2 cases we need to support:
**1) Functools partials as input**
In this case, we are first seeing the functools partial and it is guaranteed to have a source. As such, the args, keywords, and func of the functools partial are passed through VariableBuilder. As this is the first time we are seeing these objects (as it is an input), we re-enter VariableBuilder with a source referencing the args, keywords, and func as attributes of the input to produce:
- func: A callable VariableTracker (UDF, TorchVariable, etc) depending on the value of `func`
- args: List[VariableTracker] - note, not ListVariableTracker!
- keywords: Dict[str, VariableTracker]
A major benefit of this structure is that it very elegantly matches the args to `call_function`.
We then compose a FunctoolsPartialVariable from the VariableTrackers made above.
**2) Functools partials created within compile**
In this case, we already have all the args as known VTs, and thus just compose a FunctoolsPartialVariable as we do for case (1).
For both (1) and (2) - we propagate all guards from the func, args, and keyword VTs to the FunctoolsPartialVariable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108846
Approved by: https://github.com/ezyang, https://github.com/jansel
Now looks like:
```
[2023-09-08 06:04:48,532] [0/0] torch._dynamo.symbolic_convert: [DEBUG] TRACE STORE_ATTR foo [ConstantVariable(int), NNModule
Variable()]
[2023-09-08 06:04:48,532] [0/0] torch._dynamo.convert_frame: [INFO]
Restarting analysis due to _dynamo/variables/nn_module.py
:138 in convert_to_unspecialized
[2023-09-08 06:04:48,533] [0/0_1] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing f /data/users/ezyang/c/pytorch/a.py:6
[2023-09-08 06:04:48,533] [0/0_1] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] TRACE starts_line f /data/users/ezyang/c/pytorch/a.py:6
```
I'm happy to bikeshed the exact formatting of the attempt number if you
want.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108864
Approved by: https://github.com/mlazos, https://github.com/voznesenskym
Summary: As pointed out in https://github.com/pytorch/pytorch/pull/107479, using a set prevents collisions like "a" => "a", "a" => "a_1", "a_1" => "a_1" (but should go to "a_1_1"). We can combine using counters and a set to avoid this problem. Still gets us the performance benefit in the case of collisions with a very minor penalty in a case with no collision.
Test Plan:
Extract this code and run:
```
# New version
from typing import Dict, Set
class Net:
_net_names_used_counters: Dict[str, int] = {}
_net_names_used: Set[str] = set()
staticmethod
def current_prefix():
return "test_prefix"
staticmethod
def _get_next_net_name(basename):
basename = "/".join(x for x in [Net.current_prefix(), basename] if x)
idx = Net._net_names_used_counters.get(basename, 0)
while (name := basename if idx == 0 else f"{basename}_{idx}") in Net._net_names_used:
idx += 1
Net._net_names_used_counters[basename] = idx + 1
Net._net_names_used.add(name)
return name
print(Net._get_next_net_name("basename"))
print(Net._get_next_net_name("x_basename"))
print(Net._get_next_net_name("basename"))
print(Net._get_next_net_name("basename"))
print(Net._get_next_net_name("x_basename"))
print(Net._get_next_net_name("basename_1"))
> test_prefix/basename
> test_prefix/x_basename
> test_prefix/basename_1
> test_prefix/basename_2
> test_prefix/x_basename_1
> test_prefix/basename_1_1
```
Differential Revision: D48576516
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107743
Approved by: https://github.com/zdevito
Fixes#106893
There are two main changes:
- Before this PR, the function returned by once_differentiable was
included in skipfiles (because its .co_code is
torch/autograd/function.py). This PR adds a mechanism to tell Dynamo
to inline a function, no matter if it is included in skipfiles.
- A bugfix: when we are introspecting the backward, we need to turn the
grad mode off. This is to accurately model the eager-mode semantics:
In eager-mode PyTorch, if second-order gradients were not requested, then
the grad mode is off. torch.compile does not work with higher-order
gradients and just assumes we do first-order gradients, so this is OK.
Test Plan:
- new test
Differential Revision: [D49064185](https://our.internmc.facebook.com/intern/diff/D49064185)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108686
Approved by: https://github.com/voznesenskym
Summary:
Original commit changeset: 33650f7cb0fb
Original Phabricator Diff: D48833682
Test Plan: See T162942232 for how we figured out that this diff caused significant numeric difference.
Reviewed By: voznesenskym
Differential Revision: D49082219
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108823
Approved by: https://github.com/xw285cornell
This combines a bunch of python global state guards into a single C++ guard and switches to checking them 100% of the time. It also adds a few new guards for things that change inductor's behavior. Even though we are checking more things, I expect this to be much faster.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108624
Approved by: https://github.com/anijain2305
Summary:
Include constants in AOTInductor .so file.
Added some difference:
1) serialize with ctypes instead of the native of torch.storage
2) Use the underlying for_blob instead of from_blob to construct Tensor.
Test Plan:
Unit tests:
```
test/inductor/test_aot_inductor.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108473
Approved by: https://github.com/angelayi
**Motivation**: We already have a `CompiledFunction` event that comes from the autograd.Function added by aot_autograd. However, this doesn't appear during inference, or if none of the inputs to a graph require grad. It also doesn't appear if your backend doesn't use aot_autograd.
This adds a profiler event that will always appear.
<img width="615" alt="Screenshot 2023-09-01 at 4 46 28 PM" src="https://github.com/pytorch/pytorch/assets/5067123/fed90ca9-a8e7-458c-80eb-b4160de55218">
Perf - increase in latency (with profiler turned off) was within noise when I measured a simple cpu-only torch-compiled function that returned `x.view_as(x)`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108462
Approved by: https://github.com/anijain2305
Summary: Switch AOTInductor unit tests and integration tests to invoke the same runtime interface. This is only an effort to unify the usage of the runtime. The interface scrutiny will come in later PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108663
Approved by: https://github.com/ezyang
ghstack dependencies: #108653
Sometimes one might want to impose equalities that are not required by guards, e.g. say that you only want square images when rectangular images would suffice.
Curiously we never checked that the concrete values passed in example shapes actually satisfy such equality constraints. So, e.g., you could multiply two tensors of shapes MxK and KxN, specify that M and N must be equal, and then pass examples where they are not equal.
Relatedly, the symbolic shape dimensions for inputs in the exported graph were not forced to be equal.
However, runtime assertions still fire because they take into account all equality constraints. This would result in the strange situation where export would succeed but the exported program with the same example inputs would fail.
This PR fixes these issues.
Differential Revision: [D48910918](https://our.internmc.facebook.com/intern/diff/D48910918/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108429
Approved by: https://github.com/zhxchen17
Summary: Run mm decomposition tests for CPU and GPU
One nit - this will suppress CPU tests on hosts that have CUDA (i.e., TEST_CUDA is True), but doesn't have Triton because we don't have access to whether the test is actually for CPU or CUDA (which would require reading the device argument)
(This is a general limitation on torch.compile tests because on CUDA they require triton in the std config.)
Test Plan: sandcastle, github
Differential Revision: D48998215
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108620
Approved by: https://github.com/bertmaher
**Motivation:**
We try to make torch.cond use torch.compile automatically so that we could error out when there is side-effects in the branches and correctly handle the closures.
Before this PR, we have a warning if we don't turn on a config raise_on_backend_change (turning it on gives us an error) for the following code:
```python
def foo()
# Inside torch.cond, we'd like to do something like
torch.compile(foo, backend="eager", fullgraph=True)(...)
...
# Users may then call torch.compile somewhere else.
# Dynamo will use the cached code of foo for "eager" backend
# but we expect dynamo to recompile with "inductor" backend.
torch.compile(foo, backend="inductor")(...)
```
This PR adds a BACKEND_MATCH guard. Effectively, it implements a per-backend cache. In the above example, the cached code for "eager" won't work for "inductor" due to guard check failures and the second torch.compile will do a re-compilation. In the future, it might be useful to have something like a configuration guard that guards against dynamo configuration changes across different compiles (e.g. compile a function with fullgraph=False then compile it again with fullgraph=True).
**Implementation:**
1. We add a guarded_backend_cache and check the most_recent_backend against the backend associated with cached code. We also remove the raise_on_backend_change flag.
2. Then newly added context manager and guard adds more lines for debug log so we change the uppper limit from 50 to 55.
**Test Plan:**
Removed original tests that raise on different backend and add a new test to test whether the BACKEND_MATCH guard can guard against backend change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107337
Approved by: https://github.com/jansel
Building with CUDA in dev container leads to error: `cannot find -lcudart_static`. This is because the libraries are under a custom CUDA_HOME, and `ld` cannot find it.
Updating the `LDFLAGS` environment variable works.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108766
Approved by: https://github.com/drisspg
Before this PR, we use get_fake_value to get the fake_sub_args then call op(*fake_sub_args) to get the example value for out dtype.
This causes problem when the input proxy's op type is `get_attr`, get_fake_value for a `get_attr` node will actually look at the original param/buffer and **return a real tensor** instead of fake tensor. This is OK for export, since export's fake_mode allows non_fake_inputs see [here](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/output_graph.py#L278). But it causes problem when nesting cond with out_dtype where cond will use torch.compile(full_graph=True) to inspect out_dtype and find the inputs to op are mixed FakeTensor and real tensor.
This PR changes how we get the example values from proxies by directly looking at node.meta["example_value"]. This meta data is guaranteed to exist for all proxies during dynamo tracing so it's safe to use ( it's also used by get_fake_value to get fake tensors from args for general ops see [here](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/utils.py#L1318)).
Test Plan:
existing tests + remove expected failure for a test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108715
Approved by: https://github.com/zou3519
Summary: Currently node metadata "nn_module_stack" is only being used by export. For some export model, we still want to retain nn_module_stack for unspecialized module for various purposes. This diff add a path to also record nn_module_stack when unspecialized module has a source available.
Test Plan: test_export_nn_module_stack_patched_module
Differential Revision: D48841193
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108281
Approved by: https://github.com/yanboliang, https://github.com/tugsbayasgalan
PoC demonstrating vmap + NT based on the [design doc](https://docs.google.com/document/d/1dVVk6TOqz93PLTIneU2T3xaxCs9qZ0MaJyCvOAp_bC0). This PR:
* Allows `BatchedTensorImpl`s to contain NTs
* Introduces a `BatchedNestedTensor` dispatch key for NT-specific batching rules
* Provides a batching rule fallback that unbinds the NTs -> performs computation on constituent -> rebinds results into NT
Restrictions:
* Only supports one level of vmap
* Only supports vmapping over dim=0 for NTs
* For operations with mixed NT / dense inputs, support is also limited to dim=0 for the dense inputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106786
Approved by: https://github.com/zou3519
PR #108479 was reverted because
```
In file included from xplat/caffe2/c10/util/Exception.h:5:
In file included from xplat/caffe2/c10/util/StringUtil.h:6:
xplat/caffe2/c10/util/string_view.h:576:31: error: out-of-line definition of constexpr static data member is redundant in C++17 and is deprecated [-Werror,-Wdeprecated]
basic_string_view<CharT>::npos;
```
Now this is fixed and Wdeprecated generated no warnings on my host.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108622
Approved by: https://github.com/Skylion007
Summary: Looks like these already pass (and torch/_inductor/kernel/mm_plus_mm_new.py does not exist)
Test Plan: `lintrunner torch/_inductor/kernel/mm.py torch/_inductor/kernel/bmm.py torch/_inductor/kernel/__init__.py torch/_inductor/kernel/mm_plus_mm.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108678
Approved by: https://github.com/eellison
Fixes#68972
Relands #107246
To avoid causing Meta-internal CI failures, this PR avoids always asserting that the default dtype is float in the `TestCase.setUp/tearDown` methods. Instead, the assert is only done if `TestCase._default_dtype_check_enabled == True`. `_default_dtype_check_enabled` is set to True in the `if __name__ == "__main__":` blocks of all the relevant test files that have required changes for this issue
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108088
Approved by: https://github.com/ezyang
**This PR is a 99% copy paste of Sam Gross** (@colesbury) work at https://github.com/pytorch/pytorch/pull/100642. Copied from there
--------
The NN_MODULE guard now subsumes guards on Module attributes. The check_fn will fail if the module attributes are changed (such as Module.training), parameters, submodules, and buffers are added or removed, and if fields are changed on the type itself.
This gives up specificity in the guard check -- if any field is changed the check_fn fails -- for faster overall checks.
-----
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108528
Approved by: https://github.com/ezyang
When we retrace the graph containing constant tensors, they get lifted as buffer inputs.
AotInductor also wants to lift all the constants as inputs.
If we separate the constants as a separate thing, then it adds an additional complexity where we now have to keep track of 3 inputs (params, buffers, constants).
Cons: People might care about specifically what buffers are/are not buffers?
If people want to know specifically which buffers are constants, we can add an additional field in the graph signature to mark this.
Differential Revision: [D49017872](https://our.internmc.facebook.com/intern/diff/D49017872)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108592
Approved by: https://github.com/zhxchen17
Summary:
[experimental] use EXCEPT_FOR env to suppress CPU tests from GPU RE -- alternative implementation to D48997976 using preexisting PYTORCH_TESTING_DEVICE_EXCEPT_FOR facility and building remaining logic (for assert-positive listers like test_transformers) on top of that.
Goal: save ~100 GPU (10% of capacity), enables us to fund more aggressive PyPer unit testing on GPU RE
Test Plan: sandcastle, github
Differential Revision: D48998582
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108672
Approved by: https://github.com/bertmaher
Summary:
Previously `_generate_new_graph_signature` had the assumption that all transformations were not in place. However, this is an incorrect assumption leading to mysterious failures when running passes doing in-place modifications.
This function is technically only needed in the case where the user output node or user input node name is changed. For example, if the user output node was "add" but a pass changes all the "add"s to "mul"s, then the output node will now be named "mul", which we have to update.
For cases where users change the number of user inputs/outputs, number of parameters/buffers, or the names of parameters/buffers it will require extra work on the user's side to update the graph signature, since there is no automatic way for us to detect where to put what.
Note: this doesn't actually change the names for the buffers_to_mutate part of the graph signature, but we're going to assume this is rare, because implementing auto-fixing for that is a little hard...
Test Plan: Running `buck test fbcode//mode/dev-nosan fbcode//executorch/backends/xnnpack/test:` on top of D48710125, https://www.internalfb.com/intern/testinfra/testrun/5066549776877081
Differential Revision: D48917505
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108571
Approved by: https://github.com/zhxchen17
Summary:
Previously we can only use native pytorch int dtypes that has corresponding quantized dtypes (e.g. quint8, qint8), this
PR removes this assumption in observers/fake_quants so that users can use all pytorch native dtypes (except for int64, we can add it later if need)
the main addition here is int16.
Test Plan:
python test/test_quantization.py TestQuantizePT2E
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108453
Approved by: https://github.com/kimishpatel
Summary:
This diff demonstrates a simplified E2E workflow for PT2 Inference stack:
1. Model author with `torch.export()`
2. Model processing with `aot_inductor.compile()`
3. Model served with a new Inference Runtime API, named `ModelRunner`
`torch.export()` and `aot_inductor.compile()` produces a zip file using `PyTorchStreamWriter`.
Runtime reads the zip file with `PyTorchStreamReader`.
The zip file contains
{F1080328179}
More discussion on packaging can be found in https://docs.google.com/document/d/1C-4DP5yu7ZhX1aB1p9JcVZ5TultDKObM10AqEtmZ-nU/edit?usp=sharing
Runtime can now switch between two Execution modes:
1. Graph Interpreter mode, implemented based on Sigmoid's Executor
2. AOTInductor mode, implemented based on FBAOTInductorModel
Test Plan:
buck2 run mode/dev-nosan mode/inplace -c fbcode.enable_gpu_sections=True //sigmoid/inference/test:e2e_test
Export and Lower with AOTInductor
buck2 run mode/dev-sand mode/inplace -c fbcode.enable_gpu_sections=True sigmoid/inference:export_package
Run with GraphInterpreter and AOTInducotr
buck2 run mode/dev-nosan //sigmoid/inference:main
Reviewed By: suo
Differential Revision: D47781098
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108482
Approved by: https://github.com/zhxchen17
On Linux, CUDA header dependencies are not correctly tracked. After you modify a CUDA header, affected CUDA files won't be rebuilt. This PR will fix this problem.
```console
$ ninja -t deps
rep_penalty.o: #deps 2, deps mtime 1693956351892493247 (VALID)
/home/qc/Workspace/NotMe/exllama/exllama_ext/cpu_func/rep_penalty.cpp
/home/qc/Workspace/NotMe/exllama/exllama_ext/cpu_func/rep_penalty.h
rms_norm.cuda.o: #deps 0, deps mtime 1693961188871054130 (VALID)
rope.cuda.o: #deps 0, deps mtime 1693961188954388632 (VALID)
cuda_buffers.cuda.o: #deps 0, deps mtime 1693961188797719768 (VALID)
...
```
Historically, this line of code has been changed twice. It was first implemented in #49344 and there's no `if IS_WINDOWS`, just like now. Then in #56015 someone added `if IS_WINDOWS` for unknown reason. That PR has no description so I don't know what bug he encountered. I don't think there's any bug with these flags on Linux, at least for today. CMake generates exactly the same flags for CUDA.
```ninja
#############################################
# Rule for compiling CUDA files.
rule CUDA_COMPILER__cpp_cuda_unscanned_Debug
depfile = $DEP_FILE
deps = gcc
command = ${LAUNCHER}${CODE_CHECK}/opt/cuda/bin/nvcc -forward-unknown-to-host-compiler $DEFINES $INCLUDES $FLAGS -MD -MT $out -MF $DEP_FILE -x cu -c $in -o $out
description = Building CUDA object $out
```
where `-MD` is short for `--generate-dependencies-with-compile` and `-MF` is short for `--dependency-output`. My words can be verified by `nvcc --help`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108613
Approved by: https://github.com/ezyang
As in the title.
The bsr_dense_mm performance on inputs using column-major storage order is relevant for `linear(x, W)` operation that for BSR weights is defined as `bsr_dense_mm(W, x.transpose(-2, -1)).transpose(-2, 1)` so that the second argument to `bse_dense_mm` is a strided tensor using column-major storage order when `x` is C-contiguous.
For large inputs (size > 1000) and moderate sparsity in the BSR input, the speed up can be more than 3 times, as illustrated in the following figure (raw data: [bench_bsr_dense_mm_1_results.txt](https://github.com/pytorch/pytorch/files/12512245/bench_bsr_dense_mm_1_results.txt)):

For small inputs (size=512), there exists a slight degradation of performance.
For row-major ordered inputs, there is no change in performance (see raw data above).
For inputs with float16 dtype, there is no considerable change in performance (see blue marks in the figure).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108512
Approved by: https://github.com/cpuhrsch
At present, I refer to the existing rpc backend tensorpipe backend, and implement our own rpc communication backend in our extension package. We found that these functions are not exposed during development, and direct use will cause our extension package to appear undefined symbol problem.
Add the TORCH_API macro to the functions required to implement the custom tensorpipe agent in the rpc module to expose them to developers,at the same time, we think this risk is very controllable and hope it can be merged into the version 2.1.
cc
@albanD, @kumpera
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108553
Approved by: https://github.com/kumpera, https://github.com/albanD
Move detectron2_fcos_r_50_fpn to amp. The minifier showed the following snippet as causing the divergence, where inductor has better numerics than eager:
```
import torch
def foo(x):
return x > .2
inp = torch.tensor([.2002], device="cuda", dtype=torch.bfloat16)
print(foo(inp))
print(torch.compile(foo)(inp))
```
doctr_reco_predictor had very minimal divergence (.002 vs .001 required), bumping tolerance here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108598
Approved by: https://github.com/shunting314
Summary: See the comment in code for the reasons of the change
Test Plan:
buck2 test executorch/examples/export/test:test_export --
test_vit_export_to_executorch
Differential Revision: D48992180
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108608
Approved by: https://github.com/larryliu0820
Fixes https://github.com/pytorch/pytorch/issues/108323.
Cpp wrapper has functionality regression on `llama` and `tnt_s_patch16_224` due to recent support of scaled dot product flash attention in inductor.
The schema of this OP is as follows:
```
- func: _scaled_dot_product_flash_attention(Tensor query, Tensor key, Tensor value, float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False, *, float? scale=None) -> (Tensor output, Tensor logsumexp, Tensor cum_seq_q, Tensor cum_seq_k, int max_q, int max_k, Tensor philox_seed, Tensor philox_offset, Tensor debug_attn_mask)
```
For `llama` and `tnt_s_patch16_224`, the OP is called in the below way, where the three positional args with default values are not passed (`float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False`).
```python
y = torch.ops.aten._scaled_dot_product_flash_attention.default(x0, x1, x2, scale = 0.125)
```
This PR fixes the cpp wrapper support for this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108552
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
Summary: Move AOTInductor runtime header files into its own subdirectory, to separate them from to-be-added libtorch C interface.
Reviewed By: frank-wei
Differential Revision: D48905038
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108564
Approved by: https://github.com/frank-wei
Fixes#83196
Now, MPS implementation is blazingly fast.
Though, I have several questions on improving this PR:
1. I copied code from `test_nn.py`. Is there better way to test this?
2. I decided to use `usepixelshuffleorder:YES`. Am I right performance-wise? According to docs:
```
`usePixelShuffleOrder` can be
used to control how the data within spatial blocks is ordered in the
`depthAxis` dimension: with `usePixelShuffleOrder=YES` the values within the
spatial blocks are stored contiguosly within the `depthAxis` dimension whereas
otherwise they are stored interleaved with existing values in the `depthAxis` dimension.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99306
Approved by: https://github.com/kulinseth, https://github.com/malfet
This PR removes four usages of compute_local_offset() in PyTorch repo and replaces it with the new API compute_local_shape_and_global_offset().
We will be removing compute_local_offset() API in the next diff, as there are usages internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108547
Approved by: https://github.com/wanchaol
Couple changes to make it more efficient.
- Because we replacing nodes that only have a single value, only store a single value instead of the whole tensor for node replacement
- torch.fx.Interpreter will preserve a Tensor in the env as long as it has more uses. That also applies even to output uses, but we are not going to constant fold that use. Instead of using last use for garbage collection, use last non output use.
If reviewers would prefer I ghstack this bc of code movement let me know.
Fix for https://github.com/pytorch/pytorch/issues/108388
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108421
Approved by: https://github.com/jansel
Summary: Before implementing `aten::.randn_like` as requested (T152843033), I think it worth to extend `aten::rand_like` from existing `aten::uniform`, since they're so similar.
Test Plan:
```
[ttingchulin@6945.od /data/sandcastle/boxes/fbsource (rand_like)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*<test>*" eg. -- --gtest_filter="*rand_like*"
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from VulkanAPITest
[ RUN ] VulkanAPITest.rand_like
[ OK ] VulkanAPITest.rand_like (136 ms)
[----------] 1 test from VulkanAPITest (136 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (136 ms total)
[ PASSED ] 1 test.
[ttingchulin@6945.od /data/sandcastle/boxes/fbsource (rand_like)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*<test>*" eg. -- --gtest_filter="*uniform*"
Building: finished in 0.1 sec (100%) 329/329 jobs, 0/329 updated
Total time: 0.1 sec
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *uniform*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from VulkanAPITest
[ RUN ] VulkanAPITest.uniform
[ OK ] VulkanAPITest.uniform (131 ms)
[----------] 1 test from VulkanAPITest (131 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (131 ms total)
[ PASSED ] 1 test.
[ttingchulin@6945.od /data/sandcastle/boxes/fbsource (rand_like)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin
ALL PASS
```
Differential Revision: D48710273
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108086
Approved by: https://github.com/yipjustin
The torch.linalg.matrix_power documentation suggests using the formula
`matrix_power(torch.linalg.solve(A, B), n) == matrix_power(A, -n) @ B`
to avoid negative matrix powers. But the ordering of the left side is not correct. This patch fixes it to:
`torch.linalg.solve(matrix_power(A, n), B) == matrix_power(A, -n) @ B`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108585
Approved by: https://github.com/lezcano
This PR:
1) Add device_mesh kwarg to FSDP. Remove init_device_mesh() from _runtime_utils.py, as device_mesh would be passed in by user as an kwarg.
2) change use_dtensor flag for state_dict_config and optim_state_dict_config to be private. If device_mesh is used with sharded model/optim state dict, _use_dtensor flag would be set to True and model/optim state dict would return dtensor state_dict. Otherwise, _use_dtensor flag would be set to False and model/optim state dict would return sharded_tensor state_dict.
3) Update _optim_utils.py, _shard_utils.py, and _state_dict_utils.py to add support for HSDP to return 2D DTensor state_dict.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107533
Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/wanchaol
Fixes https://github.com/pytorch/pytorch/issues/105327. The problem is that `lift_fresh_copy()`'s functionalization implementation currently assumes that the input is always not functional. This is apparently too limiting: when you have "user" code like this (which can potentially come from exporting a model and then running compile on the resulting graph):
```
tensor_constant0 = torch.tensor(2)
lift_fresh = torch.ops.aten.lift_fresh_copy.default(tensor_constant0)
```
When we run this through AOTAutograd, the first call (torch.tensor(2)) will **already** be lifted into a functional tensor wrapper - so the `lift_fresh_copy` call doesn't need to do any "lifting" anymore - it just needs to do a clone.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108243
Approved by: https://github.com/albanD
ghstack dependencies: #108081, #108235
From talking to @wconstab, we agreed that because of the way DDPOptimizer is written, it is (sort of) incompatible with inductor's `keep_output_stride=False` optimizations (and will cause silent correctness problems if you use them ogether). Added an assertion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108235
Approved by: https://github.com/wconstab
ghstack dependencies: #108081
@wconstab pointed out that inductor found a graph with 6 input mutations and only 1 output, and seemed to be (incorrectly) chopping off the first "6" outputs from the graph (even though there is only 1). It looks like this is because:
(1) AOTAutograd has special handling for input mutations in inference vs. training graphs. In a training graph, whenever AOTAutograd sees an input mutation, it will add an **extra** output to the graph, corresponding to the updated input (and then at runtime, it will grab the updated input, and perform the actual mutation outside of the graph).
In inference, AOTAutograd is smarter and can leave the input mutations directly in the graph for inductor to optimize (doing this in training is harder). In inference, AOTAutograd will **not** add any extra graph outputs for input mutations.
It looks like inductor was unconditionally assuming that input mutations counted as extra outputs in the graph, which is wrong for the inference case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108081
Approved by: https://github.com/wconstab
Hi!
I've been fuzzing different pytorch modules with with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch), and found a heap buffer overflow error that occurs by incorrect loop condition in torch::jit::unpickler.cpp. This bug can be triggered by `torch::distributed::rpc::deserializeRequest()` method in RPC module.
Docker to reproduce found error: [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch).
### PoC for deserealizeRequest():
[crash-001e49dcd3a3c439e2b1273d580049309e052bdd.txt](https://github.com/pytorch/pytorch/files/12498999/crash-001e49dcd3a3c439e2b1273d580049309e052bdd.txt)
### ASAN report
```
==339982==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x619000086a88 at pc 0x000000996fa4 bp 0x7fffffff9c50 sp 0x7fffffff9c48
READ of size 4 at 0x619000086a88 thread T0
#0 0x996fa3 in c10::IValue::IValue(c10::IValue const&) /pytorch/aten/src/ATen/core/ivalue.h:226:33
#1 0xdf99a38 in std::pair<c10::impl::DictIterator<c10::IValue, c10::IValue, ska_ordered::detailv3::sherwood_v3_table<std::pair<c10::IValue, c10::IValue>, c10::IValue, c10::detail::DictKeyHash, ska_ordered::detailv3::KeyOrValueHasher<c10::IValue, std::pair<c10::IValue, c10::IValue>, c10::detail::DictKeyHash>, c10::detail::DictKeyEqualTo, ska_ordered::detailv3::KeyOrValueEquality<c10::IValue, std::pair<c10::IValue, c10::IValue>, c10::detail::DictKeyEqualTo>, std::allocator<std::pair<c10::IValue, c10::IValue> >, std::allocator<ska_ordered::detailv3::sherwood_v3_entry<std::pair<c10::IValue, c10::IValue> > > >::templated_iterator<std::pair<c10::IValue, c10::IValue> > >, bool> c10::Dict<c10::IValue, c10::IValue>::insert_or_assign<c10::IValue&, c10::IValue&>(c10::IValue&, c10::IValue&) const /pytorch/aten/src/ATen/core/Dict_inl.h:136:5
#2 0xed966c7 in torch::jit::Unpickler::readInstruction() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:490:14
#3 0xed94377 in torch::jit::Unpickler::run() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:253:27
#4 0xed93fd1 in torch::jit::Unpickler::parse_ivalue() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:206:3
#5 0xece09ee in torch::jit::unpickle(std::function<unsigned long (char*, unsigned long)>, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch/torch/csrc/jit/serialization/pickle.cpp:126:20
#6 0xece0dac in torch::jit::unpickle(char const*, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch/torch/csrc/jit/serialization/pickle.cpp:136:10
#7 0x1006a4e7 in torch::distributed::rpc::PythonRemoteCall::fromMessage(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/python_remote_call.cpp:40:16
#8 0x101d02e1 in torch::distributed::rpc::deserializeRequest(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/utils.cpp:111:14
#9 0x8db738 in LLVMFuzzerTestOneInput /message_deserialize.cc:192:27
#10 0x8d84cd in ExecuteFilesOnyByOne /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:255:7
#11 0x8d82d8 in LLVMFuzzerRunDriver /AFLplusplus/utils/aflpp_driver/aflpp_driver.c
#12 0x8d7e98 in main /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:300:10
#13 0x7ffff7a37082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
#14 0x817c4d in _start (/message_deserialize_afl+0x817c4d)
0x619000086a88 is located 8 bytes to the right of 1024-byte region [0x619000086680,0x619000086a80)
allocated by thread T0 here:
#0 0x8d54ca in operator new(unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_new_delete.cpp:95:3
SUMMARY: AddressSanitizer: heap-buffer-overflow /pytorch/aten/src/ATen/core/ivalue.h:226:33 in c10::IValue::IValue(c10::IValue const&)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108413
Approved by: https://github.com/ezyang
The output of softmax was overwritten by the output of fc2 in the following line. So, the output of the softmax is never utilized. Now, the final output of the model includes softmax.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107988
Approved by: https://github.com/zou3519
Currently numel only supports static shapes, but this expands it to support
generating symbolic arithmetic into the graph. e.g.
```
# x.size().numel with x.size() = [s0, 1, s1]
size = l_x_.size()
getitem = size[0]
getitem_2 = size[2]; size = None
mul = getitem * getitem_2; getitem = getitem_2 = None
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108239
Approved by: https://github.com/ezyang
It propagated NANs when it should have not.
Also, replace it with std::abs due to precision mismatching.
This change fixes test_python_ref__refs_abs_cpu_complex32 test in test/test_ops.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108515
Approved by: https://github.com/ezyang
It seems pip-installed torch is built with `D_GLIBCXX_USE_CXX11_ABI=0` and it fails the inductor/test_aot_inductor.py with:
```
ERROR: test_with_offset (__main__.AotInductorTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 2388, in wrapper
method(*args, **kwargs)
File "/home/ubuntu/src/pytorch/test/inductor/test_aot_inductor.py", line 112, in test_with_offset
actual = AOTInductorModelRunner.run(model, example_inputs, expected)
File "/home/ubuntu/src/pytorch/test/inductor/test_aot_inductor.py", line 63, in run
optimized, exported, output_tensors, output_spec = AOTInductorModelRunner.load(
File "/home/ubuntu/src/pytorch/test/inductor/test_aot_inductor.py", line 50, in load
optimized = torch.utils.cpp_extension.load_inline(
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1635, in load_inline
return _jit_compile(
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1736, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2136, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "<frozen importlib._bootstrap>", line 565, in module_from_spec
File "<frozen importlib._bootstrap_external>", line 1173, in create_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
ImportError: /tmp/torchinductor_ubuntu/cqrzlw3yizrsx2us5bnjosr4tzct24h6qwb6xbbx654fxvdupoub/cr6ndwlgeorw34etxhwvs547kbnftyxtwwrsmbdraa4hjeevsvji.so: undefined symbol: _ZN3c106detail23torchInternalAssertFailEPKcS2_jS2_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
```
I'm not sure how to test this in CI, maybe run tests with prebuilt wheels?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108319
Approved by: https://github.com/ezyang
Hi!
I've been fuzzing different pytorch modules with with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch), and found a heap buffer overflow error that occurs during Python object deserialization routine. Vector with `IValues` is verified to contain at least 3 elements, which are subsequently removed from vector. The rest of vector is passed further, where it is expected to contain at least one more element. The crash occurs on empty vector.
Docker to reproduce found error: [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch).
### PoC:
[crash-6d634f38a76bfeaa1fffc9472e8ea7b88ee8e776.txt](https://github.com/pytorch/pytorch/files/12499089/crash-6d634f38a76bfeaa1fffc9472e8ea7b88ee8e776.txt)
### ASAN report
```
==339647==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x604000105388 at pc 0x000000c2b3bc bp 0x7fffffffb8d0 sp 0x7fffffffb8c8
READ of size 4 at 0x604000105388 thread T0
#0 0xc2b3bb in c10::IValue::isString() const /pytorch/aten/src/ATen/core/ivalue.h:685:27
#1 0xc2b3bb in c10::IValue::toStringRef[abi:cxx11]() const /pytorch/aten/src/ATen/core/ivalue_inl.h:2308:3
#2 0x101ce65f in torch::distributed::rpc::SerializedPyObj::fromIValues(std::vector<c10::IValue, std::allocator<c10::IValue> >) /pytorch/torch/csrc/distributed/rpc/types.cpp:103:39
#3 0x1006a7a0 in torch::distributed::rpc::PythonRemoteCall::fromMessage(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/python_remote_call.cpp:58:26
#4 0x101d02e1 in torch::distributed::rpc::deserializeRequest(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/utils.cpp:111:14
#5 0x8db738 in LLVMFuzzerTestOneInput /message_deserialize.cc:192:27
#6 0x8d84cd in ExecuteFilesOnyByOne /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:255:7
#7 0x8d82d8 in LLVMFuzzerRunDriver /AFLplusplus/utils/aflpp_driver/aflpp_driver.c
#8 0x8d7e98 in main /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:300:10
#9 0x7ffff7a37082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
#10 0x817c4d in _start (/message_deserialize_afl+0x817c4d)
0x604000105388 is located 8 bytes to the left of 48-byte region [0x604000105390,0x6040001053c0)
allocated by thread T0 here:
#0 0x8d54ca in operator new(unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_new_delete.cpp:95:3
SUMMARY: AddressSanitizer: heap-buffer-overflow /pytorch/aten/src/ATen/core/ivalue.h:685:27 in c10::IValue::isString() const
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108414
Approved by: https://github.com/ezyang
This PR replace c10::guts::to_string with std::to_string. The major part of changes is using void* as optimizer state key since string is used only for serialization and using pointers as hashing keys is more efficient than a string.
Some other guts functions in the affected source files are also replaced.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108480
Approved by: https://github.com/Skylion007
**Background**: "TorchDynamo Cache Lookup" events appear in traces to indicate a dynamo cache lookup; it's useful to check when cache lookups are taking a long time. To add a profiler event, one can use the `torch.profiler.record_function` context manager, or the C++ equivalent. Previously, the python version was used; first, when the profiler was enabled, callbacks for record_function_enter and record_function_exit were registered; then those would be called before and after every cache lookup.
**This PR**: Instead of calling the python bindings for `torch.profiler.record_function`, directly call the C++ implementation. This simplifies a lot of the code for binding C/C++. It also improves performance; previously there was a lot of overhead in the "TorchDynamo Cache Lookup" event, making the event artificially take a long time. After this change the events now appear shorter, because there's less overhead in starting/stopping the event: in other words, the profiler no longer distorts the results as much.
**Performance results**:
I ran using the script below on a cpu-only 1.6GHz machine. I report the median time (from 100 measurements) of a "TorchDynamo Cache Lookup" event before and after this PR. I think it is reasonable to consider the difference to be due to a reduction in overhead.
<details>
<summary>Benchmarking script</summary>
```python
def fn(x, y):
return (x * y).relu()
a, b = [torch.rand((4, 4), requires_grad=True) for _ in range(2)]
opt_fn = torch.compile(fn)
opt_fn(a, b)
opt_fn(a, b)
with torch.profiler.profile() as prof:
opt_fn(a, b)
```
</details>
Median before PR: 198-228 us (median of 100, measured 5 times)
Median after PR: 27us
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108436
Approved by: https://github.com/anijain2305, https://github.com/jansel
Summary:
Earlier decomp was routing _flash* variant to _match variant and this
was result in failure during torch.export, for some reason that I
couldnt trace.
However, it seems that we should really have a decomp for
scaled_dot_product_attention, instead of
scaled_dot_product_flash_attention. Right?
This diff adds that. Plus it adds a test to check if the model exported
via two stage export, has decomposed the op. This test needs improvement
to figur eout what the core aten opset is and check for anything that is
not inside.
Test Plan:
test_model_exports_to_core_aten
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D48917461](https://our.internmc.facebook.com/intern/diff/D48917461)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108371
Approved by: https://github.com/larryliu0820
Summary:
In SelectiveBuildOperator, we can specify argument `include_all_overloads`. If True, all overloaded operators (for example, `aten::to.dtype_layout`, `aten::to.prim_Device"` are considered as overloaded operators of `aten::to`), will be built and linked to the final binary. This can significantly increases the final binary size, which could be a deal breaker for on-device deployment.
In this diff, we make back-compatible changes to add new arguments `--not-include-all-overloads-static-root-ops` and `--not-include-all-overloads-closure-ops`. When they are set, we set `include_all_overloads` flag to False for static root ops and closure ops, and rely on code analyzer to decide the actual used overloaded operator.
Test Plan:
- unit test
```
buck test //xplat/caffe2/tools:gen_operators_yaml_test
```
- See test plan in D48771544 where we reduce the shared lib file `libmrengine.lib` from 16653072 bytes to 13686032 bytes.
- See detailed document: https://fburl.com/gdoc/mc93h6kb
Reviewed By: larryliu0820
Differential Revision: D48772302
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108396
Approved by: https://github.com/larryliu0820
Summary:
This is a prototype for running extern fallback kernels with a host side proxy executor.
Sample of generated cpp wrapper call:
```
at::Tensor buf0; // output buffer
void* tensor_args_var_0[] = {&arg0_1, &arg0_1, &arg1_1, &arg0_1, &arg1_1, &buf0};
int64_t int_args_var_1[] = {81, 81, 7, 7, 7, 81};
proxy_executor->call_function("buf0", int_args_var_1, tensor_args_var_0);
```
- In my current implementation, proxy executor interprets the raw pointers according to the ops schema.
This assumes that custom op MUST have a valid schema registered to Dispatcher. (I would like to validate this assumption)
- I am using callboxed() API of the custom kernels. This is inevitable, as we wish to have a single call_function API for all possible custom kernels.
- These are all the input argument types I have support so far.
union Argument {
# Bool value does not matter
1: bool asNone;
2: TensorArgument asTensor;
3: list<TensorArgument> asTensors;
5: i64 asInt;
7: list<i64> asInts;
8: double asFloat;
9: list<double> asFloats;
10: string asString;
10.5: list<string> asStrings;
11: SymIntArgument asSymInt;
12: list<SymIntArgument> asSymInts;
13: ScalarType asScalarType;
14: MemoryFormat asMemoryFormat;
15: Layout asLayout;
16: Device asDevice;
17: bool asBool;
18: list<bool> asBools;
}
- Need a policy for handling unpopulated argument with default values. Here are the options, and it has BC implications.
1. requires exported fx graph to explicitly populate default values, if users doesn't specify.
2. requires cpp wrapper to explicitly populate default values, if fx graph doesn't specify.
3. Proxy executor look up from opSchema for default values.
For fixing T162112344
Test Plan:
frontend:
buck2 run mode/dev-sand mode/inplace -c fbcode.enable_gpu_sections=True sigmoid/frontend:export_main
test:
buck2 run mode/dev-sand //deeplearning/aot_inductor/test:test_custom_ops
backend:
buck2 run mode/dev-nosan //deeplearning/aot_inductor/fb:main
buck2 test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/model_transform/experimental/benchmark/test:test_aot_inductor_benchmark -- --exact 'caffe2/torch/fb/model_transform/experimental/benchmark/test:test_aot_inductor_benchmark - test_aot_inductor_benchmark_cmf30x (caffe2.torch.fb.model_transform.experimental.benchmark.test.test_aot_inductor_benchmark.AOTInductorBenchmark)'
Reviewed By: suo
Differential Revision: D48747417
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108350
Approved by: https://github.com/izaitsevfb
In https://github.com/pytorch/pytorch/pull/106673 , I created a private API `_debug_get_cache_entry_list` to help pull out cache entries from compiled functions.
Recently, I find that @anijain2305 commented in the code that this API should be revisited, and so I created this PR.
First, this API cannot be removed even if cache entry becomes a first-class python class`torch._C._dynamo.eval_frame._CacheEntry`. The facts that `extra_index` is static, and `get_extra_state` is inline static, make them not accessible elsewhere. This API `_debug_get_cache_entry_list` is the only way for users to get all the cache entries from code.
Second, since the`torch._C._dynamo.eval_frame._CacheEntry` class is a python class, I simplified the C-part code, and remove the necessity of creating a namedtuple for this in the python code.
Third, I also add a small improvement, that if the argument is a function, we can automatically pass its `__code__` to the API.
The above change will slightly change the output, from list of named tuple to list of `torch._C._dynamo.eval_frame._CacheEntry`. I will update the corresponding docs that use this API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108335
Approved by: https://github.com/jansel, https://github.com/anijain2305
Reland of PR #94924. The purpose of this PR is to deal with the complicated interactions between MKL and OpenMP.
There are two improvements:
1. It uses a flag to avoid infinite mutual recursion in calling find_package(MKL) and find_package(OpenMP) in some cases.
2. The logic of finding iomp5 is improved and now we can test MKLDNN under ASAN.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104224
Approved by: https://github.com/malfet
Summary:
During convert step observers are first replaced by Q-DQ pair. In some
scenarios like following output DQ has a fan out.
---> OP2 -> Q -> DQ
/
OP -> Q -> DQ -
\
---> OP3 -> Q -> DQ
If either op OP2 or OP3 are configured to be quantized, then the input
is expected to quantized. In this case quantized equivalent of some
pattern, that quantizer asked to be quantized, should look like:
[DQ -> {pattern} -> Q]. However, in scenario like above where DQ node
is shared between multiple "quantized" patterns, boundary of "quantized"
pattern is not clear because DQ now belongs to multiple quantized
patterns.
This poses challenge for:
- Porting metadata: which "quantized" partition this DQ node belongs
- Quantized representation, equivalently, needs to identify
self-contained quantized pattern that is replaced by its equivalent pattern
that captures compute in the quantized precision.
Test Plan:
test_duplicate_dq_pass
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D48663147](https://our.internmc.facebook.com/intern/diff/D48663147)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107900
Approved by: https://github.com/jerryzh168, https://github.com/andrewor14, https://github.com/leslie-fang-intel
ghstack dependencies: #107105, #107106, #107899
Summary:
In prepararation for metadata porting diff, it is required that weight
quant annotation happens via edge quantization, i.e. input_qspec_map.
Reason: Metadata is ported via associating DQ node's metadata with its
consumer while associating Q node's metadata with its producer.
Furthermore, such porting must be qualified via user intent to see if
the consumder of DQ, or producer of Q, actually specified intent of
quantization
By making quantization annotation on linear node's weight via
input_qspec_map, we can enable associating DQ of [weight -> Q -> DQ],
with the linear module.
Test Plan:
CI
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D48488414](https://our.internmc.facebook.com/intern/diff/D48488414)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107105
Approved by: https://github.com/jerryzh168
The dependency was added twice before in CUDA and ROCm binaries, one as an installation dependency from builder and the later as an extra dependency for dynamo, for example:
```
Requires-Python: >=3.8.0
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: filelock
Requires-Dist: typing-extensions
Requires-Dist: sympy
Requires-Dist: networkx
Requires-Dist: jinja2
Requires-Dist: fsspec
Requires-Dist: pytorch-triton (==2.1.0+e6216047b8)
Provides-Extra: dynamo
Requires-Dist: pytorch-triton (==2.1.0+e6216047b8) ; extra == 'dynamo'
Requires-Dist: jinja2 ; extra == 'dynamo'
Provides-Extra: opt-einsum
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
```
In the previous release, we needed to remove this part from `setup.py` to build release binaries https://github.com/pytorch/pytorch/pull/96010. With this, that step isn't needed anymore because the dependency will come from builder.
### Testing
Using the draft https://github.com/pytorch/pytorch/pull/108374 for testing and manually inspect the wheels artifact at https://github.com/pytorch/pytorch/actions/runs/6045878399 (don't want to go through all `ciflow/binaries` again)
* torch-2.1.0.dev20230901+cu121-cp39-cp39-linux_x86_64
```
Requires-Python: >=3.8.0
Description-Content-Type: text/markdown
Requires-Dist: filelock
Requires-Dist: typing-extensions
Requires-Dist: sympy
Requires-Dist: networkx
Requires-Dist: jinja2
Requires-Dist: fsspec
Requires-Dist: pytorch-triton (==2.1.0+e6216047b8) <-- This will be 2.1.0 on the release branch after https://github.com/pytorch/builder/pull/1515
Provides-Extra: dynamo
Requires-Dist: jinja2 ; extra == 'dynamo'
Provides-Extra: opt-einsum
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
```
* torch-2.1.0.dev20230901+cu121.with.pypi.cudnn-cp39-cp39-linux_x86_64
```
Requires-Python: >=3.8.0
Description-Content-Type: text/markdown
Requires-Dist: filelock
Requires-Dist: typing-extensions
Requires-Dist: sympy
Requires-Dist: networkx
Requires-Dist: jinja2
Requires-Dist: fsspec
Requires-Dist: pytorch-triton (==2.1.0+e6216047b8)
Requires-Dist: nvidia-cuda-nvrtc-cu12 (==12.1.105) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-runtime-cu12 (==12.1.105) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-cupti-cu12 (==12.1.105) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cudnn-cu12 (==8.9.2.26) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cublas-cu12 (==12.1.3.1) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cufft-cu12 (==11.0.2.54) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-curand-cu12 (==10.3.2.106) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusolver-cu12 (==11.4.5.107) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusparse-cu12 (==12.1.0.106) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nccl-cu12 (==2.18.1) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nvtx-cu12 (==12.1.105) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: triton (==2.1.0) ; platform_system == "Linux" and platform_machine == "x86_64" <--This is 2.1.0 because it already has https://github.com/pytorch/pytorch/pull/108423, but the package doesn't exist yet atm
Provides-Extra: dynamo
Requires-Dist: jinja2 ; extra == 'dynamo'
Provides-Extra: opt-einsum
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
```
* torch-2.1.0.dev20230901+rocm5.6-cp38-cp38-linux_x86_64
```
Requires-Python: >=3.8.0
Description-Content-Type: text/markdown
Requires-Dist: filelock
Requires-Dist: typing-extensions
Requires-Dist: sympy
Requires-Dist: networkx
Requires-Dist: jinja2
Requires-Dist: fsspec
Requires-Dist: pytorch-triton-rocm (==2.1.0+34f8189eae) <-- This will be 2.1.0 on the release branch after https://github.com/pytorch/builder/pull/1515
Provides-Extra: dynamo
Requires-Dist: jinja2 ; extra == 'dynamo'
Provides-Extra: opt-einsum
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108424
Approved by: https://github.com/atalman
In the lowering we don't have `SymFloat` and `SymInt`, we just have `sympy.Expr`
so it is impossible to accurately determine the expected dtype of a `full` call.
For example, `sym_float(int_expr)` has `is_integer=True` but should be treated
as a float. In the decomposition though, we can get this right.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108443
Approved by: https://github.com/lezcano
**Motivation:**
Currently, for the following code that exports cond operator:
```python
import torch
from functorch.experimental.control_flow import cond
class MySubModule(torch.nn.Module):
def foo(self, x):
return x.cos()
def forward(self, x):
return self.foo(x)
class CondBranchClassMethod(torch.nn.Module):
def __init__(self):
super().__init__()
self.subm = MySubModule()
def bar(self, x):
return x.sin()
def forward(self, x):
return cond(x.shape[0] <= 2, self.subm.forward, self.bar, [x])
from torch._export import capture_pre_autograd_graph
example_inputs = (torch.randn(1, 3, 3, 3),)
m = CondBranchClassMethod()
m.eval()
gm = capture_pre_autograd_graph(m, example_inputs)
print(gm)
# source_fn for original cond op, getattr submodule op are all cond op
for n in gm.graph.nodes:
print("n:", n.format_node(), n.meta)
print("\n\n\n")
# source_fn for submodule nodes are all cond op
# Expected: ideally this should be the real ops, e.g. torch.sin, aten.cos, etc
for n in gm.submodule_0.graph.nodes:
print("n:", n.format_node(), n.meta)
```
Output is like below:
```
GraphModule(
(submodule_0): GraphModule()
(submodule_1): GraphModule()
)
def forward(self, arg_0):
arg0_1, = fx_pytree.tree_flatten_spec([arg_0], self._in_spec)
submodule_0 = self.submodule_0
submodule_1 = self.submodule_1
cond = torch.ops.higher_order.cond(True, submodule_0, submodule_1, [arg0_1]); submodule_0 = submodule_1 = arg0_1 = None
return pytree.tree_unflatten((cond,), self._out_spec)
# To see more debug info, please use `graph_module.print_readable()`
n: %arg0_1 : [num_users=1] = placeholder[target=arg0_1] {'val': FakeTensor(..., size=(1, 3, 3, 3)), 'tensor_meta': None, 'is_torch_exported': True, 'stack_trace': 'NoneType: None\n'}
n: %submodule_0 : [num_users=1] = get_attr[target=submodule_0] {'stack_trace': 'NoneType: None\n', 'source_fn': ('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), 'original_aten': None, 'from_node': [('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), ('conditional', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), ('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>)], 'seq_nr': -1}
n: %submodule_1 : [num_users=1] = get_attr[target=submodule_1] {'stack_trace': 'NoneType: None\n', 'source_fn': ('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), 'original_aten': None, 'from_node': [('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), ('conditional', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), ('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>)], 'seq_nr': -1}
n: %cond : [num_users=1] = call_function[target=torch.ops.higher_order.cond](args = (True, %submodule_0, %submodule_1, [%arg0_1]), kwargs = {}) {'stack_trace': 'NoneType: None\n', 'source_fn': ('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), 'original_aten': None, 'from_node': [('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), ('conditional', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), ('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>)], 'seq_nr': -1, 'val': FakeTensor(..., size=(1, 3, 3, 3)), 'tensor_meta': None, 'is_torch_exported': True}
n: return (cond,) {'stack_trace': 'NoneType: None\n', 'from_node': [('output', 'output')], 'seq_nr': -1, 'is_torch_exported': True, 'val': (FakeTensor(..., size=(1, 3, 3, 3)),), 'tensor_meta': (None,)}
n: %arg0_1 : [num_users=1] = placeholder[target=arg0_1] {'stack_trace': ' File "<ipython-input-9-2a8c7c0498ed>", line 36, in forward\n return cond(x.shape[0] <= 2, self.subm.forward, self.bar, [x])\n', 'source_fn': ('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), 'original_aten': None, 'from_node': [('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), ('arg0_1', 'arg0_1')], 'seq_nr': -1, 'val': FakeTensor(..., size=(1, 3, 3, 3)), 'tensor_meta': None}
n: %cos_default : [num_users=1] = call_function[target=torch.ops.aten.cos.default](args = (%arg0_1,), kwargs = {}) {'stack_trace': ' File "<ipython-input-9-2a8c7c0498ed>", line 36, in forward\n return cond(x.shape[0] <= 2, self.subm.forward, self.bar, [x])\n', 'source_fn': ('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), 'original_aten': <OpOverload(op='aten.cos', overload='default')>, 'from_node': [('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), ('cos', <OpOverload(op='aten.cos', overload='default')>), ('cos_default', <OpOverload(op='aten.cos', overload='default')>)], 'seq_nr': -1, 'val': FakeTensor(..., size=(1, 3, 3, 3)), 'tensor_meta': None}
n: return cos_default {'stack_trace': ' File "<ipython-input-9-2a8c7c0498ed>", line 36, in forward\n return cond(x.shape[0] <= 2, self.subm.forward, self.bar, [x])\n', 'source_fn': ('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), 'original_aten': None, 'from_node': [('cond', <torch._ops.HigherOrderOperator object at 0x7f68ae93efd0>), ('output', 'output')], 'seq_nr': -1, 'val': FakeTensor(..., size=(1, 3, 3, 3)), 'tensor_meta': None}
```
As we can see, the meta of nodes in subgrarphs are overriden with the cond's metat data. This is because the function _set_current_meta is only invoked at the top-level graph module in interpreter. When we're calling into cond and dealing with the submodules here, we didn't set the current_meta to the meta of nodes of subgraph properly.
**Implementation:**
This pr fixes it by: in trace_cond, we optionally use an fx.interpreter to interpret the subgraphs so that the meta data is preserved only when the following conditions are satisfied:
- The subgraphs are graph_module: this is necessary that we use the fx.Interpreter
- The current make_fx has turned preserve_node_meta on (as is the case for capture_pre_autograd_graph).
**Test Plan**
See added tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108356
Approved by: https://github.com/SherlockNoMad
# Summary
## PR Dependencies
I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier:
- [x] Separate build flags for Flash and MemEff: #107985
### Description
This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao
### Changes Made
The majority of the changes in this pull request involve:
- Copying over the flash_attention sources.
- Updating header files.
- Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd.
- Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates.
- Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80.
- Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes.
- Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources.
- Adding/Updating tests.
### Notes for Reviewers
This is not a fun review, and I apologize in advance.
Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO:
- aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp
- aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github)
There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts.
### Follow up items
- Include the updates from e07aa036db and 9e5e8bc91e | https://github.com/pytorch/pytorch/issues/108108
### Work Items
- [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee
- [x] Let multi_query/attention pass through and test | UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup.
- [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers.
- [x] Update test exercise above codepath
- [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it a4f148b6ab)
- [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b
- [x] Update dispatcher to universally prefer FlashV2
- [x] Update tests to exercise new head_dims
- [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional
- [x] Create template generator script
- [x] Initial cmake support for building kernels/ folder
- [x] Replay CudaGraph changes
### Results
#### Forward only
The TFlops are reported here are on a100 that is underclocked.

#### Forward+Backward
Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back.
<img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602
Approved by: https://github.com/huydhn, https://github.com/cpuhrsch
This MR fixes the test failures for `jiterator` implemenation for `tan` and `tanh` Unary kernels as mentioned in #100842.
The failures were fixed by adjusting tolerances but some failures were in `test_unary_ufuncs.py` required adjusting input values as well. Since the jiterator kernels are using libstdc++, the supported input range is smaller than thrust implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102427
Approved by: https://github.com/malfet
The current pattern transforms:
```
ones([x, y]).cumsum(1) -> arange(1, 1 + y).expand([x, y])
```
but this generalizes it to
```
full(shape, fill_value).cumsum(d) ->
(fill_value * arange(1, 1 + shape[d])).view([1..., shape[d], 1...]).expand(shape)
```
So we handle any fill value, any number of dimensions, and broadcasting to any dimension.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108373
Approved by: https://github.com/lezcano
This PR adds SimpleProfiler for FSDP state_dict/load_state_dict logging purpose. SimpleProfiler use class variables to record profiling results and it does everything in the Python which can be slow. So it is only suitable for logging slow actions such as initialization and state_dict/load_state_dict.
This PR uses SimpleProfiler to log some critical/slow paths of the model and optimizer state_dict/load_state_dict.
Differential Revision: [D48774406](https://our.internmc.facebook.com/intern/diff/D48774406/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108290
Approved by: https://github.com/wz337
#103503 addresses the situation where additional work is enqueued for the NCCL watchdog to poll during a graph capture---something we want to avoid as the subsequent polling will query an event and crash the capture.
However, there is currently no check that there was not work _already_ enqueued for the watchdog to poll. If there was already work that was enqueued and not cleaned up before the start of a graph capture, then we run into a similar problem where the event query will crash the capture. We've observed this causing crashes on several workloads, although it is somewhat flaky (if the watchdog happens to poll just before the graph capture and cleanup, then we dodge the crash).
This is a bit of a tricky issue as it involves making sure that no process group has enqueued work at the start of a capture, and as such the simplest solution is to add a bit of global state to track all process groups. This PR forces the start of the graph capture to wait until all enqueued work is completed and cleaned up or times out.
I did consider the alternative of simply having the watchdog skip cleanup if we detect that we are in the middle of a graph capture, but I think deferring the cleanup until later could result in false positive timeouts (e.g., if we defer cleanup on work that has completed long ago, checking timers after the graph capture could yield a "timeout").
CC @Aidyn-A
@bottler @kwen2501 @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104487
Approved by: https://github.com/kwen2501
Summary: D41985889 removed the cast to int for the inputs to torch.histc below, allowing the inputs to still be tensors. These tensors still have require_grad_ set to True, causing issues with the call to torch.histc.
Test Plan: buck2 test 'fbcode//mode/opt' fbcode//dper3/dper3/modules/low_level_modules/tests:stat_collector_test -- --exact 'dper3/dper3/modules/low_level_modules/tests:stat_collector_test - test_scripted_module (dper3.dper3.modules.low_level_modules.tests.stat_collector_test.StatCollectorTest_1)'
Differential Revision: D48800879
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108232
Approved by: https://github.com/jerryzh168
Fixes inference accuracy for `doctr_reco_predictor` and `pyhpc_turbulent_kinetic_energy`.
For the `same(float, float)` comparison we weren't going through the more rigorous tensor comparison path which takes into account the fp64 base results.
Also return True when fp64 base result are not well formed (nan).
I debugged these models and the source of divergence were innocuous:
`doctr_reco_predictor` - can be fixed by turning off layout optimization, decomp for batch norm
`pyhpc_turbulent_kinetic_energy` - divergence caused because fused kernel keeps precision in fp32 instead of casting back and forth from/to fp32/bf16. Fused kernel is better precision, anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108202
Approved by: https://github.com/jansel
Summary:
Enables dynamo eager mode tracing for the following situation:
1. we have a torch.autograd.Function
2. the input to that function is a tensor subclass which is an intermediary
This is useful for float8 training UX.
Test Plan:
```
python test/dynamo/test_autograd_function.py -k intermediary_input
```
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108093
Approved by: https://github.com/bdhirsh, https://github.com/wanchaol
Summary:
Previously we run propagate_annotation by default in quantization flow to propagate annotations for ops like reshape, view etc.
Not all quantizers would need this so we moved this to xnnpack_quantizer_utils for now.
Next Step:
* make propagate_annotation function configurable with a custom list of ops
* remove unneeded ops in `_is_share_obs_or_fq_op`
Test Plan:
python test/test_quantization.py TestQuantizePT2E
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D48856985](https://our.internmc.facebook.com/intern/diff/D48856985)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108320
Approved by: https://github.com/kimishpatel
Summary:
FBGEMM uses `self.iter.is_cuda` to check if the tensor is for CUDA. This diff enables similar feature `self.iter.is_mtia` for tensors with MTIA device key.
Test Plan: See diff D48693225
Reviewed By: jackm321
Differential Revision: D48809191
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108310
Approved by: https://github.com/albanD
**Summary**
Fix the https://github.com/pytorch/pytorch/issues/100565 by allowing float32 data type when Autocast CPU is disabled. Current behavior is:
- When autocast is disabled and user passes in float data type, it works well.
- When autocast is enabled and user passes in float data type, a warn message throws `UserWarning: In CPU autocast, but the target dtype is not supported. Disabling autocast.` to disable autocast automatically
**TestPlan**
```
python -u -m pytest -s -v test_autocast.py -k test_autocast_disabled_with_fp32_dtype
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107348
Approved by: https://github.com/jgong5, https://github.com/Neilblaze, https://github.com/albanD
This PR:
1) Add device_mesh kwarg to FSDP. Remove init_device_mesh() from _runtime_utils.py, as device_mesh would be passed in by user as an kwarg.
2) change use_dtensor flag for state_dict_config and optim_state_dict_config to be private. If device_mesh is used with sharded model/optim state dict, _use_dtensor flag would be set to True and model/optim state dict would return dtensor state_dict. Otherwise, _use_dtensor flag would be set to False and model/optim state dict would return sharded_tensor state_dict.
3) Update _optim_utils.py, _shard_utils.py, and _state_dict_utils.py to add support for HSDP to return 2D DTensor state_dict.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107533
Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/wanchaol
By refactoring `_local_scalar_dense_mps` to use `_empty_like` to allocate CPU tensor.
Also, print a more reasonable error message when dst dim is less than src in mps_copy_
This fixes regression introduced by https://github.com/pytorch/pytorch/pull/105617 and adds regression test.
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at abd06e6</samp>
> _Sing, O Muse, of the valiant deeds of the PyTorch developers_
> _Who strive to improve the performance and usability of tensors_
> _And who, with skill and wisdom, fixed a bug in the MPS backend_
> _That caused confusion and dismay to many a user of `item()`_
Fixes https://github.com/pytorch/pytorch/issues/107867
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107913
Approved by: https://github.com/albanD
# Summary
As a stop gap to supporting the FP8 Dtype within inductor we would like to fallback to eager. Currently there are 3 ops that are needed for this:
`_scaled_mm` ( matmul for fp8 types)
`clone` (for creating new copies of fp8 tensors)
`to` ( for converting to and from fp8 types).
This PR registers a fallback for _scaled_mm. And adds fp8 to trigger `unsupported_input_tensor`
Prior to these changes this was failing with:
``` Shell
File "/home/drisspg/meta/pytorch/torch/_inductor/codegen/triton_utils.py", line 11, in signature_of
tye = JITFunction._type_of(arg.dtype)
File "/home/drisspg/miniconda3/envs/dev/lib/python3.10/site-packages/triton/runtime/jit.py", line 229, in _type_of
return key if isinstance(key, str) else f"*{tys[dtype_str]}"
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
KeyError: 'float8_e4m3fn'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108293
Approved by: https://github.com/peterbell10
Marks all params/optimizer state as static addresses and a finalizer which cleans up the graph attributes when the optimizer goes out of scope.
**Note: this does not mark grads as static because this will increase memory usage significantly
There are two cases:
1. The upstream graph is cudagraphed - this case will work fine OOTB
2. The upstream graph is not cudagraphed - in this case, there will be a lot of copies introduced from the upstream (to copy the grads) into cudagraphed-owned memory, unless the user explicitly marks the grads as static. If the user does this, this will also require not deallocating the grads in zero_grad() (either the mod or optimizer version) by setting them to zero vs None. There is a PR (https://github.com/pytorch/pytorch/pull/107853) in flight to throw an error if zero_grad attempts to set static grads to None.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107504
Approved by: https://github.com/eellison
torch.empty_strided is able to create a new tensor based on the meta data. For boolean tensor, we call a clone directly, however, we'll get a functional tensor if input is a functional tensor and that functional tensor won't be tracked by tracer's tensor_tracker after dispatching so it become a tensor\_constant in the graph if create_arg. So we manually unwrap the functional tensor before calling clone.
Test Plan:
See added test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108289
Approved by: https://github.com/angelayi
This PR fixes the new_empty_strided op to become replicate from sharding
when necessary, this is a quick fix to resolve https://github.com/pytorch/pytorch/issues/107661
We'll need to think more about the behavior of this op when it comes to
sharding, one possibility is to follow the input sharding, but given the
output shape of this op might not be the same as the input, it's hard to
say we should follow the input sharding, further improvement needed once
we figure out the op syntax
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107835
Approved by: https://github.com/fduwjj
For inductor cpu backend, the scatter_add will use ```atomic_add```, which get a worse performance, currently, we make fallback for it to avoid performance regression compared with eager mode(single socket of SKX):
```
basic_gnn_gin 1.16x(after) Vs 0.509x(before)
basic_gnn_sage 1.064x(after) Vs 0.496x (before)
basic_gnn_gcn 1.373x(aftre) Vs 0.720x(before)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108220
Approved by: https://github.com/jgong5, https://github.com/desertfire
# Summary
## PR Dependencies
I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier:
- [x] Separate build flags for Flash and MemEff: #107985
### Description
This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao
### Changes Made
The majority of the changes in this pull request involve:
- Copying over the flash_attention sources.
- Updating header files.
- Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd.
- Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates.
- Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80.
- Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes.
- Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources.
- Adding/Updating tests.
### Notes for Reviewers
This is not a fun review, and I apologize in advance.
Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO:
- aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp
- aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github)
There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts.
### Follow up items
- Include the updates from e07aa036db and 9e5e8bc91e | https://github.com/pytorch/pytorch/issues/108108
### Work Items
- [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee
- [x] Let multi_query/attention pass through and test | UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup.
- [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers.
- [x] Update test exercise above codepath
- [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it a4f148b6ab)
- [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b
- [x] Update dispatcher to universally prefer FlashV2
- [x] Update tests to exercise new head_dims
- [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional
- [x] Create template generator script
- [x] Initial cmake support for building kernels/ folder
- [x] Replay CudaGraph changes
### Results
#### Forward only
The TFlops are reported here are on a100 that is underclocked.

#### Forward+Backward
Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back.
<img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602
Approved by: https://github.com/huydhn, https://github.com/cpuhrsch
Before `test_op_has_batch_rule_cholesky_solve_cpu_float32` failed:
```
PYTORCH_TEST_WITH_DYNAMO=1 pytest test/functorch/test_vmap.py -k test_op_has_batch_rule_cholesky_solve_cpu_float32
test/functorch/test_vmap.py terminate called after throwing an instance of 'pybind11::error_already_set'
what(): RuntimeError: /home/kshiteej/Pytorch/pytorch_functorch/build/aten/src/ATen/RegisterCompositeExplicitAutograd.cpp:2214: SymIntArrayRef expected to contain only concrete integers
```
After this PR the test cases
NOTE: We can't be 100% of tests on CI till we figure out https://github.com/pytorch/pytorch/issues/107444
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107947
Approved by: https://github.com/zou3519
Summary:
There's a deadlock in current storage's implementation if the size of tensor is too large. Use ctypes to do serialization.
Test Plan:
python benchmarks/dynamo/huggingface.py --bfloat16 --accuracy --inference --device cuda --export-aot-inductor --only MT5ForConditionalGeneration
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108287
Approved by: https://github.com/desertfire, https://github.com/malfet
# Summary
As a stop gap to supporting the FP8 Dtype within inductor we would like to fallback to eager. Currently there are 3 ops that are needed for this:
`_scaled_mm` ( matmul for fp8 types)
`clone` (for creating new copies of fp8 tensors)
`to` ( for converting to and from fp8 types).
This PR registers a fallback for _scaled_mm. And adds fp8 to trigger `unsupported_input_tensor`
Prior to these changes this was failing with:
``` Shell
File "/home/drisspg/meta/pytorch/torch/_inductor/codegen/triton_utils.py", line 11, in signature_of
tye = JITFunction._type_of(arg.dtype)
File "/home/drisspg/miniconda3/envs/dev/lib/python3.10/site-packages/triton/runtime/jit.py", line 229, in _type_of
return key if isinstance(key, str) else f"*{tys[dtype_str]}"
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
KeyError: 'float8_e4m3fn'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108293
Approved by: https://github.com/peterbell10
We allow registering decomps for HigherOrderOp via the existing decomp
mechanisms:
- I refactored those APIs to accept torch._ops.OperatorBase, which is the base
class for torch.ops.HigherOrderOperator and torch.ops.OpOverload
- HigherOrderOps must directly call maybe_handle_decomp in their
ProxyTorchDispatchMode handling in order to resolve decompositions. We
can change this in the future so that they do not need to do this.
Next, we add an inductor decomp for out_dtype. This decomp shouldn't be
generally available because we want to preserve out_dtype to the backend
for other use cases (i.e. executorch).
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108080
Approved by: https://github.com/HDCharles
Addresses [issue #106085](https://github.com/pytorch/pytorch/issues/106085).
In `torch/nn/modules/rnn.py`:
- Adds documentation string to RNNBase class.
- Adds parameters to __init__ methods for RNN, LSTM, and GRU, classes.
- Adds type annotations to __init__ methods for RNN, LSTM, and GRU.
In `torch/ao/nn/quantized/dynamic/modules/rnn.py`:
- Adds type specifications to `_FLOAT_MODULE` attributes in RNNBase, RNN, LSTM, and GRU classes.
> This resolves a `mypy` assignment error `Incompatible types in assignment (expression has type "Type[LSTM]", base class "RNNBase" defined the type as "Type[RNNBase]")` that seemed to be a result of fully specified type annotations in `torch/nn/modules/rnn.py`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106222
Approved by: https://github.com/mikaylagawarecki
Inductor kernel codegen previously have the following side effect:
- in `Kernel.__exit__ `, we add local used buffers in graph.removed_buffers
- during codegen, we do memory allocation/free.
These cause doing multiple versions of codegen for the same kernel hard. The PR refactor the code to make kernel codegen not changing graph level states. After codegening a kernel, the graph level state is not changed so we can go on to codegen another version of the kernel if we want.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107617
Approved by: https://github.com/jansel
The argment order for the legacy path got swapped in a recent patch.
Because there is still a blog post documenting the legacy interface
people are hitting this pathway.
This patch fixes#108208
I will also update the blog post to the new API so that people are
more likely to use the newer `_record_memory_history` API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108260
Approved by: https://github.com/awgu
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.
This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
...
if "The client socket has timed out after" in exception_str:
...
if "Broken pipe" in exception_str:
...
if "Connection reset by peer" in exception_str:
...
```
To address this issue, in this PR I've ensured added these error types:
1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191
Approved by: https://github.com/H-Huang
Summary: In certain cases, the content of some inputs is important for consistent behavior (and performance signals) of an operator. One example is fbgemm jagged tensor operators where offsets Tensor's content must be consistent with the shape of the values Tensor (i.e. `values.size(0) == offsets[-1]` + monotonicity).
This is particularily important in the context of autotuning, where the inputs are currently generated as random (for float types) or all-zero (for int types) `torch.Tensors`. Even if the extern kernel and Triton tempalte are robust enough to tolerate improper input content, the performance signals would likely be useless.
In this PR, we add an option to pass input-generating functions for a subset of inputs of the autotuned op (to the `AlgorithmSelectorCache.__call__`).
Test Plan:
```
$ python test/inductor/test_max_autotune.py
...
----------------------------------------------------------------------
Ran 17 tests in 80.146s
OK
```
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D48831225](https://our.internmc.facebook.com/intern/diff/D48831225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108242
Approved by: https://github.com/jansel
When a test fails, we will now emit fine grained details about how accurately heuristics predicted the relevance of that test.
## Context
Why only look at failing tests? Our only signal that a PR is most likely relevant to a test is whether or not a test fails on it. Green tests don't tell us if the success was due to the code being good vs being irrelevant. This isn't a perfect measure, since it can miscategorize unstable and flaky failures as having been "missed" by the heuristics, but it's a reasonable approximation.
## What's measured?
The metrics this PR collects are designed to answer the following questions
### How comprehensive are the heuristics?
- What's the false negative rate, the % of failures that ideally should have been prioritized but weren't? (Both at an aggregate level and at a per heuristic level)
### How precise are the heuristics?
- What % of failed tests were prioritized by a given heuristic? What % was prioritized overall?
- How relevant was a failed test was considered to be? (Both a aggregate level and at a per heuristic level)
- What % of time was a given heuristic prioritizing a failing test higher than any other heuristic?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108192
Approved by: https://github.com/huydhn
ghstack dependencies: #108117
Summary: Existing code is incorrectly overwriting the stacktrace to be None because since there is no exception happening, `traceback.format_exc` is None. Also we should only populate the stack trace if it not there in the first place.
Test Plan: CI
Differential Revision: D48818478
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108217
Approved by: https://github.com/zhxchen17
Fixed test_memory_profiler::TestMemoryProfilerE2E::test_memory_timeline by changing the (arbitrary) threshold for logging. With increased size allocations on newer AMD GPUs, a higher threshold over 1024 is preferred to account for those differences and yet satisfy the test requirements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108010
Approved by: https://github.com/colesbury, https://github.com/pruthvistony
Summary: This commit adds a public facing
`torch.ao.quantization.move_model_to_eval` util function
for QAT users. Instead of calling model.eval() on an exported
model (which doesn't work, see
https://github.com/pytorch/pytorch/issues/103681), the user
would call this new util function instead. This ensures special
ops such as dropout and batchnorm (not supported yet) will have
the right behavior when the graph is later used for inference.
Note: Support for an equivalent `move_model_to_train` will be
added in the future. This is difficult to do for dropout
currently because the eval pattern of dropout is simply a clone
op, which we cannot just match and replace with a dropout op.
Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_move_model_to_eval
Reviewers: jerryzh168, kimishpatel
Subscribers: jerryzh168, kimishpatel, supriyar
Differential Revision: [D48814735](https://our.internmc.facebook.com/intern/diff/D48814735)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108184
Approved by: https://github.com/jerryzh168
`scaled_dot_product_attention` used to be decomposed in pre-autograd, given that it calls `_scaled_dot_product_attention_math` and `_scaled_dot_product_attention_math` only has a `CompositeImplicitAutograd` kernel. As a result it's decomposed into ops with finer granularity.
However recent PRs (#103826#105131) added new logic in `scaled_dot_product_attention` and now it calls `_scaled_dot_product_flash_attention` which contains a CPU kernel. This results in `_scaled_dot_product_flash_attention` showing up in `torch.export()`. This PR adds a decomposition that ensures `scaled_dot_product_attention` is still being decomposed the same way as before, i.e., going through `_scaled_dot_product_attention_math`. Notice that this decomp rule should be excluded by inductor.
Differential Revision: [D48762000](https://our.internmc.facebook.com/intern/diff/D48762000/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108180
Approved by: https://github.com/SherlockNoMad
This logic exists for index_put and index_add, but for some reason not for `index.out`
Skip testing, as this function is not technically exposed on the Python level.
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at c688cfd</samp>
> _`index_out` checks types_
> _avoiding errors in autumn_
> _complex tensors work_
Fixes https://github.com/pytorch/pytorch/issues/107698
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108167
Approved by: https://github.com/albanD
The compute_local_shape_and_global_offset API does the following:
1) Calculate both local_shape and global_offset in one API to replace two API calls (compute_local_size and compute_local_shape).
2) Generate the correct global_offset for checkpointing purposes. We are currently using compute_local_offset for downstream checkpoint components, which could lead to incorrect results. For checkpointing, we need global_offset instead of local_offset. In some cases, global_offset does not equal to local_offset, when a dimension is sharded multipe times on different mesh dimension (e.g. placements = [Shard(0), Shard(0)]).
Follow-up PRs:
1) Replace related downstream components to use compute_local_shape_and_global_offset instead of compute_local_size and compute_local_offset.
2) Audit existing code base to see if we can remove compute_local_size and compute_local_offset, since they are currently being used.
cc. @wanchaol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107996
Approved by: https://github.com/wanchaol
We'd like to benchmark fusion (either for autotuning or for gathering data to find some patterns that can guide optimizations). There is a deadlock here that prevents us from doing this: to benchmark fusion, we need do codegen before all the fusions are done. However currently codegen rely on xSchedulerNode.last_usage information to decide which buffers are not needed at all and thus don't even need to be allocated/written (Scheduler.removed_buffers tracks this). xSchedulerNode.last_usage information can only be computed once the order of all the nodes have been decided. But each fusion pass (`fuse_nodes_once`) can also change node orders. So we know the final node orders only after all the fusions have completed. That blocks us from doing codegen during fusion (before all fusion are done).
Here I just show the above with a chain of dependencies to make it easier to understand (a -> b means a depends on b, or b has to happen before a):
```
benchmark one fusion decision -> codegen -> xSchedulerNode.last_usage -> node order -> all fusions have completed
```
Actually we only need to decide if a buffer has only local usages (if yes, it's a candidate for removing). This can be decided if we know what are all the users for each buffer. We can avoid using xSchedulerNode.last_usage in this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107320
Approved by: https://github.com/peterbell10, https://github.com/jansel
There are more issues that I expect at the beginning:
* Triton was uploaded on `main` instead of `nightly` and release branch
* The environment `conda-aws-upload` wasn't used correctly in both wheel and conda upload
* Conda update wasn't run in a separate ephemeral runner
* Duplicated upload logic, should have just use `bash .circleci/scripts/binary_upload.sh` instead
* Handle `CONDA_PYTORCHBOT_TOKEN` and `CONDA_PYTORCHBOT_TOKEN_TEST` tokens in a similar way as https://github.com/pytorch/test-infra/pull/4530
Part of https://github.com/pytorch/pytorch/issues/108154
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108187
Approved by: https://github.com/atalman
This cherrypicks the reinterpret_tensor change from #102625 in order to fix a subtle correctness bug when the graph inputs already have a storage_offset set.
The view change also fixes some issues with quantized models in torchbench.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108168
Approved by: https://github.com/desertfire
Summary:
Include the constants into AOTInductor .so file.
We do not modify existing API signatures but create necessary format with weight lifted out instead.
Test Plan:
test/inductor/test_aot_inductor.py
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107718
Approved by: https://github.com/angelayi, https://github.com/eellison
Summary:
Focus2 builds of some apps on apple silicon Macs are failing. We've determined that removing the `user.focus_enabled=true` config option allows the build to succeed.
Reviewed By: milend
Differential Revision: D44076509
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107816
Approved by: https://github.com/kit1980
This PR removes an `except` clause for `RecursionError`. It used to be there because
`sympy.solve` was being used at the time. Since we are using the simpler `try_solve`, it's
not needed anymore.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108144
Approved by: https://github.com/Skylion007
### Summary
Hi! We've been fuzzing pytorch with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz) and found an error of accessing arbitary address while parsing flatbuffer format using `torch::load` function.
pytorch version: 18bcf62bbcf7ffd47e3bcf2596f72aa07a07d65f (the last commit at the moment of reporting the issue)
### Details
The vulnerability appears while loading arbitrary user input using `torch::load` function. To detect the error the input must correspond to FlatbufferFileFormat, so the part of parsing flatbuffer in `import_ir_module` function must be executed.
Firstly error can occur in `GetMutableRoot` in `module.h`, where we add pointer to input data buffer with the value, got from dereference of this pointer (which data fully depends on the user input and can be arbitrary). so the resulting `flatbuffer_module` address can be corrupted.
Moreover, we can get the arbitrary address later at `flatbuffer_loader.cpp:305`, when we get `ival` pointer with `Get` method.
There in `IndirectHelper::Read` function we add pointer with the offset got from the dereference of this pointer, so the address can be corrupted again.
The corrupted `ival` pointer is dereferenced at `table.h` in flatbuffers project, where is used to get another address, which is later dereferenced again at `table.h` in flatbuffers project. The resulting corrupted address is written to `func` pointer at `flatbuffer_loader.cpp:274`, which is then used in `parseFunction`, where write access to the address occurs.
To fix the problem we can compute the end of memory area in `parse_and_initialize_mobile_module` function like this:
```
auto* end = static_cast<char*>(data) + size;
```
And then pass it to all the callees and insert corresponding checks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107510
Approved by: https://github.com/albanD
This is tested by AOTAutograd later in the stack, but I can add direct tests if anyone wants them.
Previously, the second overload of `_make_wrapper_subclass` (which supports dynamic shapes) would just always return a wrapper tensor that reported as being on `cpu`. This updates it to properly respect the `device` arg that was passed in.
At first I thought about doing this the same way that FakeTensor does it (override device to do a custom impl), but that seemed overly complicated. Since the subclass is a wrapper, we can just bake in the value on the wrapper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107926
Approved by: https://github.com/ezyang
ghstack dependencies: #107915, #107916
This is mostly a minor fix on top of @soulitzer's PR https://github.com/pytorch/pytorch/pull/107839.
(1) `strides` wasn't going through the new `set_tensor_attr_with_capsule` flow
(2) The dynamic shapes overload for `_make_wrapper_subclass` currently errors when you try to use custom sizes - I removed the error
(3) added a test
I need this later because I'm adding a `__torch_dispatch__` `FunctionalTensor` wrapper subclass, that needs to support dynamic shapes, and also plumb metadata calls to its inner tensor later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107916
Approved by: https://github.com/ezyang, https://github.com/soulitzer
ghstack dependencies: #107915
This PR adds a `return_and_correct_aliasing()` utility, that wrapper subclasses can use to get correct aliasing. I updated `TwoTensor` to use it, and added some testing that the aliasing of my `TwoTensor` subclass now matches the aliasing behavior of normal tensors.
Right now my test just uses a few hand-picked opinfos (that have varying aliasing behavior). I thought all op infos might be overkill (does that take a while to run?), but I'm happy to add them all if people prefer.
One more general question about this PR: eventually, proper aliasing will be a **requirement** in order for AOTAutograd to handle aliasing/mutations on subclasses properly during compilation. How can we make sure that wrapper subclasses use this API? A few options (from talking to Richard):
(1) Yolo require subclasses to use the API and hope users do as well (what this PR does)
(2) Yolo require subclasses to use the API, but add a kwarg to `_make_wrapper_subclass`, e.g. `manual_aliasing=True`, that torch.compile checks for before allowing the subclass to be used in compilation
(3) Automatically run this API in our python fallback, for **every** tensor subclass that currently implements `__tensor_flatten__` (aka only the "traceable" subclasses)
(4) Automatically run this API in our python fallback, for **every** tensor subclass. This would be a bit higher blast radius, since it would change the existing aliasing behavior of wrapper subclasses. Maybe.. this is the right thing to do though?
Either way, my tentative plan is to do (1) to unblock, and revisit this later once we want to come up with public docs + a more general "tensor subclass in PT2 requirements" plan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107915
Approved by: https://github.com/ezyang
For test_conv_bn_fuse dynamic case, we always fuse bn with convolution and there only a external convolution call, not loops, so it will failed when we do a dynamic loop vars check. This PR will skip this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108113
Approved by: https://github.com/huydhn
**Update:** Made refactor of the original PR. See the original description below, but here I'll describe the updates:
(1) TLS changes in `TorchDispatchModeTLS.h/cpp`.
I added a `TorchDispatchModeKey` enum, that (for now) just contains PROXY and FAKE. The ModeTLS used to just contain a `std::vector<std::shared_ptr<c10::SafePyObject>>` corresponding to the mode stack. It now **also** contains a separate array of "infra modes", indexed by mode key (PROXY and FAKE, with a new addition, FUNCTIONAL, coming later in the stack).
`TorchDispatchModeTLS::push_onto_stack` and `TorchDispatchModeTLS::pop_stack` are now a bit more complicated. Pushing accepts an optional mode_key, which if set, tells us to add the given mode directly to our "infra_modes" array. Popping will first check the "user mode" stack, before trying to pop anything from the infra mode stack. It also optionally returns the mode key of the mode we popped if there was one - that way if we push that same mode back onto the TLS later, we know where it goes.
`TorchDispatchModeTLS::dispatch_mode_enabled()` now accepts an optional `skip_infra_modes` param, so you can separately query if there are "any modes at all", or if there are "any user modes".
`TorchDispatchModeTLS::get/set/unset_mode()` all take in a mode key, and get/set/unset the mode at that particular mode key (meaning they are only meant to be used for infra modes).
There were also some mild codegen changes to support the new enum
(2) `fake_tensor.py/proxy_tensor.py/_python_dispatch.py`
The way I tell the infra that certain subclasses/modes are "infra" is through the enum: I gave `FakeTensor` and `FakeTensorMode` a `self._mode_key = torch._C.TorchDispatchModeKey.FAKE`. `TorchDispatchMode.__enter/exit__()` (in `_python_dispatch.py` now check if the current mode has a mode key, and if so they plumb it into any `push_onto_stack()` calls (which eventually instructs `TorchDispatchModeTLS` where to put the mode). Same thing for `ProxyTorchDispatchMode`.
I also had to change both of these mode's enter/exit, to handle the fact that there can no longer be multiple proxy/fake modes on the mode stack at once. I updated them both to have a `self.enter_stack: List[Optional[TorchDispatchMode]]` - whenever we push a given mode in `__enter__`, we remove the current ambient fake/proxy mode from the mode stack, and save it in `enter_stack`, so that on exit we can reset the state properly.
(2) dispatching logic in `python_arg_parser.cpp`
This is where the core dispatching logic changes are. I added two helpers, `dispatch_on_subclass()` and `dispatch_on_mode()`. The overall dispatching order is now:
```
(a) dispatch_on_mode() # try user modes first (where the mode stack automatically considers infra modes last)
(b) dispatch_on_subclass() # try user subclasses next (skipping infra subclasses)
(c) dispatch_on_subclass() # try infra subclasses next (skipping user subclasses)
```
Note that we still want "user subclasses" to run before "infra modes". As Ed helped me realize, this will work today: If proxy/fake modes in step 1, they'll return NotImplemented if they see a user subclass, allowing us to redispatch to the user subclass.
How do (b) and (c) distinguish between user and infra subclasses? Infra subclasses (FakeTensor, and later FunctionalTensor) are required to have a `_mode_key` hidden on the subclass - so we filter via arguments that do/don't have the _mode_key.
(3) I also changed `DoubleTensor` to `TwoTensor` to minimize confusion (@albanD pointed out that DoubleTensor would be easily confused with `torch.FloatTensor` and friends).
----- original description below -----
The main purpose of this PR is to fix the "ordering problem" between torch_dispatch modes, where we want to ensure that our Fake and Proxy dispatch modes always run **after** any dispatch modes created by the user, regardless of where they are in the stack. See this doc for more details: https://docs.google.com/document/d/1COQ291nOZvtFnzGTQMJqoYZ3sttEYFw_7HbfSyL8gcA/edit
Full set of changes below. I ended up including a few semi-related changes in this PR that I documented - but if folks would rather I separate them out, happy to try to do that.
**(1) Add dedicated TLS slots for FakeTensorMode and ProxyTensorMode**
This is the main component of this PR. There are two new slots, `TorchDispatchModeTLS.fake_mode_` and `TorchDispatchModeTLS.proxy_mode_`, which correspond to a single "global" fake and proxy mode. There is now an invariant that `torchDispatchModeState.stack_` can never contain either of these modes.
I also added a `TorchDispatchModeTLS::maybe_highest_mode()` helper that consults the `stack_` as well as both the proxy and fake slots, and returns the highest priority mode - this is because there are a few places in the codebase where we legitimately want to get the highest priority mode, *including* fake or proxy, if one is set.
This also made the implementations of the existing `disable_proxy_modes_tracing()` and `get_innermost_proxy_mode()` marginally simpler.
**(2) Updated the dispatching logic in handle_torch_function_no_python_arg_parser()**
This is the function that actually figures out which torch_dispatch implementation to call, given the current mode stack and tensor subclass inputs. This function got marginally more complicated as part of the refactor: First we inspect the mode stack and any non-fake subclass inputs. Then we check for the proxy mode slot. Then we check for the Fake mode slot, before finally checking for any fake subclass inputs.
**(3) new python `_get_fake_tensor_mode()` and `_get_proxy_tensor_mode()` API's**
Before, if you wanted to see if proxy or fake modes were active in python, you would have to consult the mode stack. Since these two modes are no longer part of the actual mode stack, I added two new API's to directly check if either proxy or fake modes are active.
**(4) Allow traceable tensor subclasses to access storages from python**
This is convenient later in the stack, where AOTAutograd needs to detect aliasing of inputs and outputs, where those inputs and outputs might be tensor subclasses. Previously, `x.untyped_storage()` would raise an error if `x` was a subclass. In this PR, I tried to relax this constraint as little as possible: `THPVariable_storage()` will only try to return a storage to python if the tensor subclass that you are passing in is "traceable"
**(5) Fixed subclass fakeification**
@wanchaol recently added support to be able to fakeify tensor subclasses. That fakeification logic works in most cases, but there is one case it doesn't handle: autograd metadata. In particular, since autograd sees our tensor subclasses and not their desugared tensors, we need to make sure that our fakeified subclass has the same autograd metadata as the original subclass. I updated `meta_utils.py` to make sure that the autograd metadata is correct.
**(6) make tensor subclasses resizeable**
Previously we didn't allow tensor subclasses to be resizeable. I ran into an issue where fakeifying a tensor subclass occasionally requires swapping out its storage, which can involve resizing the tensor. Mechanically, this required updating `at::for_blob()` to expose a way to request that the tensor that you create has resizeable storage, and then using this new API in `_make_wrapper_tensor()`.
**(7) Added a basic DoubleTensor subclass for testing**
I use this subclass more later in this stack in my AOTAutograd tests - but it serves as a simple subclass example to test the dispatch ordering in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104482
Approved by: https://github.com/ezyang
ghstack dependencies: #107415
There is already some support for plumbing `__torch_dispatch__` tensor subclasses through dynamo, but this PR beefs it up a bit and adds a test. In particular:
(1) Fakeifying tensor subclasses didn't properly set autograd metadata (requires_grad, is_leaf) on the newly fakeified wrapper subclass. I don't actually have a test for this in this PR, but it's tested pretty heavily later in my aot autograd tests
(2) Fakeifying tensor subclasses didn't properly track source information for dynamic shapes on the inner tensors. I added a new `WrapperSubclassFieldSource` subclass, that represents a source coming from a tensor field on a wrapper subclass, which I use in the fakeifying logic, and again in symbolic_shapes.py to generate proper guards.
(3) `_make_wrapper_subclass()` marginally updated this code to work better with dynamic shapes. One thing that's a bit weird about `_make_wrapper_subclass`: it has two overloads, and the first explicitly does not support dynamic shapes (and the second.. does not support kwargs). I think that later we probably want to consolidate / at least make the first overload work with dynamic shapes, but I didn't want to handle that in this PR (so these smaller changes seemed like a strict improvement).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107415
Approved by: https://github.com/ezyang
Summary:
Add check for `guard.stack` which was causing exceptions like:
```
toch._dynamo.exc.InternalTorchDynamoError: 'NoneType' object has no attribute 'format'
```
Test Plan: contbuild & OSS CI
Differential Revision: D48709458
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108012
Approved by: https://github.com/anijain2305
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.
This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
...
if "The client socket has timed out after" in exception_str:
...
if "Broken pipe" in exception_str:
...
if "Connection reset by peer" in exception_str:
...
```
To address this issue, in this PR I've ensured added these error types:
1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651
Approved by: https://github.com/H-Huang
Summary: This commit does 4 main things:
1. When verifying QAT numerics, automatically check both the
per tensor and the per channel cases, and automatically verify
convert numerics
2. When verifying the QAT graph, automatically check both the
per tensor and the per channel cases
3. Merge verify graph and verify numerics tests for conv-bn
4. Fix `test_prepare_qat_conv_bn_fusion_getitem_placeholder`,
which was no longer testing the right thing recent capture
changes, since the maxpool op is no longer followed by a
getitem node. However, we do still need this test for other
ops that *are* followed by getitem nodes (e.g. standalone BN).
Items (1) - (3) make the QAT tests significantly less verbose
and easier to read.
Test Plan:
python test/test_quantization.py TestQuantizePT2E
python test/test_quantization.py TestQuantizePT2EModels
Reviewers: jerryzh168, kimishpatel
Subscribers: jerryzh168, kimishpatel, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107991
Approved by: https://github.com/jerryzh168
Fixes#107282
## Overview
- basic design decision was followed as they made on #103881 (tensor operation, test cases, order & position of argument etc.)
- for the algorithm for decoupled weight decay, I referred to [1, 2]
## backwards-incompatible changes
- positional argument `decoupled_weight_decay` is added to:
- `torch.optim.radam`
The existing code which refers to these APIs can be affected.
Note: Positional argument `decoupled_weight_decay` is added to `torch.optim.RAdam`. However, since it was added to the last position and with default value, it is not affected.
## Reference
- [1] [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101)
- [2] https://github.com/LiyuanLucasLiu/RAdam/blob/master/radam/radam.py#L5-L94
## TODO
- [x] implement tensor operation
- [x] implement test cases
- [x] modify doc-string
- [x] pass unit test code locally `python test/test_optim.py -k test_radam`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107507
Approved by: https://github.com/janeyx99
Summary: Before this change, the tensor_indices_or_sections variant of aten.tensor_split causes a `RuntimeError: The tensor has a non-zero number of elements` due to that operation needing to introspect data. Decomposing into one of the other two tensor_split variants fixes the problem.
Test Plan:
Enabled tensor_split tests in test/inductor/test_torchinductor_opinfo.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107251
Approved by: https://github.com/ezyang, https://github.com/eellison
We want cond to always throw errors despite user's torch.compile mode.
The current implementation is to
1. catch the UserError.GRAPH_BREAK_IN_CONTROL_FLOW and once saw it, we directly raise: once in [break_graph_if_unsupported](bad3f2db40/torch/_dynamo/symbolic_convert.py (L1250)), which catches and raises for call_function (entry point of higher order operator) and a few others.
2. The raised exception is caught and raised again in [step](bad3f2db40/torch/_dynamo/symbolic_convert.py (L691)), where all instructions' exceptions are handled.
3. At the top-level, we treat it like an hard error and not supressing the errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108027
Approved by: https://github.com/zou3519
ghstack dependencies: #108025, #108026
2023-08-28 07:23:03 +00:00
3713 changed files with 227396 additions and 411964 deletions
# Step by step guide on using PyTorch's DevContainer
Using PyTorch's DevContainer environment involves a series of steps that will help you set up a development environment that is isolated and replicable. Below, we'll guide you through each step to make this process as smooth as possible:
## Step 1: Install VSCode
1. Navigate to the [Visual Studio Code website](https://code.visualstudio.com/).
2. Download the appropriate installer for your operating system (Windows, Linux, or macOS).
3. Run the installer and follow the on-screen instructions to install VSCode on your system.
4. After installation, launch VSCode.
## Step 2: Install DevContainer Extension
1. In VSCode, go to the Extensions view by clicking on the Extensions icon in the Activity Bar on the side of the window.
2. Search for "Dev Containers" in the Extensions view search bar.
3. Find the "Dev Containers" extension in the search results and click on the install button to install it.
You can also go to the extension's [homepage](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) and [documentation page](https://code.visualstudio.com/docs/devcontainers/containers) to find more details.
## Step 3: Install Docker and Add Current Login User to Docker Group
1. Follow the [official guide](https://docs.docker.com/get-docker/) to install Docker. Don't forget the [post installation steps](https://docs.docker.com/engine/install/linux-postinstall/).
If you are using [Visual Studio Code Remote - SSH](https://code.visualstudio.com/docs/remote/ssh), then you only need to install Docker in the remote host, not your local computer. And the following steps should be run in the remote host.
1. If you intend to use GPU resources, first ensure you have NVIDIA drivers installed on your system. Check if `nvidia-smi` works to verify your GPU setup.
2. Follow the [official guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#docker) to install the NVIDIA Container Toolkit.
3. After installation, verify that the toolkit is installed correctly by running:
```
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
```
## Step 5: Clone PyTorch
1. Open a terminal or command prompt.
2. Use the following command to clone the PyTorch repository:
```
git clone https://github.com/pytorch/pytorch
```
3. Navigate to the cloned directory:
```
cd pytorch
```
## Step 6: Open in DevContainer
1. In VSCode, use the Command Palette (`Ctrl+Shift+P` or `Cmd+Shift+P` on macOS) to run the "Remote-Containers: Open Folder in Container..." command.
2. You will be prompted with two options: CPU dev container or CUDA dev container. Choose the one you want to run.
## Step 7: Wait for Building the Environment
1. After opening the folder in a DevContainer, VSCode will start building the container. This process can take some time as it involves downloading necessary images and setting up the environment.
2. You can monitor the progress in the VSCode terminal.
3. Once the build process completes, you'll have a fully configured PyTorch development environment in a container.
4. The next time you open the same dev container, it will be much faster, as it does not require building the image again.
You are now all set to start developing with PyTorch in a DevContainer environment. This setup ensures you have a consistent and isolated development environment for your PyTorch projects.
## Step 8: Build PyTorch
To build pytorch from source, simply run:
```
python setup.py develop
```
The process involves compiling thousands of files, and would take a long time. Fortunately, the compiled objects can be useful for your next build. When you modify some files, you only need to compile the changed files the next time.
Note that only contents in the `pytorch` directory are saved to disk. This directory is mounted to the docker image, while other contents in the docker image are all temporary, and will be lost if docker restarts the image or the server reboots.
For an in-depth understanding of Dev Container and its caveats, please refer to [the full documentation](https://code.visualstudio.com/docs/devcontainers/containers).
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.