Which is faster then ternary.
Following script
```python
import torch
from timeit import default_timer
global_setup = """
"""
setup = """
c10::SymInt a = c10::SymInt(123);
"""
code = """
-a;
"""
from torch.utils.benchmark import Timer
t = Timer(stmt=code, setup=setup, global_setup=global_setup, language="c++", timer=default_timer)
print(t.blocked_autorange())
```
reports 4.17 ns median type before and 3.61 ns after on x86_64 Linux and 2.02 ns before and 1.91 ns after on Apple M1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117015
Approved by: https://github.com/albanD
Inductor codegen for `_assert_async` is currently disabled because we don't really understand how to codegen `scalar_to_tensor` on a Sympy expression. I initially tried to see if I could get this to work, but I got into some weird problem involving stride sorting, so I decided to fix it properly by not going through a tensor.
So we introduce an `_assert_scalar` which takes a scalar as an argument, avoiding needing to turn a SymBool into a tensor before asserting on it. I also add `_functional_assert_scalar` for good luck, although this doesn't do anything right now because https://github.com/pytorch/pytorch/pull/104203 still hasn't been landed.
I need to customize the codegen for this operator, so I decide to directly implement it in Inductor, rather than trying to treat it as a generic ExternKernel. This leads to the new AssertScalar IR node. This is written carefully so that it doesn't get DCE'd by Inductor.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114148
Approved by: https://github.com/jansel
Summary:
config.is_predispatch is a config to instruct inductor to enable predispatch
tracing (high level pre-dispatch IR). Currently, there is no dedicated pass
for this config.
In this commit, for better pass function management, we created
`predispatch_pass` to hold the pass functions to be run on the high level
pre-dispatch IR-based graphs.
Test Plan: CI
Differential Revision: D52491332
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116788
Approved by: https://github.com/frank-wei
reland of https://github.com/pytorch/pytorch/pull/116559, which was reverted by internal.
The underlying reason for the revert is that the torch.dynamo.disable can't be used by the
pytorch codebase, as it's conflicting with some torch.deploy together, although the later one
only run some inference, but it somehow take that weird dependency on fsdp..
We have seen this issue with our functional collectives that we can't
use any dynamo components otherwise torch.deploy would complain..
verified internally that after removing torch.dynamo.disable the test
passed again
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117020
Approved by: https://github.com/awgu
Instead of printing the tensor's data print the dtype and shape metadata of the tensor.
```
Executing: <VarMeanBackward0 object at 0x1352d0e20> with grad_outputs: [None,f32[]]
```
This is important in order to avoid doing a cuda sync and also useful to reduce verbosity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116523
Approved by: https://github.com/albanD
Through the new dynamic_shapes API and using torch.export.Dim, dimensions that are equal will now be represented by the same symbol, so we no longer need to store `equality_constraints`.
Differential Revision: D52351705
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116979
Approved by: https://github.com/avikchaudhuri
After https://github.com/pytorch/pytorch/pull/116595/files compiling every .cu file results in
```
/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
-static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
^
detected during:
instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=float, From=int64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=float, From=int64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
-static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
^
detected during:
instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=float, From=uint64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=float, From=uint64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h
/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
-static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
^
detected during:
instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=double, From=int64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=double, From=int64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h
/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
-static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
^
detected during:
instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=double, From=uint64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=double, From=uint64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h
/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
-static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
^
detected during:
instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=c10::complex<float>, From=int64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=c10::complex<float>, From=int64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h
/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
-static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
^
detected during:
instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=c10::complex<float>, From=uint64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=c10::complex<float>, From=uint64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h
/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
-static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
^
detected during:
instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=c10::complex<double>, From=int64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=c10::complex<double>, From=int64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h
/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
-static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
^
detected during:
instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=c10::complex<double>, From=uint64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=c10::complex<double>, From=uint64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h
```
Fix it by using using `if constexpr` to avoid calling `static_cast<uint64_t>` for any floating point type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117023
Approved by: https://github.com/albanD
Summary:
In torch/_inductor/lowering.py (https://fburl.com/code/jd58vxpw), we are using
```
fallback_cumsum(x, dim=axis, dtype=dtype)
```
so this will treat `x` as args, `dim` and `dtype` as kwargs from https://fburl.com/code/cikchxp9
The issue has been fixed from D52530506 for OSS but not Meta internal. This diff address the Meta internal issue by some refactoring so both Meta internal and OSS can use the same helper function. The diff also added some debug log.
Test Plan:
before
```
aoti_torch_proxy_executor_call_function(proxy_executor, 2, 1, std::vector<int64_t>{torch.int64}.data(), 2, std::vector<AtenTensorHandle>{buf702, buf708}.data());
```
after
```
aoti_torch_proxy_executor_call_function(proxy_executor, 2, 1, std::vector<int64_t>{0}.data(), 2, std::vector<AtenTensorHandle>{buf702, buf708}.data());
```
so `torch.int64` changed to `0`
Differential Revision: D52532031
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116779
Approved by: https://github.com/desertfire, https://github.com/chenyang78
When originally authored, it was not necessary to unconditionally apply
hermetic mode, but I chose to apply it in eager mode to help catch bugs.
Well, multipy is kind of dead, and hermetic mode is causing real
implementation problems for people who want to do fancy Python stuff
from the dispatcher. So let's yank this mode for now.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116996
Approved by: https://github.com/jansel
Improve perf of vec512 bfloat16 (and also float16) load and store with partial vector lanes using masked load/store instead of via `memcpy` with aux buffer. In inductor CPU backend, we do load/store half (16) vector lanes for bfloat16 and float16.
Using the following micro-benchmark script for `layernorm + add`:
```python
import torch
import torch.nn as nn
from benchmark_helper import time_with_torch_timer
class AddLayernorm(nn.Module):
def __init__(self, hidden_size):
super().__init__()
self.ln = nn.LayerNorm(hidden_size)
def forward(self, hidden_states):
return hidden_states + self.ln(hidden_states)
hidden_states = torch.randn(1, 512, 1024).to(torch.bfloat16)
with torch.no_grad():
compiled_add_ln = torch.compile(add_ln)
print(time_with_torch_timer(compiled_add_ln, hidden_states, iters=10000))
```
Measured on single-core `Intel(R) Xeon(R) Platinum 8358 CPU`.
Before: 1.39 ms
After: 498.66 us
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116961
Approved by: https://github.com/sanchitintel, https://github.com/leslie-fang-intel
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/116492, `linear` will be decomposed into `bmm` when input dim exceeds 2 and not contiguous. Fix this issue by convert the pattern back into `qlinear`. This PR focus on int8_fp32 case, will follow up int8_bf16 case in next PR.
**Test Plan**
```
python -u -m pytest -s -v test_mkldnn_pattern_matcher.py -k test_qlinear_input_dim_exceeds_2_and_not_contiguous
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116599
Approved by: https://github.com/jgong5
ghstack dependencies: #116937
This PR is to create a separate CI job for inductor UTs on ROCm. You will need to add `ciflow/inductor` tag on PRs to trigger this job. However, the job will run on its own on any commit merged in main. This job takes around 1.5 hours to run and it is run in parallel to other rocm jobs. It is run only on the MI210 CI runners to ensure maximum inductor functionality is tested.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110544
Approved by: https://github.com/jithunnair-amd, https://github.com/jansel, https://github.com/huydhn
**Summary**
Similar as https://github.com/pytorch/pytorch/pull/115153, when we do the `qconv_binary` fusion with post op sum, we also need to ensure that: all users of the extra input in this pattern should be ancestor nodes of the compute node, except for the binary node connected to the compute node.
Also change some variable names in this diff as:
- Change name of `qconv2d_node_after_weight_prepack` to `compute_node`
- Change name of `extra_input_node` to `extra_input_of_binary_node`
**Test Plan**
```
python -u -m pytest -s -v test_mkldnn_pattern_matcher.py -k test_qconv2d_add_3
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115809
Approved by: https://github.com/jgong5
ghstack dependencies: #115153
Summary: fixed an import problem for test_xnnpack_quantizer so that it can run in CI
Test Plan:
internal CI
sanity check: buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_conv2d (caffe2.test.quantization.pt2e.test_xnnpack_quantizer.TestXNNPACKQuantizer)'
Differential Revision: D52576449
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116911
Approved by: https://github.com/mcr229
Fixes#115595
As an aside, there are currently no tests checking the output of `torch.signal.windows.kaiser` against the output of scipy's implementation, which is what is done with `torch.kaiser_window`. The same goes for the other window functions in `torch.signal.windows`. I did some tests on my end, but I'm not sure what the best practice is, so I haven't included them for now.
@gchanan @mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116470
Approved by: https://github.com/ezyang
Here's the problem: if we support unsigned integer types, and in particular if we support uint64_t, we need a way to represent these integers in Scalar. However, Scalar currently stores all integral values inside int64_t, which is not wide enough to accommodate all possible uint64_t values. So we need to do something to Scalar to support it.
The obvious thing to do is add a uint64_t field to the union, and used it some situations. But when should we use it? The proposal is that we only use it if and only if the integer in question is not representable in int64_t. The historical precedent for this is our handling for uint8_t. Because this type is representable inside int64_t, we have historically stored it inside Scalar as an int64_t. In general, the concept behind Scalar is that it doesn't know the signedness/unsignedness/bitwidth of its input; in particular, we typically construct Scalar from Python int, which doesn't have any concept of how wide the integer is! So it doesn't make any sense to allow for a small integer like 255 to be representable under both the HAS_i tag and the HAS_u tag. So we forbid the latter case.
Although I have proposed this, the PR as currently written just chokes when you pass it a uint64_t that's too big. There's some more logic that would have to be written out for this. I'm putting this out to start to get some agreement that this is the way to do it.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116595
Approved by: https://github.com/albanD
Adds bfloat16 to fractional_max_pool. If op supports fp32 and fp16, it really should support bf16 for the most part. Most but not all ops satisfy this, so I am adding support for the few that do not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116950
Approved by: https://github.com/lezcano
Whenever the monitor thread kills the watchdog thread for being stuck,
we do so to save cluster time and get a faster failure signal, but we
want to know more about why it got stuck.
One possible reason for watchdog stuckness is GIL contention, which
could be ruled out or observed by making an attempt to acquire the GIL
at exit time.
If we cannot acquire the GIL within a short time window (1s) we abort
the attempt and report GIL contention, otherwise we report that GIL was
acquired successfully.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116798
Approved by: https://github.com/zdevito
This prepares the PR where we implement sets in terms of dicts.
To do so, rather than storing internally a dictionary that maps literals
to VariableTrackers, it stores (pretty much) a dictionary from VTs to VTs.
To do so, keys are wrapped in an opaque internal class _Hashable.
The Hashable class is opaque on purpose so that it fails hard if
if it inadvertently leaks back into user code.
We also found and fixed a number of latent bugs and inconsistencies
in the way dynamo checked what can be a dict key. More generally, we
make much clearer what are the things that need to be modified to add
a new supported key type to Dicts.
Fixes [#107595](https://www.internalfb.com/tasks?t=107595)
Fixes [#111603](https://www.internalfb.com/tasks?t=111603)
Re-PR of https://github.com/pytorch/pytorch/pull/111196 sadly due to reverts, we could not reuse @lezcano's original PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116785
Approved by: https://github.com/mlazos
Summary:
We intend to preserve autograd ops for predispatch export. Therefore, we
need to exempt the autograd ops in some places, e.g. verifier and
proxy_tensor.py.
Test Plan:
python test/export/test_export.py -k test_predispatch_export_with_autograd_op
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116527
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #116339
Summary:
As current export doesn't support training, so grad mode ops doesn't
make sense. To avoid the confusion, we choose to early error if there
exist grad mode ops.
Test Plan:
python test/export/test_safeguard.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116339
Approved by: https://github.com/tugsbayasgalan
Where base stack targets default branch, rather than base. But as
default branch is likely to advance, since PR was made, search for
mergebase before determining whether `base`..`head` are in sync with `orig` branch
Also, rather than hardcode default branch name, fetch it from `GitHubPR.default_branch()`
Test Plan: https://github.com/malfet/deleteme/pull/77
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116873
Approved by: https://github.com/ezyang
Use a similar approach with Sleef as in #99550
to improve the precision and extremal value handling of the `abs` function for complex tensors.
This fixes
- test_reference_numerics_extremal__refs_abs_cpu_float64
- test_reference_numerics_extremal__refs_abs_cpu_float128
which failed on PPC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116859
Approved by: https://github.com/lezcano
**problem**
when prefetching for next forward, current forward may be annotated by
`@torch.no_grad`. `param.grad_fn` keeps being None during prefetching.
`_post_backward_hook` never gets triggered
repro
```pytest test/distributed/fsdp/test_fsdp_freezing_weights.py```
**solution**
this PR enabled autograd during prefetching (`_use_unsharded_views`), so
`param.grad_fn` are properly assigned for next forward
a longer-term fix would be moving `_use_unsharded_views` out of
`_prefetch_handle` and put it in `_pre_forward_unshard`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116792
Approved by: https://github.com/awgu
Part 3 of implementation for general [subclass view fake-ification](https://docs.google.com/document/d/1C5taWiplmX7nKiURXDOAZG2W5VNJ2iV0fQFq92H0Cxw).
Changes codegen to generate `view_func()` / `rev_view_func()` by default for python subclasses. With `view_func()` existing more often now, the lazy view rebase logic [here](f10c3f4184/torch/csrc/autograd/variable.cpp (L665-L695)) causes some slight behavior changes for in-place ops on views:
* Additional view nodes are inserted into output graphs, changing their string representation, although they are functionally the same. The extra nodes are removed in AOTAutograd's DCE pass.
* When `t` is a `FunctionalTensor`, calling `t.grad_fn` will now invoke `view_func()`; we need to make sure we're operating in a `FunctionalTensorMode` so the view op calls succeed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116512
Approved by: https://github.com/bdhirsh, https://github.com/soulitzer
ghstack dependencies: #115894
Part 2 of implementation for general [subclass view fake-ification](https://docs.google.com/document/d/1C5taWiplmX7nKiURXDOAZG2W5VNJ2iV0fQFq92H0Cxw).
Details:
* Codegen `rev_view_func()` alongside `view_func()`
* Reverse view_func gives you a "base" from a "view": `rev_view_func(new_view) -> new_base` AKA it plays the original view backwards
* Utilizes the functional inverses defined in `FunctionalInverses.cpp`, passing `InverseReturnMode::AlwaysView`
* Manually implements functional inverses for `narrow()` and `chunk()`
* **NB: Multi-output views now set view_func() / rev_view_func() for each of the output views!**
* Due to this, the `as_view()` overload that operates on a list of views is scrapped in favor of iteration via codegen
Example codegen in `ADInplaceOrViewTypeN.cpp`:
```cpp
at::Tensor narrow(c10::DispatchKeySet ks, const at::Tensor & self, int64_t dim, c10::SymInt start, c10::SymInt length) {
auto _tmp = ([&]() {
at::AutoDispatchBelowADInplaceOrView guard;
return at::_ops::narrow::redispatch(ks & c10::after_ADInplaceOrView_keyset, self, dim, start, length);
})();
std::function<at::Tensor(const at::Tensor&)> func=nullptr;
std::function<at::Tensor(const at::Tensor&)> rev_func=nullptr;
if (false || !self.unsafeGetTensorImpl()->support_as_strided() ||
c10::AutogradState::get_tls_state().get_view_replay_enabled()) {
func = [=](const at::Tensor& input_base) {
return at::_ops::narrow::call(input_base, dim, start, length);
};
rev_func = [=](const at::Tensor& input_view) {
// NB: args from narrow() signature are passed along to the inverse
return at::functionalization::FunctionalInverses::narrow_copy_inverse(self, input_view, at::functionalization::InverseReturnMode::AlwaysView, dim, start, length);
};
}
auto result = as_view(/* base */ self, /* output */ _tmp, /* is_bw_differentiable */ true, /* is_fw_differentiable */ true, /* view_func */ func, /* rev_view_func */ rev_func, /* creation_meta */ InferenceMode::is_enabled() ? CreationMeta::INFERENCE_MODE : (at::GradMode::is_enabled() ? CreationMeta::DEFAULT : CreationMeta::NO_GRAD_MODE));
return result;
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115894
Approved by: https://github.com/soulitzer
As per title. With this, https://github.com/pytorch/pytorch/issues/93697
does not choke, but spits out many of these:
```
[ERROR] Name: "L['self']"
[ERROR] Source: local
[ERROR] Create Function: NN_MODULE
[ERROR] Guard Types: ['ID_MATCH']
[ERROR] Code List: ["___check_obj_id(L['self'], 139962171127504)"]
[ERROR] Object Weakref: <weakref at 0x7f4b72f7c9a0; to
'ActorCriticPolicy' at 0x7f4b7b7df6d0>
[ERROR] Guarded Class Weakref: <weakref at 0x7f4afbd08b30; to
'ABCMeta' at 0x56463a727840 (ActorCriticPolicy)>
[ERROR] Created at:
[ERROR] File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 248, in __call__
[ERROR] vt = self._wrap(value)
[ERROR] File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 474, in _wrap
[ERROR] return self.wrap_module(value)
[ERROR] File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 941, in wrap_module
[ERROR] return self.tx.output.register_attr_or_module(
[ERROR] File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/output_graph.py", line
735, in register_attr_or_module
[ERROR] install_guard(source.make_guard(GuardBuilder.NN_MODULE))
[ERROR] Error while creating guard:
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116460
Approved by: https://github.com/jansel
ghstack dependencies: #116459
- Enable Sonoma testing on M2 machines
- Add 70+ ops to the list of supported ones on MacOS Sonoma
- Enable nn.functional.
- Add explicit `TORCH_CHECK` to mark scatter/gather, index_select and linalg ops as yet not supporting Complex, as attempt to call those will crash with various MPS asserts such as:
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:96:0: error: 'mps.reduction_min' op operand #0 must be tensor of MPS type values or memref of MPS type values, but got 'tensor<5x5xcomplex<f32>>'
(mpsFileLoc): /AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:96:0: note: see current operation: %3 = "mps.reduction_min"(%1, %2) <{keep_dims}> : (tensor<5x5xcomplex<f32>>, tensor<2xsi32>) -> tensor<1x1xcomplex<f32>>
```
- Treat bools as int8 to fix regression re-surfaced in `index_fill` (used to be broken in Monterey, then fixed in Ventura and broken in Sonoma again)
- `nn.functional.max_pool2d` results now match CPU output for uint8 dtype in Sonoma
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116764
Approved by: https://github.com/kulinseth, https://github.com/seemethere
Fixes https://github.com/pytorch/pytorch/issues/116385
Don't call `torch._transformer_encoder_layer_fwd` when `bias=False`
`bias=False` was not something that `torch._transformer_encoder_layer_fwd` was meant to work with, it was my bad that this wasn't tested as I approved https://github.com/pytorch/pytorch/pull/101687.
`bias=False` was causing the `tensor_args` in [`TransformerEncoder`](a17de2d645/torch/nn/modules/transformer.py (L663-L677)) to contain `None`s and error on checks for the fastpath like `t.requires_grad for t in tensor_args`.
Alternative fix would be to
1) Pass `torch.zeros_like({*}.weight)` to the kernel when `bias=False` and filter `tensor_args` as appropriate
2) Fix `torch._transformer_encoder_layer_fwd` to take `Optional<Tensor>` for biases and fix the kernels as appropriate
Let me know if these approaches are preferable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116760
Approved by: https://github.com/jbschlosser
Note about the Updates:
This PR:
1. skips more flash attention related UTs on MI200
2. Fix additional ATen compiling errors after hipification
3. Fix the author "root" of a specific commit
4. Includes the patch from Nikita in favor of block level static initialization.
CAVEAT: This revised PR has a commit that modifies the CI to force its running on MI200 nodes. That specific commit must be reverted before merge.
Original PR (https://github.com/pytorch/pytorch/pull/114309) Note:
This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project.
Know limitations:
- Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`.
- Only supports power of two sequence lengths.
- No support for varlen APIs.
- Only support head dimension 16,32,64,128.
- Performance is still being optimized.
Fixes#112997
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115981
Approved by: https://github.com/malfet
Summary:
If heartbeat monitor times out and kills the process, we want to know why.
It's convenient to use an internal tool for this, but we plan to later
integrate with torchelastic to call into pyspy or something else, which will be
both better (including py stacks) and compatible with OSS.
Test Plan: tested manually, observed c++ stacktraces were dumped
Reviewed By: fduwjj
Differential Revision: D52370243
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116717
Approved by: https://github.com/zdevito
Summary: Now we can allocate an AOTIModelContainerRunner object instead of relying on torch.utils.cpp_extension.load_inline. Also renamed AOTInductorModelRunner to AOTIRunnerUtil in this PR.
Test Plan: CI
Reviewed By: khabinov
Differential Revision: D52339116
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116269
Approved by: https://github.com/khabinov
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), The first runtime component we would like to upstream is `Device` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 4 PRs. This is one of the 4 PRs and covers the changes under `c10`.
# Design
Intel GPU device is a wrapper of sycl device on which kernels can be executed. In our design, we will maintain a sycl device pool containing all the GPU devices of the current machine, and manage the status of the device pool by PyTorch. The thread local safe is considered in this design. The corresponding C++ files related to `Device` will be placed in c10/xpu folder. And we provide the c10 device runtime APIs, like
- `c10::xpu::device_count`
- `c10::xpu::set_device`
- ...
# Additional Context
In our plan, 4 PRs should be submitted to PyTorch for `Device`:
1. for c10
2. for aten
3. for python frontend
4. for lazy initialization shared with CUDA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116019
Approved by: https://github.com/gujinghui, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet
Summary:
Currently we have a very ugly specialization on edge dialect in verifier like the following:
```
# TODO Remove this branch.
if ep.dialect == "EDGE": # !!! Don't change this allowlist. !!!
pass
else:
raise e
```
In this diff we do some additional work to make signature checking also work in exir. We decouple the transformation stack in torch export and exir so that different layers of the stack can evolve in their own fashion and the team can divide and conquer them seperately.
Test Plan: CI
Differential Revision: D52499225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116705
Approved by: https://github.com/tugsbayasgalan
As per title. With this, https://github.com/pytorch/pytorch/issues/93697
does not choke, but spits out many of these:
```
[ERROR] Name: "L['self']"
[ERROR] Source: local
[ERROR] Create Function: NN_MODULE
[ERROR] Guard Types: ['ID_MATCH']
[ERROR] Code List: ["___check_obj_id(L['self'], 139962171127504)"]
[ERROR] Object Weakref: <weakref at 0x7f4b72f7c9a0; to
'ActorCriticPolicy' at 0x7f4b7b7df6d0>
[ERROR] Guarded Class Weakref: <weakref at 0x7f4afbd08b30; to
'ABCMeta' at 0x56463a727840 (ActorCriticPolicy)>
[ERROR] Created at:
[ERROR] File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 248, in __call__
[ERROR] vt = self._wrap(value)
[ERROR] File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 474, in _wrap
[ERROR] return self.wrap_module(value)
[ERROR] File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 941, in wrap_module
[ERROR] return self.tx.output.register_attr_or_module(
[ERROR] File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/output_graph.py", line
735, in register_attr_or_module
[ERROR] install_guard(source.make_guard(GuardBuilder.NN_MODULE))
[ERROR] Error while creating guard:
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116460
Approved by: https://github.com/jansel
ghstack dependencies: #116459
This is proof-of-concept implementation of how people can use a marker `mark_strict` to enable torchdynamo while exporting under non-strict mode. The main idea is that `mark_strict` will turn into an HOO which then utilizes dynamo to do correctness analysis in the same way how torch.cond works today. There are some notable limitations:
1. This API is not meant for public use yet
2. Strict region can't work with arbitrary container inputs
3. We don't preserve `nn_module_stack` and other node metadata for the strict region.
4. strict_mode HOO will show up in the final graph. This is undesirable in the long term, but for short term experiments, it should be good enough. Will fix this in the follow up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114658
Approved by: https://github.com/ydwu4
Summary:
Currently, the code is working. We know this becuase we observe heartbeat
timeouts.
However, there is a chance that if the code were refactored, the compiler could
optimize away the load of heartbeat_ inside heartbeatMonitor, and we wouldn't
know.
Using atomic here is not really for thread synchronization, but more to ensure
compiler optimizations (hoisting the read outside the loop) can never be
allowed to happen. Again, we know this isn't currently happening bc if it
were, it would not be an intermittent failure, it would be an always failure.
(at least with a fixed compiler/platform).
I previously avoided atomic bc we didn't want shared locks between heartbeat
monitor and watchdog thread. Why? if watchdog held the lock and hung, monitor
could also hang. However, this really can't happen (Afaik) when using an
atomic.
Test Plan: existing CI tests
Differential Revision: D52378257
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116702
Approved by: https://github.com/fduwjj, https://github.com/zdevito
Stacks recorded when tensors are being freed during exit could
try to acquire the GIL. Py_IsInitialized can be used to check if we
are post Python exit and should not attempt to acquire the GIL.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116709
Approved by: https://github.com/aaronenyeshi
**Summary**
Take this Pattern as example
```
# ReLU
# / \
# Conv1
# / \
# Conv2
# \ /
# Add
```
The current `ConvBinaryInplace` check will fail to perform Inplace fusion (using outplace fusion instead) due to `ReLU` having 2 users. However, if all users of `ReLU` are ancestor nodes of `Conv2`, we should be able to proceed with the `ConvBinaryInplace` fusion. This diff relaxes the `ConvBinaryInplace` check accordingly.
**TestPlan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_conv2d_binary_inplace_fusion_pass_cpu
python -m pytest test_mkldnn_pattern_matcher.py -k test_conv2d_binary_inplace_fusion_failed_cpu
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115153
Approved by: https://github.com/CaoE, https://github.com/jgong5
We change manually_set_subgraph_inputs to three modes: manual, automatic and flatten_manual. The flatten_manual wil first flatten the sub_args then recussively call set_subgrah_inputs = "manual". This allows us to control the order of the placeholder shown up in the graph, which is necessary for map, where we want to keep the mapped arguments before the rest positional arguments.
Right now, map only takes a single tensor as mapped argument but it would become pretty easy to match the subgraph inputs to original proxy if we have a "flatten_manual" option.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115853
Approved by: https://github.com/zou3519
Ran into the following exception during C++ file compilation.
```
error: use of undeclared identifier 'aoti_torch_scatter_reduce_out'
aoti_torch_scatter_reduce_out(buf12, buf12,0,buf13,buf14, "sum",1);
^
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116700
Approved by: https://github.com/aakhundov
This PR increases the read size for the `hub.download_url_to_file` function from 8,192 bytes to 131,072 bytes (128 * 1,024), as reading in larger chunks should be more efficient. The size could probably be larger still, at the expense of the progress bar not getting updated as often.
It re-introduces use of the `READ_DATA_CHUNK` constant that was originally used for this purpose in 4a3baec961 and since forgotten.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116536
Approved by: https://github.com/NicolasHug
Updated range_constraints to be the union of shape_env.var_to_range and shape_env.runtime_var_to_range, with shape_env.runtime_var_to_range taking priority.
Due to 0/1 specialization, if we bound an unbacked symint to be less than 5, the range of possible values for this symint is actually recorded as [2, 5] in shape_env.var_to_range. To fix this so that users will be able to see a more understandable range of [0, 5], shape_env.runtime_var_to_range was created to store the range of [0, 5]. Since range_constraints is a user-facing attribute to query the ranges of certain symints, we want to use shape_env.runtime_var_to_range to get the unbacked symints ranges, rather than shape_env.var_to_range.
Additionally, run_decompositions() has an issue where it will always add assertions to the graph, even if a previous run has already added the assertions. So, I added a part to the AddRuntimeAssertionsForInlineConstraints which will store which assertions have already been added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115427
Approved by: https://github.com/zhxchen17
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 27084ed</samp>
This pull request simplifies and cleans up the code that uses the cuDNN library for convolution, batch normalization, CTC loss, and quantized operations. It removes the unnecessary checks and conditions for older cuDNN versions and the experimental cuDNN v8 API, and ~~replaces them with the stable `cudnn_frontend` API that requires cuDNN v8 or higher. It also adds the dependency and configuration for the `cudnn_frontend` library in the cmake and bazel files.~~ Correction: The v7 API will still be available with this PR, and can still be used, without any changes to the defaults. This change simply always _builds_ the v8 API, and removes the case where _only_ the v7 API is built.
This is a re-land of https://github.com/pytorch/pytorch/pull/91527
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95722
Approved by: https://github.com/malfet, https://github.com/atalman
Fixes#116504
When this API is invoked, a runtime error occurs. When the NPU acceleration device is used, the input tensor is not processed at a branch. As a result, some input tensors are on the CPU and some are on the NPU. As a result, an error is reported.
Here, I adapt to other acceleration devices and move the tensor on the acceleration device to the CPU. It's tested and feasible.
The details are in the issue:#116504
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116682
Approved by: https://github.com/lezcano
Context: Existing FSDPExtension have some bug in the case when the
unflatten tensor involves some compute/communications in cuda stream,
the current logic of FSDPExtension unflatten tensor happens in the
unshard stream, which makes runtime lost sync with the compute stream,
and if there're some dependencies between the compute stream and the
unflatten tensor logic, currently it would lose sync point, which could
possibly lead to NaN.
This PR make the FSDPExtension to record the compute stream and let
DTensorExtension to directly use the compute stream for unflatten_tensor.
In long term we might want to directly make the FSDP runtime logic to only
make the unshard happen in unshard stream, and use unshard views to
happen in the compute stream. We currently fix this in the Extension
directly as this is the simplest thing to do without affecting FSDP
runtime logic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116559
Approved by: https://github.com/awgu, https://github.com/fduwjj, https://github.com/yifuwang
ghstack dependencies: #116426
Summary: Sometimes we will get a None value from ops which returns Tensor type in the schema. Allow this case during serialization.
Test Plan: test__scaled_dot_product_flash_attention
Differential Revision: D52491668
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116664
Approved by: https://github.com/SherlockNoMad
Summary:
- add workMetaList_.size() so we know how many outstanding works there
were when killing
- Print our first log before debuginfo dump instead of after, since it
is clearer when reading the logs that we time out and then dump
- Organize the log strings- put them near where they are used
cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225
imported-using-ghimport
Test Plan: Imported from OSS
Reviewed By: fduwjj
Differential Revision: D52369167
Pulled By: wconstab
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116661
Approved by: https://github.com/fduwjj
Summary:
Making union types harder to use wrong:
1. Initialize unset fields still with None, but we don't assert on the uniqueness of not None field, since it's possible to set a real field to None.
2. Raise error on unset fields in union, reducing the error surface and enforcing type safety.
3. Serialize union type with only tag and omit all the unset fields, this makes the serialized model more readable and debuggable.
Test Plan:
buck test mode/opt caffe2/test:test_export
buck test mode/opt executorch/exir/...
buck test mode/opt mode/inplace aps_models/ads/icvr/tests:export_test
Differential Revision: D52446586
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116511
Approved by: https://github.com/angelayi
Previously, we have the writer register to each NCCL PG(backend), so for every pg, we have a NCCL PG instance, so if we use some customized writer when multiple sub-PGs are used, we need to ensure user to register the writer for every backend which indicates a bad UX. Furthermore, the debug info is global, so it does not make sense to have the writer for each instance. We even have a static mutex in the `dumpDebuggingInfo` to ensure we serialize the write, that makes it more obvious that we can make the writer a singleton so that we only have one writer instance for all PG instances.
Although the rationale is clear, the implementation may vary a lot. So this PR is RFC for now to see if this implementation makes sense or not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116489
Approved by: https://github.com/kwen2501
Summary:
We introduced `node.meta["numeric_debug_handle"]` in https://github.com/pytorch/pytorch/pull/114315 to
indicate the numeric debug handle for values in the graph, in this PR we supported preserving this field
in prepare and convert so that we can use these for numerical debugging
Next: we also want to preserve these in deepcopy of GraphModule as well
Test Plan:
python test/test_quantization.py -k test_quantize_pt2e_preserve_handle
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116477
Approved by: https://github.com/tugsbayasgalan
As titled, this PR adds gradient hook to the FSDP DTensor extension, to check if there's gradients that are AsyncCollectiveTensors, if there're some, we eagerly wait there.
This is needed because sometimes the parameter's gradient might still pending with AsyncCollectiveTensor, if we directly feed them to FSDP then FSDP would use the ACT's storage to do reduce_scatter, which is wrong.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116122
Approved by: https://github.com/awgu, https://github.com/fduwjj
Somehow the logprefix only have ProcessGroup 0 rank [global rank]. This does not give the expected result as per the comment says "a prefix that is unique to this process group and rank". So this PR fix it and make it different for different subPGs.
The reason is that we set the prefix static which is shared across all NCCLPG instances and whoever calls this function first will set `rank_` and `uid_` to the prefix. We always initialize PG 0 first that's why we always see PG[0] + global ranks for all subPGs.
<img width="484" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/7fbb0226-7e25-4306-9cee-22e17b00bc8e">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116520
Approved by: https://github.com/wconstab
ghstack dependencies: #116218
Weight tying is useful when we'd like to share weights (and their gradients) between two modules, e.g. the word/token embedding module and the output linear module in language models. This test demonstrates that with DTensor it can be achieved just as with normal tensor, e.g. using `model.fc.weight = model.embedding.weight`.
To test: `python test/distributed/tensor/parallel/test_tp_examples.py -k test_weight_tying`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116475
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
A less general version of this wrapper was used in the keynote on
`torch.compile(numpy)`. We expose a generic version of the wrapper
that works seamlessly with `torch.compile`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114610
Approved by: https://github.com/albanD
Fixes#96617
There was already an attempt to fix this build issue - see #96618. One commit is reused from this attempt (@zboszor) with adjustments to commit message. Another one differs and takes into account provided review feedback (@ezyang).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115976
Approved by: https://github.com/ezyang
Introduce `mtl_setBuffer` and `mps_dispatch1DJob` and use it to bind
Tensor to metal kernel as well as disptatch Metal job
This avoids potential typos/bugs when one tries to bind tensor to a
Metal kernel but forgets about storage offset
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116566
Approved by: https://github.com/Skylion007
Fix: #115900https://github.com/pytorch/xla/issues/6085
This PR adds a last resort for testing for BF16 support on CUDA. This is necessary on GPUs
such as RTX 2060, where `torch.cuda.is_bf_supported()` returns False, but we can
successfully create a BF16 tensor on CUDA.
Before this PR:
```python
>>> torch.cuda.is_bf_supported()
False
>>> torch.tensor([1.], dtype=torch.bfloat16, device="cuda")
tensor([...], device='cuda:0', dtype=torch.bfloat16)
```
After this PR:
```python
>>> torch.cuda.is_bf_supported()
True
>>> torch.tensor([1.], dtype=torch.bfloat16, device="cuda")
tensor([...], device='cuda:0', dtype=torch.bfloat16)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115924
Approved by: https://github.com/jansel
Remove weird logic for designating matrices as transposed if sizes match(which always true if square matrices are multiplied with each other), which resulted in `torch.addmm` returns transposed matrix compared to `torch.mm`, see below:
```
% python -c "import torch;torch.set_default_device('mps');a=torch.eye(2);b=torch.arange(4.0).reshape(2, 2);print(a@b);print(torch.addmm(torch.zeros(2, 2), a,b))"
tensor([[0., 1.],
[2., 3.]], device='mps:0')
tensor([[0., 2.],
[1., 3.]], device='mps:0')
```
Fixes introduced to `torch.mm` in https://github.com/pytorch/pytorch/pull/77462 suggests that this is not needed
Modify `sample_inputs_addmm` to test `torch.addmm` with square matrices, but skip this config for `test_autograd_dense_output_addmm`, see https://github.com/pytorch/pytorch/issues/116565
TODO: probably tweak tolerances, as `test_output_match_addmm_cpu_float16` fails with 2x2 matrices, but passes using 3x3 ones with errors slightly exceeding the tolerance
Fixes https://github.com/pytorch/pytorch/issues/116331
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116547
Approved by: https://github.com/albanD, https://github.com/Skylion007
Use proper `os.fspath` to better convert `os.PathLike` object to a path.
Replace `pathlib.Path` with `os.PathLike` which is more generic and typing correct. `pathlib.Path` is an instance of `os.PathLike`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116562
Approved by: https://github.com/malfet
For training graphs (when inputs require grad), previously, we would speculate the forward and backward graph to determine if there are any graph breaks, side effect and etc but would not actually use these speculated graphs. We would just insert a call function node on the graph and later rely on autograd's tracing.
This approach does not work for more generalized graphs like graphs that include user defined triton kernels because autograd is not able to do the higher order function conversation.
This PR speculates the forward and backward functions and emits them in a HOF that later gets used via templating mechanism.
While working on this PR, I have exposed some bugs in the current tracing due to trampoline functions losing the source information resulting in incorrect graphs being produced. I have fixed these source information bugs and killed the trampolines.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116358
Approved by: https://github.com/jansel
Basically we observed that if there are multiple PGs and if the timeout happens on one of the subPG, we somehow use the local rank in the dump file. We realize that:
1. For setting the timeout signal in the store, any watchdog thread from any PG can do that.
2. For checking and dump, only the watchdog thread of default PG which we will always create and contain all ranks (no file name conflict) is needed here because the store signal and dump debug info are all global.
3. Since dump is global, we want to avoid the case when ranks from sub-PG pollute logs from global ranks (local rank 0 vs global rank 0). So that we use global ranks here to initialize debug info writer. (Down the road, we are thinking about making it a singleton so that user only register it once for multi-PG case.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116218
Approved by: https://github.com/wconstab
Turned command sequence mentioned in https://dev-discuss.pytorch.org/t/how-to-get-a-fast-debug-build/1597 and in various discussions into a tool that I use almost daily to debug crashes or correctness issues in the codebase
Essentially it allows one to turn this:
```
Process 87729 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
frame #0: 0x00000001023d55a8 libtorch_python.dylib`at::indexing::impl::applySelect(at::Tensor const&, long long, c10::SymInt, long long, c10::Device const&, std::__1::optional<c10::ArrayRef<c10::SymInt>> const&)
libtorch_python.dylib`at::indexing::impl::applySelect:
-> 0x1023d55a8 <+0>: sub sp, sp, #0xd0
0x1023d55ac <+4>: stp x24, x23, [sp, #0x90]
0x1023d55b0 <+8>: stp x22, x21, [sp, #0xa0]
0x1023d55b4 <+12>: stp x20, x19, [sp, #0xb0]
```
into this
```
Process 87741 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
frame #0: 0x00000001024e2628 libtorch_python.dylib`at::indexing::impl::applySelect(self=0x00000001004ee8a8, dim=0, index=(data_ = 3), real_dim=0, (null)=0x000000016fdfe535, self_sizes= Has Value=true ) at TensorIndexing.h:239:7
236 const at::Device& /*self_device*/,
237 const c10::optional<SymIntArrayRef>& self_sizes) {
238 // See NOTE [nested tensor size for indexing]
-> 239 if (self_sizes.has_value()) {
240 auto maybe_index = index.maybe_as_int();
241 if (maybe_index.has_value()) {
242 TORCH_CHECK_INDEX(
```
while retaining good performance for the rest of the codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116521
Approved by: https://github.com/atalman
Summary:
`_fold_conv_bn_qat` is taking a long time currently, so skipping it when it's not necessary,
we can have follow up fixes to actually reduce the patterns or cache the patterns if possible
Test Plan:
uncomment the print in `test_speed`, run
python test/test_quantization.py -k test_speed
and make sure the convert time is low, e.g. 0.1s instead of 8-9 seconds
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116440
Approved by: https://github.com/andrewor14
Summary:
The INTERFACE_CALL opcode for the mobile TorchScript interpreter contained an out of bounds read issue leading to memory corruption.
This change adds an explicit check that the number of inputs passed to the format method called when handling the INTERFACE_CALL opcode is a valid and within bounds of the stack.
Test Plan: contbuild + OSS signals
Differential Revision: D49739450
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110301
Approved by: https://github.com/dbort
Run slow/periodic and inductor workflows on push to release branches
Right now there are no signal from those jobs on release branches at all.
This will run periodic jobs on every commit to release branch, which is fine, as they are short lived and have a much lower traffic that a regular jobs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116516
Approved by: https://github.com/clee2000
This PR adds a very simple mutation tracking mechanism to dynamo which can later be improved to be more thorough. Currently it allows tensors to be in tl.load but if it sees a tensor used anywhere else (including a tl.load), it bails out.
One question about the method: is `ast.NodeVisitor` the best thing to use here? Having to detect mutations with this is kinda pretty since you need to keep setting state at each transition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116466
Approved by: https://github.com/aakhundov
Fixes https://github.com/pytorch/pytorch/issues/107842
Unlike `AdaptiveAvgPool`, `AdaptiveMaxPool` does not have a CUDA kernel for ChannelsLast. We workaround this by calling `contiguous()` on the input. However, there is an edge case when the channels dimension has size 1.
```python
>>> t = torch.randn(2, 1, 3, 3)
>>> t.stride()
(9, 9, 3, 1)
>>> t_c = t.to(memory_format=torch.channels_last)
>>> t_c.stride()
(9, 1, 3, 1) # (CHW, 1, CW, C)
>>> t_c.is_contiguous()
True # contiguity check doesn't check strides for singleton dimensions
```
Since the CUDA kernel treats the batch,`B`, and channels,`C`, dimensions as implicitly flattened and increments the data pointer for `input` to the start of the next plane using
669b182d33/aten/src/ATen/native/cuda/AdaptiveMaxPooling2d.cu (L67)
If our input falls into the aforementioned edge case, the `data_ptr` will not be incremented correctly. The simple fix for this is to calculate the stride for the channels dimension using $\prod_{i > 1}size(i)$
Analogous fix for the 3D case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116482
Approved by: https://github.com/albanD
Summary: currently pad_sequence caused symbolic shape specialization in export which is unintended. Adding a decomp seems to work to avoid the c++ kernel which caused the specialization.
Test Plan: buck test mode/opt caffe2/test:test_export -- -r pad_sequence
Reviewed By: SherlockNoMad
Differential Revision: D52345667
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116285
Approved by: https://github.com/peterbell10, https://github.com/lezcano
By adding `get_ghstack_dependent_prs` that using `git branch --contains`
finds all PRs containing stacked branch, selecting longest one (in
terms of distance between origin and default branch) and skipping all
open PRs
Please note, that reverts should be applied in a reversed order with the
one how PRs were landed originally.
Use a bit of a defensive programming, i.e. revert single PR if attempt to fetch dependencies fails for some reason.
Test plan:
- Lint
- ```
>>> from trymerge import GitRepo, GitHubPR, get_ghstack_prs, get_ghstack_dependent_prs
>>> pr=GitHubPR("pytorch", "pytorch", 115188)
>>> pr1=GitHubPR("pytorch", "pytorch", 115210)
>>> repo=GitRepo("/Users/nshulga/git/pytorch/pytorch")
>>> get_ghstack_dependent_prs(repo, pr1)
[('22742d93a5357c9b5b45a74f91a6dc5599c9c266', <trymerge.GitHubPR object at 0x100f32f40>)]
>>> get_ghstack_dependent_prs(repo, pr)
[('22742d93a5357c9b5b45a74f91a6dc5599c9c266', <trymerge.GitHubPR object at 0x10102eaf0>), ('76b1d44d576c20be79295810904c589241ca1bd2', <trymerge.GitHubPR object at 0x10102eb50>)]
>>> rc=get_ghstack_dependent_prs(repo, pr)
rc[0]>>> rc[0][1].pr_num
115210
>>> rc[1][1].pr_num
115188
```
- see: https://github.com/malfet/deleteme/pull/59#issuecomment-1869904714 and https://github.com/malfet/deleteme/pull/74#issuecomment-1870542702
Fixes https://github.com/pytorch/test-infra/issues/4845
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116447
Approved by: https://github.com/huydhn
ghstack dependencies: #116446
Prep change for allowing stacked reverts
This is a no-op that factors out some helper function that would be
useful later:
- `get_pr_commit_sha` finds a committed sha for a given PR
- `_revlist_to_prs` converts a revlist to GitHubPRs conditionally
filtering some out
- `do_revert_prs` reverts multiple PRs in a batch, but so far is
invoked with only one PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116446
Approved by: https://github.com/huydhn, https://github.com/seemethere
Fixes#112602
For some reason, I could not get the same output when running pycodestyle command as indicated in the issue. I manually ran ruff checks fixing the following issues `D202`, `D204`, `D205`, `D207`, `D400` and `D401`.
### Requested output
nn.modules.activation:
before: 135
after: 79
nn.modules.batchnorm
before: 21
after: 3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113531
Approved by: https://github.com/mikaylagawarecki
## Context
**Currently, `*.h` and `*.cpp` produces many lint warnings/errors from `clang-tidy` in the Meta internal Phabricator mirror**. These changes address all the lint warnings in the `api`, `graph`, and `impl` folders in preparation for upcoming planned work.
## Review Guide
* Most changes are the result of automatically applied patches from `clang-tidy`
* However, some warnings had to be manually addressed
* There should be no functional changes
* Many of the `clang-tidy` warnings arose from the `facebook-hte-BadMemberName` rule which checks for compliance with variable naming rules from Meta's internal C++ style guide
* However, the rest of the ATen codebase does not conform to this rule, and PyTorch Vulkan was written to be consisten with ATen's naming conventions; thus, to stay consistent with the rest of ATen, this rule is disabled wherever relevant using `// @lint-ignore-every CLANGTIDY facebook-hte-BadMemberName`
* Lint was disabled entirely for`vulkan_api_test.cpp` since there are too many warnings to address at the moment. Addressing all of them will be a small project of its own; thus, in the interim lint will be disabled to reduce distracting signals for developers.
Internal:
## Notes for Internal Reviewers
This diff was largely created with
```
cd ~/fbsource/xplat/caffe2/aten/src/ATen/native/vulkan
arc lint -e extra -a --take CLANGTIDY * 2>&1 | tee ~/scratch/lint.txt
```
The above command automatically applied patches suggested by `clang-tidy`, and the rest of the warnings were addressed manually.
To disable `facebook-hte-BadMemberName`, I found that disabling it via a `.clang-tidy` file didn't work with `arc lint`, and the only way that worked was through the adding a comment
```
// @lint-ignore-every CLANGTIDY facebook-hte-BadMemberName
```
Differential Revision: [D50336057](https://our.internmc.facebook.com/intern/diff/D50336057/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116431
Approved by: https://github.com/GregoryComer, https://github.com/kirklandsign
After this refactor:
* ```TorchVariable``` definition and all references are removed.
* All ```is_allowed``` references except one are removed.
- The only left one is in ```torch/_dynamo/decorators:_disallow_in_graph_helper```. It was called when users put ```disallow_in_graph``` decorator on a function. Since we use the lists in ```trace_rules``` to decide the function's trace rule, so the decorator would only be used as customer function rather than torch functions. I'll defer this to a separate decorator refactor PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116312
Approved by: https://github.com/jansel
Not sure if this is recent API change or what but `gh_get_labels('malfet', 'deleteme')` used to raise an exception (see https://github.com/malfet/deleteme/actions/runs/7334535266/job/19971328673#step:6:37 )
```
File "/home/runner/work/deleteme/deleteme/.github/scripts/label_utils.py", line 50, in get_last_page_num_from_header
link_info[link_info.rindex(prefix) + len(prefix) : link_info.rindex(suffix)]
AttributeError: 'NoneType' object has no attribute 'rindex'
```
And with this fix it returns the expected list
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116444
Approved by: https://github.com/huydhn
The cuDNN RNNv6 API has been deprecated and support will be dropped in a recent release; this PR migrates to the newer API to support newer cuDNN versions that would otherwise break the build.
Note that it may not be tested yet in upstream CI if the upstream CI cuDNN version is less than 8.9.7.
CC @ptrblck @malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115719
Approved by: https://github.com/albanD, https://github.com/malfet
Should allow for uses cases mentioned in #110940
This would allow scalars to also be float64s in the foreach implementation. The fused implementation would still create a float32 step on Adam and AdamW. This PR also does NOT worry about performance and is mainly for enablement.
Next steps:
- Relax the constraint on fused adam(w) and allow torch.float64 scalars there
- Allow _performant_ mixed dtypes in foreach (a bigger project in itself).
This PR will conflict with my other PRs, I will figure out a landing order
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115841
Approved by: https://github.com/albanD
When exporting a model with a convolution kernel on cpu, if mkldnn is disabled and nnpack is enabled, export will go down the nnpack optimized convolution kernel for certain shapes ((code pointer)[cd449e260c/aten/src/ATen/native/Convolution.cpp (L542-L552)]). This means that we will automatically create a guard on that certain shape. If users want to export without any restrictions, one option is to disable nnpack. However, no config function exists for this, so this PR is adding a config function, similar to the `set_mkldnn_enabled` function.
Original context is in https://fb.workplace.com/groups/1075192433118967/posts/1349589822345892/?comment_id=1349597102345164&reply_comment_id=1349677642337110.
To test the flag, the following script runs successfully:
```
import os
import torch
from torchvision.models import ResNet18_Weights, resnet18
torch.set_float32_matmul_precision("high")
model = resnet18(weights=ResNet18_Weights.DEFAULT)
model.eval()
with torch.no_grad():
# device = "cuda" if torch.cuda.is_available() else "cpu"
torch.backends.mkldnn.set_flags(False)
torch.backends.nnpack.set_flags(False) # <--- Added config
device = "cpu"
model = model.to(device=device)
example_inputs = (torch.randn(2, 3, 224, 224, device=device),)
batch_dim = torch.export.Dim("batch", min=2, max=32)
so_path = torch._export.aot_compile(
model,
example_inputs,
# Specify the first dimension of the input x as dynamic
dynamic_shapes={"x": {0: batch_dim}},
# Specify the generated shared library path
options={
"aot_inductor.output_path": os.path.join(os.getcwd(), "resnet18_pt2.so"),
"max_autotune": True,
},
)
```
I'm not sure who to add as reviewer, so please feel free to add whoever is relevant!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116152
Approved by: https://github.com/malfet
This PR is the start to enable the integrate pytorch distributed logs in Torch LOGs. We now already have one tag "distributed" for all distributed components but distributed is a very large component and we want to have some hierarchy and give users options to only turn on logs for certain submodules. So we also added tags starting with "dist_*" for each submodule. (This PR only adds some of them and we are going to add more down the road)
Related discussions can be found here: https://github.com/pytorch/pytorch/issues/113544
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116434
Approved by: https://github.com/awgu, https://github.com/wanchaol
get_unbacked_symbol_defs and get_unbacked_symbol_uses inconsistently return dicts vs. sets. The majority of the use cases of these methods use them for set membership, which is deterministic, but set iteration is non deterministic. Therefore, in the one place where we iterate through unbacked symbols, we sort by the symbol name before iterating to preserve determinism.
Another approach would be to have these functions consistently return dictionaries, where the key of the dictionary is the name of the symbol. I'm happy to do that approach if we think it's likely future code will forget to sort before iteration.
Fixes#113130
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116421
Approved by: https://github.com/oulgen, https://github.com/aakhundov
**Summary**:
Ops like `native_layer_norm_backward` return a tuple of optional torch.Tensor.
This PR allows to use OpStrategy to represent `native_layer_norm_backward`'s
return value sharding.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115682
Approved by: https://github.com/wanchaol
Removes a part of the sparse adam test and the following three tests: `test_fused_optimizer_raises`, `test_duplicate_params_across_param_groups`, `test_duplicate_params_in_one_param_group`
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (d2d129de)]$ python test/test_optim.py -k test_fused_optimizer_raises -k test_duplicate_params_across_param_groups -k test_duplicate_params_in_one_param_group
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
...
----------------------------------------------------------------------
Ran 3 tests in 0.023s
OK
```
Increases coverage by testing the duplicate param tests on ALL the optims instead of just one each. Also fixes SparseAdam bug which was accidentally calling torch.unbind through list instead of putting params in a list. This bug was caught by migrating the weird warning stuff to just one easy warning context manager, which checks that nothing else gets raised.
The new test_errors does not run slower than before, overhead is still king:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (d2d129de)]$ python test/test_optim.py -k test_errors
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
..........................
----------------------------------------------------------------------
Ran 26 tests in 10.337s
OK
```
Compared to test_errors BEFORE my commit :p
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b47aa696)]$ python test/test_optim.py -k test_errors
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
.............sssssssssssss
----------------------------------------------------------------------
Ran 26 tests in 11.980s
OK (skipped=13)
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b47aa696)]$
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116315
Approved by: https://github.com/mikaylagawarecki
As the title, this PR enables vectorization for the situation when the the index_expr depends on vectorized itervar. There are two cases here:
1. The vectorized itervar has constant stride in the index_expr. We vectorize the index_expr with `Vectorized<int32>::arange` for this case.
2. Otherwise, we load the index_expr vector in a non-contiguous way with a loop.
Below is the generated code for the first case from the test `test_concat_inner_vec`. Here `x1` is the index_expr and depends on the vectorized itervar `x1`. It has constant stride 1. We vectorized it with arange. We use `all_zero` to implement a short-cut for masks to avoid unnecessary execution of nested masked regions which are invalid.
Before:
```c++
#pragma omp for collapse(2)
for(long x0=static_cast<long>(0L); x0<static_cast<long>(32L); x0+=static_cast<long>(1L))
{
for(long x1=static_cast<long>(0L); x1<static_cast<long>(155L); x1+=static_cast<long>(1L))
{
auto tmp0 = c10::convert<long>(x1);
auto tmp1 = static_cast<long>(0);
auto tmp2 = tmp0 >= tmp1;
auto tmp3 = static_cast<long>(35);
auto tmp4 = tmp0 < tmp3;
auto tmp5 = [&]
{
auto tmp6 = in_ptr0[static_cast<long>(x1 + (35L*x0))];
return tmp6;
}
;
auto tmp7 = tmp4 ? tmp5() : static_cast<decltype(tmp5())>(0.0);
auto tmp8 = tmp0 >= tmp3;
auto tmp9 = static_cast<long>(155);
auto tmp10 = tmp0 < tmp9;
auto tmp11 = [&]
{
auto tmp12 = in_ptr1[static_cast<long>((-35L) + x1 + (120L*x0))];
return tmp12;
}
;
...
```
After:
```c++
#pragma omp for
for(long x0=static_cast<long>(0L); x0<static_cast<long>(32L); x0+=static_cast<long>(1L))
{
for(long x1=static_cast<long>(0L); x1<static_cast<long>(144L); x1+=static_cast<long>(16L))
{
auto tmp0 = c10::convert<int>(x1);
auto tmp1 = at::vec::Vectorized<int32_t>::arange(tmp0, 1);
auto tmp2 = static_cast<int>(0);
auto tmp3 = at::vec::Vectorized<int>(tmp2);
auto tmp4 = to_float_mask(tmp1 >= tmp3);
auto tmp5 = static_cast<int>(35);
auto tmp6 = at::vec::Vectorized<int>(tmp5);
auto tmp7 = to_float_mask(tmp1 < tmp6);
auto tmp8 = [&]
{
auto tmp9 = masked_load(in_ptr0 + static_cast<long>(x1 + (35L*x0)), to_float_mask(tmp7));
return tmp9;
}
;
auto tmp10 =
[&]
{
if (all_zero(to_float_mask(tmp7)))
{
return at::vec::Vectorized<float>(static_cast<float>(0.0));
}
else
{
return decltype(tmp8())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp8(), to_float_mask(tmp7));
}
}
()
;
...
```
Below is the generated code for the second case from the test case `test_expr_vec_non_contiguous`. Here, the index_expr is `31L + (63L*(c10::div_floor_integer(x1, 32L))) + (c10::div_floor_integer(x2, 32L))` which depends on the vectorized itervar `x2` and doesn't have constant stride. So, we load the index_expr vector with a loop. (In fact, this can be further optimized since the index_expr is invariant with the data points in the range [x2, x2+16). So it can be regarded as a scalar. This will be optimized in the follow-up PR.) The code uses `vector_lane_mask_check` to implement the masked version of non-contiguous load.
Before:
```c++
#pragma omp for collapse(2)
for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L))
{
for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(1L))
{
{
float tmp_acc0 = -std::numeric_limits<float>::infinity();
for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L))
{
auto tmp0 = c10::convert<long>(31L + (63L*(c10::div_floor_integer(x1, 32L))) + (c10::div_floor_integer(x2, 32L)));
auto tmp1 = static_cast<long>(2048);
auto tmp2 = tmp0 < tmp1;
auto tmp3 = [&]
{
auto tmp4 = in_ptr0[static_cast<long>(31L + (63L*(c10::div_floor_integer(x1, 32L))) + (2048L*(static_cast<long>(x1) % static_cast<long>(32L))) + (65536L*x0) + (c10::div_floor_integer(x2, 32L)))];
return tmp4;
}
;
auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
tmp_acc0 = max_propagate_nan(tmp_acc0, tmp5);
}
out_ptr0[static_cast<long>(x1 + (1024L*x0))] = tmp_acc0;
}
}
}
```
After:
```c++
#pragma omp for
for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L))
{
for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
{
{
#pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())})
float tmp_acc0 = -std::numeric_limits<float>::infinity();
at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L))
{
auto tmp0 =
[&]
{
__at_align__ std::array<int, 16> tmpbuf;
#pragma GCC unroll 16
for (long x1_inner = 0; x1_inner < 16; x1_inner++)
{
tmpbuf[x1_inner] = static_cast<long>(31L + (63L*(c10::div_floor_integer((x1 + x1_inner), 32L))) + (c10::div_floor_integer(x2, 32L)));
}
return at::vec::Vectorized<int>::loadu(tmpbuf.data());
}
()
;
auto tmp1 = static_cast<int>(2048);
auto tmp2 = at::vec::Vectorized<int>(tmp1);
auto tmp3 = to_float_mask(tmp0 < tmp2);
auto tmp4 = [&]
{
auto tmp5 =
[&]
{
__at_align__ std::array<float, 16> tmpbuf;
#pragma GCC unroll 16
for (long x1_inner = 0; x1_inner < 16; x1_inner++)
{
if (vector_lane_mask_check(tmp3, x1_inner))
{
tmpbuf[x1_inner] = in_ptr0[static_cast<long>(31L + (63L*(c10::div_floor_integer((x1 + x1_inner), 32L))) + (2048L*(static_cast<long>((x1 + x1_inner)) % static_cast<long>(32L))) + (65536L*x0) + (c10::div_floor_integer(x2, 32L)))];
}
}
return at::vec::Vectorized<float>::loadu(tmpbuf.data());
}
()
;
return tmp5;
}
;
auto tmp6 =
[&]
{
if (all_zero(to_float_mask(tmp3)))
{
return at::vec::Vectorized<float>(static_cast<float>(0.0));
}
else
{
return decltype(tmp4())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp4(), to_float_mask(tmp3));
}
}
()
;
tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp6);
}
tmp_acc0_vec.store(out_ptr0 + static_cast<long>(x1 + (1024L*x0)));
}
}
}
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114545
Approved by: https://github.com/lezcano
Sometimes, the first statement that sets this variable in the try block fails due to out of memory issues and the finally block tries to delete this variable, but it was not written to in the first place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116260
Approved by: https://github.com/lezcano
The `initial` argument in `functools.reduce` can be `None`.
```python
initial_missing = object()
def reduce(function, iterable, initial=initial_missing, /):
it = iter(iterable)
if initial is initial_missing:
value = next(it)
else:
value = initial
for element in it:
value = function(value, element)
return value
```
Reference:
- python/cpython#102759
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116398
Approved by: https://github.com/Skylion007
This pull request primarily addresses two issues to resolve the `QConvPointWiseBinaryPT2E` layout problem:
- As the changes made in 611a7457ca, for `QConvPointWiseBinaryPT2E` with post-op `sum`, we should also utilize `NoneLayout` and return `accum` instead of `QConvPointWiseBinaryPT2E`.
- Additionally, this pull request fixes an issue in the `_quantized_convolution_onednn` implementation. Given that we expect `accum` to be inplace changed, we should avoid copying `accum` by changing the memory format or data type inside the kernel implementation. Instead, we have moved the necessary changes of memory format or data type to the lowering of `QConvPointWiseBinaryPT2E`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115613
Approved by: https://github.com/jgong5, https://github.com/oulgen
ghstack dependencies: #116172
**Summary**
Re-land https://github.com/pytorch/pytorch/pull/115329. Open a new PR since the origin branch has been deleted.
Change the QConv2d Binary fusion post op name from `add` to `sum`, since we are actually using OneDNN `post op sum` instead of `Binary_Add` for now.
**TestPlan**
```
python -m pytest test_quantized_op.py -k test_qconv2d_sum_pt2e
python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_pt2e
python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_float_output_pt2e
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116172
Approved by: https://github.com/kit1980
After this refactor:
* ```TorchVariable``` definition and all references are removed.
* All ```is_allowed``` references except one are removed.
- The only left one is in ```torch/_dynamo/decorators:_disallow_in_graph_helper```. It was called when users put ```disallow_in_graph``` decorator on a function. Since we use the lists in ```trace_rules``` to decide the function's trace rule, so the decorator would only be used as customer function rather than torch functions. I'll defer this to a separate decorator refactor PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116312
Approved by: https://github.com/jansel
Remove ```ExecutionRecorder.MOD_EXCLUDES``` since now torch python modules are wrapped as ```PythonModuleVariable``` after #115724.
This is reported from Meta internal user cases, where it triggers failure when replay & record is enabled. But the enablement was triggered by ```TORCH_COMPILE_DEBUG=1``` rather than they really need this. Actually they are not using it according the conversation with the team members. I think we don't maintain replay & record well, so probably we can remove them from our codebase to avoid such issues in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116347
Approved by: https://github.com/jansel
Fixes#114903
Previously large split variance reductions stored the intermediates as float16
precision, which may lead to overflow as the intermediate result is
unnormalized.
In #114903 we see two different `num_split` decisions made based on the
hardware capabilities, one of which has large enough intermediates to cause
overflows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115181
Approved by: https://github.com/shunting314
**Motivation:**
Denormal numbers are used to store extremely small numbers that are close to 0. Denormal numbers can incur extra computational cost. To solve the low performance issue caused by denormal numbers, Pytorch supports flushing denormal numbers and it successfully configures flush denormal mode
Currently set_flush_denormal() is only supported on x86 architectures supporting SSE3 ([https://pytorch.org/docs/stable/generated/torch.set_flush_denormal.html (Opens in new window or tab)](https://pytorch.org/docs/stable/generated/torch.set_flush_denormal.html) and now we want to extend this functionality for ARM architecture.
**This PR:**
->Supports set_flush_denormal() on ARM.
->Datatypes supported and tested: FP64, FP32, BFloat16
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115184
Approved by: https://github.com/jgong5
As the [RFC](https://github.com/pytorch/pytorch/issues/114856) mentions, this is the step 1 to add Intel GPU backend as an alternative inductor backend.
### Design
Typically, in order to integrate Intel GPU backend into Inductor, we need to inherit from `WrapperCodegen` and `TritonScheduling` and implement the corresponding subclasses respectively. However, since `WrapperCodegen` and `TritonScheduling` have some device-bias code generation **scattered** in their methods, overriding them in subclasses would introduce a lot of duplicated parent class code.
For example:
2a44034895/torch/_inductor/codegen/wrapper.py (L487)2a44034895/torch/_inductor/codegen/triton.py (L1996)
So we abstract the device-bias code scattered in WrapperCodegen and TritonScheduling and provide a unified interface "DeviceOpOverrides". This way, when integrating a new backend, we can maximize the reuse of `WrapperCodegen` and `TritonScheduling` code by inherit and implement this interface for device flexibility.
Currently the `DeviceOpOverrides` only cover Python wrapper code generation. We can futher extend it to cover Cpp wrapper code generation on demand.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116020
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
Summary:
- Ported `all_to_all_single` to native c10d_functional
- Added Inductor support for the native `all_to_all_single` via the new collective IR's `create_out_of_place()`
- Since the new collective IR derives from `FallbackKernel` which implements a generic `free_unbacked_symbols`, no additional unbacked symbol handling for all_to_all_single is required
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113438
Approved by: https://github.com/yf225, https://github.com/ezyang
Summary: amd build is broken
Test Plan:
```
buck-out/v2/gen/fbcode/75c2b50d9f8b18d8/caffe2/__fb_libtorch_hipify_gen_eqsb_torch/csrc/distributed/c10d/intra_node_comm.hip__/out/torch/csrc/distributed/c10d/intra_node_comm.hip:37:1: error: non-void function does not return a value [-Werror,-Wreturn-type]
}
^
1 error generated when compiling for gfx90a.
```
Now it's gone
Differential Revision: D52373348
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116291
Approved by: https://github.com/yifuwang
Summary: lifted tensor constants were not being treated the same way as named buffers when unlifting, i.e. getting name correction to convert "." in FQNS to "_" for proper names. Additionally, future torchbind object support will allow objects to be registered, so only register_buffer for lifted constants if the value is a tensor.
Differential Revision: D52367846
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116266
Approved by: https://github.com/angelayi
Summary:
We have found out the root cause of the hang is NOT due to destruction
of stores. The hang in the check() only happens when the store is of
type FileStore.
The file held by each filestore was a temp file, which was created by
Python Tempfile, it was deleted by default when the file was closed.
Note that the file was opened and closed by every check() in the watchdog and in constructor of FileStore.
The when check() tried to open the deleted file again, open() would fail
after the timeout value (by default 5 mins), hence the hang happened.
The fix is simple, just avoid the default deletion after the file is
closed.
Test Plan:
1. We first reproduce the hang in check() in the existing unit test:
test_init_process_group_for_all_backends by enabling the
DumpOnTimeOut and making the main thread sleep for 2s, to give enough time for tempfile
to be deleted
2. Adding log to check ref count of fileStore and also the sequence of
file opening and closing
3. With the repro, an exception will be thrown as "no such file or
directory' and unit test would fail
4. Verify the tests now passes with the above knob change
5. add an unit test in test_c10d_nccl to cover the fileStore check() code path
python test/distributed/test_c10d_common.py ProcessGroupWithDispatchedCollectivesTests
python test/distributed/test_c10d_nccl.py ProcessGroupNCCLTest.test_file_store_check
Reviewers:
Subscribers:
Tasks:
T173200093
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116297
Approved by: https://github.com/fduwjj
ghstack dependencies: #116296
Fixes#115234
Currently, the unpickling code does not support integers larger than 64 bits in size. However, this is a part of the Python unpickling code.
See `pickle.py` in CPython:
```
def decode_long(data):
r"""Decode a long from a two's complement little-endian binary string.
>>> decode_long(b'')
0
>>> decode_long(b"\xff\x00")
255
>>> decode_long(b"\xff\x7f")
32767
>>> decode_long(b"\x00\xff")
-256
>>> decode_long(b"\x00\x80")
-32768
>>> decode_long(b"\x80")
-128
>>> decode_long(b"\x7f")
127
"""
return int.from_bytes(data, byteorder='little', signed=True)
```
E.g.:
```
>>> int.from_bytes(bytearray(b'\xff\xff\xff\xff\xff\xff\xff\xff\x00'), byteorder='little', signed=True)
18446744073709551615
```
This PR makes it so that integers of arbitrary size are supported with JS BigNums.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115264
Approved by: https://github.com/zdevito
`add` when passed one sparse and one dense argument will error if the
sparse argument does not have csr layout. This PR modifies the
underlying algorithm to be generic on the compressed dimension handling
both csr and csc. The functions are renamed to use the
`sparse_compressed` qualifier rather than `sparse_csr`
Fixes: #114807
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115433
Approved by: https://github.com/cpuhrsch, https://github.com/pearu
ghstack dependencies: #115432
Change default from 2 min to 10 min.
Why? Many cases of heartbeat timeout were reported, but increasing
timeout led to the same job hanging in a different place, suggesting
heartbeat kill was working well and not a false positive. However, some
others reported jobs running fine with increased timeouts. One such
case was investigated below, and suggests that indeed a 2 min timeout is
too aggressive. While we have not fully root caused the issue, it
is better to avoid killing jobs that would otherwise complete.
Current theory is that watchdog is not totally deadlocked, but is slowed
down in its processing of work objs due to some intermittent resource
contention. Hence, allowing more time is more of a workaround than a
fix.
Debug/Analysis:
https://docs.google.com/document/d/1NMNWoTB86ZpP9bqYLZ_EVA9byOlEfxw0wynMVEMlXwM
Differential Revision: [D52368791](https://our.internmc.facebook.com/intern/diff/D52368791)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116267
Approved by: https://github.com/fduwjj
Recent 2 triton PRs (https://github.com/openai/triton/pull/2701, https://github.com/openai/triton/pull/2756) change the interface for triton.compile, this PR added the necessary change on inductor side to work with both old and new compile API.
Also there is some simplification between compilation call in subprocess and the one in main process
- previously we pass warm_cache_only=True if the compilation happens in subprocess. But triton never use that argument in the currently used pin. So I removed that
- previously we only pass compute_capability if compilation happens in subprocess. The PR change that to always passing compute_capability to triton.compile no matter if the compilation happens in main or sub process.
Updated:
There are more interface change from triton side. E.g.
- tl.math.{min, max} now requires a propagate_nan argument
- JITFunction.run now requires a warmup argument. This affect the benchmarking phase of matmul max-autotune; on the other hand, JITFunction.run forbids stream argument now. Simply removing passing this in when benchmarking matmul triton kernel will work for both old and new version of triton.
- triton Autotuner change attribute name from 'warmup' to 'num_warmup' and from 'rep' to 'num_rep'. This cause dynamo failed to handle triton Autotuner object since dynamo TritonKernelVariable makes assumption about attribute names. It's used in some test cases that a model call triton Autotuner directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115878
Approved by: https://github.com/jansel
Summary: Currently, we [`clone`](19207b9183/torch/_inductor/lowering.py (L5273)) every `TensorBox` argument of custom Triton kernels while lowering them to the Inductor IR, during which the stride information of the kernel inputs is lost. This is problematic in the common case when the strides of a `torch.Tensor` argument are passed as scalars to a custom Triton kernel alongside the tensor itself (due to the underlying Triton code interpreting the tensors as raw pointers, so the contained stride semantics of the `torch.Tensor` is lost).
In this PR, we add an extended version of the existing [`clone` lowering](19207b9183/torch/_inductor/lowering.py (L2289))---`clone_preserve_reinterpret_view`---which carries over the `ir.ReinterpretVew` layers (if any) from the source `TensorBox` to the cloned one. The rationale behind adding a new function (and switching to it in the `triton_kernel_wrap` only for now) as opposed to extending the existing `clone` is keeping the semantics of the latter untouched, as it is a lowering of `torch.clone` (albeit incomplete, as the `memory_format` is currently ignored). Changing the existing `clone` would change the semantics which is not necessarily desirable in general. Open to suggestions, though.
Test Plan:
```
$ python test/dynamo/test_functions.py -k test_triton_kernel_strided_input
...
----------------------------------------------------------------------
Ran 1 test in 5.568s
OK
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116219
Approved by: https://github.com/jansel
Part 1 of implementation for general [subclass view fake-ification](https://docs.google.com/document/d/1C5taWiplmX7nKiURXDOAZG2W5VNJ2iV0fQFq92H0Cxw).
The following functional inverses are currently implemented scatter-style and thus never return views:
* `as_strided_copy_inverse()`
* `diagonal_copy_inverse()`
* `expand_copy_inverse()`
* `select_copy_int_inverse()`
* `slice_copy_Tensor_inverse()`
* `split_copy_Tensor_inverse()`
* `split_with_sizes_copy_inverse()`
* `unbind_copy_int_inverse()`
* `unfold_copy_inverse()`
We need to get actual views for the introduction of reverse view funcs coming next.
Details:
* Use `as_strided()` to implement actual view inverses for the above
* Assumes we're given a mutated_view that is actually part of a bigger storage; this isn't really the case for functionalization
* Introduce `InverseReturnMode` enum for customization of functional inverses
* `AlwaysView` - always return an actual view; needed for reverse view_funcs()
* `NeverView` - always do a copy; useful for certain functionalization use cases (e.g. XLA, executorch)
* `ViewOrScatterInverse` - return an actual view in most cases, but prefer scatter inverses when they exist. this avoids the need to implement `as_strided()` for subclasses, which can be difficult or impossible
* Make sure functionalization works as before
* Use `ViewOrScatterInverse` when reapply_views TLS is True or `NeverView` otherwise
* Adds tests to ensure old behavior for above inverses **in functionalization**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115893
Approved by: https://github.com/bdhirsh
These tests are failing in CI with this error
```
File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/torch/_dynamo/variables/builder.py", line 1126, in wrap_numpy_ndarray
value.flags.writeable = True
^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.InternalTorchDynamoError: cannot set WRITEABLE flag to True of this array
```
And it may be related to a `SIGKILL` exception being raised shortly after the
failure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116263
Approved by: https://github.com/lezcano
Summary:
Original commit changeset: 2a9588cfd51b
Original Phabricator Diff: D52062368
Test Plan: In investigating S386328 and S382826, we found checkpoint loading succeed after backout D52062368: S386328_backout_1220_193648
Differential Revision: D52356011
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116243
Approved by: https://github.com/voznesenskym
I found whatever the attribute `async_op` in collective ops is `True` or `False` explicitly set by the users, it always leads to the graph break because the argument `async_op` is wrappered as `ConstantVariable(bool)` in dynamo. So here we need to use the `value` for the judgement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115921
Approved by: https://github.com/jansel, https://github.com/wconstab
Make ```SkipFilesVariable``` only handle function type, and route skipped classes to ```UserDefinedClassVariable```. The reasons behind this are:
* We'd like to remove ```is_allowed```, so the allowed/disallowed torch classes should have a proper place to handle. We can put them in either ```SkipFilesVariable``` and ```UserDefinedClassVariable``` under the current architecture, but it's confusing to have two places do one thing.
- Going forward, let's make ```SkipFilesVariable``` only handle functions, and probably I'll rename it to ```SkippedFunctionVariable``` in the following PRs.
- Let's do dispatch by value's type, all torch classes stuff would go to ```UserDefinedClassVariable``` in the next PR.
* We'd merge in_graph/skip/inline trace decision into the same API ```trace_rule.lookup```, so probably we have to limit the input to only function for better organizing ```VariableBuilder._wrap``` logics.
- Next step, I'll merge ```skipfiles.check``` into ```trace_rules.lookup```, and do the skipfile check before wrapping them into correct variable tracker.
- Though the ```TorchCtxManagerClassVariable``` is decided by ```trace_rules.lookup```, I'll refactor it out in the following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115963
Approved by: https://github.com/jansel
This is to address issue #115309.
Test plan
`python test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training_is_seq_parallel_False`
`python test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training_is_seq_parallel_True`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115530
Approved by: https://github.com/wanchaol
Recent 2 triton PRs (https://github.com/openai/triton/pull/2701, https://github.com/openai/triton/pull/2756) change the interface for triton.compile, this PR added the necessary change on inductor side to work with both old and new compile API.
Also there is some simplification between compilation call in subprocess and the one in main process
- previously we pass warm_cache_only=True if the compilation happens in subprocess. But triton never use that argument in the currently used pin. So I removed that
- previously we only pass compute_capability if compilation happens in subprocess. The PR change that to always passing compute_capability to triton.compile no matter if the compilation happens in main or sub process.
Updated:
There are more interface change from triton side. E.g.
- tl.math.{min, max} now requires a propagate_nan argument
- JITFunction.run now requires a warmup argument. This affect the benchmarking phase of matmul max-autotune; on the other hand, JITFunction.run forbids stream argument now. Simply removing passing this in when benchmarking matmul triton kernel will work for both old and new version of triton.
- triton Autotuner change attribute name from 'warmup' to 'num_warmup' and from 'rep' to 'num_rep'. This cause dynamo failed to handle triton Autotuner object since dynamo TritonKernelVariable makes assumption about attribute names. It's used in some test cases that a model call triton Autotuner directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115878
Approved by: https://github.com/jansel
As stated. I do notice there is perhaps opportunity to abstract, but the tests as written are also super understandable and more abstraction might not be desirable.
This PR _increases coverage_. The original tests each tested 12 default configs (left out Rprop). Now the tests test ~80 configs, and then foreach + fused on top of that! Test time, we basically increase over 10-fold, but this test is tiny so we are not worried:
Old:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5ca9672c)]$ python test/test_optim.py -k test_step_is_noop_when_params_have_no_grad
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
.
----------------------------------------------------------------------
Ran 1 test in 0.028s
OK
```
New (includes the old test):
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5ca9672c)]$ python test/test_optim.py -k test_step_is_noop_when_params_have_no_grad
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
...........................
----------------------------------------------------------------------
Ran 27 tests in 0.456s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115299
Approved by: https://github.com/albanD
ghstack dependencies: #114802, #115023, #115025
Removing 4 tests:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7539011b)]$ python test/test_optim.py -v -k test_fused_optimizers_with_large_tensors -k test_fused_optimizers_with_varying_tensors -k test_multi_tensor_optimizers_with_large_tensors -k test_multi_tensor_optimizers_with_varying_tensors
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_fused_optimizers_with_large_tensors (optim.test_optim.TestOptim) ... ok
test_fused_optimizers_with_varying_tensors (optim.test_optim.TestOptim) ... ok
test_multi_tensor_optimizers_with_large_tensors (optim.test_optim.TestOptim) ... ok
test_multi_tensor_optimizers_with_varying_tensors (optim.test_optim.TestOptim) ... ok
----------------------------------------------------------------------
Ran 4 tests in 22.731s
OK
```
For the same 4 but more granular:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7539011b)]$ python test/test_optim.py -v -k test_fused_large_tensor -k test_fused_mixed_device_dtype -k test_foreach_large_tensor -k test_foreach_mixed_device_dtype
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_foreach_large_tensor_ASGD_cpu_float16 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
....
test_fused_mixed_device_dtype_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
test_foreach_large_tensor_ASGD_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Adadelta_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Adagrad_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_AdamW_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Adam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_NAdam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_RAdam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_RMSprop_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Rprop_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_SGD_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_ASGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adadelta_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adagrad_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adamax_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_NAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_RAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_RMSprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Rprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_large_tensor_AdamW_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_large_tensor_Adam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_mixed_device_dtype_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_mixed_device_dtype_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
----------------------------------------------------------------------
Ran 50 tests in 50.785s
OK (skipped=25)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115025
Approved by: https://github.com/albanD
ghstack dependencies: #114802, #115023
Replace the following:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (1bbf1c6f)]$ python test/test_optim.py -k test_peak_mem_multi_tensor_optimizers
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
.
----------------------------------------------------------------------
Ran 1 test in 38.599s
OK
```
with 11 tests (one for each foreach optim :))
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (1bbf1c6f)]$ python test/test_optim.py -k TestOptimRenewedCUDA.test_foreach_memory
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
...........
----------------------------------------------------------------------
Ran 11 tests in 39.293s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115023
Approved by: https://github.com/albanD
ghstack dependencies: #114802
New tests look like:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (af8fca04)]$ python test/test_optim.py -v -k TestOptimRenewedCUDA.test_fused
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_fused_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
----------------------------------------------------------------------
Ran 2 tests in 34.591s
OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (af8fca04)]$ python test/test_optim.py
-v -k test_set_default_dtype_works_with_foreach
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_set_default_dtype_works_with_foreach_ASGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
...
test_set_default_dtype_works_with_foreach_ASGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adadelta_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adagrad_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adamax_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_NAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_RAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_RMSprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Rprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_SGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
----------------------------------------------------------------------
Ran 22 tests in 32.915s
OK (skipped=11)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114802
Approved by: https://github.com/albanD
Added a `--num_workers` option to `server.py` that allows more than 1 worker in the `ThreadPoolWorker` used for model predictions. Each worker uses its own `cuda.Stream()` that is created when the worker thread is initialized.
Ran benchmark for 2-4 workers with `compile=False` (since compile is not thread-safe)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116190
Approved by: https://github.com/albanD
ghstack dependencies: #115286, #116187, #116188, #116189
Added two `ThreadPoolExecutor`s with 1 worker each for D2H and H2D copies. Each uses its own `cuda.Stream`. The purpose is to try to overlap D2H and H2D with compute and allow the worker handling prediction to launch compute kernels without being blocked by D2H/H2D.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116189
Approved by: https://github.com/albanD
ghstack dependencies: #115286, #116187, #116188
Before this PR, each `request_time` was separated by the time for a `torch.randn(...)` to create the fake `data` tensor on CPU. This meant that the gap between `request_times` **scaled with the batch_size**. So the latency comparisons across batch sizes were inaccurate. In this PR we generate all the fake data outside the loop to avoid this.
Other bug fixes:
- Only start polling GPU utilization after warmup event is complete
- Correct calculation of throughput: previously `(num_batches * batch_size) / sum(response_times)`, should have been `(num_batches * batch_size) / (last_response_time - first_request_time)`
- Make sure that response sent back to frontend is on CPU
- Use a lock to ensure writing to `metrics_dict` in `metrics_thread` and `gpu_utilization_thread` in a thread-safe manner
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116188
Approved by: https://github.com/albanD
ghstack dependencies: #115286, #116187
Experiment: run model predictions in the backend in a ThreadPoolExecutor so that each model prediction does not block reading requests from the queue
Baseline is reset in above PR that bugfixes a lot of the metrics calculations but I kept the metrics here anyway
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115286
Approved by: https://github.com/albanD
Follow-up to https://github.com/pytorch/pytorch/pull/115920
This PR fixes the error with symbolic expression in aten.div:
```python
import torch
aten = torch.ops.aten
def func(x, a):
return aten.div(x * 0.5, a, rounding_mode=None)
cfunc = torch.compile(func, dynamic=True, fullgraph=True)
device = "cpu"
x = 124
a = 33
out = cfunc(x, a)
expected = func(x, a)
torch.testing.assert_close(out, expected)
```
Error message:
```
File "/pytorch/torch/_inductor/graph.py", line 700, in call_function
out = lowerings[target](*args, **kwargs)
File "/pytorch/torch/_inductor/lowering.py", line 293, in wrapped
out = decomp_fn(*args, **kwargs)
File "/pytorch/torch/_inductor/lowering.py", line 4823, in div_mode
return div(a, b)
File "/pytorch/torch/_inductor/lowering.py", line 293, in wrapped
out = decomp_fn(*args, **kwargs)
File "/pytorch/torch/_inductor/lowering.py", line 4857, in div
a, b = promote_constants(
File "/pytorch/torch/_inductor/lowering.py", line 368, in promote_constants
ex = next(x for x in inputs if isinstance(x, (TensorBox, ExpandView)))
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
LoweringException: StopIteration:
target: aten.div.Tensor_mode
args[0]: 1.0*s0
args[1]: s1
kwargs: {'rounding_mode': None}
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116196
Approved by: https://github.com/peterbell10
Removes an unnecessary duplicated utility functions and just have it rely on itertools. Since the file is low traffic, I also added the modified files to UFMT'd files and formatted them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116192
Approved by: https://github.com/malfet
Decorates all NT tests with `@markDynamoStrictTest` to ensure we get the correct signal. Adds xfails where needed to get things passing.
Includes a fix in meta_utils.py for a bug that was breaking several python 3.11 tests. In particular, a dense tensor graph input that is a view of a strided NT would slip past Dynamo's check and break in meta-ification.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116111
Approved by: https://github.com/soulitzer, https://github.com/zou3519
ghstack dependencies: #115192
Before this change `-SymInt(std::numeric_limits<int64_t>::min()) == 0` would reliably crash with null pointer dereference, as `data_` of the SymInt returned by `operator-` would be `0x8000000000000000`, because of the carry/overflow flags set by `negq`.
Before the change x86_64 assembly generated for
4f02cc0670/c10/core/SymInt.cpp (L137)
looked as follows:
```
0x7ffff7f2f490 <+115>: movq %rax, %rdx
0x7ffff7f2f493 <+118>: negq %rdx
0x7ffff7f2f496 <+121>: movq %rdx, (%rbp)
0x7ffff7f2f49a <+125>: movabsq $0x4000000000000000, %rdx ; imm = 0x4000000000000000
0x7ffff7f2f4a4 <+135>: cmpq %rdx, %rax
0x7ffff7f2f4a7 <+138>: jle 0x7ffff7f2f520 ; <+259> at SymInt.cpp:141:1
```
`negq %rfx` correspond to unary minus and `cmpq %rdx, 0x4000000000000000` are inverted `check_range`
b6d0d0819a/c10/core/SymInt.h (L247-L249)
Flags raised by `negq` will affect the results of `cmpq`, and as result value would not be allocated on heap, but rather preserved as `nullptr`.
Not sure if it's worth benchmarking, but perhaps using `__builtin_sub_overflow` would be faster as it does not require an extra comparison, just guarantees that overflow flags is cleared after the op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116160
Approved by: https://github.com/Skylion007, https://github.com/colesbury
Updated range_constraints to be the union of shape_env.var_to_range and shape_env.runtime_var_to_range, with shape_env.runtime_var_to_range taking priority.
Due to 0/1 specialization, if we bound an unbacked symint to be less than 5, the range of possible values for this symint is actually recorded as [2, 5] in shape_env.var_to_range. To fix this so that users will be able to see a more understandable range of [0, 5], shape_env.runtime_var_to_range was created to store the range of [0, 5]. Since range_constraints is a user-facing attribute to query the ranges of certain symints, we want to use shape_env.runtime_var_to_range to get the unbacked symints ranges, rather than shape_env.var_to_range.
Additionally, run_decompositions() has an issue where it will always add assertions to the graph, even if a previous run has already added the assertions. So, I added a part to the AddRuntimeAssertionsForInlineConstraints which will store which assertions have already been added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115427
Approved by: https://github.com/zhxchen17
This replaces a bunch of unnecessary lambdas with the operator package. This is semantically equivalent, but the operator package is faster, and arguably more readable. When the FURB rules are taken out of preview, I will enable it as a ruff check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116027
Approved by: https://github.com/malfet
* Enable PERF402. Makes code more efficient and succinct by removing useless list copies that could be accomplished either via a list constructor or extend call. All test cases have noqa added since performance is not as sensitive in that folder.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115505
Approved by: https://github.com/malfet
Fixes incorrect range check when index is `std::numeric_limits<int64_t>::min()`, as result of unary minus operations for such values is undefined, but in practice is equal to self, see https://godbolt.org/z/Wxhh44ocr
Lower bound check was `size >= -index`, which was incorrect if `index` is `INT64_MIN`, with `-1 - index`, which for all int64_t values returns result that also fits into int64_t range. `- (index + 1)` is more readable and results in the identical optimized assembly, see https://godbolt.org/z/3vcnMYf9a , but its intermediate result for `INT64_MAX` is outside of `int64_t` range, which leads to a similar problems as with `int64_min` in original example.
Added regression test.
Fixes https://github.com/pytorch/pytorch/issues/115415
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116062
Approved by: https://github.com/Skylion007, https://github.com/albanD
Summary:
This change makes two major improvements to PyTorch Vulkan's shader authoring workflow.
## Review Guide
There are a lot of changed files because every GLSL shader had to be touched. The majority of changes is changing
```
#define PRECISION $precision
#define FORMAT $format
```
to
```
#define PRECISION ${PRECISION}
#define FORMAT ${FORMAT}
```
due to changes in how shader templates are processed.
For reviewers, the primary functional changes to review are:
* `gen_vulkan_spv.py`
* Majority of functional changes are in this file, which controls how shader templates are processed.
* `shader_params.yaml`
* controls how shader variants are generated
## Python Codeblocks in Shader Templates
From now on, every compute shader (i.e. `.glsl`) is treated as a shader template. To this effect, the `templates/` folder has been removed and there is now a global `shader_params.yaml` file to describe the shader variants that should be generated for all shader templates.
**Taking inspiration from XNNPACK's [`xngen` tool](https://github.com/google/XNNPACK/blob/master/tools/xngen.py), shader templates can now use Python codeblocks**. One example is:
```
$if not INPLACE:
layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict writeonly image3D uOutput;
layout(set = 0, binding = 1) uniform PRECISION sampler3D uInput;
layout(set = 0, binding = 2) uniform PRECISION sampler3D uOther;
layout(set = 0, binding = 3) uniform PRECISION restrict Block {
ivec4 output_sizes;
ivec4 input_sizes;
ivec4 other_sizes;
float alpha;
}
uArgs;
$else:
layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict image3D uOutput;
layout(set = 0, binding = 1) uniform PRECISION sampler3D uOther;
layout(set = 0, binding = 2) uniform PRECISION restrict Block {
ivec4 output_sizes;
ivec4 other_sizes;
float alpha;
}
uArgs;
```
Another is:
```
// PYTHON CODEBLOCK
$if not IS_DIV:
const int c_index = (pos.z % ((uArgs.output_sizes.z + 3) / 4)) * 4;
if (uArgs.other_sizes.z != 1 && c_index + 3 >= uArgs.output_sizes.z) {
ivec4 c_ind = ivec4(c_index) + ivec4(0, 1, 2, 3);
vec4 mask = vec4(lessThan(c_ind, ivec4(uArgs.output_sizes.z)));
other_texel = other_texel * mask + vec4(1, 1, 1, 1) - mask;
}
// PYTHON CODEBLOCK
$if not INPLACE:
ivec3 input_pos =
map_output_pos_to_input_pos(pos, uArgs.output_sizes, uArgs.input_sizes);
const vec4 in_texel =
load_texel(input_pos, uArgs.output_sizes, uArgs.input_sizes, uInput);
imageStore(uOutput, pos, OP(in_texel, other_texel, uArgs.alpha));
$else:
const vec4 in_texel = imageLoad(uOutput, pos);
imageStore(uOutput, pos, OP(in_texel, other_texel, uArgs.alpha));
```
In addition to making it easier and clearer to write shader templates, this enables shaders that were previously unable to be consolidated into a single template to now be represented using a single template, such as non inplace and inplace variants of the same shader.
## `generate_variant_forall` in shader variant YAML configuration
YAML files that describe how shader variants should be generated can now use a `generate_variant_forall` field to iterate over various settings for a specific parameter for each variant defined. Example:
```
unary_op:
parameter_names_with_default_values:
OPERATOR: exp(X)
INPLACE: 0
generate_variant_forall:
INPLACE:
- VALUE: 0
SUFFIX: ""
- VALUE: 1
SUFFIX: "inplace"
shader_variants:
- NAME: exp
OPERATOR: exp(X)
- NAME: sqrt
OPERATOR: sqrt(X)
- NAME: log
OPERATOR: log(X)
```
Previously, the `inplace` variants would need to have separate `shader_variants` entries. If there are multiple variables that need to be iterated across, then all possible combinations will be generated. Would be good to take a look to see how the new YAML configuration works.
Test Plan:
There is no functional change to this diff; we only need to make sure that the generated shaders are still correct. Therefore, we only need to run `vulkan_api_test`.
```
# On Mac Laptop
buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*"
```
Reviewed By: digantdesai
Differential Revision: D52087084
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115948
Approved by: https://github.com/manuelcandales
Noticed that on many MRS kernels the grid wrapper for autotuning is huge with a bunch of duplicates due to num_warps and num_stages not being needed for grid calculation. Lets deduplicate these entries.
Previously, we would see wrapper like
```
def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```
now it looks like
```
def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115849
Approved by: https://github.com/jansel
Summary: To be used in https://github.com/pytorch/pytorch/pull/113873. Since set_ is effectively an inplace view op, we'll need to skip caching them.
Test Plan: Built pytorch; specifically this step: `/home/slarsen/local/miniconda3/envs/pytorch-3.10/bin/python -m torchgen.gen --source-path /home/slarsen/local/pytorch/cmake/../aten/src/ATen --install_dir /home/slarsen/local/pytorch/build/aten/src/ATen --per-operator-headers --generate sources --output-dependencies /home/slarsen/local/pytorch/build/aten/src/ATen/generated_sources.cmake`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115769
Approved by: https://github.com/bdhirsh
Attempt to surface the segfault that happens on exit by resetting the "pytest last run" cache if pytest succeeds. CI does not rerun on success so we won't hit an infinite loop anywhere, and I don't expect people to rerun on success (unless they're looking for flakes? Either way I highly doubt any one is using the --sc/--scs flag locally).
This ensures that if pytest succeeds but the process gets a non zero exit code, the rerun will start at beginning instead of skipping all the "succeeding" tests.
This only applies if the --sc/--scs flags are used, custom to pytorch and probably not used anywhere other than CI, not to be confused with --stepwise, which pytest has by default
Here's a list of segfaulting inductor/test_aot_inductor tests, which I added skips for:
```
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_duplicated_params_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_fqn_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_no_args_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_output_misaligned_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_pytree_inputs_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_seq_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_simple_split_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_addmm_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_aliased_buffer_reuse_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_buffer_reuse_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_convolution_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_duplicated_params_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_empty_graph_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_fqn_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_large_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_missing_output_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_no_args_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_output_misaligned_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_output_path_1_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_pytree_inputs_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_repeat_interleave_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_return_constant_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_reuse_kernel_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_seq_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_simple_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_simple_split_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_small_constant_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_with_no_triton_profiler_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_with_offset_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_with_profiler_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_zero_size_weight_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115775
Approved by: https://github.com/desertfire
In this PR, we are implementing Functionalization on pre-dispatch graph. Today, every dispatch key except for Dispatchkey.Python has a dedicated mode stack in python. PreDispatch tracing relies on this behaviour by pushing ProxyTorchDispatchMode to Dispatchkey.PreDispatch mode stack and handle the dispatching logic in python. To make pre-dispatch functionalization work, we now need to push FunctionalTensorMode on DispatchKey.PreDispatch mode stack and make sure it runs before ProxyTorchDispatchMode. (this is very similar to how post-dispatch tracing work). Here are some design decisions we made for this flow to work:
1. FunctionalTensorMode internally calls C++ functionalize key. Since C++ functionalization goes after PreDispatch, if we are not careful, we will keep re-entering into PreDispatch key. We solve this by directly dispatching to C++ Functionalize key.
2. We delete mode_stack_per_key logic because the only realistic time it is exercised is for PreDispatch and it is in general not safe to have a plain list because FunctionalTensorMode and ProxyTorchDispatchMode ordering matter and it is hard to enforce it on plain list. Instead, now we have a private class that tracks PreDispatch mode stack.
3. We will still run CompositeImplicitAutograd decomps in this PR, and disable this logic later as a followup.
Some missing bits after this PR:
1. Preserving autograd ops in a functional form. Right now they still show up in the graph but in a "non-functional" way.
2. Turn off CompositeImplicitAutograd decomps
3. Functionalizing HOO
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113728
Approved by: https://github.com/bdhirsh
After # 113734 landed (adding dynamic storage offsets), we found that compilation times increased significantly. The reason: tensors_definitely_do_not_overlap was doing comparisons on storage offsets which were adding guards
626b7dc847/torch/_functorch/_aot_autograd/input_output_analysis.py (L268-L276)
This guard is added on all pairs of tensors which are views of the same source tensor - i.e. it the number of guards can be quadratic in the number of input tensors. This PR adds a test to prevent similar regressions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115793
Approved by: https://github.com/yanboliang
Fixes#114310 and supersedes #114748.
There are two reasons why we have quite a few special cases for `round`:
1. `round` is actually two ops. With `ndigits=None` (default), `round` always returns an integer. When `ndigits` is an integer, the returned type is a float.
2. Although `round` takes two arguments, it is a unary function with a parameter rather than a binary one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115259
Approved by: https://github.com/peterbell10, https://github.com/lezcano
Currently the inductor code for `x.any(-1)` does a this strange dance:
```python
tmp0 = tl.load(in_ptr0 + (r1 + (128*x0)), rmask & xmask)
tmp1 = tmp0.to(tl.int64)
tmp2 = (tmp1 != 0)
```
This happens because `register_lowering` is doing type promotion with the
dimension argument, and so promotes to `int64` which we then cast back to bool.
A better fix would be to fix `register_lowering` but for now I just remove
the unnecessary type promotion from `aten.any`.
In the current code we also see:
```python
tmp5 = tl.where(rmask & xmask, tmp3, 0)
```
which promotes the boolean value to int since `0` is an int32 in triton.
This fixes it to generate a boolean constant instead.
Finally there is also a triton bug where the `tl.load` itself upcasts to
`tl.int8`. I fix this by adding an explicit cast to `tl.int1`. The final
kernel code looks like:
```python
tmp0 = tl.load(in_ptr0 + (r1 + (128*x0)), rmask & xmask).to(tl.int1)
tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK])
tmp3 = tl.full([1, 1], 0, tl.int1)
tmp4 = tl.where(rmask & xmask, tmp1, tmp3)
tmp5 = triton_helpers.any(tmp4, 1)[:, None]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109913
Approved by: https://github.com/lezcano
By representing `torch.cfloat`/`torch.chalf` as `float2`/`half2` metal types and modifying `SCATTER_OPS_TEMPLATE`/`GATHER_OPS_TEMPLATE` to accept third argument which is fully specialized `cast` function, which is no-op for regular type, but special cased for float->complex and complex->float
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115727
Approved by: https://github.com/kulinseth
This PR fixes two bugs
1) Constant folding a triton kernel results in the kernel's inputs to be returned back without any modification. Disable constant folding for triton kernels. Need more investigation
2) NoneLayout buffers should not be deleted as they do not exist
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115908
Approved by: https://github.com/aakhundov, https://github.com/jansel
Make ```SkipFilesVariable``` only handle function type, and route skipped classes to ```UserDefinedClassVariable```. The reasons behind this are:
* We'd like to remove ```is_allowed```, so the allowed/disallowed torch classes should have a proper place to handle. We can put them in either ```SkipFilesVariable``` and ```UserDefinedClassVariable``` under the current architecture, but it's confusing to have two places do one thing.
- Going forward, let's make ```SkipFilesVariable``` only handle functions, and probably I'll rename it to ```SkippedFunctionVariable``` in the following PRs.
- Let's do dispatch by value's type, all torch classes stuff would go to ```UserDefinedClassVariable``` in the next PR.
* We'd merge in_graph/skip/inline trace decision into the same API ```trace_rule.lookup```, so probably we have to limit the input to only function for better organizing ```VariableBuilder._wrap``` logics.
- Next step, I'll merge ```skipfiles.check``` into ```trace_rules.lookup```, and do the skipfile check before wrapping them into correct variable tracker.
- Though the ```TorchCtxManagerClassVariable``` is decided by ```trace_rules.lookup```, I'll refactor it out in the following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115963
Approved by: https://github.com/jansel
Summary:
This change makes the `DTensor.from_local()` placements in backward pass from `Partial()` to `Replicate()` as pass through for following reasons:
1. When we run backward pass of DTensor.from_local, if the target placement is partial() (i.e. from user manual overwrite code instead of torch_dispatch) we keep the grad as replicate. This is because converting the gradients back to `Partial()` is meaningless.
2. The current div logic will lead to wrong numerical value in the above case.
Test Plan:
**CI**:
CI Tests
**Unit test**:
`buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:redistribute`
- Passed
**With model training**:
```
# We tested the case where input tensor is manually overwrite as Partial() and
# output tensor manually overwrite to Shard() then to local.
# Before the change: numerical value not correct
Forward pass:
collective: ReduceScatter
backward pass:
collective: AllGather + div by process group size
# After the change: div is removed as expected.
Forward pass:
collective: ReduceScatter
Backward pas:
collective: AllGather
```
Differential Revision: D52175709
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115967
Approved by: https://github.com/wanchaol
Motivation: it would be nice to be able to test using the metrics in log_compilation_event; currently dumps logs (or logs to a database in fbcode) - these are hard to use in unit tests.
This change:
* always record the information in torch._dynamo.utils.record_compilation_metrics; here, log into a limited-size deque to prevent the list of metrics from getting too long
* if config.log_compilation_metrics, then call back into the original log_compilation_event function
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115788
Approved by: https://github.com/yanboliang
## summary
`zip(inputs, self.input_layouts, self.desired_input_layouts)` is used in `_prepare_input_fn`; similar for `_prepare_output_fn`. Without assertion, unmatched dimension in inputs/outputs will be lost, potentially causing unexpected behabiors.
## test plan
`python test/distributed/tensor/parallel/test_tp_style.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115957
Approved by: https://github.com/wanchaol
Summary:
c2_protobuf_rule ([here](https://fburl.com/code/iyiulpmv)) is broken on buck2, ultimately due to the following error:
> .\./caffe2.proto: File does not reside within any path specified using --proto_path (or -I). You must specify a --proto_path which encompasses this file. Note that the proto_path must be an exact prefix of the .proto file names -- protoc is too dumb to figure out when two paths (e.g. absolute and relative) are equivalent (it's harder than you think).
The root cause is differences in how buck1 and buck2 handle `%SRCDIR%` (absolute versus relative paths). This diff fixes the build.
Test Plan:
# Before
```
buck2 build arvr/mode/win/opt //xplat/caffe2:caffe2.pb.h
```
```
More details at https://www.internalfb.com/intern/buck/build/c6550454-ae6d-479e-9d08-016e544ef050
BUILD SUCCEEDED
```
```
Action failed: fbsource//xplat/caffe2:caffe2.pb.h (genrule)
Remote command returned non-zero exit code <no exit code>
Reproduce locally: frecli cas download-action 5df17cf64b7e2fc5ab090c91e1129f2f3cad36dc72c7c182ab052af23d3f32aa:145
stdout:
stderr:
OUTMISS: Missing outputs: buck-out/v2/gen/fbsource/dd87aacb8683145b/xplat/caffe2/caffe2.pb.h/out/caffe2.pb.h
```
# After
Buck1 still works
```
buck1 build arvr/mode/win/opt //xplat/caffe2:caffe2.pb.h
```
Buck2 works
```
buck2 build arvr/mode/win/opt //xplat/caffe2:caffe2.pb.h
```
```
Buck UI: https://www.internalfb.com/buck2/e5dae607-325a-4eab-b0c9-66fe4e9a6254
BUILD SUCCEEDED
```
Differential Revision: D52218365
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115954
Approved by: https://github.com/mcr229
some typo result in the note section not rendered properly, can't see
this from the last PR directly as the last PR only show the first commit
documentation :(
Also make the parallelize_module doc example more concrete
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115974
Approved by: https://github.com/wz337
The PyTorch build breaks when building from tip on ppc64le with following error pytorch/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp:863:46: error: no matching function for call to 'at::vec::DEFAULT::Vectorizedc10::qint8::dequantize(at::vec::DEFAULT::Vectorized&, at::vec::DEFAULT::Vectorized&)
Issue reported #115165
This patch fixes the build issue.
Fixes#115165
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115729
Approved by: https://github.com/albanD
Summary: This is useful the comparing the Triton kernels generated by two different invocations of torch.compile on the same model (e.g., checking of serial compile and parallel compile generate identical Triton kernels).
Test Plan:
Unit test:
buck2 test mode/opt //caffe2/torch/fb/module_factory/sync_sgd/tests:test_torchdynamo_wrapper -- --print-passing-details >& ~/tmp/log.test
PyPer Mast job:
https://www.internalfb.com/mast/job/sw-951074659-OfflineTraining_87587a4e
See the *.py files generated in:
pyper_traces/tree/torchinductor_traces/sw-951074659-OfflineTraining_87587a4e/4623
Differential Revision: D52221500
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115979
Approved by: https://github.com/yanboliang
Summary: During the inference time the intermediate graphs for optimization are not used so the Executor's graph is the only graph we need to keep around these two flags
Test Plan:
the FLAGS are all off by default
baseline
```
buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true
I1212 10:24:20.407408 401092 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 182863 Kb
```
```
buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true --torch_jit_release_profiling_graph_after_optimization=true
I1212 10:31:37.663487 464000 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 186127 Kb
```
```
buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true --torch_jit_release_profiling_graph_after_optimization=true --torch_jit_execution_plan_avoid_extra_graph_copy=true
I1212 10:29:42.848093 447218 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 129451 Kb```
Differential Revision: D52081631
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115657
Approved by: https://github.com/houseroad
as titled, when using SAC + torch.compile, it currently only check for
functional tensor, but not checking any tensor subclasses, therefore SAC
under torch.compile would ignore the tensor types like tensor
subclasses. Fixed in this PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115960
Approved by: https://github.com/bdhirsh
Summary:
Refactor update inactive constant buffer to allow updating with active
buffer.
Test Plan:
Existing test to test inactive buffer updates.
UpdateConstantsCuda in cpp test for active buffer updates.
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116001
Approved by: https://github.com/chenyang78
Summary:
This change is to make the input tensor contiguous for DTensor reduce scatter in the case no padding is needed.
There's no exception thrown during training, but we ran into numerical value correctness issue without the change.
Test Plan:
**CI**
CI test
**WHEN model test**:
- Verified loss for each iteration within the expected range.
- Verified NE on-par with this change with 4B training data.
Differential Revision: D52170822
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115847
Approved by: https://github.com/wanchaol
- Add call to `at::globalContext().userEnabledMkldnn()` to `apply_mkldnn_matmul_heur`
- Surround calls to `mkldnn_matmul` with `try {} catch {}`
- Print warning and fall back to BLAS (by calling `at::globalContext().setUserEnabledMkldnn()`) if `mkldnn_matmul()` fails
Test plan: On Linux arm run:
```shell
$ sudo chmod 400 /sys; python -c "import torch;m=torch.nn.Linear(1, 32);print(torch.__version__);print(m(torch.rand(32, 1)))"
Error in cpuinfo: failed to parse the list of possible processors in /sys/devices/system/cpu/possible
Error in cpuinfo: failed to parse the list of present processors in /sys/devices/system/cpu/present
Error in cpuinfo: failed to parse both lists of possible and present processors
2.3.0.dev20231215
bad err=11 in Xbyak::Error
bad err=11 in Xbyak::Error
/home/ubuntu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/nn/modules/linear.py:116: UserWarning: mkldnn_matmul failed, switching to BLAS gemm:internal error (Triggered internally at /pytorch/aten/src/ATen/native/LinearAlgebra.cpp:1509.)
return F.linear(input, self.weight, self.bias)
tensor([[-0.5183, 0.2279, -0.4035, ..., -0.3446, 0.0938, -0.2113],
[-0.5111, 0.2362, -0.3821, ..., -0.3536, 0.1011, -0.2159],
[-0.6387, 0.0894, -0.7619, ..., -0.1939, -0.0282, -0.1344],
...,
[-0.6352, 0.0934, -0.7516, ..., -0.1983, -0.0247, -0.1366],
[-0.4790, 0.2733, -0.2862, ..., -0.3939, 0.1338, -0.2365],
[-0.5702, 0.1682, -0.5580, ..., -0.2796, 0.0412, -0.1782]],
grad_fn=<AddmmBackward0>)
```
Fixes https://github.com/pytorch/pytorch/issues/114750
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115936
Approved by: https://github.com/lezcano
Support for something we need for both FSDP and optimizers. For sourced args that are not inputs (params, etc) - we use the dynamic_getattr flow on tensors. This soundly handles the storage and registration and guarding downstream of tensor_wrap for the grad values. For non sourced (true intermediates), we only support None (the idea being that if we have a true intermediate in the graph with grad, we are already doing something weird).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115898
Approved by: https://github.com/bdhirsh
ghstack dependencies: #115315, #112184
Summary:
This is important for writing aten IR based graph transformation.
```
In [4]: [x.name for x in torch.ops.aten.reshape.default._schema.arguments]
Out[4]: ['self', 'shape']
In [8]: torch.ops.aten.reshape.default(torch.rand(1,2), shape=[2])
Out[8]: tensor([0.7584, 0.4834])
# === CANNOT CALL `self` BY KWARGS ===
In [7]: torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[7], line 1
----> 1 torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2])
TypeError: OpOverload.__call__() got multiple values for argument 'self'
```
# Where's the problem?
1. the aten ops first arg is usually named `self` (aten/src/ATen/native/native_functions.yaml)
2. Unfortunately, in `torch._ops.{OpOverload, OpOverloadPacket}.__call__()`, the first arg is (by python convention) named `self` too.
So when call `self` by kwargs, `OpOverloadPacket.__call__` received:
```
OpOverloadPacket.__call__(self, {"self": ...})
```
It is Python that does not allow some argument named "arg" to appear twice. and hence
> TypeError: OpOverload.__call__() got multiple values for argument 'self'
# How to fix?
**Note that**, in above, `self` is an instance of `OpOverloadPacket`, and the "self" kwarg is the input tensor to the aten op. To fix, we only need to differentiate the two `self`s.
In Python, first arg of a method does not need to be named `self`. So we change the `__call__` definition to:
```
def __call__(_self, ...):
```
Now the call becomes:
```
OpOverloadPacket.__call__(_self, {"self": ...})
```
where:
* `_self` is the instance to the `OpOverloadPacket`
* `"self"` is the input tensor to the aten op.
Test Plan:
```
In [4]: [x.name for x in torch.ops.aten.reshape.default._schema.arguments]
Out[4]: ['self', 'shape']
In [3]: torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2])
Out[3]: tensor([0.5127, 0.3051])
```
Differential Revision: D51731996
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114920
Approved by: https://github.com/houseroad
A bug fix of a recently merged PR per comment: https://github.com/pytorch/pytorch/pull/101507#discussion_r1271393706
The follow test would fail without this bug fix:
```
import torch
def test_erfinv():
for device in ['cpu', 'mps']:
x = torch.tensor([0.1, 0.2, 0.3, 0.4, 0.5], device=device)
y = x[2:].erfinv()
x2 = torch.tensor([0.3, 0.4, 0.5], device=device)
y2 = x2.erfinv()
print(y)
print(y2)
torch.testing.assert_close(y, y2)
print(f"{device} passes.")
test_erfinv()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105801
Approved by: https://github.com/malfet
> **__Note:__** XNNPACK Upgrade is too large in the range of **40k** files and **10m** Lines of code, Thus we break the update of the library into multiple parts. All Parts [1 - n] Must be landed together for it to work. ***This also means If there is a revert. Please revert the Entire Stack.***
This change is everything remaining requiring XNNPACK version to work.
@allow-large-files
Differential Revision: [D52099769](https://our.internmc.facebook.com/intern/diff/D52099769/)
---
submodule
(unblock merge to make ShipIt happy)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115714
Approved by: https://github.com/digantdesai
Fixes#77818
I saw that PR #99246 was approved, but no one fixed the rebase conflicts, so I am bringing this up again to be merged.
I am leveraging @mattiaspaul work. Quoting the description here:
> * this pull request enables 3D convolutions (forward/backward) for MPS (Apple Silicon) within the same Convolution.mm file as conv2d.
> * does not support channel_last (since pytorch doesn't implement channel_last for 3D tensors)
> * does not support conv3d_transpose and treats depth-separable convolutions not as normal case (there are no MPS kernels available for either of those so far)
> * requires MacOS >=13.2 (Ventura)
Please, let me know if there are any other changes needed and I'll be happy to implement them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114183
Approved by: https://github.com/malfet
Helps call attention to any cases where the dump actually times out.
The timeout is likely to hit if we run into slow stacktrace processing.
Log any exceptions encountered in the background thread, but don't raise
them- we're already willing to abandon the debug dump, and want to
proceed with our normal execution (in the case of dumppipe) or shutdown
process (when dumping happens on timeout and shutdown is already
initiated).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115876
Approved by: https://github.com/zdevito
ghstack dependencies: #115807
- It was installed twice : once in `/usr/local/cuda/lib64` folder and 2nd time in `/usr/lib64`
- And don't install CuDNN headers thrice, only in `/usr/local/cuda/includa`
- Error on unknown CUDA version
- Modify bazel builds to look for cudnn in `/usr/local/cuda` folder
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115872
Approved by: https://github.com/huydhn
## Summary
This PR added 3 intra-node GPU allreduce algorithms to PyTorch:
- One-shot allreduce (inspired by FasterTransformer): all ranks simultaneously read and accumulate data from other ranks.
- Two-shot allreduce (inspired by FasterTransformer): all ranks simultanesouly read and accumulate `1 / world_size` data from other ranks. Then all ranks read accumulated data from other ranks. (effectively one-shot reduce-scatter + one-shot all-gather).
- Hybrid cube mesh allreduce (original): a one-shot allreduce variant that avoids transmission over PCIe on HCM topology.
## Micro Benchmarks



## Details
The intra-node algos are organized behind `c10d::IntraNodeComm`, which is responsible for:
- Managing handshaking and cuda IPC handle exchange among ranks.
- Querying NVLink connection and detecting topology.
- Performing algo selection based on available info.
- Launching the selected allreduce kernel.
`c10d::IntraNodeComm` is integrated into `c10d::ProcessGroupNCCL` as follows:
- When the `ENABLE_INTRA_NODE_COMM` environment variable is set, `c10d::ProcessGroupNCCL` initialize a `c10d::IntraNodeComm` for its ranks.
- If the setup is not suitable for intra-node comm (e.g. not all ranks are from the same node), the rendezvous logic guarantees all participants fall back consistently.
- `c10d::ProcessGroupNCCL::allreduce` consults `c10d::IntraNodeComm` whether to use intra-node allreduce and carries out the communication accordingly.
We currently detect two types of topoloies from the nNVLink connection mesh:
- Fully connected: all GPU pairs has direct NVLink connection (e.g. NVSwitch or fully connected sub-set of hybrid cube mesh)
- `msg <= 256KB`: one-shot allreduce.
- `256KB < msg <= 10MB`: two-shot allreduce.
- `msg > 10MB`: instructs the caller to fallback to NCCL.
- Hybrid cube mesh
- `msg <= 256KB`: one-shot allreduce.
- `msg > 256KB`: instructs the caller to fallback to NCCL.
## Next Steps
- Fine tune algo selection based on GPU model, topology, link speed.
- Potentially optimize the two-shot allreduce impl. Accroding to FasterTransformer, two-shot allreduce is preferred until 50MB. There might be room for improvement, but PyTorch does impose more constraints:
- FasterTransformer uses a single process to drive multiple devices. It can use `cudaDeviceEnablePeerAccess` enable device-level peer access.
- PyTorch uses multiple process to drive multiple devices. With cuda IPC, a device can only share a specific region to other devices. This means extra copies may be unavoidable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114001
Approved by: https://github.com/yf225
**Summary**
Change the QConv2d Binary fusion post op name from `add` to `sum`, since we are actually using OneDNN `post op sum` instead of `Binary_Add` for now.
**TestPlan**
```
python -m pytest test_quantized_op.py -k test_qconv2d_sum_pt2e
python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_pt2e
python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_float_output_pt2e
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115329
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
Summary:
Both cuSPASRELt and CUTLASS support 1:2 semi-structured sparsity for
fp32, which this PR enables.(thanks @alexsamardzic).
Furthermore, this PR also updates the sparse_config to take into account
the different shape constraints for sparse and dense matrices.
Technically, cuSPARSELt supports smaller sparse matrix constraints as it
seens to pad to the CUTLASS constraints under the hood. However, in
practice small sparse matrices are not commonly used and we care more
about the dense constraints for LLM inference.
For now, we keep the CUTLASS constraints in place for both cuSPARSELt
and CUTLASS tensors
This PR also reconnects the _FUSE_TRANSPOSE flag for cuSPARSELt tensors.
Test Plan:
```
python test/test_sparse_semi_structured.py
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115550
Approved by: https://github.com/cpuhrsch
Tests that are added to a list in dynamo_test_failures.py will
automatically be marked as expectedFailure when run with
PYTORCH_TEST_WITH_DYNAMO=1. I'm splitting this PR off on its own so that
I can test various things on top of it.
Also added an unMarkDynamoStrictTest that is not useful until we turn
on strict mode by default.
Test Plan:
- code reading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115845
Approved by: https://github.com/voznesenskym
Summary:
We do not need to dequantize and quantize again for this op.
With this optimization cunet-enc ops:
vulkan.quantized_max_pool2d_quint8{48, 36, 2} 207532
vulkan.quantized_max_pool2d_quint8{24, 18, 4} 78832
vulkan.quantized_max_pool2d_quint8{12, 9, 8} 49296
Without optimization:
vulkan.quantized_max_pool2d_quint8{48, 36, 2} 234416
vulkan.quantized_max_pool2d_quint8{24, 18, 4} 94380
vulkan.quantized_max_pool2d_quint8{12, 9, 8} 58760
Test Plan:
Ensure all vulkan quantize tests pass:
buck2 run --target-platforms ovr_configplatform/macos:arm64-fbsourcexplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 78 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 78 tests from VulkanAPITest
...
[==========] 78 tests from 1 test suite ran. (1519 ms total)
[ PASSED ] 78 tests.
buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 395 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 395 tests from VulkanAPITest
...
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 395 tests from VulkanAPITest (6515 ms total)
[----------] Global test environment tear-down
[==========] 395 tests from 1 test suite ran. (6515 ms total)
[ PASSED ] 394 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
YOU HAVE 5 DISABLED TESTS
Reviewed By: yipjustin, copyrightly
Differential Revision: D50998619
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115690
Approved by: https://github.com/SS-JIA
This removes global (but ROCM_ONLY) `over_arch` and `gcn_arch_override_flag` variables in favor of block level static initialization introduced in C++11
To quote from [ISO/IEC 14882-2014](https://www.open-std.org/jtc1/sc22/wg21/docs/standards)
>The zero-initialization (8.5) of all block-scope variables with static storage duration (3.7.1) or thread storage
> duration (3.7.2) is performed before any other initialization takes place. Constant initialization (3.6.2) of a
> block-scope entity with static storage duration, if applicable, is performed before its block is first entered.
> An implementation is permitted to perform early initialization of other block-scope variables with static or
> thread storage duration under the same conditions that an implementation is permitted to statically initialize
> a variable with static or thread storage duration in namespace scope (3.6.2). Otherwise such a variable is
> initialized the first time control passes through its declaration; such a variable is considered initialized upon
> the completion of its initialization. If the initialization exits by throwing an exception, the initialization
> is not complete, so it will be tried again the next time control enters the declaration. If control enters
> the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for
> completion of the initialization.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115844
Approved by: https://github.com/huydhn
Noticed that on many MRS kernels the grid wrapper for autotuning is huge with a bunch of duplicates due to num_warps and num_stages not being needed for grid calculation. Lets deduplicate these entries.
Previously, we would see wrapper like
```
def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```
now it looks like
```
def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115849
Approved by: https://github.com/jansel
The mutex was originally added to avoid racing to dump debuginfo,
where a race in this case would result in a corrupted dump file.
The reason a mutex helps is that it forces all dump requests to be
serialized, so that an observer would either see an in-progress file, a
complete file, or no file. Without a mutex, a fourth state is possible
(a file that has been written to by multiple threads and is invalid).
Becuase the mutex was a ProcessGroupNCCL class member, and each PG
instance has its own watchdog thread that can launch a dump, it was not
doing its job. Making the mutex static shares it between instances of
the class and ensures serialization of dumps triggered by any PG.
(Note: dumps triggered by different PGs have the same, global contents
anyway- there is only one global flight recorder, so it doesn't matter
who triggers it.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115803
Approved by: https://github.com/kwen2501
ghstack dependencies: #115771, #115798, #115800, #115801
Adds a PG {process group uid} prefix component to logs.
This is helpful in situations where there are multiple processgroups,
and rank information by itself is confusing. (For example rank0 on PG1
may correspond to rank3 on PG0. People may assume 'rank0' references
the global (PG0) world, but it may reference a sub-pg. Prefacing the PG
helps clarify this.
Does NOT change logs from inside WorkNCCL functions, since WorkNCCL
doens't know what PG ID it corresponds to. Will address these logs
separately.
Example:
```
[I ProcessGroupNCCL.cpp:787] [PG 0 Rank 0] ProcessGroupNCCL initialization ...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115801
Approved by: https://github.com/fduwjj
ghstack dependencies: #115771, #115798, #115800
Put the repeated code that string formats [Rank {rank}] in one place.
Sets up for the next PR that also adds more info to this prefix.
(Does not change exception messages, which could be done as well.
Exception messages are not formatted quite the same way. Tries
instead to keep from changing log behavior (in this PR) and only
refactor code.
Did limited testing (some logs were observed OK).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115800
Approved by: https://github.com/fduwjj
ghstack dependencies: #115771, #115798
The NCCL flight recorder is per-process (it is shared by all
processgroups), but individual process groups used to construct their
own pipe for being signaled to dump the flight recorder.
This ensures that only one pipe per process is created, by only creating
the pipe on the first ProcessGroup (uid_ == 0) which should be the world
group.
Filenames are still keyed off of rank, but this should now be global
rank instead of sub-pg rank, making the filenames unique across the
whole trainer process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115798
Approved by: https://github.com/zdevito
ghstack dependencies: #115771
This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project.
Know limitations:
- [ ] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`.
- [ ] Only supports power of two sequence lengths.
- [ ] No support for varlen APIs.
- [ ] Only support head dimension 16,32,64,128.
- [ ] Performance is still being optimized.
Fixes https://github.com/pytorch/pytorch/issues/112997
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114309
Approved by: https://github.com/jeffdaily, https://github.com/malfet
---------
Co-authored-by: Joseph Groenenboom <joseph.groenenboom@amd.com>
## Summary
This PR added 3 intra-node GPU allreduce algorithms to PyTorch:
- One-shot allreduce (inspired by FasterTransformer): all ranks simultaneously read and accumulate data from other ranks.
- Two-shot allreduce (inspired by FasterTransformer): all ranks simultanesouly read and accumulate `1 / world_size` data from other ranks. Then all ranks read accumulated data from other ranks. (effectively one-shot reduce-scatter + one-shot all-gather).
- Hybrid cube mesh allreduce (original): a one-shot allreduce variant that avoids transmission over PCIe on HCM topology.
## Micro Benchmarks



## Details
The intra-node algos are organized behind `c10d::IntraNodeComm`, which is responsible for:
- Managing handshaking and cuda IPC handle exchange among ranks.
- Querying NVLink connection and detecting topology.
- Performing algo selection based on available info.
- Launching the selected allreduce kernel.
`c10d::IntraNodeComm` is integrated into `c10d::ProcessGroupNCCL` as follows:
- When the `ENABLE_INTRA_NODE_COMM` environment variable is set, `c10d::ProcessGroupNCCL` initialize a `c10d::IntraNodeComm` for its ranks.
- If the setup is not suitable for intra-node comm (e.g. not all ranks are from the same node), the rendezvous logic guarantees all participants fall back consistently.
- `c10d::ProcessGroupNCCL::allreduce` consults `c10d::IntraNodeComm` whether to use intra-node allreduce and carries out the communication accordingly.
We currently detect two types of topoloies from the nNVLink connection mesh:
- Fully connected: all GPU pairs has direct NVLink connection (e.g. NVSwitch or fully connected sub-set of hybrid cube mesh)
- `msg <= 256KB`: one-shot allreduce.
- `256KB < msg <= 10MB`: two-shot allreduce.
- `msg > 10MB`: instructs the caller to fallback to NCCL.
- Hybrid cube mesh
- `msg <= 256KB`: one-shot allreduce.
- `msg > 256KB`: instructs the caller to fallback to NCCL.
## Next Steps
- Fine tune algo selection based on GPU model, topology, link speed.
- Potentially optimize the two-shot allreduce impl. Accroding to FasterTransformer, two-shot allreduce is preferred until 50MB. There might be room for improvement, but PyTorch does impose more constraints:
- FasterTransformer uses a single process to drive multiple devices. It can use `cudaDeviceEnablePeerAccess` enable device-level peer access.
- PyTorch uses multiple process to drive multiple devices. With cuda IPC, a device can only share a specific region to other devices. This means extra copies may be unavoidable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114001
Approved by: https://github.com/yf225
This test isn't actually parametrized by `dtype` so it seems to surface bogus failures where "unsupported" types "work" but in reality fp8 is used every time.
CC @drisspg I'm guessing this doesn't surface in upstream CI because there are no SM9.0 runners yet?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115661
Approved by: https://github.com/drisspg
Description:
- Added non-integer expr support for floordiv in triton codegen
- Added a test
- cpp test is skipped as failing and https://github.com/pytorch/pytorch/pull/115647 may fix it
This PR is fixing compilation error with the following code:
```python
import torch
def func(x, a):
n = (a * 1.234) // 8.234
y = x + n
return y
cfunc = torch.compile(func, dynamic=True, fullgraph=True)
device = "cuda"
x = torch.tensor(0, dtype=torch.float32, device=device)
a = 33
out = cfunc(x, a)
expected = func(x, a)
torch.testing.assert_close(out, expected)
```
Error message on Nightly:
```
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised:
CompilationError: at 7:38:def triton_(in_ptr0, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr):
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex
tmp0 = tl.load(in_ptr0 + (x0), xmask)
tmp1 = ((1.23400000000000*ks0) // 8.23400000000000)
^
AssertionError()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115751
Approved by: https://github.com/peterbell10
This diff aims to directly import DeviceMesh from torch.distributed.device_mesh instead of importing it from dist._tensor. This is done to avoid a circular dependency issue. The code changes in each file of the diff are as follows:
- torch/distributed/_functional_collectives.py: import DeviceMesh from torch.distributed instead of dist._tensor.
Overall, this diff aims to improve the code by avoiding circular dependencies and improving the import statements.
==
The above summary is generated by LLM with minor manual fixes. The following summary is by me.
The original import causes some issues when compiling DDP with compiled_autograd. The root cause of compilation failure is not identified but it is good to fix the lazy initialization, which indirectly fixes the compilation issues for DDP.
Differential Revision: [D51857246](https://our.internmc.facebook.com/intern/diff/D51857246/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115649
Approved by: https://github.com/wconstab, https://github.com/wz337
ghstack dependencies: #115523, #115302, #115648
Summary:
This library contains global state, e.g. pytorch_qnnp_params. If we make
it a static library, different shared libraries linking that static
library can end up with their own copies of the global state, leading to
bugs. Make it a shared library instead, to avoid this issue.
Test Plan: buck2 test fbsource//fbandroid/javatests/com/facebook/playground/apps/fb4aplayground/scenarios/pytorchscenario:pytorchscenario -- --run-disabled --regex runBundledInputWithLocalAsset
Differential Revision: D51926024
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115570
Approved by: https://github.com/malfet
Fixes#114903
Previously large split variance reductions stored the intermediates as float16
precision, which may lead to overflow as the intermediate result is
unnormalized.
In #114903 we see two different `num_split` decisions made based on the
hardware capabilities, one of which has large enough intermediates to cause
overflows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115181
Approved by: https://github.com/shunting314
We were only passing a subset of the group creation information to the
NCCL PG. We are specifically missing the information on which global
ranks belong to a particular PG.
This allows the NCCL PG to use this additional information for things
like better trace logging.
Test Plan:
OSS CI
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114736
Approved by: https://github.com/kwen2501
Summary:
Dynamo test methodology provides a good example to patch various
treaments on the same set of test cases. A pitfall is the global config
that could be easily modified somewhere. Here we change the behavior of
the export API thru hijacking it with self defined code.
For supporting non-strict test suite, the `strict=False` is explicitly
passed into the export API when it's called w/ or w/o strict arg.
Test Plan:
python test/export/test_export_nonstrict.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115399
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
Related to #103973#110532#108404#94891
**Context:**
As commented in 6ae0554d11/cmake/Dependencies.cmake (L1198)
Kernel asserts are enabled by default for CUDA and disabled for ROCm.
However it is somewhat broken, and Kernel assert was still enabled for ROCm.
Disabling kernel assert is also needed for users who do not have PCIe atomics support. These community users have verified that disabling the kernel assert in PyTorch/ROCm platform fixed their pytorch workflow, like torch.sum script, stable-diffusion. (see the related issues)
**Changes:**
This pull request serves the following purposes:
* Refactor and clean up the logic, make it simpler for ROCm to enable and disable Kernel Asserts
* Fix the bug that Kernel Asserts for ROCm was not disabled by default.
Specifically,
- Renamed `TORCH_DISABLE_GPU_ASSERTS` to `C10_USE_ROCM_KERNEL_ASSERT` for the following reasons:
(1) This variable only applies to ROCm.
(2) The new name is more align with #define CUDA_KERNEL_ASSERT function.
(3) With USE_ in front of the name, we can easily control it with environment variable to turn on and off this feature during build (e.g. `USE_ROCM_KERNEL_ASSERT=1 python setup.py develop` will enable kernel assert for ROCm build).
- Get rid of the `ROCM_FORCE_ENABLE_GPU_ASSERTS' to simplify the logic and make it easier to understand and maintain
- Added `#cmakedefine` to carry over the CMake variable to C++
**Tests:**
(1) build with default mode and verify that USE_ROCM_KERNEL_ASSERT is OFF(0), and kernel assert is disabled:
```
python setup.py develop
```
Verify CMakeCache.txt has correct value.
```
/xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt
USE_ROCM_KERNEL_ASSERT:BOOL=0
```
Tested the following code in ROCm build and CUDA build, and expected the return code differently.
```
subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"])
```
This piece of code is adapted from below unit test to get around the limitation that this unit test now was skipped for ROCm. (We will check to enable this unit test in the future)
```
python test/test_cuda_expandable_segments.py -k test_fixed_cuda_assert_async
```
Ran the following script, expecting r ==0 since the CUDA_KERNEL_ASSERT is defined as nothing:
```
>> import sys
>>> import subprocess
>>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"])
>>> r
0
```
(2) Enable the kernel assert by building with USE_ROCM_KERNEL_ASSERT=1, or USE_ROCM_KERNEL_ASSERT=ON
```
USE_ROCM_KERNEL_ASSERT=1 python setup.py develop
```
Verify `USE_ROCM_KERNEL_ASSERT` is `1`
```
/xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt
USE_ROCM_KERNEL_ASSERT:BOOL=1
```
Run the assert test, and expected return code not equal to 0.
```
>> import sys
>>> import subprocess
>>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"])
>>>/xxxx/pytorch/aten/src/ATen/native/hip/TensorCompare.hip:108: _assert_async_cuda_kernel: Device-side assertion `input[0] != 0' failed.
:0:rocdevice.cpp :2690: 2435301199202 us: [pid:206019 tid:0x7f6cf0a77700] Callback: Queue 0x7f64e8400000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
>>> r
-6
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114660
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/jithunnair-amd
Currently this skip imports torchvision, so if your torchvision install
is broken then the entire file fails at collection time. This instead
means only the test itself will fail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115677
Approved by: https://github.com/lezcano
This implements an optional alternate interface to the AOTI
generated DSO, intended to increase efficiency for models running on
CPU and requiring minimal overhead. See comment in config.py for more
explanation.
This took a while to get right (e.g., I initially required 1-D
MiniArrayRef<T> for the inputs, but found that multi-dimensional
ArrayRefTensor<T> ended up simplifying the implementation and allowed
test_aot_inductor.py to run) and is somewhat intricate, so I am
anticipating that review will require some back-and-forth.
Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/)
**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112800
Approved by: https://github.com/chenyang78
Summary:
`CheckForSignals()` can be called from multiple threads concurrently, e.g. from within `ExecuteStepRecursive()`. This means that `my_sigint_count_` and `my_sighup_count_` can be written concurrently, causing data races.
To fix, use atomic exchange which writes the new value and returns the old value in one atomic operation.
Test Plan: Running TSAN tests that failed before and now pass
Differential Revision: D52018963
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115510
Approved by: https://github.com/malfet
Summary:
Dynamo test methodology provides a good example to patch various
treaments on the same set of test cases. A pitfall is the global config
that could be easily modified somewhere. Here we change the behavior of
the export API thru hijacking it with self defined code.
For supporting non-strict test suite, the `strict=False` is explicitly
passed into the export API when it's called w/ or w/o strict arg.
Test Plan:
python test/export/test_export_nonstrict.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115399
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
AOTInductor currently relies of export_to_torch_ir to generate a graph, and passes it to inductor to generate the .so. They would like the FQN to be consistent so that they can easily find/update the weights in the .so.
Note that since export flattens all modules in to a single computational graph, we will change the FQNs in the original module by replacing all periods with underscores. For example, `foo.child1param`, which points to a submodule named `foo`'s parameter named `child1param`, will be renamed to `foo_child1param` since we no longer have the submodule `foo`. This is done just by doing `name.replace(".", "_")`.
Outputted AOTInductor c++ code: https://www.internalfb.com/phabricator/paste/view/P900120950?lines=377-355%2C354
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115462
Approved by: https://github.com/tugsbayasgalan
This PR fixed#104791
bitcast requires the source and target have the bitwidth.
Because the input tensor's dtype could be promoted, e.g. from float16 to
float, we have to cast the tensor to its original source dtype before
invoking bitcast in such cases. After that, we also need to convert
the bit-casted tensor back to float to make sure we keep using higher
precision values for the rest of the computation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115619
Approved by: https://github.com/jansel, https://github.com/eellison
`torch._C.has_mkldnn` does not respect cases where users try to disable mkldnn using `torch._C._set_mkldnn_enabled()`. This is relevant to edge use cases, where they do not want decompositions to go to the ATen opset, and do not want the mkldnn operator to appear in the graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115448
Approved by: https://github.com/jgong5, https://github.com/ydwu4
Summary:
An internal MRS model was taking over a day's worth of time to compile due to many duplicates in dependency tracking. This PR replaces the list with a custom dedup list.
Normally one could use a set/dict for this purpose however the list in question gets elements appended as it is being iterated over which means that we need to keep the list semantics.
Test Plan: ad hoc testing
Differential Revision: D52060659
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115609
Approved by: https://github.com/jansel
This PR is intended to fix the following problem:
When using `CTCLoss`, there is a cudnn path gated by a call to [`_use_cudnn_ctc_loss`](
e918461377/aten/src/ATen/native/cudnn/LossCTC.cpp (L73-L101)) which checks some conditions
e918461377/aten/src/ATen/native/LossCTC.cpp (L486-L496)
However, there are more checks in `_cudnn_ctc_loss`
e918461377/aten/src/ATen/native/cudnn/LossCTC.cpp (L122-L130)
some of which are not present in `_use_cudnn_ctc_loss` (e.g. the check that `targets` is on CPU which will cause a RuntimeError after dispatching to `_cudnn_ctc_loss`). Instead, these checks should be in `_use_cudnn_ctc_loss` so that the normal `_ctc_loss` path will be used if the checks are not met)
e.g. Before this PR
```python
>>> import torch
>>> ctcloss = torch.nn.CTCLoss()
>>> log_probs = torch.randn((50, 3, 15), device='cuda').log_softmax(2)
>>> target = torch.randint(1, 15, (30 + 25 + 20,), dtype = torch.int)
>>> input_lengths = torch.tensor((50, 50, 50), device='cuda')
>>> target_lengths = torch.tensor((30, 25, 20), device='cuda')
>>> ctcloss(log_probs, target, input_lengths, target_lengths)
tensor(4.1172, device='cuda:0')
>>> target = target.to('cuda')
>>> ctcloss(log_probs, target, input_lengths, target_lengths)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/users/mg1998/pytorch/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/users/mg1998/pytorch/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/users/mg1998/pytorch/torch/nn/modules/loss.py", line 1779, in forward
return F.ctc_loss(log_probs, targets, input_lengths, target_lengths, self.blank, self.reduction,
File "/data/users/mg1998/pytorch/torch/nn/functional.py", line 2660, in ctc_loss
return torch.ctc_loss(
RuntimeError: Expected tensor to have CPU Backend, but got tensor with CUDA Backend (while checking arguments for cudnn_ctc_loss)
```
After this PR the above snippet runs without error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115617
Approved by: https://github.com/janeyx99
As in the title.
The reproducer for the bug is as follows:
```python
>>> import torch
>>> dtype = torch.bfloat16
>>> D1 = torch.tensor([[1, 2], [3, 4]], dtype=dtype, requires_grad=True)
>>> D2 = torch.tensor([[1, 2], [3, 4]], dtype=dtype, requires_grad=True)
>>> torch.autograd.gradcheck(torch.mm, (D1, D2), fast_mode=True)
```
<details>
```
torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor(0., dtype=torch.bfloat16)
analytical:tensor(4.9062, dtype=torch.bfloat16)
The above quantities relating the numerical and analytical jacobians are computed
in fast mode. See: https://github.com/pytorch/pytorch/issues/53876 for more background
about fast mode. Below, we recompute numerical and analytical jacobians in slow mode:
Numerical:
tensor([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]], dtype=torch.bfloat16)
Analytical:
tensor([[1., 2., 0., 0.],
[3., 4., 0., 0.],
[0., 0., 1., 2.],
[0., 0., 3., 4.]], dtype=torch.bfloat16)
```
</details>
```python
The max per-element difference (slow mode) is: 4.0.
>>> torch.autograd.gradcheck(torch.mm, (D1, D2), fast_mode=True, eps=1e-1)
```
<details>
```
<snip>
torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor(5., dtype=torch.bfloat16)
analytical:tensor(4.9062, dtype=torch.bfloat16)
The above quantities relating the numerical and analytical jacobians are computed
in fast mode. See: https://github.com/pytorch/pytorch/issues/53876 for more background
about fast mode. Below, we recompute numerical and analytical jacobians in slow mode:
Numerical:
tensor([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]], dtype=torch.bfloat16)
Analytical:
tensor([[1., 2., 0., 0.],
[3., 4., 0., 0.],
[0., 0., 1., 2.],
[0., 0., 3., 4.]], dtype=torch.bfloat16)
```
</details>
```
The max per-element difference (slow mode) is: 4.0.
```
Notice that changing `eps` value has no effect to max per-element difference.
With this PR, increasing `eps` value will lead to sensible results in numerical jacobian:
```python
>>> torch.autograd.gradcheck(torch.mm, (D1, D2), fast_mode=True, eps=1e-1)
```
<details>
```
<snip>
torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor(5., dtype=torch.bfloat16)
analytical:tensor(4.9062, dtype=torch.bfloat16)
The above quantities relating the numerical and analytical jacobians are computed
in fast mode. See: https://github.com/pytorch/pytorch/issues/53876 for more background
about fast mode. Below, we recompute numerical and analytical jacobians in slow mode:
Numerical:
tensor([[0.9375, 1.8750, 0.0000, 0.0000],
[2.9688, 3.7500, 0.0000, 0.0000],
[0.0000, 0.0000, 1.2500, 2.5000],
[0.0000, 0.0000, 2.5000, 3.7500]], dtype=torch.bfloat16)
Analytical:
tensor([[1., 2., 0., 0.],
[3., 4., 0., 0.],
[0., 0., 1., 2.],
[0., 0., 3., 4.]], dtype=torch.bfloat16)
```
</details>
```
The max per-element difference (slow mode) is: 0.5.
```
Finally:
```python
>>> torch.autograd.gradcheck(torch.mm, (D1, D2), fast_mode=True, eps=1e-1, atol=1)
True
```
that would fail with the current main branch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115634
Approved by: https://github.com/lezcano, https://github.com/soulitzer, https://github.com/albanD
ghstack dependencies: #115536
Summary: The original logic has an incorrect assumption that there is at one object name left when traversing the module tree. This is not correct when the leaf module is wrapped by FSDP.
Test Plan: CI
Differential Revision: D52049293
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115592
Approved by: https://github.com/wz337
Fixes#50051.
This PR is based on #50320 and I address the last feedback.
On Windows it is enabled by default. Can be enabled or disabled via USE_CUSTOM_TERMINATE env variable.
This PR adds support for overriding the terminate handler in order to log uncaught exceptions in the threads.
If an exception is thrown and not caught, it will print <Unhandled exception caught in c10/util/AbortHandler.h>
The point of doing this is that in issue #50051, exceptions were thrown but not logged. With this logging system it will be easier to debug it in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101332
Approved by: https://github.com/albanD, https://github.com/malfet
> **__Note:__** XNNPACK Upgrade is too large in the range of **40k** files and **10m** Lines of code, Thus we break the update of the library into multiple parts. All Parts [1 - 6/n] Must be landed together for it to work. ***This also means If there is a revert. Please revert the Entire Stack.***
This change is everything remaining requiring XNNPACK version to work.
Differential Revision: [D52044420](https://our.internmc.facebook.com/intern/diff/D52044420/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115587
Approved by: https://github.com/digantdesai
As in the title.
In addition:
- improve the algorithm for finding a minima of operation timings: break the inner loop early when a next minima candidate is found
- add tests and fix bugs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115499
Approved by: https://github.com/cpuhrsch
Previously if two calls to cumsum were generated in the same triton kernel
we would generate identical helper functions with different names. Now this
recognizes identical functions and only defines it once. To do this I defer
choosing the name until after codegen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115546
Approved by: https://github.com/lezcano
ghstack dependencies: #109132
But limit it to MacOS Sonoma +
Before the calling `torch.cat` with complex types failed, but now it works.
Before:
```
% python -c "import torch;print(torch.cat([torch.rand(3, 3, dtype=torch.cfloat).to('mps'), torch.rand(3, 3, dtype=torch.cfloat).to('mps')]))"
TypeError: Trying to convert ComplexFloat to the MPS backend but it does not have support for that dtype.
```
After:
```
% python -c "import torch;print(torch.cat([torch.rand(3, 3, dtype=torch.cfloat).to('mps'), torch.rand(3, 3, dtype=torch.cfloat).to('mps')]))"
tensor([[0.4857+0.0030j, 0.9375+0.8630j, 0.3544+0.9911j],
[0.5293+0.8652j, 0.8440+0.1991j, 0.5152+0.8276j],
[0.0136+0.7469j, 0.1403+0.4761j, 0.2943+0.0896j],
[0.6458+0.0035j, 0.3579+0.4577j, 0.1723+0.1508j],
[0.4420+0.3554j, 0.4396+0.7272j, 0.2479+0.1191j],
[0.3895+0.2292j, 0.7886+0.1613j, 0.9243+0.4180j]], device='mps:0')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115513
Approved by: https://github.com/kulinseth
ghstack dependencies: #115512
* Try linux.large.arc for stale workflow
* Run stale workflow on PR changes
* Added arc runner lable to the list of self hosted runners
* Added concurency linux-job
* Cleanup
* Added workflow_dispatch for testing purpose
Summary:
Dynamo test methodology provides a good example to patch various
treaments on the same set of test cases. A pitfall is the global config
that could be easily modified somewhere. Here we change the behavior of
the export API thru hijacking it with self defined code.
For supporting non-strict test suite, the `strict=False` is explicitly
passed into the export API when it's called w/ or w/o strict arg.
* For existing failed strict test cases, non-strict also fails.
* For passed strict but failed non-strict cases, we mark them as
`@testing.expectedFailureNonStrict`.
* Moreover, I manually check the failure reason and some of them are not
related to nn.Module asserting exception. I mark them as `# Need to fix
for non-strict mode`.
Test Plan:
python test/export/test_export_nonstrict.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115399
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
This changes cached thread_local tensors to stack-allocated buffers. Since we were incidentally caching output in a thread_local, I had to add manual thread_local caching of outputs, which I implemented by caching a buffer and a Tensor whose storage is that buffer and then just memcpying the result into the cached buffer every time. Ideally, memory planning would be able to identify allocations that are the backing storage for outputs, but this should be good enough in the absence of planning.
Differential Revision: [D50416438](https://our.internmc.facebook.com/intern/diff/D50416438/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112116
Approved by: https://github.com/jansel, https://github.com/desertfire
This pull request adds a tool to visualize sharding. It uses the device_mesh and placement details to construct a visualization of the split of a torch dtensor.
Things to fix:
- [x] This implementation only uses the first element of the placement tuple, when can there be more than one elements?
- [x] The calculation of the split is happening here but maybe it is already done somewhere internally in Shard class and can we directly call that here?
Fixes#108746
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114307
Approved by: https://github.com/wanchaol
This PR adds a experimental implicit replication support for DTensor to
inter-op with torch.Tensor, basically under this context manager DTensor
could work together with torch.Tensor by assuming the torch.Tensor
sharding layout is replicated.
Note that this is risky for DTensor so we don't turn it on by default,
but for certain cases where it is for sure replicated, user can use this
to allow DTensor and Tensor computation work together
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115297
Approved by: https://github.com/awgu
We added a monitor thread in NCCL PG in https://github.com/pytorch/pytorch/pull/112518. To summarize what we are doing in monitor thread: it listens to the heartbeat from watchdog thread and detect unhealthy nccl watchdog hang (due to several reasons such as nccl/cuda API bugs or unexpected blocking behaviors). This is the last resort to ensure that we don't silently keep the training job run for hours.
We didn't open this feature as default, since we want to perform more due diligence and have some customers to try it out. So far, we didn't see any obstacle which blocks turning on this feature and received positive feedback from users. We now decided to turn in on by default in this PR.
If this feature turns out not to work as expected and disturb one's training process, one can set `TORCH_NCCL_ENABLE_MONITORING=0` to disable this feature. Please kindly file an issue with us so that we can see if we missed any corner cases during the design.
Differential Revision: [D52045911](https://our.internmc.facebook.com/intern/diff/D52045911)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115577
Approved by: https://github.com/wconstab, https://github.com/kwen2501
Summary:
cuSPARSELt has support for different alg_id, which are set via
`cusparseLTMatmulAlgSetAttribute`, in total there are 4 different
alg_ids, 0 - 3.
Previously we were just using the default alg_id, as from our initial
experiments we found that for most shapes the default alg_id is the
fastest and that they made no difference on numerical correctness, just
performance. From our previous experiments the fastest alg_id seemed to
differ only on small matmul shapes.
danthe3rd found a performance regression when running with
cuSPARSELt v0.4.0 vs v0.5.0, on LLM shapes, which match these
characteristics (activations are small, weights are large).
However it's likely that this is due to the alg_id ordering changing, as
mentioned in the release notes for v0.5.0.
```
cusparseLtMatmulAlgSelectionInit() does not ensure the same ordering of
algorithm id alg as in v0.4.0.
```
This PR adds in the following:
- support for passing in alg_id to _cslt_sparse_mm
- a new op, _cslt_sparse_mm_search, which returns the optimal alg_id for
a given matmul
_cslt_sparse_mm_search has the same function signature as
_cslt_sparse_mm, minus the alg_id parameter.
We are able to achieve v0.4.0 performance with alg_id=1 on the shapes
that daniel provided.
We will address autoselecting the best alg_id in a future PR, possibly
with torch.compile.
Test Plan:
```
python test/test_sparse_semi_structured -k cslt
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115178
Approved by: https://github.com/cpuhrsch
But limit it to MacOS Sonoma +
Before the calling `torch.cat` with complex types failed, but now it works.
Before:
```
% python -c "import torch;print(torch.cat([torch.rand(3, 3, dtype=torch.cfloat).to('mps'), torch.rand(3, 3, dtype=torch.cfloat).to('mps')]))"
TypeError: Trying to convert ComplexFloat to the MPS backend but it does not have support for that dtype.
```
After:
```
% python -c "import torch;print(torch.cat([torch.rand(3, 3, dtype=torch.cfloat).to('mps'), torch.rand(3, 3, dtype=torch.cfloat).to('mps')]))"
tensor([[0.4857+0.0030j, 0.9375+0.8630j, 0.3544+0.9911j],
[0.5293+0.8652j, 0.8440+0.1991j, 0.5152+0.8276j],
[0.0136+0.7469j, 0.1403+0.4761j, 0.2943+0.0896j],
[0.6458+0.0035j, 0.3579+0.4577j, 0.1723+0.1508j],
[0.4420+0.3554j, 0.4396+0.7272j, 0.2479+0.1191j],
[0.3895+0.2292j, 0.7886+0.1613j, 0.9243+0.4180j]], device='mps:0')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115513
Approved by: https://github.com/kulinseth
ghstack dependencies: #115512
InterpreterModule is better than GraphModule codegen; it's more debuggable and
has better stack traces. The only reason we don't use it today is because
torch.compile doesn't work with it.
I work around this by constructing a GraphModule separately for usage during
dynamo tracing, but otherwise using torch.fx.Interpreter.
Differential Revision: [D51971661](https://our.internmc.facebook.com/intern/diff/D51971661/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115436
Approved by: https://github.com/zhxchen17
ghstack dependencies: #115408
UnflattenedModule doesn't really behave like a graph module; we customize `__call__` to do something completely different than what GraphModule does. So, things that test `isinstance(unflattened_module, GraphModule)` and do something with the GraphModule are often broken.
This change makes UnflattenedModule it's own thing.
Differential Revision: [D51959097](https://our.internmc.facebook.com/intern/diff/D51959097/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115408
Approved by: https://github.com/zhxchen17
Prerequisite for adding more complex type support and FFT operation
Check using `conjugateWithTensor:name:` selector defined as follows
```objc
/// Returns the complex conjugate of the input tensor elements.
///
/// - Parameters:
/// - tensor: The input tensor.
/// - name: An optional string which serves as an identifier for the operation..
/// - Returns: A valid `MPSGraphTensor` object containing the elementwise result of the applied operation.
-(MPSGraphTensor *) conjugateWithTensor:(MPSGraphTensor *) tensor
name:(NSString * _Nullable) name
MPS_AVAILABLE_STARTING(macos(14.0), ios(17.0), tvos(17.0))
MPS_SWIFT_NAME( conjugate(tensor:name:) );
```
- Rename `isOnMacOS13orNewer(unsigned minor)` hook to `isOnMacOSorNewer(major, minor)`
- Replace `torch._C.__mps_is_on_macos_13_or_newer` with `torch._C._mps_is_on_macos_or_newer`
- Add `torch.backends.mps.is_macos_or_newer` public API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115512
Approved by: https://github.com/albanD
In https://github.com/pytorch/pytorch/pull/115449/ somehow after turning on `DUMP_ON_TIMEOUT=1`, some existing tests failed. Upon checking, the failing is because of TCPStore check call within watchdog thread.
1. It's not because of TCPStore creation has not completed, because if I put it sleep for a long time, the test still failed. Rather, it's because we query TCPStore after we shutdown the PG.
2. The reason for that is: The `std::chrono::steady_clock::now()` function in C++ returns a `time_point` object representing the current point in time according to the steady clock. The default unit of this time_point is not directly specified in terms of seconds or nanoseconds; rather, it is dependent on the internal representation of the steady clock, which can vary between implementations. In reality it's actually nanosecs which makes the delta so big that we are checking the store every time when watchdog thread wakes up. To make things even worse, `terminateProcessGroup_` might be turned to be `True` before the next check for the outmost while but before TCPStore check, so watchdog gets stuck because we are checking a TCPStore which is already deleted. And main thread is still waiting for watchdog to join.
The solution here is:
1. Add back `std::chrono::duration_cast` to ensure the delta is indeed mil_sec, so that the timeout check logic is working as expected.
2. Check `terminateProcessGroup_` as well so that, we don't do any dump when main thread has already mark the process exited.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115475
Approved by: https://github.com/wconstab
Summary:
In order to get better performance on conv2d pw its better to read the input together in a batch.
With this optimization on CUNET-enc ops:
Kernel Name Workgroup Size Duration P50 (ns)
=========== ============== =================
vulkan.quantized_conv2d_pw_2x2{96, 72, 2} 891332
vulkan.quantized_conv2d_pw_2x2{48, 36, 4} 528528
vulkan.quantized_conv2d_pw_2x2{24, 18, 8} 557336
Without this optimization:
Kernel Name Workgroup Size Duration P50 (ns)
=========== ============== =================
vulkan.quantized_conv2d_pw_2x2{96, 72, 2} 1633268
vulkan.quantized_conv2d_pw_2x2{48, 36, 4} 1177228
vulkan.vulkan.quantized_conv2d_pw_2x2{24, 18, 8} 1343264
Test Plan:
Ensure all vulkan quantize tests pass:
buck2 run --target-platforms ovr_configplatform/macos:arm64-fbsourcexplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 78 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 78 tests from VulkanAPITest
[ RUN ] VulkanAPITest.uniform_buffer_copy
...
[----------] Global test environment tear-down
[==========] 78 tests from 1 test suite ran. (1519 ms total)
[ PASSED ] 78 tests.
buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 395 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 395 tests from VulkanAPITest
[ RUN ] VulkanAPITest.zero_size_tensor
[ OK ] VulkanAPITest.zero_size_tensor (83 ms)
...
xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp:7593: Skipped
QueryPool is not available
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 395 tests from VulkanAPITest (6515 ms total)
[----------] Global test environment tear-down
[==========] 395 tests from 1 test suite ran. (6515 ms total)
[ PASSED ] 394 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
YOU HAVE 5 DISABLED TESTS
Reviewed By: yipjustin
Differential Revision: D50997530
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115221
Approved by: https://github.com/yipjustin
Constant time access of first value in collection. This is a constant time operation instead of converting the item to a list to get the first item which is linear. The rule is turned on which automatically autofixes and enforces this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115507
Approved by: https://github.com/malfet
Summary:
cuSPARSELt has support for different alg_id, which are set via
`cusparseLTMatmulAlgSetAttribute`, in total there are 4 different
alg_ids, 0 - 3.
Previously we were just using the default alg_id, as from our initial
experiments we found that for most shapes the default alg_id is the
fastest and that they made no difference on numerical correctness, just
performance. From our previous experiments the fastest alg_id seemed to
differ only on small matmul shapes.
danthe3rd found a performance regression when running with
cuSPARSELt v0.4.0 vs v0.5.0, on LLM shapes, which match these
characteristics (activations are small, weights are large).
However it's likely that this is due to the alg_id ordering changing, as
mentioned in the release notes for v0.5.0.
```
cusparseLtMatmulAlgSelectionInit() does not ensure the same ordering of
algorithm id alg as in v0.4.0.
```
This PR adds in the following:
- support for passing in alg_id to _cslt_sparse_mm
- a new op, _cslt_sparse_mm_search, which returns the optimal alg_id for
a given matmul
_cslt_sparse_mm_search has the same function signature as
_cslt_sparse_mm, minus the alg_id parameter.
We are able to achieve v0.4.0 performance with alg_id=1 on the shapes
that daniel provided.
We will address autoselecting the best alg_id in a future PR, possibly
with torch.compile.
Test Plan:
```
python test/test_sparse_semi_structured -k cslt
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115178
Approved by: https://github.com/cpuhrsch
Currently, we place constants in the .so. To avoid cases
where constants are too large (i.e. >2G), we put the
constants into .lrodata, which allows doesn't have 2G limit.
Not sure why, lld still issues errors like beow even if
those large constants data are stored in .lrodata section:
"relocation R_X86_64_PC32 out of range: 5459191920 is not in
[-2147483648, 2147483647]"
In constrast, the default gnu ld linker works fine. Let's
switch back to use ld to unblock some internal models.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115478
Approved by: https://github.com/desertfire, https://github.com/htyu
Fixes#113422Fixes#94575
This is now possible:
```py
model = Model()
compiled_model = torch.compile(model)
model.load_state_dict(compiled_model.state_dict()) # previously key mismatch!
```
This also makes it much easier to checkpoint and load models that were wrapped like so:
```py
FSDP(torch.compile(model))
# or
DDP(torch.compile(model))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113423
Approved by: https://github.com/msaroufim
1. Removes calls to `replace_all` and `clone` and makes VTs mutable.
2. Properly handles Tuple Iterator mutation. Previously TupleIterator variables would only be properly reconstructed if they were advanced at least once in a frame. On calls to `next`, the source information would be lost (due to constructing a new iterator without using builder), which would ensure that during codegen the variable would be reconstructed from scratch. Now that VTs are mutated, the source is never lost, so we need to properly track mutation and handle it by replaying calls to `next` at the end of the modified bytecode.
3. Added test for checking iadd side effects, this was missing in our unit test coverage.
4. Fixed two incorrect sources, DelayGraphBreakVariable, and UserMethodVariable both relied on setting the source to AttrSource(parent, name) at the callsite of `var_getattr`.
5. Fixed a bug in inplace adding for lists, it would set the resulting VariableTracker's source to `None` which would utilize a different reconstruct path in codegen. Now this is handled explicitly by reconstructing vars when allow_cache=`False`, so that during side effect replay, the mutated var is correctly updated.
In subsequent PRs:
* Refactoring side effect tracking to be significantly simpler (I think we only need an `is_modified` flag)
* Refactor `next_variables` iterator to match the signature of `next`
* Remove all references to `options` in the code
* Refactor VTs representing mutable collections to implement their own mutation update handling
* Remove clone and/or make it specific to lists for creating slices
* Add mutation tracking/replay for sets
* Add mutation tracking/replay for iter.py
* Removing setting source in builder (it's set at the top level after a var is returned)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113725
Approved by: https://github.com/jansel
Re-enable type checking for distributed_c10d.py
Type checking for distributed_c10d.py was inadvertently turned off in issues that have accumulated since.
Note: the backwards compatibility linter does not like some of these changes. But they were incorrect before. This needs human verification, however.
#suppress-api-compatibility-check
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115223
Approved by: https://github.com/wconstab
Summary: This diff is only for prototype to unblock the TP work. PyTorch distributed team is working on a more generic backward op for `aten.layer_norm`. Will remove this op from the experimental file once it is ready.
Test Plan:
**Local Test**:
Accuracy:
- Dtensor + Checkpoint: first run loss: P884569822 (on-par with baseline: P884213363)
- 2nd by loading saved checkpoint: P884583429 (on-par with baseline: P884271869)
Trace:
- Collective functions are inserted automatically.
- Example: https://fburl.com/perfdoctor/l567ww1x
**MAST Test**:
With: trainer = 128, batch_size=512
- NE on-par:
(see: 4441_ep_bs512_2fsdp_tp_sp_dtensor)
{F1155318138}
Differential Revision: D51490868
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115398
Approved by: https://github.com/wanchaol
Currently (after https://github.com/pytorch/pytorch/pull/114407), the user has must pass the original user ``model`` to APIs such as ``ONNXProgram.__call__``, ``ONNXProgram.adapt_torch_inputs_to_onnx`` and ``ONNXProgram.adapt_torch_outputs_to_onnx`` APIs.
This was needed because when the model is fakefied, a version of the non-fakefied model is needed so that the Initializers, buffers and constants can be extracted from a real model (and used as input to the ONNX model).
That approach brings an unnecessary usability burden to the user when the model is not fakefied, because the model that was already passed to ``torch.onnx.dynamo_export`` could be used to extract ``state_dict``.
This PR adds ``ONNXProgram._model_torch`` attribute to store the user model and demote ``model`` argument of the aforementioned APIs to optional, only (as opposed to required).
As a result, for the fakefied model scenario, the user still need to pass the required model, but for non fakefied models, the persisted model is implicitly used to extract the model state_dict, making it easier to use.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115281
Approved by: https://github.com/BowenBao
ghstack dependencies: #114407
multiprocessing.Queue relies on, among other things, background threads to send messages between processes. This works in the happy path but can cause issues if a process is exiting by bypassing atexit handlers or crashing because the writer to the Queue can terminate while the reader is blocked reading the queue. The reader sees the queue as non-empty yet even with a timeout will actually block forever.
An example of a Queue deadlock is here: https://gist.github.com/chipturner/342f72341f087737befe9df84d0e41ce
Since the error reporting case here is a simple one-shot message from the dying child to the parent, we can just use a file-based rendezvous. This eliminates the deadlock when a large traceback is still being flushed to the network when a child exits.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114688
Approved by: https://github.com/suo, https://github.com/yifuwang
Summary:
Most NT operations end with creating a new NestedTensor, which is time-consuming. Trying to reduce overhead during the NestedTensor creation.
The ops return a new NestedTensor with the same offsets, so "tensor not in _tensor_symint_registry" would be false in most case. The "in" (__contain__) function takes ~8 us. If we use the "get" directly, then we save a few us for most NT operations.
Test Plan:
Before:
get_tensor_symint take 15us
https://pxl.cl/3XF83
After
get_tensor_symint take 10us
https://pxl.cl/3XFc9
Differential Revision: D51992836
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115450
Approved by: https://github.com/soulitzer
Patch `--save-xml` when `TEST_IN_SUBPROCESS`
When `--save-xml` is given as a unit test argument and the test is handled by a `TEST_IN_SUBPROCESS` handler (e.g., `run_test_with_subprocess` for `distributed/test_c10d_nccl`), the `--save-xml` args were first "consumed" by argparser in `common_utils.py`. When a following subprocess in this `if TEST_IN_SUBPROCESS:` section starts, there are no `--save-xml` args, thus leaving `args.save_xml` to `None`.
Since argparser for `--save-xml` option has a default argument of `_get_test_report_path()` when the arg is `None`, it's not a problem for Github CI run. It could be an issue when people run those tests without `CI=1`. Test reports won't be saved in this case even if they passed `--save-xml=xxx`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115463
Approved by: https://github.com/clee2000
Currently, the ONNX exporter using torch.nn.Module as input can support
FakeTensor because the ONNX model stores all initializers
When using torch.export.ExportedProgram as input, the initializers are
lifted as inputs. In order to execute the ONNX model, we need to pass a
reference to the non-fake model to the
ONNXProgram.adapt_torch_inputs_to_onnx API, so that initializers can be
fetched from the model and fed to the ONNX model as input
ps: https://github.com/pytorch/pytorch/issues/115461 will track the API revision for the cases where additional `model_with_state_dict` are required to produce complete ONNX files exported with fake support. This is also tracked by the umbrella fake tensor issue https://github.com/pytorch/pytorch/issues/105464 FYI @BowenBao
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114407
Approved by: https://github.com/BowenBao
Fixes https://github.com/pytorch/pytorch/issues/113717.
When `preserve_rng_state=True`, we let AOTAutograd trace through `torch.random.fork_rng` op, and the tracing doesn't work under CUDA, hence the original error reported in the issue.
But since we are already doing RNG functionalization at Inductor level, we don't actually need to trace this `fork_rng` op. So we should just rewrite `preserve_rng_state` to False when we are using torch.compile (and let Inductor do its RNG functionalization which it's already been doing).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113718
Approved by: https://github.com/wanchaol
## Summary
Since #97351, CPU ATen kernel for `mean` for BF16 & FP16 dtypes has been unvectorized (it's not even implicitly vectorized).
This PR vectorizes `mean` for BF16 & FP16 on CPU in a `cast_fp32 -> sum -> div -> cast_bf16_or_fp16` fashion.
The perf benefit would be especially pronounced on machines with `AVX512_BF16` and/or `AVX512_FP16` ISA support.
## Benchmarking data for BF16 (collected before & after the change in this PR)
**Machine:** Intel® Xeon® (4th generation series, formerly codenamed Sapphire Rapids) Platinum 8468H
One socket (48 physical cores) - used `numactl --membind=0 --cpunodebind=0`
libtcmalloc & Intel OpenMP were preloaded
Environment variable used -
`KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 OMP_NUM_THREADS=48 MKL_NUM_THREADS=48`
**Workload:** E2E performance on BS 32 resnet50 (using BF16 via AMP) inference using oneDNN Graph JIT fuser (`mean` kernel is dispatched to eager mode ATen kernel, and is the bottleneck right now)
| **BEFORE:** Latency with unvectorized mean (lower is better)| **AFTER:** Latency with vectorized mean (lower is better)| Speedup due to vectorizing mean|
|----------------------------|-------------------------|------------|
| 19.1 ms | 10.8 ms | latency reduced by ~43.45% |
**Benchmarking script for BF16 -**
```
import time
import torch
import torchvision
# enable oneDNN Graph JIT fuser
torch.jit.enable_onednn_fusion(True)
# AMP for JIT mode is enabled by default, and is divergent with its eager mode counterpart
torch._C._jit_set_autocast_mode(False)
# sample input should be of the same shape as expected inputs
example_input = torch.rand(32, 3, 224, 224)
# Using resnet50 from torchvision in this example for illustrative purposes,
# but the line below can indeed be modified to use custom models as well.
model = getattr(torchvision.models, "resnet50")().eval()
with torch.no_grad(), torch.cpu.amp.autocast(cache_enabled=False, dtype=torch.bfloat16):
# Conv-BatchNorm folding for CNN-based Vision Models should be done with ``torch.fx.experimental.optimization.fuse`` when AMP is used
import torch.fx.experimental.optimization as optimization
# Please note that optimization.fuse need not be called when AMP is not used
model = optimization.fuse(model)
model = torch.jit.trace(model, (example_input))
model = torch.jit.freeze(model)
# a couple of warm-up runs
model(example_input)
model(example_input)
# speedup would be observed in subsequent runs
start = time.time()
model(example_input)
end = time.time()
inference_time = (end - start) * 1000
print("Inference time is ", inference_time)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114582
Approved by: https://github.com/jgong5, https://github.com/malfet
Summary: This PR does 2 things:
1) Previously this would simply error, now it will ignore any
torch.inf values that it recieves. note: The code checks for torch.inf after
aminmax that way if there are no torch.inf values found, the perf is a
relatively unchanged
2) as mentioned in https://github.com/pytorch/pytorch/issues/100051,
values close to (but not quite at) the maximum/minimum float value could
overflow to infinity in the course of _adjust_min_max() (when this large
value would be multiplied by something in the middle of a calculation
that would otherwise result in a non inf value). This was fixed by
rearranging the order of operations for the lines in question without
altering the actual equations. Specifically, where operations in lines
1095, 1098 and 1100 have multiplication and division of large values,
its better to divide the two large values before multiplying, rather
than multiplying the two large values together (creating overflow) before dividing like it had been.
Test Plan: python test/test_quantization.py
TestObserver.test_histogram_observer_ignore_infinity
python test/test_quantization.py TestObserver.test_histogram_observer_handle_close_to_infinity
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D51489345](https://our.internmc.facebook.com/intern/diff/D51489345)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103467
Approved by: https://github.com/andrewor14
RE #115301
Decoupling gives us a path to disable timing without disabling the
flight recorder.
Flight recorder is still useful for stuckness analysis without 'timing'.
Disabling timing makes it miss the 'started'
state that comes from using an extra nccl event at the start of each
collective. It will also be missing 'duration_ms' of collectives, which
hasn't been landed yet, but is useful for timing/perf work more than
stuckness analysis.
Hopefully we can enable timing by default and leave both on, but it's
nice to have the flexiblity for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115358
Approved by: https://github.com/fduwjj
This PR is proposing a new approach to solve the nn/optim only linked by python object identity problem.
The idea is to have a function that can swap the content of two Tensors t1 and t2 while preserving all the old references.
This would allow us to swap the `model.weight` with a new Tensor (can be any subclass of Tensor and any TensorImpl (xla, sparse, nested tensorimpl would work)). The use within nn will be done in a follow up.
This is done by swapping the whole content of the PyObject and then putting back the fields associated with external references (refcount, gc tracking and weakrefs).
Note that we have to properly handle all the cases where there is memory used before the public pointer PyObject* and where the PyObject is bigger due to dict/weakref being inlined (older CPython version) or due to slots.
The main limitation of this approach is that the number of slots need to match for the objects being swapped and thus limit usage of slots in subclasses.
Draft right now to see what @colesbury thinks about doing this?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111747
Approved by: https://github.com/colesbury
Removes always restore, assuming that a HOP will cleanup any leftover state from tracing fwd + bwd
This required a minor change to the autograd fn variable higher order op. If we are tracing forward DON'T add the call_function node into the main graph, since we are only tracing it for the purposes of speculation. Instead return the result directly to be passed to the backward for speculation. This was the only observable side effect on the output graph that I found.
Test plan:
test_smoke_from_test_autograd in test_autograd_function.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115317
Approved by: https://github.com/voznesenskym, https://github.com/jansel
Summary: Add two logic:
1. If the custom op is returning a `Tensor` but also doesn't have an out tensor as input, return an empty tensor.
2. If the custom op is returning more than one Tensor and the number of out tensors is not the same as return Tensor, return a tuple of empty tensors.
Test Plan: Rely on new unit tests
Differential Revision: D51471651
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114143
Approved by: https://github.com/cccclai
Summary:
GraphFunction internally stores the optimized graph after generating it and then it is passed into the executor which makes a copy of it. So we store the optimized graph effectively twice.
This diff allows to set a flag to not store the optimized graph inside the GraphFunction.
The code is NoP right now until the flag is enabled.
Test Plan:
I ran SL with this on raas with good memory saving on raas server. From command line:
exmaple model run
```
buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=953556500 --model_snapshot_to_load=362
I1207 11:04:58.657143 3556226 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 953556500_362 is 255646 Kb
```
then with flag enabled:
```
buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=953556500 --model_snapshot_to_load=362 --torch_jit_do_not_store_optimized_graph=true
I1207 11:06:25.245779 3577383 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 953556500_362 is 165167 Kb
```
So collective with this flag and the flag from D51950418
```
buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=953556500 --model_snapshot_to_load=362 --torch_jit_do_not_store_optimized_graph=true --torch_jit_enable_profiling_graph_executor=false
I1207 11:09:17.502743 3592345 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 953556500_362 is 114848 Kb
```
Differential Revision: D51931895
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115381
Approved by: https://github.com/malfet
Summary:
This is to allow easier extension of quant workflow in the future, as we are seening more
diverse ways of doing quantization
putting up this for feedbacks first
Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_observer_callback
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115001
Approved by: https://github.com/kimishpatel
The test introduced in #102530 has a bug:
Construction of `crow_indices` raises an exception: "value cannot be converted to type int32 without overflow" which is obviously correct.
This makes the test fail which is supposed to check for an overflow in nnz.
Fix by making the construction of `crow_indices` pass although with an invalid value which would error later but triggers the correct check.
Given that I'm not sure it is even worth checking for an overflow in nnz:
- `crow_indices[..., -1] == nnz` is already enforced
- this can only hold if `crow_indices` is able to hold `nnz` without overflow
- `col_indices` has to be of the same type as `crow_indices`
- Hence the type of `col_indices` has to be able to hold the value of `nnz`
So in conclusion: The situation being checked for cannot reasonably occur
CC @pearu as the test author for additional insight
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114940
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
Summary:
Rename _device_mesh.py to device_mesh.py, update all callsites, add documentation.
We created stubs for public class and methods in torch.distributed.device_mesh so that torch.distributed.device_mesh can be imported with or without distributed is available().
Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/115099
Prior to landing, CI signals are all passed. Shipit added the "ci/trunk" label to the PR and DID NOT wait for it and went ahead committing. More context can be found in the reverted PR above.
Test Plan: CI.
Differential Revision: D51861018
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115193
Approved by: https://github.com/fegin
*
Context:
Joel sees that unless he manually writes to the fake tensor memo, fakification seems to produce spurious symbols! Voz (me) objects, saying that not only is directly writing to memo a bad pattern, recursively invoking fakification on tensor subclass elements in dynamo should suffice! Joel says that while he morally agrees, he has a test proving otherwise, a most perplexing situation.
Digging in, I figured out that while *we were* making fake tensors correctly, with properly cached symbols and the like, we were *also* incorrectly creating spurious symbols, leading the test to fail.
Before this PR, we would only cache source->symint. This was generally fine, but meant that you would create a symbol, then potentially throw it out due to symint cache. For example, the cache hit flow was:
make a symbol (ex: s2) -> use it to make a symint -> hit the cache (my_source-s1)
Now, in this example, you have a symbol in your val_to_var/var_to_val (s2) that is unused. This is sound, but wasteful, and furthermore, misleading.
This was causing a test added in a PR in this stack to fail, specifically, because the test was using
```
curr_var_to_val = {
str(k): v for k, v in context.fake_mode.shape_env.var_to_val.items()
}
````
To validate that no new symbols were being created (that is, that recursively creating fake tensors for subclasses was working).
The test is correct, but the implementation of caching would make (by this method of observation) cache hits look like cache misses.
So, the fix here is to move the cache up to be a general symbol cache, rather than only a cache for symints.
The initial implementation did that! But then, it ran into some interesting errors when it came to replay. When replaying symbol creation, behaviors would diverge in the new shape env! How could that be? The answer is because creating a new shape_env resulted in us replaying symbol creation... but with a cache from a different shape env! This was short circuiting symbol creation - and so, adding an extra layer to the cache for id(shape_env) fixes the problem.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115396
Approved by: https://github.com/mlazos
Summary:
[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process
We have seen a handful of jobs training stuck where one of the trainer goes down
while others are stuck in c++ land and hence not handling the sigterm.
Test Plan: Manually validated by attaching gdb to one of the processes and sent a kill -9 to another. Saw the log ```WARNING] Unable to shutdown process 4422 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL```
Differential Revision: D51862545
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115219
Approved by: https://github.com/wconstab, https://github.com/fduwjj
Adds a useful high level wrapper for calling `dist.save/load` with the correct storage readers and writers.
Instead of doing:
```
DCP.save(
state_dict={...},
storage_writer=StorageWriter(...)
)
DCP.load(
state_dict={...},
storage_reader=StorageReader(...)
)
```
We can now do:
```
checkpointer = Checkpointer(...)
checkpointer.save(state_dict={...})
checkpointer.load(state_dict={...})
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114603
Approved by: https://github.com/fegin, https://github.com/wz337
I approved https://github.com/pytorch/pytorch/pull/110850 which did the following
Previously:
`num_batches_tracked` not in state_dict when doing `m.load_state_dict(state_dict)` --> always overwrite module's `num_batches_tracked` in `load_from_state_dict` with a 0 cpu tensor
Now:
`num_batches_tracked` not in state_dict loaded when doing `m.load_state_dict(state_dict)` --> only overwrite module's `num_batches_tracked` in `load_from_state_dict` with a 0 cpu tensor if module does not have `num_batches_tracked`
This causes the following issue:
```
with torch.device('meta'):
m = BatchNorm(...)
m.load_state_dict(state_dict, assign=True)
```
If `num_batches_tracked` is not in `state_dict`, since `modules's` `num_batches_tracked` is present on meta device, it is not overwritten with a 0 cpu tensor. When compiling, this error is raised
```
AssertionError: Does not support mixing cuda+meta
```
I am not sure whether the explicit check for meta device makes sense as a fix, will add testing if this fix is ok
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115285
Approved by: https://github.com/albanD
After auditing higher_order_ops.py, the graph checkpoints were only getting used in the event of an exception, so it is safe to remove because we restart analysis in this case now.
To make this clearer the current state is the following:
Checkpoint side effects
Capture subgraph
if graph break:
restore as usual
else:
throw away inlining translator and subgraph tracer
Restore side effects
This will change to the following after this change:
Checkpoint side effects
Capture subgraph:
if graph break:
restart analysis
else:
throw away inlining translator and subgraph tracer
Restore side effects
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115321
Approved by: https://github.com/jansel, https://github.com/zou3519
Replaces the "always sleep 30 sec before abort" with "wait up to 30 sec
for the future to complete then abort". The difference in this case is
the abort happens as soon as the dump finishes up to a maximum, instead
of always waiting the maximum.
Allows multiple calls to dump, which will be serialized.
Renames tryWriteDebugInfo to launchAsyncDebugDump in spirit of the
change to support more than one launch and to always launch rather than
only launching on the first call.
Adds a test for dumping on timeout.
This reverts commit ac7d14baad53fa7d63119418f760190f289d8a01.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115332
Approved by: https://github.com/fduwjj
Summary: Add a toggle to inductor config that will force matmul precision dtypes to match between cublas and triton backends for addmm, bmm, and mm operations.
Test Plan: CI + model launches
Reviewed By: jansel
Differential Revision: D51442001
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115326
Approved by: https://github.com/jansel
This PR enables the fx passes and mkldnn optimizations for aarch64 It improved the bert inference performance up to 5.8x on AWS c7g instance when compared torch.compile() vs no compile path. This is enabled when pytorch is built with USE_MKLDNN_ACL option for aarch64.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115037
Approved by: https://github.com/jgong5, https://github.com/malfet
This PR aims for parity+ compared to the old testing for the simplest foreach test case.
Test coverage increase: we now test foreach optimizers with CPU as well as on GPU.
Before:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ python test/test_optim.py -v -k test_multi_tensor_optimizers
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_multi_tensor_optimizers (optim.test_optim.TestOptim) ... ok
----------------------------------------------------------------------
Ran 1 test in 7.253s
OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$
```
Now, we get granular test cases at the cost of overhead!
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ python test/test_optim.py -v -k test_foreach
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_foreach_ASGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adadelta_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adagrad_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_AdamW_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adamax_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_NAdam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_RAdam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_RMSprop_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Rprop_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_SGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_ASGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adadelta_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adagrad_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adamax_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_NAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_RAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_RMSprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Rprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_SGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
----------------------------------------------------------------------
Ran 22 tests in 30.954s
OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$
```
Why the increase in time?
Two reasons:
1. overhead. Any _CUDA_ *Info test (OpInfo, ModuleInfo, OptimizerInfo) will wrap itself with the `CudaNonDefaultStream` policy, and `CudaNonDefaultStream.__enter__` when called for the first time will go through all visible CUDA devices and synchronize each of them, thus forcing the CUDAContext to be init'd. Doing this for all 8 devices takes ~10-15s. Also, test parametrization costs a little overhead too, but not to the level init'ing CUDA context does.
2. We test more! Now, we have 72 configs (in the foreach optimizer world) whereas we only had 59 before.
Next steps for the future:
- consider adding more Tensor LR configs (like a Tensor LR without capturable in the single tensor case)
- this is likely the next PR or 2: migrate all uses of _test_derived_optimizers in test_optim to TestOptimRenewed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114797
Approved by: https://github.com/albanD
Summary: We construct a unified API that can be easily add pointwise ops to be batched in the post grad
Test Plan:
# unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion
```
Buck UI: https://www.internalfb.com/buck2/19b3f641-782f-4f94-a953-3ff9ce2cfa7b
Test UI: https://www.internalfb.com/intern/testinfra/testrun/1125900251953016
Network: Up: 67KiB Down: 32KiB (reSessionID-c2a80f26-8227-4f78-89fc-bcbda0ae8353)
Jobs completed: 18. Time elapsed: 1:19.8s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0
# local reproduce
### cmf
P881792289
### igctr
### dsnn
### icvr
Reviewed By: xuzhao9
Differential Revision: D51332067
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114778
Approved by: https://github.com/xuzhao9
Default should be False because in general, we're interested
in reliability and composability: we want to check that
running PyTorch with and without Dynamo has the same semantics (with
graph breaks allowed).
Test Plan:
Existing tests?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115276
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115267
Due to not all tests in the Dynamo shard actually running in CI, we've
started to bitrot on this implementation. Since our plan is to trace
into the functorch implementations instead of construct a HOP
(which is what capture_func_transforms=True does), let's turn off this
config by default.
Test Plan:
- Tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115267
Approved by: https://github.com/voznesenskym, https://github.com/guilhermeleobas
Summary:
Documenting the `Work` object
For a collective (broadcast, all_reduce, etc.) when async_op=True we return a `Work` object to which users can call `.wait()`, `.is_success()`, among other things but this class is not documented
Test Plan: Preview the docs build in OSS
Differential Revision: D51854974
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115172
Approved by: https://github.com/wconstab
As titled, this PR removes the unnessecary getitem call from the graph that's manipulated in MapHigherOrder, where we want to get the first dim slice of original tensor for specualtion but using call_method will accidentally create a get_item call in the graph, so want to avoid it by calling unpack_var_sequence on input tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115207
Approved by: https://github.com/yanboliang
ghstack dependencies: #115115, #115204, #115205
We want to remove the map_wrapper and replace it with dynamo always on. This is the first step of this plan.
In this PR, we make dynamo directly generates a map_impl nodes. This hasn't touch the eager logic yet. So the execution path after this PR looks like 1. `dynamo -> map_impl` when torch.compile is on. (Before this PR, it's `dynamo -> map_wrapper -> map_impl` and 2. `map_wrapper -> map_impl` (This PR did't touch the logic here).
The added TODO(yidi) is addressed in the following pr.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115205
Approved by: https://github.com/yanboliang
ghstack dependencies: #115115, #115204
# Summary
This PR updates the FlashAttention code from:
02ac572f3f.
Or Tag 2.3.2
To 92dd5703ec
Or tag 3.2.6.
As well I think that this should be cherry picked into 2.2.0 release since there was a temporary ~15% perf regression for causal masking. It is not technically a regression since Flash wasn't released yet but it would be nice to have in the release.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115313
Approved by: https://github.com/Skylion007
Previously we could only use `ncclCommSplit` when we knew all backends were connected on all shards (due to the need to perform a NOCOLOR split), which in practice meant we could only use it for subgroups that were copies of the entire world.
This change allows for specifying a bound device id to `init_process_group` which tells the pg and its backends that the specified device, and the specified device only, will be associated with this rank.
This guarantee lets us do an early connect (which we could not previously do due to how ProcessGroupNCCL infers devices based on tensors and not the rank number). And by doing the early connect, we have the guarantee ranks are connected and can perform nocolor splits when needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114916
Approved by: https://github.com/kwen2501
**Summary**:
#114174 did not test the case where `elementwise_affine=False` (i.e. `weight` and `bias` are `None`) and this test would fail due to cached sharding propagation. The difference on sharding prop between these cases is, when `weight` and `bias` are None, the forward layer norm op will be recognized as a "static shape op" and `propagate_op_sharding` will be applied rather than `propagate_op_sharding_non_cached`. A fix is to force re-compute sharding when `normalized_shape` changes by setting op schema's `RuntimeSchemaInfo`'s `static_argnum` to include `normalized_shape` (i.e. 1)
**Test**:
pytest test/distributed/_tensor/test_math_ops.py -s -k layer_norm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115250
Approved by: https://github.com/wanchaol
Current non-strict test cases (added in #114697) are already supported by strict mode, so it can't demonstrate the incremental value of non-strict mode. How about adding test cases that fail in strict mode but pass in non-strict mode?
Test Plan:
python test/export/test_export.py -k test_external_call_non_strict_real_tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115245
Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17
Summary: move matmul precision out of the system info (system hash) and into the cache in preparation for switching precisions during compile
Test Plan: CI
Reviewed By: jansel
Differential Revision: D51442000
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115242
Approved by: https://github.com/jansel
On some systems it is possible to receive a signal that does not have a name. Rare, but possible. This prevents our error handler from crashing and instead properly reports the signal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114696
Approved by: https://github.com/xmfan
In certain edge cases when using lazy tensors, the base tensor stored in the `FunctionalStorageImpl` and the `value_` tensor stored in the `FunctionalTensorWrapper` diverge. For instance, take this simple example
```python
class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.fc1 = torch.nn.Linear(4, 2, bias=False)
def forward(self, x):
return x @ self.fc1.weight.transpose(0, 1)
with torch.device("lazy"):
model = Model()
x = torch.ones(4)
out = model(x)
```
The call to `transpose` on the lazily initialized weight `fc1.weight` applies a view op on the functional tensor which only gets propagated to the functional tensor wrapper and not the base tensor in the storage. Thus, causing them to diverge.
To fix this behaviour, we need to reset the functional tensor's storage. To facilitate this, we add a `reset_storage` method to `FunctionalTensorWrapper` which clears away the old storage and view metas.
CC: @behzad-a @GlebKazantaev @wconstab @bdhirsh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115235
Approved by: https://github.com/bdhirsh
This PR covers `ExportedProgram` to `test_fx_op_consistency.py`, which helps us identify the necessary but missing io_steps.
Next, we should refactor the tests to actually cover all ops supported by registry.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114886
Approved by: https://github.com/thiagocrepaldi
Replaces the "always sleep 30 sec before abort" with "wait up to 30 sec
for the future to complete then abort". The difference in this case is
the abort happens as soon as the dump finishes up to a maximum, instead
of always waiting the maximum.
Allows multiple calls to dump, which will be serialized.
Renames `tryWriteDebugInfo` to `launchAsyncDebugDump` in spirit of the
change to support more than one launch and to always launch rather than
only launching on the first call.
Adds a test for dumping on timeout.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115176
Approved by: https://github.com/zdevito
Slight refactor to:
* lazily compute min / max seq_len used for flash. this avoids unnecessary graph breaks / specialization when we're not accessing these
* store min / max seq_len in a general `metadata_cache`. condensing these should make it easier to avoid specializing on these and others we may add in the future
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115212
Approved by: https://github.com/soulitzer, https://github.com/ani300
ghstack dependencies: #114311
Apply a few optimizations to funcol:
- allgather on non-0 dim, the resulting tensor already needs to access
data in order to do torch.cat, so we sync wait here so that we don;t
need to go through ACT dispatch for chunk + cat alltogether
- have a fast return logic to aten.view as it's a commonly hit op for
view related ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113324
Approved by: https://github.com/XilunWu
Summary: Right now when load_model fails (either because of loading error or validation eager run failure), the result won't be logged in generated csv files. Let's log them in csv so that they are monitored by the expected results checking.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114784
Approved by: https://github.com/malfet
# Summary
This PR introduces a new Tensor subclass that is designed to be used with torch.nn.functional.scaled_dot_product_attention. Currently we have a boolean `is_causal` flag that allows users to do do causal masking without the need to actually create the "realized" attention bias and pass into sdpa. We originally added this flag since there is native support in both fused kernels we support. This provides a big performance gain ( the kernels only need to iterate over ~0.5x the sequence, and for very large sequence lengths this can provide vary large memory improvements.
The flag was introduced when the early on in the kernel development and at the time it was implicitly meant to "upper_left" causal attention. This distinction only matters when the attention_bias is not square. For a more detailed break down see: https://github.com/pytorch/pytorch/issues/108108. The kernels default behavior has since changed, largely due to the rise of autogressive text generation. And unfortunately this would lead to a BC break. In the long term it may actually be beneficial to change the default meaning of `is_causal` to represent lower_right causal masking.
The larger theme though is laid here: https://github.com/pytorch/pytorch/issues/110681. The thesis being that there is alot of innovation in SDPA revolving around the attention_bias being used. This is the first in hopefully a few more attention_biases that we would like to add. The next interesting one would be `sliding_window` which is used by the popular mistral model family.
Results from benchmarking, I improved the meff_attention perf hence the slightly decreased max perf.
```Shell
+---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+
| Type | Speedup | batch_size | num_heads | q_seq_len | k_seq_len | embed_dim | dtype | head_dim |
+---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+
| Average | 1.2388050062214226 | | | | | | | |
| Max | 1.831672915579016 | 128 | 32 | 1024 | 2048 | 2048 | torch.bfloat16 | 64 |
| Min | 0.9430534166730135 | 1 | 16 | 256 | 416 | 2048 | torch.bfloat16 | 128 |
+---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114823
Approved by: https://github.com/cpuhrsch
Summary:
This work is for PT2 inference. Since the IR from Export will change to pre-grad aten IR in a few months. We need to start this work from now on. Here is what I do in this diff:
1) Copy the fuse parallel linear pass to fb folder and adapt it to aten IR. We still want to keep the original `group_batch_fusion.py` because it is still used in training. In future at certain time point when PT2 training decided to retire the torch IR based group_batch_fusion, we can remove it. But right now, it's better to have torch IR and aten IR version seperately.
Our plan is to gradually transform the existing and important pre-grad passes to aten IR based passes.
Differential Revision: D51017854
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114776
Approved by: https://github.com/zhxchen17
Previously we only supported Tensor, Constants, and SymNode. We lift
that restriction (there's not really a good reason for it). HOPs like
torch.cond, torch.map already do input validation (those are the ones
that can only support Tensor, Constant, and SymNode inputs).
Test Plan:
New test for `wrap`, which is a HOP that has
manually_set_subgraph_inputs=False
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115186
Approved by: https://github.com/ydwu4, https://github.com/yanboliang
ghstack dependencies: #115185
**Summary**
Enable the qlinear weight prepack when input dimension size exceeds 2. There are extra reshape node before and after the `addmm` or `mm` node if input dimension size exceeds 2.
**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k input_dim_exceeds_2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113928
Approved by: https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #113733, #113912
**Summary**
When decomposing `Linear` to `addmm` or `mm` within Inductor, if the input dimension size exceeds 2, `reshape` nodes are introduced to convert the input into a 2-dimensional form before and after the `addmm` or `mm` node. It is essential to identify and match this pattern during quantization for dequantization promotion. For instance,
```
# quant
# + - - - | - - - +
# | dequant |
# | | |
# | reshape |
# | / \ |
# | node1 node2 |
# + - | - - - | - +
# reshape reshape
# + - | - - - | - +
# quant quant
```
In this PR, we mainly do 2 things:
- Extend support for the dequantization pattern in QLinear when the input dimension size exceeds 2.
- Revise the implementation of the dequant promotion pass, as it now needs to accommodate the matching of four different patterns.
**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k input_dim_exceeds_2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113912
Approved by: https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #113733
**Summary**
In the previous QLinear implementation, it was assumed that inputs have a dimension of 2. In this update, we have modified QLinear to accept inputs with a dimension greater than 2, incorporating input and output reshaping accordingly.
**Test Plan**
```
python -u -m pytest -s -v test_quantized_op.py -k test_qlinear_pt2e
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113733
Approved by: https://github.com/jgong5, https://github.com/eellison
Might be some upstream updates, the previous hack starts to not pick up model names, updating to use the other more appropriate variable.
Also fix a bug with an unused argument that was supposed to be removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115108
Approved by: https://github.com/thiagocrepaldi
We technically allow backends to aot_autograd to pass a config saying "yes I am ok with seeing input mutations in my graph".
With https://github.com/pytorch/pytorch/pull/112906 though, there can be input mutations that show up in the backward (that we need to handle for correctness), that are a large pain to keep out of the graph. The meta-point is that it's been ~a year since we added the config, and it almost always makes sense for backends to support input mutations for performance reasons (inductor does). So I just allow these input mutations in the graph in this rare backward situation, even if the backend didn't explicitly use the config.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115195
Approved by: https://github.com/drisspg
This adds the `ir.Scan` node (currently only supported on CUDA) which re-uses the existing reduction kernel machinery to support different kinds of non-pointwise ops. Just like reductions it supports prologue and epilogue fusions and has both persistent and non-persistent kernel generation.
Currently this doesn't support the equivalent of `Reduction.create_multilayer` and will instead fall back to eager in those cases. This is because splitting into multiple kernel invocations ends up being far slower than cub's single kernel strategy which matches the performance of a copy kernel.
Fixes https://github.com/pytorch/pytorch/issues/93631
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106581
Approved by: https://github.com/lezcano, https://github.com/atalman
Introduce OptimizerInfos + use them to refactor out the error testing.
Why OptimizerInfos?
- cleaner, easier way to test all configs of optimizers
- would plug in well with devicetype to auto-enable tests for devices like MPS, meta
- would allow for more granular testing. currently, lots of functionality is tested in `_test_basic_cases` and some of that should be broken down more.
What did I do for error testing?
- I moved out some error cases from `_test_basic_cases` into a new test_errors parametrized test.
- The new test has to live in TestOptimRenewed (bikeshedding welcome) because the parametrized tests need to take in device and dtype and hook correctly, and not all tests in TestOptim do that.
- TestOptimRenewed also is migrating to the toplevel test/test_optim.py now because importing TestOptimRenewed does not work (because of test instantiation, TestOptimRenewed gets replaced with TestOptimRenewedDevice for CPU, CUDA, and whatever other device).
Is there any change in test coverage?
- INCREASE: The error case where a single Parameter (vs a container of them) are passed in has now expanded to all optims instead of only LBFGS
- DECREASE: Not much. The only thing is we no longer test two error cases for foreach=True AND foreach=False, which I think is redundant. (Highlighted in comments)
Possible but not urgent next step: test ALL possible error cases by going through all the constructors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114178
Approved by: https://github.com/albanD
We can auto-functionalize operators that mutate their inputs as long as
the outputs of the operator do not alias their inputs. The user needs
to provide an abstract impl for the operator if it has non-trivial
returns.
- We update can_auto_functionalize(op) to include ops that return (but
do not alias) Tensors
- We update auto_functionalized(op, mutated_args_names, kwargs) to
return (out, mutated_args), where `out = op(**kwargs)` and
`mutated_args` are the new values of the inputs that would have been
mutated.
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115135
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114955, #114956, #115134
In preparation for the next PR up in the stack, which is going to update
"can_auto_functionalize" to support more operators than just ones that
return nothing. We are unable to auto-generate FakeTensor kernels for
operators that do not return nothing, but we are able to generate
functionalization kernels for operators that return something.
Test Plan:
Existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115134
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114955, #114956
Summary:
This adds function to model container doing weight swapping with double buffering.
There are 2 parts for double buffering
a) Write constants into inactive buffer
b) Swap active buffer
For (a), we write the constants into the buffer that's currently not in use, and store the information in both constants map and the corresponding constant array to read.
For (b), we obtain the lock, and activate the constant map/constant array that is inactive, and flag the one that's currently in use to inactive.
Test Plan:
test/cpp/aot_inductor/test.cpp
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D51543732](https://our.internmc.facebook.com/intern/diff/D51543732)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114446
Approved by: https://github.com/chenyang78, https://github.com/eellison
The `bsr_dense_addmm` triton kernel introduced in https://github.com/pytorch/pytorch/pull/114595 is a generalization of `bsr_dense_mm` triton kernel and a more efficient version of it because it uses an extra kernel parameter `SPLIT_N` that has notable effect to performance for r.h.s operand with a larger number of columns.
This PR eliminates the `bsr_dense_mm` triton kernel in favor of using `bsr_dense_addmm` triton kernel.
The performance increase of `bsr_dense_mm` is as follows (float16, `NVIDIA A100-SXM4-80GB`):
- with 16x16 blocks, the average/maximal speed up is 50/71 %
- with 32x32 blocks, the average/maximal speed up is 30/63 %
- with 64x64 blocks, the average/maximal speed up is 12/26 %
- with 128x128 blocks, the average/maximal speed up is 7/17 %
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115030
Approved by: https://github.com/cpuhrsch
Summary:
This diff fix the param unflattening when using FSDP together with TP. Currently we hardcode the `reshape_size` to be multiplied by 2, which instead should be the size of the process group.
Before the fix, example exception: `shape '[257, 514]' is invalid for input of size 264196`, where the process group size is 4 instead of 2.
Test Plan:
**CI**:
CI test
**Unit test**:
`buck2 test mode/dev-nosan //caffe2/test/distributed/tensor/parallel:fsdp_2d_parallel`
- Passed
**Test model with WHEN**:
- Verified that checkpoint can be saved and resumed successfully;
- Verified the accuracy with window_ne, which is on-par with baseline.
https://pxl.cl/3Wp8w
Differential Revision: D51826120
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115105
Approved by: https://github.com/fegin
Continuation of #112185, following the design in this [doc](https://docs.google.com/document/d/1ipSxcTzEMMOAPvxP-YJlD5JBZZmIGgh8Q34ixtOUCRo).
Summary:
* Introduce `SubclassSymbolicPolicy` containing separate dynamic dim / constraint policies for the outer and inner tensors
* Expand the automatic dynamic algorithm to recurse into inner tensors and produce one of these for a subclass instance
* Maintain legacy behavior for subclasses by recursively calling `mark_dynamic()` on inner tensors *of the same dim as outer* when `mark_dynamic(outer, ...)` is called
* Addresses this: 6a86cf00ad/torch/_dynamo/variables/builder.py (L1750)
* Add `outer_size` and `outer_stride` arguments to `__tensor_unflatten__()` so that you can find out what symbols were allocated for the outer size / stride (you are expected to return a tensor that compares equal to the outer symbols)
* Signatures now:
```python
# attrs is a list of inner tensor attributes on x; inner_tensor = getattr(x, attr)
# ctx is anything useful for rebuilding the class we want to guard on
attrs, ctx = x.__tensor_flatten__()
...
# inner_tensors is a dict of {attr -> tensor}
# ctx is taken unmodified from flattening and (eventually) guarded on
# outer_size is the expected size of the output; possibly symbolic
# outer_stride is the expected strides of the output; possibly symbolic
y = MySubclass.__tensor_unflatten__(inner_tensors, ctx, outer_size, outer_stride)
# at the __tensor_unflatten__() call-site in PT2, we assert y.shape == outer_size and y.stride() == outer_stride
# the assert simplifies symbols when there are relationships between outer and inner symbols
```
* Size info needed for `NestedTensor` at least, stride info needed for `DTensor` at least
* Punting on `outer_storage_offset` because storage_offset handling is horribly broken in PT2 right now
* ~~Add new `__tensor_mark_dynamic__()` to allow overriding the behavior of mark_dynamic on a per-subclass basis~~ (booted to future work)
* ~~Add guards for tensor subclasses by calling `__tensor_flatten__()` in the guard to test equality on `ctx`~~
* Now handled in #114469
* Next PR: add TENSOR_MATCH guards on inner tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114311
Approved by: https://github.com/ezyang, https://github.com/drisspg, https://github.com/voznesenskym, https://github.com/bdhirsh
If TORCH_NCCL_DUMP_ON_TIMEOUT is set, then along with producing a dump
file when a timeout happens, you can trigger a dump by writing to local pipe
`<TORCH_NCCL_DEBUG_INFO_TEMP_FILE>_<rank>.pipe` (by default
/tmp/nccl_trace_{rank}_<rank>.pipe).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115139
Approved by: https://github.com/wconstab
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.