Fixes#114844
In the linked issue we have
```
compiled_module = torch.compile(module)
compiled_module.x = ...
compiled_module(...) # Mutates self.x
```
Where since the module mutates `self.x` you would expect `compiled_module.x`
to be updated but actually `compiled_module.x = ...` sets an attribute "x"
on the `OptimizedModule` object while the forward method of the module mutates
`module.x`.
This gives the expected behavior by forwarding `compiled_module.__setattr__`
down to `module.__setattr__`. There is already a corresponding `__getattr__`
so now `compiled_module.x` becomes an alias for `module.x`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122098
Approved by: https://github.com/ezyang, https://github.com/lezcano
Summary:
During tracing, some constants (tensor_constant{idx}) are being generated internally.
Those constants are neither parameters or buffers, and users have zero control on them.
To accomodate this, we should allow users not passing in those constants generated internally but still be able the constants in the model.
Test Plan:
Included in commit.
```
build/bin/test_aot_inductor
```
Differential Revision: D55286634
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122562
Approved by: https://github.com/chenyang78, https://github.com/khabinov
Before this PR we were not precompiling triton templates in parallel. Compilation would occur during benchmarking.
Triton benchmarking templates were emitted as :
```
@triton.jit
def triton_mm(arg_A, arg_B, out_ptr0):
```
In order to precompile we need to give the full kernel specification, as we do when we emit the template in the final output code generation.
```
@triton_heuristics.template(
num_stages=3,
num_warps=8,
triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())]},
inductor_meta={'kernel_name': 'Placeholder.DESCRIPTIVE_NAME', 'backend_hash': 'cdeecfeccd31ad7810f96b5752194b1c2406d0a81e39a6ca09c8ee150baae183'},
)
@triton.jit
def triton_mm(arg_A, arg_B, out_ptr0):
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121998
Approved by: https://github.com/jansel
Previously, all jvp tests under dynamo/test_dynamic_shapes would fail because symbolic execution wasn't supported in some autograd functions.
List of changes:
- Update`_has_same_storage_numel` to use `sym_nbytes`
- Symintify `_efficientzerotensor_meta`
- Introduce `empty_generic_symint` with the first argument `size` as symbolic integer
- Update gen_variable_type.py script to call the symint version of zeros_fn function (zeros_symint / _efficientzerotensor_symint)
- Update `has_same_meta` to call `sym_*` functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120338
Approved by: https://github.com/soulitzer
When working on testing all-reduce with an alternative rccl replacement backend, my test script crashed. After debugging, I found that `ncclGetLastError(NULL)` return null, and then the code uses the return value to do std::string would seg-fault with an exception of `basic_string::_M_construct null not valid`.
This pull request is to fix this edge condition so that it will exit the program gracefully with useful information.
**Test:**
Before the fix, my test script exits like below:
```
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2051, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: basic_string::_M_construct null not valid
```
After this fix, my test script exited with useful message like,
```
[rank0]: File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
[rank0]: work = group.allreduce([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:272, internal error - please report this issue to the NCCL developers, NCCL version 0.4.2
[rank0]: ncclInternalError: Internal check failed.
[rank0]: Last error: Unknown NCCL Error
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121905
Approved by: https://github.com/wconstab
At a high level, the goal of this refactor was to make it so that `MetaConverter.__call__` has a straightforward code structure in three steps: (1) check if we support doing meta conversion, (2) describe the tensor into MetaTensorDesc, (3) call `meta_tensor` on MetaTensorDesc. However, this is not so easy to do, because there is a big pile of special cases for functional tensor inside `__call__`.
The primarily complication is handling the ambient functionalization state: specifically, the functorch dynamic layer stack and the Python functionalization dispatch. The old code demands that meta tensor conversion happen with this state disabled. But I discovered that when I reconstruct functorch tensors it demands that the functorch layers be active; in fact a batch tensor will have a pointer to the internal functorch layer.
I had some discussion with Richard Zou about what code structure here makes sense. In particular, one of the goals of the refactor here is that I can inflate MetaTensorDesc from an entirely different process, which may not have all of the functorch layers activated at the time we do reconstruction. So it seems to me that we should make it explicit in MetaTensorDesc that there was some functorch layer active at the time the functorch tensor was serialized, so that we could potentially know we need to reconstruct these layers on the other side. This is NOT implemented yet, but there's some notes about how potentially it could proceed. But the important thing here is we SHOULD disable everything when we run `meta_tensor`, and internally be responsible for restoring the stack. Actually, the necessary infra bits in functorch don't exist to do this, so I added some simple implementations in pyfunctorch.py.
The rest is splitting up the manipulations on tensor (we do things like sync the real tensor before describing it; Describer is responsible for this now) and I also tried to simplify the not supported condition, based on my best understanding of what the old thicket of conditions was doing. You may notice that the internal meta_tensor handling of functional tensor is inconsistent with surrounding code: this is because I *exactly* replicated the old reconstruction behavior; a further refactor would be to rationalize this.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122202
Approved by: https://github.com/zou3519
If users want to run some cuda testcases on other devices throw setting an environment variable for testing the performance on custom devices, I think it can be used like this pr.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122182
Approved by: https://github.com/ezyang
Summary:
Make sure the order of submodules is the same as the original eager module.
bypass-github-export-checks
Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_unflatten_submodule_ordering
Differential Revision: D55251277
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122507
Approved by: https://github.com/tugsbayasgalan
Fixes https://github.com/pytorch/pytorch/issues/121085
This PR pretty involved so pay attention to this description. At a high
level, the refactor is intended to be mechanical: anywhere in
MetaConverter where previously we took a Tensor as argument, we now take
a MetaTensorDesc, which contains all of the information that we would
have queried off of the Tensor, but placed into a separate data
structure which we can serialize or use to recreate a fake tensor in
a separate fake tensor mode in exact fidelity to the original.
However, this transformation is not always entirely mechanical. Here
is what you need to pay attention to:
- The memo table from real Tensor -> meta/fake Tensor is now broken
into two memo tables: real Tensor -> stable int id -> meta/fake
Tensor. The stable int id is needed so that when we do serialization,
we know when tensors/storages alias each other and can ensure we preserve
this aliasing upon deserialization.
The way I have implemented changes the weak reference behavior.
Previously, when either the real Tensor OR the meta/fake Tensor went
dead, we would remove the entry from the memo table. Now, this only
removes entries from one of the two memo tables. This semantically
makes sense, because the user may have held on to the stable int id
out of band, and may expect a real Tensor to continue to be numbered
consistently / expect to be able to lookup a meta/fake tensor from
this id. If this is unacceptable, it may be possible to rejigger
the memo tables so that we have real Tensor -> stable int id
and real Tensor -> meta/fake Tensor, but TBH I find the new
implementation a lot simpler, and arranging the memo tables in this
way means that I have to muck around with the real tensor to save
to the memo table; in the current implementation, I never pass the
Tensor to meta_tensor function AT ALL, which means it is impossible
to accidentally depend on it.
- When I fill in the fields of MetaTensorDesc in describe_tensor, I need
to be careful not to poke fields when they are not valid. Previously,
preconditions were implicitly checked via the conditional structure
("is this sparse? is this nested?") that is tested before we start
reading attributes. This structure has to be replicated in
describe_tensor, and I have almost assuredly gotten it wrong on my
first try (I'll be grinding through it on CI; a careful audit will
help too, by auditing that I've tested all the same conditionals that
the original access was guarded by.)
- I originally submitted https://github.com/pytorch/pytorch/pull/121821
for the symbolic shapes change, but it turned out the way I did it
there didn't actually work so well for this PR. I ended up just
inlining the symbolic shapes allocation logic into MetaConverter
(look for calls to maybe_specialize_sym_int_with_hint), maybe there
is a better way to structure it, but what I really want is to
just read sizes/strides/offset directly off of MetaTensorDesc; I
don't want another intermediate data structure.
- Some fields aren't serializable. These are documented as "NOT
serializable". ctx/type should morally be serializable and I just
need to setup a contract with subclasses to let them be serialized.
The fake_mode is used solely to test if we are refakefying with
a pre-existing ShapeEnv and we want to reuse the SymInt
directly--serializing this case is hopeless but I am kind of hoping
after this refactor we do not need this at all. view_func is not
serializable because it's a bound C implemented method. Joel has
promised me that this is not too difficult to actually expose as a
true data structure, but this is the edgiest of edge cases and there
is no reason to deal with it right now.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044
Approved by: https://github.com/eellison
Not sure what was the idea behind `{self.tiling_factor}*sizeof(float)/sizeof({DTYPE_TO_CPP[dtype]})` size calculation (perhaps copy-n-paste error during the refactor made by https://github.com/pytorch/pytorch/pull/97626 ) , but `Vectorized::store(ptr, tiling_factor)` needs at least `tiling_factor` elements, not `tiling_factor/2` (which would be the case with the original calculation if data type is 64-bit value such as int64)
Discovered while trying to enable arch64 vectorized inductor.
Minimal reproducer (reproducible on ARMv8 or any x86_64 machine that does not support AVX512):
```python
import torch
def do_ds(x, y):
return torch.diagonal_scatter(x, y)
x=torch.ones(10, 10, dtype=torch.int64)
y=torch.tensor([ 1, 2, -8, 8, 5, 5, -7, -8, 7, 0])
dsc = torch.compile(do_ds)
assert torch.allclose(torch.diagonal_scatter(x, y), dsc(x, y))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122580
Approved by: https://github.com/Skylion007, https://github.com/jansel
The test may fail because it either uses target flags newer than the GPU resulting in failures loading the compiled binary or targetting a GPU for which CUDA has no support yet/anymore
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122400
Approved by: https://github.com/ezyang
The test uses the CUDA compute capabilities of the current device to
compile an extension. If nvcc is older than the device, it will fail
with a message like "Unsupported gpu architecture 'compute_80'"
resulting in a `RuntimeError: Error building extension 'cudaext_archflags'`
ultimately failing the test.
This checks for this case and allows execution to continue
Fixes https://github.com/pytorch/pytorch/issues/51950
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122402
Approved by: https://github.com/ezyang
Previously, when we applied a replacement, a SymInt that was
previously an unbacked SymInt would then transmute into whatever
we replaced it into (e.g., a constant).
This has a major downside: we often look at SymInts associated with
FX nodes (e.g., the meta of x.item() return) to find out where the
unbacked SymInt was allocated. If we replace it, we no longer can
find out where, e.g., u1 was allocated! But we need to know this
so we can generate deferred runtime asserts like u1 == s0.
To solve this problem, I have a special mode for replace, resolve_unbacked=False, which lets you disable substitutions on unbacked SymInts. When reporting node.expr, we preferentially avoid applying unbacked SymInt substitutions. To understand if we might accidentally reapply the substitution later, before we have reached the deferred runtime assert, we must study the calls to simplify() in ShapeEnv. My audit turns up these sites:
* `produce_guards`: this is fine, deferred runtime asserts never show up here, we must NOT have unbacked SymInts show up here. Similarly `get_nontrivial_guards`.
* `_maybe_evaluate_static`: this is fine, we are using this to determine if it is necessary to produce a guard/runtime assert. We don't want to reissue a runtime assert if we've already asserted on it, and replacements can help us understand if this has occurred.
* `_simplify_floor_div`: this is a legitimate bug, it needs to be `resolve_unbacked=False`
* `_refine_ranges`: this is fine, a refined range doesn't affect what runtime asserts we issue
* `_update_divisible`: this updates the `self.divisible` set, which specifies when we can simplify away divisibility constraints. Since this affects replacements only, it won't cause us to oversimplify a user provided expression.
There are some situations where we DO want to always apply the substitution, specifically when we have the duplicate symbol problem (we retrace an item call and get u0 and u1 which refer to the same thing.) I don't want two symbols in this case, so a special `rename_unbacked_to` is provided which sets up the unconditional renaming.
Along the way, I make a refinement to `_update_var_to_range`: if you update a var range for a size-like unbacked SymInt, you are now no longer allowed to set its lower bound below 2. This is because if you could, then our size oblivious tests for it would be inconsistent. Actually, I think there is still some inconsistency, because if you assert `u0 == 0` we will still end up with this in deferred runtime asserts, and we will then use this to simplify these statements to be True everywhere else. Maybe we should forbid this kind of refinement; not done in this PR.
Fixes https://github.com/pytorch/pytorch/issues/119689
Fixes https://github.com/pytorch/pytorch/issues/118385
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120816
Approved by: https://github.com/lezcano
This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076
Approved by: https://github.com/malfet
Summary: Currently there is a test for adding a backend in test/inductor/test_extension_backend.py for a cpp backend with a new device. However there is no such test for the Triton backend; it should be possible for a user to create and register your own ExtensionWrapperCodegen and ExtensionSchedulingfor another non-CUDA device and be able to generate Triton code. For simplicity I have chosen to use a CPU device, as I think it's plausible someone might want to create a CPU Triton backend.
Unfortunately the generation and running of the code is quite tightly coupled so I've had to use a mocked function to extract the code before running. Suggestions are welcome for better ways to do this.
This is a stepping off point for some additional PRs to make the Triton code path less CUDA specific, as currently there would be no way to test this avenue.
Test plan:
```
frames [('total', 1), ('ok', 1)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('intermediate_hooks', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.
----------------------------------------------------------------------
Ran 1 test in 0.394s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122396
Approved by: https://github.com/jansel
Summary:
Add a shape inference tool that helps to infer each node shape of a given graph module.
1. Given a fx graph, and an example of an input(don't need to be an accurate input that can be forward, but should have valid dims and data structures), `infer shape` creates an input of symbolic shape
2. Shape prop this symbolic input can catch runtime or value exceptions.
3. These errors are constraints for symbol values, and the constraint solver `infer symbolic values` helps us figure out specific values for each symbol.
4. Finally, we run the shape propagation based on input tensor to get tensor shapes for all nodes in the FX traced module.
Test Plan:
### 1. Test `infer symbol values`
Command:
```
buck2 test mode/opt //caffe2/test:fx_experimental -- test_infer_symbol_values
```
### 2. Test `infer shape`
Command:
```
buck2 test mode/opt //caffe2/test:fx_experimental -- test_infer_symbol_values
```
Inferred shape result like: P897560514
Differential Revision: D53593702
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120097
Approved by: https://github.com/yf225
At least the following tests fail when there is no supported vector ISA:
test_lowp_fp_neg_abs
test_non_contiguous_index_with_constant_stride
test_scalar_mul_bfloat16
test_transpose_non_contiguous
test_transpose_sum2d_cpu_only
test_transpose_sum_outer
test_transpose_vertical_sum_cpu_only
test_vertical_sum_cpu_only
Those tests assert `metrics.generated_cpp_vec_kernel_count` is nonzero
which is never the case without a supported vector ISA, e.g. on PPC and
maybe on AArch.
Skip those tests with a new decorator and use the simpler one where an equivalent is already used
Some usages of `metrics.generated_cpp_vec_kernel_count` where guarded by a check instead of skipping the test. I tried to apply that instead of a skip where the test looked similar enough to where that was previously done.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117262
Approved by: https://github.com/jgong5, https://github.com/jansel
This PR is enough to fix https://github.com/pytorch/pytorch/issues/118600.
More description of the problem is in the issue, but the high-level problem is similar to the "tangents might be non-contiguous" problem that we handle today, via forcing all tangents to be contiguous. There, the problem was something like:
"We guessed the tangent strides incorrectly, because strides on the runtime tangents were different from strides on the forward outputs, which we used to generate tangents"
Here, the problem is similar:
"We guessed the tangent tensor subclass's metadata incorrectly, because the runtime tangent was a subclass with different metadata than the forward output subclass".
This happened in an internal DTensor issue, where the metadata in question was the `placements` (shard vs. replicate vs. Partial).
One option is to solve this problem via backward guards. This is needed to unblock internal though, so I figured handling this similarly to how we handle non-contiguous tangents would be reasonable. I did this by:
(1) Assert that the metadata on subclass tangents is the same as what we guessed, and if not raise a loud error
(2) In the error message, provide the name of an optional method that the subclass must implement to handle this case:
`def __force_same_metadata__(self, metadata_tensor):`: If the forward output had a `Replicate()` placement, but the runtime tangent had a `Shard(1)` placement, this method allows a subclass to take the tangent and "convert" it to one with a `Replicate()` placement.
`__force_standard_metadata__(self)`: One issue is that there is another placement called `_Partial`, and its semantics are such that DTensor is **unable** to convert a DTensor with some placement type into another DTensor with a `_Partial` placement.
`__force_standard_metadata__` is now called on all (fake) subclass forward outs at trace-time to generate tangents, and gives subclasses a chance to "fix" any outputs with metadata that they cannot convert to later. Morally, this is similar to the fact that we force a `contiguous()` call on all tangents at trace-time.
I'm interested in thoughts/feedback! Two new dunder methods on traceable subclasses is definitely a contentious change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118670
Approved by: https://github.com/ezyang
We can use Sapling (hg) with the pytorch repo but there are a couple minor issues to teach our scripting to be happier with having either a git or hg repo.
This change fixes some issues in:
- setup.py
- lintrunner
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122072
Approved by: https://github.com/ezyang
List of changes:
- Replace JVP_NESTING by torch._C._functorch.maybe_current_level()
- Remove all increment nesting functions from wrap_fx_proxy_cls
- fwAD.make_dual receives the dual_level as keyword argument
- Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926
Approved by: https://github.com/zou3519
**Motivation**: https://github.com/pytorch/pytorch/issues/112771
**Summary**: Inductor generates triton that assumes that inputs are going to be 16-byte aligned. If the inputs aren't aligned, Inductor clones the inputs. This PR introduces a config option to not do this: when assume_aligned_inputs=False, Inductor will _not_ pass inputs as being divisible_by_16, and Inductor will not make clones. This an can generate code that might be a bit slower, but this tradeoff can be worth it in some scenarios where you might otherwise make a lot of clones.
Ideally, we could do this on a per-tensor basis. But this would be a lot of work, and attempts to add guards on storage offsets to do this automatically have run into issues: recompilations and excessive time to generate/check guards.
**Tests** https://github.com/pytorch/pytorch/pull/122159 flips this to False. It didn't run through all errors, but the ones we see are all expected failures: divisible_by_16 changes; triton kernel caching fails if we call the same triton kernel multiple times (this makes sense because the first call will have unaligned inputs, but subsequent calls have aligned inputs); and some xfailed tests start passing.
**Alternatives/RFC**:
* Is this the right thing to do with cudagraphs?
* Elias and Jason mentioned that we probably still want to make clones if we're dealing with unaligned inputs to matmuls. Is this something we should add in this config option? (In the use case I'm targeting, it seems like we don't need this optimization right now)
Differential Revision: [D55079094](https://our.internmc.facebook.com/intern/diff/D55079094)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122158
Approved by: https://github.com/ezyang
`unimplemented` is a function that raises an error, so
`raise unimplemented(...)` never reaches the `raise`.
Another related issue is that `raise unimplemented(...) from e`
doesn't attach the exception cause correctly. I fix this by adding
a `from_exc` argument to `unimplemented`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122136
Approved by: https://github.com/lezcano
This issue popped up when enabling predispatch IR on the benchmarks (https://github.com/pytorch/pytorch/pull/122225)
On the following model:
```
class M(torch.nn.Module):
def __init__(self, device):
super().__init__()
self.device = device
def forward(self, x):
t = torch.tensor(x.size(-1), device=self.device, dtype=torch.float)
t = torch.sqrt(t * 3)
return x * t
```
We get the following error:
```
======================================================================
ERROR: test_constant_abi_compatible_cuda (__main__.AOTInductorTestABICompatibleCuda)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 2741, in wrapper
method(*args, **kwargs)
File "/data/users/angelayi/pytorch/test/inductor/test_torchinductor.py", line 9232, in new_test
return value(self)
File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor.py", line 922, in test_constant
self.check_model(M(self.device), (torch.randn(5, 5, device=self.device),))
File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor.py", line 91, in check_model
actual = AOTIRunnerUtil.run(
File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor_utils.py", line 102, in run
so_path = AOTIRunnerUtil.compile(
File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor_utils.py", line 40, in compile
so_path = torch._inductor.aot_compile_ep(
File "/data/users/angelayi/pytorch/torch/_inductor/__init__.py", line 150, in aot_compile_ep
return compile_fx_aot(
File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1005, in compile_fx_aot
compiled_lib_path = compile_fx(
File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1111, in compile_fx
return compile_fx(
File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1145, in compile_fx
return compile_fx(
File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1336, in compile_fx
return inference_compiler(unlifted_gm, example_inputs_)
File "/data/users/angelayi/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper
r = func(*args, **kwargs)
File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1266, in fw_compiler_base
return inner_compile(
File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/data/users/angelayi/pytorch/torch/_dynamo/repro/after_aot.py", line 83, in debug_wrapper
inner_compiled_fn = compiler_fn(gm, example_inputs)
File "/data/users/angelayi/pytorch/torch/_inductor/debug.py", line 304, in inner
return fn(*args, **kwargs)
File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/data/users/angelayi/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper
r = func(*args, **kwargs)
File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 447, in compile_fx_inner
compiled_graph = fx_codegen_and_compile(
File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 707, in fx_codegen_and_compile
graph.run(*example_inputs)
File "/data/users/angelayi/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper
r = func(*args, **kwargs)
File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 612, in run
return super().run(*args)
File "/data/users/angelayi/pytorch/torch/fx/interpreter.py", line 145, in run
self.env[node] = self.run_node(node)
File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 957, in run_node
result = super().run_node(n)
File "/data/users/angelayi/pytorch/torch/fx/interpreter.py", line 202, in run_node
return getattr(self, n.op)(n.target, args, kwargs)
File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 819, in call_function
raise LoweringException(e, target, args, kwargs).with_traceback(
File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 816, in call_function
out = lowerings[target](*args, **kwargs)
File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 298, in wrapped
out = decomp_fn(*args, **kwargs)
File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 5340, in mul
return make_pointwise(fn)(a, b)
File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 409, in inner
inputs = promote_constants(inputs, override_return_dtype)
File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 373, in promote_constants
ex = next(x for x in inputs if isinstance(x, (TensorBox, ExpandView)))
torch._inductor.exc.LoweringException: StopIteration:
target: aten.mul.Tensor
args[0]: Constant(value=5.0, dtype=torch.float32, device=device(type='cuda', index=0))
args[1]: 3
```
So I added an additional casing in `promote_constants` to handle the ir.Constants and now it works! Although please let me know if this is the wrong approach. Here's a paste of the full run with the inductor logs: P1198927007
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122419
Approved by: https://github.com/eellison, https://github.com/desertfire, https://github.com/chenyang78
Internal users reported that they get failure for max-autotune if tensors are not on device 0. It turns out that we may use tensors on device say 6 and run kernel on them at device 0.
This PR enforces that we do benchmarking for max-autotune on the correct device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122479
Approved by: https://github.com/xintwfb, https://github.com/Chillee
Differential Revision: D54993977
### Summary
The initial purpose of ncclCommDevIdxMap is to support NCCL zero copy algorithms. Therefore, it is only enabled (with its values filled) if useTensorRegisterAllocatorHook_ is set to true. However, now we rely on it to support dumping NCCL information in a single PG. So we need it to be always available, regardless of whether we enabled useTensorRegisterAllocatorHook_.
Move the code of filling ncclCommDevIdxMap out of if (useTensorRegisterAllocatorHook_) statement.
### Test Plan
See diff
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122049
Approved by: https://github.com/shuqiangzhang
Summary:
handling cases where worker process is terminated w/o releasing the timer request, this scenario causes reaping of process at expiry.
removing the non-existent process during clear timer.
Test Plan: unit tests
Differential Revision: D55099773
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122324
Approved by: https://github.com/d4l3k
Fix the following warning while compilation:
```
/home/pytorch/aten/src/ATen/native/cuda/int4mm.cu: In function ‘at::Tensor at::native::_weight_int4pack_mm_cuda(const at::Tensor&, const at::Tensor&, int64_t, const at::Tensor&)’:
/home/pytorch/aten/src/ATen/native/cuda/int4mm.cu:871:6: warning: variable ‘stream’ set but not used [-Wunused-but-set-variable]
871 | auto stream = at::cuda::getCurrentCUDAStream();
| ^~~~~~
/home/pytorch/aten/src/ATen/native/cuda/int4mm.cu: In function ‘at::Tensor at::native::_convert_weight_to_int4pack_cuda(const at::Tensor&, int64_t)’:
/home/pytorch/aten/src/ATen/native/cuda/int4mm.cu:1044:6: warning: variable ‘stream’ set but not used [-Wunused-but-set-variable]
1044 | auto stream = at::cuda::getCurrentCUDAStream();
| ^~~~~~
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122286
Approved by: https://github.com/soulitzer
This PR introduces `torch.nested.nested_tensor_from_jagged(values, offsets=None, lengths=None, jagged_dim=1)` (bikeshedding welcome). This is intended to be the main entrypoint for getting an NJT from the `(values, offsets, lengths)` components. The returned NJT is a view of the `values` component.
Note that `torch.nested.nested_tensor()` / `torch.nested.as_nested_tensor()` already exist for constructing an NJT from a list of tensors.
TODO:
* Some doc formatting; suggestions welcome there
* Tests / examples using `jagged_dim != 1`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121518
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #113279, #113280
Fixes https://github.com/pytorch/pytorch/issues/118596.
The issue was as follows:
(1) Whenever AOTAutograd sees an output that is non-contiguous, that it needs a tangent for, it forces the tangent that it generates to be contiguous during tracing
(2) However: if this tangent is a subclass, we need to generate code to flatten/unflatten the subclass at runtime.
(3) To do so, we use the metadata stashed here: https://github.com/pytorch/pytorch/blob/main/torch/_functorch/_aot_autograd/schemas.py#L231
(4) However, this metadata was **wrong** - it was generated by inspecting the tangent, **before** we made the tangent contiguous.
The fix in this PR basically moves the logic make `traced_tangents` contiguous earlier, at the time that we first generate `ViewAndMutationMetadata`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118669
Approved by: https://github.com/zou3519
ghstack dependencies: #118803, #119947
* Adds a configurable GEMM size threshold for the usage of Cutlass GEMM Kernels **_inductor.config.cutlass_backend_min_gemm_size**
* During GEMM algorithm choice generation: **if no viable choices can be generated using the configured backends, the ATen backend will be used as a fallback backend**, even if it is not enabled in **_inductor.config.max_autotune_gemm_backends**
Test plan:
CI
Additional unit test in test_cutlass_backend.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121491
Approved by: https://github.com/jansel
ghstack dependencies: #121490
Fixes https://github.com/pytorch/pytorch/issues/121085
This PR pretty involved so pay attention to this description. At a high
level, the refactor is intended to be mechanical: anywhere in
MetaConverter where previously we took a Tensor as argument, we now take
a MetaTensorDesc, which contains all of the information that we would
have queried off of the Tensor, but placed into a separate data
structure which we can serialize or use to recreate a fake tensor in
a separate fake tensor mode in exact fidelity to the original.
However, this transformation is not always entirely mechanical. Here
is what you need to pay attention to:
- The memo table from real Tensor -> meta/fake Tensor is now broken
into two memo tables: real Tensor -> stable int id -> meta/fake
Tensor. The stable int id is needed so that when we do serialization,
we know when tensors/storages alias each other and can ensure we preserve
this aliasing upon deserialization.
The way I have implemented changes the weak reference behavior.
Previously, when either the real Tensor OR the meta/fake Tensor went
dead, we would remove the entry from the memo table. Now, this only
removes entries from one of the two memo tables. This semantically
makes sense, because the user may have held on to the stable int id
out of band, and may expect a real Tensor to continue to be numbered
consistently / expect to be able to lookup a meta/fake tensor from
this id. If this is unacceptable, it may be possible to rejigger
the memo tables so that we have real Tensor -> stable int id
and real Tensor -> meta/fake Tensor, but TBH I find the new
implementation a lot simpler, and arranging the memo tables in this
way means that I have to muck around with the real tensor to save
to the memo table; in the current implementation, I never pass the
Tensor to meta_tensor function AT ALL, which means it is impossible
to accidentally depend on it.
- When I fill in the fields of MetaTensorDesc in describe_tensor, I need
to be careful not to poke fields when they are not valid. Previously,
preconditions were implicitly checked via the conditional structure
("is this sparse? is this nested?") that is tested before we start
reading attributes. This structure has to be replicated in
describe_tensor, and I have almost assuredly gotten it wrong on my
first try (I'll be grinding through it on CI; a careful audit will
help too, by auditing that I've tested all the same conditionals that
the original access was guarded by.)
- I originally submitted https://github.com/pytorch/pytorch/pull/121821
for the symbolic shapes change, but it turned out the way I did it
there didn't actually work so well for this PR. I ended up just
inlining the symbolic shapes allocation logic into MetaConverter
(look for calls to maybe_specialize_sym_int_with_hint), maybe there
is a better way to structure it, but what I really want is to
just read sizes/strides/offset directly off of MetaTensorDesc; I
don't want another intermediate data structure.
- Some fields aren't serializable. These are documented as "NOT
serializable". ctx/type should morally be serializable and I just
need to setup a contract with subclasses to let them be serialized.
The fake_mode is used solely to test if we are refakefying with
a pre-existing ShapeEnv and we want to reuse the SymInt
directly--serializing this case is hopeless but I am kind of hoping
after this refactor we do not need this at all. view_func is not
serializable because it's a bound C implemented method. Joel has
promised me that this is not too difficult to actually expose as a
true data structure, but this is the edgiest of edge cases and there
is no reason to deal with it right now.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044
Approved by: https://github.com/eellison
ghstack dependencies: #122018
This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076
Approved by: https://github.com/malfet
As a follow-up to #114835 and #119682, we add limited ATen operators implementation for XPU. With this PR, the blocking issue for oneDNN operations and Inductor XPU backend will be resolved as the two components depend on these operations to support its basic features, respectively.
The added ATen operators include:
- `copy_`, `_to_copy`, `_copy_from_and_resize`, , `clone`
- `view`, `view_as_real`, `view_as_complex`,
- `as_strided`, `_reshape_alias`, `resize_`, `resize_as_`,
- `add`/`add_`, `sub`/`sub_`, `mul`/`mul_`, `div`/`div_`, `abs`,
- `empty`, `empty_strided`,
- `fill_`, `zeros_`.
Co-authored-by: Wang, Eikan <eikan.wang@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120891
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/atalman
Summary: `torch.while_loop` HOP support is added to JIT Inductor. The test coverage is limited due to the functionality constraints of the upstream `torch.while_loop` op in Dynamo / Export. When those are lifted, we'll add more tests (see TODO-s in the test file).
AOT Inductor support will be added in a follow-up PR.
Test Plan:
```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 38 tests in 159.387s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122069
Approved by: https://github.com/jansel, https://github.com/eellison
This PR:
* Introduces an ATen op for creating true jagged views from a dense values buffer
* `_nested_view_from_jagged(values, offsets, lengths, ragged_idx, dummy)`
* This ops is implemented on the Python side using torch.library so we can return a subclass instance
* `jagged_from_list()` now uses this instead of the old autograd.Function `NestedViewFromBuffer`
* The latter op is used for non-contiguous JTs returned via `torch.nested.narrow()`
* `dummy` is an awful hack to ensure that `NestedTensor.__torch_dispatch__()` is invoked for our view
* Introduces an ATen op for accessing the `values` component of an NT via a view
* `_nested_get_values(nt)`
* **Removes** the autograd.Functions `ViewNestedFromBuffer` and `ViewBufferFromNested` in favor of `nested_from_values_offsets()` / `nested_from_values_offsets_lengths()` and `nt.values()`, respectively.
* Changes test code to prefer `as_nested_tensor()` over `jagged_from_list()` directly
* Similarly, avoid `buffer_from_jagged()`, preferring `values()`
* Depends on general subclass view fake-ification on the PT2 side (handled solely in previous PRs in the stack)
With these changes, the semantics of jagged layout NTs are such that they are considered a true view of the underlying `values` buffer. This means views of jagged NTs are views of the underlying buffer as well, simplifying some handling.
Differential Revision: [D54269922](https://our.internmc.facebook.com/intern/diff/D54269922)
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113279
Approved by: https://github.com/ezyang
I have some minor fixes in the scripts to
1. Fix the bug where the empty test matrix was confusingly print as unstable https://github.com/pytorch/pytorch/pull/121381#issuecomment-2004558588
1. Replace `print` with `logging.info`
1. Remove the hardcoded `VALID_TEST_CONFIG_LABELS` list. It's out of date and not many people use this features besides `test-config/default`, so why bother. The behavior here is simpler now:
1. If the PR has some `test-config/*` labels, they will be applied
1. If the PR has none of them, all test configs are applied
1. Add log for the previous 2 cases to avoid confusion
### Testing
```
python filter_test_configs.py --workflow "Mac MPS" --job-name "macos-12-py3-arm64 / build" --event-name "push" --schedule "" --branch "" --tag "ciflow/mps/121381" \
--pr-number 121065 \
--test-matrix "{ include: [
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-stable" },
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m2-14" },
]}
```
Also running on this PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122155
Approved by: https://github.com/clee2000
In order to a fake model to run using ONNXProgram.__call__
interface, we need to save the model into disk along with external data
before executing the model. This is what this PR implements
An alternative is to ONNXProgram.__call__ to detect that the model
was exported with fake mode and explicit raise an exception when
ONNXProgram.__call__ is executed. The exception message would instruct
the user to call ONNXProgram.save and manually execute the model using
the ONNX runtime of choice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122230
Approved by: https://github.com/BowenBao
ghstack dependencies: #122196
In the next PR I have the IR `ops.neg(ops.constant(0.0, torch.float32))`
which should be folded to `ops.constant(-0.0, torch.float32)` but it seems that
`sympy.Float(-0.0)` doesn't respect the sign of the zero and so we instead
get a positive zero constant.
Here, I work around this by doing the constant folding with python arithmetic
which does respect signed zeros.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122031
Approved by: https://github.com/lezcano
The comment suggests that we need to replace all FakeTensors with real
tensors. `torch.empty` doesn't actually return a real Tensor because
FakeTensorMode is active!
We disable torch dispatch so that torch.empty actually returns a real Tensor.
The motivation for this PR is that we're trying to ban
FakeTensor.data_ptr (or at least warn on it) in torch.compile. See the
next PR up in the stack
Test Plan:
- Existing tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122418
Approved by: https://github.com/oulgen
Fixes breaking changes for ONNX Runtime Training.
PR https://github.com/pytorch/pytorch/pull/121102 introduced incompatibility with ORT training because of change in parameter type. Creating a PR to add previous parameter types and verified that it works with ORT training.
Error with current scenario:
```
site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/aten_op_executor/aten_op_executor.cc:60:40: error: invalid conversion from ‘const DLManagedTensor*’ to ‘DLManagedTensor*’ [-fpermissive]
at::Tensor tensor = at::fromDLPack(dlpack);
site-packages/torch/include/ATen/DLConvertor.h:15:46: note: initializing argument 1 of ‘at::Tensor at::fromDLPack(DLManagedTensor*)’
TORCH_API Tensor fromDLPack(DLManagedTensor* src);
```
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122000
Approved by: https://github.com/malfet
Before this PR we were not precompiling triton templates in parallel. Compilation would occur during benchmarking.
Triton benchmarking templates were emitted as :
```
@triton.jit
def triton_mm(arg_A, arg_B, out_ptr0):
```
In order to precompile we need to give the full kernel specification, as we do when we emit the template in the final output code generation.
```
@triton_heuristics.template(
num_stages=3,
num_warps=8,
triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())]},
inductor_meta={'kernel_name': 'Placeholder.DESCRIPTIVE_NAME', 'backend_hash': 'cdeecfeccd31ad7810f96b5752194b1c2406d0a81e39a6ca09c8ee150baae183'},
)
@triton.jit
def triton_mm(arg_A, arg_B, out_ptr0):
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121998
Approved by: https://github.com/jansel
ghstack dependencies: #121996, #120275, #121997
This PR allows users to specify int values for dimensions in dynamic_shapes as well as None, for example:
```
class Foo(torch.nn.Module):
def forward(self, x, y, z):
...
foo = Foo()
inputs = (torch.randn(4, 6), torch.randn(5, 4), torch.randn(3, 3))
for dynamic_shapes in [
None
((4, 6), (5, 4), (3, 3)),
((None, 6), None, {0: 3, 1: 3})
]:
_ = export(foo, inputs, dynamic_shapes=dynamic_shapes)
```
All of the above should produce the same ExportedProgram.
This is done by temporarily creating a static dim constraint during analysis, where vr.lower == vr.upper. These constraints are then deleted during _process_constraints(), and do not show up in the final ExportedProgram's range_constraints.
Additionally, export() will also fail if the shapes are mis-specified, for example:
```
_ = export(foo, inputs, dynamic_shapes=((5, None), None, None))
```
leads to `torch._dynamo.exc.UserError: Static shape constraint of 5 does not match input size of 4, for L['x'].size()[0]`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121860
Approved by: https://github.com/avikchaudhuri
As a follow-up to #114835 and #119682, we add limited ATen operators implementation for XPU. With this PR, the blocking issue for oneDNN operations and Inductor XPU backend will be resolved as the two components depend on these operations to support its basic features, respectively.
The added ATen operators include:
- `copy_`, `_to_copy`, `_copy_from_and_resize`, , `clone`
- `view`, `view_as_real`, `view_as_complex`,
- `as_strided`, `_reshape_alias`, `resize_`, `resize_as_`,
- `add`/`add_`, `sub`/`sub_`, `mul`/`mul_`, `div`/`div_`, `abs`,
- `empty`, `empty_strided`,
- `fill_`, `zeros_`.
Co-authored-by: Wang, Eikan <eikan.wang@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120891
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/atalman
Currently, the in-memory onnx program model proto does
not contain initializers saved into the disk version.
This PR changes this behavior, so that both versions are
identical. This is important for running models with fake
tensor from OMMProgram.model_proto directly, without a file
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122196
Approved by: https://github.com/BowenBao
Summary: Previously, we only supported torch.Tensor boolean scalar predicate in `torch.cond` in Inductor. This PR adds support for SymBool and Python bool predicate, to match the `torch.cond` [sematics](https://pytorch.org/docs/stable/generated/torch.cond.html) in Dynamo / Export.
Test Plan:
```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 34 tests in 56.980s
OK
$ python test/inductor/test_aot_inductor.py -k test_cond
...
----------------------------------------------------------------------
Ran 54 tests in 460.093s
OK (skipped=4)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122378
Approved by: https://github.com/jansel, https://github.com/chenyang78
Summary: It would be useful to log the destination of the trace dump in either manifold or local file for the users to quickly locate the dump
Test Plan: Modified unit tests
Differential Revision: D54972069
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122345
Approved by: https://github.com/wconstab
Summary:
Compute duration would invoke additional cuda overhead and possibly
GPU mem increase and possible hang, so we want to disable it by default and enable it only
when needed, or at least when timing is enabled.
Test Plan:
Test with existing unit test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122138
Approved by: https://github.com/wconstab
Summary:
We're getting errors that Python.h is not found because we didn't have
the proper include path set up for it.
bypass-github-export-checks
Test Plan: I can only get this to show up in Bento: N5106134
Reviewed By: hl475, chenyang78
Differential Revision: D55133110
Co-authored-by: Bert Maher <bertrand@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122363
Approved by: https://github.com/bertmaher
This PR introduces `torch.nested.nested_tensor_from_jagged(values, offsets=None, lengths=None, jagged_dim=1)` (bikeshedding welcome). This is intended to be the main entrypoint for getting an NJT from the `(values, offsets, lengths)` components. The returned NJT is a view of the `values` component.
Note that `torch.nested.nested_tensor()` / `torch.nested.as_nested_tensor()` already exist for constructing an NJT from a list of tensors.
TODO:
* Some doc formatting; suggestions welcome there
* Tests / examples using `jagged_dim != 1`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121518
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #113280
Summary: Special kwargs like `num_warps`, `num_stages`, and `num_ctas` can be passed to the Triton kernel call as kwargs. These kwargs are handled in a special way, not being passed to the underlying kernel function directly. In this PR, we move those special kwargs from `kwargs` of the `TritonKernelVariable` in dynamo to `Autotuner`'s `Config` instances (either already existing or newly created for this purpose). As a result, the special kwargs can be codegened correctly as a part of `Config`, not as direct arguments to the kernel `.run`.
Test Plan:
```
python test/inductor/test_triton_kernels.py -k test_triton_kernel_special_kwargs
...
----------------------------------------------------------------------
Ran 6 tests in 6.783s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122280
Approved by: https://github.com/oulgen
See #113541
The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality.
cc @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068
Approved by: https://github.com/ezyang
Summary: We provide a `is_in_torch_dispatch_mode` API returning `bool` to determine whether the program is running in torch dispatch mode or not.
Test Plan:
- OSS CI
- Tested with publish of hstu models with the this diff and following diffs D54964288, D54964702, D54969677, D55025489, runtime errors are not raised anymore in publish
Differential Revision: D55091453
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122339
Approved by: https://github.com/jiayisuse
`torchxla_trace_once ` and `aot_torchxla_trivial ` should be removed.
In our internal(hopefully dashboard can be open source soon) torchbench daily runs, `openxla` backend has much higher passing rate and similar perfomrance as the `openxla_eval`(non-aot-auto-grad backend). We still use `openxla_eval` in llama2 example but I think we should move user to `openxla` backend going forward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122128
Approved by: https://github.com/alanwaketan, https://github.com/jansel
Summary:
When building our iOS app, we get a compile error about the deprecated `volatile` keyword.
This diff attempts to fix it by replacing the usage of the deprecated `volatile` keyword with `atomic` as suggested by malfet
Test Plan: Successfully built the iOS app that previously had a compile error
Differential Revision: D55090518
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122231
Approved by: https://github.com/malfet
This PR:
* Introduces an ATen op for creating true jagged views from a dense values buffer
* `_nested_view_from_jagged(values, offsets, lengths, ragged_idx, dummy)`
* This ops is implemented on the Python side using torch.library so we can return a subclass instance
* `jagged_from_list()` now uses this instead of the old autograd.Function `NestedViewFromBuffer`
* The latter op is used for non-contiguous JTs returned via `torch.nested.narrow()`
* `dummy` is an awful hack to ensure that `NestedTensor.__torch_dispatch__()` is invoked for our view
* Introduces an ATen op for accessing the `values` component of an NT via a view
* `_nested_get_values(nt)`
* **Removes** the autograd.Functions `ViewNestedFromBuffer` and `ViewBufferFromNested` in favor of `nested_from_values_offsets()` / `nested_from_values_offsets_lengths()` and `nt.values()`, respectively.
* Changes test code to prefer `as_nested_tensor()` over `jagged_from_list()` directly
* Similarly, avoid `buffer_from_jagged()`, preferring `values()`
* Depends on general subclass view fake-ification on the PT2 side (handled solely in previous PRs in the stack)
With these changes, the semantics of jagged layout NTs are such that they are considered a true view of the underlying `values` buffer. This means views of jagged NTs are views of the underlying buffer as well, simplifying some handling.
Differential Revision: [D54269922](https://our.internmc.facebook.com/intern/diff/D54269922)
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113279
Approved by: https://github.com/ezyang
Enable VEC on Windows OS.
1. Fix some type defination gap between Windows and Linux.
2. Fix some operator not support on Windows, such as [], /.
3. Enable static sleef library build on Windows.
4. Disable unsupported function overloading on MSVC.
5. Upgrade submodule sleef lib, which fixed build issue on Windows.
6. Fixed bazel build issues.
7. Fix test app not link to sleef on Windows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980
Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet
`get_device_states` doesn't recursively look into nested lists/dicts to find tensors. As a result, activation checkpointing for such inputs results in silent incorrect results as `get_device_states` returns an empty result and no rng is saved as a result here: https://github.com/pytorch/pytorch/blob/main/torch/utils/checkpoint.py#L188 since `fwd_device_states` is empty.
Fixed this by using `tree_map` for both `get_device_states` and `_infer_device_type`. Also added appropriate unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121462
Approved by: https://github.com/soulitzer
Summary: Instead of using "batch_fusion" and "group_fusion" to log, we use the specific pass name to log, which could better summarize the hit of each pattern as well as debug
Test Plan:
```
buck2 test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
```
Differential Revision: D55103303
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122245
Approved by: https://github.com/jackiexu1992
# PR
Vectors allocated inside `get_chunk_cat_metadata()` are out of local scope when used in `_chunk_cat_out_cuda_contiguous()`. This PR fixes the issue by returning vectors from `get_chunk_cat_metadata`.
This PR also added a few unit tests to cover more edge cases.
# Tests
This PR is tested with the following command and no error shows. So the flaky test error should be resolved.
- `PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer python test/test_ops.py -v -k test_out__chunk_cat_cuda_float32`
- `PYTORCH_NO_CUDA_MEMORY_CACHING=1 python test/test_ops.py -v -k test_out__chunk_cat_cuda_float32 --repeat 1500`
Fixes#122026Fixes#121950
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122076
Approved by: https://github.com/yifuwang
Summary: Adds a pass that blindly removes the functionalize hop without consideration on if its safe. Useful for ExecuTorch today and other usecases that have additional logic that can reason about when this pass is safe to use
Test Plan: added unit test
Differential Revision: D55103867
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122246
Approved by: https://github.com/angelayi
Summary:
We can also log the module hierarchy in the following format:
```
:ToplevelModule
sparse:SparshArch
dense:DenseArch
```
So that we can have more information recorded about model's identity.
Test Plan: CI
Differential Revision: D54921097
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121970
Approved by: https://github.com/angelayi
Precompile benchmarking choices in parallel, and then wait on those choices prior to benchmarking. In the case of deferred templates, we only only wait only those choices in the scheduler to allow multiple separate lowerings to compile in parallel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121997
Approved by: https://github.com/jansel
ghstack dependencies: #121996, #120275
Summary: In `torch.inference_mode()`, fake tensors don't have `_version`s. This breaks unbacked SymInt memoization in `torch.nonzero` tracing. Here we disable the latter in inference mode.
Fixes https://github.com/pytorch/pytorch/issues/122127
Test Plan:
```
$ python test/inductor/test_unbacked_symints.py -k test_nonzero_in_inference_mode
...
----------------------------------------------------------------------
Ran 2 tests in 14.060s
OK
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122147
Approved by: https://github.com/ezyang
Previously, all jvp tests under dynamo/test_dynamic_shapes would fail because symbolic execution wasn't supported in some autograd functions.
List of changes:
- Update`_has_same_storage_numel` to use `sym_nbytes`
- Symintify `_efficientzerotensor_meta`
- Introduce `empty_generic_symint` with the first argument `size` as symbolic integer
- Update gen_variable_type.py script to call the symint version of zeros_fn function (zeros_symint / _efficientzerotensor_symint)
- Update `has_same_meta` to call `sym_*` functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120338
Approved by: https://github.com/soulitzer
ghstack dependencies: #119926
List of changes:
- Replace JVP_NESTING by torch._C._functorch.maybe_current_level()
- Remove all increment nesting functions from wrap_fx_proxy_cls
- fwAD.make_dual receives the dual_level as keyword argument
- Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926
Approved by: https://github.com/zou3519
Summary: We have rewritten `conv1d` as `create_conv1d_context` and `run_conv1d_context` to enable prepack of `weight` and `bias`. We have registered `create_conv1d_context` but not `run_conv1d_context`. We add the registration in this diff.
Test Plan:
```
[luwei@devbig439.ftw3 /data/users/luwei/fbsource (f89a7de33)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*conv1d*"
Using additional configuration options from /home/luwei/.buckconfig.d/experiments_from_buck_start
Recommended: For faster builds try buck2: replace 'buck' with 'buck2'
NOTE: buck-out/ has changed: look for files in fbsource/buck-out/v2/
'buck2 build --show-output //xplat/caffe2:pt_vulkan_api_test_bin' will print the new output paths.
If you are building in fbsource//xplat and have questions, post in 'Cross Platform Dev Discussions': https://fb.workplace.com/groups/xplat.qa
Targets matching .buckconfig buck2.supported_projects:
{'//xplat/caffe2:pt_vulkan_api_test_bin': '//xplat'}
To suppress this warning: touch ~/.config/.dont_hint_buck2
Building: finished in 0.1 sec (100%) 394/394 jobs, 0/394 updated
Total time: 0.2 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *conv1d*
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from VulkanAPITest
[ RUN ] VulkanAPITest.conv1d_simple
[ OK ] VulkanAPITest.conv1d_simple (208 ms)
[ RUN ] VulkanAPITest.conv1d
[ OK ] VulkanAPITest.conv1d (81 ms)
[----------] 2 tests from VulkanAPITest (289 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (289 ms total)
[ PASSED ] 2 tests.
```
full test result
```
...
[----------] 427 tests from VulkanAPITest (22583 ms total)
[----------] Global test environment tear-down
[==========] 427 tests from 1 test suite ran. (22583 ms total)
[ PASSED ] 426 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
YOU HAVE 11 DISABLED TESTS
```
Differential Revision: D55052816
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122172
Approved by: https://github.com/nathanaelsee
Check that the `classname` attribute actually exists.
#122017
I expect this route to happen very rarely
At a certain point, we should just remove this parsing altogether since everything uses pytest now...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122103
Approved by: https://github.com/huydhn
1) add operand and get_dim_names API;
2) set will_resize to true when output tensor is undefined;
3) add abs_stub for dummy device and calculate on cpu device;
4) support dummy device copy with stride;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120792
Approved by: https://github.com/ezyang
Our prior approach to epilogue fusion was to select from a choice from a set of triton templates and extern calls based on benchmarking inputs, then unconditionally fuse epilogues. This can be sub-optimal in following ways:
- We select an extern kernel, however an epilogue like relu() exists such that choosing a triton template + relu would have been faster
- We select a triton template, epilogue fuse, and register spilling occurs causing it to be slower than not epilogue fusing.
In this PR we wait to select either the Triton Template or Extern Kernel based on benchmarking results from the kernel itself and its epilogue. As soon as a successful fusion occurs where a fused Triton Template + epilogue is faster than the unfused choice we finalize the MultiTemplateBuffer as a specific template. If no fusion occurs we'll finalize the MultiTemplateBuffer after fusion.
Note: if there are multiple epilogue fusions (not super likely), even though we select a template after the first fusion, we will still benchmark to see if subsequent epilogue are worth fusing. We could potentially defer choosing template in this case in a follow up at expense of compile time.
Gives 4% HF training win, 10% TIMM inference win. Increases compilation time which I will be trying to address more in follow up prs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120275
Approved by: https://github.com/jansel
ghstack dependencies: #121996
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.
This:
```
try {
...
} catch (exception& e) {
// no use of e
}
```
should instead be written as
```
} catch (exception&) {
```
If the code compiles, this is safe to land.
Test Plan: Sandcastle
Reviewed By: kimishpatel, palmje
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116875
Approved by: https://github.com/Skylion007
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.
This:
```
try {
...
} catch (exception& e) {
// no use of e
}
```
should instead be written as
```
} catch (exception&) {
```
If the code compiles, this is safe to land.
Test Plan: Sandcastle
Reviewed By: palmje
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116876
Approved by: https://github.com/Skylion007
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.
This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.
- If you approve of this diff, please use the "Accept & Ship" button :-)
Test Plan: Sandcastle
Reviewed By: palmje
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120177
Approved by: https://github.com/Skylion007
Documentation states that the parameter margin of torch.nn.TripletMarginLoss is greater than 0, however any value was being accepted. Also fixed torch.nn.TripletMarginWithDistanceLoss which had the same problem. Added error test input for the new ValueError.
Fixes#83241
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121978
Approved by: https://github.com/mikaylagawarecki
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.
This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.
- If you approve of this diff, please use the "Accept & Ship" button :-)
Test Plan: Sandcastle
Reviewed By: malfet, dmm-fb
Differential Revision: D52981072
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118116
Approved by: https://github.com/Skylion007
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.
This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.
- If you approve of this diff, please use the "Accept & Ship" button :-)
Test Plan: Sandcastle
Reviewed By: palmje
Differential Revision: D53779579
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120176
Approved by: https://github.com/Skylion007
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.
This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.
- If you approve of this diff, please use the "Accept & Ship" button :-)
Test Plan: Sandcastle
Reviewed By: palmje
Differential Revision: D53779549
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120178
Approved by: https://github.com/Skylion007
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.
This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.
- If you approve of this diff, please use the "Accept & Ship" button :-)
Test Plan: Sandcastle
Reviewed By: palmje
Differential Revision: D54931224
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121995
Approved by: https://github.com/Skylion007
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.
This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.
- If you approve of this diff, please use the "Accept & Ship" button :-)
Test Plan: Sandcastle
Differential Revision: D54378401
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122151
Approved by: https://github.com/Skylion007
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.
This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.
- If you approve of this diff, please use the "Accept & Ship" button :-)
Test Plan: Sandcastle
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122161
Approved by: https://github.com/Skylion007
Summary:
# Diff Specific
The signature of `copyFrom` is
```
void Tensor::CopyFrom(const Tensor& src, bool async) {
```
so the `&context` always evaluated to true.
I could dig around to see if anyone cares about what the flag should actually be, but this is old code in caffe2, so I've just used `true` and we'll keep using whatever behaviour we've been using since 2019 or so when this was written.
# General
A bug in this code was identified by `-Waddress`, which we are working to enable globally.
This diff fixes the bug. There are a few types of fixes it might employ:
The bug could be `const_char_array == "hello"` which compares two addresses and therefore is almost always false. This is fixed with `const_char_array == std::string_view("hello")` because `string_view` has an `==` operator that makes an appropriate comparison.
The bug could be `if(name_of_func)` which always returns true because the function always has an address. Likely you meant to call the function here!
- If you approve of this diff, please use the "Accept & Ship" button :-)
Test Plan: Sandcastle
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121856
Approved by: https://github.com/Skylion007
torchrec_dlrm training fail the accuracy check when max-autotune is enabled.
I found there is no real issue in PT2. We fail to get fp64 reference results for the accuracy check. In max-autotune mode numerical may change a bit and cause the cosine similarity check fail. Using fp64 baseline is more reliable and make the test pass.
The reason why we are not using a fp64 baseline earlier is because torchrec uses a dataclass [Batch](99e6e669b5/torchrec/datasets/utils.py (L28)) to represent the input. We use pytree to cast model and inputs to fp64. pytree can not look into a dataclass. My fix is to convert the dataclass to namedtuple to be more pytree friendly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122012
Approved by: https://github.com/jansel, https://github.com/eellison
Summary:
also added some utils in xnnpack_quantizer_utils.py
* annotate_conv_tranpsose_bn_relu and annotate_conv_transpose_bn -> this is for QAT
* annotate_conv_transpose_relu
conv_transpose + bn weights fusion is performed automatically and can not be disabled currently
we can add support to allow disable this fusion later if needed
Test Plan:
python test/test_quantization.py -k test_conv_transpose_bn_fusion
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122046
Approved by: https://github.com/andrewor14
If TORCH_DISABLE_ADDR2LINE is set, the symbolizer will instead give the
filename of the shared library as the filename, the offset in that library as the linenumber,
and use dladdr to get the function name if possible. This is much faster than using addr2line,
and the symbols can be later resolved offline using addr2line if desired.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121359
Approved by: https://github.com/aaronenyeshi
@ezyang mentioned that we should not put constant args on the graph. Especially when there are args that would be trickier to put on the graph. E.g. next PR needs `triton.language.dtype` as an argument on the graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122140
Approved by: https://github.com/jansel
Creating this after [PR](https://github.com/pytorch/pytorch/pull/121642) got reverted.
Current dynamic shapes implementation fixes lower range of Dims to be 2 for analysis, but allows 0/1 shapes during runtime. This leads to failures when initializing Dim(1,2). This PR sets the lower bound to 0, and avoids erroring out when conflicting with the generated (2, maxsize) constraint during analysis.
Also resolves a derived dim constraints issue with the following code:
```
class Bar(torch.nn.Module):
def forward(self, x, y):
return x + y[1:]
dx = Dim("dx", min=1, max=3)
ep = export(
Bar(),
(torch.randn(2, 2), torch.randn(3, 2)),
dynamic_shapes=({0: dx, 1: None}, {0: dx+1, 1: None})
)
print(ep.range_constraints)
```
In main:
```
{s0: ValueRanges(lower=2, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=3, upper=4, is_bool=False)}
```
This PR:
```
{s0: ValueRanges(lower=1, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=2, upper=4, is_bool=False)}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121910
Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.
This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.
- If you approve of this diff, please use the "Accept & Ship" button :-)
Test Plan: Sandcastle
Reviewed By: dmm-fb
Differential Revision: D54380402
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122165
Approved by: https://github.com/Skylion007
Summary:
- we want fx nodes' stack trace format to be backward compatible and same as before in the program we export
- however in the serialized format, we would want to show a more compact stack_trace format, otherwise the nodes attributes are dominated by stack traces
- the diff implements the minimal in serialization process to dedupe node stack traces by resorting to a fileinfo_list and a filename_to_abbrev map, so we can use index to represent filenames, use lineno to represent lines.
Test Plan:
# llm
base on D54497918
```
buck2 run @//mode/dev-nosan fbcode//executorch/examples/models/llama2:export_llama -- -c ~/stories110M.pt -p ~/params.json
```
set up breakpoint after serialization/deserialization
- serialize
```
(Pdb) v_meta = [n.meta for n in exported_program.graph_module.graph.nodes]
(Pdb) paste_client.create_phabricator_paste_object(paste_creation_client_id=1093956601162697, content=str(v_meta)).number
1193647450
(Pdb) json_program = json.dumps(_dataclass_to_dict(serialized_graph.co_fileinfo_ordered_list),cls=EnumEncoder)
(Pdb) json_bytes = json_program.encode('utf-8')
(Pdb) paste_client.create_phabricator_paste_object(paste_creation_client_id=1093956601162697, content=str(json_bytes)).number
1193604333
(Pdb) sys.getsizeof(json_bytes)
3846
(Pdb) compressed_bytes = zstd.ZstdCompressor().compress(json_bytes)
(Pdb) sys.getsizeof(compressed_bytes)
1139
```
in P1193647450 (before serialization), search for `stack_trace`
in P1193604333 (after serialization), search for `stack_trace` and `co_fileinfo_ordered_list`
[note: didn't do compression in this diff since the size is pretty small and it adds complexity if we do compression]
- deserialize
```
(Pdb) v_meta = [n.meta for n in deserialized_exported_program.graph_module.graph.nodes]
(Pdb) paste_client.create_phabricator_paste_object(paste_creation_client_id=1093956601162697, content=str(v_meta)).number
1193629435
```
in P1193629435, search for `stack_trace`
# ads
Differential Revision: D54654443
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121675
Approved by: https://github.com/angelayi
Fixes#114844
In the linked issue we have
```
compiled_module = torch.compile(module)
compiled_module.x = ...
compiled_module(...) # Mutates self.x
```
Where since the module mutates `self.x` you would expect `compiled_module.x`
to be updated but actually `compiled_module.x = ...` sets an attribute "x"
on the `OptimizedModule` object while the forward method of the module mutates
`module.x`.
This gives the expected behavior by forwarding `compiled_module.__setattr__`
down to `module.__setattr__`. There is already a corresponding `__getattr__`
so now `compiled_module.x` becomes an alias for `module.x`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122098
Approved by: https://github.com/ezyang, https://github.com/lezcano
Previously, typing an autograd.Function like the following would lead to
a mypy error (which expects the first arg to forward to be named `ctx`).
This PR fixes that by deleting the ctx arg.
```py
class MySin(torch.autograd.Function):
@staticmethod
def forward(x: torch.Tensor) -> torch.Tensor:
return x.sin()
@staticmethod
def setup_context(*args, **kwargs):
pass
@staticmethod
def backward(ctx, grad):
if grad.stride(0) > 1:
return grad.sin()
return grad.cos()
```
Test Plan:
- tested locally (I don't know how to put up a test in CI for this).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122167
Approved by: https://github.com/soulitzer
Summary:
## Context
Some small fixes to the ATen-Vulkan backend.
The first is that GPU sizes for a 4 dimensional tensor with width packing had a small bug:
```
case 4:
switch (memory_layout) {
case api::GPUMemoryLayout::TENSOR_WIDTH_PACKED:
gpu_sizes.at(0) = sizes.at(0);
gpu_sizes.at(1) = sizes.at(1);
// should be gpu_sizes.at(2) == sizes.at(2)
gpu_sizes.at(2) = sizes.at(3);
gpu_sizes.at(3) = api::utils::align_up(sizes.at(3), INT64_C(4));
break;
```
This was fixed by simplifying the logic of GPU size calculation for texture storage.
The second was to modify the ctype mapping of the `api::kHalf` scalar type to be `float` instead of `unsigned short`. This is because GLSL does not natively support `float16`, so even with a FP16 texture type CPU/GPU transfer shaders will have to read from and write to `float` buffers.
In the future, we will look into integrating [VK_KHR_shader_float16_int8](https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_KHR_shader_float16_int8.html) into ATen-Vulkan to allow for 16 bit and 8 bit types to be referenced explicitly.
Test Plan: CI
Differential Revision: D55018171
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122096
Approved by: https://github.com/jorgep31415
Previously, all jvp tests under dynamo/test_dynamic_shapes would fail because symbolic execution wasn't supported in some autograd functions.
List of changes:
- Update`_has_same_storage_numel` to use `sym_nbytes`
- Symintify `_efficientzerotensor_meta`
- Introduce `empty_generic_symint` with the first argument `size` as symbolic integer
- Update gen_variable_type.py script to call the symint version of zeros_fn function (zeros_symint / _efficientzerotensor_symint)
- Update `has_same_meta` to call `sym_*` functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120338
Approved by: https://github.com/soulitzer
ghstack dependencies: #119926
List of changes:
- Replace JVP_NESTING by torch._C._functorch.maybe_current_level()
- Remove all increment nesting functions from wrap_fx_proxy_cls
- fwAD.make_dual receives the dual_level as keyword argument
- Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926
Approved by: https://github.com/zou3519
Differential Revision: D54964130
When we re-export, auto_functionalize HOP will be in the graph. Therefore, we need to implement proper functionalization rule for it. Since the content inside auto_functionalize is guaranteed be functional, it is ok to just fall through it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121990
Approved by: https://github.com/ydwu4, https://github.com/zou3519
This removes the duplicate handling of comparison ops between symbolic_convert and bultin and refactors the handling to use the binop infrastructure. This change regresses overheads a bit, but this is fixed in the next PR.
New test skips are variants of `type(e) is np.ndarray` previously falling back to eager.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122043
Approved by: https://github.com/anijain2305
ghstack dependencies: #122039
Summary: In `torch.inference_mode()`, fake tensors don't have `_version`s. This breaks unbacked SymInt memoization in `torch.nonzero` tracing. Here we disable the latter in inference mode.
Test Plan:
```
$ python test/inductor/test_unbacked_symints.py -k test_nonzero_in_inference_mode
...
----------------------------------------------------------------------
Ran 2 tests in 14.060s
OK
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122147
Approved by: https://github.com/ezyang
Enable VEC on Windows OS.
1. Fix some type defination gap between Windows and Linux.
2. Fix some operator not support on Windows, such as [], /.
3. Enable static sleef library build on Windows.
4. Disable unsupported function overloading on MSVC.
5. Upgrade submodule sleef lib, which fixed build issue on Windows.
6. Fixed bazel build issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980
Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet
Summary:
Previous implementation seems to introduce a key value of {"node": none}. This causes an error in logging later on because we extract the name from the "node" but it is a string instead of a torch.fx.node
This seems to cause tests to pass.
Test Plan:
CI
ExecuTorch CI:
buck test mode/dev-nosan //executorch/backends/xnnpack/test:test_xnnpack_models
Reviewed By: larryliu0820
Differential Revision: D55026133
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122111
Approved by: https://github.com/mikekgfb
Forward fix for regressions introduced by https://github.com/pytorch/pytorch/pull/121381 as we failed to run MPS CI twice on it
- Do not call `minimumWithNaNPropagationWithPrimaryTensor` for integral tensors as it will crash with
```
/AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Utility/MPSKernelDAG.mm:805: failed assertion `Error getting visible function: (null) Function isNaN_i16_i8 was not found in the library'
```
- Change the order of max and min call as it's apparently important for
consistency, as `min(max(a, b), c)` might not equal to `max(min(a, c), b)` if `c` is not always less or equal than `b`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122148
Approved by: https://github.com/huydhn
**Summary:**
This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:
```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
aten.native_batch_norm ->
aten._native_batch_norm_legit (export only) ->
_batch_norm_legit_cpu/cuda (kernels, export only) ->
_batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```
Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.
Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:
```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```
The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:
```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```
Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.
Test Plan: `OpInfo` tests for `batch_norm_with_update`.
Reviewers: albanD, bdhirsh
Subscribers: albanD, bdhirsh, supriyar
Tasks: https://github.com/pytorch/pytorch/issues/111384
Differential Revision: [D54805279](https://our.internmc.facebook.com/intern/diff/D54805279)
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
Fixes https://github.com/pytorch/pytorch/issues/120873.
Fixes the output stride of Conv in the case of dynamic shapes. The previous logic in inductor assumed that the output stride of Conv is always channels last while it is actually contiguous if `dynamic_shapes and is_contiguous_storage_and_layout(x)`.
### Static shape
In static shape cases, since weight is prepacked (`weight_t.is_mkldnn()` will be `true`), we'll always force output to be channels last in the Conv kernel, thus it's fine to have the assumption in Inductor that the output stride of Conv is always channels last.
96ed37ac13/aten/src/ATen/native/mkldnn/Conv.cpp (L357-L358)
### Dynamic shape
In dynamic shape cases, we won't do weight prepack for Conv, in this case, the Conv kernel decides the output layout based on the input and weight layout.
96ed37ac13/torch/_inductor/fx_passes/mkldnn_fusion.py (L1024-L1025)
For input with `channels = 1`, like tensor of size `(s0, 1, 28, 28)` and stride `(784, 784, 28, 1)`, in Inductor, with `req_stride_order` in channels last order, the `require_stride_order` on `x` of such size and stride won't change the stride of the tensor since stride for dimensions of size 1 is ignored
96ed37ac13/torch/_inductor/ir.py (L5451)
While in Conv kernel, such tensor is consider it as **contiguous** tensor instead of channels last tensor thus the output of the Conv kernel will be in contiguous format.
96ed37ac13/aten/src/ATen/native/ConvUtils.h (L396-L404)
To align the behavior of the Conv kernel, we set the output_stride in such case to be contiguous instead of channels last.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121400
Approved by: https://github.com/jgong5, https://github.com/jansel
Summary: `has_triton` causes some import time cycles. Lets use `has_triton_package` which is enough.
Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//fblearner/flow/projects/model_processing/pytorch_model_export_utils/logical_transformations/tests:filter_inference_feature_metadata_test -- --exact 'fblearner/flow/projects/model_processing/pytorch_model_export_utils/logical_transformations/tests:filter_inference_feature_metadata_test - test_collect_features_from_graph_module_nodes (fblearner.flow.projects.model_processing.pytorch_model_export_utils.logical_transformations.tests.filter_inference_feature_metadata_test.FilterInferenceFromFeatureMetadataTest)'
```
now passes
Differential Revision: D55001430
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122059
Approved by: https://github.com/aakhundov
Summary: Unless we register triton to be a special import, FX graph import mechanism imports it as `from fx-generated._0 import triton as triton` which is obviously broken.
Test Plan:
I could not figure out how to write a test for this but
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//tgif/lib/tests/gpu_tests:lowering_pass_test -- -r test_default_ait_lowering_multi_hardwares
```
now passes
Differential Revision: D54990782
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122041
Approved by: https://github.com/aakhundov
We would like to improve consistency for nn_module_stack metadata in torch.export.
This PR ensures that all tests in test/export/test_export.py has the following constraints:
- Remove nn_module_stack for all placeholder & output nodes, for all modules and submodules
- Ensure nn_module_stack is present for all other node types for the top-level module (there is still an issue with torch.cond submodules having empty fields)
- Add these checks to _export() in _trace.py (we would add this in the Verifier, but downstream apps construct ExportedPrograms separate from _export(), and metadata may not be maintained there)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120661
Approved by: https://github.com/avikchaudhuri
Copying a scalar 0 tensor on CPU to GPU or constructing a scalar 0 tensor on GPU requires a CPU sync with the GPU. Let us avoid doing ops that involve it.
`FSDP.clip_grad_norm_` already first checks if all parameters are not sharded and calls into `nn.utils.clip_grad_norm_`, so at the point of the code changes, there is guaranteed to be some sharded parameters.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122001
Approved by: https://github.com/wanchaol
This PR addresses the issue identified in #121920. The existing problem is that all tests are deemed mandatory if none are selected as required. This behavior is particularly noticeable during a force merge operation.
In the context of a force merge, it may not be necessary to execute any tests which are not required (imo). However, this proposed change could be seen as controversial, hence it has been separated from the main update for further discussion and review.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121921
Approved by: https://github.com/huydhn
ghstack dependencies: #121920
Summary: During key calculation for FX graph caching: Rather than specialize on "small" vs. "large" tensor constants (i.e., inlined vs. not inlined), always hash on the tensor value. Doing so avoids the complication of trying to later attach the constant values as attributes to an already-compiled module. Instead, different constants will cause an FX graph cache miss and we'll just compile.
Test Plan: New unit test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121925
Approved by: https://github.com/eellison
Enable ASGD foreach optimizer and add DTensor optimizer unit test for ASGD.
Note that we need to investigate why when using ASGD we need higher atol and rtol when comparing model parameters. Listing it as a TODO now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121942
Approved by: https://github.com/wanchaol
This PR adds support for 2D `clip_grad_norm_` (`foreach=True`).
- This PR changes `OpSchema.args_spec` to use pytree if the runtime schema info specifies it.
- This PR includes a unit test for 2D FSDP2 + SP with `clip_grad_norm_` enabled, which serves as a complete numerics test for 2D.
Note: With this PR patched, 2-way SP + 4-way FSDP matches 8-way FSDP numerics on Llama-7B (doubling local batch size for the 2-way SP run).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121945
Approved by: https://github.com/wanchaol
ghstack dependencies: #121747, #121869
Summary: The handling of the current_callable and compiled_artifact fields in the CompiledFxGraph object is unnecessarily complicated and confusing. We can simplify by storing only the callable. That field is not serializable, so the caching approach is to store a path to the generated artifact and reload from disk on a cache hit. We can just reload inline in the FX cache hit path. This change has the added benefit that it makes it easier to fallback to a "cache miss" if the path somehow doesn't exist.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121903
Approved by: https://github.com/eellison
The inductor lowering code for viewing a tensor as a type with a different bitwidth currently doesn't generate valid triton code. This change looks for a source and destination dtype and, if different sizes, falls back to the eager mode aten implementation. Prior to this change, this condition would throw an exception.
Fixes#120998.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121786
Approved by: https://github.com/peterbell10, https://github.com/bertmaher
The error message shown when input aliasing is detected in `while_loop_func` may not have the correct `fn_name` as it set only in the previous for loop. This change merges the two loops so that `fn_name` has the correct value.
No Issue Number for this minor change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121601
Approved by: https://github.com/albanD
This adds some basic comm tests to test_tp_examples. This validates that the expected distributed calls are being made for `test_transformer_training`.
Fixes#121649
Test plan:
```
pytest test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121669
Approved by: https://github.com/wanchaol
This PR addresses an issue with the trymerge function for executorch, which currently uses Facebook CLA instead of Easy CLA. This bug has been patched in #121921. However, the patch is potentially controversial, and we still want to verify Facebook CLA if it exists. Therefore, this PR includes Facebook CLA in our set of mandatory checks.
Additionally, this PR removes Facebook CLA from one of the mocks. This change is necessary because the specific PR used for testing fails due to the presence of Facebook CLA in the mock.
## Testing:
We run `find_matching_merge_rule(pr = GitHubPR("pytorch", "executorch", 2326), skip_mandatory_checks=True, skip_internal_checks=True)` to check if things work
https://pastebin.com/HHSFp2Gw
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121920
Approved by: https://github.com/huydhn
`python benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
- Before: `symbolic_convert_overhead_stress_test: 10.7s`
- After: `symbolic_convert_overhead_stress_test: 8.6s`
`tx.step()` is a small part of that benchmark, so likely the speedup in that isolated function is larger than the top line.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121790
Approved by: https://github.com/oulgen
This PR rewrite the stack strategy to be more generalized, basically
stack/cat like strategy follow pattern need to be smarter, i.e. it
should be able to identify:
1. PR, PP, RP -> follow PP
2. RR, SR, RS -> follow SS
So this PR refactors how the follow strategy should work, and make sure
we start following the strategy that incurred lowest cost. i.e. for
multiple PR, RP placements, we should be able to further delay the
pending sum reductions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121869
Approved by: https://github.com/awgu
Summary: Refer to OSS PR for details
Test Plan: CI
Differential Revision: D54812833
In pre-dispatch export, we have a special proxy torch mode where we intercept torch._C._set_grad_enabled op to correctly capture user's intention on train/eval. However, this is bit problematic when we are tracing torch.cond during export as it calls torch.compile internally. As a result, we end up capturing unwanted autograd context manager calls that are happening inside dynamo framework code because the top level tracer is still active. We fix it by turning off this proxy torch mode. We can still capture autograd ops inside cond branches because dynamo will translate them into HOP for us, so we don't have to intercept with special proxy mode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121736
Approved by: https://github.com/anijain2305, https://github.com/ydwu4
I realized there's a bug when unlifting buffer mutations in AOTI.
However there seems to be a bug during tracing where AOTI mutates the buffer. I didn't take the time to investigate, so I left is as TODO for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121688
Approved by: https://github.com/chenyang78
Summary:
tldr: User calls to `torch.autograd.profiler.record_function` fails when tracing with non-strict pre-dispatch export due to an effect token failure, so the solution is to skip over these operators 😅
Some user code contains calls to a `torch.autograd.profiler.record_function` context, like https://fburl.com/code/uesgknbq and https://fburl.com/code/iogbnsfw, which is used for adding user-defined events into the profiler.
Currently these function calls will be skipped/removed in dynamo (https://fburl.com/code/fkf7qmai) but **non-strict pre-dispatch export** will hit these operators during tracing. However, it seems that although these operators get hit by the dispatcher, they don't actually show up in the final graph (maybe they get DCE-d).
However, an issue comes up with a recent change with effect tokens (D54639390) which creates tokens if it sees a ScriptObject during tracing. The operator `torch.ops.profiler.record_function_exit` takes in a ScriptObject, so the effect tokens framework now tries to add an effect token to this operator, but results in the following error: (https://www.internalfb.com/intern/everpaste/?handle=GI-hvBknzj2ZxYkBABNzdztDxJVAbsIXAAAB, P1195258619)
The reason is because this operator only gets hit during pre-dispatch, not post-dispatch tracing. During pre-dispatch tracing, we first trace using post-dispatch to collect metadata needed for functionalization, and then we do pre-dispatch tracing to construct the graph. The metadata collection phase is also when we determine what operators need effect tokens and create those tokens. However, since the operator only shows up in pre-dispatch tracing, we do create any tokens. During the actual pre-dispatch tracing to create the graph, we then run into this operator and try to get a token, but none exist, causing an error :(
This PR just blocks the record_function operator from being looked at by the effect tokens framework. But a proper fix might be to have functionalization run on the pre-dispatch graph or have the operator also show up in the post-dispatch graph. But since in the PT2 stack dynamo just gets rid of this operator so that it won't show up anywhere downstream, I think we can also just ignore this operator.
Test Plan: Fixed test for P1195258619
Differential Revision: D54857444
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121829
Approved by: https://github.com/BoyuanFeng, https://github.com/tugsbayasgalan
Namely, it adds the `s3-bucket` argument to the following workflows, with default value set to `gha-artifacts`):
- _docs
- _linux-test workflows
- download-build-artifacts
- pytest-cache-download
- upload-test-artifacts
This is prerequisite part is required in order to start migrating to other s3 buckets for asset storage; This is one of the required steps in order to migrate to ARC and move our assets away from our S3 to Linux Foundation S3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121907
Approved by: https://github.com/malfet
To prepare for FSDP2 + TP/SP in torchtrain, we should verify that we can resume training correctly with DCP save/load. For loading into a new model/optimizer instance, torchtrain uses lightweight `ModelWrapper` and `OptimizerWrapper`. In the added unit test, we use `get_optimizer_state_dict` directly to show the minimal requirement for correctness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121747
Approved by: https://github.com/wz337
Summary:
This diff tries to periodically (e.g., every 30s) log critical collective
progress status to scuba table, starting from a few metric such as last
enequeued seq id.
With the Scuba table, it is our hope that we can easily detect the straggler of a PG,
E.g., the rank that has not progressed it seq_ for X seconds while other ranks in the same PG have a larger seq_
The implementation needs to make sure that Scuba will be used only for FB internal use
cases.
For OSS, we still provide a generic logger data struct and logger that can be
easily extended. If users do not register the logger, nothing will be logged.
Test Plan:
Re-use the existing unit test for fb side of operations, such as
test_register_and_dump in test_c10d_manifold and change the dump period to a
very small number, e.g., 1ms, verified that the loggs are correctly shown in scuba table:
https://fburl.com/scuba/c10d_work_update/9trhwnmy
Reviewed By: wconstab
Differential Revision: D54556219
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121859
Approved by: https://github.com/wconstab
TestReductionsCUDA.test_nansum_out_dtype_cuda_float32 would fail or pass depending on the random inputs. Observed by ROCm internal QA testing. But same problematic random inputs breaks the test for CUDA, verified on V100.
There is precedent in another test within the same file to relax tolerance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121550
Approved by: https://github.com/albanD
Current dynamic shapes implementation fixes lower range of Dims to be 2 for analysis, but allows 0/1 shapes during runtime. This leads to failures when initializing Dim(1,2). This PR sets the lower bound to 0, and avoids erroring out when conflicting with the generated (2, maxsize) constraint during analysis.
Also resolves a derived dim constraints issue with the following code:
```
class Bar(torch.nn.Module):
def forward(self, x, y):
return x + y[1:]
dx = Dim("dx", min=1, max=3)
ep = export(
Bar(),
(torch.randn(2, 2), torch.randn(3, 2)),
dynamic_shapes=({0: dx, 1: None}, {0: dx+1, 1: None})
)
print(ep.range_constraints)
```
In main:
```
{s0: ValueRanges(lower=2, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=3, upper=4, is_bool=False)}
```
This PR:
```
{s0: ValueRanges(lower=1, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=2, upper=4, is_bool=False)}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121642
Approved by: https://github.com/avikchaudhuri
This PR enables DDP + TP using a TP internal API. This should not be the final implementation. A more sound implementation is to inline the TP internal API in DDP. In other words, DDP needs to be aware of DTensor so that we can support 2D state_dict.
This PR adds a compiled DDP + TP test to ensure the new compiled DDP fusion doesn't break TP all_reduce.
**TODOs**
- [x] Implement DDP allreduce fusion algorithm for Inductor post_grad pass.
- [x] Add unit tests to ensure the fusion doesn't DDP + TP.
- [ ] Group different PG and data type of all_reduces.
- [ ] Mixed precision supports and tests
- [ ] Implement the fusions with Inductor IR.
- [ ] Add auto bucketing based on Inductor profiling.
Differential Revision: [D54105050](https://our.internmc.facebook.com/intern/diff/D54105050/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120479
Approved by: https://github.com/wz337
ghstack dependencies: #113209
Summary:
Deserialization didn't populate ShapeEnv's `var_to_val` field properly, and AOTInductor is relying on this field to compile dynamic shape properly.
As a result, when AOTI failed at compiling a deserialized ExportedProgram.
Test Plan: buck2 test mode/dev-nosan caffe2/test/inductor/fb:test_aot_inductor_pt2_inference
Differential Revision: D54559494
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121759
Approved by: https://github.com/avikchaudhuri
- Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm
- Include files more granularly to avoid namespace pollution and circular imports
limitations:
- requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness
- will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)
- can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection
- tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them.
Differential Revision: [D54818488](https://our.internmc.facebook.com/intern/diff/D54818488)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681
Approved by: https://github.com/jansel
Differential Revision: [D49858057](https://our.internmc.facebook.com/intern/diff/D49858057/)
**TL;DR**
This PR implements 2 different DDP all_reduce fusions in Inductor post_grad fx passes. The two fusions are 1) fusion with concat op and 2) fusion with all_reduce_coalesced. When DDP detects that Python reducer is being used, DDP will automatically turn on the fusion.
This PR does not invent any algorithm and simply reflects the bucket size users set to DDP.
**Implementation Details**
*Fusion with concat op*
The idea of this fusion is to use a concat op to concatenate all the gradients into one tensor and perform one `all_reduce`. After the `wait` op of the `all_reduce`, splitting and reshaping will also be perform to get the individual gradient.
Because DDP needs to perform gradient scaling, the benefit of using this fusion is that we could perform the gradient scaling over the the concatenated buffer.
*Fusion with `all_reduce_coalesced`*
The idea of this fusion is to use `all_reduce_coalesced` op to directly perform the `all_reduce` over multiple buffers. This avoid the copy overhead but may not achieve the best NCCL performance. In addition, because there are multiple buffers, we could not do one simple gradient scaling but have to rely on `foreach_div` to help the gradient scaling.
**Limitations**
Current fusions do not distinguish `all_reduce` generated by different DDP modules. This is okay if all DDP instances use the same PG and data type. The support of multiple DDP instances with different PG and data type will come in the later PRs.
**TODOs**
- [x] Implement DDP allreduce fusion algorithm for Inductor post_grad pass.
- [ ] Add unit tests to ensure the fusion doesn't DDP + TP.
- [ ] Group different PG and data type of `all_reduce`s.
- [ ] Mixed precision supports and tests
- [ ] Implement the fusions with Inductor IR.
- [ ] Add auto bucketing based on Inductor profiling.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113209
Approved by: https://github.com/yf225
This PR fixes Issue #111279.
While #111279 reported the issue with `MultiheadAttention`, a minimal reproduction would be:
```python
class ToyModel(nn.Module):
def __init__(self,):
super().__init__()
self.linear = nn.Linear(128, 10)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.linear.forward(x) # Error
# return self.linear(x) # OK
```
Dynamo treats `self.linear(x)` as `call_module` while treating `self.linear.forward(x)` as a [`get_attr` and a `call_method`](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/variables/nn_module.py#L358-L378). However, existing DDPOptimizer assumes, for a `get_attr` node, `getattr(gm, node.target)` gives a tensor with the `requires_grad` attribute. Existing DDPOptimizer also does not support `call_method` nodes.
This PR adds support for `call_method` and check on `get_attr`. It also checks if a module's parameters have been added to a bucket to support multiple method calls from the same module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121771
Approved by: https://github.com/yf225
This is very confusing when checking for memory usage and allocations are only happening using C API. We should change it to a warning/error or just init cuda. Codepaths that run on non-CUDA environments shouldn't call into these functions in the first place
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121698
Approved by: https://github.com/jansel
Summary: instantiate_device_type_tests() creates dynamic test case classes that derive from a "template class". By default, the test harness will call the setUpClass() and tearDownClass() methods defined by the template class (if the template class defines them). We can explicitly create these methods in the dynamic class and arrange to call those methods in both base classes. That allows us to support setUpClass & tearDownClass test classes used with instantiate_device_type_tests().
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121686
Approved by: https://github.com/ezyang, https://github.com/eellison
Two main changes:
- Don't rethrow the exception when we fail in TV, just throw the entire
thing and trust the user will inspect logs / backtrace to see we
failed in TV
- Don't add an event to the TV logs until we've confirmed that the event
actually runs without erroring. This prevents us from putting events
that e.g., fail because the guard on data dependent size, and the
failing in TV.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120880
Approved by: https://github.com/lezcano, https://github.com/ysiraichi
Summary:
Original commit changeset: e52b8809c8d8
Original Phabricator Diff: D54778906
We have to backout this diff.
D54778906 seems to be causing test failures for APF blocking trunk health and hence release. Just starting to look at the issue. T182209248
Test Plan: Sandcastle
Reviewed By: satgera
Differential Revision: D54825114
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121763
Approved by: https://github.com/osalpekar
This PR Move `operator<<` of `BFloat16` to `BFloat16.h`.
Previously, this function is in `TensorDataContainer.h`. If need `std::cout` a `BFloat16` variable when debugging, `TensorDataContainer.h` have to be included. This is inconvient and counterintuitive.
Other dtypes such as `Half`, define their `operator<<` in headers where they are defined such as `Half.h`. Therefore, I think it makes more sense to move `operator<<` of `BFloat16` to `BFloat16.h`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121302
Approved by: https://github.com/ezyang
This PR changes the upload artifact step of the wheels and conda build to write
each matrix entry to a different file. This is because updating the same file
from multiple jobs can be flaky as is warned in the docs for upload-artifact
> Warning: Be careful when uploading to the same artifact via multiple jobs as artifacts may become corrupted. When uploading a file with an identical name and path in multiple jobs, uploads may fail with 503 errors due to conflicting uploads happening at the same time. Ensure uploads to identical locations to not interfere with each other.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121733
Approved by: https://github.com/huydhn
ghstack dependencies: #121268
Right now logic is mostly duplicated between `test_output_match` and `test_output_gradient_match`
So move tolerance definition logic into a shared `_compute_tolerances` function and
only keep differences (for example, grad checks are completely skipped for `torch.unique`) in the respective test functions.
Also, increase tolerance for `pow` and `__rpow__` only on MacOS-13.3 or older and remove GRAD xfaillist for those
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121754
Approved by: https://github.com/albanD
Summary: The goal is to make the `test_argmax` and `test_reduce_sum` to work both before and after https://github.com/openai/triton/pull/3191 is included into the Triton pin. This is important to make those tests work during the Triton pin update process both in OSS and internally.
Test Plan:
```
$ python test/inductor/test_triton_kernels.py -k test_reduce_sum -k test_argmax
..
----------------------------------------------------------------------
Ran 2 tests in 1.906s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121753
Approved by: https://github.com/Skylion007
Summary:
with a simple bench in TestDeserializer.test_basic function:
```
time_start = time.time()
for i in range(1000):
self.check_graph(MyModule(), inputs)
warnings.warn(f"time_taken: {time.time() - time_start}")
```
and forcing FakeTensorConfig.debug to True, record_stack_traces to True, logging level to debug, it shows that the the changed code is consistently ard 20 secs faster (~90s vs originally ~110s)
Test Plan:
test passed, see summary
compared debug trace before and after:
- exactly the same for fake tensor and proxy callsite https://www.internalfb.com/intern/diffing/?paste_number=1189883685
- slightly different for the user frame in proxy node https://www.internalfb.com/intern/diffing/?paste_number=1189884347
Differential Revision: D54237017
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121449
Approved by: https://github.com/angelayi
This PR corrects the example in the AOTInductor example which currently fails with:
```
/home/ubuntu/test/inference.cpp:21:62: error: cannot bind non-const lvalue reference of type ‘std::vector<at::Tensor>&’ to an rvalue of type ‘std::vector<at::Tensor>’
21 | std::cout << runner.run({torch::randn({2, 10}, at::kCPU)})[0] << std::endl;
|
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121672
Approved by: https://github.com/desertfire
Putting this PR as an RFC since I have resorted to some horrible hacks in order to make this work.
```
(Pdb) p triton.language.float32
triton.language.fp32
(Pdb) p str(triton.language.float32)
'fp32'
(Pdb) p repr(triton.language.float32)
'triton.language.fp32'
```
This means that we need to "rewrite" them for fx graph and inductor execution.
This PR allows Mamba2 to work with `torch.compile`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121690
Approved by: https://github.com/Skylion007
Add these log to debug the regress of accuracy test for dm_nfnet_f0 model for training.
With these extra log when the accuracy check fail, we can verify if it's close to succeed or not. If yes that indicates there is no real issue but just flaky and we probably can tune the tolerance to fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121656
Approved by: https://github.com/jansel, https://github.com/Skylion007
Summary:
`DECLARE_DISPATCH` is shadowing the variable data with the data type:
`extern TORCH_API struct name name` -> `extern TORCH_API struct gemm_stub gemm_stub` for instance.
This is probably dangerous behavior to rely on, as the compiler needs to always resolve to type and/or data based on context. Previous macro fails with VS2022.
Test Plan: `buck2 build arvr/mode/win/vs2022/cpp20/opt //xplat/caffe2:aten_pow_ovrsource`
Differential Revision: D54699849
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121659
Approved by: https://github.com/albanD
Summary:
# Why?
Right now I'm running into a case where `itype` is `torch.fx.immutable_collections.immutable_list` which is a subclass of `list`. However, currently we're checking the concrete types (i.e. `list`) and `immutable_list` isn't explictly supported here.
Thus, we use a runtime check that looks at the subclass so we can support subclasses -- such as immutable_list -- as well.
Test Plan: ci
Differential Revision: D54764829
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121662
Approved by: https://github.com/aakhundov
This PR fixed the bug of redistribute to move early return check into the
redistribute autograd function, so that even though we redistribute the
same placement, the grad_placements from the `to_local` call might be
different, the redistribute backward still need to happen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121653
Approved by: https://github.com/awgu
Refactor release only changes to two step execution.
1. Step ``tag-docker-images.sh`` . Tags latest docker images for current release. This step takes about 30min to complete. This step may fail due to space issues on the local host or http connection when pulling image. Hence should be rerun if failed.
2. Apply release only changes ``apply-release-changes.sh`` prepares a PR with release only changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121728
Approved by: https://github.com/jeanschmidt
It looks like it was commented out because the original implementation was not sufficiently portable. I had to do some rewrites to the innards to make it no portable. No Windows nanoseconds support because I'm lazy.
I tested by running `build/bin/TCPStoreTest` and observing the log messages there. I am actually not sure how to look at the log messages from Python though.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121384
Approved by: https://github.com/Skylion007, https://github.com/malfet
Summary:
`np.asscalar` was deprecated and removed in a recent Numpy. It used to be implemented the following way, and the recommended alternative is to call `item()` directly:
```python
def asscalar(a):
return a.item()
```
This fixes all of the references.
Test Plan: visual inspection and automated tests
Differential Revision: D54697760
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121545
Approved by: https://github.com/malfet
Summary:
## Context
This changeset lays the foundations for supporting dynamic shapes in the ExecuTorch Vulkan delegate via allowing Tensors to be resized in one of two ways:
1. Discarding underlying `vkImage` or `vkBuffer` and reallocating a new `vkImage` or `vkBuffer` with updated sizes. This method is intended to be used when the current `vkImage` or `vkBuffer` is not large enough to contain the new sizes.
2. Update the tensor's size metadata without reallocating any new resources. This allows shaders to interpret the underlying `vkImage` or `vkBuffer` as if it were smaller than it actually is, and allows command buffers to be preserved when sizes are changed.
Test Plan: Check CI. Tests have also been added to `vulkan_compute_api_test` that test the two methods of tensor resizing.
Differential Revision: D54728401
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121598
Approved by: https://github.com/jorgep31415
Summary:
We don't want people to move to NCCL exp without explicit opt in. It seems that sparse allreduce was accidentally called and people were confused whether they should use NCCL exp instead.
Update the error message to explicitly say that sparse_allreduce is not supported.
Test Plan: sandcastle
Differential Revision: D54759307
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121644
Approved by: https://github.com/awgu
Summary:
X-link: https://github.com/pytorch/executorch/pull/2308
Note: The initial purpose of this PR is to draw suggestion and feedback regarding better alternative, if any.
At present, dequantize op for decomposed quantized Tensor representation e.g. dequantize_per_tensor() assumes the output dtype as torch.float and hence, it does not have the output dtype in its operator argument list. However, this op signature becomes unusable when the assumption breaks. Because, in case the output dtype is different from torch.float, there is no way to specify the same during dequantization.
This change is aimed at generalizing the signature of dequantize op like dequantize_per_tensor() for wider use-cases where the output dtype can be different from torch.float and needs to passed during dequantization. The proposal is to use an additional argument named 'output_dtype' to solve the problem. However, we would also like to have suggestion and feedback regarding any better alternative that can be used instead.
cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 Xia-Weiwen leslie-fang-intel
Reviewed By: digantdesai
Differential Revision: D53590486
Pulled By: manuelcandales
Co-authored-by: kausik <kmaiti@habana.ai>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121450
Approved by: https://github.com/jerryzh168
Summary: Previously, we bailed out of the Triton kernel analysis pass when seeing a `tt.reduce` op. In this PR, we support the op and don't bail out anymore.
Test Plan: This is a bit tricky, as the extension is added to the MLIR walk-based analysis code path which is active only on when the MLIR bindings added in https://github.com/openai/triton/pull/3191 are available. So for now I've run the `test_argmax` and `test_reduce_sum` manually with a newer Triton version than the current pin. When pin updates, we'll make those tests official (left a TODO comment).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121706
Approved by: https://github.com/jansel
Reduces the `torch.compile(backend="eager")` for this code
~~~
def fn(x):
for _ in range(10000):
# x = torch.sin(x)
x = torch.ops.aten.sin(x)
# x = sin(x)
return x
~~~
From 18 seconds to 12 seconds.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121031
Approved by: https://github.com/jansel
In this PR, we create another dynamic test class for TestExport tests that basically serializes/deserializas pre-dispatch IR. I encountered 4 additional failures. But 3 of them are due to different operator showing up in the graph and only one legit failure which is tracked by another task internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121678
Approved by: https://github.com/angelayi
ghstack dependencies: #121652
This PR enables `test_addmm_sizes_all_sparse_csr_k_*_n_*_m_*_cuda_complex128` for ROCm for trivial cases (m or n or k = 0)
CUSPARSE_SPMM_COMPLEX128_SUPPORTED also used for `test_addmm_all_sparse_csr` and ` test_sparse_matmul` and both of them are skipped for ROCm by `@skipIfRocm` or `@skipCUDAIf(not _check_cusparse_spgemm_available())`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120504
Approved by: https://github.com/jithunnair-amd, https://github.com/ezyang
This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton)
- [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`).
* MI300X is supported. More architectures will be added once Triton support them.
- [x] Only supports power of two sequence lengths.
* Now it support arbitrary sequence length
- [ ] No support for varlen APIs.
* varlen API will be supported in the next release of AOTriton
- [x] Only support head dimension 16,32,64,128.
* Now it support arbitrary head dimension <= 256
- [x] Performance is still being optimized.
* Kernel is selected according to autotune information from Triton.
Other improvements from AOTriton include
* Allow more flexible Tensor storage layout
* More flexible API
This is a more extensive fix to #112997
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561
Approved by: https://github.com/malfet, https://github.com/atalman
Summary: For OEMAE, this contributes 14% of the total DPER pass perf gain.
Test Plan:
Run test cases
Run oemae lower benchmark with and with this fix. FLOP/s 29 -> 34.
Reviewed By: frank-wei
Differential Revision: D54711064
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121674
Approved by: https://github.com/frank-wei
Eventually, we should just have one unified way to check for parity between a `DTensor`-sharded model and a replicated model. This PR is a small refactor to work toward that. One current gap to use this `check_sharded_parity` function for 2D is that FSDP's `(Shard(0), Shard(0))` layout differs from that of the `DTensor` APIs since FSDP shards on dim-0 after TP shards on dim-0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121357
Approved by: https://github.com/weifengpy
ghstack dependencies: #121360
Introduce `conditional_gil_scoped_release` and use it in `wrap_pybind_function*` to avoid a runtime branch making the code cleaner and faster.
@albanD This is the GIL change extracted from #112607 as discussed.
Also fixes a potential use of a moved-from object introduced in #116560:
- `f` is captured by value in a lambda that may be used called times
- After `std::move(f)` the lambda is not safe to call anymore
CC @cyyever for that change
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116695
Approved by: https://github.com/albanD, https://github.com/Skylion007
Summary: I plan to enable the FX graph cache for more inductor unit tests. This PR does some refactoring to prepare by moving the `TestCase` base class to `torch._inductor.test_case` (which mirrors the existing `torch._dynamo.test_case`). In a subsequent diff, I'll modify tests importing `torch._dynamo.test_case.TestCase` to use `torch._inductor.test_case.TestCase` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121520
Approved by: https://github.com/eellison
Summary: Does not change weights structure so compatible with const folding and realtime weights update
Test Plan: run added test cases
Reviewed By: frank-wei
Differential Revision: D53843428
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121617
Approved by: https://github.com/frank-wei
Summary: Taking the right most part of the fqn will cause name conflict when having multiple instances of the same class. Changed to replace "." in fqn by "_" to avoid invalid syntax in input args.
Test Plan: added test case
Differential Revision: D54435230
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121145
Approved by: https://github.com/zhxchen17
After this, the sam_fast benchmark can now be run in the pytorch repo:
```
SEGMENT_ANYTHING_FAST_USE_FLASH_4=0 benchmarks/dynamo/torchbench.py --inference --amp --performance --backend=inductor --explain --only sam_fast
```
sam_fast is designed for inference only, with cuda and amp on. The code adds these restrictions to the benchmark.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121420
Approved by: https://github.com/oulgen, https://github.com/msaroufim
Update the torch_xla pin to a more recent one (03/08/2024). We need to make sure the torch_xla pin stays up-to-date so that pytorch can test against a up-to-date version of torch_xla.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121529
Approved by: https://github.com/atalman
Follow up on #119326 with addressed comment: https://github.com/pytorch/pytorch/pull/119326#issuecomment-1939428705:
> I'd like to propose a slightly different approach. We could check if scipy is version `1.12.0`. If so, overload `scipy_cumulative_trapezoid` with a function that specifically checks `t.shape[axis] == 0`, and in that case return an array of the same shape as `t`, which is the expected behavior as far as I understand. That way, we're not just skipping the test cases
I would like to add that the version check is not necessary as in any case the outcome is the same.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121541
Approved by: https://github.com/nWEIdia, https://github.com/albanD
Fix round robin sharding when there are no test times and sort_by_time=False
Adds more tests to test_test_selections for sort_by_time=False
Adds more checks to test_split_shards_random for serial/parallel ordering + ordering of tests
Refactoring of dup code
Tested locally by running `python test/run_test.py --shard 3 5` with no test times downloaded and checked that it wasn't an empty list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121022
Approved by: https://github.com/huydhn, https://github.com/osalpekar
This reduces `torch.mv` time for 256x768 matrix by 256 element vector from 209 usec to 16 usec for nontransposed case and from 104 to 18 usec if transposed
Also, add fp16-accumulation flavor to the same ops (controlled by private `torch._C._set_cpu_allow_fp16_reduced_precision_reduction` which yields a slightly better numbers), summarized in the following table
| op | original | F32+NEON | F16+NEON|
| ---| -------- | ---------- | ----- |
| torch.mv(m, v) | 209.53 usec | 16.25 usec | 14.68 usec |
| torch.mv(m.t(), v) | 104.80 usec | 28.68 usec | 24.82 usec |
Test plan: CI on MacOS for both CPU and MPS test fp32<->fp16 matmul consistency ( For example "test_output_grad_match_nn_functional_linear_cpu_float16" passes if fp32-reductions are performed, but fails if fp16 accumulation is used)
To investigate:
- why replacing `sum0Vec = vaddq_f32(sum0Vec, vmulq_f32(a0Vec, xVec));` with `sum0Vec = vfmaq_f32(sum0Vec, a0Vec, xVec);` slows down gemv from 16.2 to 18.2 usec
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119992
Approved by: https://github.com/mikekgfb
Make a test that fails on purpose to trigger retries. Check the opposite of success (that env vars exist)
It's bit hacky because I want it to fail on the normal flow in order to trigger reruns but I don't want to expose the failures to users since it's confusing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120519
Approved by: https://github.com/huydhn
Currently, when `torch.onnx.dynamo_export` is called within `torch.onnx.enable_fake_mode`, all the external pytorch checkpoint files used to initialize the model are automatically and used by `torch.onnx.ONNXProgram.save` to recreate the initializers for
the newly exported ONNX model.
This API extends the mechanism for HuggingFace models that use safetensors weights. This PR detects safetensors state files and converts them to PyTorch format using mmap on a temporary file, which is deleted after conversion is finished.
Without this PR, the user would have to convert the safetensors files to pytorch format manually and feed it to `torch.onnx.ONNXProgram.save` manually.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121001
Approved by: https://github.com/BowenBao, https://github.com/malfet
Fixes#118566
Unlike **OpOverload** or **OpOverloadPacket**, there is a lot of complex information in the schema, so for me keeping it as is is probably a good choice, but in theory the **\_\_repr__** function should show the class name as well as some other key information.
If you have any choices, please show me, thank you.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121484
Approved by: https://github.com/Skylion007
**Summary**
We should skip the `visualize_sharding()` function on those ranks that are not a part of the DTensor's mesh. If not, exception will be thrown in current visualize logic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121382
Approved by: https://github.com/wanchaol
ghstack dependencies: #121385
Summary:
It's very difficult to debug the passes ineffectiveness, with them mingled in one single pass container. Here we extract them into seperate passes with diagnostics info.
This is also required for a later change, where we must run shape prop on each of these passes, in order for the subsequent passes to have the correct shape information.
Reviewed By: frank-wei
Differential Revision: D53579545
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121592
Approved by: https://github.com/frank-wei
`hybridCubeMeshAllReduceKernel` uses the latter half of p2p buffers as relay buffers. The relay buffer address is calculated using a bf16 base pointer and the buffer size in byte. The breakage was caused by not taking element size into account.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121575
Approved by: https://github.com/Chillee
This means when codegen depends on a particular import we only need to
add it in one place and it's applied to all triton kernels.
This also changes codegen slightly so instead of generating
`@pointwise` we now generate `@triton_heuristics.pointwise` just so
the imports are the same for all kernel types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121438
Approved by: https://github.com/lezcano
async output option was only available in `full_tensor()` call, but I think it's
generally good to make this option available in the `redistribute` call directly
so that user can control it
This PR adds async_op option to redistribute call, to allow user control
whether to perform tensor redistribution asynchronously or not.
By default we set this to False, this is to follow the semantics of the c10d
collectives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121477
Approved by: https://github.com/wz337
The necessity of this PR lies in the fact that autograd engine + DDP calls `all_reduce` from C++, so the changes must be made in C++.
```
[rank0]: Traceback (most recent call last):
[rank0]: File "~/complex_ddp.py", line 72, in <module>
[rank0]: main()
[rank0]: File "~/complex_ddp.py", line 64, in main
[rank0]: loss.backward()
[rank0]: File "/home/usr/pytorch/torch/_tensor.py", line 525, in backward
[rank0]: torch.autograd.backward(
[rank0]: File "/home/usr/pytorch/torch/autograd/__init__.py", line 267, in backward
[rank0]: _engine_run_backward(
[rank0]: File "/home/usr/pytorch/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: TypeError: Input tensor data type is not supported for NCCL process group: ComplexFloat
```
I believe, for minimizing the Python overhead, the same could be done for the rest of the ops, what do you think @kwen2501?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121045
Approved by: https://github.com/eqy, https://github.com/kwen2501
# Context
I believe we have an incorrect guard being created during FakeTensor's binary op fast path.
Consider this case
```
# op.shape: (10, 192); final_shape: (s0, 10, 192)
# Guard Ne(s0, 10) is created when we create SymBool(10 == s0)
if isinstance(op, torch.Tensor) and op.shape == final_shape:
break
```
As of right now, `op.shape == final_shape` checks whether one of the binary op's operands is the same as the binay op's output shape.
* If one of them is a dynamic shape, then we'll create a guard via`SymBool` creation (i.e. `s0 == 10`).
* If the `SymBool` expr resolves to `false`, then we'll create the guard `Ne(s0, 10)`.
This is a problem when the # of dimensions aren't the same between `op.shape` & `final_shape`. Take the case above for example, `op.shape: (10, 192); final_shape: (s0, 10, 192)`. Although, the shapes aren't the same, it doesn't necessarily mean that `s0 != 10`.
Some thoughts (feel free to ignore). What if the # of dimensions are equal but one of the shapes has symbols. Here's three cases:
1. `op.shape: (9000, 10, 192); final_shape: (s0, 10, 192)` -- not broadcastable.
2. `op.shape: (1, 10, 192); final_shape: (s0, 10, 192)` -- 0/1 specialization wins?
3. `op.shape: (100, 10, 192); final_shape: (s0, 10, 192) where s0 = 100` -- Ask user to mark `s0` as a constant.
# Test
```
$ TORCHDYNAMO_VERBOSE=1 PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_dynamic_shapes.py -k test_export_fast_binary_broadcast_check_dynamic_shapes
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (dim0)! For more information, run with TORCH_LOGS="+dynamic".
- Not all values of dim0 = L['a'].size()[0] in the specified range 3 <= dim0 <= 1024 satisfy the generated guard Ne(L['a'].size()[0], 3).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121546
Approved by: https://github.com/aakhundov
This does not introduce a new test but is tested by checking that all the classes we already have still behave as before now that they don't explicitly disable torch_function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120632
Approved by: https://github.com/ezyang
Let us try to remove this warning 😄 :
```
[rank0]:/data/users/andgu/pytorch/torch/distributed/checkpoint/filesystem.py:150: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
[rank0]: if tensor.storage().size() != tensor.numel():
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121538
Approved by: https://github.com/wz337, https://github.com/fegin
Summary:
`add_metadata_json` function in profiler can only work when being called during trace collection. However, sometimes we want to pass in some user-defined metadata and amend to the trace before trace collection starts, e.g. when the profiler is defined.
This PR add a function `preset_metadata_json` for this purpose. The preset metadata will be stored and amended to the trace later.
Differential Revision: D54678790
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121487
Approved by: https://github.com/aaronenyeshi
to_local accepts a `grad_placements` if user choose to pass, previously
we enforce the grad_out to be the "same" placement as the current
DTensor for safety.
But I realized that we DO NOT need to enforce this constraint. Why?
backward placement does not need to be the same as fwd tensor placement, this
is already the case for param vs param.grad (i.e. param can be replicate
and grad can be partial), so we should not restrict this to activation
vs activation grad too
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121474
Approved by: https://github.com/awgu, https://github.com/yoyoyocmu, https://github.com/yifuwang
Summary:
Currently we do not have a easy mechanism to distinguish between models created with some specific config.
We use a warning instead of failing directly.
Test Plan: Messaging change only.
Reviewed By: aakhundov
Differential Revision: D54622522
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121396
Approved by: https://github.com/chenyang78
# Motivation
In backward of per-parameter sharding FSDP, each rank performs reduce scatter to sync gradients across ranks. A rank chunks each gradient tensor into `world_size` slices along the 0-th dimension and concatenate all slices along the 1-th dimension. Gradient tensors will be padded before concatenation when tensor.size(0) % world_size != 0.
### Example 1
Consider `world_size=3` and tensors A (2x4), B (3x3), C (1x2):
Input tensors:
```
AAAA BBB CC
AAAA BBB
BBB
```
Reduce-scatter-copy-in Output:
```
AAAABBBCC
AAAABBB00
0000BBB00
```
### Example 2
Consider `world_size=2` and tensors A (2x4), B (3x3), C(1x2), D(4x2):
Input tensors:
```
AAAA BBB CC DD
AAAA BBB 00 DD
BBB DD
000 DD
```
Reduce-scatter-copy-in first pad:
```
AAAA BBB CC DD
AAAA BBB 00 DD
BBB DD
000 DD
```
Then chunk and cat along dim as the output:
```
AAAABBBBBBCCDDDD
AAAABBB00000DDDD
```
The performance of reduce-scatter-copy-in is critical to per-parameter sharding FSDP. However, reduce-scatter-copy-in via composing existing ATen ops involves `cat` and irregular `pad`, leading redundant data copies and unsatisfactory performance.
# PR
We provide aten native support for reduce-scatter-copy-in, namely `_chunk_cat()`:
```
_chunk_cat(Tensor[] tensors, int dim, int num_chunks) -> Tensor
```
This PR includes the registration of `_chunk_cat` and `_chunk_cat.out`, OpInfo tests, and basic implementation composing existing ATen ops.
In the next PR, we will add the CUDA implementation. Comparing with baselines of composing existing ATen ops, `_chunk_cat()` CUDA implementation improves copy bandwidth from 498 GB/s to 966 GB/s on a production benchmark.
## Requirements on input
1. If input tensors have different ndims, dim should be non-negative and be less than the ndims of every input tensors. If all input tensors have the same ndims, we support both negative and non-negative dim.
2. For wrapped_dim, all tensors should have the same size for 0,...,wrapped_dim-1 dimensions. No requirements for (wrapped_dim, ...)-th dimension.
3. Expect positive num_chunks
4. Expect non-empty input tensor list and each input tensor should have at least 1 element
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121081
Approved by: https://github.com/albanD
- Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm
- Include files more granularly to avoid namespace pollution and circular imports
limitations:
- requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness
- will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)
- can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection
- tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681
Approved by: https://github.com/jansel
Make a test that fails on purpose to trigger retries. Check the opposite of success (that env vars exist)
It's bit hacky because I want it to fail on the normal flow in order to trigger reruns but I don't want to expose the failures to users since it's confusing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120519
Approved by: https://github.com/huydhn
I was originally trying to solve https://github.com/pytorch/pytorch/issues/120799 but got sidetracked along the way.
This PR contains a couple fixes. Let me know if you want me to split them up!
- Properly handle invalid user code when "super()" is called from non-method/classmethod. It will now properly raise the same error as CPython
- Fix base VariableTracker `__str__` method shadowing all `__repr__` methods defined in subclasses
- Fix accessing a classmethod on a user object to bind "cls" and not "self"
- Fix custom class handling of super() call to properly handle mixed regular/class/static methods
Locally , test_repros.py -k test_batch_norm_act still fails where the generated graph module is:
```
Call using an FX-traced Module, line 8 of the traced Module's generated forward function:
x = self.forward(l_x_); self = l_x_ = None
x_1 = self.L__self___act(x); x = None
```
note that "self" is being unset on the first line even though it is used on the second one.
For reference, this is the test c268ce4a6d/test/dynamo/test_repros.py (L1368-L1369)
I cannot figure out where the generated forward function comes from though, any hint would be welcome!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121365
Approved by: https://github.com/jansel
Today `GroupRegistry` employs thread isolation by default, i.e. every thread sees its own process group registry. This is intended to work for one-device-per-process (for python use cases) and one-device-per-thread case (for custom native runtimes).
However, there's a problem - there are python use cases that initializes/registers process groups in one thread, and runs collectives in another thread. This use case should be supported. However, since `GroupRegistry` employs thread isolation by default, collectives in different threads can't find the registered process groups.
This PR fixes the issue by:
- Make `GroupRegistry` work in non-thread isolation mode by default. This would match the behavior w/o the native process group registry.
- Introduces `set_thread_isolation_mode` so one-device-per-thread runtimes can enable thread isolation mode explicitly.
Differential Revision: [D54658515](https://our.internmc.facebook.com/intern/diff/D54658515)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121457
Approved by: https://github.com/wanchaol
Summary:
The profiling, even when disabled, takes up about 1.5% cpu for a model I'm looking into.
This patch just splits into with/without profile runs.
The potential downside is that now the script can't enable profiling in itself. It doesn't seem to be used anywhere. If that's a crusial usecase, we can do something about it but ideally we wouldn't.
Test Plan:
Link with profiles:
https://fburl.com/scuba/strobelight_services/ihxsl7pj
```
buck2 run fbcode//caffe2/test/cpp/jit:jit
```
Reviewed By: zhxchen17
Differential Revision: D54066589
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121404
Approved by: https://github.com/zhxchen17
By deleting `where_mps` and registering MPS dispatch for `where_kernel`.
As result of this change resizing and type-checking logic is shared between MPS, CPU and CUDA backends.
Add test_case to `TestMPS.test_where` (that should eventually be removed, when `out` OpInfo testing is enabled for MPS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121476
Approved by: https://github.com/albanD, https://github.com/Skylion007
ghstack dependencies: #121473, #121494
- There are no usages of this internally.
- There are very few usages of this in OSS (most of these are forks of old
repositories).
- This flag doesn't do anything.
We're deprecating it to prevent confusion. I will delete it immediately
after the branch cut.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121413
Approved by: https://github.com/albanD, https://github.com/soulitzer
This PR removes the deprecated tp_mesh_dim arg to prepare for release.
As we deprecated this arg for a while (by throwing deprecating
messages), we should remove it before the release
#suppress-api-compatibility-check
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121432
Approved by: https://github.com/wz337
ghstack dependencies: #121431
Fix round robin sharding when there are no test times and sort_by_time=False
Adds more tests to test_test_selections for sort_by_time=False
Adds more checks to test_split_shards_random for serial/parallel ordering + ordering of tests
Refactoring of dup code
Tested locally by running `python test/run_test.py --shard 3 5` with no test times downloaded and checked that it wasn't an empty list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121022
Approved by: https://github.com/huydhn, https://github.com/osalpekar
**Summary:**
This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:
```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
aten.native_batch_norm ->
aten._native_batch_norm_legit (export only) ->
_batch_norm_legit_cpu/cuda (kernels, export only) ->
_batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```
Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.
Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:
```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```
The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:
```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```
Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.
Test Plan: `OpInfo` tests for `batch_norm_with_update`.
Reviewers: albanD, bdhirsh
Subscribers: albanD, bdhirsh, supriyar
Tasks: https://github.com/pytorch/pytorch/issues/111384
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
**Summary**
Add the operator of `quantized_decomposed.fake_quant_per_channel` and test the forward and backward of this op with comparing to ATen.
**Test Plan**
```
python -u -m pytest -s -v test_cpu_repro.py -k test_decomposed_fake_quant_per_channel
```
**Next Step**
Optimize the performance: from the generated code of forward and backward graph, the code didn't vectorize.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121297
Approved by: https://github.com/jerryzh168, https://github.com/jgong5
Summary: When fake tensors are passed to a graph module and we do runtime assertions on them, we can accidentally trigger specialization guards. It's better to just relax the checking for these.
Test Plan: confirmed that problem in T181400371 is now fixed
Differential Revision: D54658960
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121460
Approved by: https://github.com/angelayi
Summary: This is in preparation to enable FX graph caching by default. First fix some bugs uncovered by running all unit tests under `test/inductor/`. I'll enable in a separate diff in case we need to revert. Summary of changes:
* Turn off caching for tests that require a compilation, e.g., when checking that a relevant counter was incremented
* Bypass caching when we see mkldnn tensors as constants (they currently don't serialize, so we can't save to disk)
* Include various global settings that could affect compilation in the cache key calculation.
* Handle a few config settings that break key calculation.
* Handle code paths where no ShapeEnv is available (the cache impl requires a shape env as part of handling guards)
* Skip caching when freezing is enabled (Freezing can embed constants that wouldn't be static across runs).
* Fix the clear() method to not throw when the cache /tmp dir doesn't exist
Test Plan: Ran all tests under `test/inductor/` twice with TORCHINDUCTOR_FX_GRAPH_CACHE=1 to exercise any test that might be affected by caching.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117888
Approved by: https://github.com/eellison
These are just tests that I noticed passed on current main
Running:
```
PYTORCH_TEST_WITH_DYNAMO=1 pytest test/dynamo/test_dynamic_shapes.py test/dynamo/test_compile.py -k 'test_export_decomp_dynamic_shapes or test_export_dynamic_dim_cleanup_dynamic_shapes or test_export_multi_dynamic_dim_constraint_dynamic_shapes or test_export_multi_dynamic_dim_unsafe_relationship_dynamic_shapes or test_export_no_raise_dynamic_shapes or test_export_preserve_constraints_as_metadata_scalar_dynamic_shapes or test_export_raise_on_relationship_dynamic_shapes or test_exported_graph_serialization_dynamic_shapes or test_retracibility_dict_container_inp_out_dynamic_shapes or test_retracibility_nested_list_out_dynamic_shapes or test_exception_table_e2e_2_dynamic_shapes or test_exception_table_e2e_dynamic_shapes or test_exception_table_parsing_dynamic_shapes or test_inference_mode_dynamic_shapes or test_inplace_view_on_graph_input_dynamic_shapes or test_numpy_torch_operators_dynamic_shapes or test_py311_jump_offset_dynamic_shapes or test_lazy_module_no_cls_to_become_dynamic_shapes or test_batchnorm_e2e_dynamic_shapes or test_functools_wraps_dynamic_shapes or test_jit_trace_errors_dynamic_shapes or test_multi_import_dynamic_shapes or test_requires_grad_guards_with_grad_mode2_dynamic_shapese or test_dynamo_signatures'
```
BEFORE: `1 failed, 1 passed, 22 skipped, 1372 deselected`
AFTER: `24 passed, 1372 deselected`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121378
Approved by: https://github.com/oulgen
Summary:
## No Functional Change
- Refactor Subprocess Handler into a separate folder for easier subclassing
- SubprocessHandler
- added `local_rank_id` in `SubprocessHandler` to make it available as a field in the class
- pass in `local_rank_id` from subprocess start
Test Plan: No functional changes.
Differential Revision: D54038627
#suppress-api-compatibility-check
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120373
Approved by: https://github.com/kurman
Things that were bad before this PR:
1. Temporarily unsetting functional tensor mode and proxy mode both had duplicate implementation
2. There are variants of mode handling private utils that has duplicate implementation. (different APIs calling repeated implementation, so i refactored)
3. _push_mode API used to take dispatch key argument which is not necessary.
4. There are unused APIs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121083
Approved by: https://github.com/zou3519
@tfsingh I got to it first--wanted to land this stack and close the gap ASAP.
This PR also fixes a discrepancy between `_init_group` and `__set_state__` because we have the constants live on params' device always.
There are some next steps though:
- ASGD can be made faster by making etas, mus, steps be on CPU when NOT capturable. (I had mistakenly thought foreachifying was faster and so we landed https://github.com/pytorch/pytorch/pull/107857, but it is slower). No one has complained yet though. ¯\_(ツ)_/¯
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121264
Approved by: https://github.com/albanD
ghstack dependencies: #121260
Summary: In this PR, `torch.cond` support and the necessary codegening infrastructure is added to C++ wrapper (AOTInductor and friends).
Notable additions:
- A new mechanism in the Python wrapper codegen to precompile and save the Triton kernels (generated and user-defined) which haven't been covered by the active path through the control flow given the sample inputs. As we can't do the runtime autotuning of the kernels outside the active path, we precompile and save them with the `launchers[0]` (corresponding to the first config).
- Codegen infra for `torch.cond` in the C++ wrapper (ABI- and non-ABI-compatible). The `torch.cond` codegen has been slightly refactored to avoid duplication across the Python and C++ wrappers.
- More extensions of the caching sites in the wrapper code to cache per codegened graph (e.g., `codegen_int_array_var`) + some infra for tracking the current codegened graph in the wrapper (both during codegen-ing in the `Scheduler.codegen` and in the `WrapperCodeGen.generate` functions).
- New unit tests to cover the added AOT Inductor + `torch.cond` functionality.
Codegen examples from the new unit tests:
- [`test_cond_simple_abi_compatible_cpu`](https://gist.github.com/aakhundov/862d5de9aa460f5df399e1387f7b342e)
- [`test_cond_simple_abi_compatible_cuda`](https://gist.github.com/aakhundov/d70b81f95fa8cc768cedef9acacb25bb)
- [`test_cond_simple_non_abi_compatible_cpu`](https://gist.github.com/aakhundov/c0ae7a8cbb6fa311c838e1b580f9a3f6)
- [`test_cond_simple_non_abi_compatible_cuda`](https://gist.github.com/aakhundov/08b945d4e8a32c97b7f9ff6272f4a223)
- [`test_cond_nested_abi_compatible_cuda`](https://gist.github.com/aakhundov/ce664f433c53e010ce4c0d96a6c13711)
- [`test_cond_with_parameters_abi_compatible_cuda`](https://gist.github.com/aakhundov/77afbeb8eaab5c5b930a3f922a7baf12)
- [`test_cond_with_multiple_outputs_abi_compatible_cuda`](https://gist.github.com/aakhundov/8cc06105ec8a3fe88be09b3f6e32c690)
Test Plan:
```
$ python test/inductor/test_aot_inductor.py -k test_cond
...
----------------------------------------------------------------------
Ran 42 tests in 170.619s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121120
Approved by: https://github.com/jansel, https://github.com/chenyang78
In this case, it's simpler to use ctx.actions.run(executable = ...), which already ensures that the runfiles associated with the executable are present.
(It's also possible to use ctx.actions.run_shell(tools = ...) with a custom command line, but it's unclear to me that indirecting through the shell is needed here.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120493
Approved by: https://github.com/ezyang
This PR:
* Uses reified ViewFuncs to swap in fake tensors / symbolic SymInts for view replay during subclass view fake-ification
* Enables automatic dynamic on view bases -> fakeifies according to the resultant symbolic context instead of the old "all-static" approach
* Covers the following view types:
* subclass -> dense
* dense -> subclass
* subclass -> subclass
* Dense -> dense views are handled the old way via an `as_strided()` call, as it's likely there is no view func available
Differential Revision: [D54269082](https://our.internmc.facebook.com/intern/diff/D54269082)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118405
Approved by: https://github.com/ezyang
Move tests that are mentioned in PR body or commit message to front. Also attempts to find any issues/PRs mentioned in the PR body and search for those too (ex if you link a disable issue and that issue contains the test file that it was failing on)
looking for: dynamo/test_export_mutations
Also removes some printed information in TD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120621
Approved by: https://github.com/osalpekar
Because some implementations, like OpenDAL does not work with AWS IMDSv2, but this script will bridge the gap and enables more recent `sccache` releases(that switched from simple-s3 to OpenDAL) to work in current CI system
When launched it prints something like:
```
export AWS_ACCESS_KEY_ID=XXXXX
export AWS_SECRET_ACCESS_KEY=YYYY
export AWS_SESSION_TOKEN=ZZZZ
```
which can be `eval`ed and passed then sccache can use those failures.
Validated in https://github.com/pytorch/pytorch/pull/121323
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121426
Approved by: https://github.com/Skylion007
Finishes the work started in https://github.com/pytorch/pytorch/pull/118697. Thanks @MarouaneMaatouk for the attempt, but due to inactivity I have opened this PR for Adamax. Note that the new capturable implementation is much simpler and I've modified the foreach capturable impl--it now calls fewer kernels and is more easily comparable to forloop.
Next steps:
* This PR discovered two bugs: #121178 and #121238.
* Move the now hefty graph optim tests in test_cuda to use OptimInfo.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121183
Approved by: https://github.com/albanD
This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076
Approved by: https://github.com/malfet
Summary: In production I am seeing errors like "AttributeError: module 'triton.runtime' has no attribute 'fb_memcache'", likely due to some package skew. Until this is resolved, lets wrap this code with try-catch.
Test Plan: CI
Differential Revision: D54604339
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121340
Approved by: https://github.com/aakhundov
**Summary**
In `visualize_sharding` we chose to only print on rank 0 (global rank) which means calling `visualize_sharind` will never print anything when the dtensor object's mesh doesn't include rank 0 (i.e. a sub-mesh). This PR has `visualize_sharding` always print on rank whose mesh coordinate is (0, 0, ..., 0) instead of whose global rank is 0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121216
Approved by: https://github.com/wanchaol
ghstack dependencies: #121179, #120260
**Summary**
our goal is to demonstrate that DTensor's capability to represent TorchRec's parameter sharding. Currently this is done with `ShardedTensor` and theoretically `DTensor` can replace it with minor change.
This PR serves as a start of this effort by adding an example test that represents TorchRec's `ShardingType.ROW_WISE` using DTensor. Note that this PR only covers the even sharding case.
**Test Run**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120260
Approved by: https://github.com/wanchaol
ghstack dependencies: #121179
This PR proposes to keep the original order as the original state_dict, as the issue creator proposed. It also removes a bug concerning how ``_metadata`` is handled (see below), as well as other small changes to properly remove the prefix when is present.
In the original code, ``_metadata`` was handled as a ``key``.
```
# also strip the prefix in metadata if any.
if "_metadata" in state_dict:
```
This is not the case, ``_metadata`` is actually an ``attribute``. Hence, the previous condition is changed to:
```
# also strip the prefix in metadata if any.
if hasattr(state_dict, "_metadata"):
```
This PR also includes the necessary test.
Fixes#106942
Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117464
Approved by: https://github.com/mikaylagawarecki
`Tensor.__repr__` calls functions which can perform logging which ends up logging `self` (with `__repr__`) causing an infinite loop. Instead of logging all the args in FakeTensor.dispatch log the actual parameters (and use `id` to log the tensor itself).
The change to torch/testing/_internal/common_utils.py came up during testing - in some ways of running the test parts was `('test', 'test_testing.py')` and so `i` was 0 and we were doing a join on `()` which was causing an error.
Repro:
```
import torch
from torch.testing import make_tensor
from torch._subclasses.fake_tensor import FakeTensor, FakeTensorMode
t = torch.sparse_coo_tensor(((0, 1), (1, 0)), (1, 2), size=(2, 2))
t2 = FakeTensor.from_tensor(t, FakeTensorMode())
print(repr(t2))
```
and run with `TORCH_LOGS=+all`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120206
Approved by: https://github.com/yanboliang, https://github.com/pearu
As titled, this PR introduces a dedicated `ParallelStyle` to shard the
nn.LayerNorm/nn.Dropout/RMSNorm layers. We were mainly using a manual
distribute_module calls before when sharding the RMSNorm layer, but I
think we should have a dedicate TP API to easily shard those layers,
instead of user manually using DTensors.
I call this SequenceParallel, which might bring some confusion that we
technically "deprecated" a SequenceParallel style months ago. But this
time the SeuqenceParallel style is significantly different with the
previous ones (which used to shard two consecutive Linear layers). I
believe making it the right name is the first priority, instead of
worrying about the issue of reusing the old name
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121295
Approved by: https://github.com/awgu, https://github.com/tianyu-l
ghstack dependencies: #121294
# Update Profiler API to collect Execution Traces
## TLDR
We would like to simplify collecting Execution Trace and Kineto together. Execution Trace and Kineto both provide meaningful information that can be combined to enable benchmarking, performance analysis and simulating new hardware.
```
import torch
def main():
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
…
excution_trace_observer=ExecutionTraceObserver() # <<<<<<< NEW
) as prof:
...
prof.step()
```
See test/profiler/test_profiler.py 'test_execution_trace_with_kineto' for an example of using this API.
## What are Execution Traces?
[Chakra Execution Traces](https://github.com/mlcommons/chakra/wiki) offer a graph based representation of AI/ML workloads. It stands apart from conventional AI/ML frameworks by focusing on replay benchmarks, simulators, and emulators, prioritizing agile performance modeling and adaptable methodologies.
- Chakra is part of ML Commons industry standard and is being adopted by other companies besides NVIDIA too.
- At Meta we have instrumented PyPer framework to collect Execution Traces. More details on our [PyTorch implementation of Chakra can be found here](https://github.com/mlcommons/chakra/wiki)
Chakra essentially enables benchmarking and co-design for ML Models without having to reproduce entier software stacks and helps companies collaborate together [[chakra paper](https://arxiv.org/pdf/2305.14516.pdf)]
## Why correlate Execution Trace with PyTorch/Kineto Trace
Both Execution Traces and Kineto/ provide different types of information and combining. While PyTorch ETs focus on CPU operators with explicit dependencies between them, Kineto traces encode GPU operators with their start and end times. In addition, collecting them at different timestamps will be inaccurate as several operations (NCCL, Embedding lookup) are data dependent and may not match correctly.
Thus, it makes sense to collect both ET and Kineto together. The problem is that there are two code paths.
## Proposal
The proposal is to modify the PyTorch profiler (Kineto) API to enable execution trace to be collected simultaneously, see TLDR section
# Testing
Updated the unit test for collecting kineto and Execution Trace together.
- Check the collected ET has right range of events.
- Compare two sets of IDs - record func Ids in ET and external IDs in Kineto. We check if these have a constant difference.
```
pytest test/profiler/test_profiler.py -k test_execution_trace_with_kineto -rP
Running 1 items in this shard
test/profiler/test_profiler.py [W execution_trace_observer.cpp:682] Enabling Execution Trace Observer
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[W execution_trace_observer.cpp:694] Disabling Execution Trace Observer
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119912
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
Summary: This specific rocm logic will make aten-cpu code diverge between rocm and cuda. This is not good because we won't be able to share aten-cpu.so between rocm and cuda. More specifically, this will prevent us build aten-hip by default, which requires us to set up rocm specific rules which is an extra burden for our build system.
Test Plan: sandcastle + oss ci
Differential Revision: D54453492
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121082
Approved by: https://github.com/jeffdaily, https://github.com/aaronenyeshi, https://github.com/albanD
Since we are already checking if the RNG tracker is initialized, there is no real performance difference between erroring vs. just initializing a default RNG tracker (which we choose to be the `OffsetBasedRNGTracker`).
```
pytest test/distributed/_composable/fsdp/test_fully_shard_init.py -k test_meta
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121328
Approved by: https://github.com/wanchaol
ghstack dependencies: #120351
Fixes#121093
Previously, calling the following functions with invalid padding dimensions would cause a segmentation fault:
```
torch._C._nn.replication_pad1d, torch._C._nn.replication_pad3d, torch._C._nn.replication_pad3d
```
To fix, added condition checking to raise a runtime error with a debug message instead, specifying the correct dimensions necessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121298
Approved by: https://github.com/mikaylagawarecki
This PR adds initial support for meta-device initialization for pre-training without loading from a state dict. The idea is to allow `fully_shard(module)` to return and still have sharded parameters on meta device. Then, the user is free to initialize them as they please, e.g. using `to_empty()`.
We override `_apply` to achieve the following:
- Reshard the parameters to ensure that sharded parameters are registered (for correctness) -- we will always need this
- Pad new local tensors and use the padded local tensors (to handle uneven sharding) -- we will remove this once `DTensor` pads its local tensor
We use the `swap_tensors` path in `_apply`. For now, this requires setting `torch.__future__.set_swap_module_params_on_conversion(True)`; however, in the future, this may be enabled by default for wrapper subclasses and will not need any explicit API call. If requiring this call is too intrusive in the short term, we can also call it in `_apply` or when importing `fully_shard`.
```
# Pre-training flow (no checkpoint)
global_mesh = init_device_mesh(..., mesh_dim_names=("dp", "tp"))
dp_mesh, tp_mesh = global_mesh["dp"], global_mesh["tp"]
with torch.device("meta"):
model = ...
parallelize_module(model, tp_mesh, ...)
fully_shard(model, mesh=dp_mesh, ...)
for param in model.parameters():
assert param.device.type == "meta"
model.to_empty(device="cuda")
random.manual_seed(42, global_mesh)
for module in model.modules():
if hasattr(module, "reset_parameters"):
module.reset_parameters()
```
This PR includes some minor changes to allow the user to similarly cast the module to a different dtype after construction time but before forward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120351
Approved by: https://github.com/wanchaol
Without this the build will freeze with prompt:
Proceed ([y]/n)?
I'm using rootless podman in vscode instead of docker but I think it should not affect this.
..or does conda somehow detect Docker but not Podman? Anyway, this should not break anything.
Btw, I also had to uncomment the line: "remoteUser": "root" in devcontainer.json to finish the post installation properly but I guess there might be other workarounds - and perhaps you don't want to run as root if your container has root privileges.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121128
Approved by: https://github.com/drisspg
Fixes#115607
We were missing guards when the grads were set to `None`. So if we compiled the optimizer with grads set to their proper value, and then with the grads set to `None` we'd continuously run the `None` version because all of the guards would pass and it would be ordered before the correct version in the cache.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121291
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
- Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm
- Include files more granularly to avoid namespace pollution and circular imports
limitations:
- requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness
- will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)
- can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection
- tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681
Approved by: https://github.com/jansel
Fix: https://github.com/pytorch/xla/issues/6009
This PR adds another case to `TensorVariable.method_new` special case, where it
re-dispatches `new` into `new_empty`.
Since we are using fake tensors, the `new` call doesn't actually gets to the corresponding
backend (e.g. XLA). So, things like the following might happen:
```python
@torch.compile(backend="openxla")
def foo(x):
new_x = x.new(*x.size())
# new_x.device() == "xla"
# x.device() == "xla:0"
return new_x + x
a = torch.arange(10)
foo(a.to(xm.xla_device()))
```
Resulting in the following error:
```python
Traceback (most recent call last):
...
File "torch/_dynamo/utils.py", line 1654, in get_fake_value
ret_val = wrap_fake_exception(
File "torch/_dynamo/utils.py", line 1190, in wrap_fake_exception
return fn()
File "torch/_dynamo/utils.py", line 1655, in <lambda>
lambda: run_node(tx.output, node, args, kwargs, nnmodule)
File "torch/_dynamo/utils.py", line 1776, in run_node
raise RuntimeError(make_error_message(e)).with_traceback(
File "torch/_dynamo/utils.py", line 1758, in run_node
return node.target(*args, **kwargs)
File "torch/utils/_stats.py", line 20, in wrapper
return fn(*args, **kwargs)
File "torch/_subclasses/fake_tensor.py", line 885, in __torch_dispatch__
return self.dispatch(func, types, args, kwargs)
File "torch/_subclasses/fake_tensor.py", line 1224, in dispatch
return self._cached_dispatch_impl(func, types, args, kwargs)
File "torch/_subclasses/fake_tensor.py", line 955, in _cached_dispatch_impl
output = self._dispatch_impl(func, types, args, kwargs)
File "torch/_subclasses/fake_tensor.py", line 1445, in _dispatch_impl
return self.wrap_meta_outputs_with_default_device_logic(
File "torch/_subclasses/fake_tensor.py", line 1575, in wrap_meta_outputs_with_default_device_logic
return tree_map(wrap, r)
File "torch/utils/_pytree.py", line 900, in tree_map
return treespec.unflatten(map(func, *flat_args))
File "torch/utils/_pytree.py", line 736, in unflatten
leaves = list(leaves)
File "torch/_subclasses/fake_tensor.py", line 1550, in wrap
) = FakeTensor._find_common_device(func, flat_args)
File "torch/_subclasses/fake_tensor.py", line 625, in _find_common_device
merge_devices(arg)
File "torch/_subclasses/fake_tensor.py", line 620, in merge_devices
raise RuntimeError(
torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function add>(*(FakeTensor(..., device='xla', size=(10,), dtype=torch.int64), FakeTensor(..., device='xla:0', size=(10,), dtype=torch.int64)), **{}):
Unhandled FakeTensor Device Propagation for aten.add.Tensor, found two different devices xla, xla:0
```
Using `new_empty`, instead, fixes this error because it uses the device from the source
tensor, instead of inferring from the current dispatch key set.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121075
Approved by: https://github.com/jansel
Summary: Without args we have a hard time detecting fake modes. This causes a fake mode mismatch error in non-strict (specifically, `aot_export_module`) when the module contains tensor attributes, because we create a fresh fake mode when we cannot detect one. The fix is to pass the same fake mode throughout.
Test Plan: added test
Differential Revision: D54516595
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121176
Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan
in layer_norm_kernel.cu since the qualifier seems to be ignored according to:
```
[18/263] Building CUDA object
caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o
/home/mkozuki/ghq/github.com/crcrpar/torch-3/aten/src/ATen/native/cuda/layer_norm_kernel.cu(300):
warning #20050-D: inline qualifier ignored for "__global__" function
Remark: The warnings can be suppressed with "-diag-suppress
<warning-number>"
/home/mkozuki/ghq/github.com/crcrpar/torch-3/aten/src/ATen/native/cuda/layer_norm_kernel.cu(300):
warning #20050-D: inline qualifier ignored for "__global__" function
Remark: The warnings can be suppressed with "-diag-suppress
<warning-number>"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121246
Approved by: https://github.com/eqy, https://github.com/malfet
RECORD_FUNCTION in python_function only captures argument that is a Tensor. However, it is very common for user to use non tensor arguments in custom ops, for example, sequence length in GPT attention custom op. My previous PR tries to capture all non-tensor arguments, it turned out in some cases, it is very expensive.
This PR is to support primitive (or its container) arguments in RECORD_FUNCTION.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120949
Approved by: https://github.com/soulitzer
**Summary:**
This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:
```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
aten.native_batch_norm ->
aten._native_batch_norm_legit (export only) ->
_batch_norm_legit_cpu/cuda (kernels, export only) ->
_batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```
Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.
Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:
```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```
The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:
```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```
Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.
Test Plan: `OpInfo` tests for `batch_norm_with_update`.
Reviewers: albanD, bdhirsh
Subscribers: albanD, bdhirsh, supriyar
Tasks: https://github.com/pytorch/pytorch/issues/111384
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
Summary: Update the macros to exclude using __grid__constant on compiling for devices > sm80 but cuda version < 11.8.
Test Plan: buck2 build --keep-going --config buck2.log_configured_graph_size=true --flagfile fbcode//mode/dev fbcode//sigrid/predictor/client/python:ig_sigrid_client_pybinding
Differential Revision: D54556796
Co-authored-by: Driss Guessous <drisspg@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121275
Approved by: https://github.com/drisspg
Context: view fake-ification should handle closed-over state in ViewFuncs for use in view replay by:
* fake-ifying tensors
* symbolicizing SymInts
This avoids invalid specialization during view replay. However, the symbols / tensors created as intermediates in the view chain should not stick around or be guarded on. This PR introduces an `EphemeralSource` intended to be used as a source for this purpose. It has the following properties:
* Considered first to be simplified out in symbol simplification logic
* Errors if guarded on
Differential Revision: [D54561597](https://our.internmc.facebook.com/intern/diff/D54561597)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120948
Approved by: https://github.com/ezyang
manually restore fully_shard after \_\_exit\_\_ from mock.patch ctx. This will fix flaky CIs in trunk
```
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py
```
this is a workaround to make mock.patch(fully_shard) work with multi-thread
* thread 1 set func.\_\_module\_\_[fully_shard] = patched function
* thread 2 read func.\_\_module\_\_[fully_shard], thought it is original and fail to restore fully_shard during \_\_exit\_\_
* this PR manually restore fully_shard after \_\_exit\_\_
Co-authored-by: Andrew Gu <31054793+awgu@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121058
Approved by: https://github.com/awgu
Summary: qnnack library merge fails on some application. This fix implements recommendation from Android build team to prevent merge for qnnpack.
Test Plan:
1. Measure the binary size impact
1. Release build failed previously; now it should succeed
Differential Revision: D54048156
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120676
Approved by: https://github.com/kimishpatel
Make `torch.__future__.get_swap_module_params_on_conversion() == True` account for `assign` argument to `nn.Module.load_state_dict`
Similar to when `torch.__future__.set_swap_module_params_on_conversion()` is `False`, `assign=True` means that we do not incur a `self.copy_(other)` and the properties of `other` will be preserved
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121158
Approved by: https://github.com/albanD
ghstack dependencies: #121157
Always preserve requires_grad of param in module. Documentation fixed in PR stacked above.
Also fix test case to test load a state_dict generated with `keep_vars=False` (the default)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121157
Approved by: https://github.com/albanD
Fixes#83149
There is a limitation of `TensorIterator` reductions:
The non-permuted input tensor will be coalesced down to a 2-d tensor by `TensorIterator` whereas the permuted case may become a >2d operation (for example, two reduced dimensions and non-reduced dim).
Since the cpu reduction loop of `TensorIterator` only operates on two dimensions at a time, this means the intermediate sums will be truncated to lower precision.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108559
Approved by: https://github.com/mingfeima, https://github.com/peterbell10
# Summary
Updates the SDPA docs to fix some small inaccuracies and points to the new sdpa_kernel context manger. The Enum like type binded from cpp SDPBackend does not render its fields for some reason. Manually list them instead for now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121180
Approved by: https://github.com/mikaylagawarecki
The main thread and the autograd one are latency critical threads. They launch CPU/GPU/Accelerator kernels and if for some reason they get preempted, the rank can become a straggler in a distributed training application. By naming these threads we can debug performance issues that impact the latency sensitive threads.
I used Kineto traces to verify if the thread names were propagated:
<img width="851" alt="Screenshot 2024-03-04 at 3 07 43 PM" src="https://github.com/pytorch/pytorch/assets/23515689/68b4a09c-b8e5-4f14-a5c0-6593f866c03f">
Also:
```
nvidia-smi
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3065920 C ...me#python#py_version_3_10 1968MiB |
| 1 N/A N/A 3065926 C ...me#python#py_version_3_10 1978MiB |
| 2 N/A N/A 3065930 C ...me#python#py_version_3_10 2084MiB |
| 3 N/A N/A 3065936 C ...me#python#py_version_3_10 2016MiB |
| 4 N/A N/A 3065939 C ...me#python#py_version_3_10 1998MiB |
| 5 N/A N/A 3065943 C ...me#python#py_version_3_10 2070MiB |
| 6 N/A N/A 3065948 C ...me#python#py_version_3_10 2026MiB |
| 7 N/A N/A 3065952 C ...me#python#py_version_3_10 2070MiB |
+-----------------------------------------------------------------------------+
[me@myhost ~]$ ps -T -p 3065920
PID SPID TTY TIME CMD
3065920 3065920 pts/14 00:01:04 pt_main_thread
...
3065920 3092181 pts/14 00:00:40 pt_autograd_d0
3065920 3092182 pts/14 00:00:00 pt_autograd_d1
3065920 3092183 pts/14 00:00:00 pt_autograd_d2
3065920 3092184 pts/14 00:00:00 pt_autograd_d3
3065920 3092185 pts/14 00:00:00 pt_autograd_d4
3065920 3092186 pts/14 00:00:00 pt_autograd_d5
3065920 3092187 pts/14 00:00:00 pt_autograd_d6
3065920 3092188 pts/14 00:00:00 pt_autograd_d7
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121170
Approved by: https://github.com/albanD
Closes #120988
Currently operators that hit the autograd fallback call `check_inplace`
on all mutated inputs, including out arguments. This leads to a slightly
confusing error message:
```
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
```
Compared to functions that don't fallback, which raise
```
RuntimeError: add(): functions with out=... arguments don't support automatic differentiation, but one of the arguments requires grad.
```
This changes the error message to make clear the issue is with the out argument,
but does not tighten the check to outright ban out arguments that require grad.
Instead, I use the same checks from `check_inplace` which allows non-leaf tensors
that require grad to pass without error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121089
Approved by: https://github.com/lezcano, https://github.com/soulitzer
ghstack dependencies: #121142
`linalg_eigvals_out` calls into a dispatch stub, so only supports CPU and CUDA
strided tensors but incorrectly claimed to be a composite op. `linalg_eigvals`
also shouldn't defer to the out variant inside a `CompositeImplicitAutograd` op
as not all types support out variants. Instead, I add a new helper
`_linalg_eigvals` which does the same thing in a non-composite operator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121142
Approved by: https://github.com/lezcano
Summary: WrapperModule seems a good idea but may introduce some surprising behavior to users, for example, it never registers enclosed modules as submodules and therefore it's unclear that's the state dict for the exported program should look like, because some people may argue to include every state in state dict but others want to keep them as constants.
Test Plan: CI
Reviewed By: tugsbayasgalan
Differential Revision: D54326331
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121042
Approved by: https://github.com/angelayi
**Summary**
This PR extend `_offload_state_dict_to_cpu` to accept a `cpu_offload_state_dict` argument. If `cpu_offload_state_dict` is not None, `_offload_state_dict_to_cpu` will use `copy_` to copy the GPU data to the CPU tensors. This allows users to pass a pin_memory or share_memory version of `cpu_offload_state_dict`.
This PR also adds `_create_cpu_state_dict` to allow users to easily create a pin_memory or share_memory cpu state_dict.
**Performance improvement**
```
# The micro-benchmark has a source state_dict with 150 tensors, and each tensor is 50MB.
# The micro-benchmark is run on a H100 machine with PCIe 5
cpu_state_dict_2 = _create_cpu_state_dict(state_dict, pin_memory=True)
cpu_state_dict_3 = _create_cpu_state_dict(state_dict, share_memory=True)
# GPU->CPU memory: 4.6556 seconds
cpu_state_dict = _offload_state_dict_to_cpu(state_dict)
# GPU->pin memory: 0.1566 seconds
_offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_2)
# GPU->shared memory: 0.5509 seconds (variation is quite large)
_offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_3)
# GPU->pin memory->shared memory: 0.2550 seconds
_offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_2)
_offload_state_dict_to_cpu(cpu_state_dict_2, cpu_offload_state_dict=cpu_state_dict_3)
```
Differential Revision: [D54045845](https://our.internmc.facebook.com/intern/diff/D54045845/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120378
Approved by: https://github.com/LucasLLC
Spectral norm implementation has extensive tests, but there doesn't appear to be any checking that indeed the spectral norm (= top singular value) is correctly calculated. There should at least be one such testcase.
This adds one such testcase for the parameterizations.py implementation of spectral norm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121068
Approved by: https://github.com/soulitzer
Fixes#120044
Should fix build from source instructions on release branch here: https://github.com/pytorch/pytorch#from-source
Please note we are using /test/ channel for release here to make sure it works, before actual release is completed.
Test main:
```
make triton
pip3 uninstall -y triton
WARNING: Skipping triton as it is not installed.
Looking in indexes: https://download.pytorch.org/whl/nightly/
Collecting pytorch-triton==3.0.0+a9bc1a3647
Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.0.0%2Ba9bc1a3647-cp310-cp310-linux_x86_64.whl (239.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 239.0/239.0 MB 8.7 MB/s eta 0:00:00
Requirement already satisfied: filelock in /home/atalman/miniconda3/envs/py310/lib/python3.10/site-packages (from pytorch-triton==3.0.0+a9bc1a3647) (3.13.1)
Installing collected packages: pytorch-triton
Attempting uninstall: pytorch-triton
Found existing installation: pytorch-triton 2.2.0
Uninstalling pytorch-triton-2.2.0:
Successfully uninstalled pytorch-triton-2.2.0
Successfully installed pytorch-triton-3.0.0+a9bc1a3647
```
Test release/2.2:
```
make triton
pip3 uninstall -y triton
WARNING: Skipping triton as it is not installed.
Looking in indexes: https://download.pytorch.org/whl/test/
Collecting pytorch-triton==2.2.0
Using cached https://download.pytorch.org/whl/test/pytorch_triton-2.2.0-cp310-cp310-linux_x86_64.whl (183.1 MB)
Requirement already satisfied: filelock in /home/atalman/miniconda3/envs/py310/lib/python3.10/site-packages (from pytorch-triton==2.2.0) (3.13.1)
Installing collected packages: pytorch-triton
Attempting uninstall: pytorch-triton
Found existing installation: pytorch-triton 3.0.0+a9bc1a3647
Uninstalling pytorch-triton-3.0.0+a9bc1a3647:
Successfully uninstalled pytorch-triton-3.0.0+a9bc1a3647
Successfully installed pytorch-triton-2.2.0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121169
Approved by: https://github.com/seemethere
As reported in https://github.com/pytorch/pytorch/issues/119434, `detectron2_fcos_r_50_fpn` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of this model in this PR.
* Error msg is
```
File "/home/jiayisun/pytorch/benchmarks/dynamo/common.py", line 3877, in run
assert marked, f"nothing in example_inputs had a dim with {batch_size}"
AssertionError: nothing in example_inputs had a dim with 4
```
* Root Cause is
Benchmark code will only annotate the inputs' dim as dynamic when its size equals to batch size c617e7b407/benchmarks/dynamo/common.py (L3867-L3871). If it fails to find any dim equals to batch size, above error throws.
However, the inputs of `detectron2_fcos_r_50_fpn` are as follows:
```
([{'file_name': '/home/jiayisun/benchmark/torchbenchmark/data/.data/coco2017-minimal/coco/val2017/000000001268.jpg', 'height': 427, 'width': 640, 'image_id': 1268, 'image': tensor([[[147., 124., 82., ..., 3., 4., 5.],
[125., 104., 65., ..., 3., 3., 4.],
[ 87., 68., 34., ..., 2., 2., 2.],
...,
[ 47., 45., 41., ..., 45., 45., 45.],
[ 46., 44., 40., ..., 44., 45., 46.],
[ 46., 44., 40., ..., 43., 45., 46.]],
[[154., 129., 84., ..., 3., 4., 5.],
[133., 110., 69., ..., 3., 3., 4.],
[ 95., 76., 43., ..., 2., 2., 2.],
...,
[ 44., 42., 38., ..., 34., 37., 39.],
[ 43., 41., 37., ..., 35., 39., 41.],
[ 43., 41., 37., ..., 35., 40., 43.]],
[[171., 140., 85., ..., 3., 4., 5.],
[147., 120., 71., ..., 3., 3., 4.],
[103., 83., 47., ..., 2., 2., 2.],
...,
[ 46., 44., 40., ..., 16., 20., 22.],
[ 45., 43., 39., ..., 17., 22., 26.],
[ 45., 43., 39., ..., 18., 24., 28.]]])}, ... ],)
```
None of the inputs' dim will equal to input batch size, so I think we may need to skip the dynamic batch size testing for this model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120697
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/desertfire
Fixes#115331.
This is a temporary fix to increase the compile time number of GPUs to 120 until #119639 can be merged. Changing the parameter to 128 leads to annoying errors, as some checks would be tautological (`int8_t` is always < 128).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121076
Approved by: https://github.com/albanD
Summary:
While testing I noticed that if we generate different configs, we will fail to use the remote cache, so lets include configs in the cache key.
Not sure how to write a deterministic test for this.
Test Plan: existing tests
Differential Revision: D54500957
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121152
Approved by: https://github.com/aakhundov
Summary: While investigating why we were calling put each time, I noticed that memcache backend returns a list instead of direct result, which means that we were correctly fetching the cached result but not using it.
Test Plan: The test should now work as expected
Differential Revision: D54500851
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121151
Approved by: https://github.com/aakhundov
Summary: The current C shim layer manually implements a C interface for a handful of ops. Obviously that's not scalable if we want to extend it to cover all aten ops. This new torchgen script automatically generates C shim interfaces for CPU and CUDA backends. The interface follows the same parameter passing rules as the current C shim layer, such as
* Use plain C data types to pass parameters
* Use AtenTensorHandle to pass at::Tensor
* Use pointer type to pass optional parameter
* Use pointer+length to pass list
* Use device_type+device_index to pass device
* When a parameter is a pointer of pointer, e.g. AtenTensorHandle**, the script generates either a list of optional values or an optional list of values
https://gist.github.com/desertfire/83701532b126c6d34dae6ba68a1b074a is an example of the generated torch/csrc/inductor/aoti_torch/generated/c_shim_cuda.cpp file. The current version doesn't generate C shim wrappers for all aten ops, and probably generates more wrappers than needed on the other hand, but it should serve as a good basis.
This PR by itself won't change AOTI codegen and thus won't introduce any FC breakage. The actual wrapper codegen changes will come in another PR with some version control flag to avoid FC breakage.
Differential Revision: [D54258087](https://our.internmc.facebook.com/intern/diff/D54258087)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120513
Approved by: https://github.com/jansel
Differential Revision: D54447700
## Context
This changeset updates Vulkan SPIR-V codegen to introduce a global SPIR-V shader registry and register shaders dynamically at static initialization time. This change makes it possible to define and link custom shader libraries to the ATen-Vulkan runtime.
Before:
* `gen_vulkan_spv.py` generated two files, `spv.h` and `spv.cpp` which would contain the definition and initialization of Vulkan shader registry variables.
After:
* Introduce the `ShaderRegistry` class in `api/`, which encapsulates functionality of the `ShaderRegistry` class previously defined in the generated `spv.h` file
* Introduce a global shader registry (defined as a static variable in the `api::shader_registry() function`
* Define a `ShaderRegisterInit` class (taking inspiration from `TorchLibraryInit`) that allows for dynamic shader registration
* `gen_vulkan_spv.py` now only generates `spv.cpp`, which defines a static `ShaderRegisterInit` instance that triggers registration of the compiled shaders to the global shader registry.
Benefits:
* Cleaner code base; we no longer have `ShaderRegistry` defined in a generated file, and don't need a separate implementation file (`impl/Registry.*`) to handle shader lookup. All that logic now lives under `api/ShaderRegistry.*`
* Makes it possible to compile and link separate shader libraries, providing similar flexibility as defining and linking custom ATen operators
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121088
Approved by: https://github.com/manuelcandales, https://github.com/jorgep31415
benchmark fusion currently does not support foreach kernel. If we don't explicitly skip foreach kernels, we end up with exceptions in `codegen_node_schedule` because individual nodes in a foreach kernel may have incompatible shapes from pointwise/reduction perspective.
cc Manman Ren ( @manman-ren ) who reported the issue when turning on benchmark fusion on BertForMaskedLM.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121168
Approved by: https://github.com/Chillee
Summary:
add_histogram fails for this data type. Updating conversion code to handle it.
Stack trace for the failure -
`
[trainer0]Traceback (most recent call last):
[trainer0] File "<torch_package_0>.tensorboard/logging/summary_v2.py", line 203, in unscriptable_record_summary
[trainer0] unscriptable_histogram(name, t, step, ranks)
[trainer0] File "<torch_package_0>.tensorboard/logging/fx_v1.py", line 146, in unscriptable_histogram
[trainer0] Adhoc.writer().add_histogram(tag, x, step.int())
[trainer0] File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/writer.py", line 40, in wrapper
[trainer0] resp = super_method(*args, **kwargs)
[trainer0] File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/writer_oss.py", line 526, in add_histogram
[trainer0] histogram(tag, values, bins, max_bins=max_bins), global_step, walltime
[trainer0] File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/summary.py", line 482, in histogram
[trainer0] values = make_np(values)
[trainer0] File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/_convert_np.py", line 23, in make_np
[trainer0] return _prepare_pytorch(x)
[trainer0] File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/_convert_np.py", line 30, in _prepare_pytorch
[trainer0] x = x.detach().cpu().numpy()
[trainer0]TypeError: Got unsupported ScalarType BFloat16
`
Test Plan: Updated unit test that was failing before but passes after this change.
Reviewed By: hamzajzmati, jcarreiro
Differential Revision: D53841197
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120087
Approved by: https://github.com/jcarreiro, https://github.com/yanboliang
Differential Revision: D54487619
## Context
Allow the descriptor pool of an `api::Context` object to be initialized in a deferred fashion, instead of forcing initialization upon construction. This mode of operation will be used in the ExecuTorch Vulkan delegate, where the exact number of descriptor sets can determined once the graph is built instead of needing to "guess" an adequate amount.
## Implementation Details
* Check `config.descriptorPoolMaxSets > 0` to check if the descriptor pool should be initialized
* Introduce `DescriptorPool::init()` function to trigger intialization
* Introduce safeguards against using an uninitialized descriptor pool
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121134
Approved by: https://github.com/manuelcandales
Summary: Inductor currently has a best config cache for kernels that it generates. This is a local cache done via writing to the file system. This diff takes this local cache to remote by reusing the existing triton caching mechanism built via Memcache internally and Redis externally.
Test Plan:
tested locally using `TORCH_INDUCTOR_AUTOTUNE_REMOTE_CACHE =1`
Look at scuba to verify the local testing: https://fburl.com/scuba/triton_remote_cache/z6pypznk
The plan is to land this diff with this turned off and gradually introduce this.
Differential Revision: D54398076
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120963
Approved by: https://github.com/jansel
This is a BC breaking change to distribute_module. The underlying rationle
for this change is that sometimes in the input_fn/output_fn, user would want
to access to the current module for some attributes. This might not be
common enough, but in some cases it's worth to access to the module.
An outstanding use case we want to support is float8, if we want to make
float8 works with the TP API, the input_fn/output_fn of TP parallel
styles would need to get access to the module, where the module might
encapsulates `dynamic_linear.emulate` attribute, that is useful for
input/output casting
Since this is needed for fp8 and DTensor still under prototype release,
I feel it's worth the change and it's better we make the change as
early.
Right now making it a soft BC breaking, which means we maintain BC still
but throw deprecation messages.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120895
Approved by: https://github.com/tianyu-l
Summary:
torch.testing.assert_equal doesn't support nested strided tensors because sizes is not implemented.
This adds special handling for nested tensors by checking for nested tensors unbinding if they are found.
Test Plan: test_trace_with_nested_strided_tensor_output
Differential Revision: D54430238
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121039
Approved by: https://github.com/YuqingJ
Summary:
1) As a follow up in D53602514. Found a new way to decompose mm in backward. Sum the permuted input and reduce along 0 dim. Some benchmark result P1190140001. 30x speedup
Some explanations on why the original mm decomposition is slow. For mxkxn mm, when m is small and k is large, the stride for lhs is [m,1], hence it need to access memory k times to load all the data. As a result, decomposition will be effective with permute since the stride will be [k,1].
2) add another pattern for large k. benchmark result P1190596489 28x speedup
3) fix the value not found error in ig ctr. f536115499
Test Plan:
pt2 decompose:
{F1462894821}
decompose: f536159404
baseline: f536282578
705k vs 725k 4% for ig ctr
Differential Revision: D54294491
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120933
Approved by: https://github.com/mengluy0125
Reduces the torch.compile(backend="eager") for this code by 1-2 seconds.
~~~
def fn(x):
for _ in range(10000):
# x = torch.sin(x)
x = torch.ops.aten.sin(x)
# x = sin(x)
return x
~~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121052
Approved by: https://github.com/jansel
ghstack dependencies: #121053
Reduces the torch.compile(backend="eager") for this code by 1-2 seconds.
~~~
def fn(x):
for _ in range(10000):
# x = torch.sin(x)
x = torch.ops.aten.sin(x)
# x = sin(x)
return x
~~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121053
Approved by: https://github.com/jansel
Summary:
Expose an option to users to specify name of the LogsSpec implementation to use.
- Has to be defined in entrypoints under `torchrun.logs_specs` group.
- Must implement LogsSpec defined in prior PR/diff.
Test Plan: unit test+local tests
Reviewed By: ezyang
Differential Revision: D54180838
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120942
Approved by: https://github.com/ezyang
**description**
Enable lowering of dynamic qlinear for X86Inductor. The pattern is `choose_qparams -> getitem -> q -> dq -> linear`. We only fuse `dq -> linear` and get `choose_qparams -> getitem -> q -> onednn.qlinear_pointwise`. So, we treat it as dynamic quantization of activation + static quantized linear.
The previous implementation of `onednn.qlinear_pointwise` is for the case where `x_scale` and `x_zp` are scalars. Since `choose_qparams` returns tensors, we added a variation `onednn.qlinear_pointwise.tensor` to support the case.
This feature is targeting PyTorch 2.3 release.
**Test plan**
```
python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_cpu
python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_qat_cpu
python inductor/test_cpu_cpp_wrapper.py -k test_dynamic_qlinear
```
**Performance before and after lowering `choose_qparam` to Inductor**
Before
- latency for shape (32, 32) = 0.151 ms
latency for shape (128, 128) = 0.153 ms
latency for shape (1024, 1024) = 0.247 ms
After
- latency for shape (32, 32) = 0.049 ms
- latency for shape (128, 128) = 0.052 ms
- latency for shape (1024, 1024) = 0.133 ms
Test method: A module with a single Linear layer, dynamic-quantize, lower to X86Inductor
Test env & config: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz, single instance, single core, using Intel OpenMP and Tcmalloc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120605
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
Loss parallel is the last piece of sequence parallelism to enable. It enables efficient distributed cross entropy computation when the input is sharded on the class dimension (in a classification problem with many classes). The implementation is via a context manager `loss_parallel`, after enabling which users can directly use `torch.nn.functional.cross_entropy` or `torch.nn.CrossEntropyLoss` without modifying other parts of their code.
Here are the underlying rationales why we are going through these op replacements:
1. `nn.functional.cross_entropy` is the common method that OSS user is using for things like transformer training, to avoid changing user code, we want user to still use this function for loss calculation if they are already using it.
2. `nn.functional.cross_entropy` boils down into `aten.log_softmax` and `aten.nll_loss_foward/backward`, and DTensor now supports those ops already (#117723#119255#118917#119256). They are doing computation with input *replicated* on the class dimension.
3. However when the input of this loss calculation is **sharded on the class dimension**, to run sharded computation efficiently, we need to run both `aten.log_softmax` and `aten.nll_loss_foward` with multiple all-reduce collectives **in the middle of** those aten ops. This is not possible if we are just overriding these two ops, so we need to have some way to **decompose** these two ops into smaller ops to have collectives run in the middle of these two ops.
4. We explored the existing decompositions (#118950). It seems working, except that `log_softmax_backward` and `nll_loss_backward` combined together in aten are implemented in a inefficient way, which would trigger an additional expensive collective. Recently some user also reported similar issues https://github.com/pytorch/pytorch/issues/119261.
5. Therefore, currently we are doing our own decomposition inside a context manager for sequence parallelism specifically. Once we have a better decomposition in core, we can possibly take that instead of reinventing the wheels here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119877
Approved by: https://github.com/wanchaol
Fix https://github.com/pytorch/pytorch/issues/120545 . The reason why these models fail accuracy test with freezing is due to the conv-batchnorm fusion. Conv-batchnorm fusion causes relative big numerical churn.
For the failed TIMM models, raising the tolerance to `8 * 1e-2` can make the test pass.
For the failed TB models, the numerical difference is too large. Having a discussion with @eellison , we decided to skip them with freezing for now.
One the other hand, we probably should dig more why the conv-bn fusion cause such large numerical difference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121054
Approved by: https://github.com/eellison
Summary:
## Context
Introduce some simple bug fixes to the Vulkan Compute API that were causing errors on Android.
1. When using deferred allocation for image textures, it is undefined behaviour to create a `vkImageView` for a `vkImage` that has not yet been bound to memory. Fix this by creating the image view only after the `vkImage` has been bound to memory.
2. When flushing the `api::Context`, the command pool is flushed but any current command buffers are not invalidated. This will cause a segmentation fault if the command buffer is not submitted prior to calling `flush()`, because subsequent calls to `submit_*_job()` will use the old command buffer which will have been freed when the command pool is flushed. To fix, invalidate any existing command buffers when calling `flush()`.
Test Plan:
Build the test binary for Android:
```
buck build --target-platforms=ovr_config//platform/android:arm64-fbsource -c ndk.custom_libcxx=false //xplat/caffe2:pt_vulkan_api_test_bin --show-output
```
Push and run the test binary on a local android phone.
Differential Revision: D54425370
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121027
Approved by: https://github.com/mcr229, https://github.com/cbilgin
Fixes https://github.com/pytorch/pytorch/issues/120645
`_internal_new_from_data` calls `_recursive_build`, but we run into an error such as the cases.
```
Failed running call_function <function tensor at 0xDEADBEEF>:
scalar_tensor(): argument (position 1) must be Number, not FakeTensor
# e.g. cases
1. [FakeTensor(..., size=(20, 1), dtype=torch.float64), ..., FakeTensor(..., size=(20, 1), dtype=torch.float64)]
- Here, we call _recursive_build(sizes=[4] ...) which hits the base case `if dim == ndim:` in the 2nd level of recursion.
- So, we try to return `scalar_tensor(FakeTensor)`
2. [[(FakeTensor(..., size=(1,), dtype=torch.int64), FakeTensor(..., size=(), dtype=torch.int64)]]
# site note: when can size = ()? Probably from scalar_tensor.
>>> torch.scalar_tensor(1).shape
torch.Size([])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120872
Approved by: https://github.com/ezyang
The test has been failing sporadically rencetly in CI and the failures
are not reproducible locally, likely due to some nasty race conditional
related a combination of MultiThreadedTestCase, the use of global state
and finalizers, and the recently introduced test decorator for native
funcol migration.
Switching to the test to use MultiProcessTestCase to provide better
isolation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121046
Approved by: https://github.com/weifengpy
This PR uses `ncclAvg` op (via `ReduceOp.AVG`) if doing fp32 reduce-scatter. This allows the division by world size to happen in the reduce-scatter kernel itself, which seems to save extra memory read/write for dividing. This yields ~1.5% speedup on the Llama-7B workload (and makes per-parameter FSDP faster than flat-parameter FSDP 😅 ).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120919
Approved by: https://github.com/yifuwang, https://github.com/wanchaol
ghstack dependencies: #120238, #120910
This PR adds support for `clip_grad_norm_(foreach=True)` by implementing `aten._foreach_norm.Scalar` and `aten._foreach_mul_.Tensor`. `foreach=True` is required to get competitive performance with `DTensor`.
`foreach=True` reduces CPU overhead for Llama-7B from 388 ms to 63 ms. Existing flat-parameter FSDP's `clip_grad_norm_` takes 3 ms on CPU 😢 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120910
Approved by: https://github.com/wanchaol, https://github.com/janeyx99
ghstack dependencies: #120238
This PR adds `DTensor` support for `aten.linalg_vector_norm.default` and `aten.stack.default` so that we can run `clip_grad_norm_` (with `foreach=False`).
To implement `linalg_vector_norm`, we introduce a `_NormPartial` placement since the reduction op for norm is the norm itself.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120238
Approved by: https://github.com/wanchaol
Summary:
When there are multiple PGs in a process and a hardware failure happens,
we found that multiple PGs/ threads in the same
process are competing to dump the same records at the same time. The
affects the reliability of dumps.
In this PR, we will try to make the change such that only one thread/PG
could dump: PG0's monitor thread. We use a static variable to indicate
that something (e.g., collective timeout) has triggered the dump
locally.
monitor thread would dump debug info under any one of the 3 conditions:
1: this static variable is set to true by the watchdog thread when it detects
a timeout or pipe dump signal
2: timeout signal is received from other ranks through tcpstore
3: no heartbeat of watchdog
Test Plan:
python test/distributed/test_c10d_nccl.py -k
test_timeout_dumps_on_stuck_ranks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120893
Approved by: https://github.com/wconstab
Give TD it's own job so that each shard can get the results from this one job artifact and they will always be in sync with each other/no longer need to worry about consistently issues
* Move test discovery to its own file that is not dependent on torch so it can be run without building torch
* Cannot do cpp test discovery before building pytorch
* Move TD calculation to own file that will create a json file with the final results
* TD is now job/build env agnostic
* TD will rank all tests, including those that test jobs may not want to run (ex it will rank distributed tests along with default tests, even though these tests are never run on the same machine together)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118250
Approved by: https://github.com/huydhn
TestCustomOp's tests uses helper attributes and functions from a util parent class. To support arbitrary test classes, we need to refactor the current approach. Instead of allowlisting certain methods, we can instead copy the whole class and only overwrite the "test_.*" methods.
Compiled autograd fails on ~10/90 of the newly added tests. test_autograd_function_backed_op is the example we discussed in PT-2D meeting about requiring c++ autograd::Function support. I'm addressing this in #120732
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120679
Approved by: https://github.com/jansel, https://github.com/zou3519
In particular this ensures we release the GIL when serializing:
- PyBytes objects (this is how we get the pickle object)
- Storage objects
Other string-like objects keep the gil which is fine because we only use this for very small strings today (for endianess) and so releasing the GIL is not important there
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120818
Approved by: https://github.com/colesbury
For the user-defined `Mapping` type, it may contain some metadata (e.g., pytorch/tensordict#679, https://github.com/pytorch/pytorch/pull/120195#issue-2141716712). Simply use `type(mapping)({k: v for k, v in mapping.items()})` do not take this metadata into account. This PR uses `copy.copy(mapping)` to create a clone of the original collection and iteratively updates the elements in the cloned collection. This preserves the metadata in the original collection via `copy.copy(...)` rather than relying on the `__init__` method in the user-defined classes.
Reference:
- pytorch/tensordict#679
- #120195Closes#120195
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120553
Approved by: https://github.com/vmoens
Summary: Change to get_buffer from the input plain_graph_module instead of the new stateful_gm when restoring non_persistent buffers, since the stateful_gm doesn't contain the buffer yet.
Test Plan:
Added test case.
`buck test caffe2/test:test_export -- test_unlift_nonpersistent_buffer`
Differential Revision: D54216772
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120715
Approved by: https://github.com/zhxchen17
Summary: Today we don't allow free functions to be tracing callable in torch.export. As a part of migrating capture_preautograd_graph usages to torch.export, we need to ban free functions to capture_preautograd_graph as well
Test Plan: CI
Differential Revision: D54319597
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120817
Approved by: https://github.com/zhxchen17, https://github.com/andrewor14
When a dynamo backend captures the entire forward pass and the entire backward pass without graph break, there could be many (per my memory, hundreds or thousands for big model) `contiguous` calls. Here we can save those overhead by checking `is_contiguous` before `contigous` call.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118730
Approved by: https://github.com/thiagocrepaldi, https://github.com/ezyang
Currently when there is a print/warning in the graph, dynamo graph breaks causing export to fail. However export would like to just skip over these print/warning calls: https://github.com/pytorch/pytorch/issues/113792.
Additionally there's a torch.compile feature request to "reorder prints" so that instead of graph breaking when hitting prints/logging, we can skip over these prints to create larger compiled graphs, and then print the results out after those compiled graphs: https://github.com/pytorch/pytorch/issues/93739. This PR also adds the `reorderable_logging_functions` config for users to register logging functions to be reordered (like `print` or a custom logging function). Printout of the bytecode after reordering the prints looks like the following: P914736600
There are some limitations to the printing right now:
* You can only register logging functions, not methods
* Inputs to the logging functions can only be tensors, constants, and format strings
* Inputs to the logging functions which will later be mutated in-place will not be printed correctly
TODO: Add the following tests
* print function with argument of nested data structure;
* print function with argument of nested data structure being updated inside of compile region (this would test if we handle side effect correctly);
* custom defined logging functions with nn.Module or nn.Module attribute arguments;
* custom defined logging functions with submodule input/output as arguments (we need to handle the mapping and fused-out value);
* custom defined logging functions with tensor argument and mutation inside of the function (TBD: this may increase memory usage);
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116106
Approved by: https://github.com/yanboliang
Summary: Previously, we relied on the `lark`-based parsing of the string TTIR representation dumped by the Triton compiler. However, this has proven to be brittle in the face of changes both in the user-written Triton kernel code and in the Triton compiler code.
In this PR, we add an alternative way of mining the function information from the TTIR based on walking the tree of structured MLIR entities. To this end, we rely on the MLIR bindings exposed by `libtriton` (related PR in Triton: https://github.com/openai/triton/pull/3191).
For now, we introduce gating based on whether `ttir_module.hasattr("walk")`. This will allow switching to the newly introduced TTIR analysis approach only when the new MLIR bindings (including that of `ModuleOp::walk`) become available in the Triton pin. Before then, we'll keep using the old string TTIR parsing-based approach.
Test Plan: The new functionality was tested locally with the latest Triton version compiled with the added new MLIR bindings: all Triton kernel mutation tests in `test_triton_kernels.py` are passing. Here we rely on the CI for regression testing, but it won't cover the new functionality due to gating.
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120476
Approved by: https://github.com/oulgen
The special-case code for handling SUPPORTED_NODES was producing a guard that looked like:
```
"G['torch'].utils._pytree.SUPPORTED_NODES[<class '__main__.CausalLMOutputWithPast'>].type"
```
resulting in a eval error trying to evaluate the guard.
This change adds a new source type (`ClassSource`) which is given a class type (in this case `CausalLMOutputWithPast`) and attempts to fetch it from its defining module. It then uses that to build the `SUPPORTED_NODES` guards instead of referring to the type directly.
Also added a unit test which fails before this change and passes after.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120798
Approved by: https://github.com/anijain2305
VariableInfo is used by both `custom_function.h` (in a templated class) and `compiled_autograd.h` (in a class with some templated methods). Another way could have been to make a `compiled_autograd.cpp` and forward declare VariableInfo, but this VariableInfo was also being used in other nodes like PyNode so it felt cleaner to do it this way.
Differential Revision: [D54287007](https://our.internmc.facebook.com/intern/diff/D54287007)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120732
Approved by: https://github.com/jansel
Update xfails for test_dispatch_meta_outplace and test_dispatch_symbolic_meta_outplace.
These tests are sometimes expected to fail, because we moved the registrations from meta_registrations.py to fake_impls.py. AFAIK, this is okay because fake tensors will still work because we have special handling in fake_impls.py. The purpose of this PR is to update the xfails so they are correctly xfailing the failing tests.
Previously, I set these to xfail only for bfloat16, float16, and float32, but not float64; but this isn't really correct. Explanation below:
Scaled dot product attention (SDPA) has multiple implementations, including efficient_attention, flash_attention, and unfused attention. flash_attention supports fp16, bf16. efficient_attention supports fp16, bf16, fp32. unfused attention supports all dtypes.
efficient_attention and flash_attention implementations will fail the meta tests, but the unfused attention will not. Certain platforms may support none, both, or one of efficient_attention and flash_attention. Unfused attention will pass because it falls back to constituent ops which have registered meta kernels.
So: on CUDA, we have all 3 available: in bf16, fp16, fp32, we'll select one of the fused implementations (where this test will fail).
On ROCM, we don't have efficient_attention: so fp32 will use the unfused implementation, where the test will pass.
Fix in this PR:
* If any fused impl is available, then xfail float16 & bfloat16
* If efficient_attention is available, then also xfail float32
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120928
Approved by: https://github.com/drisspg
Inductor codegen for `_assert_async` is currently disabled because we don't really understand how to codegen `scalar_to_tensor` on a Sympy expression. I initially tried to see if I could get this to work, but I got into some weird problem involving stride sorting, so I decided to fix it properly by not going through a tensor.
So we introduce an `_assert_scalar` which takes a scalar as an argument, avoiding needing to turn a SymBool into a tensor before asserting on it. I also add `_functional_assert_scalar` for good luck, although this doesn't do anything right now because https://github.com/pytorch/pytorch/pull/104203 still hasn't been landed.
I need to customize the codegen for this operator, so I decide to directly implement it in Inductor, rather than trying to treat it as a generic ExternKernel. This leads to the new AssertScalar IR node. This is written carefully so that it doesn't get DCE'd by Inductor.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114148
Approved by: https://github.com/jansel
ghstack dependencies: #120800
Currently when there is a print/warning in the graph, dynamo graph breaks causing export to fail. However export would like to just skip over these print/warning calls: https://github.com/pytorch/pytorch/issues/113792.
Additionally there's a torch.compile feature request to "reorder prints" so that instead of graph breaking when hitting prints/logging, we can skip over these prints to create larger compiled graphs, and then print the results out after those compiled graphs: https://github.com/pytorch/pytorch/issues/93739. This PR also adds the `reorderable_logging_functions` config for users to register logging functions to be reordered (like `print` or a custom logging function). Printout of the bytecode after reordering the prints looks like the following: P914736600
There are some limitations to the printing right now:
* You can only register logging functions, not methods
* Inputs to the logging functions can only be tensors, constants, and format strings
* Inputs to the logging functions which will later be mutated in-place will not be printed correctly
TODO: Add the following tests
* print function with argument of nested data structure;
* print function with argument of nested data structure being updated inside of compile region (this would test if we handle side effect correctly);
* custom defined logging functions with nn.Module or nn.Module attribute arguments;
* custom defined logging functions with submodule input/output as arguments (we need to handle the mapping and fused-out value);
* custom defined logging functions with tensor argument and mutation inside of the function (TBD: this may increase memory usage);
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116106
Approved by: https://github.com/yanboliang
Summary:
This is a reimplemented version of the FB specific code in https://www.internalfb.com/diff/D54230697
The new strategy is that we unconditionally install an FB handler to trace_log logger (and always set level to DEBUG). When the first log message is emitted, we check the JK/filesystem to see if we should actually do logging. If we decide we don't do logging, we remove the handler from trace_log and are done.
build_only[github-export-checks,executorch,pytorch_benchmark,pytorch_quantization,pytorch_distributed,pytorch_distributed_gpu,pytorch_dynamo_inductor,pytorch_functorch,pytorch_fx2trt,pytorch_diff_train_tests_ads,glow_fb_pytorch_tests,training_platform,training_platform_compatibility,training_toolkit_applications,training_toolkit_examples,training_toolkit_model_optimization,dper3_pytorch,xplat_caffe2,pytorch_dev,android-pytorch-instrumentation-tests,smartpytorchgithub_first_try_merge,frl-target-determinator,f6-buck,training_platform_for_github,sigmoid_cpu,sigmoid_gpu,aiplatform_modelprocessing_for_github,accelerators_workloads_models_slimdsnn,ae_aotinductor_benchmark_test,aps_,aps_deterministic_ne_tests,dper_lib_silvertorch,torchrec,torchrec_fb,deeplearning_aot_inductor]
Test Plan:
sandcastle
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//torchrec/inference/tests:test_single_gpu_executor -- --exact 'torchrec/inference/tests:test_single_gpu_executor - TorchDeployGPUTest.NestedModelSingleGPU'
buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/modules/dynamic_stats/tests:accumulators_test -- --exact 'dper_lib/silvertorch/modules/dynamic_stats/tests:accumulators_test - test_global_fixed_interval_accumulator (dper_lib.silvertorch.modules.dynamic_stats.tests.accumulators_test.GlobalFixedIntervalUnivalentAcculumatorTest)'
```
Also running a test flow with/without JK enabled
Differential Revision: D54275086
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120915
Approved by: https://github.com/yanboliang
A global transorm list is created. All backend instances call the transform functions in that list sequentially to modify the exported ONNX model before sending model to ORT session. For example, `record_onnx_model_transform` below is a no-op transform and only records the ONNX graphs sent to ONNXRuntime.
```python
recorded_models = []
def record_onnx_model_transform(onnx_model):
# Record the ONNX model seen by the transform.
recorded_models.append(onnx_model)
from torch.onnx import (
register_backend_graph_transform,
unregister_backend_graph_transform,
)
# Register so that `onnxrt` backend calls it to modify ONNX model.
register_backend_graph_transform(record_onnx_model_transform)
def example_model(x: torch.Tensor):
y = torch.sigmoid(x)
z = x + y
return z
# During the compilation, the exported ONNX model will be
# modified by calling `record_onnx_model_transform` before
# sending the model to `onnxruntime.InferenceSession`.
compiled_model = torch.compile(
example_model,
backend="onnxrt",
dynamic=True,
)
# Now, `recorded_models` should contain one `onnx.ModelProto` representing
# `example_model(x: torch.Tensor)`.
# Remove the pass when not needed. If `record_onnx_model_transform` is not
# removed, it will be applied to all models compiled by `backend="onnxrt"`.
unregister_backend_graph_transform(record_onnx_model_transform)
```
In the future, we plan to use this mechanism to register all graph transforms such ash graph fusion and general ONNX optimization for `onnxrt`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120854
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi
Fixes https://github.com/pytorch/pytorch/issues/120441
We follow how triton_kernel_wrapper_functional gets re-inplaced:
- If we see auto_functionalized, then first we compute what inputs we
actually need to clone ("tensors_to_clone") and fixup the graph. This happens in
`reinplace_and_refine_tensors_to_clone`, which I have refactored out
of the triton_kernel_wrapper_functional reinplacing code.
- Later on, after the reinplacing pass, we have a decomposition pass for
auto_functionalized. In that decomposition pass, we make use of the
"tensor_to_clone" info and only clone those inputs in the
decomposition.
- We shepherd "tensor_to_clone" from the first step to the second step
by setting the .meta field on the auto_functionalized node.
Test Plan:
- existing tests
- tested locally by reading the output of TORCH_LOGS="post_grad_graphs"
- added assertExpectedInline tests for the post_grad_graphs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120829
Approved by: https://github.com/oulgen
# Note: Returning Fake Tensors on First AOT Autograd Call
#
# Inductor will optimize strides of outputs when it deems it profitable.
# For instance, converting to channels last. When we split the graph here
# into multiple inductor compilations, we need to make sure that the
# output strides of one compilation is appropriately passed to the subsequent
# compilations. However, the mapping from inductor output to dynamo output
# is non-trivial due to aot_autograd's deduping, de-aliasing, mutation, re-writing,
# subclass handling, etc. In order to replay all this logic we set a flag such that
# the first invocation of inductor in aot_autograd will return Fake Tensors with
# appropriate strides. Then, all of aot autograd's runtime logic is replayed.
# This gives us the appropriately strided outputs here which will reflect runtime strides.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120523
Approved by: https://github.com/yf225, https://github.com/bdhirsh
This PR adds tests to test distributed state dict work properly for FSDP2's model and optimizer state_dict after a full training loop.
We test the combination of these options on a evenly sharded model.
```
{
"reshard_after_forward": [True, False],
"optimizer_class": [torch.optim.Adam],
"compile_model": [True, False],
},
```
Followup: 1. Add test for unevenly sharded model. 2. Add test to include `torch.optim.AdamW` (seems to have some gaps currently, still investigating)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120871
Approved by: https://github.com/fegin
Summary:
Pulling out logging parameters into a logging specs that can be overridden (follow-up changes on possible mechanism)
Why?
Right now the logging approach is quite rigid:
- Requires for log directory to exist and not be empty
- Will create tempdir otherwise,
- Creates subdir for a run
- creates subdir for each attempt
- creates files named as stdout.log, stderr.log, error.json
In some instances some of the users would like to customize the behavior including file names based on context. And we do have right now a mechanism to template multiplexed teed output prefix.
With current changes, users can create custom log spec that can use env variables to change the behavior.
Notes:
Made `LaunchConf.logs_specs` as an optional field that will be bound to `DefaultLogsSpecs` instance. There are large number of clients (code) that use the API directly without using torchrun API. For those cases, we have to explicitly pass LogSpecs implementation if we would like to override the implementation. For the regular torchrun users, we can use pluggable approach proposed in the follow up change.
Test Plan: CI + unit tests
Differential Revision: D54176265
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120691
Approved by: https://github.com/ezyang
This is basically done the obvious way. For better or worse, I jammed this into what used to be `_maybe_guard_eq` but now is `_maybe_guard_rel`. I was careful to test all the off by one conditions, and each permutation. Let me know if you think I missed anything. Importantly, this now works for unbacked SymInts.
While testing, I noticed we are silently duck sizing all symbolic variables in `test_dynamic_shapes.py`. This may or may not be covering up bugs.
Along the way, I had to fix a bug in export constraints, where we weren't checking that the final var_to_range was consistent with what the user requested at top level.
After I implemented all this, I realized that applying this to non-unbacked SymInts was duplicative with @ysiraichi's previous work on https://github.com/pytorch/pytorch/pull/97963 . The upside is I now understand what Yukio was trying to do in the original PR, and I think my new logic is simpler and less error prone. In Yukio's earlier diff, Yukio tried very hard to avoid changing what guards we actually issue (since this would cause tests to wobble). Thus, when he refined a range, he also saved the guard that actually caused the range to refine. In this PR, I don't bother saving these guards; instead I just tighten var_to_range directly and rely on generating guards on this to be correct. The key insight is that if I assert `x < y`, it's always safe to emit (potentially) more restrictive range guards, because this won't invalidate our guards, it will just make them a little too strong (but actually, I think we are precise along the way.) If these guards make it unnecessary to test `x < y`, because now the ranges for x and y are disjoint, this is fine, we've subsumed the x < y guard and can just not bother testing it. If I've gotten it right, TV will agree with me.
In fact, I had a bug in this PR which TV didn't catch, which is that when we have a recorded var_to_guards for a symbol, we unconditionally never generate the range guard for it, even if the var_to_guards is potentially inconsistent with var_to_range (because var_to_range was updated separately). With var_to_guards removed, I don't have to worry abou this inconsistency.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120800
Approved by: https://github.com/Skylion007, https://github.com/avikchaudhuri, https://github.com/ysiraichi
Summary:
When we convert `dynamic_shapes` to `constraints` and pass them to `_dynamo.export`, we shouldn't give a deprecation warning. Such conversion happens when calling `torch.export.export`, e.g. But it can also happen when calling `capture_pre_autograd_graph` (which itself has this deprecation warning when `constraints` are passed directly as well).
Since `_log_export_usage` is an indicator of a top-level call (it is `True` by default but set to `False`, or at least passed through, by callers), we can (ab)use it to indicate when to give this deprecation warning.
Test Plan: none
Differential Revision: D54350172
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120896
Approved by: https://github.com/BoyuanFeng, https://github.com/zhxchen17
Current threshold is to cut the bottom 75% of test files, which results in 13 min of tests getting cut.
test_ops, functorch/test_ops, and test_decomp and other really long running test files are not getting cut and make the top 25% to take really long (still 90+ min)
The original plan was to test on rocm but I'm worried about queuing given that cutting 75% of test files only cuts off 13 min, and crossref is rarely referenced by others and people keep talking about getting rid of it, so it's a good alternative
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119426
Approved by: https://github.com/huydhn
This diff introduces a new separate logging of autotuning results,
with the intention of making the results analyzable, specifically
those for the new experimental Cutlass backend.
Results are logged as text files with one JSON document corresponding to a single benchmark result per line.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119004
Approved by: https://github.com/jansel
ghstack dependencies: #120620
The existing use of "if(NOT ENV{ROCM_SOURCE_DIR})" seems to be
not working correctly, e.g.
```
$ cmake --version
cmake version 3.26.4
$ cat CMakeList.txt
cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
project(FOO)
if(NOT ENV{ROCM_SOURCE_DIR})
message(INFO ": not defined 1")
else()
message(INFO ": defined 1: $ENV{ROCM_SOURCE_DIR}")
endif()
if("$ENV{ROCM_SOURCE_DIR}" STREQUAL "")
message(INFO ": not defined 2")
else()
message(INFO ": defined 2: $ENV{ROCM_SOURCE_DIR}")
endif()
$ ROCM_SOURCE_DIR=/tmp cmake .
INFO: not defined 1
INFO: defined 2: /tmp
-- Configuring done (0.0s)
-- Generating done (0.0s)
-- Build files have been written to: /home/yangche/tmp/tmp
```
This PR replace it with a STREQUAL check. Note that the choice
of STREQUAL is to avoid cases like:
```
$ ROCM_SOURCE_DIR= cmake .
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120858
Approved by: https://github.com/jianyuh, https://github.com/jeffdaily
Summary:
Previously `export` would take `constraints` built with `dynamic_dim(...)`s. This has been deprecated for a while; one can now pass in a `dynamic_shapes` spec built with `Dim(...)`s.
Here we kill this deprecated API. Eventually this will lead to simplification of the underlying implementation, since the new `Dim`-based specs can map 1-1 with symbolic shapes concepts without going through indirect machinery of `dynamic_dim`-based constraints. It is expected that internal APIs like `_dynamo.export` and `_trace._export_to_torch_ir` will change when that happens.
Leaving `aot_compile` and `capture_pre_autograd_graph` entry points alone for now. This will eventually be updated anyway.
Test Plan: updated tests
Differential Revision: D54339703
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120860
Approved by: https://github.com/suo, https://github.com/tugsbayasgalan
Summary: Currently the logics for filling the default value for optional arguments are scattered in several places. By storing OpOverload in the base ExternKernel class, we can simplify codegen_kwargs, and this is a preparation step for enabling the torchgen-ed C shim. The default value filling logic for FallbackKernel can also be simplified, but that can come later.
Differential Revision: [D54258089](https://our.internmc.facebook.com/intern/diff/D54258089)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120629
Approved by: https://github.com/chenyang78
ghstack dependencies: #119987, #120592
Summary:
https://github.com/pytorch/pytorch/pull/104373 introduced backend_id
> an unique ID for the actual backend object, this is also exposed in record_param_comms, so we can correlate these collectives with the right backend object.
However, it is inconvenient to correlate collectives with backend id. Instead, using pg id(uid) to correlate directly is a better solution.
This PR change the ID information exposted in record_param_comms from backend_id to pg_id.
Differential Revision: D53558257
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120475
Approved by: https://github.com/aaronenyeshi
1. This moves TRITON_CONSTRAINT to common binary_populate_env.sh so that this is set for all wheels.
test in CI via ``ciflow/binaries`` label. Please note we only setting this constraint when PYTORCH_EXTRA_INSTALL_REQUIREMENTS is set. And this variable is set for all the wheels that gets uploaded to pypi. Hence triton wheels need to be set at the same place.
This is done for regular wheels and rocm wheels separately, since rocm wheels using different triton package
3. Cleanup legacy unused code
Test:
``
git grep setup_linux_system_environment.sh
``
Needs: https://github.com/pytorch/builder/pull/1712
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120744
Approved by: https://github.com/huydhn
scaled_gemm for ROCm using hipblaslt. As of ROCm 6.0, HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER is not supported. A work-around is provided, performing the absmax operation on the output buffer, but this results in some loss of accuracy for the absmax result. For this reason the feature should be considered beta/preview.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117822
Approved by: https://github.com/jianyuh, https://github.com/xw285cornell
Summary: In non-strict mode of torch.export() we didn't set those `is_compiling()` to `True` which is needed by some models.
Test Plan: Unit tests and manual testing.
Differential Revision: D53624452
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119602
Approved by: https://github.com/suo
Summary: Previously, we omitted `equal_to_1` from the `triton_meta` part of the `@user_autotune` decorator. For user-written Triton kernels, this could lead to perf regressions, as the kernel in the Inductor codegen is compiled without `equal_to_1` specialization.
Fixes#120478. The repro from the issue, on A100:
Before this PR:
```
Triton matmul: 0.0167 seconds
Triton matmul compiled: 0.0751 seconds
```
After this PR:
```
Triton matmul: 0.0168 seconds
Triton matmul compiled: 0.0072 seconds
```
Test Plan:
```
$ python test/dynamo/test_triton_kernels.py -k test_triton_kernel_equal_to_1_arg
...
----------------------------------------------------------------------
Ran 3 tests in 3.545s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120579
Approved by: https://github.com/oulgen, https://github.com/jansel, https://github.com/chenyang78
Summary: Previously, in the `memory_plan_reuse` we assumed that the generated code is flat: in the sense of it can't have nested scopes. However, with nested control flow codegen-ing, this is no longer the case. This causes bugs in buffers being reused across the visibility boundaries in different nested scopes.
In this PR, we add nested planning states in `memory_plan_reuse` on entering and exiting scope in the codegen. This restricts the buffer reusability only to the currently active (peak) scope / planning state.
Test Plan:
```
python test/inductor/test_control_flow.py -k test_subgraphs_with_parameters
...
----------------------------------------------------------------------
Ran 27 tests in 149.413s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120777
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #120665
Previous, we parametrize some tests to run with both native and py funcol by flipping a global variable. However, some of these tests are multi-threaded tests, and the parametrization mechanism could lead to race condition.
This PR changes the mechansim to use `mock.patch` which is applied on a per-thread basis.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120833
Approved by: https://github.com/wconstab
Summary: The current machinery of Inductor's `compile_fx` assumes that the incoming fx graph is flat. As a result, everything before `graph.run` is applied to the outermost graph. This assumption was valid before #119759, but now there is control flow bringing (arbitrarily deeply) nested fx subgraphs to `compile_fx`.
In this PR, we start extending the `compile_fx` machinery to deal with nested fx subgraphs. Namely, we recursively apply Inductor's `pre_grad`, `joint_graph`, and `post_grad` passes to the nested subgraphs in the incoming fx graph.
For the recursive application of the `pre_grad` passes (which require example inputs per subgraph), we don't pass example inputs for the nested subgraphs. A few different attempts to infer the latter via fake tensor prop has led to different side effects in the model. Therefore, to the nested subgraphs, we only apply a subset of `pre_grad` passes that doesn't require example inputs.
Test Plan:
```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 26 tests in 59.252s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120665
Approved by: https://github.com/eellison
Fix and test issues with both coalesced and individual send/recv ops
Considered an alternate approach and then ditched it
- alternate approach: #119757
- reason ditched: prefer recording individual collective events inside
coalescing region instead of just the event at the end of the region,
which also would not have tensor sizes or opnames without additional
state variables added
Another approach also ditched
- record events on workEnqueue instead of initWork
- reason ditched: too messy to get input/output shapes tagged on
recording when recording in workEnqueue. Adding the info onto the
Work obj would be possible, but adds to overhead of copying Works
which we do on every collective. We can get info off the input/output
tensors directly in initWork, but we don't want to keep refs to those
tensors alive while the work is Enqueued, so we'd have to specifically
copy size lists or something.
This PR instead avoids creating a work inside pointToPoint when
coalescing is active. Instead, only at endCoalescing() is a work finally
intialized and enqueued. But it adds a record() call inside
pointToPoint() instead of creating a work, during coalescing. This
record() call picks up tensor shapes and op names.
It ALSO changes initWork to accept a 'record' argument. This defaults to
false, and should only be set to true if the caller ensures the work
will be enqueued by workEnqueue, ensuring its cuda events are live when
used by flight recorder's update_state().
The testing uncovers some odd pre-existing behavior and leaves them
alone for now. We could change some of these
- seq starts off at 1, not 0 for first op (but this is inconistent)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120270
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #120724
In cases where sequence number is shared between events (e.g. coalesced
collectives) we want to ensure a unique (and ordered) ID per record.
Note: the records are already in a list, so their ID could be implicitly
observed. But (1) it's a ring buffer, so absolute ID is lost once the
buffer rolls over once, (2) users may sort or process or filter their
flight records, so having the ID be an explicit member of an entry is
still useful
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120724
Approved by: https://github.com/zdevito
As reported in https://github.com/pytorch/pytorch/issues/119434, `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` and `pyhpc_turbulent_kinetic_energy` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of these 3 models in this PR.
* Error msg is
```
File "/localdisk/leslie/torch_inductor_community/pytorch/benchmarks/dynamo/common.py", line 3879, in run
assert marked, f"nothing in example_inputs had a dim with {batch_size}"
AssertionError: nothing in example_inputs had a dim with 1048576
```
* Root Cause is
* Benchmark code will only annotate the inputs' dim as dynamic when its size equals to batch size c617e7b407/benchmarks/dynamo/common.py (L3867-L3871). If it fails to find any dim equals to batch size, above error throws.
* However, for these 3 models, none of the inputs' dim will equal to input batch size since the [relationship of dim sizes](26b85eadde/torchbenchmark/models/pyhpc_equation_of_state/__init__.py (L12-L16))
```
shape = (
math.ceil(2 * size ** (1/3)),
math.ceil(2 * size ** (1/3)),
math.ceil(0.25 * size ** (1/3)),
)
```
* Another thing is `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` can pass the dynamic batch size accuracy testing, because the batch size has been set to 4 in accuracy testing (c617e7b407/benchmarks/dynamo/common.py (L3456)) and `math.ceil(2 * size ** (1/3))` happens equaling to 4.
* Since the dim sizes of input has above relationship, running the these models in dynamic shape, we may need to annotate `dim[0](s0) = dim[2](s1) * 8`, per the discussion in https://github.com/pytorch/pytorch/issues/117477#issuecomment-1897108756 @avikchaudhuri, looks like we are not expressible for this case. So, I think we may need to skip the dynamic batch size testing for these 3 models.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120599
Approved by: https://github.com/jgong5, https://github.com/desertfire
Previously, torch.export in non-strict mode was failing on str inputs while creating fake inputs for tracing (fakify()), and using graph nodes to create constraints. This fixes those 2 stages to allow strs to pass.
Failing test case:
```
class Foo(torch.nn.Module):
def forward(self, a, b, mode):
return torch.div(a, b, rounding_mode=mode)
foo = Foo()
inps = (torch.randn(4, 4), torch.randn(4), "trunc")
exported = export(foo, inps)
with self.assertRaisesRegex(
RuntimeError, "to be equal to trunc, but got floor"
):
_ = exported.module()(torch.randn(4, 4), torch.randn(4), "floor")
self.assertTrue(torch.allclose(exported.module()(*inps), foo(*inps)))
```
Before:
```
(pytorch-local) pianpwk@pianpwk-mbp pytorch % python test/export/test_export_nonstrict.py -k test_runtime_assert_for_prm_str
E
======================================================================
ERROR: test_runtime_assert_for_prm_str_non_strict (__main__.NonStrictExportTestExport.test_runtime_assert_for_prm_str_non_strict)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/pianpwk/Documents/pytorch/torch/testing/_internal/common_utils.py", line 2744, in wrapper
method(*args, **kwargs)
File "/Users/pianpwk/Documents/pytorch/test/export/testing.py", line 40, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/test/export/test_export.py", line 1588, in test_runtime_assert_for_prm_str
exported = export(foo, inps)
^^^^^^^^^^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/test/export/test_export_nonstrict.py", line 16, in mocked_non_strict_export
return export(*args, **kwargs, strict=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/torch/export/__init__.py", line 186, in export
return _export(
^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/torch/export/_trace.py", line 541, in wrapper
raise e
File "/Users/pianpwk/Documents/pytorch/torch/export/_trace.py", line 527, in wrapper
ep = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/torch/export/exported_program.py", line 83, in wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/torch/export/_trace.py", line 707, in _export
) = make_fake_inputs(f, args, kwargs, constraints)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/torch/_export/non_strict_utils.py", line 133, in make_fake_inputs
fake_args, fake_kwargs = tree_map_with_path(
^^^^^^^^^^^^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/torch/utils/_pytree.py", line 1519, in tree_map_with_path
return treespec.unflatten(func(*xs) for xs in zip(*all_keypath_leaves))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/torch/utils/_pytree.py", line 734, in unflatten
leaves = list(leaves)
^^^^^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/torch/utils/_pytree.py", line 1519, in <genexpr>
return treespec.unflatten(func(*xs) for xs in zip(*all_keypath_leaves))
^^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/torch/_export/non_strict_utils.py", line 134, in <lambda>
lambda kp, val: fakify(fake_mode, kp, val, t_constraints, sources),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/pianpwk/Documents/pytorch/torch/_export/non_strict_utils.py", line 68, in fakify
raise ValueError("Only tensors allowed as input")
ValueError: Only tensors allowed as input
To execute this test, run the following from the base repo dir:
python test/export/test_export_nonstrict.py -k test_runtime_assert_for_prm_str_non_strict
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 1 test in 0.008s
FAILED (errors=1)
```
After:
```
(pytorch-local) pianpwk@pianpwk-mbp pytorch % python test/export/test_export_nonstrict.py -k test_runtime_assert_for_prm_str
.
----------------------------------------------------------------------
Ran 1 test in 0.237s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120536
Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17, https://github.com/avikchaudhuri, https://github.com/gmagogsfm
The special-case code for handling SUPPORTED_NODES was producing a guard that looked like:
```
"G['torch'].utils._pytree.SUPPORTED_NODES[<class '__main__.CausalLMOutputWithPast'>].type"
```
resulting in a eval error trying to evaluate the guard.
This change adds a new source type (`ClassSource`) which is given a class type (in this case `CausalLMOutputWithPast`) and attempts to fetch it from its defining module. It then uses that to build the `SUPPORTED_NODES` guards instead of referring to the type directly.
Also added a unit test which fails before this change and passes after.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120798
Approved by: https://github.com/anijain2305
This PR is mostly just code movement to make the code review easier - AFAIK it should not change any functionality. The final goal is to remove the xfails for some of the test_fake opinfos for these ops. The opinfos are failing because the outputs can have mixed devices - we need to move them to fake_impls first before we can support mixed device returns.
This PR:
* Move the `_meta_registrations.py` implementations to `fake_impls.py`
* Change the function signature from taking explicit named variables to taking `{args, kwargs}` and normalizing them
* Wrap all the returned tensors in FakeTensors
Tests: relying on opinfos. I also checked `test_fake_*` for these tests (by removing x-fails and patching things until they passed) to verify general correctness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120682
Approved by: https://github.com/drisspg
Summary: Vulkan gtests were segfaulting on mac because the memory for barriers can get destroyed after the local function(CommandBuffer::insert_barrier) exits where it is created. Since we provide this barrier pointer to vulkan library it needs to be around even after the function exit, else we get crashes.
Test Plan:
See that there is no segfault on mac with fix and tests can run:
Compile gtests:
buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
Crash w/o diff
bash-3.2$ buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 85 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 85 tests from VulkanAPITest
[ RUN ] VulkanAPITest.uniform_buffer_copy
[ OK ] VulkanAPITest.uniform_buffer_copy (88 ms)
[ RUN ] VulkanAPITest.copy_to_buffer
Segmentation fault: 11
With diff there is no crash:
bash-3.2$ buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 85 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 85 tests from VulkanAPITest
[ RUN ] VulkanAPITest.uniform_buffer_copy
[ OK ] VulkanAPITest.uniform_buffer_copy (296 ms)
.....
[ FAILED ] VulkanAPITest.gelu_quint8_self (23 ms)
[----------] 85 tests from VulkanAPITest (1494 ms total)
[----------] Global test environment tear-down
[==========] 85 tests from 1 test suite ran. (1494 ms total)
[ PASSED ] 72 tests.
[ FAILED ] 13 tests, listed below:
[ FAILED ] VulkanAPITest.linear_2d_flat
[ FAILED ] VulkanAPITest.linear_2d_small
[ FAILED ] VulkanAPITest.linear_2d_large
[ FAILED ] VulkanAPITest.linear_3d_flat
[ FAILED ] VulkanAPITest.linear_3d_small
[ FAILED ] VulkanAPITest.linear_3d_large
[ FAILED ] VulkanAPITest.linear_4d_flat
[ FAILED ] VulkanAPITest.linear_4d_small
[ FAILED ] VulkanAPITest.linear_4d_large
[ FAILED ] VulkanAPITest.gelu_qint8
[ FAILED ] VulkanAPITest.gelu_qint8_self
[ FAILED ] VulkanAPITest.gelu_quint8
[ FAILED ] VulkanAPITest.gelu_quint8_self
The above failing tests were failing before as well and are being worked on.
Differential Revision: D54023146
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120337
Approved by: https://github.com/SS-JIA
This adds support for backwards hooks that are *both*:
1) Interior to the graph; and
2) Dynamically generated (e.g. lambdas)
We do this by creating a BackwardState object that is used to register the hooks in the forward, then populated by dynamo *after* the forwards runs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120382
Approved by: https://github.com/xmfan
Summary:
`nullptr` is typesafe. `0` and `NULL` are not. In the future, only `nullptr` will be allowed.
This diff helps us embrace the future _now_ in service of enabling `-Wzero-as-null-pointer-constant`.
Test Plan: Sandcastle
Reviewed By: meyering
Differential Revision: D54163060
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120740
Approved by: https://github.com/Skylion007
With the current `Dim`-based dynamic shapes API for export, one can express that shapes of different input shapes must be equal by reusing the same `Dim`. However, non-trivial relationships between such input shapes cannot be expressed.
Recently we are seeing more and more examples of code that require this additional expressibility, e.g., where a pair of shapes might differ by one, or a shape might be double another (or simply even).
This PR introduces the concept of a "derived" `Dim`, i.e., a linear arithmetic expression over a `Dim`. By using a combination of `Dim`s and derived `Dim`s to specify input shapes, the desired relationships can be expressed naturally. E.g., a pair of shapes might be `dim` and `dim + 1`, or `dim` and `2*dim`, or even `2*dim` and `dim + 1`.
We extend the current infrastructure that translates `Dim`s to deprecated `dynamic_dim`-based constraints to work with derived `Dim`s. As usual, we raise constraint violation errors when shape guards cannot be verified given a dynamic shapes spec; suggest fixes; and raise runtime errors when future inputs violate the spec.
Importantly, some guards that used to cause forced specializations in the constraint solver because they were deemed "too complex" now do not do so, because they can now be specified as constraints. Since this was what motivated the introduction of a `disable_constraint_solver` flag to some internal APIs, we may not need that flag any more.
Note that shapes of placeholders in exported programs can now contain symbolic expressions and not just symbols.
Differential Revision: D53254587
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118729
Approved by: https://github.com/ezyang
This pull request is writing to provide an update on the recent advancements made in the PyTorch profiler with regards to XPU backend support. Following the successful merge of a previous pull request #94502 that established a pathway for the XPU backend within PyTorch, we have now taken steps to enhance the profiler's capabilities for handling and displaying profile data directly related to the XPU backend.
# Motivation
The current pull request builds upon this foundation by refining the profiler's data processing scripts, particularly `profiler_util.py`, to accommodate XPU backend-specific profile data. The aim is to align the handling and presentation of this data with that of the CUDA backend, offering users a consistent experience across different device profiles. This includes generating outputs such as JSON files compatible with Chrome trace tooling, among other formats.
# Principles
1. Minimal Impact: The modifications introduced should support XPU backend data with minimal disruption to the existing profiling scripts.
2. Consistency: Changes should maintain stylistic and functional consistency with existing `CUDA` and `privateuse1` pathways, ensuring no adverse effects on other logic paths.
3. Exclusivity: Ensure that the new XPU pathway does not interfere with or impede other pathways.
# Solutions
### a. Pathway Identification:
Introduction of a `use_xpu` flag within `torch.autograd.profiler.profile` interfaces to distinguish XPU-specific profiling.
### b. `use_device` Logic Revision:
With the introduction of the XPU pathway, `use_device` no longer implies a binary relationship with `use_cuda`. Consequently, we have revised related logic to remove implicit assertions and establish independent device distinction.
### c. Kernel List Segregation:
To accommodate the non-binary nature of device pathways, we have enabled kernel lists to identify specific device affiliations through separate list objects.
### d. Formatted Output:
To ensure output consistency, we have employed code duplication and keyword substitution techniques to facilitate the formatting of XPU-related profile data.
# Additional Enhancements
### a. Enumerations in `.pyi` Files:
Added recognition items for `DeviceType` and `ProfilerActivity` specific to XPU.
### b. Correct DeviceType Returns:
Revised `deviceTypeFromActivity` logic to accurately differentiate between device backends, even when they share common flags such as `libkineto::ActivityType::GPU_MEMCPY`.
### c. Bug Fixes in `cuda_corr_map`:
Addressed a corner case where erroneous parent-child event relationships were formed due to shared function event identifiers. The solution involves refining `cuda_corr_map` processing to prevent a function event from being misidentified as both the linker and linkee.
# Further Abstraction
Looking forward, we acknowledge the potential for further abstraction in the codebase. The current changes necessitated by XPU support have highlighted opportunities for reducing redundancy by consolidating naming conventions and utilizing a singular `device` naming system that relies on `DeviceType` attributes or string flags for differentiation. This would involve significant refactoring to replace device-specific flags and variables. This topic needs further discussions about whether we could and when we should deprecate all those flags and variables named with `cuda`.
# Next Pull Request
The next pull request will be contingent on Kineto's adoption of Intel's forthcoming PTI-sdk library, which will enable direct usage of XPU-related tracers. Subsequent modifications to `libkineto_init()` will aim to endow PyTorch running on XPU backends with comprehensive profiling capabilities on XPU devices.
We appreciate your attention to these enhancements and welcome any feedback or questions you may have regarding these developments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120185
Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui
When compiling an deserialized ExportedProgram, constant’s original_fqn is not populated(). Highlighted line is missing, And a latter assertion is breaking due to original_fqn missing.
```
constants_info_[0].name = "L__self___w_pre";
constants_info_[0].dtype = static_cast<int32_t>(cached_torch_dtype_float32);
constants_info_[0].offset = 0;
constants_info_[0].data_size = 64;
constants_info_[0].from_folded = false;
constants_info_[0].shape = {4, 4};
constants_info_[0].stride = {4, 1};
// constants_info_[0].original_fqn = "w_pre"; // this line is missing
```
Inductor is relying `dynamo_flat_name_to_original_fqn` to populate the original_fqn field. This field originates from `graph_module.meta["dynamo_flat_name_to_original_fqn"]`, and is set during dynamo tracing. However, when compiling
an deserialized ExportedProgram, we don't do dynamo tracing, thus this field is missing.
As a fix, I maintain AOTI's own mapping for constant tensor's fqn.
Differential Revision: D54097073
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120524
Approved by: https://github.com/chenyang78
Summary: FBCode CI does not compile torch with CUDA for tests in dynamo folder, instead of adding a special rule, lets move these tests to inductor folder.
Test Plan:
```
buck run mode/opt //caffe2/test/inductor/:triton_kernels
```
now works instead of skipping tests
Differential Revision: D54280629
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120738
Approved by: https://github.com/aakhundov
It's a bit annoying to have to read through the test name in verbose mode just to see what the test's sentinel file is actually called when encountering an unexpected success. Now that we have sentinel files, we can directly list the file path from root in the error message.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120766
Approved by: https://github.com/Skylion007
Summary: Expose sequence number to work info. The number can help applications identify a NCCL work more precisely.
Test Plan:
1. pytest test/distributed/test_c10d_nccl.py::WorkHookTest::test_on_completion_hook_seq
2. pytest test/distributed/test_c10d_nccl.py::WorkHookTest
Differential Revision: D54180050
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120596
Approved by: https://github.com/kwen2501
Summary:
We observed that stack nodes have missing exampe_value in DPA+FIRST, causing issue to further do split cat. Full error log: P1187633689.
pre grad graph: https://www.internalfb.com/intern/everpaste/?color=0&handle=GPUFOBWniTeB6s8DAN8z9sHTadpxbr0LAAAz
We found that it was introduced by the new stack nodes in the group batch fusion, thus we fix the bug to enable further split cat optimization.
Test Plan:
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch
```
before fix: P1187633689
```
W0221 13:32:09.334000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: sigmoid_16
W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_19
W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_16
W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_6
W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_5
W0221 13:32:09.336000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_4
W0221 13:32:09.517000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_20
W0221 13:32:09.518000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_18
W0221 13:32:09.518000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_17
W0221 13:32:09.521000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_19
W0221 13:32:09.521000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_15
W0221 13:32:09.521000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_14
W0221 13:32:09.522000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_16
W0221 13:32:09.524000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_18
W0221 13:32:09.525000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_12
W0221 13:32:09.525000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_11
W0221 13:32:09.525000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_13
W0221 13:32:09.527000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_17
W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_9
W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_8
W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_10
W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_7
```
after fix:
P1189491364
```
W0226 13:19:56.542000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: sigmoid_16
W0226 13:19:56.543000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_16
W0226 13:19:56.703000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_20
W0226 13:19:56.707000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_19
W0226 13:19:56.711000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_18
W0226 13:19:56.713000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_17
```
Differential Revision: D54140488
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120655
Approved by: https://github.com/jackiexu1992
Summary:
We ran into the following import loop when testing aps:
```
Traceback (most recent call last):
File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/forkserver.py", line 274, in main
code = _serve_one(child_r, fds,
File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/forkserver.py", line 313, in _serve_one
code = spawn._main(child_r, parent_sentinel)
File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/spawn.py", line 234, in prepare
_fixup_main_from_name(data['init_main_from_name'])
File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/spawn.py", line 258, in _fixup_main_from_name
main_content = runpy.run_module(mod_name,
File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/runpy.py", line 224, in run_module
return _run_module_code(code, init_globals, run_name, mod_spec)
File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/aps_models/ads/icvr/icvr_launcher.py", line 29, in <module>
class ICVRConfig(AdsComboLauncherConfig):
File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/aps_models/ads/common/ads_launcher.py", line 249, in <module>
class AdsComboLauncherConfig(AdsConfig):
File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/aps_models/ads/common/app_config.py", line 16, in <module>
class AdsConfig(RecTrainAppConfig):
File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/apf/rec/config_def.py", line 47, in <module>
class EmbeddingKernelConfig:
File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/apf/rec/config_def.py", line 52, in EmbeddingKernelConfig
cache_algorithm: CacheAlgorithm = CacheAlgorithm.LRU
File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torchrec/distributed/types.py", line 501, in <module>
class ParameterSharding:
File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torchrec/distributed/types.py", line 527, in ParameterSharding
sharding_spec: Optional[ShardingSpec] = None
File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharding_spec/api.py", line 48, in <module>
class ShardingSpec(ABC):
File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharding_spec/api.py", line 55, in ShardingSpec
tensor_properties: sharded_tensor_meta.TensorProperties,
File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharded_tensor/__init__.py", line 21, in <module>
def empty(sharding_spec: shard_spec.ShardingSpec,
ImportError: cannot import name 'ShardingSpec' from partially initialized module 'torch.distributed._shard.sharding_spec.api' (most likely due to a circular import) (/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharding_spec/api.py)
```
Using future annotations to mitigate.
Test Plan:
```
hg update 1b1b3154616b70fd3325c467db1f7e0f70182a74
CUDA_VISIBLE_DEVICES=1,2 buck2 run @//mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=local_ctr_cvr_rep
```
Differential Revision: D53685582
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119820
Approved by: https://github.com/fegin
Shorthand for `"%(levelname)s:%(name)s:%(message)s"` which is hard to
remember.
I find the default formatter annoying since just the metadata fills up
most of the width of my terminal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120757
Approved by: https://github.com/ezyang
This adds some initial unit tests for FSDP2 model state dict only.
This PR adds two tests:
1. Add a unit test for parity check for FSDP `model.state_dict()` with distributed_state_dict's `get_model_state_dict`.
2. Add a unit test to make sure`StateDictOptions(full_state_dict=True, cpu_offload=True)` in distributed_state_dict work for FSDP2 model state_dict.
Optimizer state dict will be in follow up PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120680
Approved by: https://github.com/awgu
Overall design: https://docs.google.com/document/d/1CX_hJ0PNy9f3R1y8TJrfkSeLkvGjjjLU84BSXgS2AZ8/edit
How to read the diff:
* Most files are me augmenting pre-existing logging with structured variants. For the most part it's simple (esp FX graphs, which have a canonical string representation); it gets more complicated when I decided to JSON-ify some data structure instead of keeping the ad hoc printing (notably, guards and dynamo output graph sizes)
* torch/_functorch/_aot_autograd/collect_metadata_analysis.py is some unrelated fixes I noticed while auditing artifact logs
* torch/_logging/_internal.py has the actual trace log implementation. The trace logger is implement as a logger named torch.__trace which is disconnected from the logging hierarchy. It gets its own handler and formatter (TorchLogsFormatter with _is_trace True). `trace_structured` is the main way to emit a trace log. Unusually, there's a separate "metadata" and "payload" field. The metadata field should not be too long (as it is serialized as a single line) and is always JSON (we put contextual things like compile id in it); the payload field can be long and is emitted after the metadata log line and can span multiple lines.
* torch/_logging/structured.py contains some helpers for converting Python data structures into JSON form. Notably, we have a string interning implementation here, which helps reduce the cost of serializing filenames into the log.
* test/dynamo/test_structured_trace.py the tests are cribbed from test_logging.py, but all rewritten to use expect tests on munged versions of what we'd actually output. Payloads are never tested, since they tend not be very stable.
https://github.com/ezyang/tlparse is a POC Rust program that can interpret these logs.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120289
Approved by: https://github.com/Skylion007
ghstack dependencies: #120712
This updates the nesting of if statements in `nn.Module._apply` such that if
`torch.__future__.set_swap_module_params_on_conversion(True)`, we always try to swap regardless of whether
- `torch._has_compatible_shallow_copy_type(param, fn(param)`
- `torch.__future__.set_overwrite_module_params_on_conversion` is set
This means that `meta_module.to_empty('device')` can now use the swap_tensors path cc @awgu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120659
Approved by: https://github.com/albanD
cuBLAS has indicated that certain kernels will transition to using the driver API over the CUDA runtime API, which we've observed to break existing tests (e.g., DataParallel) that use multithreading and may not eagerly grab a context via `cudaSetDevice`.
CC @Aidyn-A @ptrblck
Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120131
Approved by: https://github.com/atalman
Tacotron2 causes massive loop unrolling resulting in very large graphs (26k nodes) which was causing inductor (and tracing itself) to choke.
The unrolling size is controlled by the environment variable TORCHDYNAMO_MAX_LOOP_UNROLL_NODES which defaults to the arbitrary value 5000.
This updates the tacotron2 timings as follows:
eager timing: 3m:23s -> 35s
aot_eager timing: 4m:12s -> 39s
inductor timing: 22m:24s ->1m
For reference the big loop in tacotron2 was this one (model.py[405]):
```
while len(mel_outputs) < decoder_inputs.size(0) - 1:
decoder_input = decoder_inputs[len(mel_outputs)]
mel_output, gate_output, attention_weights = self.decode(decoder_input)
mel_outputs += [mel_output.squeeze(1)]
gate_outputs += [gate_output.squeeze(1)]
alignments += [attention_weights]
```
which gets unrolled and inlined adding about 36 nodes to the graph per iteration.
Fixes#98467
Relates to #102839 which hopefully will result in a better fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120023
Approved by: https://github.com/yanboliang
This PR refactors the tuple strategy handling logic, and allow
TupleStrategy to have both input/output specs for each OpStrategy child,
so that we could further enable operators like foreach norm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120695
Approved by: https://github.com/awgu
Summary: Process Group config is essential to analyze collective pattern. We have added this in Execution Trace. Now expose this information in Kineto as well
Differential Revision: D53557965
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119443
Approved by: https://github.com/kwen2501
Summary:
When we export in on strict mode and turn on preserve_module_call_signature, the following assertion error will occur today:
```
child_split[: len(parent_split)] == parent_split
```
This is due to the fact that we're monkey patching forward call directly, which kinda breaks the attribute propagation in the tracer. It's actually better to implement this by using forward hook because we don't have to alter the original module structure at all during export.
Test Plan: CI
Differential Revision: D54102714
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120468
Approved by: https://github.com/ydwu4
In this PR stack, there were unrelated test failures within test_trace_rules.py - It turned out that torch.cuda._get_device_properties should be registered in _dynamoc/trace_rules.py. A test failed because it was not.
This is a small fix which tries to get rid of the test failure by manually registering that function.
Note:
I am not sure whether this is the best way to fix this, as I am neither familiar with the trace rules nor with the introduction of torch.cuda._get_device_properties.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120620
Approved by: https://github.com/Skylion007
Currently, when loading w/strict=False or w/strict=True and looking at
error message, FQNs are garbled w/FSDP details such as "_fsdp_wrapped_module".
This makes it tricky for upstream applications to validate the expected set of
keys are missing / unexpected (for example with PEFT where state_dict is loaded
non-strict), and makes error message more complicated w/FSDP details.
This PR cleans those prefixes by using `clean_tensor_name` in FSDP's existing
post load_state_dict hooks. Currently, only full_state_dict impl is tested, can test the rest of the impls as follow up work.
Differential Revision: [D54182472](https://our.internmc.facebook.com/intern/diff/D54182472/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120600
Approved by: https://github.com/XilunWu, https://github.com/fegin
Fixes#115331.
This PR increases the number of valid GPU devices to 512 (from 64) in order to future-proof PyTorch for providers that offer [single nodes with a large device count](https://www.tensorwave.com/). Until now, `DeviceIndex` was an `int8_t`, thus multiple changes were necessary:
- `DeviceIndex` changed to `int16_t`. Updated consumers that assume it to be an `int8_t`.
- Updated bounds checking for `torch.device()` in the Python frontend. Right now, we allow funny things like `torch.device('cpu', 200).index == -56`, which is undefined behavior. I inserted some checks to only allow values between 0 and `c10::Device::MAX_NUM_DEVICES - 1`.
- Updated the `ArgumentInfo` struct as it hardcodes the device index as 8 bit field [^1]. Might be a breaking change, not sure if users rely on this.
- Introduced `c10::Device::MAX_NUM_DEVICES` as a replacement for the old `C10_COMPILE_TIME_MAX_GPUS`
[^1]: This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the `ArgumentInfo` struct. When I switched the `DeviceIndex` to `int16_t`, it actually stayed 255 after unpacking from `ArgumentInfo` again, as the `DeviceIndex` was now wide enough that it didn't wrap back to -1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119639
Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/huydhn
Summary:
Previously we were renaming constants to `lifted_constant_tensor0` or equivalent. This PR changes things so that the constants retain the same FQN as in the original eager module.
Actually, `symbolic_trace` already is supposed to do this, but the code path is not triggered when used from `make_fx`, since we don't pass an actual `nn.Module` instance to `trace()`, but rather a multiply-wrapped-functionalized-lambda-thing.
So, I reproduced the essential logic outside of make_fx, at the export layer.
Test Plan: added a unit test
Differential Revision: D54221616
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120664
Approved by: https://github.com/SherlockNoMad
Yuzhen Huang was complaining to me that searching for `__recompile`
no longer works. This is because the glog format is filename, not
logger name, so we lost the artifact name. Add it back.
Looks like:
```
V0226 15:56:04.142000 139828992779264 torch/_dynamo/guards.py:1084] [0/2] __guards: ___check_type_id(L['inputs'], 7626144)
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120671
Approved by: https://github.com/Skylion007
This enables native functional collectives by default. After this PR:
- The Python APIs remain backward compatible. Users will receive a deprecation warning if they use `(rank, tags)` as process group identifier.
- Collectives will be captured as `_c10d_functional` ops in post-grad fx graphs. The change will not affect end-users, but it will impact `torch-xla` which has implemented an all-reduce backend based on the existing `c10d_functional` IR. This excludes the migration for `torch-xla` use cases, which will be coordinated separately (see communications in #93173).
- Collectives will be lowered to and codegen'd by new Inductor collective IRs (`ir._CollectiveKernel` and `ir._WaitKernel`). This change will not affect end-users.
Testing performed:
- We have been running a set of representative unit tests with both the new native funcol and the old py funcol in CI. These test will continue to run with the old py funcol after this PR, so they are covered until they are removed.
- Manually verified with e2e llama model training with DTensor + functional collectives (https://github.com/fairinternal/xlformers/tree/pt2_llm/pt2d#create-your-local-development-env).
Fallback mechansim:
- Introduced a temporary environment variable `TORCH_DISABLE_NATIVE_FUNCOL` that allows users to fall back to the previous implementation. We don't expect the migration to break anything; the mechanism is a safety measure to reduce potential disruption in case the PR causes unforeseen breakages.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120370
Approved by: https://github.com/wconstab, https://github.com/yf225
Reduces backend=eager compile time from 33 to 19 seconds for `MobileBertForQuestionAnswering`. This also helps an internal model where guards.add function is taking 124 seconds.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120520
Approved by: https://github.com/mlazos
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the last runtime component we would like to upstream is `Generator` which is responsible for the pseudo-random number generation. To facilitate the code review, we split the code changes into 2 PRs. This is one of the 2 PRs and covers the changes under `aten`.
# Design
Following the previous design, `c10::GeneratorImpl` is the device-agnostic abstraction of a random number generator. So we will introduce an XPU generator `XPUGeneratorImpl`, inheriting from `c10::GeneratorImpl`, to manage random states on an Intel GPU device. Intel GPU runtime `Generator` adopts the same algorithm as CPU. The corresponding C++ file should be placed in aten/src/ATen/xpu/ folder and is built in `libtorch_xpu.so`.
This PR provide the list of APIs:
- `getDefaultXPUGenerator`
- `createXPUGenerator`
# Additional Context
The 2nd PR will cover `python frontend`.
The differences with CUDA:
The generator-related ATen CPP APIs are 1:1 mapping with CUDA.
The XPUGeneratorImpl's member functions have slight differences with CUDA.
lack of CUDA-related counterpart APIs listed below:
- capture_prologue
- capture_epilogue
- philox_cuda_state
- reset_rnn_state
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118528
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
Overall design: https://docs.google.com/document/d/1CX_hJ0PNy9f3R1y8TJrfkSeLkvGjjjLU84BSXgS2AZ8/edit
How to read the diff:
* Most files are me augmenting pre-existing logging with structured variants. For the most part it's simple (esp FX graphs, which have a canonical string representation); it gets more complicated when I decided to JSON-ify some data structure instead of keeping the ad hoc printing (notably, guards and dynamo output graph sizes)
* torch/_functorch/_aot_autograd/collect_metadata_analysis.py is some unrelated fixes I noticed while auditing artifact logs
* torch/_logging/_internal.py has the actual trace log implementation. The trace logger is implement as a logger named torch.__trace which is disconnected from the logging hierarchy. It gets its own handler and formatter (TorchLogsFormatter with _is_trace True). There's a teensy bit of FB specific code to automatically enable trace logging if a /logs directory exists. `trace_structured` is the main way to emit a trace log. Unusually, there's a separate "metadata" and "payload" field. The metadata field should not be too long (as it is serialized as a single line) and is always JSON (we put contextual things like compile id in it); the payload field can be long and is emitted after the metadata log line and can span multiple lines.
* torch/_logging/structured.py contains some helpers for converting Python data structures into JSON form. Notably, we have a string interning implementation here, which helps reduce the cost of serializing filenames into the log.
* test/dynamo/test_structured_trace.py the tests are cribbed from test_logging.py, but all rewritten to use expect tests on munged versions of what we'd actually output. Payloads are never tested, since they tend not be very stable.
https://github.com/ezyang/tlparse is a POC Rust program that can interpret these logs.
Testing that the fbcode detection works at https://www.internalfb.com/mlhub/pipelines/runs/fblearner/534553450 (Meta-only)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120289
Approved by: https://github.com/Skylion007
By changing runtime symbolic asserts to using assert_scalar, the asserts can call into `expect_true` and modify the shape env so that we can run through the traced graph module with fake tensors. With assert_async, the asserts only get hit during runtime, but that means if we run the graph module with fake tensors, the asserts will not affect the shape env, so later data dependent calls to the fake tensors may result in GuardOnDataDependentSymNode errors.
https://github.com/pytorch/pytorch/issues/119587
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119608
Approved by: https://github.com/ezyang
This does two things,
1) Short circuit `welford_reduce` on the first iteration to ignore the accumulator (big win for small `rnumel`)
2) Replace division with multiplication by reciprocal
Currently this is not enough to match two pass reduction with bfloat16 but it is still a significant improvement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120330
Approved by: https://github.com/lezcano
Convert from a list/bucket based TD system to just a numbers based TD system. Looks like a massive change but a decent amount of it is tests and removing code.
Main file of interest is interface.py, which Github is collapsing by default due to size
The test files pretty much got rewritten entirely since a lot of the old tests are no longer relevant.
Other notable changes:
* Use Frozenset to make TestRun hashable
* Adds tools/test/heuristics/__init__.py to ensure that unittest can discover the tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119901
Approved by: https://github.com/osalpekar, https://github.com/huydhn
Fixes#114831
Before:
```
(pytorch10) angelayi@devgpu022 ~/local/pytorch [main] $ python test/dynamo/test_trace_rules.py -k test_torch_name_rule_map_updated
F
======================================================================
FAIL: test_torch_name_rule_map_updated (__main__.TraceRuleTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 2739, in wrapper
method(*args, **kwargs)
File "/data/users/angelayi/pytorch/test/dynamo/test_trace_rules.py", line 328, in test_torch_name_rule_map_updated
self._check_set_equality(
File "/data/users/angelayi/pytorch/test/dynamo/test_trace_rules.py", line 302, in _check_set_equality
self.assertTrue(len(x) == 0, msg1)
AssertionError: False is not true : New torch objects: {<built-in method _print of type object at 0x7ff477e40ee0>} were not added to `trace_rules.torch_c_binding_in_graph_functions` or `test_trace_rules.ignored_c_binding_in_graph_function_names`. Refer the instruction in `torch/_dynamo/trace_rules.py` for more details.
To execute this test, run the following from the base repo dir:
python test/dynamo/test_trace_rules.py -k test_torch_name_rule_map_updated
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 1 test in 0.184s
FAILED (failures=1)
```
After change:
```
(pytorch10) angelayi@devgpu022 ~/local/pytorch [main] $ python test/dynamo/test_trace_rules.py -k test_torch_name_rule_map_updated
.
----------------------------------------------------------------------
Ran 1 test in 0.209s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120533
Approved by: https://github.com/clee2000, https://github.com/yanboliang, https://github.com/huydhn, https://github.com/Skylion007
We need a higher tolerance for GPT2ForSequenceClassification since if I change --bfloat16 in
```
time python benchmarks/dynamo/huggingface.py --accuracy --inference --bfloat16 --backend inductor --disable-cudagraphs --only GPT2ForSequenceClassification
```
to --float16 or --float32 it will pass the accuracy check.
Adding --freezing can also make the test pass for this model. I think that's may be due to different fusion output being generated (depending on if constant propagation is happening controlled by freezing) and cause some small numerical difference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120537
Approved by: https://github.com/jansel
Summary:
Fixes#120386
`_AllGather.backward` assumes that `_ReduceScatter` would always in-place update the output buffer. However, when the output buffer is non-contiguous, `_ReduceScatter` would allocate and return a different buffer, causing the gradient to be thrown away.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120582
Approved by: https://github.com/XilunWu
1) Using items stored in torch._tensor_classes to check item passed from python side;
2) Add SparsePrivateUse1 in backend_to_string, layout_from_backend and check_base_legacy_new;
3) Using more general API to get python module name in get_storage_obj and get_name functions.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119263
Approved by: https://github.com/ezyang
Summary:
## Context
Move Vulkan graph runtime from PyTorch directory to ExecuTorch directory to improve development logistics:
* ExecuTorch delegate changes will no longer require export to PyTorch directory
* Makes it much easier to enable OSS build for Vulkan delegate
Test Plan:
```
LD_LIBRARY_PATH=/home/ssjia/fbsource/third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/executorch/backends/vulkan/test:vulkan_compute_api_test_bin
buck2 run fbcode//executorch/backends/vulkan/test:test_vulkan_delegate
```
Differential Revision: D54133350
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120528
Approved by: https://github.com/manuelcandales
This enables native functional collectives by default. After this PR:
- The Python APIs remain backward compatible. Users will receive a deprecation warning if they use `(rank, tags)` as process group identifier.
- Collectives will be captured as `_c10d_functional` ops in post-grad fx graphs. The change will not affect end-users, but it will impact `torch-xla` which has implemented an all-reduce backend based on the existing `c10d_functional` IR. This excludes the migration for `torch-xla` use cases, which will be coordinated separately (see communications in #93173).
- Collectives will be lowered to and codegen'd by new Inductor collective IRs (`ir._CollectiveKernel` and `ir._WaitKernel`). This change will not affect end-users.
Testing performed:
- We have been running a set of representative unit tests with both the new native funcol and the old py funcol in CI. These test will continue to run with the old py funcol after this PR, so they are covered until they are removed.
- Manually verified with e2e llama model training with DTensor + functional collectives (https://github.com/fairinternal/xlformers/tree/pt2_llm/pt2d#create-your-local-development-env).
Fallback mechansim:
- Introduced a temporary environment variable `TORCH_DISABLE_NATIVE_FUNCOL` that allows users to fall back to the previous implementation. We don't expect the migration to break anything; the mechanism is a safety measure to reduce potential disruption in case the PR causes unforeseen breakages.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120370
Approved by: https://github.com/wconstab, https://github.com/yf225
The search code to find the dynamo skip files wasn't working properly when used with pytest and multiple files:
```
pytest a.py b.py
```
because pytest would point `__main__` at itself instead of the individual file. (This worked fine when only running a single file test)
Change the scanning code to look for the skip directory relative to its own file first.
While in there add/update some comments and log a warning when the directory wasn't found (instead of a hard crash).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120521
Approved by: https://github.com/oulgen
For graphs with single output, the expectation of torch.export / torch.compile graph_module output type is a single torch.tensor instead of a tuple.
However, after using `_SplitterBase` partitioner on these graph_module (obtained from torch.export/torch.compile), the resultant graph module will return a tuple of tensors, in this case `(output,)`.
This PR adds codegen to the graphs produced by `_SplitterBase` partitioner. Setting this will ensure pytree unflatten nodes will be added automatically to handle unflattening of the output to return single outputs directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120361
Approved by: https://github.com/angelayi
Summary:
The current dump timeout logic is a bit cumbersome as it needs 2 times: 1.
timeout, 2. wake up time. And in theory the caller just needs to wait
for a max of timeout value for the dump and declare the dump to be
either successful or not. Also we unify the async call using std::async
instead of a customized async lauch function for each operation.
Test Plan:
Unit tests
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120331
Approved by: https://github.com/wconstab
Summary: We can only not-decompose CompositeImplicit functional custom ops. From the looks of the implementation, this op looks functional. So the fix is just fixing the schema.
Test Plan: CI
Differential Revision: D54019265
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120332
Approved by: https://github.com/zhxchen17
These ops are at the level of the OperatorRegistry from the previous change. All ExecuTorch ops will go here.
```
ATen/native/vulkan/graph/ops
```
They are not to be confused with the general ATen ops from `native_functions.yaml` that will continue to exist. All PyTorch ops are here.
```
ATen/native/vulkan/ops
```
To help think around this split, note that we can actually implement the latter ATen ops with the former OperatorRegistry ops, since it's currently a subset.
Differential Revision: [D54030933](https://our.internmc.facebook.com/intern/diff/D54030933/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120364
Approved by: https://github.com/SS-JIA
ghstack dependencies: #120362, #120363
This PR moves other aspects of torchbench's model configuration (e.g. batch size,
tolerance requirements, etc.) into a new YAML file: `torchbench.yaml`. It also merges the
recently added `torchbench_skip_models.yaml` file inside the `skip` key.
This is an effort so that external consumers are able to easily replicate the performance
results and coverage results from the PyTorch HUD.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120299
Approved by: https://github.com/jansel
Found an error in the doc of `torch.linalg.lu_factor` related to `torch.linalg.lu_solve`. Also fix a sphinx issue by the way.
```Python traceback
TypeError: linalg_lu_solve(): argument 'LU' (position 1) must be Tensor, not torch.return_types.linalg_lu_factor
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120484
Approved by: https://github.com/lezcano
By hiding float32 constructors and exposing float16 ones. This allows compiler do implicit conversions as needed, and in safe cases optimize out unneeded upcasts to fp32, see example [below](https://godbolt.org/z/5TKnY4cos)
```cpp
#include <arm_neon.h>
#ifndef __ARM_FEATURE_FP16_SCALAR_ARITHMETIC
#error Ieeee
#endif
float16_t sum1(float16_t x, float16_t y) {
return x + y;
}
float16_t sum2(float16_t x, float16_t y) {
return static_cast<float>(x) + static_cast<float>(y);
}
```
both sum variants are compiled to scalar fp16 add, if build for the platform that supports fp16 arithmetic
```
sum1(half, half): // @sum1(half, half)
fadd h0, h0, h1
ret
sum2(half, half): // @sum2(half, half)
fadd h0, h0, h1
ret
```
Fixes build error in some aarch64 configurations after #119483 which are defined as supporting FP16 but don't define _Float16.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120425
Approved by: https://github.com/mikekgfb, https://github.com/atalman, https://github.com/snadampal
Fixes#119543
- doc fixed with the `reduce` being a kwarg (see below for details)
- doc added another interface `(int dim, Tensor index, Number value, *, str reduce)` where
the full signature in the pyi file after build is
```
def scatter_(self, dim: _int, index: Tensor, value: Union[Number, _complex], *, reduce: str) -> Tensor:
```
. This can be further verified in
02fb043522/aten/src/ATen/native/native_functions.yaml (L8014)
Therefore, the value can be int, bool, float, or complex type.
Besides the issue mentioned in 119543, the `reduce should be a kwarg` as shown below
```
* (int dim, Tensor index, Tensor src)
* (int dim, Tensor index, Tensor src, *, str reduce)
* (int dim, Tensor index, Number value)
* (int dim, Tensor index, Number value, *, str reduce)
```
The test case for scala value is already implemented in
70bc3b3be4/test/test_scatter_gather_ops.py (L86)
so no additional test case required.
@mikaylagawarecki @janeyx99
Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120169
Approved by: https://github.com/mikaylagawarecki
Summary:
Previously, I added lastEnqueuedSeq_ and lastCompletedSeq_ to store the states of PG progress
but log only when there is timeout detected.
We found it is not enough since the 'straggler' itself might not detect
the timeout and hence there is no log from the 'straggler'.
In this PR, we can log these states periorically so that it would be
much easier for us to identify the straggler by checking which rank
has the smallest number of lastEnqueuedSeq_
Test Plan:
Log adding, build success
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120438
Approved by: https://github.com/wconstab, https://github.com/XilunWu, https://github.com/kwen2501
This does two things,
1) Short circuit `welford_reduce` on the first iteration to ignore the accumulator (big win for small `rnumel`)
2) Replace division with multiplication by reciprocal
Currently this is not enough to match two pass reduction with bfloat16 but it is still a significant improvement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120330
Approved by: https://github.com/lezcano
The intent of this change was to minimize code differences between CUDA and ROCm while maintaining or improving performance.
Verified new performance using pytorch/benchmarks/operator_benchmark.
```
python -u -m pt.unary_test --tag-filter all --device cuda
python -u -m pt.binary_test --tag-filter all --device cuda
```
On MI200 this improved performance on average 3%.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120101
Approved by: https://github.com/albanD
We have many tests that use CommDebugMode to verify the occurrence of collectives. These tests do so by querying comm_counts with legacy funcol ops as key. For the purpose of native funcol migration, we need these tests to work for both legacy and native funcol. To avoid the need to modify all tests to accommodate the two implementations, we make CommDebugMode translate native funcol ops into legacy funcol ops until the migration finishes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120070
Approved by: https://github.com/wconstab, https://github.com/wanchaol
ghstack dependencies: #120042, #120043
Summary:
While I think it probably makes more sense to only require `all_reduce` input to be non-overlapping and dense, today `ProcessGroupNCCL` requires it to be contiguous. This is also what the `all_reduce` in non-native funcol does.
Also marking a test affected by this with `@run_with_both_funcol_impls`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120042
Approved by: https://github.com/wanchaol
`nn.Module.__setattr__` does not actually call `super().__setattr__()`. If we make this call in our fast path, then we will inadvertently set the parameter as an actual attribute on the module, not just as an entry in the `_parameters` dict. This can lead to a bug where after replacing the parameters on the module (e.g. via `to_empty()` from meta device), we now have both an actual attribute (old) and a new entry in `_parameters` (new). Trying to access the parameter would give the old one since Python only resolves `__getattr__` if normal attribute lookup fails.
The bug was exercised in the following PR. I wanted to land this bug fix separately.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120340
Approved by: https://github.com/yifuwang
ghstack dependencies: #120231
Summary:
Decompose memory bound mm/bmm.
Linear decomposition result: D53502768
BMM decomposition result: D53148650
We should only decompose when
1)bmm, b is large, m,n,k is relative small
2)mm/addmm. m is large, n and K is relative small. e.g. mm of input gradient in linear backward should not be decomposed since m is small and n is large.
Need to conduct more experiments to see if we can find a better strategy for decomposition. I have tried to use a linear regression model (see the bento results) which does not fit well. For short term, we use heuristics to determine when to decompose.
Test Plan:
```
buck2 test mode/dev-nosan //caffe2/test/inductor:decompose_mem_bound_mm
```
COFFEE APS mc0:
baseline: aps-lsf-0124-bf16-267ccb7a0d
decompose: aps-lsf-0124-bf16-4e3824db40
FIRST AFOC pyper mc1
Differential Revision: D53602514
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120047
Approved by: https://github.com/mengluy0125
Log a few more fields
- num_atomic_add: perf of kernels using atomic_add are usually data dependent. Our benchmarking code generate all indices to be 0 which will result in worse perf than reality.
- kernel_args_num_gb: estimate the amount of read/writes for kernel args. In-place args will be double counted. If we have a good estimation, this should be the lower bound of memory access that the GPU performs. Sometimes GPU will do more memory access since a single buffer may be access multiple times (e.g. for softmax when input tensor is quite large. cache only help a bit here). With this logged, and if we augment the metadata with amount of memory the GPU actually accessed, then it would be nice to dig into kernels that GPU access more memory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120274
Approved by: https://github.com/jansel
ghstack dependencies: #120266
Fixes#112389
## About
PyTorch (Kineto) profiler registers with the profiling daemon Dynolog to enable on-demand profiling. The user should only need to set the env variable `KINETO_USE_DAEMON`. To enable this we need to initialize kineto library early rather than lazily on a PyTorch profiler call. This initialization happens in a static initializer.
- Kineto init function basically registers a callback using the CUDA CUPTI library https://github.com/pytorch/kineto/blob/main/libkineto/src/init.cpp#L130-L148
- However, the above needs the dynamic linking to libcupti.so to have taken place.
- I understand now that static initializations of compilation units will be called before the dynamic linking leading to a segfault in #112389

## Workaround
We add a delay in the initialization that can be configured using the env variable 'KINETO_DAEMON_INIT_DELAY_S'. May not be the best but it could help resolve the issue.
## Testing
Tested this out with [linear_model_example.py](https://github.com/facebookincubator/dynolog/blob/main/scripts/pytorch/linear_model_example.py)
First export the daemon env variable
### Without any delay
```
>$ python3 linear_model_example.py
INFO:2024-02-21 19:34:50 2366287:2366287 init.cpp:131] Registering daemon config loader, cpuOnly = 1
INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:50 2366287:2366287 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclientb8f91363-d8d6-47a7-9103-197661e28397 status = initialized
INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
cpu
99 1385.468505859375
```
### With 5 seconds delay
```
>$ KINETO_DAEMON_INIT_DELAY_S=5 python3 linear_model_example.py
cpu
99 284.82305908203125
10099 8.817167282104492
INFO:2024-02-21 19:34:26 2359155:2359214 init.cpp:131] Registering daemon config loader, cpuOnly = 1
ERROR: External init callback must run in same thread as registerClient (1782580992 != -1922169024)
INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:26 2359155:2359214 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient49270a3f-e913-4ea6-b9e0-cc90a853a869 status = initialized
INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
20099 8.817167282104492
```
### With an invalid delay
```
>$ KINETO_DAEMON_INIT_DELAY_S=abc python3 linear_model_example.py
INFO:2024-02-21 19:35:02 2369647:2369647 init.cpp:131] Registering daemon config loader, cpuOnly = 1
INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:35:02 2369647:2369647 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient0e12a349-af7b-4322-901d-1ff22f91fd4c status = initialized
INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
cpu
```
### Unit test updated as well.
## Impact
This should not impact any general user. The initialization only occurs if `KINETO_USE_DAEMON` is set in the environment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120276
Approved by: https://github.com/anupambhatnagar, https://github.com/aaronenyeshi
# Motivation
According to [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the 5th runtime component we would like to upstream is `Guard`. We will cover device guard and stream guard in this PR.
# Design
Device guard is used mainly for op dispatcher in PyTorch. Currently, PyTorch already has a device guard abstraction `c10::impl::DeviceGuardImplInterface`. In our design, we will introduce an `XPUGuardImpl` class inherits from `c10::impl::DeviceGuardImplInterface`. Register `XPUGuardImpl` to PyTorch after we implement the device switch management mechanism in `XPUGuardImpl`. Besides, we will introduce `XPUGuard`, `OptionalXPUGuard`, `XPUStreamGuard`, and `OptionalXPUStreamGuard`. They are all following the design of CUDA's counterpart. The corresponding C++ file should be placed in c10/xpu/ folder.
# Additional Context
It is unnecessary to add `Guard` code to PyTorch frontend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118523
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #120315
Summary: RECORD_FUNCTION only capture the argument when it is a Tensor. However, it is very common for user to use the argument with primitive data type (int, float, index, bool). This DIFF is to support non tensor arguments in RECORD_FUNCTION.
Test Plan:
unit test
buck test mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2 test_execution_trace_alone test_execution_trace_with_kineto test_execution_trace_start_stop test_execution_trace_repeat_in_loop test_execution_trace_no_capture
Differential Revision: D53674768
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120017
Approved by: https://github.com/soulitzer
Fixes#115331.
This PR increases the number of valid GPU devices to 512 (from 64) in order to future-proof PyTorch for providers that offer [single nodes with a large device count](https://www.tensorwave.com/). Until now, `DeviceIndex` was an `int8_t`, thus multiple changes were necessary:
- `DeviceIndex` changed to `int16_t`. Updated consumers that assume it to be an `int8_t`.
- Updated bounds checking for `torch.device()` in the Python frontend. Right now, we allow funny things like `torch.device('cpu', 200).index == -56`, which is undefined behavior. I inserted some checks to only allow values between 0 and `c10::Device::MAX_NUM_DEVICES - 1`.
- Updated the `ArgumentInfo` struct as it hardcodes the device index as 8 bit field [^1]. Might be a breaking change, not sure if users rely on this.
- Introduced `c10::Device::MAX_NUM_DEVICES` as a replacement for the old `C10_COMPILE_TIME_MAX_GPUS`
[^1]: This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the `ArgumentInfo` struct. When I switched the `DeviceIndex` to `int16_t`, it actually stayed 255 after unpacking from `ArgumentInfo` again, as the `DeviceIndex` was now wide enough that it didn't wrap back to -1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119639
Approved by: https://github.com/cyyever, https://github.com/albanD
In many places in the code we use `tree_map_only((SymInt, SymBool, SymFloat), foo)` but with nested ints, it is possible to have SymInts that are non-symbolic, so we may want to do something like `tree_map_only(is_symbolic, foo)` instead.
Alternative: wrap nested int SymNodes with something other than SymInt.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119974
Approved by: https://github.com/zou3519
ghstack dependencies: #119661
This PR simplifies the outputs wrapping handling in op dispatch, to make
it simpler and easier to understand.
It also enables a new case, where if the output DTensorSpec for the res is
None, and the res is a scalar tensor, we will just return the scalar
tensor instead of wrapping it with a DTensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120297
Approved by: https://github.com/wz337
Fixes#117794
Fix tripped the assert here: 86dedebeaf/torch/utils/_python_dispatch.py (L216)
From investigation: I found that functionalization of an in-place op (`mul_` in this test case) results in the strides of `TwoTensor`'s `a` / `b` components being mutated to be contiguous. This is not reflected in the outer tensor, causing the assert to be tripped.
After discussion with Brian, I address this in this PR by disallowing input mutations on non-contiguous tensor subclass inputs for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117860
Approved by: https://github.com/bdhirsh
Changes sharding to attempt to put all serial tests on as few shards as possible. Parallel tests are then distributed across all shards, with most of which likely ending up on the non serial shards
Example: 8 minutes of serial tests, 20 minutes of parallel tests, 2 proc per machine, 6 machines
-> 8 + 20/2 = 18 total minutes of tests
-> 18 / 6 machines = 3 min per machine
-> all serial tests should fit on 3 machines (3min, 3 min, 2min)
-> majority of parallel tests should go on last 4 machines, one of which is shared with the serial tests
Move serial tests to run first
If I want to move to a purely numbers based sharding, this ensures that parallel tests are run with parallel tests as much as possible instead of interleaving serial + parallel tests, which decreases effectiveness of parallelization, while also ensuring that test reordering is still mostly effective.
See 73e816ee80 for example logs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119078
Approved by: https://github.com/huydhn
Summary:
This PR is mainly about flight recorder side of changes that takes a
map of maps as input, and dump it as picklable. Also add functions that
should be compiled only when NCCL_COMM_DUMP is defined
Test Plan:
Integration tests with NCCL would be done later, here we only do the
c10d side of dump test, aka,NCCLTraceTest
Testing the dump function is a bit tricky as we don't have
existing C++ unit tests for them. So we still use the Python NCCLTraceTest with
the python binding of _dump_nccl_trace(), we manually fed the
dump_nccl_trace with a map of test info, and assert the pickle result and
print the converted python dict:
```
(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (main)]$ python
test/distributed/test_c10d_nccl.py NCCLTraceTest
NCCL version 2.19.3+cuda12.0
[rank0]:[E ProcessGroupNCCL.cpp:1200] [PG 0 Rank 0] ProcessGroupNCCL
preparing to dump debug info.
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
.NCCL version 2.19.3+cuda12.0
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.
----------------------------------------------------------------------
Ran 8 tests in 95.761s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120063
Approved by: https://github.com/wconstab
This PR removes and adds some failures and successes that were hidden in the past week (ish).
https://github.com/pytorch/pytorch/pull/119408 (47182a8f4b5e36e280ca3595ba134f53499d2dc9) accidentally removed environment variables on rerun (see PR body of https://github.com/pytorch/pytorch/pull/120251 for slightly more details).
Enabling testing with dynamo is set using an env var, so if a test failed with dynamo, it would rerun without the dynamo env var set, making it pass on retry. Normally, the flaky test bot would catch this and make an issue for the test, but the CI env var controls whether or not xml test reports get made, and that also got removed on rerun, so the xmls weren't made either.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120271
Approved by: https://github.com/DanilBaibak, https://github.com/zou3519
Summary:
Now we can parse statements like
```
%22 = tt.make_tensor_ptr %20, [%21, %c128_i64], [%c2048_i64, %c1_i64], [%1, %c0_i32]
```
Test Plan:
Added new test
```
buck2 test mode/opt //hammer/ops/tests/inductor:ragged_hstu_test
```
now passes again with optimizations
Differential Revision: D53975130
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120263
Approved by: https://github.com/aakhundov, https://github.com/sijiac
There are some tests in this file that are impl specific, e.g. verifying generated code via `FileCheck`. These tests are covered for native funcol in test_c10d_functional_native.py, therefore marking them with `@run_with_legacy_funcol`.
Other tests are marked with `@run_with_both_funcol_impls`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120025
Approved by: https://github.com/wanchaol
ghstack dependencies: #119982
env=None (which is the default) inherits the env from the calling process. Explicitly set the env to the calling process env so that things can be added to it later
Tested in: e7b4d8ec88
Checked that test-reports (which depend on the CI env var) get made.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120251
Approved by: https://github.com/huydhn
Implicitly, the return type of `set_extra_state` is `NoReturn` since it always raises an error, and pyright will complain about mismatched return types if you override it with an implementation that doesn't also always raise an error. If we explicitly hint the return type as `None` (how we expect people to override it), we can avoid this error message.
```
Method "set_extra_state" overrides class "Module" in an incompatible manner
Return type mismatch: base method returns type "NoReturn", override returns type "None"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120161
Approved by: https://github.com/mikaylagawarecki
This PR fixes the issue in https://github.com/pytorch/pytorch/issues/120188.
In collect_metadata_analysis.py, handling of input/output mutations was different from handling in other locations. In other locations, MUTATED_OUT_GRAPH was used to indicate that mutation would require returning an output; in collect_metadata_analysis.py, any type of mutation was being handled as if it would require returning an output.
This PR changes collect_metadata_analysis to match other callsites and refactors computation of mutation types so that it is a property of the dataclass instead of something that needs to be computed manually when constructing an InputAliasInfo.
Differential Revision: [D53950998](https://our.internmc.facebook.com/intern/diff/D53950998)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120136
Approved by: https://github.com/bdhirsh
ghstack dependencies: #120141
This PR removes the conditional logic depending on requires_subclass_dispatch for mutation handling.
Inputs are labeled with one of three labels: NOT_MUTATED, MUTATED_IN_GRAPH, or MUTATED_OUT_GRAPH. MUTATED_IN_GRAPH indicates mutation that is allowed in the aot autograd graph; MUTATED_OUT_GRAPH indicates mutation that is not allowed in the graph, so the result is computed, returned, and then assigned back to the input after the graph.
Previously, there was logic to handle subclasses differently, so that MUTATED_IN_GRAPH + subclasses would behave like MUTATED_OUT_GRAPH.
This PR simplifies aot_autograd's handling of mutations so that MUTATED_IN_GRAPH will always be handled in graph, even when subclasses are present. Note that there are still some cases where subclass support won't be handled correctly.
Differential Revision: [D53950999](https://our.internmc.facebook.com/intern/diff/D53950999)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120141
Approved by: https://github.com/bdhirsh
If we already have Python `Stream` objects, then calling `stream1.wait_stream(stream2)` is syntactic sugar for creating an `event: Event`, recording it in `stream2`, and calling `stream1.wait_event(event)`.
~~Getting a Python `Stream` object incurs some CPU overhead, so we prefer to not change other callsites where we do not already have the `Stream` objects.~~
Update: Calling `event.record()` with no stream specified calls `torch.cuda.current_stream()`, so the overhead should be identical.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120231
Approved by: https://github.com/yifuwang
ghstack dependencies: #118298, #119985
Additional changes: tests in test_functional_api.py uses multi-threaded pg which is implemented in Python. For the native ops to call into the Python pg implementation, glue code in PyProcessGroup is required for each collective. This PR also adds a few pieces of previously missing glue code, which are necessary for running test_functional_api.py with native funcol.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119982
Approved by: https://github.com/wanchaol
This PR adds a couple of missing words in the Checkpointing documentation, it doesn't have a specific issue number related to it.
Changes are:
- "backward." -> "backward propagation."
- "to be advanced than" -> "to be more advanced than"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120196
Approved by: https://github.com/soulitzer
Pre-emptive test in OSS to ensure that models relying on the "non-overlapping guards" checks do not suffer drastically w.r.t. guard slowness. Current plan is to follow up on this with a "real" fix, to generate a linear number of these guards.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120106
Approved by: https://github.com/mlazos
`CompiledKernel.launch_enter_hook` and `CompiledKernel.launch_exit_hook` are hooks that allow external tools to monitor the execution of Triton kernels and read each kernel's metadata. Initially, these hooks have a value of `None`.
Triton's kernel launcher passes hooks and kernel metadata by default, while Inductor's launcher doesn't. This PR could unify the parameters passed to both launchers so that tools can get information from both handwritten Triton kernels and Inductor-generated Triton kernels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119450
Approved by: https://github.com/soulitzer
For ROCm/HIP, each stream is lazily initialized rather than creating all streams when the first stream is requested. HIP streams are not as lightweight as CUDA streams; the pooling strategy can affect performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119996
Approved by: https://github.com/ezyang
This fixes an internal DTensor enablement bug (I don't have an OSS issue for it)
I finally root-caused this as follows:
(1) we were fakefying a DTensor graph input, that was an autograd non-leaf (it had a grad_fn)
(2) that caused it do go through this `clone()` call during fakeification: https://github.com/pytorch/pytorch/blob/main/torch/_subclasses/meta_utils.py#L549
(3) `clone(torch.preserve_format)` is supposed to return another DTensor with the same strides as the input, but I noticed we were returning a DTensor with contiguous strides incorrectly.
(4) It turns out that DTensor was hashing on the sharding strategy for `aten.clone`, regardless of the `memory_format` kwarg that was passed in.
I could have manually updated the `clone` sharding strategy registration to take `memory_format` into account. But instead, I figured that every aten op with a sharding strategy needs to handle the memory_format kwarg specially - so I tried to generically force DTensor to consider all ATen ops that take a `memory_format` kwarg during hashing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118667
Approved by: https://github.com/wanchaol
ghstack dependencies: #117667, #117666, #118209, #118191
Fix https://github.com/pytorch/pytorch/issues/115260.
This issue is triggered by `FusedSchedularNodes` cases.
We always store `lowp buffer` to `store_cache` then load `lowp buffer` from `store_cache` and `convert it to float` before `compute ops`.
Now we will generate a `{key: to(float32)_expr, value: the float32 cse var before to_dtype and store}` in `cse.cache`.
Then the `to_dtype(float32)` after `load` will hit this cache and not generate a new var with cast codes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118365
Approved by: https://github.com/jgong5, https://github.com/jansel
It's already a requirement for building PyTorch, but should be a
requirement for linking extensions with it, as that can lead to runtime
crashes, as `std::optional` template layout is incompatible between
gcc-9 and older compilers.
Also, update minimum supported clang version to 9.x(used to build Android), as clang-5 is clearly not C++17 compliant.
Fixes https://github.com/pytorch/pytorch/issues/120020
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120126
Approved by: https://github.com/Skylion007
This PR updates the list of benchmarks that should (not) be skipped. Here's a summary of
the changes:
- `detectron2_maskrcnn`: #120115
- `fambench_xlmr`: moved to canary models
- `hf_Bert` and `hf_Bert_large`: pass
- `maml`: pass
- `clip`: renamed to `hf_clip`
- `gat`, `gcn`, and `sage`: moved to canary models
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120117
Approved by: https://github.com/ezyang, https://github.com/lezcano
```
Between the time we switch to the native funcol by default and the time when
we are confident that we can remove the legacy implementation, we want to
ensure that the legacy funcol remains covered by unit tests. This is to
prepare for any potential (but unlikely) reverts. The following utilities
help achieve this goal.
run_with_{native,legacy}_funcol - mark a test to run with only
{native,legacy} funcol. These decorators are for impl specific tests (e.g.
verifying generated code with FileCheck).
run_with_both_funcol_impls - parametrize a test to run with both legacy and
native funcol.
run_with_both_funcol_impls_with_arg - same as run_with_both_funcol_impls, but
passes `enable_native_funcol` to the test so impl specific checks can be
carried out.
```
This PR also marks some tests we want to cover in this fashion. More tests will be marked in subsequent PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119950
Approved by: https://github.com/wanchaol
ghstack dependencies: #119881
Fixes#119768
- #119768
This PR adds a new function `tree_iter` that lazily iterates over the tree leaves. It is different than the `tree_leaves` function while the latter traversal the whole tree first to build a list of leaves.
```python
for leaf in tree_iter(tree):
...
```
is much more efficient than:
```python
for leaf in tree_leaves(tree):
...
```
where `tree_leaves(tree)` is `list(tree_iter(tree))`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120155
Approved by: https://github.com/vmoens
Previously, pad_mm skips cases where any input tensor has symbolic
dimension or stride. This is too constraint in practise.
This PR enables this pass to pad non-symbolic dimensions in
the presence of dynamic dims. For example, with this PR, we could
pad the K dimension (i.e. 1921) for torch.mm(A[s0, 1921], B[2048, 1921]).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120073
Approved by: https://github.com/jansel
Test is failing on our internal CI with below error
```RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```
Purpose of this test is for nccl so it doesnt make sense to run in 1 GPU setting either.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120130
Approved by: https://github.com/wconstab, https://github.com/eqy
Summary: `torch.cond` is already supported in Dynamo and Export: the `true_fn` and `false_fn` subgraphs are traced as child fx graphs of the main graph and passed to the `torch.cond` higher-order operator in the fx graph. However, this breaks in Inductor, as the latter doesn't have the ways of dealing with child fx subgraphs and properly lowering and codegen-ing them.
In this PR, we add `torch.cond` support in Inductor. This is achieved by adding subgraph lowering and codegen-ing infrastructure as well as new `Conditional` IR node type weaving the parent graph with the true and false child subgraphs.
Here we only implement `torch.cond` support in JIT Inductor (Python wrapper codegen). The implementation in AOT Inductor (C++ wrapper codegen), including ABI-compatibility mode, will follow.
Test Plan:
```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 24 tests in 86.790s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119759
Approved by: https://github.com/jansel, https://github.com/eellison
This PR makes the tests for inline and sequential_split stop relying on set_grad_enabled to be in the graph. Because they'll be gone if we turn on the replace_set_grad_with_hop_pass in the following diff. Instead, we'll manually insert them into the graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119914
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #119732, #119736, #119810, #119913
As titled. Before the PR, after we split then inline_, there will be getitem calls in the graph while the original graph module doesn't have them. This PR removes the additional get_item calls by inlining.
Test Plan:
Added new test cases for graphs that return multiple outputs and takes multiple inputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119913
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #119732, #119736, #119810
This pr is the 1/N pr of transforming the global state mutating ops such as torch._C.set_grad_enabled calls in pre-dispatch graph into a higher order op so that the graph becomes more functional. We make use of split_module to help us do the transformation.
This pr preserves the node.name in original module by adding a new kwarg `keep_original_node_name` to split_module.
For a graph looks like this:
```python
def forward(self, arg_0):
arg0_1, = fx_pytree.tree_flatten_spec(([arg_0], {}), self._in_spec)
add = torch.ops.aten.add.Tensor(arg0_1, 1); arg0_1 = None
sin = torch.ops.aten.sin.default(add); add = None
sum_1 = torch.ops.aten.sum.default(sin); sin = None
_set_grad_enabled = torch._C._set_grad_enabled(False)
add_1 = torch.ops.aten.add.Tensor(sum_1, 1); sum_1 = None
_set_grad_enabled_1 = torch._C._set_grad_enabled(True)
sub = torch.ops.aten.sub.Tensor(add_1, 1)
return pytree.tree_unflatten((add_1, sub), self._out_spec)
```
Before the change, split graph returns the following graphs and subgraphs (notice the change from `add` -> `add_tensor`, `sin` -> `sin_default`:
```python
def forward(self, arg_0):
arg0_1, = fx_pytree.tree_flatten_spec(([arg_0], {}), self._in_spec)
submod_0 = self.submod_0(arg0_1); arg0_1 = None
submod_1 = self.submod_1(submod_0); submod_0 = None
submod_2 = self.submod_2(submod_1)
return pytree.tree_unflatten((submod_1, submod_2), self._out_spec)
# submod_0
def forward(self, arg0_1):
add_tensor = torch.ops.aten.add.Tensor(arg0_1, 1); arg0_1 = None
sin_default = torch.ops.aten.sin.default(add_tensor); add_tensor = None
sum_default = torch.ops.aten.sum.default(sin_default); sin_default = None
return sum_default
# submod_1
def forward(self, sum_1):
_set_grad_enabled = torch._C._set_grad_enabled(False)
add_tensor = torch.ops.aten.add.Tensor(sum_1, 1); sum_1 = None
return add_tensor
# submod_2
def forward(self, add_1):
_set_grad_enabled = torch._C._set_grad_enabled(True)
sub_tensor = torch.ops.aten.sub.Tensor(add_1, 1); add_1 = None
return sub_tensor
""")
```
After the change, the test produce the following graph, all the node names in original graph module are preserved in sub_modules.
```python
def forward(self, arg_0):
sub, = fx_pytree.tree_flatten_spec(([arg_0], {}), self._in_spec)
submod_0 = self.submod_0(sub); sub = None
submod_1 = self.submod_1(submod_0); submod_0 = None
submod_2 = self.submod_2(submod_1)
return pytree.tree_unflatten((submod_1, submod_2), self._out_spec)
# submod_0
def forward(self, arg0_1):
add = torch.ops.aten.add.Tensor(arg0_1, 1); arg0_1 = None
sin = torch.ops.aten.sin.default(add); add = None
sum_1 = torch.ops.aten.sum.default(sin); sin = None
return sum_1
# submod_1
def forward(self, sum_1):
_set_grad_enabled = torch._C._set_grad_enabled(False)
add_1 = torch.ops.aten.add.Tensor(sum_1, 1); sum_1 = None
return add_1
# submod_2
def forward(self, add_1):
_set_grad_enabled_1 = torch._C._set_grad_enabled(True)
sub = torch.ops.aten.sub.Tensor(add_1, 1); add_1 = None
return sub
```
Note that currently, we call split_module on the graph after pre-dispatch aot. The difference is even larger if we `split_module` the graph module produced by dynamo, where all the original variables names in user program are preserved after dynamo but lost after `split_module` without this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119732
Approved by: https://github.com/tugsbayasgalan
1. right now we double increment the profile counter. The PR avoid that so we don't end up with profile_0, profile_2, profile_4 ...
2. log the latency to run the passed in function with profiling on so we can easily skip those _compile call which returns quickly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120100
Approved by: https://github.com/eellison
I want to log metadata for inductor generated triton kernels for a couple of purposes
1. with these metadata, it should be convenient to find unaligned reduction kernels and try the idea here https://github.com/pytorch/pytorch/issues/119929 . I think it's nice to try on kernels that are used in real models
2. I'm thinking that based on the collected kernel metadata, I can build a simple offline tool by benchmarking each kernel with ncu and augment each kernel metadata with: latency, theoretical membw (estimated memory access / latency), and actually achieved membw. Hopefully this can point us to some good optimization opportunities.
Command:
```
TORCHINDUCTOR_CACHE_DIR=`realpath ~/inductor-caches/kernel-metadata-log` TORCHINDUCTOR_ENABLED_METRIC_TABLES=kernel_metadata TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 time python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --training
```
The best practice here is to point inductor cache to a folder outside of /tmp so that one can always run the kernel again based on the path stored in kernel metadata. (folders under /tmp may get removed by the system)
Here is first 1000 rows of collected metadata for huggingface: https://gist.github.com/shunting314/cf4ebdaaaa7e852efcaa93524c868e5f
And here is the total 10K kernels collected for huggingface. The gist can not be rendered as a csv since it's too large: https://gist.github.com/shunting314/7f841528e2debdc2ae05dece4ac591be .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120048
Approved by: https://github.com/jansel
Summary: Added Quantized gelu for vulkan backend.
Test Plan:
**Tested it on "On Demand RL FBSource"**
LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_quantized_api_test_bin -c pt.vulkan_full_precision=1 -- --gtest_filter="VulkanAPITest.gelu_q*"
----------------------------------------------------------------------------------
Note: Google Test filter = VulkanAPITest.gelu_q*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from VulkanAPITest
[ RUN ] VulkanAPITest.gelu_qint8
[ OK ] VulkanAPITest.gelu_qint8 (318 ms)
[ RUN ] VulkanAPITest.gelu_qint8_self
[ OK ] VulkanAPITest.gelu_qint8_self (214 ms)
[ RUN ] VulkanAPITest.gelu_quint8
[ OK ] VulkanAPITest.gelu_quint8 (152 ms)
[ RUN ] VulkanAPITest.gelu_quint8_self
[ OK ] VulkanAPITest.gelu_quint8_self (142 ms)
[----------] 4 tests from VulkanAPITest (828 ms total)
[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (828 ms total)
[ PASSED ] 4 tests.
Differential Revision: D52985437
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119935
Approved by: https://github.com/jorgep31415
Fixes https://github.com/pytorch/pytorch/issues/117596
This was needed for Float8Tensor. Before this PR, dynamo would sometimes handle attribute access on tensor subclasses correctly, but it would choke on tensor subclasses with no source (it would fall back to using a `GetAttrVariable` to represent the attribute access, which is a problem if the attribute is a tensor that we later want to call tensor methods on).
I supported two cases:
(1) the attribute is a tensor, which is part of the `attrs` returned by the subclass's `__tensor_flatten__`. This creates a `TensorVariable`
(2) the attribute is a constant, which is part of the constant metadata returned by `__tensor_flatten__`. As per the contract of tensor_flatten, this should be a `ConstantVariable`. It could be possible that we allow non-constant metadata in the future, but we don't support that today.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117666
Approved by: https://github.com/zou3519
ghstack dependencies: #117667
### Summary
@LucasLLC recently implemented `broadcast` in funcol. This is not yet available in the native funcol ops. This PR adds support for broadcast for native funcol.
- Added `_c10d_functional::broadcast` and `_c10d_functional::broadcast_`
- Integrated with python functol broadcast and `AsyncCollectiveTensor`
- Implemented Inductor lowering. Verified correctness and buffer reuse behavior
- Validated dynamo traceability
- Validated AOTInductor compile-ability
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119229
Approved by: https://github.com/wanchaol
ghstack dependencies: #119104
`nargs="?"` accept 0 or 1 argument, but `nargs="*"` accepts 0 or any number of arguments, which is the intended behavior of the tool
Test plan: Run `python tools/build_with_debinfo.py aten/src/ATen/native/cpu/BlasKernel.cpp aten/src/ATen/native/BlasKernel.cpp` and observe that it generates torch_cpu with those two files containing debug information
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120088
Approved by: https://github.com/Skylion007
Summary: This commit adds the `model_is_exported` util function
for users to be able to easily tell what APIs to call to move
their models between train and eval modes. This has the
additional advantage of hiding the implementation of how we
detect a model is exported, in case the metadata format changes
in the future.
Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_model_is_exported
Differential Revision: [D53812972](https://our.internmc.facebook.com/intern/diff/D53812972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119726
Approved by: https://github.com/tugsbayasgalan, https://github.com/albanD
We use the fact that we now propagate indexing properly to avoid having
to maintain two different implementations of the op. Doing this we also remove
a spurious guard on this op.
We move the ref into a decomp as we now use advanced indexing.
The only difference we did in the implementation is that we now use
advanced indexing rather than `torch.cat`.
We also remove it from core. Let's see how this goes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119857
Approved by: https://github.com/peterbell10, https://github.com/larryliu0820
ghstack dependencies: #119863, #119864
Summary: If cuda is not initialized before calling attachAllocatorTraceTracker, then the CudaCachingAllocator device_allocator is empty which means that the registration hooks are not setup. This means that a new segment_alloc will not be registered causing an expensive dynamic registration each time the segment is used. The fix is to guarantee that cuda is initialized before attaching the hooks. If cuda is already initialized, then this lazyInitCUDA is a no-op.
Test Plan:
Testing this on fsdp+tp example model where cuda is not initialized before init_process_group.
Job without the fix keeps dynamically registering:
https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-fsdp_2d_main-j544j0vn7zqh4c?job_attempt=0&version=0&env=PRODUCTION
The following keeps looping:
[0]:2024-02-14T10:48:18.873079 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: registered buffer 0x7f6ebe000000 len 608124000, state 1
[0]:2024-02-14T10:48:18.873087 twshared0039:4836:6232 [0] NCCL INFO *dynamicRegist = true
[0]:2024-02-14T10:48:18.903234 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: deregister buffer 0x7f6ebe000000 len 608124000, state 1
[0]:2024-02-14T10:48:18.903240 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: deregiter buffer 0x7f6ebe000000 len 608124000
Job with the fix does not have this issue:
https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-fsdp_2d_main-hzm5dwqncr7l7?version=0&env=PRODUCTION
Reviewed By: minsii, kwen2501, xw285cornell
Differential Revision: D53770989
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120052
Approved by: https://github.com/kwen2501
Fix: https://github.com/pytorch/pytorch/issues/119779 by properly graph breaking a proper fix is to handle quantized tensors for full complete solution.
if when generating a fake tensor, UnsupportedFakeTensorException is thrown, then its handled and converted into a
Unimplemented in inside wrap_fake_exception which is then translated to a graph break.
However run_node used to convert UnsupportedFakeTensorException into a runtime error, creating runtime
errors instead of graph breaks whenever generating a fake tensor for a quantized tensor fails.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120026
Approved by: https://github.com/jansel
# Motivation
According to [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842) and [[RFC] Intel GPU Runtime Upstreaming for Allocator](https://github.com/pytorch/pytorch/issues/116322), we will upstream the key functionality of device `Allocator` dedicated for XPU to PyTorch. And following our design prepare to generalize `Allocator` in parallel.
# Design
In the current design, XPU uses an `XPUAllocator` class, inherited from `c10::Allocator`. `XPUAllocator` is a manager to handle `DeviceCachingAllocator`, which is a per-device implementation of the caching mechanism to manage the already cached or newly allocated memory. The caching mechanism is similar to other backends, like CUDA. We can visualize the design as below.
<p align="center">
<img width="162" alt="image" src="https://github.com/pytorch/pytorch/assets/106960996/6b17b8cf-e7d1-48b4-b684-f830c409d218">
</p>
# Additional Context
We're going to implement our design gradually. This PR covers the device `Allocator` dedicated to XPU. The second PR covers the host `Allocator`.
Besides these PRs, we plan to generalize the device `Allocator` device-agnostic through another PR.
In this PR, our device `Allocator` has the same memory management mechanism as CUDA, but lacks features such as expendable segments and statistics. We will add these features back in the subsequent PR which intend to generalize `Allocator`.
The differences with CUDA:
only key functionality, and lack of AsyncAllocator, gpu_trace, history_record, graph functionality, memory snapshot, memory statistics, expandable segment...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118091
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD
ghstack dependencies: #117611, #117619, #117734
Summary:
1. Make sure folded constants generated internally doesn't get exposed.
2. Add runConstantFolding and related API calls
Test Plan:
```buck2 run mode/opt-split-dwarf -c fbcode.nvcc_arch=v100,a100 caffe2/caffe2/fb/predictor/tests_gpu:pytorch_predictor_container_gpu_test -- --gtest_filter=*PyTorchPredictorContainerTest.LoadAOTInductorModel*
```
The test triggers the added predictor tests `test_aot_inductor_merge_net_file_*.predictor_20240206`,
which would trigger runConstantFolding from predictor's module loading.
Reviewed By: SherlockNoMad
Differential Revision: D53718139
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119823
Approved by: https://github.com/chenyang78
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the next runtime component we would like to upstream is `Event` which handles the status of an operation that is being executed. Typically, in some circumstances, we can fine-grain control of the operation execution via `Event`.
# Design
`XPUEvent` is a movable but not a copyable wrapper around sycl event. It should be created lazily on an XPU device when recording an `XPUStream`. Meanwhile, `XPUEvent` can wait for another `XPUEvent` or all the submitted kernels on an `XPUStream` to complete. Align to the other backend, the C++ files related to `Event` will be placed in `aten/src/ATen/xpu` folder. For frontend code, `XPUEvent` runtime API will be bound to Python `torch.xpu.Event`. The corresponding C++ code will be placed in `torch/csrc/xpu/Event.cpp` and Python code will be placed in `torch/xpu/streams.py` respectively.
# Additional Context
It is worth mentioning that the `elapsed_time` method is temporarily not supported by `XPUEvent`. We will be adding support for it soon. Meanwhile `XPUEvent` doesn't support IPC from different processes. For the other parts, we have almost a 1:1 mapping with CUDA.
lack of the below APIs:
- `torch.cuda.Event.ipc_handle`
- `CUDAEvent`'s constructor with `IpcEventHandle`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117734
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #117611, #117619
Summary: an fbcode test exposed a shortcoming where we serve a FakeTensor from the cache with the wrong inference_mode. Take the current mode into account in the cache key so we only serve entries from the same mode we're in currently
Test Plan: New unit test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119963
Approved by: https://github.com/eellison
goal: all unit tests for eager. we want to test torch.compile by default
this PR adds ``@test_compiled_fsdp(compile_compute_on_module=None/TransformerBlock)`` to unit tests. now it's compiling compute-only as follows.
```
module.compile() # include user registered hooks if any
fully_shard(module)
```
torch.compile does not work following component yet
* compiling AC
* compiling reshard_after_forward=2
* delayed_all_gather, delayed_reduce_scatter
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119933
Approved by: https://github.com/awgu, https://github.com/jansel
Sets up Nvidia Runtime and runs indexer inside a docker container.
Verified this works by running the indexer jobs (all the setup is correct, it OOMs for an unrelated reason, for which a fix is on the way).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119923
Approved by: https://github.com/huydhn
The first try reused TensorListMetadata, which caused illegal memory access issues when there were too many tensors in the list. We just launch multiple kernels with a simpler version of the struct (to minimize kernels launched).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119927
Approved by: https://github.com/albanD
print_performance previously returns the execution time for `times` runs in total but now it returns the average execution time of a single run. Change the profiler to be consistent with that. Not sure if there is a good way to add test though.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119959
Approved by: https://github.com/eellison
Fixes#119722,
1, added the missing device in
```
max_memory_allocated = torch.cuda.max_memory_allocated()
max_memory_reserved = torch.cuda.max_memory_reserved()
```
2, fix the device parameter to device_str. Based on [lines](2bda6b4cb8/torch/profiler/profiler.py (L291)), the input device are a string (device_str) for
```
self.mem_tl.export_memory_timeline_html
self.mem_tl.export_memory_timeline_raw
self.mem_tl.export_memory_timeline
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119751
Approved by: https://github.com/aaronenyeshi
Seeing the error for c10d tests when running on 1GPU. Adding the skip when there is insufficient GPU.
```
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```
referring to https://github.com/pytorch/pytorch/pull/84980
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119938
Approved by: https://github.com/eqy, https://github.com/fegin
Summary: `ExportedProgram` is an artifact produced by torch.export, containing the graph that is exported, along with other attributes about the original program such as the graph signature, state dict, and constants. One slightly confusing thing that users run into is that they treat the `ExportedProgram` as a `torch.nn.Module`, since the object is callable. However, as we do not plan to support all features that `torch.nn.Module`s have, like hooks, we want to create a distinction between it and the `ExportedProgram` by removing the `__call__` method. Instead users can create a proper `torch.nn.Module` through `exported_program.module()` and use that as a callable.
Test Plan: CI
Differential Revision: D53075378
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119466
Approved by: https://github.com/zhxchen17, https://github.com/thiagocrepaldi
By changing runtime symbolic asserts to using assert_scalar, the asserts can call into `expect_true` and modify the shape env so that we can run through the traced graph module with fake tensors. With assert_async, the asserts only get hit during runtime, but that means if we run the graph module with fake tensors, the asserts will not affect the shape env, so later data dependent calls to the fake tensors may result in GuardOnDataDependentSymNode errors.
https://github.com/pytorch/pytorch/issues/119587
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119608
Approved by: https://github.com/ezyang
nvcc flag `--generate-dependencies-with-compile` doesn't seem to be supported by `sccache` for now. Builds with this flag enabled will not benefit from sccache.
This PR adds an environment variable that allows users to set this flag and skip those nvcc dependencies to speed up their build with compiler caches. If everything is "fresh build" in CI, we don't care if there are unnecessary recompile during incremental builds.
related: https://github.com/pytorch/pytorch/pull/49344
- [ ] todo: raise an issue to sccache
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119936
Approved by: https://github.com/ezyang
Fixes#118862
If libtorch is included multiply times in different sub-folders, linking caffe2::mkl may incur errors like
```
Cannot specify link libraries for target "caffe2::mkl" which is not built
by this project.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119945
Approved by: https://github.com/ezyang
Timestamps the generated embedding indices. Moves the old indices to an `archived/` folder and then uploads the index to a `latest/` folder. There will be a short period in between these operations where there is no index in `latest/`. To handle this case, any workflow fetching the index (such as the retriever) should use a retry with backoff when copying from S3.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119955
Approved by: https://github.com/huydhn
Summary: In some cases the output of `softmax` are so small that they are below the float16 precision. These values are represented as 0 in float16 and result in `-inf` when log is applied. According to [Wikipedia](https://en.wikipedia.org/wiki/Half-precision_floating-point_format#Exponent_encoding), the minimum strictly positive (subnormal) value is 2^−24 ≈ 5.9605 × 10^−8. Therefore, we add 6 x 10^-8 to the output of softmax to avoid the numerical issue.
Test Plan:
We add two tests:
- `log_softmax_underflow_exception` tests the log_softmax without adding epsilon to the output of softmax, so we expect to get nan or -inf. (**NOTE**: this test has passed on both devserver and on Android device, but failed on the `
fbsource//xplat/caffe2:vulkan_ops_testAndroid` test on CI. In this test, `log` of small numbers [even `log 0` shows output -88 instead of `-inf`](https://interncache-cco.fbcdn.net/v/t49.3276-7/379414752_342395058779076_6447867753374424757_n.txt?ccb=1-7&_nc_sid=ce8ad4&efg=eyJ1cmxnZW4iOiJwaHBfdXJsZ2VuX2NsaWVudC9pbnRlcm4vc2l0ZS94L3Rlc3RpbmZyYSJ9&_nc_ht=interncache-cco&oh=00_AfApTdId1WOHUqdoSTc66s6adnrQt1YS0NDT-LDppIvX0g&oe=65D0CC99). We cannot reproduce this error on device now, so we **DISABLE** this test for now to integrate into CI.)
- `log_softmax_underflow` tests the updated implementation of log_softmax, nan and -inf have been removed
## test on devserver
```
luwei@devbig984.prn1 /data/users/luwei/fbsource (9f6b78894)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*log_softmax_underflow*"
File changed: fbcode//caffe2/aten/src/ATen/test/vulkan_api_test.cpp
File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp
Buck UI: https://www.internalfb.com/buck2/baaaa683-60da-4dd8-95b9-6848fe1d7d74
Network: Up: 53KiB Down: 1.4MiB (reSessionID-9580ce4f-7e1e-4c65-8497-52443329b796)
Jobs completed: 6. Time elapsed: 24.2s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 1, local: 1)
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *log_softmax_underflow*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from VulkanAPITest
[ DISABLED ] VulkanAPITest.DISABLED_log_softmax_underflow_exception
[ RUN ] VulkanAPITest.log_softmax_underflow
[ OK ] VulkanAPITest.log_softmax_underflow (169 ms)
[----------] 1 test from VulkanAPITest (169 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (169 ms total)
[ PASSED ] 1 test.
YOU HAVE 1 DISABLED TEST
```
full test results: P1184164670
```
[----------] 428 tests from VulkanAPITest (21974 ms total)
[----------] Global test environment tear-down
[==========] 428 tests from 1 test suite ran. (21974 ms total)
[ PASSED ] 427 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
YOU HAVE 11 DISABLED TESTS
```
## test on device:
- build
```
[luwei@devbig984.prn1 /data/users/luwei/fbsource (82c91e8da)]$ buck2 build -c ndk.static_linking=true -c pt.enable_qpl=0 --target-platforms=ovr_config//platform/android:arm32-fbsource //xplat/caffe2:pt_vulkan_api_test_binAndroid --show-output
```
- push to device and run
```
[luwei@devbig984.prn1 /data/users/luwei/fbsource (82c91e8da)]$ adb shell /data/local/tmp/pt_vulkan_api_test_binAndroid --gtest_filter="*log_softmax_underflow*"
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *log_softmax_underflow*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from VulkanAPITest
[ DISABLED ] VulkanAPITest.DISABLED_log_softmax_underflow_exception
[ RUN ] VulkanAPITest.log_softmax_underflow
[ OK ] VulkanAPITest.log_softmax_underflow (292 ms)
[----------] 1 test from VulkanAPITest (293 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (294 ms total)
[ PASSED ] 1 test.
YOU HAVE 1 DISABLED TEST
```
Reviewed By: yipjustin
Differential Revision: D53694989
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119898
Approved by: https://github.com/jorgep31415
That would bundle PTXAS into a `bin` folder
When compiling for Triton, define `TRITION_PTXAS_PATH` if `ptxas` is bundled with PyTorch Needed to make PyTorch compiled against CUDA-11.8 usable with 11.8 driver, as Triton is bundled with latest (CUDA-12.3 at time of PyTorch-2.2 release) ptxas
Needs 5c814e2527 to produce valid binary builds
Test plan:
- Create dummy ptxas in `torch/bin` folder and observe `torch.compile` fail with backtrace in Triton module.
- Run following script (to be added to binary tests ) against CUDA-11.8 wheel:
```python
import torch
import triton
@torch.compile
def foo(x: torch.Tensor) -> torch.Tensor:
return torch.sin(x) + torch.cos(x)
x=torch.rand(3, 3, device="cuda")
print(foo(x))
# And check that CUDA versions match
cuda_version = torch.version.cuda
ptxas_version = triton.backends.nvidia.compiler.get_ptxas_version().decode("ascii")
assert cuda_version in ptxas_version, f"CUDA version mismatch: torch build with {cuda_version}, but Triton uses ptxs {ptxas_version}"
```
Fixes https://github.com/pytorch/pytorch/issues/119054
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119750
Approved by: https://github.com/jansel, https://github.com/atalman
Changes sharding to attempt to put all serial tests on as few shards as possible. Parallel tests are then distributed across all shards, with most of which likely ending up on the non serial shards
Example: 8 minutes of serial tests, 20 minutes of parallel tests, 2 proc per machine, 6 machines
-> 8 + 20/2 = 18 total minutes of tests
-> 18 / 6 machines = 3 min per machine
-> all serial tests should fit on 3 machines (3min, 3 min, 2min)
-> majority of parallel tests should go on last 4 machines, one of which is shared with the serial tests
Move serial tests to run first
If I want to move to a purely numbers based sharding, this ensures that parallel tests are run with parallel tests as much as possible instead of interleaving serial + parallel tests, which decreases effectiveness of parallelization, while also ensuring that test reordering is still mostly effective.
See 73e816ee80 for example logs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119078
Approved by: https://github.com/huydhn
This PR refactors the shardeing cost model, to do a more accurate
estimation of redistribute cost, including both collective latency and
communciation time.
The previous cost model does not recale the latency and communciation
time, therefore the latency factor is too small to be counted, and in
the case of small tensors, multiple collectives is preferred than a
single collective, which is wrong.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119897
Approved by: https://github.com/tianyu-l
This PR adds a way to do gradient accumulation without collectives (i.e. reduce-scatter for FSDP and reduce-scatter/all-reduce for HSDP, though HSDP is not yet implemented). Since the `no_sync()` context manager has received some feedback, we simply define a method on the module to set whether the module requires gradient synchronization or not, where this method can recurse or not.
```
# Before with `no_sync()`:
with fsdp_model.no_sync() if not is_last_microbatch else contextlib.nullcontext():
# Forward/backward
# After with a setter:
fsdp_model.set_requires_gradient_sync(not is_last_microbatch)
# Forward/backward
```
Having the method be able to recurse or not also gives some flexibility. For example, some large modules can still reduce-scatter, while some smaller modules can avoid it to save communication bandwidth:
```
fsdp_modules_to_reduce_scatter: Set[nn.Module] = ...
for module in fsdp_model.modules():
if isinstance(module, FSDP) and module not in fsdp_modules_to_reduce_scatter:
module.set_requires_gradient_sync(not is_last_microbatch)
# Forward/backward
```
(Separately, we may expose a helper for `return [module for model.modules() if isinstance(module, FSDP)]`.)
---
To show the spirit of this API choice, I also included `set_requires_all_reduce` that would give us the ability to only reduce-scatter but not all-reduce for HSDP (originally from the MiCS paper). If we want to flexibly support heterogeneous sharding where FSDP is applied to some modules and HSDP to others in the same model, then having a module-level method that has the option to not recurse makes sense to me.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118298
Approved by: https://github.com/wconstab, https://github.com/wanchaol
ghstack dependencies: #119550, #118136, #118223, #118755, #119825
Summary:
as title.
The following APIs are logged:
- capture_preautograd_graph
- torch._export.aot_compile
- external usage of _export_to_torch_ir (AOTInductor, Pippy)
- constraints API
- public use of torch._dynamo.export
Test Plan: CI
Differential Revision: D53735599
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119848
Approved by: https://github.com/suo
So far it's been only testing legacy conversion, rather than the one actually used when `at::Half` is constructed
Test `fp16` to `fp32` for the whole range of its 65536 values, though skip NaN comparisons, as different algorithms are not guaranteed to yield identical NaN representations and they are different anyway.
Do a small code cleanup, remove extraneous semicolons as well as named namespace inside unnamed one
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119892
Approved by: https://github.com/kit1980
Replaces `view_func()` closures with a reified `ViewFunc` data structure. Codegen generates a `ViewFunc` subclass for each view op (e.g. `NarrowViewFunc`) containing state needed to reconstruct the view. The `ViewFunc` API allows for querying and hot-swapping any `SymInt`s or `Tensors` in the state through `get_symints()` / `get_tensors()` / `clone_and_set()`, which will be essential for fake-ification later on.
```cpp
/// Base class for view functions, providing reapplication of a view on a new base.
/// Each view op should get a codegenerated subclass of this class containing
/// any state needed to reconstruct the view. The class also provides convenience
/// accessors for saved SymInts / tensor state. This is useful for e.g. fake-ification,
/// where we want to use symbolic values or fake tensors instead.
struct TORCH_API ViewFunc {
virtual ~ViewFunc() {}
/// Returns any SymInts in the saved state.
virtual std::vector<c10::SymInt> get_symints() const { return {}; }
/// Returns the number of SymInts in the saved state.
virtual size_t num_symints() const { return 0; }
/// Returns any tensors in the saved state.
virtual std::vector<at::Tensor> get_tensors() const { return {}; }
/// Returns the number of tensors in the saved state.
virtual size_t num_tensors() const { return 0; }
/// Reapplies the view on the given base using the saved state.
virtual at::Tensor operator()(const at::Tensor&) const = 0;
/// Returns a clone of this ViewFunc, optionally with the specified saved state.
virtual std::unique_ptr<ViewFunc> clone_and_set(
std::optional<std::vector<c10::SymInt>> = c10::nullopt,
std::optional<std::vector<at::Tensor>> = c10::nullopt) const = 0;
protected:
/// Sets the values of any SymInts in the saved state. The input vector size must
/// match the number of SymInts in the saved state (i.e. the size of the list
/// returned by get_symints()).
virtual void set_symints(std::vector<c10::SymInt>) {}
/// Sets the values of any Tensors in the saved state. The input vector size must
/// match the number of Tensors in the saved state (i.e. the size of the list
/// returned by get_tensors()).
virtual void set_tensors(std::vector<at::Tensor>) {}
};
```
New codegen files:
* `torch/csrc/autograd/generated/ViewFunc.h`
* `torch/csrc/autograd/generated/ViewFuncs.cpp`
The templates for these also contains impls for `ChainedViewFunc` and `ErroringViewFunc` which are used in a few places within autograd.
Example codegen for `slice.Tensor`:
```cpp
// torch/csrc/autograd/generated/ViewFuncs.h
#define SLICE_TENSOR_VIEW_FUNC_AVAILABLE
struct SliceTensorViewFunc : public torch::autograd::ViewFunc {
SliceTensorViewFunc(int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step) : dim(dim), start(start), end(end), step(step)
{};
virtual ~SliceTensorViewFunc() override {};
virtual std::vector<c10::SymInt> get_symints() const override;
virtual size_t num_symints() const override;
virtual std::vector<at::Tensor> get_tensors() const override;
virtual size_t num_tensors() const override;
virtual at::Tensor operator()(const at::Tensor&) const override;
virtual std::unique_ptr<ViewFunc> clone_and_set(
std::optional<std::vector<c10::SymInt>> = c10::nullopt,
std::optional<std::vector<at::Tensor>> = c10::nullopt) const override;
protected:
virtual void set_symints(std::vector<c10::SymInt>) override;
virtual void set_tensors(std::vector<at::Tensor>) override;
private:
int64_t dim;
c10::optional<c10::SymInt> start;
c10::optional<c10::SymInt> end;
c10::SymInt step;
};
...
// torch/csrc/autograd/generated/ViewFuncs.cpp
std::vector<c10::SymInt> SliceTensorViewFunc::get_symints() const {
::std::vector<c10::SymInt> symints;
symints.reserve((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
if(start.has_value()) symints.insert(symints.end(), *(start));
if(end.has_value()) symints.insert(symints.end(), *(end));
symints.push_back(step);
return symints;
}
size_t SliceTensorViewFunc::num_symints() const {
return static_cast<size_t>((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
}
void SliceTensorViewFunc::set_symints(std::vector<c10::SymInt> symints) {
TORCH_INTERNAL_ASSERT(symints.size() == num_symints());
auto i = 0;
if(start.has_value()) start = symints[i];
i += (start.has_value() ? 1 : 0);
if(end.has_value()) end = symints[i];
i += (end.has_value() ? 1 : 0);
step = symints[i];
}
std::vector<at::Tensor> SliceTensorViewFunc::get_tensors() const {
::std::vector<at::Tensor> tensors;
return tensors;
}
size_t SliceTensorViewFunc::num_tensors() const {
return static_cast<size_t>(0);
}
void SliceTensorViewFunc::set_tensors(std::vector<at::Tensor> tensors) {
TORCH_INTERNAL_ASSERT(tensors.size() == num_tensors());
}
at::Tensor SliceTensorViewFunc::operator()(const at::Tensor& input_base) const {
return at::_ops::slice_Tensor::call(input_base, dim, start, end, step);
}
std::unique_ptr<ViewFunc> SliceTensorViewFunc::clone_and_set(
std::optional<std::vector<c10::SymInt>> symints,
std::optional<std::vector<at::Tensor>> tensors) const {
auto output = std::make_unique<SliceTensorViewFunc>(dim, start, end, step);
if (symints.has_value()) {
output->set_symints(std::move(*(symints)));
}
if (tensors.has_value()) {
output->set_tensors(std::move(*(tensors)));
}
return output;
}
```
The `_view_func()` / `_view_func_unsafe()` methods now accept two additional (optional) args for `symint_visitor_fn` / `tensor_visitor_fn`. If these are defined, they are expected to be python callables that operate on a single SymInt / tensor and return a new one. This allows for the hot-swapping needed during fake-ification.
For testing, there are extensive pre-existing tests, and I added a test to ensure that hot-swapping functions correctly.
```sh
python test/test_autograd.py -k test_view_func_replay
python test/test_ops.py -k test_view_replay
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118404
Approved by: https://github.com/ezyang
Fixes https://github.com/pytorch/pytorch/issues/119436
<s>In essence we need to ensure aliases are run in separate foreach kernels so that they are ordered correctly. Previously, aliases could end up in the same kernel which creates weird scheduling dependencies.</s>
There was a bug in cycle detection/can_fuse which was creating cycles when more than two aliases were used in foreach nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119508
Approved by: https://github.com/jansel
When building guards which went through a property we were analyzing the property using getattr_static but the guard wasn't built using getattr_static so if the property was "unusual" it generated misbehaved code which referenced a non-existent `__closure__` field.
Fixes#118786
Note that after this change some of the referenced tests are still failing with a different error - but getting further.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119719
Approved by: https://github.com/oulgen
This PR substantially improves the error reporting for GuardOnDataDependentSymNode in the following ways:
* The GuardOnDataDependentSymNode error message is rewritten for clarity, and contains a link to a new doc on how to resolve these issues https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit#heading=h.44gwi83jepaj
* We support `TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL`, which lets you specify a symbol name to get detailed debug information when it is logged (e.g., the full backtrace and user backtrace of the symbol creation). The exact symbols that you may be interested in our now explicitly spelled out in the error message.
* We support `TORCHDYNAMO_EXTENDED_DEBUG_CPP` which enables reporting C++ backtraces whenever we would report a backtrace.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119412
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #117356
Some operations, such as GEMMs, could be implemented using more than one library or more than one technique. For example, a GEMM could be implemented for CUDA or ROCm using either the blas or blasLt libraries. Further, ROCm's rocblas and hipblaslt libraries allow the user to query for all possible algorithms and then choose one. How does one know which implementation is the fastest and should be chosen? That's what TunableOp provides.
See the README.md for additional details.
TunableOp was ported from onnxruntime starting from commit 08dce54266. The content was significantly modified and reorganized for use within PyTorch. The files copied and their approximate new names or source content location within aten/src/ATen/cuda/tunable include the following:
- onnxruntime/core/framework/tunable.h -> Tunable.h
- onnxruntime/core/framework/tuning_context.h -> Tunable.h
- onnxruntime/core/framework/tuning_context_impl.h -> Tunable.cpp
- onnxruntime/core/providers/rocm/tunable/gemm_common.h -> GemmCommon.h
- onnxruntime/core/providers/rocm/tunable/gemm_hipblaslt.h -> GemmHipblaslt.h
- onnxruntime/core/providers/rocm/tunable/gemm_rocblas.h -> GemmRocblas.h
- onnxruntime/core/providers/rocm/tunable/gemm_tunable.cuh -> TunableGemm.h
- onnxruntime/core/providers/rocm/tunable/rocm_tuning_context.cc -> Tunable.cpp
- onnxruntime/core/providers/rocm/tunable/util.h -> StreamTimer.h
- onnxruntime/core/providers/rocm/tunable/util.cc -> StreamTimer.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114894
Approved by: https://github.com/xw285cornell, https://github.com/jianyuh
Before:
```
[2024-02-13 19:34:50,591] [0/0] torch._dynamo.guards.__guards: [DEBUG] GUARDS:
[2024-02-13 19:34:50,591] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['x'], 70049616) # assert x.shape[0] > 2 # b.py:5 in f
[2024-02-13 19:34:50,592] [0/0] torch._dynamo.guards.__guards: [DEBUG] hasattr(L['x'], '_dynamo_dynamic_indices') == False # assert x.shape[0] > 2 # b.py:5 in f
```
After this change, the logs look like this:
```
V0214 07:00:49.354000 139646045393920 torch/_dynamo/guards.py:1023 [0/0] GUARDS:
V0214 07:00:49.354000 139646045393920 torch/_dynamo/guards.py:1039 [0/0] ___check_type_id(L['x'], 70050096) # assert x.shape[0] > 2 # b.py:5 in f
V0214 07:00:49.355000 139646045393920 torch/_dynamo/guards.py:1039 [0/0] hasattr(L['x'], '_dynamo_dynamic_indices') == False # assert x.shape[0] > 2 # b.py:5 in f
```
The main differences from what we had before:
* We don't print DEBUG/INFO/WARNING, instead, we only print a single character. DEBUG, somewhat oddly, maps to V, because it corresponds to glog VERBOSE
* The year is omitted, and a more compact representation for date/month is adopted. Somewhat perplexingly, six digits are allocated for the nanoseconds, even though Python typically doesn't have that level of resolution
* The thread ID is included (in a containerized environment, this thread id will be typically much lower)
* Instead of using the module name, we give a filepath, as well as the line the log message was emitted from. I think the line number is a nice touch and improvement over our old logs, but one downside is we do lose the artifact name in the log message, in case anyone was grepping for that.
* I chose to move the compile id prefix to the very end so as to keep a uniform layout before it, but I do think there are benefits to having it before the filename
Meta only: This format was reverse engineered off of 6b8bbe3b53/supervisor/logging.py and https://www.internalfb.com/code/fbsource/[e6728305a48540110f2bdba198aa74eee47290f9]/fbcode/tupperware/front_end/log_reader/filter/StreamingLogLineFilter.cpp?lines=105-114
Now, I think this may be slightly controversial, but I have chosen to apply this format *by default* in OSS. My reasoning is that many PT2 developers work with the logs in OSS, and keeping the format identical to what we run in prod will make it easier for these skills to transfer.
The non-negotiable portion of the new format is "V0213 19:28:32"; the date string is expected to be in exactly this form or Tupperware will fail to parse it as a date.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119869
Approved by: https://github.com/oulgen, https://github.com/mlazos, https://github.com/Skylion007
As titled. This is a followup to PR #118917 on nll_loss_forward. It also fixes an issue in it: the forward function produces two return values, the loss `result` and the `total_weight`. The previous PR didn't explicitly deal with the `total_weight` part.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119256
Approved by: https://github.com/wanchaol
Summary: When we deserialize nn_module_stack, sometimes the module no longer exists in the python environment so we cannot deserialize it back into the python type and instead it's kept as a string. This causes downstream failures when retracing due to one of our checks in export. This diff just bypasses the check.
Test Plan: CI
Reviewed By: chakriu
Differential Revision: D53527706
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119753
Approved by: https://github.com/zhxchen17
Summary:
Previously, we just store the char pointer in entry, the string is a
temp object and will be destructed when we want to dump/access it.
A quick fix is to store a copy of the string, but without changing the
upstream char*.
An alternative is to change every profilingTitle into std:string, this
however would needs comprehensive overhall of the code up to the
c10d::work layer above workNCCL and RecordFunction etc.
We chose the first option for this change
Resolve#119808
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119837
Approved by: https://github.com/zdevito, https://github.com/wconstab
Summary:
When, during `ExternKernel.realize_input` call, underlying `ExternKernel.convert_to_reinterpret_view` fails, we currently fall back to `cls.copy_input` here:
31e59766e7/torch/_inductor/ir.py (L3805-L3816)
This creates a `TensorBox(StorageBox(...))` wrapped output, which causes a problem for this assertion:
31e59766e7/torch/_inductor/ir.py (L3479)
Here we add a special case handling for this to unwrap `x` recursively.
Test Plan:
This local repro:
```
torch.compile()
def f(a, b, mat1, mat2):
bias = torch.bmm(a + 3.14, b).permute(0, 2, 1).reshape(3992, -1)
return torch.addmm(bias, mat1, mat2)
f(
torch.randn(3992, 20, 40).cuda(),
torch.randn(3992, 40, 192).cuda(),
torch.empty(3992, 1024).cuda(),
torch.empty(1024, 3840).cuda(),
)
```
with this line:
690f54b0f5/torch/_inductor/fx_passes/post_grad.py (L650)
changed to `if cond(*args, **kwargs):` fails before and succeeds after this PR.
Differential Revision: D53743146
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119867
Approved by: https://github.com/xw285cornell
As described in [this talk](https://www.youtube.com/watch?v=I95KmF6KSIA) and [this repo](https://github.com/osalpekar/llm-target-determinator), we are experimenting with using CodeLlama-powered information retrieval for target determination.
The idea is that we create embeddings for PyTorch test functions, and store this index in S3. Then when a new PR comes in, we create embedding(s) for that PR, compare them to the index of test embeddings, and run only the most relevant tests.
This PR creates a workflow that does the indexing part (creating embeddings for functions and store in S3). All the logic for running the indexer is in [osalpekar/llm-target-determinator](https://github.com/osalpekar/llm-target-determinator). This workflow just checks out the relevant repos, installs the dependencies, runs the torchrun command to trigger indexing, and uploads the artifacts to S3.
Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118824
Approved by: https://github.com/izaitsevfb, https://github.com/huydhn
Summary:
Fix for a case where --run-path option fails to exit if the script exits with non-error status code.
When there is an error exit code, run-path correctly detects an error and fails when calling spawn.join(). However for-non error case, current behavior is to check the return value of the operation and the fix is to return None so that our MP code detects an exit.
Test Plan:
cat /tmp/script.py
~~~
import sys
def main():
exit_code = 1
if len(sys.argv) > 1:
exit_code = int(sys.argv[1])
sys.exit(exit_code)
if __name__=="__main__":
main()
~~~
Case of exit code with 0 (prior behavior - never exits):
torchrun --run-path /tmp/script.py 0
~~~
[2024-02-12 09:20:57,523] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-02-12 09:20:58,980] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
(conda:pytorch) ➜ workspace echo $?
0
~~~
Existing behavior for non-zero exit code still works:
torchrun --run-path /tmp/script.py
~~~
(conda:pytorch) ➜ workspace torchrun --run-path /tmp/script.py
[2024-02-12 09:16:20,667] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-02-12 09:16:22,197] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 64668) of fn: run_script_path (start_method: spawn)
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] Traceback (most recent call last):
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] File "/Users/kurman/workspace/pytorch/torch/distributed/elastic/multiprocessing/api.py", line 441, in _poll
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] self._pc.join(-1)
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] File "/Users/kurman/workspace/pytorch/torch/multiprocessing/spawn.py", line 177, in join
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] raise ProcessExitedException(
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1
Traceback (most recent call last):
File "/Users/kurman/miniconda3/envs/pytorch/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kurman/workspace/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/Users/kurman/workspace/pytorch/torch/distributed/run.py", line 812, in main
run(args)
File "/Users/kurman/workspace/pytorch/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/Users/kurman/workspace/pytorch/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kurman/workspace/pytorch/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_script_path FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-02-12_09:16:25
host : kurman-mbp.dhcp.thefacebook.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 64668)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
~~~
Differential Revision: D53653874
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119697
Approved by: https://github.com/wconstab
Meta registration wrongly assumes 4D inputs, while the underlying op allows 3D inputs for the `mha_varlen_fwd()` case.
Testing: I added `detach()`es so the NJT test `test_sdpa_compile()` won't fail for a view-related reason. It should pass now with this fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119812
Approved by: https://github.com/drisspg
# Summary
Initially reported in https://github.com/pytorch/pytorch/issues/119320
I found that the by updating this function the nan values went away. I then created a godbolt to try and highlight the difference between the two versions:
https://godbolt.org/z/3sKqEqn4M
However they appear to always produce the same value, as the nvcc version is varied, except that the for some versions -inf is chosen and for others the correct subnormal is chosen... I am having a hard time finding an isolated test case for this but will keep working
### Update:
I added printf_statements to the the version and indeed some values/*addr contain -0.0f. Hence the reason why this update fixes the reported issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119577
Approved by: https://github.com/yifuwang
Summary:
This PR is a refactor of semi-structured sparsity support.
**deprecation**:
Before `torch.sparse.to_sparse_semi_structured` had a kwarg param
`transposed=False`, which has been removed. This kwarg was unused and
now thros a deprecation warning.
Namely, I've taken the subclassing implementation that xFormers has
created and brought it over to PyTorch, as part of our plan to upstream
runtime 2:4 sparsity.
I've also copied over all the op support that Daniel implemenented that
did not depend on the fast sparsification routines, into
`_sparse_semi_structured_ops.py`
With this subclass, all of our internal tests pass, as well as those in
xFormers.
The main change is that we now define a base subclass,
`SparseSemiStructuredTensor` that is inherited from for each of the
specific backends.
We also now can arbitrarily override the sparse dispatch table with
`_load_dispatch_table()`, idea being this is still general enough
where users don't need to modify pytorch source code to get their model
working.
This also adds in padding support and stores alg_id and fuse_transpose
as flags on the tensor, instead of hardcoding them.
There still remains two components in xFormers that will need to be
ported over eventually:
- the autograd functions (`Sparsify24`, `Sparsify24_like`)
- fast sparsification routines that they rely on
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117302
Approved by: https://github.com/alexsamardzic, https://github.com/HDCharles
Otherwise, at least on MacOS builds are littered with:
```
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/DeviceAccelerator.h:6:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MTIAHooksInterface.h:23:11: warning: '~MTIAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
virtual ~MTIAHooksInterface() = default;
^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/CUDAHooksInterface.h:65:11: warning: '~CUDAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
virtual ~CUDAHooksInterface() = default;
^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here
virtual ~AcceleratorHooksInterface() = default;
^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MPSHooksInterface.h:21:11: warning: '~MPSHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
virtual ~MPSHooksInterface() = default;
^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here
virtual ~AcceleratorHooksInterface() = default;
^
```
Likely introduced by https://github.com/pytorch/pytorch/pull/119329
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119656
Approved by: https://github.com/Skylion007
Improve performance of inductor searching large graphs for potential fusions.
Also adds some direct unit tests of find_independent_subset_greedy() to ensure that the rewrite didn't break behavior.
Fixes#98467
Previously find_independent_subset_greedy() was recursive and the example from the issue would cause it to blow out the stack. This changes it to be iterative and also caches some of the computed dependencies (it can't cache all of them because the caller is allowed to change the graph during the iteration).
Fusion is still slow - but at least finishes.
After this change the example given in #98467 has the following backend timings (on one particular CPU):
eager timing: 3m:23s
aot_eager timing: 4m:12s
inductor timing: 22m:24s
Possible future work to improve this further:
1. In dynamo limit the amount of inlining allowed before falling back to a graph break. This test ends up tracing through 483k bytecodes generating the graph.
2. In inductor have a limit so we don't exhaustively search the graph for fusion possibilities.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118324
Approved by: https://github.com/oulgen
By just calling `std_mps` and `mean` in sequence
Move `var_mean` decomp to `ReduceOps.mm`, as it should be faster to skip dispatching to a Python, which one can validate by running the following script:
```python
from timeit import default_timer
import torch
from torch.utils.benchmark import Measurement, Timer
def bench_var_mean(
m, n, k,
dtype = torch.float32,
device:str = "cpu",
) -> Measurement:
setup = f"""
x = torch.rand({m}, {n}, {k}, dtype={dtype}, device="{device}")
"""
t = Timer(
stmt="torch.var_mean(x, dim=1)", setup=setup, language="python", timer=default_timer
)
return t.blocked_autorange()
for x in [100, 1000]:
rc = bench_var_mean(1000, x, 100, device="mps")
print(f"{x:5} : {rc.mean*1e6:.2f} usec")
```
which before the change reports 681 and 1268 usec and after 668 and 684 (which probably means that GPU is not saturated, but overhead from switching between native and interpretable runtimes are shorter.
Fixes https://github.com/pytorch/pytorch/issues/119663
TODOs:
- Refactor the codebase and implement proper composite function (that must be faster)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119777
Approved by: https://github.com/albanD
Tune the grid and block sizes for ROCm. Add a contig kernel separate from aligned+contig.
Verified new performance using pytorch/benchmarks/operator_benchmark.
`python -m pt.cat_test --device=cuda --tag-filter all`
On MI200 this improved performance on average 4%, and on MI300 14%.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118685
Approved by: https://github.com/malfet
Summary:
auto& entry = entries_.at(*id % max_entries_);
entry = entries_.at(*id % max_entries_);
The above line of code has unintended consequence of invoking copy/assignment
of entry objects as ref itself cannot be re-assigned.
Also what could cause the crash is that the entry ref could become invalid if entries_ are
resized by other threads. and this could result in 'copy to a garbage
location'. The fix is to use a pointer which can be re-assigned after
re-acquiring the lock
Tests: python test/distributed/test_c10d_nccl.py NCCLTraceTest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119748
Approved by: https://github.com/wconstab, https://github.com/fegin
Match FxGraphDrawer compat constructor signature to avoid the following failure when `pydot` is not installed:
```
File "/pytorch/torch/_functorch/partitioners.py", line 933, in draw_graph
g = graph_drawer.FxGraphDrawer(
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
TypeError: __init__() got an unexpected keyword argument 'dot_graph_shape'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119767
Approved by: https://github.com/eellison
This PR adds mixed precision configured via `MixedPrecisionPolicy`.
- By default (`cast_forward_inputs=True`), each FSDP module will cast forward floating-point input tensors to `param_dtype` if specified. If the user wants to own the cast, then the user can disable it by passing `False`.
- Symmetrically, by default (`output_dtype=None`) each FSDP module will not cast the forward output. If the user wants to customize the output dtype, then the user can pass a `torch.dtype`.
- `param_dtype` configures the unsharded parameters' dtype for forward/backward computation and hence the all-gather dtype.
- `reduce_dtype` configures the gradient reduction dtype. If `reduce_dtype=None` and `param_dtype is not None`, then `reduce_dtype` inherits from `param_dtype` for simplicity.
We test against a manually implemented reference implementation instead of comparing against existing FSDP since the comparison is more direct to what we want to test.
---
**Overhead benchmarks to inform design**
The dilemma is as follows:
- The common path for FSDP is bf16 parameter mixed precision, where we cast sharded parameters from fp32 to bf16 before all-gathering them.
- The baseline implementation is to `torch._foreach_copy_` the sharded parameters to the flat `all_gather_input`, which gets passed to `dist.all_gather_into_tensor`.
- The baseline incurs 1 extra fp32 read and 1 extra bf16 write per parameter because `_foreach_copy` takes the slow path, calling `copy_` in a loop, and `copy_` calls `dst.copy_(src.to(bf16))` where `dst` is bf16 and `src` is fp32.
- These `copy_` calls stay in C++ and do not require calling `at::as_strided`.
- The issue with this baseline implementation is that it requires knowing that all parameters in the group will be cast from fp32 to bf16 to do this `_foreach_copy_` from fp32 sources to a bf16 destination.
- We want per-parameter FSDP to support mixed dtype all-gathers, which would involve different parameters providing different dtype all-gather inputs and viewing them as uint8 for a combined flat all-gather input, where this viewing-as-uint8 step is only needed in the mixed dtype case.
- However, this incurs more CPU overhead, so we want to investigate this in more detail.
We consider 150 `nn.Parameter`s with shapes taken from an internal model (where the shapes only affect the copy bandwidth, not the CPU overhead). We focus on world size 128 first. We consider two experiments: (1) run the copy-in with no head start, allowing CPU boundedness affect GPU time, and (2) run the copy-in with a CPU head start, removing CPU overhead from affecting GPU time.
No head start:
- Baseline `torch._foreach_copy_`: 0.525 ms CPU; 0.528 ms GPU
- `.to(bf16)` before `torch._foreach_copy_`: 0.828 ms CPU; 0.836 ms GPU
- `.to(bf16).view(uint8)` before `torch._foreach_copy_`: 0.933 ms CPU; 0.937 ms GPU
Head start (removing CPU boundedness from GPU times):
- Baseline `torch._foreach_copy_`: 0.393 ms GPU
- `.to(bf16)` before `torch._foreach_copy_`: 0.403 ms GPU
- `.to(bf16).view(uint8)` before `torch._foreach_copy_`: 0.403 ms GPU
Some other interesting notes:
- Constructing a set of all all-gather input dtypes: ~0.015 ms -- this would be the overhead cost of checking whether we need to view as uint8 (i.e. whether we have mixed dtype); alternatively, we could always view as uint8 (but that loses the mixed precision policy info from the profiler trace)
- Changing from `[t.to(bf16).view(uint8) for t in ts]` to two list comprehensions like `[t.to(bf16) for t in ts]; [t.view(uint8) for t in ts]` actually reduces CPU overhead 🤔 (by ~0.04 ms)
We see that the main difference is just CPU overhead. The GPU times are almost the same. (Actually, sweeping over 8, 16, 32, 64 world size, we do see difference in GPU time inversely proportional to world size, as expected since smaller world sizes copy more data. However, even at world size 8, the difference is only 0.407 ms vs. 0.445 ms GPU time.) Note though that the CPU overhead differences are exacerbated when the PyTorch profiler is turned on, and how much so seems to depend on the CPU capability.
Seeing these numbers, I am inclined to prefer to just incur the CPU overhead, especially given that if we want to support the mixed dtype case for fp8 all-gather, we will need to incur this anyway. If the CPU overhead becomes a problem on a real workload, then we will need to figure out options then, one being using `torch.compile` possibly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118223
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #119550, #118136
This PR adds tests for autograd (mainly backward hooks), memory, overlap, and frozen parameters.
- Autograd: unused forward output, unused forward module, non-tensor activations (common in internal models)
- Memory: expected GPU memory usage after init, forward, backward, and optimizer step
- Overlap: communication/computation overlap in forward and backward
- Frozen: expected reduce-scatter size, training parity
This PR adds some initial 2D (FSDP + TP) training and model state dict tests. The only change required for model sharded state dict is to make sure parameters are sharded before save and load.
This PR adds tests that `fully_shard` can use `torch.utils.checkpoint`, `_composable.checkpoint`, and `CheckpointWrapper` on a transformer.
(I squashed all of these into one PR now to save CI cost.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118136
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #119550
Fixes https://github.com/pytorch/pytorch/issues/117268; check this issue for background.
This PR does the following:
* Do not perform a replacement if the expression we're replacing the symbol with has a less refined value range than the original. There's a little bit of trickiness around the handling for values close to INT64_MAX; when checking if a range refines another, I *only* consider the range representable in 64-bit integers. This is enough to prevent us from doing a substitution like `i0 = 10 - i1`, but it appears to still let us do the other substitutions we like, such as `i0 = i1` or `i0 = 12 * i1`
* The test above is order dependent: if we assert an equality BEFORE we have refined a range, we might be willing to do the replacement because there isn't a meaningful range. This means that it's important to mark things as sizes, before you start doing other error checking. `split_with_sizes` is adjusted accordingly. It would be good to raise an error if you get the ordering wrong, but I leave this to future work.
* It turns out this is not enough to fix AOTAutograd, because we lose the size-ness of unbacked SymInts when AOTAutograd retraces the Dynamo graph. So update deferred runtime assert insertion to also insert size-ness and value ranges annotations. Note that, in principle, it shouldn't be necessary to explicitly do the latter; these should just show up as deferred runtime asserts. That's some extra refactoring for a later day.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117356
Approved by: https://github.com/lezcano
Summary: The codegen of `with torch.cuda._DeviceGuard` context manager in the Python wrapper code is implemented via `device_cm_stack: contextlib.ExitStack()`. As the context managers in the stack are `code.indent()`, this means that the whole stack is unindented at once on `device_cm_stack.close()`. This becomes problematic when attempting to codegen indented code (e.g., for control flow in Python and / or nested subgraph codegen-ing).
In this PR, we refactor the device guard codegen-ing in Python by replacing the `device_cm_stack` by explicit indent and unindent calls for entering and exiting the `with torch.cuda._DeviceGuard` context manager. This allows for nested device guard context managers and better aligns with other indented codegen-ing intertwined with it (e.g., for nested subgraph codegen-ing).
This is necessary for the upcoming support for `torch.cond` (and other control flow operators) in Inductor. Before that, the only change in the Python wrapper codegen is that the `return outputs` is now happening outside the `with torch.cuda._DeviceGuard` context manager.
Test Plan: CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119673
Approved by: https://github.com/peterbell10
This PR adds support for for-loop parsing and analysis. While doing so, I ran into some constant value and function name problems so I fixed them as well. Technically, it should be possible to break this into multiple PRs but since these are small, I'm bundling them together.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119730
Approved by: https://github.com/aakhundov
Summary:
In March 2019 D14468816 introduced some infra to mark tests as flaky
while still running them. In July 2019 D15797371 removed the last use of this
feature. Remove the related code as well.
Test Plan: ci
Reviewed By: mlogachev
Differential Revision: D50601204
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112007
Approved by: https://github.com/malfet
There's a bug when converting from TorchVariable to trace rule look ups,
in some corner cases the DTensor.from_local calls not matching the trace
name rule look up, resulting in a None look up, and falling back to the
UserFunctionVariable, which makes the tracing silent wrong by tracing
into the DTensor.from_local function. Not exactly sure yet why the look
up failed
This PR fixes the DTensor.from_local tracing to make sure in everycase
we should hit the InGraphFunctionVariable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119659
Approved by: https://github.com/yifuwang
Summary:
This PR serves as a follow-up fix to address numerical correctness concerns identified in PR #118197, and we should only wait on `AsyncCollectiveTensor`.
Without the change, we occasionally ran into exception: `AttributeError("'Tensor' object has no attribute 'wait'")`
Test Plan:
**CI**:
Wait for the CI test
**Test with prod model**:
- Tested with models and no-longer ran into the exception after checkpoint loading.
Differential Revision: D53680406
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119716
Approved by: https://github.com/fegin, https://github.com/Skylion007, https://github.com/wz337
Fixes#118990
The root cause is due to `out_features` of Linear not matching `num_features` of BatchNorm, resulting in shape mismatch while computing `fused_w`, and `fused_b`. This can happen for linear-bn folding because linear layer operates over the last dim, `(*, H_in)`, while bn layer operates over the channel dim, `(N, C_in, H, W)`.
To preserve the shapes of the original linear weight and bias in linear-bn folding, check linear `out_features` match bn `num_features`. If they don't match, bn `num_features` need to be 1 to broadcast.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119264
Approved by: https://github.com/eellison
Former is only on MacOS 14+, but at least on older MacOSes it would raise an exception rather than returning non-conjugated tensor
Preliminary step for enabling FFT ops (without it `ifft` would never work)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119669
Approved by: https://github.com/albanD
ghstack dependencies: #119681
partially address https://github.com/pytorch/pytorch/issues/118785
This diff fixes three things:
1. add get_function to FunctoolsPartialVariable note that it will be available only if all args constant otherwise,
it would throw unimplemented in the call to asPythonConstant.
2. NamedTupleVariable takes args dispatched not as list ex: NamedTuple(a, b, c) vs NamedTuple([a, b, c]),
hence fix that by specializing asProxy.
3. A call to create_arg from within create_proxy, changes a python NamedTuple to a function call node without
associating an example value! Updated get_fake_values_from_nodes to handle such case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119435
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #119314
I accidentally disabled this without realizing it. It turns out that
PYTORCH_TEST_WITH_INDUCTOR=1 implies PYTORCH_TEST_WITH_DYNAMO=1, which
activates skipIfTorchDynamo decorators.
Test Plan:
- wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119693
Approved by: https://github.com/bdhirsh
`dv = at::empty_like(k)` and `dv = at::empty_like(v)` can be materially different, because `empty_like` tries to preserve the strides of the input when possible. So if `k` is contiguous, but `v`, is transposed, then before this PR, `dv` would be computed to be contiguous.
Alternatively, we could change the meta implementation of `aten._scaled_dot_product_flash_attention` to this:
```
grad_q = torch.empty_like(query.transpose(1, 2)).transpose(1, 2)
grad_k = torch.empty_like(key.transpose(1, 2)).transpose(1, 2)
grad_v = torch.empty_like(key.transpose(1, 2)).transpose(1, 2)
return grad_q, grad_k, grad_v
```
But (I think?) the logic in the sdpa backward impl was a typo.
I noticed this because changing the meta formula as above was enough to fix the issue with the `aot_eager` backend in this [link](https://github.com/pytorch/pytorch/issues/116935#issuecomment-1914310523).
A minimal repro that I made looks like this:
```
import torch
# in this repro, "grad_out" and "value" are transposed tensors,
# but "key" and "value" are contiguous
a = torch.randn(2, 513, 16, 64, dtype=torch.float16, device='cuda').transpose(1, 2)
b = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda')
c = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda')
d = torch.randn(2, 513, 16, 64, dtype=torch.float16, device='cuda').transpose(1, 2)
e = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda')
f = torch.randn(2, 16, 513, device='cuda')
g = None
h = None
i = 513
j = 513
k = 0.0
l = False
m = torch.tensor(1, dtype=torch.int64)
n = torch.tensor(1, dtype=torch.int64)
out1_ref, out2_ref, out3_ref = torch.ops.aten._scaled_dot_product_flash_attention_backward(a, b, c, d, e, f, g, h, i, j, k, l, m, n, scale=0.125)
from torch._meta_registrations import meta__scaled_dot_product_flash_backward
out1_test, out2_test, out3_test = meta__scaled_dot_product_flash_backward(a, b, c, d, e, f, g, h, i, j, k, l, m, n, scale=0.125)
# prints True True
print(out1_ref.is_contiguous())
print(out1_test.is_contiguous())
# prints True True
print(out2_ref.is_contiguous())
print(out2_test.is_contiguous())
# prints True False
print(out3_ref.is_contiguous())
print(out3_test.is_contiguous())
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119500
Approved by: https://github.com/drisspg, https://github.com/ezyang, https://github.com/Skylion007
Do not run test ConstantPropagation.CustomClassesCanBePropagated on a platform where QNNPACK is not supported.
For example, this test fails on M1 Mac because QNNPACK is not supported on M1 Mac:
[----------] 1 test from ConstantPropagation
[ RUN ] ConstantPropagation.CustomClassesCanBePropagated
unknown file: Failure
as described in more details in the issue #88613.
After the PR, test passes successfully as below:
[----------] 1 test from ConstantPropagation
[ RUN ] ConstantPropagation.CustomClassesCanBePropagated
[ OK ] ConstantPropagation.CustomClassesCanBePropagated (0 ms)
[----------] 1 test from ConstantPropagation (0 ms total)
Fixes#88613
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119139
Approved by: https://github.com/jcaip
Recently we made it possible to serialize ExportedPrograms with fake parameters/buffers/etc.
The serialization regime was kind of whacky; basically we serialized a stub and reassembled the FakeTensor using metadata that we had stashed elsewhere in the Graph state.
This was bad for a few reasons:
- Storing the metadata separately from the actual serialized object caused situations where you could have one but not the other. An example case is if you had a FakeTensor contained inside a TorchBind object—there was no obviously place to store the metadata for this. This actually happens—TensorQueue in fbgemm does this.
- It created an annoying cycle: we had to deserialize the Graph's tensor metadata in order to deserialize (potentially faked) constants, but we need constants in order to deserialize the Graph.
This fixes all that. The basic idea is to patch the reducer function for FakeTensor at serialization time, and serialize a copy of the FakeTensor metadata. We already are policing BC for the TensorMeta schema struct so it's not a net increase in the BC surface.
As a bonus, I fixed a weird bug with torchbind tracing where we were accidentally reinterpreting a torch.ScriptObject as a torch.ScriptModule (which was the root cause of some weird behavior @bahuang was seeing last week).
Differential Revision: [D53601251](https://our.internmc.facebook.com/intern/diff/D53601251/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119531
Approved by: https://github.com/zhxchen17
Summary:
This PR tries to resolve issue #119215.
Basically, processgroup shutdown (and hence ncclCommAbort) is called in
destroy_process_group APIs for the corresponding PGs. and in the
destructor of ProcessGroup, we avoid calling abort/ncclCommAbort.
Instead, it just checks if the users have explicitly already called destroy_process_group. If
not, Destructor will log a warning and encourage/expect users to do so
for cleanup of resources of PGs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119250
Approved by: https://github.com/minsii, https://github.com/kwen2501
25 min -> 17 + 13 min, which is still not as fast as I want it to be but I'll take it
Lintrunner provides some parallelism by default, but it's not perfect
Reducing fetch-depth from all to 1 further reduces time by ~2-3 minutes
From non clang's logs:
```
2024-02-09T22:05:39.5297616Z Requirement already satisfied: PyYAML==6.0 in /opt/conda/lib/python3.11/site-packages (6.0)
2024-02-09T22:12:23.6164708Z Collecting black==23.12.1
```
I don't know why this part takes so long, maybe it's just buffering? Clang version doesn't show this issue
See 5a750c8035
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119575
Approved by: https://github.com/huydhn, https://github.com/malfet
Summary: In dynamo tracing, `index()`'s implementation currently has the default begin index as `0` and the default end index as`-1` which means that by default we're dropping the last element. Rather we should be doing `None` which will ensure that the last element is also checked.
Test Plan: CI
Differential Revision: D53392287
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119151
Approved by: https://github.com/yanboliang
We'd like to get auto_functionalized to work with AOTInductor. To get
there, we decompose `output = auto_functionalized(inplace_op, ...)` into its
corresponding aten ops (clones + inplace_op) before the Inductor lowering phase.
This decomposition must happen at the end of the Inductor FX passes
because it introduces in-place operations.
The pattern matcher's "replace this single node with multiple nodes" API
isn't robust enough here. The problem is that `auto_functionalized`
returns a single output (this output is a List), but the decomposition
ends up returning the unpacked List (e.g. it may return two tensors).
Previously, there was an assertion that this was not the case; I fixed
up `replace_with_graph` to handle this.
Future: Not all of the clones are necessary (e.g. if the input's last
usage is this operator, then we don't need to clone it). We can add this
logic later.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118673
Approved by: https://github.com/oulgen
Summary: As we're growing the user surface of torch.export, we'd like to understand better how people are using our APIs. It's also possible to analyze the usages based on static analysis, but due to the fact that there could be many creative ways to call things in Python, I think just building some logging infra will benefit us in the short term and gain us some insights.
Test Plan:
buck test caffe2/test:test_export
{F1454519846}
Reviewed By: tugsbayasgalan
Differential Revision: D53618220
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119585
Approved by: https://github.com/avikchaudhuri
Reason:
Consumers of ExportProgram might choose to further lower exported_program.graph_module
to something else.
Then, it will need to setup the calling convention to call it.
This refactor concentrates these calling convention to one place and can be reused.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119513
Approved by: https://github.com/zhxchen17
This PR makes it so that ops is no longer a dict of RET => OP but rather it is now RET => List[OP] since now multiple OPs can return the same RET. In real execution, only one of these OPs will be executed, so no need to worry about renaming. For analysis, we pessimistically assume any one of them could be executed (which is safest for analysis purposes)
Example TTIRs that can now be handled:
```
scf.if %13 {
%14 = tt.get_program_id y : i32 loc(#loc13)
%c0_i32_1 = arith.constant 0 : i32 loc(#loc14)
%15 = arith.cmpi eq, %14, %c0_i32_1 : i32 loc(#loc14)
scf.if %15 {
%16 = arith.addf %8, %11 : tensor<4xf32> loc(#loc16)
%17 = tt.splat %arg2 : (!tt.ptr<f32, 1>) -> tensor<4x!tt.ptr<f32, 1>> loc(#loc17)
%18 = tt.addptr %17, %4 : tensor<4x!tt.ptr<f32, 1>>, tensor<4xi32> loc(#loc17)
tt.store %18, %16, %5 {cache = 1 : i32, evict = 1 : i32} : tensor<4xf32> loc(#loc18)
} else {
} loc(#loc15)
} else {
} loc(#loc12)
```
and
```
%14 = scf.if %13 -> (tensor<4xf32>) {
%17 = arith.addf %8, %11 : tensor<4xf32> loc(#loc13)
scf.yield %17 : tensor<4xf32> loc(#loc13)
} else {
%17 = arith.mulf %8, %11 : tensor<4xf32> loc(#loc14)
scf.yield %17 : tensor<4xf32> loc(#loc14)
} loc(#loc12)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119664
Approved by: https://github.com/aakhundov
Otherwise, at least on MacOS builds are littered with:
```
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/DeviceAccelerator.h:6:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MTIAHooksInterface.h:23:11: warning: '~MTIAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
virtual ~MTIAHooksInterface() = default;
^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/CUDAHooksInterface.h:65:11: warning: '~CUDAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
virtual ~CUDAHooksInterface() = default;
^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here
virtual ~AcceleratorHooksInterface() = default;
^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MPSHooksInterface.h:21:11: warning: '~MPSHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
virtual ~MPSHooksInterface() = default;
^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here
virtual ~AcceleratorHooksInterface() = default;
^
```
Likely introduced by https://github.com/pytorch/pytorch/pull/119329
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119656
Approved by: https://github.com/Skylion007
Summary: Currently, when a custom (user-written) Triton kernel has a ReinterpretView argument in IR, we're always skipping the alignment checking for this argument when preparing the `signature_of` for the AOT compilation of the Triton kernel (via setting `TensorArg.check_alignment` to `False`). This is problematic for user-written kernels where, albeit reinterpreted, the argument of the Triton kernel (the data pointer) can still be aligned to 16. When we skip alignment checking, the performance of the AOT-compiled internal Triton kernels can degrade 2x--3x.
In this PR, we replace `TensorArg.check_alignment` by `TensorArg.offset`, in which we specify the offset of the `ReinterpretView.layout` relative to the underlying `ir.Buffer` (corresponding to the data pointer before reinterpretation). As the size and stride of the layout don't change the alignment properties, those can be skipped. Importantly, for `ReinterpretView` arguments of custom Triton kernels, we use `arg.data.get_name()` as the buffer name. That, together with the offset, is used to check the alignment.
Bonus: the namedtuples in `codegen/common.py` are refactored as `dataclass`es, with nicer type hints and default values (for the newly added `TensorArg.offset`).
Test Plan:
```
$ python test/inductor/test_aot_inductor.py -k test_triton_kernel_reinterpret_view
...
----------------------------------------------------------------------
Ran 6 tests in 27.952s
OK (skipped=4)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119649
Approved by: https://github.com/oulgen
It's unnecessary and inefficient to create a `dict` from list indices to list values just to check if particular `idx` exists there. This way leads to `O(N)` time and space complexity whereas using `list` directly is `O(1)` time and space complexity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118011
Approved by: https://github.com/Skylion007
There is no need to make a `frame_summary_stack` copy in case it's not modified. Proposed change uses copy-on-write functional approach that is easy to understand and is more efficient in case `self.loc_in_frame` is `None`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119115
Approved by: https://github.com/Skylion007
This diff adds few improvements:
* Parsing for multiple return value: `tt.return %1, %arg0`
* Parsing for assignment for multiple values: `%1:2` means %1 has two values
* Parsing for usage of a value with multiple values: `%1#0` means 0th index of %1
* Fixes a bug in memo-cycle detection when multiple tests are executed back to back
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119615
Approved by: https://github.com/aakhundov
ghstack dependencies: #119581
# Motivation
According to [[1/2] Intel GPU Runtime Upstreaming for Stream](https://github.com/pytorch/pytorch/pull/117611), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`.
# Design
Currently, it primarily offers stream-related APIs, including
- `torch.xpu.StreamContext`
- `torch.xpu.current_stream`
- `torch.xpu.set_stream`
- `torch.xpu.synchronize`
- `torch._C._xpu_getCurrentRawStream`
# Additional Context
We will implement functions like `torch.xpu.Stream.wait_event`, `torch.xpu.Stream.wait_stream`, and `torch.xpu.Stream.record_event` in the next PR related with `Event`.
The differences with CUDA:
no default and external stream in XPU and lack of below APIs:
- `torch.cuda.ExternalStream`
- `torch.cuda.default_stream`
- `toch.cuda.is_current_stream_capturing`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117619
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #117611
This PR makes a couple of improvements to non-strict to bring it closer to strict. (This lets us remove some expected failures from test_export.)
1. Support constant arguments (easy).
2. Support keyword arguments. This forces us to add kwargs to `aot_export_module`. Indeed there is no way to make this work otherwise, because some arguments in a function signature can be keyword-only and thus cannot be simulated by positional arguments alone. Adding kwargs to `aot_export_module` turns out to be fairly routine, but there is a bit of a unsatisfactory fork between how it is called by strict and non-strict: because strict calls it on a graph module, kwargs must be converted to positional arguments. So kwargs in `aot_export_module` really only comes into play in non-strict.
Differential Revision: D53600977
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119529
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
`dv = at::empty_like(k)` and `dv = at::empty_like(v)` can be materially different, because `empty_like` tries to preserve the strides of the input when possible. So if `k` is contiguous, but `v`, is transposed, then before this PR, `dv` would be computed to be contiguous.
Alternatively, we could change the meta implementation of `aten._scaled_dot_product_flash_attention` to this:
```
grad_q = torch.empty_like(query.transpose(1, 2)).transpose(1, 2)
grad_k = torch.empty_like(key.transpose(1, 2)).transpose(1, 2)
grad_v = torch.empty_like(key.transpose(1, 2)).transpose(1, 2)
return grad_q, grad_k, grad_v
```
But (I think?) the logic in the sdpa backward impl was a typo.
I noticed this because changing the meta formula as above was enough to fix the issue with the `aot_eager` backend in this [link](https://github.com/pytorch/pytorch/issues/116935#issuecomment-1914310523).
A minimal repro that I made looks like this:
```
import torch
# in this repro, "grad_out" and "value" are transposed tensors,
# but "key" and "value" are contiguous
a = torch.randn(2, 513, 16, 64, dtype=torch.float16, device='cuda').transpose(1, 2)
b = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda')
c = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda')
d = torch.randn(2, 513, 16, 64, dtype=torch.float16, device='cuda').transpose(1, 2)
e = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda')
f = torch.randn(2, 16, 513, device='cuda')
g = None
h = None
i = 513
j = 513
k = 0.0
l = False
m = torch.tensor(1, dtype=torch.int64)
n = torch.tensor(1, dtype=torch.int64)
out1_ref, out2_ref, out3_ref = torch.ops.aten._scaled_dot_product_flash_attention_backward(a, b, c, d, e, f, g, h, i, j, k, l, m, n, scale=0.125)
from torch._meta_registrations import meta__scaled_dot_product_flash_backward
out1_test, out2_test, out3_test = meta__scaled_dot_product_flash_backward(a, b, c, d, e, f, g, h, i, j, k, l, m, n, scale=0.125)
# prints True True
print(out1_ref.is_contiguous())
print(out1_test.is_contiguous())
# prints True True
print(out2_ref.is_contiguous())
print(out2_test.is_contiguous())
# prints True False
print(out3_ref.is_contiguous())
print(out3_test.is_contiguous())
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119500
Approved by: https://github.com/drisspg, https://github.com/ezyang, https://github.com/Skylion007
Summary:
There was a bug in the module name filter for modules that had an underscore
already in them, as it was replaced with a "dot" notation.
This is because it was thought that underscores always meant a module separator,
but this isn't the case for modules whose name contains an underscore.
Test Plan:
Added a unit test. Before this change, that test failed (due to applying the wrong
qscheme). Now it passes.
Differential Revision: D53502771
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119344
Approved by: https://github.com/jerryzh168
Apply modularization pass to exported program exporting. The only two things that needs to be taken care of are (1) the extra call stack generated by `torch.export.export` and (2) lifted placeholder has call stack (different from original placeholder).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119498
Approved by: https://github.com/thiagocrepaldi
Added a `torch.Tensor` method that defines how to transform `other`, a value in the state dictionary, to be loaded into `self`, a param/buffer in an `nn.Module` before swapping via `torch.utils.swap_tensors`
* `param.module_load(sd[key])`
This method can be overridden using `__torch_function__`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117913
Approved by: https://github.com/albanD
This PR substantially improves the error reporting for GuardOnDataDependentSymNode in the following ways:
* The GuardOnDataDependentSymNode error message is rewritten for clarity, and contains a link to a new doc on how to resolve these issues https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit#heading=h.44gwi83jepaj
* We support `TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL`, which lets you specify a symbol name to get detailed debug information when it is logged (e.g., the full backtrace and user backtrace of the symbol creation). The exact symbols that you may be interested in our now explicitly spelled out in the error message.
* We support `TORCHDYNAMO_EXTENDED_DEBUG_CPP` which enables reporting C++ backtraces whenever we would report a backtrace.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119412
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #117356
FallbackKernel wasn't handing mutable ops correctly: it would not report
them in get_mutation_names or get_alias_names. This would lead to silent
incorrectness -- Inductor would incorrectly reorder the mutable op with other
mutable ops.
This PR fixes that:
- we only support mutable operations that are "auto_functionalizable".
That is, they mutate inputs and do not return aliases of any inputs.
- Following the Triton kernel work, any mutated inputs must be specified
in get_alias_names and processed via mark_node_as_mutating
- We also do some minor cleanup by killing dead code (FallbackKernel no
longer processes OpOverloadPacket) and adding some handling around
HOPs.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118649
Approved by: https://github.com/eellison, https://github.com/oulgen
Replaces `view_func()` closures with a reified `ViewFunc` data structure. Codegen generates a `ViewFunc` subclass for each view op (e.g. `NarrowViewFunc`) containing state needed to reconstruct the view. The `ViewFunc` API allows for querying and hot-swapping any `SymInt`s or `Tensors` in the state through `get_symints()` / `get_tensors()` / `clone_and_set()`, which will be essential for fake-ification later on.
```cpp
/// Base class for view functions, providing reapplication of a view on a new base.
/// Each view op should get a codegenerated subclass of this class containing
/// any state needed to reconstruct the view. The class also provides convenience
/// accessors for saved SymInts / tensor state. This is useful for e.g. fake-ification,
/// where we want to use symbolic values or fake tensors instead.
struct TORCH_API ViewFunc {
virtual ~ViewFunc() {}
/// Returns any SymInts in the saved state.
virtual std::vector<c10::SymInt> get_symints() const { return {}; }
/// Returns the number of SymInts in the saved state.
virtual size_t num_symints() const { return 0; }
/// Returns any tensors in the saved state.
virtual std::vector<at::Tensor> get_tensors() const { return {}; }
/// Returns the number of tensors in the saved state.
virtual size_t num_tensors() const { return 0; }
/// Reapplies the view on the given base using the saved state.
virtual at::Tensor operator()(const at::Tensor&) const = 0;
/// Returns a clone of this ViewFunc, optionally with the specified saved state.
virtual std::unique_ptr<ViewFunc> clone_and_set(
std::optional<std::vector<c10::SymInt>> = c10::nullopt,
std::optional<std::vector<at::Tensor>> = c10::nullopt) const = 0;
protected:
/// Sets the values of any SymInts in the saved state. The input vector size must
/// match the number of SymInts in the saved state (i.e. the size of the list
/// returned by get_symints()).
virtual void set_symints(std::vector<c10::SymInt>) {}
/// Sets the values of any Tensors in the saved state. The input vector size must
/// match the number of Tensors in the saved state (i.e. the size of the list
/// returned by get_tensors()).
virtual void set_tensors(std::vector<at::Tensor>) {}
};
```
New codegen files:
* `torch/csrc/autograd/generated/ViewFunc.h`
* `torch/csrc/autograd/generated/ViewFuncs.cpp`
The templates for these also contains impls for `ChainedViewFunc` and `ErroringViewFunc` which are used in a few places within autograd.
Example codegen for `slice.Tensor`:
```cpp
// torch/csrc/autograd/generated/ViewFuncs.h
#define SLICE_TENSOR_VIEW_FUNC_AVAILABLE
struct SliceTensorViewFunc : public torch::autograd::ViewFunc {
SliceTensorViewFunc(int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step) : dim(dim), start(start), end(end), step(step)
{};
virtual ~SliceTensorViewFunc() override {};
virtual std::vector<c10::SymInt> get_symints() const override;
virtual size_t num_symints() const override;
virtual std::vector<at::Tensor> get_tensors() const override;
virtual size_t num_tensors() const override;
virtual at::Tensor operator()(const at::Tensor&) const override;
virtual std::unique_ptr<ViewFunc> clone_and_set(
std::optional<std::vector<c10::SymInt>> = c10::nullopt,
std::optional<std::vector<at::Tensor>> = c10::nullopt) const override;
protected:
virtual void set_symints(std::vector<c10::SymInt>) override;
virtual void set_tensors(std::vector<at::Tensor>) override;
private:
int64_t dim;
c10::optional<c10::SymInt> start;
c10::optional<c10::SymInt> end;
c10::SymInt step;
};
...
// torch/csrc/autograd/generated/ViewFuncs.cpp
std::vector<c10::SymInt> SliceTensorViewFunc::get_symints() const {
::std::vector<c10::SymInt> symints;
symints.reserve((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
if(start.has_value()) symints.insert(symints.end(), *(start));
if(end.has_value()) symints.insert(symints.end(), *(end));
symints.push_back(step);
return symints;
}
size_t SliceTensorViewFunc::num_symints() const {
return static_cast<size_t>((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
}
void SliceTensorViewFunc::set_symints(std::vector<c10::SymInt> symints) {
TORCH_INTERNAL_ASSERT(symints.size() == num_symints());
auto i = 0;
if(start.has_value()) start = symints[i];
i += (start.has_value() ? 1 : 0);
if(end.has_value()) end = symints[i];
i += (end.has_value() ? 1 : 0);
step = symints[i];
}
std::vector<at::Tensor> SliceTensorViewFunc::get_tensors() const {
::std::vector<at::Tensor> tensors;
return tensors;
}
size_t SliceTensorViewFunc::num_tensors() const {
return static_cast<size_t>(0);
}
void SliceTensorViewFunc::set_tensors(std::vector<at::Tensor> tensors) {
TORCH_INTERNAL_ASSERT(tensors.size() == num_tensors());
}
at::Tensor SliceTensorViewFunc::operator()(const at::Tensor& input_base) const {
return at::_ops::slice_Tensor::call(input_base, dim, start, end, step);
}
std::unique_ptr<ViewFunc> SliceTensorViewFunc::clone_and_set(
std::optional<std::vector<c10::SymInt>> symints,
std::optional<std::vector<at::Tensor>> tensors) const {
auto output = std::make_unique<SliceTensorViewFunc>(dim, start, end, step);
if (symints.has_value()) {
output->set_symints(std::move(*(symints)));
}
if (tensors.has_value()) {
output->set_tensors(std::move(*(tensors)));
}
return output;
}
```
The `_view_func()` / `_view_func_unsafe()` methods now accept two additional (optional) args for `symint_visitor_fn` / `tensor_visitor_fn`. If these are defined, they are expected to be python callables that operate on a single SymInt / tensor and return a new one. This allows for the hot-swapping needed during fake-ification.
For testing, there are extensive pre-existing tests, and I added a test to ensure that hot-swapping functions correctly.
```sh
python test/test_autograd.py -k test_view_func_replay
python test/test_ops.py -k test_view_replay
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118404
Approved by: https://github.com/ezyang
This PR replaces the `_unsafe_preserve_version_counters` context with a simple `torch.no_grad()` context instead. This decreases CPU overhead from (1 context enter/exit + `N` loop over tensors) with just (1 context enter/exit).
This PR also removes a `torch.no_grad()` from `init_unsharded_param` as it helps compiling but does not affect eager.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119550
Approved by: https://github.com/Skylion007
This is going to fix a legacy issue like:
```
torch._dynamo.export(torch.ops.aten.scaled_dot_product_attention, ...)(*inputs,)
```
This is not supported any more, now the top level ```torch.export``` only support ```nn.Module```, but there are still some tests using the internal APIs and caused the ```trace_rules.check``` assertion error. This PR is going to mitigate such cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119528
Approved by: https://github.com/ydwu4
Due to PR_WINDOW, if the magic string exists in the body but the pr was not updated recently, the query wouldn't find it and would delete the branch. Instead, query separately for branches with the no-delete-branch label, which I created recently.
Might as well query for branches with open PRs while we're at it so PRs with the stale label won't get their branches deleted either
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119399
Approved by: https://github.com/huydhn
Due to PR_WINDOW, if the magic string exists in the body but the pr was not updated recently, the query wouldn't find it and would delete the branch. Instead, query separately for branches with the no-delete-branch label, which I created recently.
Might as well query for branches with open PRs while we're at it so PRs with the stale label won't get their branches deleted either
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119399
Approved by: https://github.com/huydhn
When using the Cutlass backend, the compilation
of CUDA source files can totally dominate the runtime required for the benchmarking done
as part of Autotuning.
This change adds a multithreaded precompilation phase, which serves to pre-populate the compilation cache ( both in-memory, and a
possible on-disk sccache ).
Also it ensures that no unneccessary compilation
and benchmarking steps are performed, which was peviously the case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119386
Approved by: https://github.com/aakhundov
Fixes https://github.com/pytorch/pytorch/issues/117268; check this issue for background.
This PR does the following:
* Do not perform a replacement if the expression we're replacing the symbol with has a less refined value range than the original. There's a little bit of trickiness around the handling for values close to INT64_MAX; when checking if a range refines another, I *only* consider the range representable in 64-bit integers. This is enough to prevent us from doing a substitution like `i0 = 10 - i1`, but it appears to still let us do the other substitutions we like, such as `i0 = i1` or `i0 = 12 * i1`
* The test above is order dependent: if we assert an equality BEFORE we have refined a range, we might be willing to do the replacement because there isn't a meaningful range. This means that it's important to mark things as sizes, before you start doing other error checking. `split_with_sizes` is adjusted accordingly. It would be good to raise an error if you get the ordering wrong, but I leave this to future work.
* It turns out this is not enough to fix AOTAutograd, because we lose the size-ness of unbacked SymInts when AOTAutograd retraces the Dynamo graph. So update deferred runtime assert insertion to also insert size-ness and value ranges annotations. Note that, in principle, it shouldn't be necessary to explicitly do the latter; these should just show up as deferred runtime asserts. That's some extra refactoring for a later day.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117356
Approved by: https://github.com/lezcano
Add a flag setting that controls a threshold of guards involving a symbol, after which we force a symbol to be specialized. The roll out plan is to enable this on OSS but not fbcode, and then roll out to fbcode after we get some telemetry from the previous PR.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119347
Approved by: https://github.com/lezcano
```
Takes in a function which has been printed with print_readable() and constructs kwargs to run it.
Currently only handles Tensor inputs and a graph module which might have tensor constants.
Example:
Consider a function `forward` defined as follows:
>>> def forward(self, primals_1: "f32[1001, 6]"):
... _tensor_constant0: "i64[4190]" = self._tensor_constant0
... # Further implementation
>>> kwargs = aot_graph_input_parser(forward)
>>> forward(**kwargs)
"""
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119409
Approved by: https://github.com/shunting314
This PR adds a new type of triton kernel in which data is persistent but the
reduction dimension is split over multiple blocks (up to the entire kernel).
though this is called a reduction dimension, in actuality we only support scans.
because of this limitation, i have to be able to block fusions of split scan
operations with reductions so chose to add a new `ir.SplitScan` node which
is identical but allows for differentiation in the scheduler.
The split scan kernel is also the first to require an additional workspace buffer
which is used to communicate between cuda blocks. this is slightly tricky as we
the exact scratch space requirement isn't known until the grid size is calculated.
here i workaround the issue by setting a minimum rblock size and always allocating
to the maximum possible grid size for a given input tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117992
Approved by: https://github.com/jansel
ghstack dependencies: #117991
Currently the dimension handling in triton kernels has various special cases e.g.
- handling "r" for non-reduction vs persistent reduction vs non-persistent reduction.
- handling "x" when `no_x_dim` is set
This adds three new properties to the range tree objects which capture the
same information in a more generic way:
- `is_loop`: true for the "r" dimension of a non-persistent reduction
- `tensor_dim`: Optional index of the triton tensor dimension
- `grid_dim`: Optional index of the triton grid dimension
The motivation here is I want to add a new split scan kernel type which is:
- not a persistent reduction, yet has `is_loop=False` for the "r" dimension
- Has a `grid_dim` for the "r" dimension
These flags now only need to be set once in `initialize_range_trees`, instead of having
to infer them throughout the code based on the tree prefix and various other kernel flags.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117991
Approved by: https://github.com/lezcano
```
`torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Internal Triton PTX codegen error:
ptxas /tmp/compile-ptx-src-83b319, line 51; error : Feature '.bf16' requires .target sm_80 or higher
ptxas /tmp/compile-ptx-src-83b319, line 51; error : Feature 'cvt with .f32.bf16' requires .target sm_80 or higher
ptxas /tmp/compile-ptx-src-83b319, line 59; error : Feature '.bf16' requires .target sm_80 or higher
ptxas /tmp/compile-ptx-src-83b319, line 59; error : Feature 'cvt with .f32.bf16' requires .target sm_80 or higher
ptxas /tmp/compile-ptx-src-83b319, line 65; error : Feature '.bf16' requires .target sm_80 or higher
ptxas /tmp/compile-ptx-src-83b319, line 65; error : Feature 'cvt.bf16.f32' requires .target sm_80 or higher
ptxas fatal : Ptx assembly aborted due to errors
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
To execute this test, run the following from the base repo dir:
python test/inductor/test_torchinductor.py -k test_bfloat16_to_int16_cuda`
```
Fixed test failure that uses bfloat 16 on pre SM80 (V100 is where the test failure is seen for this test)
See also #113384
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118449
Approved by: https://github.com/eqy, https://github.com/peterbell10
Summary:
Seems like `kwargs` is already support in `_infer_argument`, so we don't need the extra assertion `len(kwargs) == 0`.
This optimization ensures compatibility with torch.compile() for LazyModules with kwargs inputs, preventing graph breaks.
Test Plan: Unit tetst and CI
Differential Revision: D53558778
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119445
Approved by: https://github.com/yanboliang
In some cases where we have TORCH_CHECK in loops, it may cause the host
compiler to spend hours optimizing the run_impl function. This PR
mitigated the issue by replacing TORCH_CHECK with a custom AOTI_CHECK,
where we force the underneath assert function to be noinline.
If forcing noinline caused any serious perf regression, we could
either add an option to turn on/off enable noinline. Or, we could
another an option to just turn AOTI_CHECK into a no-op, similar
to the ```assert``` macro from cassert.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119220
Approved by: https://github.com/hl475, https://github.com/desertfire
Right now, `ModuleInfo.dtypes` defaults to `torch.testing._internal.common_dtype.floating_types()`, almost no ModuleInfos override this (so only `float32` and `float64` are tested).
This is the first step to clean up/improve dtype testing for `ModuleInfos` and fix#116626.
Follow up PRs will updates `dtypes=` (and perhaps `dtypesIf{Device}` (if it makes sense)) for each `ModuleInfo`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119039
Approved by: https://github.com/janeyx99
`CompiledKernel.launch_enter_hook` and `CompiledKernel.launch_exit_hook` are hooks that allow external tools to monitor the execution of Triton kernels and read each kernel's metadata. Initially, these hooks have a value of `None`.
Triton's kernel launcher passes hooks and kernel metadata by default, while Inductor's launcher doesn't. This PR could unify the parameters passed to both launchers so that tools can get information from both handwritten Triton kernels and Inductor-generated Triton kernels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119450
Approved by: https://github.com/jansel
Before: `softmax` definition uses `jagged_unary_pointwise()` (wrong)
After: `softmax` impl adjusts the `dim` arg to account for the difference in dimensionality between the outer NT and the NT's `_values`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119459
Approved by: https://github.com/soulitzer
It should usually be safe to run pointwise binary ops with >2 inputs. e.g. threshold_backward(tensor, tensor, scalar): we just operate on the values of the nested tensors, and pass in the other args as-is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119419
Approved by: https://github.com/soulitzer
Previously in non-strict mode we would source a FakeTensorMode from existing tensors if available.
It turns out this is problematic, as it means we can't directly control the behavior of this FakeTensorMode. For example, if the user-provided FakeTensorMode does not set `allow_non_fake_inputs=True`, then we get into trouble with constant tensors, etc.
At the moment, we still have to epxlicitly re-fakifky the module state. @ezyang has recommended against this, but it's necessary because `create_aot_dispatcher_function` calls `detect_fake_mode` on all the inputs, which will error if not all the FakeTensors are on the same mode. We should straighten this out, but leaving for the future.
Differential Revision: [D53559043](https://our.internmc.facebook.com/intern/diff/D53559043/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119446
Approved by: https://github.com/ezyang, https://github.com/zhxchen17
Summary:
This wires the eager-mode operation to the Vulkan shader. We only cover the case where both inputs are Tensor type, which is on par with the existing operators: add, sub, mul, div, floor_div.
It doesn't seem like we can cover [any other of the 8 cases](https://www.internalfb.com/code/fbsource/[e45c04564445b5e67ebb61e6ba53995729686526]/xplat/caffe2/torch/distributed/_tensor/ops/pointwise_ops.py?lines=310-317), right now. We categorize them and explain that what's missing for each.
## Category 1
The other 2/3 "standard" cases requires one of the values to be a scalar,
```
z = torch.pow(x, y)
```
```
aten.pow.Scalar,
aten.pow.Tensor_Scalar,
aten.pow.Tensor_Tensor,
```
which is not currently supported.
```
F 00:00:01.746228 executorch:aten_bridge.cpp:21] In function check_tensor_meta(), assert failed (b.sizes().data() != nullptr): ETensor must have valid sizes array
```
## Category 2
IIUC, these operators require an out argument in the declaration. However, when they are traced they collapsed into Category 1, e.g., we obtain `aten.pow.Tensor_Tensor` not `aten.pow.Tensor_Tensor_out`.
This appears in line with current PT-Vulkan, which only [implements the other two categories](https://www.internalfb.com/code/fbsource/[f148c22604b8e409696fd64f814cda89d091fe7a]/xplat/caffe2/aten/src/ATen/native/vulkan/ops/BinaryOp.cpp?lines=533-558).
```
torch.pow(x, y, out=z)
```
```
aten.pow.Scalar_out,
aten.pow.Tensor_Scalar_out,
aten.pow.Tensor_Tensor_out,
```
## Category 3
IIUC, in-place operators are written like this:
```
x.pow_(y)
```
```
aten.pow_.Scalar,
aten.pow_.Tensor,
```
They are not currently supported.
```
File "/data/users/jorgep31415/fbsource/buck-out/v2/gen/fbcode/b007eb344207ad7d/executorch/backends/vulkan/test/__test_vulkan_delegate__/test_vulkan_delegate#link-tree/torch/_export/verifier.py", line 188, in _check_valid_op
raise SpecViolationError(
torch._export.verifier.SpecViolationError: operator 'aten.copy_.default' is not functional
```
Test Plan:
```
[jorgep31415@devvm15882.vll0 /data/users/jorgep31415/fbsource (fd1ed5f81)]$ buck2 test fbcode//executorch/backends/vulkan/test:test_vulkan_delegate -- test_vulkan_backend_pow
File changed: fbcode//executorch/backends/vulkan/vulkan_preprocess.py
Buck UI: https://www.internalfb.com/buck2/7f9ec9e5-cbac-4618-b8ad-d94d10bb50ff
Test UI: https://www.internalfb.com/intern/testinfra/testrun/562950306906309
Network: Up: 3.2KiB Down: 0B (reSessionID-ea5af789-c131-4170-ba20-5c5c9718276b)
Jobs completed: 7. Time elapsed: 48.5s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1)
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
```
Differential Revision: D53547865
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119423
Approved by: https://github.com/SS-JIA, https://github.com/malfet
- Switch to native complex support if running on MacOS Monterey or newer for binary ops.
- Python complex scalars are always represented in PyTorch as ComplexDouble, but MPS yet to support double precision types, so downcast them to floats
- Also add `cf`(for complex float) and `ch`(for complex half) to MPSScalar value union
- Fix complex scalars to view promotion, by introducing `legacy_complex_as_view` helper function, that non-float types to complex and promotes CPU complex scalars to MPS before turning them into a view.
- Add `test_tensor_scalar_binops`
Fixes https://github.com/pytorch/pytorch/issues/119088
Test plan: CI (have quite a lot of tests, see new unexpected successes) + `python -c "import torch;x,y=torch.rand(2, 2, dtype=torch.cfloat, device='mps'),torch.tensor(2+3j,dtype=torch.chalf);print(y+x)"`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119318
Approved by: https://github.com/albanD
Summary:
With the compiled PyTorch module, in execution_trace_observer.cpp, function convertIValue calls TensorImpl->storage_offset(). That function call will trigger a recursive call into recordOperatorStart. It will cause a deadlock on ob.g_mutex.
This DIFF is to fix this deadlock by replacing std::mutex with std::recursive_mutex.
Since PyTorch only has one thread for FWD, and one thread for BWD. The contention is very low, the performance should NOT be a concern.
Test Plan:
Unit Test
buck test mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2
Differential Revision: D53533253
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119398
Approved by: https://github.com/aaronenyeshi
This pull request aims to complete most of the support for vectorizing int32 and int64 data types except for indirect indexing and masks. The basic data type support for uint32 and uint64 is also added but without vectorization. More vectorized conversion functions are added between integer and float. In order to support int64 vectors, a new VectorizedN class to handle vectors of arbitrary length. Below are the details:
1. Complete most of the int32 and int64 vectorization support including load, store, reduction, constant and conversion. The indirect indexing and masks will be addressed in follow-up PRs, after which, the legality checking logic in `CppVecKernelChecker` can be further simplified.
2. Util functions for conversion between integer and float vectors (in cpp_prefix.h and ATen vec). Ideally, we'd better move them from cpp_prefix.h to ATen vec to simplify cpp_prefix.h, will be addressed in follow-up PRs.
3. Introduced a new template class VectorizedN, designed to handle vectors of arbitrary length by encapsulating multiple Vectorized<T> instances. This class supports most of the operations of `Vectorized<T>`. It makes the support of int64 vectorization simpler. I will also apply it to bf16/fp16/int8 in the follow-up PRs for better efficiency. For example, bf16 currently only uses half of the vector lanes. With `VectorizedN`, we can use full of the lanes and map bf16 vector to `VectorizedN<float,2>` on conversion.
4. Basic data type support is added for uint32 and uint64 (in graph.py). Vectorization support will be added later but not of high priority due to fewer usages.
Next steps:
- [ ] Refactor the vector mask handling to support data types other than float. Currently vector masks are implemented with float vectors.
- [ ] Fully utilize vector lanes for bfloat16/float16/int8.
- [ ] Support indirect indexing with vectorized index via scalarization.
- [ ] Clean up `CppVecKernelChecker`.
- [ ] Simplify `cpp_prefix.h` including refactoring vector conversion logic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119001
Approved by: https://github.com/peterbell10, https://github.com/jansel
Fix: #115792
This PR implements 2 virtual functions of `TensorImpl` that are called when setting the
`tensor.data`:
- `shallow_copy_from`: which calls `copy_tensor_metadata`; and
- `copy_tensor_metadata`: which copies all `FunctionalTensorWrapper` metadata and ~calls
`dest->value_.set_data(src->value_)`~ assigns `dest->value_ = src->value_`, so as to copy also the inner tensor using the same
method
Before this PR, the inner tensor of a `FunctionalTensorWrapper` was being ignored.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118783
Approved by: https://github.com/bdhirsh
Fixes#119198.
This PR make dynamo inline `__iter__` of a user defined object instead of creating a graph break. Also added a new test, which shows:
1. the loop is unrolled
2. the length of the loop is guarded when inlining `__iter__`
```python
class Mod:
def __init__(self):
self.a = [torch.randn(2, 2), torch.randn(2, 2)]
def __iter__(self):
return iter(self.a)
def f(mod):
ret = []
for x in mod:
ret.append(x + 1)
return ret
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119243
Approved by: https://github.com/jansel
Some APIs like ncclCommAbort can cause nccl kernels to finish even if
they were previously stuck. Because we can gather the trace buffer after
those calls, we can end up seeing some collectives marked completed eventhough
that complete happened several minutes after they started and clearly after
the timeout. This changes how we record state so that we keep track of the time
we discover a state change, so even if eventually the collective gets marked complete,
we can observe it happened minutes after it was schedule.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119249
Approved by: https://github.com/wconstab
Summary: This diff optimizes the .item() call by backing the scalar value storage with pinned memory, so we dont create an implicit synchronization with libcuda library.
Test Plan:
# Prod VDD model on H100
Vanguard runs
9.8k qps -> 10.1k qps (~3% improvement)
# .item() Benchmark
1 thread 50k iterations
consistent ~2-3% improvements
With pinned memory
item() took 1.627608060836792 seconds
item() took 1.635591983795166 seconds
item() took 1.6398141384124756 seconds
item() took 1.6378591060638428 seconds
item() took 1.618534803390503 seconds
item() took 1.6467158794403076 seconds
item() took 1.6278800964355469 seconds
item() took 1.6205573081970215 seconds
item() took 1.64951753616333 seconds
item() took 1.6286702156066895 seconds
w/o pinned memory
item() took 1.6783554553985596 seconds
item() took 1.6670520305633545 seconds
item() took 1.6748230457305908 seconds
item() took 1.6708712577819824 seconds
item() took 1.6836023330688477 seconds
item() took 1.6518056392669678 seconds
item() took 1.6769678592681885 seconds
item() took 1.661888837814331 seconds
item() took 1.6627326011657715 seconds
item() took 1.6908581256866455 seconds
Differential Revision: D53431148
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119202
Approved by: https://github.com/xw285cornell
Everything inside the `AT_DISPATCH` block is being compiled 5 times,
so it makes sense to limit it to the only line that uses `scalar_t` which is
the `numeric_limits` query.
Also a small optimization, instead of computing `grad.log()` and `(-grad).log()`
we can compute `grad.abs().log()` which is 2 pointwise ops instead of 3.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119397
Approved by: https://github.com/lezcano, https://github.com/albanD
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second runtime component we would like to upstream is `Stream` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 2 PRs. This is one of the 2 PRs and covers the changes under `c10`.
# Design
Intel GPU stream is a wrapper of sycl queue which schedules kernels on a sycl device. In our design, we will maintain a sycl queue pool containing 32 queues per priority per device. And when a queue is requested one of these queues is returned round-robin. The corresponding C++ files related to `Device` will be placed in `c10/xpu` folder. We provide the `c10::xpu::XPUStream` APIs, like
- `XPUStream getStreamFromPool`
- `XPUStream getCurrentXPUStream`
- `void setCurrentXPUStream`
- `void device_synchronize`
# Additional Context
In our plan, 2 PRs should be submitted to PyTorch for `Stream`:
1. for c10
2. for python frontend.
The differences with CUDA:
no default and external stream in XPU and lack of the below API:
- `getDefaultCUDAStream`
- `getStreamFromExternal`
for cuda, `cuda::device_synchronize` can sync all streams on the device, but for xpu, `xpu::sync_streams_on_device` only sync all reserved streams on the device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117611
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
To avoid the following error:
```
2024-02-07T12:49:51.8306390Z ld: warning: dylib (/Users/runner/work/_temp/anaconda/envs/wheel_py38/lib/libomp.dylib) was built for newer macOS version (11.1) than being linked (11.0)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119373
Approved by: https://github.com/huydhn
This PR adds explicit backward prefetching to overlap communication and computation in backward (namely, needed for `reshard_after_forward=True` or `reshard_after_forward: int`). We do this by recording the post-forward order and using its reverse to approximate the backward order.
This works for the typical 1 forward / 1 backward training. However, for more complex schedules, this can run into some gaps:
- We need to know the _true end of backward_.
- At the true of end of backward, we can clear our recorded post-forward order and pre-backward hook state, and we should wait on gradient reductions.
- There is no easy way to know whether the current backward marks the true end of backward. Therefore, we introduce an API for the user to set this: `fsdp_module.set_is_last_backward(bool)`. For example, for pipeline parallelism's DFS cooldown backward, we can call `fsdp_module.set_is_last_backward(is_last_microbatch)`.
- When the user runs backward through only part of the model, our reverse-post-forward-order heuristic risks _mistargeted prefetches_ for unused modules, which would mean the module's parameters are all-gathered and not freed until the end of backward.
- To error on the side of less memory usage (but no overlap), this PR introduces logic to check whether a module will need its unshard in the current backward (by recording the module's `forward` outputs' `grad_fn`s and querying the autograd engine).
- Note that there may be _no_ overlap in backward for some parts due to no prefetching.
- Note further that when running multiple backwards, if the user does not use `set_is_last_backward`, we may not be able to provide a meaningful error message, as the pre-backward hook could be erroneously cleared on the 1st backward.
- In the future, we may expose more APIs from the autograd engine (similar to `_current_graph_task_execution_order`) to make the prefetching exact. (Currently, `_current_graph_task_execution_order` requires the `with torch.autograd.set_multithreading_enabled(False)`, which is too hard of a constraint as we cannot easily modify users' training loops. We can replace the multi-threading check with a device check. Moreover, in the partial backward case in this PR's unit test, I still hit an [internal assertion](b816760a2f/torch/csrc/autograd/engine.cpp (L476)), so some follow-up is required.)
<details>
<summary> Old Discussion </summary>
For discussion:
- The PR includes a counter `expected_backward_unshard_count` to mitigate mistargeted prefetches in backward. However, it can be seen as a necessary but not sufficient solution.
- If a module's outputs do not require gradient, then we certainly do not need to unshard the module in backward.
- However, if a module's outputs do require gradient, then we still may not need to unshard the module for _this_ backward (e.g. if the module did not contribute to `loss` for the current `loss.backward()`).
- This counter will only address the first case but not the second. If we want to address the second, then we may need more info from the autograd engine.
- For now, I did not include any unit test to cover these behaviors, as I do not have a good example yet.
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118118
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #118017
**Summary**
The reducer of `DistributedDataParallel` is implemented with C++ and it is not easy to trace the allreduce launched in the reducer. This PR modifies `DistributedDataParallel` to launch one allreduce per gradient when `compiled_autograd` is enabled. The changes allow us to use `compiled_autograd` to trace the allreduce and later be optimized (fused) in the Inductor.
**Key Logic**
1. If `ddp_python_hook` is True, we assume `compiled_autograd` is used. `DistributedDataParallel` registers `compiled_accum_grad_hook` for all parameters.
2. In the first forward() call, if `DistributedDataParallel` is not compiled, all `compiled_accum_grad_hook` are deregistered. If `DistributedDataParallel` is compiled, all `compiled_accum_grad_hook` will be compiled by `compiled_autograd`.
3. `compiled_accum_grad_hook` launches an allreduce to reduce the gradient of the parameter.
**Bucketing**
The compiled backward is slow because there is no bucketing for the allreduces. We rely on Inductor to bucket the allreduces.
The bucketing is done in a separate PR.
Differential Revision: [D49428482](https://our.internmc.facebook.com/intern/diff/D49428482/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110662
Approved by: https://github.com/wconstab
Partially fixes https://github.com/pytorch/pytorch/issues/105077
Repro:
```python
import tempfile
import torch
from torch._subclasses import fake_tensor
class TheModelClass(torch.nn.Module):
def __init__(self):
super(TheModelClass, self).__init__()
self.fc1 = torch.nn.Linear(5, 10)
def forward(self, x):
return self.fc1(x)
with tempfile.NamedTemporaryFile() as state_dict_file:
# Create state_dict to be loaded later
model = TheModelClass()
torch.save(model.state_dict(), state_dict_file.name)
fake_mode = fake_tensor.FakeTensorMode()
with fake_mode:
# This is where the bug is triggered
state_dict = torch.load(state_dict_file.name)
```
Error:
```bash
Traceback (most recent call last):
File "issue_gh_torch_105077.py", line 22, in <module>
state_dict = torch.load(state_dict_file.name)
File "/opt/pytorch/torch/serialization.py", line 1014, in load
return _load(opened_zipfile,
File "/opt/pytorch/torch/serialization.py", line 1422, in _load
result = unpickler.load()
File "/opt/pytorch/torch/_utils.py", line 205, in _rebuild_tensor_v2
tensor = _rebuild_tensor(storage, storage_offset, size, stride)
File "/opt/pytorch/torch/_utils.py", line 184, in _rebuild_tensor
return t.set_(storage._untyped_storage, storage_offset, size, stride)
File "/opt/pytorch/torch/utils/_stats.py", line 20, in wrapper
return fn(*args, **kwargs)
File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1288, in __torch_dispatch__
return self.dispatch(func, types, args, kwargs)
File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1468, in dispatch
self.invalidate_written_to_constants(func, flat_arg_fake_tensors, args, kwargs)
File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1733, in invalidate_written_to_constants
_, new_kwargs = normalize_function(
File "/opt/pytorch/torch/fx/operator_schemas.py", line 297, in normalize_function
torch_op_schemas = get_signature_for_torch_op(target)
File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in get_signature_for_torch_op
signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]
File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in <listcomp>
signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]
File "/opt/pytorch/torch/fx/operator_schemas.py", line 70, in _torchscript_schema_to_signature
arg_type = _torchscript_type_to_python_type(arg.type)
File "/opt/pytorch/torch/fx/operator_schemas.py", line 64, in _torchscript_type_to_python_type
return eval(ts_type.annotation_str, _type_eval_globals)
File "<string>", line 1, in <module>
NameError: name 'Storage' is not defined
```
This PR adds the ability to create fake tensors during `torch.load` by wrapping the `torch.tensor.set_` call around a `torch.utils._mode_utils.no_dispatch()` to skip fake mode dispatcher for it and thus create a real tensor. It later calls `fake_mode.from_tensor(t)` to finally create the fake tensor.
Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108186
Approved by: https://github.com/ezyang
# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this last PR covers the changes under lazy initialization.
# Design
This PR primarily offers the support of multi-processing via lazy initialization. We lazily initialize our runtime avoiding initializing XPU until the first time it is accessed. In our design, we extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for both CUDA and XPU while maintaining scalability.
# Additional Context
We adopt a similar design to CUDA. So we share some code with CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116869
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
ghstack dependencies: #119248
## This diff
This optimization reduces calls to `texelFetch(uKernel, ...)` by 4.
We borrow MatMul's work to do the re-packing:
https://www.internalfb.com/code/fbsource/[7e8ef1b8adeda224a736f8cc4bf870e0a659df95]/xplat/caffe2/aten/src/ATen/native/vulkan/ops/Mm.cpp?lines=20%2C50
## Future optimziations
We are already batching reads from input/weight tensors, and writes to output tensor.
Here are other ideas, which I won't pursue for now. (2) is the most doable.
1. **Batch reads/writes along the dimension that is most commonly > 1.** For weights, the length dimension is definitely correct here, but input/outputs could potentially leverage the length dimensions too. However, `stride != 1` would complicate this optimization.
2. **Batch an optimal number of reads/writes.** Instead of default-ing to 4 elements (since that corresponds to 1 texel), consider more elements such as MatMul's 4x4 texel tile.
3. **Obscure shader compiler optimizations.** Since MatMul seemed to benefit from several seemingly equivalent ways to write code.
Differential Revision: [D53204674](https://our.internmc.facebook.com/intern/diff/D53204674/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118835
Approved by: https://github.com/SS-JIA, https://github.com/liuk22
Removes raising error if a device_mesh has a parent.
The comment says that HSDP + TP is not supported, but I'm able to do 2D parallelism + HSDP fine. The only issues are:
- this check
- https://github.com/pytorch/pytorch/pull/118618
- a series of PRs related to checkpointing with 3D meshes that I will open
We currently monkeypatch for the above which I am slowly upstreaming.
I imagine torch will have a better, native integration eventually, but this check seems too aggressive in the meantime given DTensor now lets users do some things themselves (which is amazing 🎉)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118620
Approved by: https://github.com/Skylion007
```
$ python test/inductor/test_aot_inductor.py -k test_triton_kernel_reinterpret_view_mem_leak
# Before
RuntimeError:
Found following user inputs located at [0] are mutated. This is currently banned in the aot_export workflow.
If you need this functionality, please file a github issue.
fw_metadata=ViewAndMutationMeta(input_info=[InputAliasInfo(is_leaf=True, mutates_data=True, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutates_storage_metadata=False, requires_grad=False, mutation_type=<MutationType.MUTATED_OUT_GRAPH: 3>),...)
# Now
Ran 6 tests in 13.851s
OK (skipped=4)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119219
Approved by: https://github.com/oulgen
Otherwise emitting TD stats will fail with following warning:
```
Emiting td_test_failure_stats
/Users/ec2-user/runner/_work/pytorch/pytorch/tools/testing/target_determination/heuristics/edited_by_pr.py:37: UserWarning: Can't query changed test files due to Command '['git', 'merge-base', 'origin/main', 'HEAD']' returned non-zero exit status 1.
warn(f"Can't query changed test files due to {e}")
```
Test plan: Observe that MPS jobs finishes without those warnings
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119401
Approved by: https://github.com/atalman, https://github.com/huydhn
Summary:
## Issue
When there is Unicode non-decodable text in logs, `tail_logger` will stop working afterwards, i.e. f527390102
In the example, the process stopped producing Python logs after 17:20:21 untill the job finished
```
[0]:I0201 17:20:21.338000 3429 gen_ai/genie_projects/llm/metaformers/reward_model_score.py:335] Progress: 118 batches out of 512 total batches. 23.05 % | (gpu mem: 25.8GB, free CPU mem: 1387.8GB)
I0201 17:39:14 Stopping twtask-main.service with Service Result: [success] Exit Code: [exited] Exit Status: [0]
```
At the end, `UnicodeDecodeError` was thrown at the end with no call stack.
## Fix
Use `errors="replace"` to avoid throwing exception when `UnicodeDecodeError` happens.
Test Plan: f528854819
Differential Revision: D53483644
Co-authored-by: Jack Zhang <jackzh@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119298
Approved by: https://github.com/XilunWu
This PR adds the `reshard_after_forward: Union[bool, int]` arg and a `reshard()` method. The `reshard_after_forward` argument trades off communication and memory.
- `reshard_after_forward=True`: reshard parameters after forward; unshard (all-gather) in backward
- `reshard_after_forward=False`: no reshard of parameters after forward; no unshard (all-gather) in backward
- `reshard_after_forward: int`: reshard parameters to a smaller world size; unshard (all-gather) over small world size in backward
In comparison with DeepSpeed and existing FSDP:
- `reshard_after_forward=True` == `FULL_SHARD` == ZeRO-3
- `reshard_after_forward=False` == `SHARD_GRAD_OP` == ZeRO-2
- `reshard_after_forward=8` == ZeRO++
ZeRO-1 is `reshard_after_after_forward=False` without gradient reduction (implemented in a later PR). If we need gradient reduction on an iteration, then ZeRO-2 supersedes ZeRO-1.
We prefer a simple state transition between `SHARDED` / `SHARDED_POST_FORWARD` and `UNSHARDED`, where the state directly defines what tensors are registered to the module. In particular, we _do not_ have a state where the sharded parameters are registered but the unsharded parameters are still in GPU memory. This greatly simplifies our state transitions, but it means that parameters may be non-intuitively registered to the module (e.g. if only the root does not reshard after forward, then the root will be the only without sharded parameters registered). To address this, we introduce a simple `reshard()` method that can force-reshard the parameters. This makes sense to me because the typical case does not care about the registered parameters after forward (in fact, for existing FSDP with `use_orig_params=False`, the unsharded parameters are still registered and are dangling tensors without storage.)
I plan to expose a complementary `unshard(async_op: bool = True)` method in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118017
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
Summary:
This is a follow up to https://github.com/pytorch/pytorch/pull/118605 to remove `fold_quantize` flag from
`convert_pt2e`
Test Plan: CI
Differential Revision: D53247301
BC Breaking Note:
flag `fold_quantize` set to True `convert_pt2e` and now we'll fold the quantize op in the weight by default, so users will see model size reduction by default after pt2e quantization.
2.2
```
folded_model = convert_pt2e(model, fold_quantize=True)
non_folded_model = convert_pt2e(model)
```
2.3
```
folded_model = convert_pt2e(model)
non_folded_model = convert_pt2e(model, fold_quantize=False)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118701
Approved by: https://github.com/andrewor14, https://github.com/leslie-fang-intel
Added `torch.__future__.{get/set}_swap_module_params_on_conversion` that defaults to `False` for now, but we probably want to modify to override this and default to `True` in `nn.Module._apply` if input is a tensor subclass.
From offline discussion, for now we are **not** allowing `swap_tensor` after the first module forward has been run*** if the autograd graph is still alive. The reason being that `torch.utils.swap_tensors(t1, t2)` requires the `use_count` of both `TensorImpl`s associated with `t1` and `t2` to be 1. The first forward pass will install `AccumulateGrad` nodes on each param, which [bump the refcount of the associated TensorImpl](6cf1fc66e3/torch/csrc/autograd/variable.cpp (L307)). **Future work might be to swap the refs that the `AccumulateGrad` nodes hold if it is necessary.**
***From this, it might seem like we don't need to handle gradients. However, I still handle the grads for the edge case that the grads are set via `p.grad = grad` OR the autograd graph is no longer alive because the output has been garbage collected.
If any `swap_tensors` fails on any of the parameters in the `nn.Module` we raise an error.
**`RNNBase` overrides `nn.Module._apply()` and installs weakrefs on some parameters. As a result, all modules that inherit from `RNNBase` (`RNN`, `GRU` and `LSTM`) cannot use the`swap_tensors` path as of now**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117167
Approved by: https://github.com/albanD
ghstack dependencies: #118028
So, this is a little awkward, so I don't mind more thoughts on how best to do this.
Let's suppose that you have a graph break inside of an inlined function call. We are not actually going to print this graph break yet; instead, we are going to restart analysis so that we can run up until the inlined function call. When this happens, the only log message we ever get is the log to `graph_break` (seen here) reporting that a graph break has occurred.
In the current code, we don't print the fully formatted exception if you are only using `graph_breaks` logging. So the exception that induced the graph break has its traceback lost forever. For some classes of errors, esp., guard on data-dependent SymInt, this is quite bad.
With this change, we do print the traceback. On this sample program:
```
import torch
import torch._dynamo.config
torch._dynamo.config.capture_scalar_outputs = True
def g(x, y):
y = x.item()
if y < 3:
return x + 2
else:
return x + 3
@torch.compile()
def f(x, y):
y = y * y
return g(x, y)
f(torch.tensor(4), torch.randn(4))
```
It looks like this:
```
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] Graph break: Traceback (most recent call last):
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/tensor.py", line 878, in evaluate_expr
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return guard_scalar(self.sym_num)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/fx/experimental/symbolic_shapes.py", line 414, in guard_scalar
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return guard_bool(a)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/fx/experimental/symbolic_shapes.py", line 663, in guard_bool
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return a.node.guard_bool("", 0) # NB: uses Python backtrace
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/fx/experimental/sym_node.py", line 366, in guard_bool
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] r = self.shape_env.evaluate_expr(self.expr, self.hint, fx_node=self.fx_node)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/fx/experimental/recording.py", line 227, in wrapper
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return fn(*args, **kwargs)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3670, in evaluate_expr
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] concrete_val = self.size_hint(orig_expr)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3403, in size_hint
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] raise self._make_data_dependent_error(result_expr, expr)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: It appears that you're trying to get a value out of symbolic int/float whose value is data-dependent (and thus we do not know the true value.) The expression we were trying to evaluate is u0 < 3 (unhinted: u0 < 3). For more information, run with TORCH_LOGS="+dynamic".
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] During handling of the above exception, another exception occurred:
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] Traceback (most recent call last):
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 469, in wrapper
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return inner_fn(self, inst)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 1196, in CALL_FUNCTION
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] self.call_function(fn, args, {})
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 651, in call_function
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] self.push(fn.call_function(self, args, kwargs))
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/functions.py", line 279, in call_function
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return super().call_function(tx, args, kwargs)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/functions.py", line 87, in call_function
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return tx.inline_user_function_return(
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 657, in inline_user_function_return
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 2262, in inline_call
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return cls.inline_call_(parent, func, args, kwargs)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 2372, in inline_call_
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] tracer.run()
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 787, in run
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] and self.step()
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 750, in step
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] getattr(self, inst.opname)(inst)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 431, in inner
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] eval_result = value.evaluate_expr(self.output)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/tensor.py", line 880, in evaluate_expr
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] raise UserError( # noqa: TRY200
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] torch._dynamo.exc.UserError: Consider annotating your code using torch._constrain_as_*(). It appears that you're trying to get a value out of symbolic int/float whose value is data-dependent (and thus we do not know the true value.) The expression we were trying to evaluate is u0 < 3 (unhinted: u0 < 3). For more information, run with TORCH_LOGS="+dynamic".
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] For more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#constrain-as-size-example
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] From user code at:
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/b.py", line 16, in f
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return g(x, y)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/b.py", line 8, in g
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] if y < 3:
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]
```
The end of the log at restarted computation maybe can be improved too. Right now it looks like this:
```
[2024-02-06 10:32:24,338] [0/0_1] torch._dynamo.symbolic_convert: [DEBUG] TRACE CALL_FUNCTION 2 [UserFunctionVariable(), LazyVariableTracker(), TensorVariable()]
[2024-02-06 10:32:24,338] [0/0_1] torch._dynamo.output_graph: [DEBUG] COMPILING GRAPH due to GraphCompileReason(reason='Consider annotating your code using torch._constrain_as_*(). It appears that you\'re trying to get a value out of symbolic int/float whose value is data-dependent (and thus we do not know the true value.) The expression we were trying to evaluate is u0 < 3 (unhinted: u0 < 3). For more information, run with TORCH_LOGS="+dynamic".\n\nFor more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#constrain-as-size-example', user_stack=[<FrameSummary file /data/users/ezyang/b/pytorch/b.py, line 16 in f>, <FrameSummary file /data/users/ezyang/b/pytorch/b.py, line 8 in g>], graph_break=True)
```
An alternative to doing it this way, is I can make symbolic shapes print a warning log when guard on unbacked SymInt itself, so we don't have to worry about Dynamo generating the backtrace well. If, for the most part, the backtrace for other graph breaks is irrelevant, then this would seem to be a more expedient solution.
PTAL and submit your opinions.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119292
Approved by: https://github.com/yanboliang
Summary:
Previously, we were not fakifying module state explicitly in the nonstrict path.
This led to errors when modules were constructed under a fake mode, since the user-provided fake mode was clashing with the one that we had constructed internally to fakify the inputs.
This fixes things to use a single fake mode for everything.
As a side effect, this raised the question of how we ought to serialize state_dicts/constants that might be fake tensors. Naively calling torch.save understandably explodes—so this diff piggybacks on our infra for doing this on meta["val"]. Open to revising this, I'm low confidence that it's the best way to do it.
Test Plan: unit tests
Differential Revision: D53484942
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119297
Approved by: https://github.com/tugsbayasgalan
Note that this increases coverage from 1 config (vanilla SGD) to all the configs (13 optimizers at around 6-7 each). The test time seems fine though!
With the torch cuda synchronization:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b6093c03)]$ python test/test_optim.py -k test_step_pre_hook -k test_step_post_hook
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
....................................................
----------------------------------------------------------------------
Ran 52 tests in 13.680s
OK
```
Excluding the torch cuda synchronization:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (916f6fe3)]$ python test/test_optim.py -k test_step_pre_hook -k test_step_post_hook
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
....................................................
----------------------------------------------------------------------
Ran 52 tests in 1.038s
OK
```
The old tests:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (916f6fe3)]$ python test/test_optim.py -k test_pre_hook -k test_post_hook
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
..
----------------------------------------------------------------------
Ran 2 tests in 0.518s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119288
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #119283
Since accumulate grad may steal the gradient's `c10::Storage`, we can't reuse the op otherwise the gradient will get overwritten. From benchmarks, using the inductor's codegen'd _empty_strided_cpu/cuda and assigning to it has lower overhead than deep copying the gradient and reusing its buffer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119334
Approved by: https://github.com/jansel
ghstack dependencies: #118817
Finally we have this PR to merge allow_in_graph/inline/skip trace rules into ```trace_rules.lookup_inner```, where we can define and lookup trace rules at both function level and file level. Going forward, this is the central place that we define and consulte Dynamo trace rule for any function.
* ```trace_rules.looup``` is the API can return allow_in_graph, inline or skip.
* ```skipfiles.check``` is the API can return inline or skip, since we have multiple places that only do inline/skip check.
* I'll move ```skipfiles.check``` to ```trace_rules.check``` as one of the follow-ups.
* Both functions consulte ```trace_rules.lookup_inner``` to get the tracing rule.
To avoid a single big PR, I left a few items as the follow-ups:
* Remove ```skipfiles.py``` and merge the code into ```trace_rules.py```.
* We do double check in ```symbolic_convert.check_inlineable```, will refactor and simplify it. We should only do inline/skip check before generating ```SkipFilesVariable``` and ```UserFunctionVariable```.
* Rename ```SkipFilesVariable``` as ```SkipFunctionVariable```, since we only handle functions.
* The inline/skip reasons are not logged for some cases, since the new lookup framework doesn't always return inline/skip reasons. I'll refactor loggings to record the inline/skip reason in next step.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118971
Approved by: https://github.com/jansel
Addresses issue https://github.com/pytorch/pytorch/issues/117383
The implementation exposes `--local-ranks-filter` which filters by rank which files we pass to `TailLog` (used in torchrun to determine which logs to output to stdout/stderr)
## Behavior
### with --tee
Currently --tee is implemented as --redirect to file, and streams file to console using `tail`. When --tee is specified, file logs will be unaffected and we will only filter the output to console.
### with --redirect
When --redirect is specified without --tee, nothing is logged to console, so we no-op.
### with neither
When neither --tee or --redirect are specified, torchrun uses empty string "" to indicate logging to console. We intercept this empty string, and redirect it to "/dev/null" to not print to console.
The api also allows a per-rank configuration for --tee and --redirect, and is also supported by this filter implementation.
## Usage
### without --tee
```
> TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --local_rank_filter=0 t.py
hello from rank 0 python
DEBUG: TRACED GRAPH
__compiled_fn_0 <eval_with_key>.0 opcode name target args kwargs
------------- ------ ----------------------- --------- --------
placeholder l_x_ L_x_ () {}
call_function mul <built-in function mul> (l_x_, 5) {}
output output output ((mul,),) {}
...
```
### with --tee
```
> TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --tee 3 --local_rank_filter=0 t.py
[rank0]:hello from rank 0 python
[rank0]:DEBUG: TRACED GRAPH
[rank0]: __compiled_fn_0 <eval_with_key>.0 opcode name target args kwargs
[rank0]:------------- ------ ----------------------- --------- --------
[rank0]:placeholder l_x_ L_x_ () {}
[rank0]:call_function mul <built-in function mul> (l_x_, 5) {}
[rank0]:output output output ((mul,),) {}
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118562
Approved by: https://github.com/wconstab, https://github.com/wanchaol
Attempt #2 for https://github.com/pytorch/pytorch/pull/117875 to fix https://github.com/pytorch/pytorch/issues/112090.
Summary of changes:
- ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor)
- Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated.
- ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor)
- CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free.
- code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references.
- Added tests that check for memory leaks and cache deletion operations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119107
Approved by: https://github.com/jansel
The previous fix https://github.com/pytorch/pytorch/pull/118981 misses some corner cases. It works when both LazyGraphModule and compiled-autograd are enabled. But it fail with FakeTensorMode mismatch error again if LazyGraphModule+CompiledAutograd+DynamicShape are all enabled. Note that disabling any of the three does not trigger the issue.
The reason why enabling DynamicShape cause the previous fix not working is, we will call the bw_compiler here before running the backward pass if there are symints saved for backward: 73f0fdea5b/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L382)
The bw_compiler may cause extra GraphModule recompilation on the bw_module which cause it's forward method become the lazy one again. The fix is just to delay applying the previous fix after the potential extra call of the bw_compiler.
Repro on hf_Whisper:
```
CUDA_VISIBLE_DEVICES=1 time benchmarks/dynamo/torchbench.py -dcuda --training --backend=inductor --disable-cudagraphs --accuracy --only hf_Whisper --repeat 1 --compiled-autograd --dynamic-batch-only
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119311
Approved by: https://github.com/xmfan, https://github.com/jansel
Fix https://github.com/pytorch/pytorch/issues/118787
In the compiled function, calls to random() are replaced with a single function call
to a function that generates all the random variables .
The random calls encountered during compilation used to be tracked inside a variable
stored inside the instruction translator. And when there are nested translators, the tracked
calls used to get lost when the inner instructions translator popped out.
This diff fixes that by moving the tracked calla to the output graph which is shared across translators that are generating the same function.
More details about the issue and why this solution is picked are in the github issue above.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119218
Approved by: https://github.com/jansel, https://github.com/anijain2305
Summary:
With the compiled PyTorch module, in execution_trace_observer.cpp, function convertIValue calls TensorImpl->storage_offset(). That function call will trigger a recursive call into recordOperatorStart. It will cause a deadlock on ob.g_mutex.
This DIFF is to fix this deadlock by replacing std::mutex with std::recursive_mutex.
Since PyTorch only has one thread for FWD, and one thread for BWD. The contention is very low, the performance should NOT be a concern.
Test Plan:
Unit Test
buck test mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2
Differential Revision: D53299183
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119242
Approved by: https://github.com/aaronenyeshi
Summary: This commit adds a util for PT2E quantization users
to call `model.train()` and `model.eval()` without error.
Instead, these will automatically call the equivalent
`move_exported_model_to_train/eval` for the user, which only
switch behavior for special ops like dropout and batchnorm.
This enables users to onboard to the PT2E flow more easily.
Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_allow_exported_model_train_eval
Reviewers: jerryzh168, tugsbayasgalan, zhxchen17
Subscribers: jerryzh168, tugsbayasgalan, zhxchen17, supriyar
Differential Revision: [D53426636](https://our.internmc.facebook.com/intern/diff/D53426636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119091
Approved by: https://github.com/jerryzh168, https://github.com/tugsbayasgalan, https://github.com/zhxchen17
Fixes https://github.com/pytorch/pytorch/issues/117361
The implementation here slightly diverges from what was proposed in the issue, so I will recap what this PR is doing here. Today, when doing computations involving size-like unbacked SymInts, we assume for all operations that the compile time range of the integer is `[2, inf]`, even though at runtime we also accept zero and one.
This PR removes the carte blanche assumption, and instead does the analysis in a much more limited and controlled fashion: only for guards which we have designated as "size oblivious" are we willing to do the analysis under the assumption that the range of all size-like unbacked SymInts is `[2, inf]`; otherwise, we will faithfully only do analysis with `[0, inf]` (or whatever the user provided) bounds.
The infra pieces of this PR are:
* Remove runtime_var_to_range from torch/fx/experimental/symbolic_shapes.py; modify `_constrain_range_for_size` to refine the range without clamping min to 2, and instead add the symbol to a `size_like` set in the ShapeEnv
* When evaluating an expression, if the expression is requested to be evaluated in a `size_oblivious` way, we attempt to statically compute the value of the expression with the assumption that all symbols in `size_like` are updated to assume that they are `>= 2`.
* Add Python and C++ APIs for guarding on a SymBool in a size-oblivious way. In C++, I also need to add some helpers for performing symbolic comparisons, since the stock comparisons immediately specialize in the "normal" way.
The rest of the changes of the PR are marking various spots in PyTorch framework code as size oblivious, based on what our current test suite exercises.
As you review the places where we have marked things as size oblivious, it may become clear why I ended up not opting for the "designate a branch as the default branch when it's not statically obvious which way to go": for some of the conditions, this answer is rather non-obvious. I think potentially there is another refinement on top of this PR, which is something like "I don't care if you can't figure it out with ValueRange analysis, go down this path anyway if there are unbacked sizes involved." But even if we add this API, I think we are obligated to attempt the ValueRange analysis first, since it can lead to better outcomes sometimes (e.g., we are able to figure out that something is contiguous no matter what the unbacked size is.)
When is it permissible to mark something as size oblivious? Heuristically, it is OK anywhere in framework code if it gets you past a guard on unbacked SymInt problem. It is somewhat difficult to provide a true semantic answer, however. In particular, these annotations don't have any observational equivalence guarantee; for example, if I have `torch.empty(u0, 1).squeeze()`, we will always produce a `[u0]` size tensor, even though if `u0 == 1` PyTorch will actually produce a `[]` size tensor. The argument that I gave to Lezcano is that we are in fact defining an alternate semantics for a "special" size = 0, 1, for which we have these alternate eager mode semantics. In particular, suppose that we have a constant `special1` which semantically denotes 1, but triggers alternate handling rules. We would define `torch.empty(special1, 1).squeeze()` to always produce a `[special1]` size tensor, making its semantics coincide with unbacked SymInt semantics. In this model, the decision to designate guards as size oblivious is simply a user API question: you put them where ever you need some handling for special1! As we conservatively error out whenever it is not obvious what `special1` semantics should be, it is always valid to expand these semantics to cover more cases (although you can always choose the wrong semantics!)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118579
Approved by: https://github.com/eellison, https://github.com/lezcano
Before the pr, we have a graph break for:
```python
def f():
if torch.cuda.current_stream() is not None:
return torch.randn(2, 2)
torch.compile(f, backend="eager", fullgraph=True)()
```
This pr supports comparson ops of StreamVariable and ConstantVariable by returning a constant.
It's safe to return a constant in this case becuase the StreamVariable is guarded by ID_MATCH when created.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119199
Approved by: https://github.com/yifuwang, https://github.com/anijain2305, https://github.com/jansel
Fixes https://github.com/pytorch/pytorch/issues/119238
Here's what it looks like now:
```
$ TORCH_LOGS=+torch._dynamo.convert_frame python a.py
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] torchdynamo start compiling f /data/users/ezyang/b/pytorch/a.py:3, stack (elided 5 frames):
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] File "/data/users/ezyang/b/pytorch/a.py", line 7, in <module>
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] f(torch.randn(2))
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] return fn(*args, **kwargs)
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG]
$ cat a.py
import torch
@torch.compile
def f(x):
return x * 2
f(torch.randn(2))
```
The eval_frame frame is intentionally present, since what happens is you run the torch.compile wrapper, and then you actually hit the user frame to be compiled.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119251
Approved by: https://github.com/yanboliang, https://github.com/mlazos
Fixes Issue #73792
This is a duplicate of pull request. #73864. It's a small bugfix that should have happened a long time ago, but it didn't because I didn't actually follow up with the pull request after originally submitting. That's my bad. Trying to remedy the error.
This contains a fix to _pad_mixture_dimension, which intends to count the number of dimensions in its referent tensors, but accidentally counts the number of elements (and can thus end up creating tensors with potentially thousands of dimensions by mistake). Also contains a single test for the fixed behavior.
Co-authored-by: Jeffrey Wan <soulitzer@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118947
Approved by: https://github.com/soulitzer
## Problem
A user-defined Triton kernel grid may use a sympy magic method like `Max`. This comes in the form of a form of a `sympy.Expr`, namely `sympy.core.function.FunctionClass`.
Handling this is not trivial since `user_defined_kernel_grid_fn_code` is used in Eager & Inductor. Eager usage below.
## Approach
Pass in wrapper when Inductor codegens grid with ints/sympy.Expr, so we can utilize wrapper functions, such as `codegen_shape_tuple()`.
Differential Revision: D53367012
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119165
Approved by: https://github.com/aakhundov
### Descriptions
According to flash attention v2, optimize softmax by dividing sum out of the KV inner loop.
### Performance
Stable Diffusion V2.1 on GNR
| Version | Kernel time (s) | Speedup |
|---------|----------------|----------------|
| BF16 Before | 28.67 |
| BF16 After | 23.55 | 17.86% |
| FP32 Before | 54.20 |
| FP32 After | 49.47 | 8.73% |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118957
Approved by: https://github.com/jgong5, https://github.com/drisspg
- Added support for serializig the auto_functionalization op, which
required adding the functions `serialize_arbitrary_inputs` and
`serialize_arbitrary_outputs` which will serialize the inputs/outputs
without needing a schema, since HOOs do not have a schema.
- Added support for serializing user input mutations
- Added support for serializing operator inputs. They just get turned
into strings.
Differential Revision: [D53331039](https://our.internmc.facebook.com/intern/diff/D53331039)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118810
Approved by: https://github.com/suo
For code like following:
```python
import torch
def f():
a = {"a": torch.randn(2, 2)}
a.clear()
return a
torch.compile(f, backend="eager", fullgraph=True)()
```
We have a graph break before the pr:
```
torch._dynamo.exc.Unsupported: call_method ConstDictVariable() clear [] {}
```
Test Plan:
Added new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119197
Approved by: https://github.com/jansel, https://github.com/anijain2305
Differential Revision: D53398312
## Problem
Currently, if a sympy expression that uses a magic method like `Max` is passed as an argument to ProxyExecutor, then C++ compilation will fail. We need to use std::max method instead.
```
# What we see
aoti_torch_proxy_executor_call_function(..., std::vector<int64_t>{Max(1025, u1)}.data(), ...);
# What we want
aoti_torch_proxy_executor_call_function(..., std::vector<int64_t>{std::max(1025L, u1)}.data(), ...)
```
## Approach
Use C++ wrapper's expression printer to handle this conversion
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119166
Approved by: https://github.com/aakhundov
The NCCL backend requires CUDA (including devices) to be available. So don't use that backend by default if that isn't the case to avoid the following error when creating a CPU-only device mesh:
> RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
Fixes#117746
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119149
Approved by: https://github.com/kwen2501
Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning.
Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813
Approved by: https://github.com/jansel
Right now, `ModuleInfo.dtypes` defaults to `torch.testing._internal.common_dtype.floating_types()`, almost no ModuleInfos override this (so only `float32` and `float64` are tested).
This is the first step to clean up/improve dtype testing for `ModuleInfos` and fix#116626.
Follow up PRs will updates `dtypes=` (and perhaps `dtypesIf{Device}` (if it makes sense)) for each `ModuleInfo`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119039
Approved by: https://github.com/janeyx99
Currently, HSDP validates that all intra/inter node PGs are the same. This makes sense if you are only using HSDP with no other forms of parallelism and is a nice but not necessary sanity check.
However, if you want to mix HSDP with other forms, say tensor parallelism on the FFN of a transformer block, the intra/inter node PGs will be different for that layer. This check raises errors in this scenario, so we need to remove this assumption.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112435
Approved by: https://github.com/wz337, https://github.com/Skylion007
Example https://github.com/pytorch/pytorch/actions/runs/7562281351/job/20592425611?pr=117079 (The code to delete branches isn't being run, it's just listing the branches it wants to delete)
Internal code: https://fburl.com/code/hdvvbfkj
Threshold for branch with PR is 30 days regardless of whether or not the PR is merged or not (compared to 3 days if merged and 30 days if closed). Threshold for branch without PR is 1.5 years (same internally).
Threshold of ~400 queries to github so it doesn't hit token usage limits. Currently this leads to about 350 branches deleted per run.
Only query for the last 90 days of updated PRs to reduce token usage, so if a branch has a PR but it was updated 90+ days ago, it will think it doesn't have a PR and will wait for the 1.5 years branch update check instead, regardless of whether the PR is open or closed.
I tested that it could delete my own branch and it worked.
labeled with test-config/crossref because I just want the smallest test config possible to reduce CI usage
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117079
Approved by: https://github.com/malfet
Type checking Python is a pain. Here are my learnings:
* The types for heavily polymorphic code is going to be verbose, no way around it. I originally was hoping I could lean on polymorphism with a bounded TypeVar to compactly write signatures for many of the ValueRanges methods, but I ran into some unworkaroundable mypy bugs. Writing out all the types explicitly and using `@overload` liberally works pretty well, so I think I recommend people do that instead of trying to do fancy things.
* Sympy is missing annotations for assumptions, because they are all metaprogrammed. I don't really relish maintaining a typeshed for sympy, so I wrote a small mypy plugin to add them in.
* GADT style refinement is... just not a good idea in practice. Mypy easily gets confused whether or not a return value from a refined section is allowed for the outer return type. So many of these have been replaced with less informative implementation types and more informative external types via overloads. Hopefully this is good for use sites.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118870
Approved by: https://github.com/Skylion007, https://github.com/albanD
Update all_gather to support HSDP + TP.
Currently, the `_all_gather_dtensor` function for dtensors only replaces the first dimension with replicate (the FSDP dimension) and does not touch the second dimension (which is assumed to be the TP dimension). With HSDP, we have two dimensions ahead of the TP dimension as opposed to 1. This PR updates to replace all other dimensions with replicate to run the all-gather.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118638
Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/wz337
This reverts commit a5a63db3bf937a6eff993d1222fab18cc63f9cb2.
Fixes #ISSUE_NUMBER
Reverts #118368
Got reverted internally but branch got deleted to automation didn't work
Mildly edited stack trace
```
...
return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
File "torch/_dynamo/eval_frame.py", line 453, in _fn
return fn(*args, **kwargs)
File "torch/_dynamo/external_utils.py", line 25, in inner
return fn(*args, **kwargs)
File "torch/fx/experimental/proxy_tensor.py", line 635, in dispatch_trace
graph = tracer.trace(root, concrete_args)
File "torch/fx/experimental/proxy_tensor.py", line 995, in trace
res = super().trace(root, concrete_args)
File "torch/_dynamo/eval_frame.py", line 453, in _fn
return fn(*args, **kwargs)
File "torch/_dynamo/external_utils.py", line 25, in inner
return fn(*args, **kwargs)
File "torch/fx/_symbolic_trace.py", line 793, in trace
(self.create_arg(fn(*args)),),
File "torch/fx/experimental/proxy_tensor.py", line 665, in wrapped
out = f(*tensors)
File "<string>", line 1, in <lambda>
File "torch/_functorch/_aot_autograd/traced_function_transforms.py", line 357, in _functionalized_f_helper
f_outs = fn(*f_args)
File "torch/_functorch/_aot_autograd/traced_function_transforms.py", line 68, in inner_fn
outs = fn(*args)
File "torch/_functorch/_aot_autograd/utils.py", line 161, in flat_fn
tree_out = fn(*args, **kwargs)
File "torch/_functorch/_aot_autograd/traced_function_transforms.py", line 618, in functional_call
out = PropagateUnbackedSymInts(mod).run(
File "torch/fx/interpreter.py", line 145, in run
self.env[node] = self.run_node(node)
File "torch/_functorch/_aot_autograd/traced_function_transforms.py", line 593, in run_node
result = super().run_node(n)
File "torch/fx/interpreter.py", line 202, in run_node
return getattr(self, n.op)(n.target, args, kwargs)
File "torch/fx/interpreter.py", line 274, in call_function
return target(*args, **kwargs)
File "torch/_ops.py", line 571, in __call__
return self_._op(*args, **kwargs)
File "torch/_subclasses/functional_tensor.py", line 380, in __torch_dispatch__
outs_unwrapped = func._op_dk(
File "torch/utils/_stats.py", line 20, in wrapper
return fn(*args, **kwargs)
File "torch/fx/experimental/proxy_tensor.py", line 744, in __torch_dispatch__
return self.inner_torch_dispatch(func, types, args, kwargs)
File "torch/fx/experimental/proxy_tensor.py", line 779, in inner_torch_dispatch
return proxy_call(self, func, self.pre_dispatch, args, kwargs)
File "torch/fx/experimental/proxy_tensor.py", line 423, in proxy_call
r = maybe_handle_decomp(proxy_mode, func, args, kwargs)
File "torch/fx/experimental/proxy_tensor.py", line 1225, in maybe_handle_decomp
return CURRENT_DECOMPOSITION_TABLE[op](*args, **kwargs)
File "torch/_decomp/decompositions.py", line 4322, in scaled_dot_product_flash_attention_for_cpu
torch._check(
File "torch/__init__.py", line 1133, in _check
_check_with(RuntimeError, cond, message)
File "torch/__init__.py", line 1116, in _check_with
raise error_type(message_evaluated)
RuntimeError: query must be FP32, FP64, BF16 but got torch.float16
While executing %_scaled_dot_product_flash_attention_for_cpu : [num_users=1] = call_function[target=torch.ops.aten._scaled_dot_product_flash_attention_for_cpu.default](args = (%l_q_, %l_k_, %l_v_), kwargs = {attn_mask: %l_attn_mask_})
Original traceback:
File "executorch/backends/xnnpack/partition/graphs/sdpa.py", line 34, in forward
return torch.nn.functional.scaled_dot_product_attention(
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119204
Approved by: https://github.com/kit1980
It's inefficient to split remaining parts of the module name by '.' just to join it back again. Instead it's more idiomatic and efficient to use `maxsplit=1` to ensure that all remaining parts remain intact. This improves best case time and space complexity since scan can terminate on first encountered `.` and only 2 parts are returned in a list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119145
Approved by: https://github.com/Skylion007
Summary: As titled. Added support of fuse_split_linear_add in pregrad passes based on predispatch IR
Test Plan: TORCH_LOGS=inductor,aot buck2 run mode/opt mode/inplace caffe2/test/inductor/fb:test_split_cat_fx_passes_aten_fb
Differential Revision: D53302168
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118983
Approved by: https://github.com/kflu, https://github.com/chenyang78
Fixes#114285
(However, still have NotImplementedError
```NotImplementedError: The operator 'aten::_linalg_svd.U' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.```)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114771
Approved by: https://github.com/lezcano
@xmfan and @fegin reported that _LazyGraphModule ( https://github.com/pytorch/pytorch/pull/117911 ) makes nanogpt training fail with compiled autograd.
We have a repro: ``` python benchmarks/dynamo/torchbench.py --training --backend=inductor --disable-cudagraphs --accuracy --only nanogpt --repeat 1 --compiled-autograd ```
but it's still mysterious how to trigger the issue with a toy model.
The error message for the failure is https://gist.github.com/shunting314/6402a6388b3539956090b6bc098952fb . In compile_fx we will call `detect_fake_mode`. This function will look for an active FakeTensorMode from both TracingContext and example inputs. The error is triggered because we find different FakeTensorMode from these 2 sources.
Although I don't know what really causes the discrepancy of FakeTensorMode above, the fix here is to force _LazyGraphModule recompilation if we have compiled autograd enabled. This does not hurt compilation time most of the time because we anyway will call the graph module here in the backward pass when compiled autograd is enabled: 855d5f144e/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L705)
Let me know if we can have a better fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118981
Approved by: https://github.com/jansel
Summary:
The reuse subgroup logic is causing GLOO to timeout on two internal modelstore tests (relevant tests in test plan).
We temporarily disabling re-use subgroup during root-causing to allow the internal tests to be able to run again, as they are now omitted shown in T176426987.
Test Plan:
CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118940
Approved by: https://github.com/wanchaol
UPDATE - I changed the PR because from discussion with @jansel it was clear that someone else was holding on to a reference to f_locals. This PR now solves that problem first. I removed the eval_frame.c part because it was failing tests that use `exec` or `eval` with weird error like `no no locals found when storing 'math'`. I would debug that in a separate PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118447
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #118975, #118420
Summary:
Expose an option to enable libuv in TCPStore based rendezvous backend that will allow better scaling.
Libuv support has been added recently and allows scaling for more than 2K nodes.
Test Plan: Unit tests
Differential Revision: D53335860
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118944
Approved by: https://github.com/wconstab
Summary: NativeCachingAllocator has a global lock which shows lock contention with one process using multiple GPUs. The lock is required to lookup Block from pointer. We can make the lock more fine grain to reduce the lock contention.
Test Plan: existing unittests, verified on prod models using eight GPUs showing double digits improvements
Differential Revision: D52493091
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118550
Approved by: https://github.com/albanD
### Summary
Run the relevant tests in `test/distributed/_tensor/test_dtensor_compile.py` and `test/distributed/test_device_mesh.py` with native funcol enabled, in addition to with them being disabled.
All tests excepts `test_tp_compile_comm_reordering` pass. This is expected because the native funcols have slightly different IRs, so the reordering pass needs to be adjusted. This test is disabled for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118437
Approved by: https://github.com/LucasLLC
ghstack dependencies: #118910, #118911
I was just playing around with improving the typing of symbolic_shapes. The PR is not "complete" but I in particular wanted to get feedback on whether or not people liked making ValueRanges Generic; it seems that distinguishing if you have an Expr ValueRange or a SympyBoolean ValueRange is a lot of trouble for downstream. Using TypeGuard, we can perform refinements on the generic parameter inside methods, although we still have to cast back to ValueRange[T] due to https://github.com/python/mypy/issues/14425#issuecomment-1914852707
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118529
Approved by: https://github.com/Skylion007
* The TODOs in `test/test_nestedtensor.py` has been mitigated, I keep the issue for reference.
* ~~The TODOs in `test/test_ops_fwd_gradients.py` doesn't apply anymore~~
* The TODOs in `run_test.py` to support disabling C++ tests is probably not going to happen. I have never seen a flaky C++ test that needs to be disabled before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119113
Approved by: https://github.com/kit1980
Which are a convenience methods that create a dictionary from placeholder, making code a more compact.
Also added `runMPSGraph` overloaded function with Placeholder instead of an output dictionary, as majority of the operators have just one output.
Typical change looks as follows
```patch
- NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* feeds = @{
- selfPlaceholder.getMPSGraphTensor() : selfPlaceholder.getMPSGraphTensorData(),
- };
- NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results =
- @{outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData()};
- runMPSGraph(stream, cachedGraph->graph(), feeds, results);
+ auto feeds = dictionaryFromPlaceholders(selfPlaceholder);
+ runMPSGraph(stream, cachedGraph->graph(), feeds, outputPlaceholder);
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119077
Approved by: https://github.com/kit1980, https://github.com/albanD
This is part of the work to support cross entropy in dtensor.
This PR doesn't support nll_loss computation with input sharded on the channel dimension yet. In that case, redistribution to Replicate is needed in sharding propagation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118917
Approved by: https://github.com/wanchaol
The torch "fake" ndarray had some mismatches vs numpy.ndarray which caused test_sparse_to_sparse_compressed to fail under dynamo.
This also fixes (because the test now hits it) a problem where unpacking a sequence with the incorrect number of args would assert in dynamo instead of graph breaking (because it would throw an exception). Added a unit test for this condition.
Fixed:
- torch._numpy._ndarray.astype() (actually used by the test)
- torch._numpy._ndarray.put() (drive-by discovery)
- torch._numpy._ndarray.view() (drive-by discovery)
(burndown item 7)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117952
Approved by: https://github.com/yanboliang
ghstack dependencies: #117951
test_compressed_layout_conversions_coverage verifies torch's conversions between different memory layouts using numpy as a reference. Since numpy doesn't support BSC format it just skipped that. Instead fake it by using a transposed BSR format.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117951
Approved by: https://github.com/zou3519
Summary:
Error running llama in xplat, where half type isnt part of c10_mobile targets. See: D53158320
This diff:
- Creates a `torch_mobile_all_ops_et` target, which is the same as `torch_mobile_all_ops`, except with a preprocessor flag (C10_MOBILE_HALF) to support Half type
- Check C10_MOBILE_HALF in LinearAlgebra.cpp and include it
- Use `torch_mobile_all_ops_et` for executorch, instead of `torch_mobile_all_ops`.
Considerations:
- Using `torch_mobile_all_ops_et` across executorch means that our runtime binary size for xplat aten increases (see test plan for increase amount, thanks tarun292 for the pointer). This may be okay, as aten mode isn't used in production.
Test Plan:
Run language llama in xplat:
```
buck2 run xplat/executorch/examples/models/llama2:main_aten -- --model_path llama-models/very_new_checkpoint_h.pte --tokenizer_path llama-models/flores200sacrebleuspm.bin --prompt 'fr Hello' --eos
```
And in fbcode:
```
buck2 run fbcode//executorch/examples/models/llama2:main_aten -- --model_path llama-models/very_new_checkpoint_h.pte --tokenizer_path llama-models/flores200sacrebleuspm.bin --prompt 'fr Hello' --eos
```
Test executor_runner size increase with:
```
buck2 build fbcode//executorch/sdk/fb/runners:executor_runner_aten
```
||original|this diff (+half dtype)|diff|
|unstripped|214975784|214976472|+688|
|stripped|71373488|71373808|+320|
Differential Revision: D53292674
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118831
Approved by: https://github.com/larryliu0820
### Motivation
Despite our plan to reduce gloo usage, it is still being widely used as testing tool (in both the PyTorch CI and user tests) for code that only uses nccl in real world scenario. There's some coverage issues around all-gather and reduce-scatter variants, which are currently worked around in ugly ways (e.g. [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L216-L219)) and [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L262-L272))). For native funcol I ran into the same issues but I'd rather just fix the coverage.
### This PR
We already have a fallback impl for `_reduce_scatter_base`, which is composed from all-reduce + scatter. The scatter was not necessary. It introduces extra communication, sync point, and forced the impl to fail on `asyncOp=True`. This PR does the following:
- Simulate reduce-scatter with `allreduce(inp).chunk(world_size)[rank]`. This is still 2x communication than a real reduce-scatter (since all-reduce = reduce-scatter + all-gather), but it's strictly better than what we have now.
- By doing the above, the comm becomes async and we don't have to fail on `asyncOp=True`.
- The general logic is implemented in `reduce_scatter_tensor_coalesced`. `_reduce_scatter_base` just calls it with single input/output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118911
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #118910
Summary:
We were suspecting ncclCommsAbort was hung due to NCCL 2.17's 'bug'
triggered by different ranks calls desctructors of different PGs in
different order. This can be reproed in a NCCL level test for 2.17
We need a test case in c10d to constantly check if PGs can be destructed
in different order
Test Plan:
Run the test and print out the distruction orders are expected
```
[$ python test/distributed/test_c10d_nccl.py
ProcessGroupNCCLTest.test_close_multi_pg_unordered
NCCL version 2.19.3+cuda12.0
[rank0]:[W ProcessGroupNCCL.cpp:1128] [PG 2 Rank 0] ProcessGroupNCCL
destructor entered.
[rank0]:[W ProcessGroupNCCL.cpp:1147] [PG 2 Rank 0] ProcessGroupNCCL
aborting communicators, check for 'abort finished' logs or look for
abort hang
[rank1]:[W ProcessGroupNCCL.cpp:1128] [PG 1 Rank 1] ProcessGroupNCCL
destructor entered.
[rank1]:[W ProcessGroupNCCL.cpp:1147] [PG 1 Rank 1] ProcessGroupNCCL
aborting communicators, check for 'abort finished' logs or look for
abort hang
[rank0]:[W ProcessGroupNCCL.cpp:1151] [PG 2 Rank 0] ProcessGroupNCCL
abort finished.
[rank0]:[W ProcessGroupNCCL.cpp:1128] [PG 1 Rank 0] ProcessGroupNCCL
destructor entered.
[rank0]:[W ProcessGroupNCCL.cpp:1147] [PG 1 Rank 0] ProcessGroupNCCL
aborting communicators, check for 'abort finished' logs or look for
abort hang
[rank1]:[W ProcessGroupNCCL.cpp:1151] [PG 1 Rank 1] ProcessGroupNCCL
abort finished.
[rank1]:[W ProcessGroupNCCL.cpp:1128] [PG 2 Rank 1] ProcessGroupNCCL
destructor entered.
[rank1]:[W ProcessGroupNCCL.cpp:1147] [PG 2 Rank 1] ProcessGroupNCCL
aborting communicators, check for 'abort finished' logs or look for
abort hang
[rank0]:[W ProcessGroupNCCL.cpp:1151] [PG 1 Rank 0] ProcessGroupNCCL
abort finished.
[rank1]:[W ProcessGroupNCCL.cpp:1151] [PG 2 Rank 1] ProcessGroupNCCL
abort finished.
.
----------------------------------------------------------------------
Ran 1 test in 18.969s
OK](url)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119045
Approved by: https://github.com/yifuwang
Fixes#112389 and https://github.com/facebookincubator/dynolog/issues/208
This PR enables profiler initialization for CPU only use cases. The main goal is to enable on-demand profiling with a daemon when using CPU only mode of PyTorch.
* When CUDA is available the profiler is initialized on first CUDA stream creation (or lazily when profiler is run).
* Since the CUDA stream creation callback does not exist on CPU only PyTorch the profiler is never initied on its own.
* Thus the job does not register with Dynolog when we set "KINETO_USE_DAEMON" env variable to set.
Part of the fix is in Kineto https://github.com/pytorch/kineto/pull/861, we point to it in PyTorch.
The change in PyTorch is to correctly set the `cpuOnly` argument.
## TestPlan:
Build PyTorch from source with USE_CUDA=0 so we have CPU only based build. Git hash = `a40951defd87b9a5e582cf9112bf7a8bd0930c79`
(See instructions in PyTorch repo)
For the setup we run dynolog daemon in another terminal
```
buck2 run dynolog/src:dynolog -- --enable_ipc_monitor &
```
Now run an example model in PyTorch - see [linear_model.py](https://github.com/facebookincubator/dynolog/blob/main/scripts/pytorch/linear_model_example.py) , and set the device to 'cpu' inside the code instead of 'cuda'.
```
export KINETO_USE_DAEMON=1
python linear_model_example.py
```
Output shows the profiler registration with dynolog
```
(pytorch) [bcoutinho@devgpu038.ftw6 ~/local/pytorch (main)]$ python linear_model_example.py
INFO:2024-01-25 11:08:53 1807792:1807792 init.cpp:122] Registering daemon config loader, cpuOnly = 1
INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-01-25 11:08:53 1807792:1807792 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient0dc36b8a-e14c-4260-958b-4b2e7d15e986 status = initialized
INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
```
We can also collect a trace using
```
[bcoutinho@devgpu038.ftw6 ~/fbsource/fbcode (3bc85f968)]$ buck2 run dynolog/cli:dyno -- gputrace --log-file /tmp/test.json
Kineto config =
ACTIVITIES_LOG_FILE=/tmp/test.json
PROFILE_START_TIME=0
ACTIVITIES_DURATION_MSECS=500
PROFILE_REPORT_INPUT_SHAPES=false
PROFILE_PROFILE_MEMORY=false
PROFILE_WITH_STACK=false
PROFILE_WITH_FLOPS=false
PROFILE_WITH_MODULES=false
response length = 147
response = {"activityProfilersBusy":0,"activityProfilersTriggered":[1807792],"eventProfilersBusy":0,"eventProfilersTriggered":[],"processesMatched":[1807792]}
Matched 1 processes
Trace output files will be written to:
/tmp/test_1807792.json
```
And trace file contains the trace correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118320
Approved by: https://github.com/aaronenyeshi
`auto_functionalize` currently takes a custom op, a list of mutated argument names, and inputs to the custom op as kwargs. The list of mutated argument names is computed from the schema, and gets created when we're tracing. However, it seems that having the list of mutated argument names is a little unnecessary since we can always recompute it from the schema during runtime.
This also prevents the case where users might incorrectly modify the inputs to this operator, as we will now just recompute it during the runtime. This probably won't affect things too much because inductor will decompose auto_functionalize.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119050
Approved by: https://github.com/zou3519
Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning.
Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813
Approved by: https://github.com/jansel
The problem was exposed in https://github.com/pytorch/pytorch/pull/118071 where the control flow tests were always recompiling. The issue turned out that the same nonlocal variable used in `true_fn` and `false_fn` was getting lifted twice and thus creating two inputs in the main Fx graph. Dynamo Tensor guards does not like it because it wants all input tensors to be non-aliased.
We already have logic to check if two different sources (closure of true_fn and closure of false_fn) point to the same tensor using side effects infra. But we were restoring side_effects after subtracing the true and false branches. This is not needed anymore. side_effects trace both read-only as well as actual writes to the variables. For higher order ops, any mutation which is not read-only leads to a graph break and safely exits the tracing. For read-only side effects, its doesn't matter.
This PR removes the restoring of side_effects, which turns on the logic for checking if two different sources point to the same tensor, and thus lifts the common non local tensor to just once in the main graph.
Related discussion at https://github.com/pytorch/pytorch/issues/113235
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118420
Approved by: https://github.com/ydwu4, https://github.com/mlazos, https://github.com/zou3519
ghstack dependencies: #118975
Partially fixes https://github.com/pytorch/pytorch/issues/105077
Repro:
```python
import tempfile
import torch
from torch._subclasses import fake_tensor
class TheModelClass(torch.nn.Module):
def __init__(self):
super(TheModelClass, self).__init__()
self.fc1 = torch.nn.Linear(5, 10)
def forward(self, x):
return self.fc1(x)
with tempfile.NamedTemporaryFile() as state_dict_file:
# Create state_dict to be loaded later
model = TheModelClass()
torch.save(model.state_dict(), state_dict_file.name)
fake_mode = fake_tensor.FakeTensorMode()
with fake_mode:
# This is where the bug is triggered
state_dict = torch.load(state_dict_file.name)
```
Error:
```bash
Traceback (most recent call last):
File "issue_gh_torch_105077.py", line 22, in <module>
state_dict = torch.load(state_dict_file.name)
File "/opt/pytorch/torch/serialization.py", line 1014, in load
return _load(opened_zipfile,
File "/opt/pytorch/torch/serialization.py", line 1422, in _load
result = unpickler.load()
File "/opt/pytorch/torch/_utils.py", line 205, in _rebuild_tensor_v2
tensor = _rebuild_tensor(storage, storage_offset, size, stride)
File "/opt/pytorch/torch/_utils.py", line 184, in _rebuild_tensor
return t.set_(storage._untyped_storage, storage_offset, size, stride)
File "/opt/pytorch/torch/utils/_stats.py", line 20, in wrapper
return fn(*args, **kwargs)
File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1288, in __torch_dispatch__
return self.dispatch(func, types, args, kwargs)
File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1468, in dispatch
self.invalidate_written_to_constants(func, flat_arg_fake_tensors, args, kwargs)
File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1733, in invalidate_written_to_constants
_, new_kwargs = normalize_function(
File "/opt/pytorch/torch/fx/operator_schemas.py", line 297, in normalize_function
torch_op_schemas = get_signature_for_torch_op(target)
File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in get_signature_for_torch_op
signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]
File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in <listcomp>
signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]
File "/opt/pytorch/torch/fx/operator_schemas.py", line 70, in _torchscript_schema_to_signature
arg_type = _torchscript_type_to_python_type(arg.type)
File "/opt/pytorch/torch/fx/operator_schemas.py", line 64, in _torchscript_type_to_python_type
return eval(ts_type.annotation_str, _type_eval_globals)
File "<string>", line 1, in <module>
NameError: name 'Storage' is not defined
```
This PR adds the ability to create fake tensors during `torch.load` by wrapping the `torch.tensor.set_` call around a `torch.utils._mode_utils.no_dispatch()` to skip fake mode dispatcher for it and thus create a real tensor. It later calls `fake_mode.from_tensor(t)` to finally create the fake tensor.
Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108186
Approved by: https://github.com/ezyang
Previously we were downloading all of (eager311, dynamo38, dynamo311).
Now we just download what's necessary. This is useful for
update_failures.py because the dynamo tests finish much faster than the
eager tests and it only needs the result from the dynamo tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119027
Approved by: https://github.com/jamesjwu
ghstack dependencies: #118874, #118882, #118931
Previously, you could run update_failures.py (with a commit hash) and it
would add new expected failures and skips for newly failing tests and
remove expected failures for newly passing tests.
This PR teaches update_failures.py to also remove skips for tests that
are now passing without them.
The way we do this is:
- dynamo_test_failures.py doesn't actually skip tests -- it runs the
test and then suppresses the signal.
- if the test actually passed, then the test gets skipped with a special
skip message
- we teach update_failures.py to look for the presence of that skip
message.
Test Plan:
- Used this to generate https://github.com/pytorch/pytorch/pull/118928
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118931
Approved by: https://github.com/yanboliang
ghstack dependencies: #118874, #118882
Fixes https://github.com/pytorch/pytorch/issues/111020
For the following code:
```python
import torch
import torch._higher_order_ops.wrap
glob = []
def f(x):
glob.append(x)
return x.clone()
@torch.compile(backend='eager', fullgraph=True)
def g(x):
return torch.ops.higher_order.wrap(f, x)
x = torch.randn(3)
g(x)
```
The stacktrace now becomes:
```
[2024-02-01 15:23:34,691] [0/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting wrap, we were unable to trace function `f` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] HigherOrderOperator: Mutating a variable not in the current scope (SideEffects)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] Traceback (most recent call last):
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 381, in speculate_subgraph
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] output = f.call_function(tx, args, sub_kwargs)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 278, in call_function
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] return super().call_function(tx, args, kwargs)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 86, in call_function
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] return tx.inline_user_function_return(
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 657, in inline_user_function_return
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2261, in inline_call
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] return cls.inline_call_(parent, func, args, kwargs)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2370, in inline_call_
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] tracer.run()
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 787, in run
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] and self.step()
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 750, in step
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] getattr(self, inst.opname)(inst)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 469, in wrapper
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] return inner_fn(self, inst)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1196, in CALL_FUNCTION
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] self.call_function(fn, args, {})
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 651, in call_function
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] self.push(fn.call_function(self, args, kwargs))
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/variables/misc.py", line 583, in call_function
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] return self.obj.call_method(tx, self.name, args, kwargs)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/variables/lists.py", line 330, in call_method
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] return super().call_method(tx, name, args, kwargs)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/variables/lists.py", line 241, in call_method
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] tx.output.side_effects.mutation(self)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/side_effects.py", line 325, in mutation
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] self.check_allowed_side_effect(var)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/side_effects.py", line 157, in check_allowed_side_effect
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] unimplemented(
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/exc.py", line 190, in unimplemented
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] raise Unsupported(msg)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] torch._dynamo.exc.Unsupported: HigherOrderOperator: Mutating a variable not in the current scope (SideEffects)
Traceback (most recent call last):
File "/home/yidi/local/pytorch/test.py", line 219, in <module>
g(x)
File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 615, in catch_errors
return callback(frame, cache_entry, hooks, frame_state)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 390, in _convert_frame_assert
return _compile(
File "/home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 650, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 248, in time_wrapper
r = func(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 531, in compile_inner
out_code = transform_code_object(code, transform)
File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
transformations(instructions, code_options)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 155, in _fn
return fn(*args, **kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 496, in transform
tracer.run()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2125, in run
super().run()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 787, in run
and self.step()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 750, in step
getattr(self, inst.opname)(inst)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 469, in wrapper
return inner_fn(self, inst)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1196, in CALL_FUNCTION
self.call_function(fn, args, {})
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 651, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 1227, in call_function
p_args, p_kwargs, example_value, body_r, treespec, _ = self.create_wrapped_node(
File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 1190, in create_wrapped_node
) = speculate_subgraph(
File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 453, in speculate_subgraph
raise ex
File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 381, in speculate_subgraph
output = f.call_function(tx, args, sub_kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 278, in call_function
return super().call_function(tx, args, kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 86, in call_function
return tx.inline_user_function_return(
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 657, in inline_user_function_return
return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2261, in inline_call
return cls.inline_call_(parent, func, args, kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2370, in inline_call_
tracer.run()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 787, in run
and self.step()
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 750, in step
getattr(self, inst.opname)(inst)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 469, in wrapper
return inner_fn(self, inst)
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1196, in CALL_FUNCTION
self.call_function(fn, args, {})
File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 651, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/home/yidi/local/pytorch/torch/_dynamo/variables/misc.py", line 583, in call_function
return self.obj.call_method(tx, self.name, args, kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/variables/lists.py", line 330, in call_method
return super().call_method(tx, name, args, kwargs)
File "/home/yidi/local/pytorch/torch/_dynamo/variables/lists.py", line 241, in call_method
tx.output.side_effects.mutation(self)
File "/home/yidi/local/pytorch/torch/_dynamo/side_effects.py", line 325, in mutation
self.check_allowed_side_effect(var)
File "/home/yidi/local/pytorch/torch/_dynamo/side_effects.py", line 157, in check_allowed_side_effect
unimplemented(
File "/home/yidi/local/pytorch/torch/_dynamo/exc.py", line 190, in unimplemented
raise Unsupported(msg)
torch._dynamo.exc.Unsupported: HigherOrderOperator: Mutating a variable not in the current scope (SideEffects)
from user code:
File "/home/yidi/local/pytorch/test.py", line 216, in g
return torch.ops.higher_order.wrap(f, x)
File "/home/yidi/local/pytorch/test.py", line 211, in f
glob.append(x)
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118826
Approved by: https://github.com/yanboliang, https://github.com/zou3519
Summary:
X-link: https://github.com/pytorch/executorch/pull/1817
Basic support for non-persistent buffers, which are buffers that do not show up in the state dict.
One weird twist is that most of our other systems (FX, aot_export, dynamo) have completely buggy handling of non-persistent buffers. I tried to go on a wild goose chase to fix them all, but it got to be too much. So I introduced some sad rewrite passes in `_export` make the final state dict correctly align with the original module's state dict.
This exposed some bugs/ambiguous handling of parameters/buffers in existing test code. For example, `TestSaveLoad.test_save_buffer` traced over a module that was not in the root module hierarchy and caused some weird behavior. I think we should error explicitly on use cases like this: https://github.com/pytorch/pytorch/issues/118410. For now I just rewrote the tests or skipped them.
As a side effect, this diff tightened up quite a few sloppy behaviors around state dict handling:
- Tensor attributes were getting promoted to be buffers—bad!
- Tracing through a module not in the children of the root module would add its parameters/buffers to the state dict—bad!
This behavior is unlikely to show up in user code since the model would be totally broken, but did show up in a bunch of tests.
#buildmore
Test Plan:
unit tests
sandcastle
Differential Revision: D53340041
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118969
Approved by: https://github.com/guangy10, https://github.com/huydhn, https://github.com/titaiwangms
This PR fixes several bugs, listed in priority:
1. `load_state_dict` with a nontensor step was incorrect for capturable and fused implementations since we don't create the tensors on the right device in `__setstate__`. This has been fixed.
2. The most recently added capturable implementations forgot the check that all tensors should be on CUDA for eager. We've now added those checks
3. The most recent change in Adamax only adds capturable for foreach but will silently be incorrect for forloop/single-tensor. I've added erroring and modified testing with many many many skips for that. Honestly my preference after this PR has only been further cemented that we should just do the single tensor and multi tensor capturable implementations together in the future. @mlazos
4. The conditional for adding cuda-supported configs for the optimizer infos was incorrect! So we hadn't been testing capturable! This also stands rectified and was the trigger for this PR in the first place.
5. In a similar way, the conditional for `_get_optim_inputs_including_global_cliquey_kwargs` was incorrect sometimes as well. This has also been corrected.
The following is not a bug, but is just something to make life simpler by not needing to handle Nones: `optim_input_funcs` must now mandatorily take in a `device`, which could be a string or a torch.device.
Details for posterity:
4. Running the test_foreach_matches_forloop test and printing the configs that get printed yields capturable getting included, which is correct.
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5d50138f)]$ python test/test_optim.py -k test_foreach_matches_forloop_AdamW_cuda
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
params=None, kwargs={}, desc=default
params=None, kwargs={'lr': 0.01}, desc=non-default lr
params=None, kwargs={'weight_decay': 0.1}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'maximize': True}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True}, desc=amsgrad
params=None, kwargs={'capturable': True}, desc=capturable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True}, desc=capturable, amsgrad
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True}, desc=Tensor lr with capturable and amsgrad
.
----------------------------------------------------------------------
Ran 1 test in 19.229s
OK
```
5. Running the test_optimizer_can_be_printed test (which calls `_get_optim_inputs_including_global_cliquey_kwargs`) and printing what gets run is also now correct.
```
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
params=None, kwargs={'differentiable': False}, desc=default
params=None, kwargs={'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.1, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': True}, desc=amsgrad & differentiable
.params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable
params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable
params=None, kwargs={'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable & foreach
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable & differentiable
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable & fused
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad & foreach
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable, amsgrad & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable, amsgrad & fused
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad & foreach
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=Tensor lr with capturable and amsgrad & differentiable
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=Tensor lr with capturable and amsgrad & fused
.
----------------------------------------------------------------------
Ran 2 tests in 11.112s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118326
Approved by: https://github.com/mlazos
This PR adds the pre- and post-backward logic:
- **Pre-backward hook:** `FSDPState` and `FSDPParamGroup` define this, and `FSDPState` is responsible for registering since its pre-backward should run even if the `FSDPState` does not manage any parameters (in case it is the root).
- **Post-backward hook:** Only `FSDParamGroup` defines this since the post-backward hook reshards parameters and reduce-scatters gradients (functionality only needed with managed parameters). The `FSDPParamGroup` is responsible for registering this.
- **Post-backward final callback:** `FSDPState` defines this, and each `FSDPParamGroup` defines a `finalize_backward()` to call in the final callback.
### Pre-Backward
The pre-backward hook is registered on the module outputs (that require gradient), and it should run when the first such output has its gradient computed. The hook may run multiple times per backward, once per module forward. Specifically, there will be one `(pre-backward, post-backward)` interval for each of the module's `forward()` calls. This is contrast with the existing FSDP semantics, which only defines a single `(pre-backward, post-backward)` interval that is equivalent to the union of this FSDP's `(pre-backward, post-backward)` intervals. This avoids spiking memory from having multiple modules not resharding and avoids some autograd edge cases.
We implement the pre-backward hook by having a flag that is set upon the 1st calls to disable subsequent calls. This flag could be maintained by FSDP, but for a cleaner design, we augment `register_multi_grad_hook` with a `mode="any"` option and use that instead.
### Post-Backward
The post-backward hook is equivalent to a module full backward hook (`nn.Module.register_full_backward_hook`) except it adds pytree logic to work with data structures other than just flat `Tensor` args passed to `nn.Module.forward`. If we were to use `register_full_backward_hook`, then the hook could fire early (before all gradients for the module have been computed). Most internal models use custom data structures as `forward` inputs, and they find that unifying under pytree is an acceptable solution.
Unlike existing FSDP, we are able to reshard the parameters in the post-backward hook _before_ 'concatenating' the autograd-computed gradients, achieving a lower peak memory usage. (Existing FSDP has `SplitWithSizesBackward` that calls a `CatArrayBatched`, and here we have the reduce-scatter copy-in.)
### Final Callback
The final callback runs as a queued callback to the autograd engine, meaning that it runs at the end of backward.
In the future, if we do not want to wait for the reduce-scatter (or similar for CPU offloading), we can augment the final callback. The code is written such that each reduce-scatter can be waited on separately (via CUDA event).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118004
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #117950, #117955, #117973, #117975
Fix internal failure D53291154
from alban: the change is breaking because the alpha argument is now kwarg only (via the * marker) while it was ok for it to be positional before for the rsub.Scalar overload
```
_wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "torch/_dynamo/eval_frame.py", line 453, in _fn
return fn(*args, **kwargs)
File "torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "torch/_dynamo/eval_frame.py", line 615, in catch_errors
return callback(frame, cache_entry, hooks, frame_state)
File "torch/_dynamo/convert_frame.py", line 390, in _convert_frame_assert
return _compile(
File "python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "torch/_dynamo/convert_frame.py", line 650, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "torch/_dynamo/utils.py", line 248, in time_wrapper
r = func(*args, **kwargs)
File "torch/_dynamo/convert_frame.py", line 531, in compile_inner
out_code = transform_code_object(code, transform)
File "torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
transformations(instructions, code_options)
File "torch/_dynamo/convert_frame.py", line 155, in _fn
return fn(*args, **kwargs)
File "torch/_dynamo/convert_frame.py", line 496, in transform
tracer.run()
File "torch/_dynamo/symbolic_convert.py", line 2125, in run
super().run()
File "torch/_dynamo/symbolic_convert.py", line 787, in run
and self.step()
File "torch/_dynamo/symbolic_convert.py", line 750, in step
getattr(self, inst.opname)(inst)
File "torch/_dynamo/symbolic_convert.py", line 469, in wrapper
return inner_fn(self, inst)
File "torch/_dynamo/symbolic_convert.py", line 1249, in CALL_FUNCTION_KW
self.call_function(fn, args, kwargs)
File "torch/_dynamo/symbolic_convert.py", line 651, in call_function
self.push(fn.call_function(self, args, kwargs))
File "torch/_dynamo/variables/torch.py", line 614, in call_function
tensor_variable = wrap_fx_proxy(
File "torch/_dynamo/variables/builder.py", line 1285, in wrap_fx_proxy
return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs)
File "torch/_dynamo/variables/builder.py", line 1370, in wrap_fx_proxy_cls
example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True)
File "torch/_dynamo/utils.py", line 1653, in get_fake_value
raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None
File "torch/_dynamo/utils.py", line 1599, in get_fake_value
ret_val = wrap_fake_exception(
File "torch/_dynamo/utils.py", line 1140, in wrap_fake_exception
return fn()
File "torch/_dynamo/utils.py", line 1600, in <lambda>
lambda: run_node(tx.output, node, args, kwargs, nnmodule)
File "torch/_dynamo/utils.py", line 1720, in run_node
raise RuntimeError(fn_str + str(e)).with_traceback(e.__traceback__) from e
File "torch/_dynamo/utils.py", line 1699, in run_node
return node.target(*args, **kwargs)
File "torch/utils/_stats.py", line 20, in wrapper
return fn(*args, **kwargs)
File "torch/_subclasses/fake_tensor.py", line 1637, in __torch_dispatch__
return self.dispatch(func, types, args, kwargs)
File "torch/_subclasses/fake_tensor.py", line 1975, in dispatch
return self._dispatch_impl(func, types, args, kwargs)
File "torch/_subclasses/fake_tensor.py", line 2190, in _dispatch_impl
r = func(*args, **kwargs)
File "torch/_ops.py", line 571, in __call__
return self_._op(*args, **kwargs)
File "torch/_prims_common/wrappers.py", line 252, in _fn
result = fn(*args, **kwargs)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118907
Approved by: https://github.com/lezcano
### Motivation
Despite our plan to reduce gloo usage, it is still being widely used as testing tool (in both the PyTorch CI and user tests) for code that only uses nccl in real world scenario. There's some coverage issues around all-gather and reduce-scatter variants, which are currently worked around in ugly ways (e.g. [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L216-L219)) and [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L262-L272))). For native funcol I ran into the same issues but I'd rather just fix the coverage.
**I think it's reasonable to think of this as a fix rather than adding new features. This is orthogonal to the potential reduction of gloo usage**.
### This PR
This PR adds `ProcessGroupGloo::allgather_into_tensor_coalesced`. This is very straightforward - `ProcessGroupGloo` already supports `allgather_coalesced`, to which we can funnel `allgather_into_tensor_coalesced`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118910
Approved by: https://github.com/shuqiangzhang
Summary: This is the equivalent API to `model.train()` for
exported models, analogous to `move_exported_model_to_eval`.
Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_move_exported_model_dropout
python test/test_quantization.py TestQuantizePT2E.test_move_exported_model_dropout_inplace
python test/test_quantization.py TestQuantizePT2E.test_move_exported_model_dropout_bn
Reviewers: jerryzh168, kimishpatel
Subscribers: jerryzh168, kimishpatel, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113492
Approved by: https://github.com/jerryzh168, https://github.com/tugsbayasgalan
Summary: generate_index_put_fallback currently generates something like the following,
```
AtenTensorHandle tensor_handle_array_1[] = {nullptr, nullptr, arg1_1, wrap_with_raii_handle_if_needed(tmp_tensor_handle_0)};
```
The problem is wrap_with_raii_handle_if_needed creates a RAIIAtenTensorHandle which only lives during this tmp array initialization. After the initialization is done, RAIIAtenTensorHandle dies and releases the underlying Tensor, and when later tensor_handle_array_1 is passed to aoti_torch_index_put_out, some of its element AtenTensorHandle becomes invalid, cauing segfault.
Differential Revision: [D53339348](https://our.internmc.facebook.com/intern/diff/D53339348)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118963
Approved by: https://github.com/aakhundov
Fixes #[117504](https://github.com/pytorch/pytorch/issues/117504)
Re-engineering Hipify Trie:
(1) Re-engineering Trie.
(2) More documentation or comments for easier understanding
(3) Created a set of unit test (class `TestHipifyTrie`) to test the Trie data structure and APIs.
Test:
```
root@xxx:/development/pytorch# pytest test/test_utils.py -k TestHipifyTrie
==================================================================================================== test session starts ====================================================================================================
platform linux -- Python 3.9.18, pytest-7.3.2, pluggy-1.3.0
rootdir: /dockerx/development/pytorch
configfile: pytest.ini
plugins: flakefinder-1.1.0, rerunfailures-13.0, xdist-3.3.1, xdoctest-1.1.0, cpp-2.3.0, shard-0.1.2, hypothesis-5.35.1
collected 11453 items / 11445 deselected / 8 selected
Running 8 items in this shard
test/test_utils.py ........ [100%]
============================================================================================ 8 passed, 11445 deselected in 3.84s ============================================================================================
root@xxx:/development/pytorch#
```
Also performed diff on modified and generated contents by this tool with the original code and the new code of the hipify_python.py script. Verified no difference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118433
Approved by: https://github.com/malfet, https://github.com/jeffdaily
Make variables in dict lazy and remove DICT_KEYS guard.
We build the keys of a dict depth-first and we rely on the guards of
each element in the dict to create the correct guards. This allows us to
remove the rather buggy DICT_KEYS guard and make the guard lazy.
The guards are not completely lazy yet, as we instantiate them in
`_HashableTracker._eq_impl` but it should be possible to make them
truly lazy.
Also, adding new types to the supported types within keys should be less
error prone.
This is marginally less efficient when we graph break, but in turn we
should graph break much less. It also makes the dicts code easier to maintain
(removes `is_hashable_python_var`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117625
Approved by: https://github.com/jansel, https://github.com/peterbell10, https://github.com/anijain2305
ghstack dependencies: #117982, #118098, #117983
This enables the new way of writing guards for dicts. Before we were
doing things like
```
L['self'].param_groups[0][___dict_keys_getitem(L['self'].param_groups[0], 0)][3] is L['self'].param_groups[0]['params'][3]
```
without knowing whether `L['self'].param_groups[0][___dict_keys_getitem(L['self'].param_groups[0], 0)]` was a list.
On a different note, I'll probably write a pass to recover the previous
way to place guards on dicts via something like `DICT_KEYS` as an
optimisation, as it seems relevant for optimisers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117983
Approved by: https://github.com/mlazos
ghstack dependencies: #117982, #118098
# Motivation
This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability.
# Design
We maintain a flag for each backend to manage the lazy initialization state separately.
# Additional Context
No need more UTs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118846
Approved by: https://github.com/malfet
Improvements to shape padding logic in torch/_inductor/pad_mm.py
These changes could lead up to 14% perf improvement for certain Meta internal models in experiments.
Most notably:
* 1.) Use aten.const_pad_nd operation to pad Tensors in a single op instead of using multiple steps involving intermediate buffers. This appears to be more performant than the previous logic, confirmed by Profiling & Benchmarking results ( Meta internal )
* 2.) Make many paddings unneccessary using explicitly transposed GEMM when either M or N dimension is properly aligned but the other is not, configurable via config.shape_pad_use_transpose (default: True).
* 3.) Enable shape padding for the Inductor CUDA / Cutlass backend for all GEMM ops where Cutlass would be enabled, without benchmarking in that case.
* Add config flag to always pad shapes (without benchmarking first), configurable via config.force_shape_pad (default: False )
* Added several new unit tests to ensure tensors are padded such that they meet all alignment requirements after padding.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118522
Approved by: https://github.com/jansel, https://github.com/eellison
When I built Pytorch on Windows with lastest MKL, it reported:
```
sources\pytorch\aten\src\ATen/cpu/vml.h(106): error C2338: static_assert failed: 'MKL_INT is assumed to be int32_t'
```
It should be safe to relax the restriction to int64_t.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118946
Approved by: https://github.com/ezyang
- Moved the dictionary arguments to the node's kwargs as dicts are not
valid inputs.
- Inlined the mutated arguments to the output. Originally, the output of
auto_functionalize was the operator output
and a list of mutated arguments (ex. [op_out1, op_out2, [mutated_arg1,
mutated_arg2]]. However this is not easily exportable. Now, it will
just be [op_out1, op_out2, mutated_arg1, mutated_arg2].
Differential Revision: [D53331040](https://our.internmc.facebook.com/intern/diff/D53331040)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118809
Approved by: https://github.com/zou3519
Summary:
Whenever we access a constant, we emit a `get_attr` node for it.
The `lift_constants_pass` was lifting every `get_attr` node unconditionally, even if the same target was already lifted. This diff fixes that.
I also took the liberty of adding some infra to make it easier to unit test passes. GraphBuilder lets you declaratively construct graphs with the right metadata, it's pretty useful for directly inducing the pattern you want to test against.
Test Plan: added unit test
Differential Revision: D53278161
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118776
Approved by: https://github.com/angelayi, https://github.com/titaiwangms
Summary:
Add defined traverse mode for minimizer
it take user input start_idx and end_idx, form a subgraph, compare result from acclerators vs cpu
Differential Revision: D53318292
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118889
Approved by: https://github.com/jfix71
Removes raising error if a device_mesh has a parent.
The comment says that HSDP + TP is not supported, but I'm able to do 2D parallelism + HSDP fine. The only issues are:
- this check
- https://github.com/pytorch/pytorch/pull/118618
- a series of PRs related to checkpointing with 3D meshes that I will open
We currently monkeypatch for the above which I am slowly upstreaming.
I imagine torch will have a better, native integration eventually, but this check seems too aggressive in the meantime given DTensor now lets users do some things themselves (which is amazing 🎉)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118620
Approved by: https://github.com/wz337, https://github.com/wanchaol
Like #110312 but we also run this check when backed symints are in the grid (e.g. s1 / 512)
### Why?
Let's say we lower a model and generate GPU kernel grid with symbolic shapes, for e.g. `s1 / 512`. If at some point later, we ran the lowered model with inputs s.t. `s1 = 0`, then we'll launch the kernel with a `0` sized grid. This surfaces as `CUDA driver error: invalid argument`.
To avoid this, we check for a `0` sized grid whenever there's symbolic shapes which includes backed and unbacked symints.
This adds non-zero overhead to the CPU. However, in return, we get better reliability when encountering this scenario. This scenario happened when serving an internal model.
### Test
```
$ python test/inductor/test_aot_inductor.py -k test_zero_grid_with_unbacked_symbols
OK (skipped=3)
$ python test/inductor/test_aot_inductor.py -k test_zero_grid_with_backed_symbols
# Before
Error: CUDA driver error: invalid argument
FAILED (errors=2, skipped=3)
# Now
OK (skipped=3)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118654
Approved by: https://github.com/chenyang78, https://github.com/desertfire
This PR fixes several bugs, listed in priority:
1. `load_state_dict` with a nontensor step was incorrect for capturable and fused implementations since we don't create the tensors on the right device in `__setstate__`. This has been fixed.
2. The most recently added capturable implementations forgot the check that all tensors should be on CUDA for eager. We've now added those checks
3. The most recent change in Adamax only adds capturable for foreach but will silently be incorrect for forloop/single-tensor. I've added erroring and modified testing with many many many skips for that. Honestly my preference after this PR has only been further cemented that we should just do the single tensor and multi tensor capturable implementations together in the future. @mlazos
4. The conditional for adding cuda-supported configs for the optimizer infos was incorrect! So we hadn't been testing capturable! This also stands rectified and was the trigger for this PR in the first place.
5. In a similar way, the conditional for `_get_optim_inputs_including_global_cliquey_kwargs` was incorrect sometimes as well. This has also been corrected.
The following is not a bug, but is just something to make life simpler by not needing to handle Nones: `optim_input_funcs` must now mandatorily take in a `device`, which could be a string or a torch.device.
Details for posterity:
4. Running the test_foreach_matches_forloop test and printing the configs that get printed yields capturable getting included, which is correct.
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5d50138f)]$ python test/test_optim.py -k test_foreach_matches_forloop_AdamW_cuda
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
params=None, kwargs={}, desc=default
params=None, kwargs={'lr': 0.01}, desc=non-default lr
params=None, kwargs={'weight_decay': 0.1}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'maximize': True}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True}, desc=amsgrad
params=None, kwargs={'capturable': True}, desc=capturable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True}, desc=capturable, amsgrad
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True}, desc=Tensor lr with capturable and amsgrad
.
----------------------------------------------------------------------
Ran 1 test in 19.229s
OK
```
5. Running the test_optimizer_can_be_printed test (which calls `_get_optim_inputs_including_global_cliquey_kwargs`) and printing what gets run is also now correct.
```
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
params=None, kwargs={'differentiable': False}, desc=default
params=None, kwargs={'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.1, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': True}, desc=amsgrad & differentiable
.params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable
params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable
params=None, kwargs={'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable & foreach
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable & differentiable
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable & fused
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad & foreach
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable, amsgrad & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable, amsgrad & fused
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad & foreach
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=Tensor lr with capturable and amsgrad & differentiable
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=Tensor lr with capturable and amsgrad & fused
.
----------------------------------------------------------------------
Ran 2 tests in 11.112s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118326
Approved by: https://github.com/mlazos
Summary: Vulkan Linear op doesn't support 1d tensors. We can unsqueeze 1d tensors to 2d to unblock the functionality.
Test Plan:
`LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*linear_*"`
```
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *linear_*
[==========] Running 11 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 11 tests from VulkanAPITest
[ RUN ] VulkanAPITest.linear_1d_small
[ OK ] VulkanAPITest.linear_1d_small (319 ms)
[ RUN ] VulkanAPITest.linear_1d_large
[ OK ] VulkanAPITest.linear_1d_large (64 ms)
[ RUN ] VulkanAPITest.linear_2d_flat
[ OK ] VulkanAPITest.linear_2d_flat (0 ms)
[ RUN ] VulkanAPITest.linear_2d_small
[ OK ] VulkanAPITest.linear_2d_small (0 ms)
[ RUN ] VulkanAPITest.linear_2d_large
[ OK ] VulkanAPITest.linear_2d_large (129 ms)
[ RUN ] VulkanAPITest.linear_3d_flat
[ OK ] VulkanAPITest.linear_3d_flat (0 ms)
[ RUN ] VulkanAPITest.linear_3d_small
[ OK ] VulkanAPITest.linear_3d_small (1 ms)
[ RUN ] VulkanAPITest.linear_3d_large
[ OK ] VulkanAPITest.linear_3d_large (51 ms)
[ RUN ] VulkanAPITest.linear_4d_flat
[ OK ] VulkanAPITest.linear_4d_flat (0 ms)
[ RUN ] VulkanAPITest.linear_4d_small
[ OK ] VulkanAPITest.linear_4d_small (1 ms)
[ RUN ] VulkanAPITest.linear_4d_large
[ OK ] VulkanAPITest.linear_4d_large (6 ms)
[----------] 11 tests from VulkanAPITest (578 ms total)
[----------] Global test environment tear-down
[==========] 11 tests from 1 test suite ran. (578 ms total)
[ PASSED ] 11 tests.
```
Differential Revision: D53243201
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118690
Approved by: https://github.com/jorgep31415, https://github.com/liuk22
## Context
This changeset is part of a stack that enables memory planning (i.e. sharing memory between intermediate tensors) in the PyTorch Vulkan Compute API. Note that Memory Planning can only be used via the ExecuTorch delegate (currently a WIP) and not Lite Interpreter (which does not collect metadata regarding tensor lifetimes).
This changeset builds upon the [previous PR enabling resource aliasing](https://github.com/pytorch/pytorch/pull/118436) and introduces the `SharedObject` class to `ComputeGraph`, which manages resource aliasing in graph execution mode. `SharedObject` tracks which `vTensor` values in a `ComputeGraph` share the same backing memory, and provides functionality to aggregate memory requirements and bind users to same memory allocation.
## Notes for Reviewers
The `SharedObject` class is introduced in `Graph.h`. It's fairly simple and provides three functions:
* `add_user()` which adds a `ValueRef` to the list of users of the `SharedObject`, and updates the aggregate memory requirements with the memory requirements of the new user
* `allocate_memory()` creates a `VmaAllocation` with the aggregated memory requirements
* `bind_users()` iterates over the `users` of the `SharedObject` and binds each `vTensor`'s underlying resource to the memory associated with the `SharedObject`.
As for how `SharedObject` is used in `ComputeGraph`:
* `add_tensor()` now has an additional argument `shared_object_idx` which, if `>0`, will construct a `vTensor` without any backing memory and add the new `vTensor` to the `SharedObject` at `shared_object_idx`
* `encode_execute()` will first iterate through the `SharedObject`s of the graph and allocate + bind users before recording the command buffer.
Differential Revision: [D53271486](https://our.internmc.facebook.com/intern/diff/D53271486/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118756
Approved by: https://github.com/jorgep31415, https://github.com/yipjustin
The motivation is fake_tensor is marked as an uninteresting file for the purposes of backtraces, but operator implementations in fake tensor are interesting and I do want them reported.
How did I decide whether or not to move helper functions or not? It was kind of random, but if they weren't used in fake tensor generally I moved them over.
There are no functional code changes, so you only need to review the import changes.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118878
Approved by: https://github.com/eellison
Our current throughput calculations for kernel benchmarks have some issues,
particularly when we slice inputs in the kernel. In such cases, we count
the original inputs as part of the memory traffic passed across the kernel.
This is incorrect because it may result in a much larger throughput
calculation, which can even exceed the theoretical bandwidth.
Instead, we should only count the size of the "slices" that contribute to
the actual memory traffic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118858
Approved by: https://github.com/jansel
Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning.
Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813
Approved by: https://github.com/jansel
```
def f():
def g():
return ()
print(g.__name__)
f()
```
The following script should print `g` (with or without torch.compile),
but prints `f.<locals>.g` with torch.compile.
The problem looks like we use the co_qualname when reconstructing the
NestedUserFunctionVariable. I switched this over to use the co_name.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118768
Approved by: https://github.com/yanboliang, https://github.com/jansel
Summary:
previously get_attr nodes were skipped, but for example:
%mul_240 : [num_users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.mul](args = (), kwargs = {input: %_fx_const_folded_attrs_13, other: %add_143})
where %_fx_const_folded_attrs_13 is int64, but add_143 is float causes issues if skipped, e.g. "unsupported dtype='int64' for alignments"
Differential Revision: D53273467
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118760
Approved by: https://github.com/khabinov
This PR adds the FSDP reduce-scatter (the copy-in/reduce-scatter collective/view-out).
- We use gradient pre- and post-divide factors like existing FSDP (mainly for fp16 reduction).
- We use a separate CUDA stream for the reduce-scatter to conveniently handle additional kernels surrounding the collective as a separate 'thread of execution' (e.g. pre/post-divide and later the D2H gradient offload).
- ~~The implementation in this PR is more complicated to _try_ to reduce CPU overhead by using `torch.split` instead of a Python for-loop. The challenge comes from the fact that the autograd-computed unsharded gradients do not have padding. We prefer to not do an intermediate padding step and instead directly copying to the big reduce-scatter input.~~ For simplicity, I changed the implementation to include intermediate padding steps, as it can still achieve ~250 GB/s, and it avoids any `O(NP)` tensor materialization for world size `N` and `P` `nn.Parameter`s.
<details>
<summary> Recall: Copy-in/All-Gather/Copy-Out Example </summary>
Suppose we have 2 parameters with shapes `(3, 3)` (denoted with `A`s) and `(2, 2)` (denoted with `B`s) and 2 ranks, where `P` represents padding and `E` represents empty:
```
Given:
(3, 3): AAAAAAAAA
(2, 2): BBBB
Sharded parameters/all-gather inputs:
Rank 0: AAAAAA, BB
Rank 1: AAAPPP, BB
Each rank allocate group's all-gather output:
EEEEEEEEEEEEEEEE
Each rank copy-in:
Rank 0: AAAAAABBEEEEEEEE
Rank 1: EEEEEEEEAAAPPPBB
Each rank all-gather:
Rank 0: AAAAAABBAAAPPPBB
Rank 1: AAAAAABBAAAPPPBB
Each rank copy-out:
Rank 0: AAAAAAAAAPPP, BBBB
Rank 1: AAAAAAAAAPPP, BBBB
```
</details>
<details>
<summary> Copy-in/Reduce-Scatter/View-Out Example </summary>
Suppose we have 2 gradients with shapes `(3, 3)` (denoted with `a`s when not-yet-reduced and `A`s after reduced) and `(2, 2)` (denoted with `b`s and `B`s similarly) and 2 ranks, where `E` represents empty:
```
Given from autograd:
(3, 3): aaaaaaaaa
(2, 2): bbbb
Unsharded gradients/reduce-scatter inputs (no padding!):
Rank 0: aaaaaaaaa, bbbb
Rank 1: aaaaaaaaa, bbbb
Each rank allocate group's reduce-scatter input:
EEEEEEEEEEEEEEEE
Each rank copy-in:
Rank 0: aaaaaabbaaaEEEbb
Rank 1: aaaaaabbaaaEEEbb
Each rank :
Rank 0: AAAAAABBAAAEEEBB
Rank 1: AAAAAABBAAAEEEBB
Each rank view-out:
Rank 0: AAAAAA BB
Rank 1: AAA, BB
```
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117975
Approved by: https://github.com/weifengpy, https://github.com/yifuwang
ghstack dependencies: #117950, #117955, #117973
Special values (`NaN`/`+/-Inf`) are not correctly during codegen for `ir.Scan` nodes. This
is a fairly minor bugfix that has not come up since the only two scan
ops with lowerings use "normal" values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118788
Approved by: https://github.com/peterbell10
Description:
- Fixed error in bicubic upsampling aa=false for uint8 input. This is seen in the test suite:
```diff
- self.assertLess(diff.max(), 15)
+ self.assertLess(diff.max(), 5)
```
While reducing the input range we do not fully remove the clipping effect that's why the threshold is 5 and not around 1.
- Renamed methods
- The error is mostly visible for upsampling (smaller -> larger) mode on the boundary values
More details on the bug:
For uint8 input and antialising=False we are using separable algorithm (using temp buffers and interpolating dimensions one by one) where interpolation weights and input indices are computed and stored using index ranges: `index_min` and `index_size`; weights outside of the `index_size` are zeros. For example, for an output point we can have index_min=10 and index_size=4 and 4 non-zero weights: so the output value is computed as
```
out_value = sum([src[i + index_min] * w for i, w in zip(range(4), weights) ])
```
When computing index ranges and weights for output points near the boundaries we should clamp `index_min` between 0 and input_size and `index_size` becomes smaller than 4. This approach is OK for antialiasing=True but is not correct for antialiasing=False where weights are computed incorrectly:
```
-- output index i= 0
regular float32 approach:
source indices: [-2, -1, 0, 1] -> outbounded values are clamped to boundaries -> [0, 0, 0, 1]
interp weights: [-0.07200000000000006, 0.4600000000000001, 0.72, -0.1080000000000001]
separable uint8 approach:
source indices coming from index ranges (min, size): [0, 1]
incorrect interp weights computed with current implementation : [1.1764705882352944, -0.17647058823529432, 0.0, 0.0]
fixed interp weights in the PR: [1.108, -0.1080000000000001, 0.0, 0.0]
Note: weight value corresponding to source index 0 is 1.108 = -0.07200000000000006 + 0.4600000000000001 + 0.72 and weight value corresponding to source index 1 is -0.1080000000000001 is the same as in f32 approach.
```
Quick benchmark to ensure perfs no regression:
```
[------------------------------------------------------------------------------------ Resize ------------------------------------------------------------------------------------]
| torch (2.3.0a0+gitfda85a6) PR | torch (2.3.0a0+git0d1e705) Nightly | Speed-up: PR vs Nightly
1 threads: -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
3 torch.uint8 channels_first bilinear (400, 400) -> (224, 224) aa=False | 440.996 (+-2.044) | 470.824 (+-5.927) | 1.068 (+-0.000)
3 torch.uint8 channels_first bicubic (400, 400) -> (224, 224) aa=False | 463.565 (+-1.519) | 497.231 (+-10.825) | 1.073 (+-0.000)
3 torch.uint8 channels_first bilinear (400, 400) -> (700, 700) aa=False | 1717.000 (+-28.589) | 1915.570 (+-43.397) | 1.116 (+-0.000)
3 torch.uint8 channels_first bicubic (400, 400) -> (700, 700) aa=False | 1801.954 (+-22.391) | 1981.501 (+-37.034) | 1.100 (+-0.000)
3 torch.uint8 channels_last bilinear (400, 400) -> (224, 224) aa=False | 199.599 (+-0.851) | 196.535 (+-3.788) | 0.985 (+-0.000)
3 torch.uint8 channels_last bicubic (400, 400) -> (224, 224) aa=False | 243.126 (+-0.681) | 240.695 (+-2.306) | 0.990 (+-0.000)
3 torch.uint8 channels_last bilinear (400, 400) -> (700, 700) aa=False | 686.270 (+-2.870) | 687.769 (+-17.863) | 1.002 (+-0.000)
3 torch.uint8 channels_last bicubic (400, 400) -> (700, 700) aa=False | 899.509 (+-5.377) | 899.063 (+-9.001) | 1.000 (+-0.000)
Times are in microseconds (us).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118389
Approved by: https://github.com/NicolasHug
ghstack dependencies: #118388
# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this third PR covers the changes under `libtorch_python`.
# Design
This PR primarily offers device-related APIs in python frontend, including
- `torch.xpu.is_available`
- `torch.xpu.device_count`
- `torch.xpu.current_device`
- `torch.xpu.set_device`
- `torch.xpu.device`
- `torch.xpu.device_of`
- `torch.xpu.get_device_name`
- `torch.xpu.get_device_capability`
- `torch.xpu.get_device_properties`
- ====================
- `torch.xpu._DeviceGuard`
- `torch.xpu._is_compiled`
- `torch.xpu._get_device`
# Additional Context
We will implement the support of lazy initialization in the next PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116850
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
Summary:
There is an annoying inconsistency in how we pickle custom objs.
`torch.save` will invoke regular pickle, for which we have bound `__setstate__`/`__getstate__` methods on `torch.ScriptObject`: https://fburl.com/code/4howyl4u.
This serializes in a different format than TorchScript does, which uses the TS C++ pickler.
The issue we were facing was using the Python pickler to save, and the C++ pickler to load. If we use the C++ pickler to both save and load (plus some plumbing to get type/object resolution to work correctly), then things should work.
Test Plan:
ran SherlockNoMad's repro
```
buck2 run 'fbcode//mode/dev-nosan' scripts/bahuang:export_torchbind -- --logging DBG
```
Got to a new error, which has to do with how we're initializing the graph, but will leave that for future diffs.
Reviewed By: SherlockNoMad
Differential Revision: D53248454
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118791
Approved by: https://github.com/qxy11, https://github.com/SherlockNoMad, https://github.com/khabinov
Description:
- Lowered error thresholds and added input range for bicubic to make visible the inconsistency error in the implementation for upsampling (smaller -> larger) bicubic aa=false mode for uint8 input dtype
- Updated out-dated comments
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118388
Approved by: https://github.com/NicolasHug
Summary: We only need to deepcopy the graph because we're modifying the graph by unlifting its parameter/buffer inputs. We don't need to deepcopy the graph module state/contents. This causes an error when the graph module contains an ExecuTorch LoweredModule which stores tensors.
Test Plan: Fixes the following diff
Differential Revision: D53290077
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118821
Approved by: https://github.com/tugsbayasgalan
Adding an `OpInfo` test for `split_with_sizes_copy` so we can use it to test [CUDA fast path for split_with_sizes_copy.out](https://github.com/pytorch/pytorch/pull/117203). Since the `OpInfo` test doesn't exist yet and introducing it requires modifications to the `CompositeExplicitAutograd` impl, adding the `OpInfo` test in a separate PR to establish a healthy baseline.
Changes made:
- Registered a batching rule for `split_with_sizes_copy`.
- Registered a decomposition for `split_with_sizes_copy`.
- Registered a DTensor prop rule for `split_with_sizes_copy`.
- Added required dtype and device checks to the composite impl.
- Added output resize to the composite impl.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118512
Approved by: https://github.com/albanD
Summary:
Add Runtime Constant-folding for AOTInductor.
This also include the invocation of constant folding at load time.
The constant folding lowering is a 2-step process.
First, we split the graph into 2 modules, one of it is the constant module, which doesn't depend on any input and the whole module could be inferred (constant-folded) one-time and be reused. The constant module, is lowered, and being codegen-ed as usual and cached (let's call this constant code). The constant code reuses the whole lowering/profiling/etc. process, only difference is that we do not generate any headers or initialization for the constant code.
Second, after handling the constant module, we take care of the main module (which is the part that would depend on the user input.) For the main module, we take in one additional component, the constant code, compare with a normal lowering. Addition step we do here is that, we inject the constant code into the codegen-ed main module, and create the caller for the main module to consume the result of the constant module.
Test Plan: Unit tests included in commit.
Differential Revision: D53274382
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118765
Approved by: https://github.com/chenyang78
Summary:
When `version` is missing in the metadata, use `min_val/max_val` as keys instead of `max_vals/min_vals`
## Reasons
1. It's been almost 2 years since this change D30003700, which means now most checkpoints are using the `max_val/min_val` keys
2. most checkpoints dumps using `model.state_dict()` don't have version info, which will lead a fake `missing keys` error when loading state_dict
Test Plan: CI
Differential Revision: D53233012
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118659
Approved by: https://github.com/jerryzh168
This PR adds the all-gather and free logic required for forward.
- We define the logical all-gather as two ops: (1) unshard and (2) wait for unshard. This abstraction allows capturing both implicit forward prefetching (using multiple streams and `async_op=False`) and explicit forward prefetching (using `async_op=True`).
- Symmetrically, we define the reshard op to free the unsharded parameters.
Some other notes:
- The `FSDPParamGroup` and its `FSDPParam`s transition their sharded states together. This invariant allows us to reason about the parameters by group rather than individually with respect to whether they are sharded or unsharded.
---
### How Does the Overlap Work for All-Gather?
For context, the all-gather consists of three steps: (1) copy-in, (2) all-gather collective, and (3) copy-out.
<details>
<summary> Example </summary>
Suppose we have 2 parameters with shapes `(3, 3)` (denoted with `A`s) and `(2, 2)` (denoted with `B`s) and 2 ranks, where `P` represents padding and `E` represents empty:
```
Given:
(3, 3): AAAAAAAAA
(2, 2): BBBB
Sharded parameters/all-gather inputs:
Rank 0: AAAAAA, BB
Rank 1: AAAPPP, BB
Each rank allocate group's all-gather output:
EEEEEEEEEEEEEEEE
Each rank copy-in:
Rank 0: AAAAAABBEEEEEEEE
Rank 1: EEEEEEEEAAAPPPBB
Each rank all-gather:
Rank 0: AAAAAABBAAAPPPBB
Rank 1: AAAAAABBAAAPPPBB
Each rank copy-out:
Rank 0: AAAAAAAAAPPP, BBBB
Rank 1: AAAAAAAAAPPP, BBBB
```
</details>
`dist.all_gather_into_tensor()` always has the PG's NCCL stream wait for the current stream before running the collective. `async_op=True` means that the function waits on the work, having the current stream wait for the NCCL stream before returning. `async_op=False` means it returns the `Work` object, which the user can wait on later.
#### Implicit Prefetching
Implicit prefetching achieves communication/computation overlap without changing the CPU issue order:
- We use separate streams for copy-in and for issuing the `dist.all_gather_into_tensor()`. The copy-in stream allows us to overlap the copy-in with all-gather/reduce-scatter in backward, and the all-gather stream allows us to overlap the all-gather with forward compute (issued before it).
- Because `dist.all_gather_into_tensor()` always has the PG's NCCL stream wait for the current stream, we need this "dummy" all-gather stream to prevent the all-gather from waiting on the forward compute with which it should overlap.
- Without the separate copy-in stream, we cannot overlap all-gather copy-in with all-gather in forward.
- We copy-out in the default stream after having the default stream wait for the all-gather. This means that the autograd leaves are allocated in the default stream and autograd will not call `recordStream`.
Implicit prefetching does not require knowing the execution order ahead of time. However, when overlapping the next all-gather with the current compute, there may be a gap from the CPU thread issuing the current compute. If the CPU thread can run ahead, then this is not an issue.
#### Explicit Prefetching
Explicit prefetching achieves communication/computation by changing the CPU issue order, namely by reordering the all-gather to be before the compute with which it should overlap.
- Because we reorder, we do not need any separate streams, and we can use `async_op=False` for overlap.
- We can expose this explicit prefetching as a module-level `unshard()` op (e.g. `module.unshard(async_op: bool)`, and we can use it as a primitive for implementing the explicit forward prefetching in existing FSDP.
Explicit prefetching requires knowing the execution order.
---
Disclaimer: The testing is relatively lighter in this PR. I did not want to spend too much time writing new forward-only tests. The stream usage will be exercised thoroughly once we have backward too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117973
Approved by: https://github.com/weifengpy, https://github.com/yifuwang
ghstack dependencies: #117950, #117955
Summary:
X-link: https://github.com/pytorch/executorch/pull/1769
Basic support for non-persistent buffers, which are buffers that do not show up in the state dict.
One weird twist is that most of our other systems (FX, aot_export, dynamo) have completely buggy handling of non-persistent buffers. I tried to go on a wild goose chase to fix them all, but it got to be too much. So I introduced some sad rewrite passes in `_export` make the final state dict correctly align with the original module's state dict.
This exposed some bugs/ambiguous handling of parameters/buffers in existing test code. For example, `TestSaveLoad.test_save_buffer` traced over a module that was not in the root module hierarchy and caused some weird behavior. I think we should error explicitly on use cases like this: https://github.com/pytorch/pytorch/issues/118410. For now I just rewrote the tests or skipped them.
Test Plan: added a unit test
Differential Revision: D53253905
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118722
Approved by: https://github.com/SherlockNoMad, https://github.com/angelayi
Prior to onnx export, the model is deepcopied to avoid modifications that may affect later performance profiling. However this increases the memory requirement on the device.
This PR modifies the script to deepcopy and export the model on another device when possible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118710
Approved by: https://github.com/thiagocrepaldi
I was just playing around with improving the typing of symbolic_shapes. The PR is not "complete" but I in particular wanted to get feedback on whether or not people liked making ValueRanges Generic; it seems that distinguishing if you have an Expr ValueRange or a SympyBoolean ValueRange is a lot of trouble for downstream. Using TypeGuard, we can perform refinements on the generic parameter inside methods, although we still have to cast back to ValueRange[T] due to https://github.com/python/mypy/issues/14425#issuecomment-1914852707
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118529
Approved by: https://github.com/Skylion007
Everyday I move closer and closer to just using numbers
* number of heuristics that marked it as high, probable, low, none etc
* order of heuristics in the `__init__` file as well as how the heuristic ordered the tests
* put heuristics historical edited files and profiling as not trial mode
* briefly sanity checked that all shards of the larger test files (ex test_ops) exist and there are no dups
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118029
Approved by: https://github.com/huydhn
This PR adds the FSDP all-gather (the copy-in/all-gather collective and the copy-out) and the unsharded parameter concept to `FSDPParam`. This is to prepare for being able to run the forward pass.
- We implement all-gather as two functions: `foreach_all_gather` (copy-in/all-gather collective) and `foreach_all_gather_copy_out` (copy-out).
- In the future, there will be two paths: `async_op=True` in the default stream for explicit prefetching and `async_op=False` in separate streams for implicit prefetching.
- In the future, we will use `torch.split_with_sizes_copy` in the copy-out when it has the CUDA fast path.
- We have the functions operate on `List[FSDPParam]` instead of passing the `torch.Tensor` and metadata mainly so that the `all_gather_input` can be computed under the `all_gather_copy_in_stream`. Since the two functions are specific to FSDP, I did not see motivation for avoiding this at the cost of entering/exiting the `all_gather_copy_in_stream` context twice (which incurs some CPU overhead).
- The `init_all_gather_output()` and `init_unsharded_parameter()` functions may seem unintuitive. The reason we initialize them once and write to them in-place thereafter is for autograd. See the note `[Note: FSDP and autograd]` in the code.
- We expand our 'FSDP tensors' definition to include the all-gather input and all-gather output in addition to the sharded and unsharded parameters. This distinction might seem unnecessary or pedantic, but it enables a language for describing pre- and post-all-gather transformations.
- We use the `_unsafe_preserve_version_counters` context when copying out because otherwise autograd will complain of a version mismatch in backward due to writing to the leaf tensors. (An alternative would be to use `.data`, but we are avoiding that 😄 .)
---
<details>
<summary> Copy-in/All-Gather/Copy-Out Example </summary>
Suppose we have 2 parameters with shapes `(3, 3)` (denoted with `A`s) and `(2, 2)` (denoted with `B`s) and 2 ranks, where `P` represents padding and `E` represents empty:
```
Given:
(3, 3): AAAAAAAAA
(2, 2): BBBB
Sharded parameters/all-gather inputs:
Rank 0: AAAAAA, BB
Rank 1: AAAPPP, BB
Each rank allocate group's all-gather output:
EEEEEEEEEEEEEEEE
Each rank copy-in:
Rank 0: AAAAAABBEEEEEEEE
Rank 1: EEEEEEEEAAAPPPBB
Each rank all-gather:
Rank 0: AAAAAABBAAAPPPBB
Rank 1: AAAAAABBAAAPPPBB
Each rank copy-out:
Rank 0: AAAAAAAAAPPP, BBBB
Rank 1: AAAAAAAAAPPP, BBBB
```
</details>
---
For context, we use the copy-in/all-gather/copy-out strategy instead of NCCL group coalescing for two reasons:
1. One large NCCL all-gather is still noticeably faster than several NCCL all-gathers using group coalescing of the same total bytes (even after NCCL 2.18.3). We prefer to tradeoff extra device-to-device copies (using GPU high-bandwidth memory) to save communication time, which does not improve as fast from hardware generation to generation.
2. Copying out of the all-gather buffer tensor simplifies multi-stream memory handling because there is a constant number of such all-gather tensors alive at once. (The copy-out is done in the default/compute stream.) If we directly used the all-gather tensor memory for computation, then the number of such alive tensors is linear in the module depth and hence dependent on the particular model.
---
Disclaimer: This PR has some extraneous code, but I did not want to simplify too much since that code will be added back soon anyway (e.g. for overlapping, mixed precision, and ZeRO++). Hopefully it does not hinder code review too much.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117950
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
Summary:
The TorchScript interpreter had multiple opcodes whose logic had the potential to access the registers_ array out of bounds.
This change ensures that all registers_ accesses are in bounds or an exception will be thrown.
Test Plan: contbuild + OSS signals
Differential Revision: D49748737
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110300
Approved by: https://github.com/malfet, https://github.com/kimishpatel
...
from /home/vinithav/pytorch-build/forks/myforks/jan23/pytorch/aten/src/ATen/test/vec_test_all_types.cpp:1: /home/vinithav/pytorch-build/forks/myforks/jan23/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h: In member function 'bool at::vec::DEFAULT::Vectorized::has_inf_nan() const': /home/vinithav/pytorch-build/forks/myforks/jan23/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h:244:36: error: no matching function for call to 'at::vec::DEFAULT::Vectorized::_isinf(float&) const' 244 | if(_isnan(_vec0[i]) || _isinf(_vec0[i])) {
| ~~~~~~^~~~~~~~~~
/home/vinithav/pytorch-build/forks/myforks/jan23/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h:237:21: note: candidate: 'at::vec::DEFAULT::Vectorized at::vec::DEFAULT::Vectorized::_isinf() const'~ ...
Started breaking from
29516bd2a0.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118516
Approved by: https://github.com/ezyang
Ref: #86340Fixes#118148
This fixes LBFGS for complex parameters. Complex parameters are handled as R^2.
I also added a test, unfortunately, due to the closure required, I could not use the existing `_test_complex_optimizer` used for all other optimizers.
Lbfgs is special, as it will call the objective function multiple times internally. So I felt making a one-off test for lbfgs might be justifiable.
We will test if each step taken internally by the optimizer is the same for R^2 and complex parameters.
Let me know if the approach is ok, thanks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118184
Approved by: https://github.com/janeyx99
Changelog:
- Don't count running PYTORCH_TEST_WITH_DYNAMO=1 on dynamo/ tests in the pass
rate. This was a bug (we were counting all of these as failing, but in
reality, most of these pass). The net effect is that the passrate is (artifically)
6% higher.
- Have the histogram script filter out skips based on the passrate metric.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118752
Approved by: https://github.com/jamesjwu
Summary: Exposes `dynamic_shapes` api at multiple levels so it's easier to replace the old API `dynamic_dim()` with the new API `Dim()`.
Test Plan: CI
Differential Revision: D53246409
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118695
Approved by: https://github.com/ydwu4
Cutlass 3.3 offers the following improvements:
Adds support for mixed precision GEMMs On Hopper and Ampere Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements. minor doc update
Test Plan:
CI ( ciflow/trunk, ciflow/inductor )
pytest test/inductor/test_max_autotune.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118629
Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/khabinov
Addresses issue https://github.com/pytorch/pytorch/issues/117383
The implementation exposes `--filter-ranks` which filters by rank which files we pass to `TailLog` (used in torchrun to determine which logs to output to stdout/stderr)
## Behavior
### with --tee
Currently --tee is implemented as --redirect to file, and streams file to console using `tail`. When --tee is specified, file logs will be unaffected and we will only filter the output to console.
### with --redirect
When --redirect is specified without --tee, nothing is logged to console, so we no-op.
### with neither
When neither --tee or --redirect are specified, torchrun uses empty string "" to indicate logging to console. We intercept this empty string, and redirect it to "/dev/null" to not print to console.
The api also allows a per-rank configuration for --tee and --redirect, and is also supported by this filter implementation.
## Usage
### without --tee
```
> TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --filter_ranks=0 t.py
hello from rank 0 python
DEBUG: TRACED GRAPH
__compiled_fn_0 <eval_with_key>.0 opcode name target args kwargs
------------- ------ ----------------------- --------- --------
placeholder l_x_ L_x_ () {}
call_function mul <built-in function mul> (l_x_, 5) {}
output output output ((mul,),) {}
...
```
### with --tee
```
> TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --tee 3 --filter_ranks=0 t.py
[rank0]:hello from rank 0 python
[rank0]:DEBUG: TRACED GRAPH
[rank0]: __compiled_fn_0 <eval_with_key>.0 opcode name target args kwargs
[rank0]:------------- ------ ----------------------- --------- --------
[rank0]:placeholder l_x_ L_x_ () {}
[rank0]:call_function mul <built-in function mul> (l_x_, 5) {}
[rank0]:output output output ((mul,),) {}
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118562
Approved by: https://github.com/wconstab, https://github.com/wanchaol
Summary:
This diff implements a mechanism for safely update torch.export serialization schema, aka schema.py, which is the API surface having the strongest compatibility guarantee.
The diff is consist of 3 changes:
- Added a script to "build" or "materialize" schema.py into a platform neutral format (yaml), which serves as the committed form of the seialization schema.
- Added unittest to compare against schema.py and schema.yaml, so that it forces developers to execute the updater script when there is mismatch between two files.
- Added a checker inside the updater script, so that all the compatible change will result in a minor version bump, and all the incompatible changes will result in a major version bump.
torch.export's serialization BC/FC policy is (tentatively) documented here: https://docs.google.com/document/d/1EN7JrHbOPDhbpLDtiYG4_BPUs7PttpXlbZ27FuwKhxg/edit#heading=h.pup7ir8rqjhx , we will update the
As noted in the code doc, people should be able to run the following command to update schema properly from now on:
```
python scripts/export/update_schema.py --prefix <path_to_torch_development_diretory>
or
buck run caffe2:export_update_schema -- --prefix /data/users/$USER/fbsource/fbcode/caffe2/
```
Test Plan:
buck test mode/opt caffe2/test:test_export -- -r test_schema
buck run caffe2:update_export_schema -- --prefix /data/users/$USER/fbsource/fbcode/caffe2/
Differential Revision: D52971020
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118424
Approved by: https://github.com/angelayi
### Summary
- Added `group_name` as the third field in `dim_group_infos`.
- `DeviceMeshTest` now runs both w/ and w/0 `_USE_NATIVE_C10D_FUNCTIONAL=1` in CI.
### Other fixes
- Convert `reduceOp` to lower case before passing it into c10d_functional ops.
- Added a finalizer to handle unwaited collectives (this mirrors the treatment for Python functional collective ops).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118423
Approved by: https://github.com/wanchaol, https://github.com/LucasLLC, https://github.com/wconstab
This PR adds the initial `_lazy_init`. Lazy initialization marks the point when the FSDP structure is finalized and is typically the beginning of the first forward. This would be after any meta-device initialization.
- Lazy initialization is distinct from construction time because when processing `fully_shard(module)`, there is no way to know whether a parent of `module` will have `fully_shard` applied as well. This is a consequence of `fully_shard` having to be applied bottom-up.
- At lazy initialization, we now have the concept of a _root_. The root FSDP module is the one whose `forward` runs first and ends last (and hence similarly for its backward). Having a single root simplifies handling logic that should only run "once per forward/backward/iteration". We may consider relaxing this in the future, but it will add more complexity to the design.
- Once we have a root, we can define _fully-qualified names_ (FQNs) for both parameters and modules. To aid debugging, we store `_param_fqn` and `_module_fqn` on `FSDPParam` and `FSDPParamGroup`, respectively. Note that we can have a unique `_module_fqn` for `FSDPParamGroup` since we currently assume a 1:1 relationship.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117881
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #118525, #117814, #117867, #117877
This setting is problematic in fbcode, where the expected behavior is to match `arc lint`, which has a behavior much like running `lintrunner` without a `--merge-base-with` argument.
Let's try removing this. I also updated the CI message to encourage people to run with `-m origin/main`, which should hopefully cut down on confusion in the absence of defaulting to that behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118677
Approved by: https://github.com/PaliC
This PR adds logic to shard the managed parameters on dim-0. This is like `distribute_tensor()` with two differences:
1. `distribute_tensor()` today cannot accept a `DTensor` and reshard it to the parent mesh (https://github.com/pytorch/pytorch/issues/116101).
2. `DTensor` does not pad its local shard on any `Shard` dimensions (https://github.com/pytorch/pytorch/issues/113045).
As such, the `FSDPParam._init_sharded_param()` derives the global `DTensor` metadata itself and pads the local tensor on dim-0. The padding helps make the all-gather copy-in more efficient since the all-gather buffer will require padding.
---
Some details:
- We free the original parameter manually after constructing the sharded parameter. This lowers the peak memory during construction time slightly (since not _all_ parameters in the group must be sharded before the original parameters are freed) and is not strictly necessary.
- We bypass `nn.Module.__setattr__` because the checks are slow and unnecessary. The drawback is that we would ignore a user-defined override of `__setattr__`; however, since we have never encountered this in practice, I am okay with this. Notably, user calls to `setattr` would still use the override; FSDP only uses `setattr` as a mechanism for switching between sharded and unsharded parameters.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117877
Approved by: https://github.com/wanchaol
ghstack dependencies: #118525, #117814, #117867
Summary:
This is part the pass migration efforts. The final target is removing the acc tracer in AOTI.
In this diff, I did a few things:
1. copy and modify the `fx_passes/split_cat.py` passes based on predispatch IR.
2. verify the correctness by copying the `test_split_cat_fx_passes.py` and create a new file `test_split_cat_fx_passes_aten_fb.py` which is executed in AOTI and checked the counters
3. create a util function to execute the pass and compare the before/after graph to give user more information like pass effect and time spent. It will create logs like
```
[2024-01-25 20:26:48,997] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 0, save before/after graph to /tmp/tmpvlpwrklp, graph before/after are the same = False, time elapsed = 0:00:00.001585
[2024-01-25 20:26:49,000] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 1, save before/after graph to /tmp/tmpz_onjfeu, graph before/after are the same = False, time elapsed = 0:00:00.001873
[2024-01-25 20:26:49,002] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 2, save before/after graph to /tmp/tmpgkck8yko, graph before/after are the same = True, time elapsed = 0:00:00.000269
[2024-01-25 20:26:49,007] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 3, save before/after graph to /tmp/tmpquenq06y, graph before/after are the same = False, time elapsed = 0:00:00.003621
[2024-01-25 20:26:49,009] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 4, save before/after graph to /tmp/tmpi8fia0dv, graph before/after are the same = True, time elapsed = 0:00:00.000190
```
Differential Revision: D53171027
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118590
Approved by: https://github.com/kflu, https://github.com/khabinov, https://github.com/chenyang78
This PR (as a followup to #115530) resolves previous issues of not passing `assertEqual()` tests (with small error) when comparing outputs from the single-gpu model and the distributed model, under certain input/model sizes or when certain operations (e.g. weight-tying) are enabled. This is done by simply enabling higher precision computation using `dtype=torch.float64`.
What is not tested: whether or not distributed model training convergence rate is affected using just `torch.float32` precision.
Test plan:
TP: `python test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training_is_seq_parallel_False`
TP+SP: `python test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training_is_seq_parallel_True`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116436
Approved by: https://github.com/wanchaol
This PR adds the initial `FSDPParamGroup` and `FSDPParam` classes, and it focuses on the `ParamModuleInfo` data structure.
- `ParamModuleInfo` has the info needed to `setattr` a managed parameter, where it must account for shared parameters and shared modules.
```
# Shared parameter
lin1.weight = lin2.weight
# Shared module
mlp.lin1 = mlp.lin2
```
- In order for FSDP to find shared modules' parameters, we must use `remove_duplicate=False`. See https://github.com/pytorch/pytorch/pull/99448/ for the original context. Finding shared modules' parameters is not necessary for the `setattr` logic, but in case we need it in the future (like for existing FSDP's state dict), we include that info for now.
With this PR, we see the general system architecture:
- 1 `module` : 1 `fully_shard`
- 1 `fully_shard` : 1 `FSDPParamGroup`
- 1 `FSDPParamGroup` : k `FSDPParam`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117867
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #118525, #117814
Squashed to include https://github.com/pytorch/pytorch/pull/117861, https://github.com/pytorch/pytorch/pull/117852
---
This PR adds `_get_managed_modules()` to determine which modules a `fully_shard(module)` call manages. The rule is defined as:
> `fully_shard(module)` manages all modules in `module.modules()` except those already managed by a nested `fully_shard()` or a nested non-composable API (e.g. `replicate()` or TorchRec).
Practically, this can be implemented as a graph search from `module` that does not proceed into any module with `fully_shard` or a non-composable API applied. Because the non-composable APIs follow the same rule, this rule is correct inductively.
---
This PR adds `_get_managed_states(managed_modules)` to return the managed parameters and buffers given the managed modules.
- Without an extra mechanism to ignore specific parameters or buffers, the rule currently is simply to get the directly managed state (i.e. parameters/buffers) from each managed module while de-duplicating shared ones.
- However, we prefer this translation from managed modules to managed states to accommodate ignoring specific states in the future (which has appeared in various open-source use cases).
---
This PR adds the `mesh` argument to `fully_shard` and some helper data structures specific to FSDP/HSDP that pre-compute useful info like rank/world size for each mesh dim.
- The `mesh` defines the FSDP/HSDP algorithm. 1D mesh means FSDP, and 2D mesh means HSDP, where we assume sharding on the last dimension.
- We can revisit the HSDP sharding-dim assumption if needed in the future.
- The default (if `mesh is None`) is that `fully_shard` calls `init_device_mesh` following the global process group.
- The helper data structures are the various `*MeshInfo`s. I included up to the `HSDPMeshInfo` even though it will not be immediately used to show the spirit of it. We want to tag both the shard and replicate dims.
- The `mesh_info` variable in `fully_shard` is not used for now. It will be passed downstream in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117814
Approved by: https://github.com/wanchaol, https://github.com/wconstab
ghstack dependencies: #118525
This PR introduces the initial `fully_shard` frontend without any distributed logic that will be built into per-parameter-sharding FSDP.
- We design `fully_shard` to be a _module-level_ API (taking in an `nn.Module`), e.g. as opposed to a tensor-level one.
- We define a `FSDP` class and use a dynamic class swap, setting `module.__class__` to a newly created class that subclasses `FSDP` and `type(module)`, to allow FSDP to override and add methods on the module.
- We name this class as `FSDP<type(module)>`, e.g. `FSDPLinear` for `Linear`.
- We disable the `deepcopy` because the state object inserted on the module will not be trivially `deepcopy`-able.
- Calling `fully_shard(module)` inserts a state object on `module` but not any of its children. This state object will be used for any FSDP-specific state.
- We raise an error on `ModuleList` or `ModuleDict` since they do not implement `forward()`, and FSDP will rely on `forward()` to insert logic (https://github.com/pytorch/pytorch/issues/113794).
- In the future, we will deprecate the existing `fully_shard` that calls into the same backend logic as `FullyShardedDataParallel` as there is no adoption for that and we prefer to reuse that name.
**Reland details:** I removed `test/distributed/_composable/fsdp/_test_fully_shard_common.py` and moved its contents to the existing `torch/testing/_internal/common_fsdp.py`, which is already a target for internal tests.
Differential Revision: [D53187509](https://our.internmc.facebook.com/intern/diff/D53187509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118525
Approved by: https://github.com/wanchaol
Summary:
Previously by default we don't generate quantized weight, that is, we'll have fp32 weight, and
`fp32 weight -> q -> dq -> linear -> ...` in the quantized model
After this PR, we'll produce a graph with int8 weight by default after convert_pt2e:
`int8 weight -> dq -> linear -> ...`
We'll remove the fold_quantize flag in the next PR
Test Plan: CI
Differential Revision: D51730862
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118605
Approved by: https://github.com/andrewor14
Fixes https://github.com/pytorch/pytorch/issues/118129
Suppressions automatically added with
```
import re
with open("error_file.txt", "r") as f:
errors = f.readlines()
error_lines = {}
for error in errors:
match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
if match:
file_path, line_number, error_type = match.groups()
if file_path not in error_lines:
error_lines[file_path] = {}
error_lines[file_path][int(line_number)] = error_type
for file_path, lines in error_lines.items():
with open(file_path, "r") as f:
code = f.readlines()
for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n"
with open(file_path, "w") as f:
f.writelines(code)
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
Simplifies and optimizes dict construction using the `fromkeys` classmethod ctor. This also makes it really obvious when all the keys will have the same static value, which could be a bug if unintentional. It is also significantly faster than using a dict comprehension. The rule is in preview, but I am adding a forward fix for when it becomes stable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118637
Approved by: https://github.com/albanD
Summary:
During debugging of some timeouted jobs, I found it difficult to
identify which rank is at fault eventhough we have logs of many ranks
reporting timeout on a specific collective seq.
If we can also report lastEqueuedSeq and lastCompletedSeq, it would be
much easier to identify,
1. whether a rank has not even join a collective call (not enqueued)
2. Or it has joined the collective call, but not completed.
For the 1st case, it is mostly likely users code problem
for the 2ed case, it could be lower-layer issues
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118582
Approved by: https://github.com/wconstab
Summary:
Learning Vulkan shaders and realized one of the branches can be easily optimized.
The relevant branch is only taken when we unsqueeze along `dim == 1` for 3D tensors.
1. There's an unnecessary for-loop.
2. There's an unnecessary dependency on the output tensor's number of channels.
## CPU Tensor
```
3D->4D: (c, h, w) -> (c, 0, h, w)
```
## GPU Texture
```
3D->4D: (w, h, c/4)[c%4] -> (w, h, c)[0]
```
Note the GPU Texture's output is always at `[0]` and the output tensor's number of channels is always 1.
We are currently writing the same value `v[p]` to all elements of the texel `out_texel`, but we need only write it to `out_texel[0]`:
Test Plan:
```
[jorgep31415@161342.od /data/sandcastle/boxes/fbsource (ca3b566bc)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*unsqueeze*"
File changed: fbcode//caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl
File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl
Buck UI: https://www.internalfb.com/buck2/2c7f1365-e004-41a0-9201-473929a2738a
Network: Up: 174B Down: 0B (reSessionID-c54d25da-f44b-49f7-8bfd-1db4eee50f6d)
Jobs completed: 6. Time elapsed: 14.4s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1)
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *unsqueeze*
[==========] Running 10 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 10 tests from VulkanAPITest
[ RUN ] VulkanAPITest.unsqueeze_0dto1d_dim0
[ OK ] VulkanAPITest.unsqueeze_0dto1d_dim0 (60 ms)
[ RUN ] VulkanAPITest.unsqueeze_1dto2d_dim0
[ OK ] VulkanAPITest.unsqueeze_1dto2d_dim0 (0 ms)
[ RUN ] VulkanAPITest.unsqueeze_1dto2d_dim1
[ OK ] VulkanAPITest.unsqueeze_1dto2d_dim1 (132 ms)
[ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim0
[ OK ] VulkanAPITest.unsqueeze_2dto3d_dim0 (20 ms)
[ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim1
[ OK ] VulkanAPITest.unsqueeze_2dto3d_dim1 (66 ms)
[ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim2
[ OK ] VulkanAPITest.unsqueeze_2dto3d_dim2 (3 ms)
[ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim0
[ OK ] VulkanAPITest.unsqueeze_3dto4d_dim0 (19 ms)
[ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim1
[ OK ] VulkanAPITest.unsqueeze_3dto4d_dim1 (1 ms)
[ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim2
[ OK ] VulkanAPITest.unsqueeze_3dto4d_dim2 (1 ms)
[ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim3
[ OK ] VulkanAPITest.unsqueeze_3dto4d_dim3 (1 ms)
[----------] 10 tests from VulkanAPITest (307 ms total)
[----------] Global test environment tear-down
[==========] 10 tests from 1 test suite ran. (307 ms total)
[ PASSED ] 10 tests.
[
```
Differential Revision: D53189637
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118575
Approved by: https://github.com/yipjustin
I don't think we should be unlifting HOO submodules.
What is the constract of unlifting? It is: restore the original calling convention of the module, undoing the transformation in which we lift parameters, buffers, and constants to inputs in the graph.
Unlifting does *not* make any guarantees about what's going on inside the module. It's still a flat module. So why should we lift the cond/map submodules? It doesn't have anything to do with the contract stated above; it's some internal stuff that doesn't affect how the module will be called.
Further, this code as written modifies the state dict; adding a new buffer that is actually duplicate of a previous buffer. Modifying the state dict from the original eager module is never correct.
Differential Revision: [D53160713](https://our.internmc.facebook.com/intern/diff/D53160713/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118610
Approved by: https://github.com/zhxchen17
ghstack dependencies: #118607, #118608, #118609
This PR rewrites two paths to use the newly-added keypaths API in pytree:
First: we were hand-rolling a tree_map during fakification because we wanted to track sources. This PR uses keypaths instead, which can do the same thing without needing custom code.
Second: our constraint error formatting was referencing placeholder names in error messages. These placeholder names are not otherwise user-visible, so they are super confusing to users (e.g. "which input does arg1_3 correspond to?"). This diff uses the `keystr` API to format the error message.
This necessitated some small refactors—generating the keystr is expensive so doing it in an f-string was very bad.
It can also be further improved—we can inspect the signature so that instead of `*args[0]` we can give people the actual argument name, which would be the ideal UX. But leaving that for later.
Differential Revision: [D53139358](https://our.internmc.facebook.com/intern/diff/D53139358/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118609
Approved by: https://github.com/zhxchen17
ghstack dependencies: #118607, #118608
tree_flatten_spec is bad; it isn't synced up with `register_pytree_node` so it will not handle arbitrary custom pytrees. It's also not really maintained.
We only use it for two purposes:
- To retain kwarg ordering stability, so that if the user passes in kwargs in a different order things will still work.
- To do "structural" checks that ignore types.
In both cases, tree_flatten_spec is probably *not* the ideal way to implement the desired behavior.
## kwargs ordering
- tree_flatten_spec overwrites the behavior of ALL dictionaries, not just kwargs. This is not correct, dictionary ordering is meaningful in Python, and it's pretty trivial to write a program that relies on dict ordering.
- For kwargs, we do sort of expect that the order in which arguments are passed shouldn't matter. BUT there is one exception: `**kwargs`. In fact, [PEP 468](https://peps.python.org/pep-0468/) was introduced specifically to clarify that ordering does matter when the function being called uses `**kwargs`.
In this diff I introduce a utility function that *only* reorders kwargs. This gets us most of the way to correct—dicts are no longer reordered, but kwargs can be passed in any order.
A "fully correct" solution would need fix the corner case from PEP468. We could detect whether the top-level fn being traced uses `**kwargs` (via `inspect`), then serialize a flag for it. In ExportedProgram, we would check that flag and only re-order if `**kwargs` was unused; otherwise error if the key order doesn't match. This is a super corner case though, so I'll file it as a followup task.
## structural equivalence checking
This is another use case, where again `tree_flatten_spec` is too broad. Generally we want to treat a precise two types as the same, not override the behavior of comparison generally. So I introduce an `is_equivalent` util for this purpose.
Differential Revision: [D53168420](https://our.internmc.facebook.com/intern/diff/D53168420/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118608
Approved by: https://github.com/zhxchen17
ghstack dependencies: #118607
## Context
This changeset is part of a stack that enables memory planning (i.e. sharing memory between intermediate tensors) in the PyTorch Vulkan Compute API. Note that Memory Planning can only be used via the ExecuTorch delegate (currently a WIP) and not Lite Interpreter (which does not collect metadata regarding tensor lifetimes).
This changeset enables [resource aliasing](https://gpuopen-librariesandsdks.github.io/VulkanMemoryAllocator/html/resource_aliasing.html), a technique that allows two resources (i.e. `VkImage`s or `VkBuffer`s to bind to the same memory allocation. This is the core feature that allows memory planning to be implemented in PyTorch Vulkan.
## Notes for Reviewers
At a high level, this changeset introduces the `MemoryAllocation` struct which represents a raw `VmaAllocation`. `VulkanImage` and `VulkanBuffer` have been updated to store a `MemoryAllocation` member instead of the raw handle of a `VmaAllocation`.
`vTensor`, `VulkanImage`, and `VulkanBuffer` constructors now have a `allocate_memory` argument which controls if memory should be allocated on construction. If `false`, then memory must be allocated separately and bound later using `bind_allocation()` before the resource can be used.
Internal:
## Notes for Internal Reviewers
Please refer to [this design doc](https://docs.google.com/document/d/1EspYYdkmzOrfd76mPH2_2BgTDt-sOeFnwTkV3ZsFZr0/edit?usp=sharing) to understand how memory planning will work end-to-end in the Vulkan Delegate.
Differential Revision: [D53136249](https://our.internmc.facebook.com/intern/diff/D53136249/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118436
Approved by: https://github.com/yipjustin
Follow up https://github.com/pytorch/pytorch/pull/118359: whether``src`` and ``dst`` are base on global pg or sub pg
* update c10d docstring: ``src`` / ``dst`` are base on global pg regardless of ``group`` arguments
* communication ops with ``dst`` argument: ``reduce``, ``gather_object``, ``gather``, ``send``, ``isend``
* communication ops with ``src`` argument: ``irecv``, ``recv``, ``broadcast``, ``broadcast_object_list``, ``scatter``, ``scatter_object_list``
* ``pytest test/distributed/test_c10d_nccl.py -k subgroup``
3 collectives are for pickable objects (``gather_object``, ``broadcast_object_list``, ``scatter_object_list``). There are 2 ways to set device
* use device argument: it's implemented in ``broadcast_object_list``. maybe worth implementing in the other 2
* ``torch.cuda.set_device(global_rank)``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118593
Approved by: https://github.com/wconstab
These operators are not used and have been deprecated since #72690
(Feb 2022).
BC-breaking message:
`TorchScript` models that were exported with the deprecated
`torch.jit.quantized` API will no longer be loadable, as the required
internal operators have been removed.
Please re-export your models using the newer `torch.ao.quantization` API
instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112153
Approved by: https://github.com/jerryzh168
### Summary
Native functional collective ops requires the backend to be implemented in C++. Porting `FakeProcessGroup` to cpp so that it can also work for native functional collective ops.
The existing tests involving `FakeProcessGroup` all pass. In addition, `DeviceMeshTest::test_fake_pg_device_mesh` now pass with `_USE_NATIVE_C10D_FUNCTIONAL=1`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118426
Approved by: https://github.com/wanchaol
ghstack dependencies: #113057
resolve#117749
Summary:
Updated the PR with the following intentions:
1. identify eagerMode init (as opposed to lazy init), in which case we will create NCCL comms without guarantees that they are fully initialized if NONBLOCKING mode is also enabled.
2. Python users can do their other works (e.g., model init) between invoking init_process_group and their first collective call.
3. c10D would guarantee/wait for communicators to be initialized before issuing the first collective call.
4. For NCCL collective calls, the contract between python users and c10d is not changed much from blocking calls (C10d would wait the NCCL call to be ncclSuccess, or timeout, whichever happens first).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118256
Approved by: https://github.com/kwen2501
Summary:
### Context
It's possible for the args of a user-defined Triton Kernel to be codegen-ed twiced. But this only happens if the arg is a `ReinterpretView`.
* First via `arg.codegen_reference()` in `define_user_defined_triton_kernel()`
* Second in `self.codegen_kwargs()`.
When using `abi_compatible=True`, the duplicate codegen will look like the code below. The issue in the code is that one of the Tensors, internal to the graph, isn't properly freed. This scenario was eventually exposed as a memory leak when we re-ran an AOTInductor model many times and observed `memory.used` increase after each iteration.
```
auto tmp_tensor_handle_0 = reinterpret_tensor_wrapper(buf1, 2, int_array_0, int_array_1, 0L);
auto tmp_tensor_handle_1 = reinterpret_tensor_wrapper(buf1, 2, int_array_0, int_array_1, 0L);
...
// There's no wrap_with_raii_handle_if_needed() for tmp_tensor_handle_0.
// And there's no reference to tmp_tensor_handle_0.
// Thus, tmp_tensor_handle_0 is left as an AtenTensorHandle which isn't
// automatically cleaned-up like RAIIAtenTensorHandle
CUdeviceptr var_6;
aoti_torch_get_data_ptr(wrap_with_raii_handle_if_needed(tmp_tensor_handle_1), reinterpret_cast<void**>(&var_6));
void* kernel_args_var_2[] = {..., &var_6, ...};
launchKernel(kernels.add_kernel_0, ..., kernel_args_var_2);
```
### Solution
We just need the arg's buffer name when creating the `TensorArg` in `define_user_defined_triton_kernel()`. Thus, just return the buffer's name and avoid any potential side-effects with `arg.codegen_reference()`.
Test Plan:
### Inspect device memory allocated
```
# Before diff
0 device memory 2048
1 device memory 2560
2 device memory 3072
3 device memory 3584
4 device memory 4096
5 device memory 4608
# With diff (memory usage doesn't grow)
0 device memory 1536
1 device memory 1536
2 device memory 1536
3 device memory 1536
4 device memory 1536
5 device memory 1536
```
Reviewed By: jingsh, tissue3
Differential Revision: D53190934
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118569
Approved by: https://github.com/oulgen
Fixes https://github.com/pytorch/pytorch/issues/118129
Suppressions automatically added with
```
import re
with open("error_file.txt", "r") as f:
errors = f.readlines()
error_lines = {}
for error in errors:
match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
if match:
file_path, line_number, error_type = match.groups()
if file_path not in error_lines:
error_lines[file_path] = {}
error_lines[file_path][int(line_number)] = error_type
for file_path, lines in error_lines.items():
with open(file_path, "r") as f:
code = f.readlines()
for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n"
with open(file_path, "w") as f:
f.writelines(code)
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
Moves test discovery into a file that doesn't have import torch so test listing can be done without having torch installed.
Helpful when you don't have torch installed (aka me when I'm feeling lazy)
I want to move TD into it's own job that doesn't need to wait for build to finish, so this is part of that.
The first commit is a nothing more than a copy paste of the selected functions/vars into a new file, the second commit has various changes that should be checked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118574
Approved by: https://github.com/huydhn
Summary:
We used to skip verifier when the signature object is not the "correct" one (usually from some deprecated frontend). This was very useful when we wanted to pay a small cost to enable verifier path to be called everywhere for torch export.
Now I believe no tests are relying on this behavior so we should remove this weird branch.
Test Plan: CI
Differential Revision: D53024506
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118139
Approved by: https://github.com/suo
This diff introduces an env var `_USE_NATIVE_C10D_FUNCTIONAL` that tells `_functional_collective` to use native `c10d_functional` ops. The Python version and the native version will co-exist until we completely switch to the native version after more testing and verification.
NOTE: `DeviceMesh` support for native `c10d_functional` will be added in a subsequent PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113057
Approved by: https://github.com/LucasLLC, https://github.com/wconstab, https://github.com/wanchaol
When there was a constant folded SymInt which was used to construct a then constant folding tensor, we had previously used tried to use the sympy symbol which would error (should take in SymInt not symbol).
Fix by recording the observed size during constant folding.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118411
Approved by: https://github.com/ezyang
In theory this tells the system that we will access the file sequentially which allows prefetching future blocks. In practice it doubles the read-ahead size on Linux (which effectively doubles the read sizes).
Without this, CUDA uploads of files that aren't already in FS cache, using mmapped files (safetensors) as source, run at ~1 GB/s (from an SSD that has ~7 GB/s read speed...).
With this, they run at ~1.5 GB/s which is still bad but better than before!
It is possible to increase the read performance further by touching the pages from multiple threads; in fact, when the tensors loaded from the file are used by the CPU, we get fairly good load performance (~5 GB/s), which appears to be because multiple threads page fault and trigger more concurrent reads which improves SSD read throughput... however, this is not the case for CUDA uploads, and it is difficult to make that change in a generic way because it's unclear what the usage pattern of the input file is going to be.
All of the numbers above are taken on Samsung 990 Pro SSD, on Linux kernel 6.5 with FS cache cleared between every attempt to load a file. The file is loaded via `safetensors.safe_open` which uses UntypedTensor.from_file to load the file into memory, which in turn uses MapAllocator.cpp.
I felt safe doing this change unconditionally but please let me know if you'd like to see a separate allocator flag for this, propagated through to UntypedTensor. Note that the fadvise API is not available on macOS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117805
Approved by: https://github.com/mikaylagawarecki
Before the PR, we have a graph break for code like this,
```python
def test_get_device_properties_tensor_device(a):
x = a.to("cuda")
prop = torch.cuda.get_device_properties(x.device)
if prop.major == 8:
return x + prop.multi_processor_count
return x + prop.max_threads_per_multi_processor
```
This PR constant folds the torch.cuda.get_device_properties and we'll get a following dynamo graph:
```python
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] <eval_with_key>.0 class GraphModule(torch.nn.Module):
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] def forward(self, L_a_ : torch.Tensor):
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] l_a_ = L_a_
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] # File: /home/yidi/local/pytorch/test/dynamo/test_functions.py:544 in test_get_device_properties_tensor_device, code: x = a.to("cuda")
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] x = l_a_.to('cuda'); l_a_ = None
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] # File: /home/yidi/local/pytorch/test/dynamo/test_functions.py:547 in test_get_device_properties_tensor_device, code: return x + prop.multi_processor_count
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] add = x + 108; x = None
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] return (add,)
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]
```
The signature of get_device_properties is:
```python
def get_device_properties(device: _device_t) -> _CudaDeviceProperties:
```
I think it's safe to constant fold get_device_properties():
1. torch.cuda.get_device_properties(tensor.device). In this case, tensor.device.index is guarded in _check_tensor
2. torch.cuda.get_device_properties(device_int_id). We don't expect the GPU properties for a particular index changes during a torch.compile run and it make sense to specialize the properties for a concrete device_int_id.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118422
Approved by: https://github.com/yanboliang, https://github.com/jansel
Before the pr, we have an error for the following code
```python
def k(x):
with torch.inference_mode():
x = x + 1
return x
torch.compile(k, backend="eager", fullgraph=True)(x)
```
error message:
```
Traceback (most recent call last):
....
return InferenceModeVariable.create(tx, args[0].as_python_constant())
torch._dynamo.exc.InternalTorchDynamoError: list index out of range
```
This pr supports the case when torch.inference_mode is not provided any argument (i.e. default to True).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118427
Approved by: https://github.com/yanboliang, https://github.com/jansel
Summary:
The current implementation of `str` passes wide types (`wchar_t`, `wchar_t*`, `std::wstring`) directly to `std::ostringstream`. This has the following behavior:
- C++17, `wchar_t` & `wchar_t *`: print the integer representation of the character or the pointer. This is unexpected and almost certainly a (runtime) bug.
- C++17, `std::wstring`: compile-time error.
- C++20, all of the above: compile-time error.
To fix the bug and to enable C++20 migration, this diff performs narrowing on these wide types (assuming UTF-16 encoding) before passing them to `std::ostringstream`. This fixes both the C++20 compile time errors and the C++17 runtime bugs.
This bug surfaced in enabling C++20 windows builds, because windows specific caffe2 code uses `TORCH_CHECK` with wide strings, which references `str` for generating error messages.
Test Plan: CI & https://godbolt.org/z/ecTGd8Ma9
Differential Revision: D52792393
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117531
Approved by: https://github.com/malfet
Instead rely on `GitHubPR.default_branch()` which is the name of the repo's default branch.
Do not pass branch name `merge_changes` is called, as it is set to default branch inside the function
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118530
Approved by: https://github.com/clee2000
Dynamo creates Tensors when tracing through numpy ufuncs like np.sin, np.minimum etc. When running, np functions generally return Tensors when run with `torch.compile`. However, we currently require when normalizing `out` arguments that the input is an ndarray. This creates assertion errors when running torch.compile on any numpy function with an out argument:
```
def test_numpy_ufunc_out(self):
@torch.compile(backend="eager")
def foo():
x = np.arange(5)
out = np.empty((x.shape[0], x.shape[0]))
res_out = np.sin(x, out=out)
assert res_out is out
foo()
```
Failure with stack trace: https://gist.github.com/jamesjwu/68e217638d735678b3de968584dba23f
Instead, we can wrap tensors in an ndarray in normalize_outarray to handle the case correctly. Fixing this resolves ~220 tests under dynamo_test_failures, but also exposes a followup bug.
In the presence of a graph break, ndarrays don't preserve their id, which can affect assertions and `is` checks between numpy arrays:
```
def test_x_and_out_broadcast(self, ufunc):
x = self.get_x(ufunc)
out = np.empty((x.shape[0], x.shape[0]))
x_b = np.broadcast_to(x, out.shape)
# ufunc is just np.sin here
res_out = ufunc(x, out=out)
res_bcast = ufunc(x_b)
# passes
assert res_out is out
graph_break()
# fails
assert res_out is out
```
Regular tensors preserve their id because Dynamo caches their example tensor values across a graph break. However, with ndarrays, we only store their converted tensor values, and construct new ndarrays around those values:
eebe7e1d37/torch/_dynamo/variables/builder.py (L1083)
Added a test with expected failure to showcase this — we can then fix that issue separately.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118248
Approved by: https://github.com/lezcano
Summary:
Adds a JK killswitch check and configures the env for enabling pytorch
nccl flight recorder. Note- this only enables recording events in memory, not
dumping them.
Test Plan: CI test
Reviewed By: zdevito
Differential Revision: D52920092
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118451
Approved by: https://github.com/malfet
Related to #118494, it is not clear to users that the default behavior is to include **all** feasible archs (if the 'TORCH_CUDA_ARCH_LIST' is not set).
In these scenarios, a user may experience a long build time. Adding a print statement to reflect this behavior. [`verbose` arg is not available and not feeling necessary to add `verbose` arg to this function and all its parent functions...]
Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118503
Approved by: https://github.com/ezyang
dmypy silently ignores follow_imports = skip, so to get parity between
dmypy and mypy we have to suck it up and type: ignore all of the sympy
typing problems.
The suppressions were added automatically with the following script generated by GPT-4:
```
import re
# Read the error file
with open("error_file.txt", "r") as f:
errors = f.readlines()
# Parse the lines with errors and error types
error_lines = {}
for error in errors:
match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
if match:
file_path, line_number, error_type = match.groups()
if file_path not in error_lines:
error_lines[file_path] = {}
error_lines[file_path][int(line_number)] = error_type
# Insert ignore comments in the source files
for file_path, lines in error_lines.items():
with open(file_path, "r") as f:
code = f.readlines()
for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n"
with open(file_path, "w") as f:
f.writelines(code)
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118469
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432, #118467, #118468
The `torch.jit.quantized` interface has been deprecated since #40102 (June 2020).
BC-breaking message:
All functions and classes under `torch.jit.quantized` will now raise an error if
called/instantiated. This API has long been deprecated in favor of
`torch.ao.nn.quantized`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118406
Approved by: https://github.com/jerryzh168
The original motivation for MYPYINDUCTOR was a faster type checking configuration that only checked a subset of files. With the removal of `follow_imports = ignore`, we are now able to use dmypy to do fast incremental typechecking, eliminating the need for this.
Perhaps erroneously, when I tee'ed up this PR I elected to delete the `follow_imports = skip` designations in the mypy-inductor.ini. This lead to a number of extra type error suppressions that I manually edited. You will need to review.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118432
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418
I feel it's easier to open a new PR rather than iterating on the previous PR (https://github.com/pytorch/pytorch/pull/105257 ) since this is more like a rewrite.
In this PR, instead of changing GraphModule directly which can easily causes BC issue, I create a LazyGraphModule class as Zachary & Jason suggested in comments from the previous PR.
The difference between LazyGraphModule and GraphModule is mainly about how re-compile for the graph module happens. In GraphModule the recompilation happens 'eagerly': constructing a GraphModule will cause the recompilation. While in LazyGraphModule, we just mark the module as needing recompilation. The real recompilation only happens when absolutely required (e.g. call forward method, access the code property etc.). In a lot of cases in torch.compile, the real recompilation eventually is not triggered at all. This can save a few seconds of compilation time.
By default, GraphModule rather than LazyGraphModule is used. `use_lazy_graph_module(True)` context manager can be used to pick LazyGraphModule instead. This has been applied to the torch.compile stack.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117911
Approved by: https://github.com/jansel
Addresses #118337 somewhat- we probably need to update docs. Let's first
confirm what behavior we want.
Identifies a couple of confusing things
1) 'dst' arg for many collectives is always in 'global' rank regardless
of whether a subgroup is passed in. This needs a doc update
2) gather_object has a strong dependency on setting the cuda device;
could we make that smoother?
3) gather_object also should be happy with an empty list on the dst
side, imo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118359
Approved by: https://github.com/weifengpy
resolve https://github.com/pytorch/pytorch/issues/117749
Summary:
This is the first step to enable NCCL nonblocking mode.
In NCCL nonblocking mode, ncclInProgress is an expected return value
when checking communicators. Without this relaxation, watchdog thread
would throw NCCL errors during work checking while it is expected.
Test Plan:
Set nonblocking mode in unit tests, and make sure all existing NCCL
tests pass
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118254
Approved by: https://github.com/kwen2501
This is a lot of files changed! Don't panic! Here's how it works:
* Previously, we set `follow_imports = silent` for our mypy.ini configuration. Per https://mypy.readthedocs.io/en/stable/running_mypy.html#follow-imports, what this does is whenever we have an import to a module which is not listed as a file to be typechecked in mypy, we typecheck it as normal but suppress all errors that occurred in that file.
* When mypy is run inside lintrunner, the list of files is precisely the files covered by the glob in lintrunner.toml, but with files in excludes excluded.
* The top-level directive `# mypy: ignore-errors` instructs mypy to typecheck the file as normal, but ignore all errors.
* Therefore, it should be equivalent to set `follow_imports = normal`, if we put `# mypy: ignore-errors` on all files that were previously excluded from the file list.
* Having done this, we can remove the exclude list from .lintrunner.toml, since excluding a file from typechecking is baked into the files themselves.
* torch/_dynamo and torch/_inductor were previously in the exclude list, because they were covered by MYPYINDUCTOR. It is not OK to mark these as `# mypy: ignore-errors` as this will impede typechecking on the alternate configuration. So they are temporarily being checked twice, but I am suppressing the errors in these files as the configurations are not quite the same. I plan to unify the configurations so this is only a temporary state.
* There were some straggler type errors after these changes somehow, so I fixed them as needed. There weren't that many.
In the future, to start type checking a file, just remove the ignore-errors directive from the top of the file.
The codemod was done with this script authored by GPT-4:
```
import glob
exclude_patterns = [
...
]
for pattern in exclude_patterns:
for filepath in glob.glob(pattern, recursive=True):
if filepath.endswith('.py'):
with open(filepath, 'r+') as f:
content = f.read()
f.seek(0, 0)
f.write('# mypy: ignore-errors\n\n' + content)
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118414
Approved by: https://github.com/thiagocrepaldi, https://github.com/albanD
Mention co-authors in PR body
Modify `CommitAuthors` to include query first two commit `authors`, which makes sure that authors from suggested commits are recognized.
Test plan: CI + check `get_authors()` on a few PRs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118347
Approved by: https://github.com/kit1980
Summary:
AVX extension flags are x86 specific, and clang-18 has started to error on it when building targets that's not x86. I couldn't find the resulting upstream change that made these flags an error, but it's fairly trivial that these flags do not apply to all architectures.
For most of the flags, they are already defined in `platform_compiler_flags`. The changes done
* Gate the flags under `compiler_flags` with `selects`
* If flags weren't defined in `platform_compiler_flags`, define them there as well
* Remove the `^` and `$` in the platform regex. Not all flavors start with the platform (e.g. `android-x86_64`.
* Some minor formatting changes were also included here.
Test Plan:
Atop D52741786,
```
buck2 build --flagfile 'arvr/mode/android/apk/linux/opt' '//arvr/projects/mixedreality/android/ocean_passthrough_service:ocean_passthrough_mrservice_dev'
```
Differential Revision: D52856224
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117923
Approved by: https://github.com/mcr229
Enables the deduplication of saved entries by load balancing duplicates across ranks.
Tested with existing and modified tests. Additionally tested with the following code snippet, which saves a 20GB DDP model in **~3 seconds on 8 ranks**. Before this PR, the same operation has been measured at ~19 seconds.
```
def run(local_rank, world_size, param_size, num_params, work_dir):
os.environ["RANK"] = str(local_rank)
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12355"
device = torch.device(f"cuda:{local_rank}")
torch.cuda.set_device(device)
dist.init_process_group(backend="nccl", rank=local_rank, world_size=world_size)
model = Model(param_size=param_size, num_params=num_params)
model = DistributedDataParallel(model, gradient_as_bucket_view=True)
_patch_model_state_dict(model)
sz = sum(t.nelement() * t.element_size() for t in model.parameters())
rank_0_print(f"Model size: {sz / 1_000_000_000.0} GB")
rank_0_print("Saving the model with DCP...")
checkpointer = _FileSystemCheckpointer(
f"{args.work_dir}/dcp",
sync_files=False,
single_file_per_rank=False,
thread_count=1
)
begin_ts = time.monotonic()
checkpointer.save(state_dict={"model": model})
end_ts = time.monotonic()
rank_0_print(f"Took {end_ts - begin_ts} seconds with DCP")
```
Differential Revision: [D52435926](https://our.internmc.facebook.com/intern/diff/D52435926/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116469
Approved by: https://github.com/fegin, https://github.com/wz337
Relying on object lifetimes in Python is a bad idea due to reference
cycles. Previously, when a torch.library.Library object gets destroyed,
it clears all the registrations associated with it, but it's unclear
when it actually gets destroyed due to the existence of refcycles.
This PR:
- adds torch::Library::clear(), which deterministically releases all of
the RAII registration handles of the torch::Library object
- adds a new `torch.library._scoped_library` context manager, which creates
a library and cleans it up at the end of the scope using the previous item.
All tests (unless they already handle library lifetimes) should use
this new API
- Rewrites some flaky tests to use `_scoped_library`.
In the future we'll probably migrate all of our torch.library tests to
use `_scoped_library`, but that's kind of annoying because we have
multiple thousands of LOC
I'm hoping this will deflake those tests; we'll see.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118318
Approved by: https://github.com/albanD
when doing print(f.read().decode etc etc) it prints an extra new line, so manually splitlines and strip to see if that helps
My guess is windows line ending differences
Also always save log file regardless of success or failure
See 476b81a9bf for what it looks like now
Swapped to opening in text mode instead of binary, seems to be ok now.
42483193bf024983060a234dc0262f4840aef4b8 for example
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118124
Approved by: https://github.com/huydhn
Summary: While turning on .module() for all the export tests, I uncovered some bugs with .module() and while fixing them I ended up rewriting some of the code... Some of the bugs were:
* bad kwargs support on the unlifted module
* no support for user input mutations
* (at the commit hash i was working off of) no support for custom objects
* there were no tests on unlifting weights from cond/map submodules
Test Plan: CI
Differential Revision: D53075380
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118272
Approved by: https://github.com/suo
Summary: When there were optionals with specified default values the code was improperly handling the number of parameters causing IndexError: tuple index out of range.
Test Plan: New tests.
Reviewed By: zou3519
Differential Revision: D53095812
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118331
Approved by: https://github.com/zou3519
This PR relands #108238 that was closed as stale due to CLA issues and also because the CI check has marked the PR as not mergeable.
Repro 1:
```python
import torch
def f(x):
return x[x > 0]
jf = torch.jit.trace(f, torch.tensor(2., device="cuda"))
```
Error:
```bash
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/pytorch/torch/jit/_trace.py", line 874, in trace
traced = torch._C._create_function_from_trace(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<stdin>", line 2, in f
RuntimeError: NYI: Named tensors are not supported with the tracer
```
Repro2:
```python
import torch
import torch.nn.functional as F
from torch import nn
import copy
class Net(nn.Module):
def __init__(self):
super().__init__()
def forward(self, inputs):
x = copy.deepcopy(inputs) # RuntimeError: NYI: Named tensors are not supported with the tracer
x = F.relu(x)
return x
model = Net()
images = torch.randn(8, 28, 28)
torch.jit.trace(model, images)
```
Error 2:
```bash
Traceback (most recent call last):
File "/opt/pytorch/test_deepcopy.py", line 18, in <module>
File "/opt/pytorch/torch/jit/_trace.py", line 806, in trace
return trace_module(
^^^^^^^^^^^^^
File "/opt/pytorch/torch/jit/_trace.py", line 1074, in trace_module
module._c._create_method_from_trace(
File "/opt/pytorch/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/pytorch/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/pytorch/torch/nn/modules/module.py", line 1501, in _slow_forward
result = self.forward(*input, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/pytorch/test_deepcopy.py", line 12, in forward
x = F.relu(x)
^^^^^^^^^^
File "/opt/conda/envs/ptca/lib/python3.11/copy.py", line 153, in deepcopy
y = copier(memo)
^^^^^^^^^^^^
File "/opt/pytorch/torch/_tensor.py", line 122, in __deepcopy__
new_storage = self._typed_storage()._deepcopy(memo)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/pytorch/torch/storage.py", line 847, in _deepcopy
return self._new_wrapped_storage(copy.deepcopy(self._untyped_storage, memo))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/ptca/lib/python3.11/copy.py", line 153, in deepcopy
y = copier(memo)
^^^^^^^^^^^^
File "/opt/pytorch/torch/storage.py", line 112, in __deepcopy__
new_storage = self.clone()
^^^^^^^^^^^^
File "/opt/pytorch/torch/storage.py", line 126, in clone
return type(self)(self.nbytes(), device=self.device).copy_(self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: NYI: Named tensors are not supported with the tracer
```
----
#48054 RuntimeError: NYI: Named tensors are not supported with the tracer
#49538 jit tracer doesn't work with unflatten layer
#31591 when i try to export a pytorch model to ONNX, got RuntimeError: output of traced region did not have observable data dependence with trace inputs; this probably indicates your program cannot be understood by the tracer.
- This bug was closed but exists. Multiple comments on it still showing error. This is addressed
Likely fixes the following issues (but untested)
#63297 Named tensor in tracer
#2323 [Bug] torch.onnx.errors.UnsupportedOperatorError when convert mask2former to onnx
Fix zero dimensioned tensors when used with jit.trace They are currently assigned an empty set for names {} this is not the same as "no name" so jit.trace bails with
"NYI: Named tensors are not supported with the tracer"
This happens when I am trying to save a non-trivial model as onnx but the simplest repro I have seen is 48054 above which has been added as test/jit/test_zero_dim_tensor_trace.py
Test plan:
New unit test added
Broken scenarios tested locally
CI
Fixes#48054
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118393
Approved by: https://github.com/zou3519
This PR add support for rowwise sharded embedding by adding a
MaskPartial placement that inherits from the default partial placement,
and override the Partial constracts to construct the mask and release
the mask after the reduction
The MaskPartial placement have the potential to support other ops
sharding computation that requires a mask for semantic correctness.
currently make it live in the embedding ops but we can move it to a
common place if needed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118080
Approved by: https://github.com/tianyu-l
ghstack dependencies: #118079
This PR rewrites sharded embedding rule to use OpStrategy instead of the
rule, one step further to get rid of rules and consolidate the embedding
operator implementation, to prepare for rowwise embedding
implementation, which will come in next PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118079
Approved by: https://github.com/tianyu-l
Let me tell you, this was a *journey.*
* When we repropagate through FX interpreter in AOTAutograd, this will reallocate unbacked SymInts. We can eliminate all of these fresh allocations by appropriately asserting equalities on them setting up replacements. See also https://github.com/pytorch/pytorch/issues/111950
* The `inner_fn` of Loops can contain references to unbacked SymInts. We must collect them to prevent DCE.
* Export naughtily accessed `_expr` when it should have accessed `expr` on SymNode. Fixed two sites of this.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117862
Approved by: https://github.com/bdhirsh
Uses case: `_unsafe_view` is used in aot_autograd to create a view that doesn't register as a view:
eebe7e1d37/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L470-L476)
If a transposed nested tensor (i.e. NT with ragged_idx != 1) encounters this code path, it previously would fail for two reasons: 1) because `_unsafe_view` isn't registered, and 2) because ragged_idx != 1 is not supported. This PR adds support for `_unsafe_view` (completely reusing the implementation of `view`; this just registers `_unsafe_view` as another op using the same implementation). It also adds support for ragged_idx != 1, but only for trivial cases where inp._size == size (the use case used by aot_autograd).
Tests: verify that the result of `_unsafe_view` doesn't have a `_base`, and that simple views on transposed NTs work.
Differential Revision: [D53096814](https://our.internmc.facebook.com/intern/diff/D53096814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118317
Approved by: https://github.com/soulitzer
The flag is not correctly set when PyTorch is compiled with GPU support resulting in failures in
`test_ops.py::test_python_ref_meta__refs_linalg_svd_cpu_complex`.
Use a similar approach to test_meta and skip the check for this function.
Workaround for #105068
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117972
Approved by: https://github.com/lezcano
`_CollectiveKernel.create_inplace` expresses mutation with the newly introduced `MutationOutput` which requires the `layout` of the input. Currently, there's a bug where if the input is a view, `inp.layout` fails. This PR fixes the issue by unwrapping the input if it's a view.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118333
Approved by: https://github.com/wanchaol
Summary:
After D50347338, we already support zero-dim tensor input, which was my original task. As a result, this diff doesn't add or change functionality; it just cleans up the following:
1. Fix TORCH_CHECK to only allow `tensor.dim() <= 3`. Previously, it was a no-op since it didn't use `&&`.
2. Add `tensor.dim() == 0` tests.
3. Address `readability-container-size-empty` and `performance-unnecessary-copy-initialization` linter errors.
Test Plan:
Tested on OD.
```
[jorgep31415@29786.od /data/sandcastle/boxes/fbsource (1d0b920e0)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -c pt.vulkan_full_precision=1 -- --gtest_filter="*stack*"
File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/ops/Unsqueeze.cpp
File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl
File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp
3 additional file change events
Buck UI: https://www.internalfb.com/buck2/98bb3bfa-a1d1-440e-8724-b4990c9cc7ca
Network: Up: 1.4MiB Down: 377KiB (reSessionID-6eccf420-3951-4942-9350-998803589b8d)
Jobs completed: 17. Time elapsed: 42.6s.
Cache hits: 38%. Commands: 8 (cached: 3, remote: 0, local: 5)
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *stack*
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from VulkanAPITest
[ RUN ] VulkanAPITest.stack_invalid_inputs
[ OK ] VulkanAPITest.stack_invalid_inputs (27 ms)
[ RUN ] VulkanAPITest.stack_0d
[ OK ] VulkanAPITest.stack_0d (28 ms)
[ RUN ] VulkanAPITest.stack_1d
[ OK ] VulkanAPITest.stack_1d (1 ms)
[ RUN ] VulkanAPITest.stack_2d
[ OK ] VulkanAPITest.stack_2d (148 ms)
[ RUN ] VulkanAPITest.stack_3d
[ OK ] VulkanAPITest.stack_3d (354 ms)
[----------] 5 tests from VulkanAPITest (561 ms total)
[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (561 ms total)
[ PASSED ] 5 tests.
```
Reviewed By: copyrightly, liuk22
Differential Revision: D53071188
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118314
Approved by: https://github.com/liuk22
Summary:
After D50347338, we already support zero-dim tensor input, which was my original task. As a result, this diff doesn't add or change functionality; it just cleans up the following:
1. Fix TORCH_CHECK to only allow `tensor.dim() <= 3`. Previously, it was a no-op since it didn't use `&&`.
2. Add 0->1 `tensor.dim()` tests.
3. Remove `dim == 0` case from shader since that path is never executed. The `cpp` code sends the input to `submit_copy` instead.
Test Plan:
Tested on OD.
```
[jorgep31415@29786.od /data/sandcastle/boxes/fbsource (c66693c95)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -c pt.vulkan_full_precision=1 -- --gtest_filter="*unsqueeze*"
File changed: fbcode//caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl
File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp
File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl
Buck UI: https://www.internalfb.com/buck2/16cf8f59-e535-493b-b123-5952ef8f1453
Network: Up: 21KiB Down: 1.4MiB (reSessionID-1219eefd-e78b-4bfd-aef8-8e4b38da82f8)
Jobs completed: 8. Time elapsed: 37.8s.
Cache hits: 0%. Commands: 3 (cached: 0, remote: 1, local: 2)
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *unsqueeze*
[==========] Running 10 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 10 tests from VulkanAPITest
[ RUN ] VulkanAPITest.unsqueeze_0dto1d_dim0
[ OK ] VulkanAPITest.unsqueeze_0dto1d_dim0 (61 ms)
[ RUN ] VulkanAPITest.unsqueeze_1dto2d_dim0
[ OK ] VulkanAPITest.unsqueeze_1dto2d_dim0 (0 ms)
[ RUN ] VulkanAPITest.unsqueeze_1dto2d_dim1
[ OK ] VulkanAPITest.unsqueeze_1dto2d_dim1 (110 ms)
[ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim0
[ OK ] VulkanAPITest.unsqueeze_2dto3d_dim0 (16 ms)
[ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim1
[ OK ] VulkanAPITest.unsqueeze_2dto3d_dim1 (58 ms)
[ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim2
[ OK ] VulkanAPITest.unsqueeze_2dto3d_dim2 (2 ms)
[ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim0
[ OK ] VulkanAPITest.unsqueeze_3dto4d_dim0 (16 ms)
[ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim1
[ OK ] VulkanAPITest.unsqueeze_3dto4d_dim1 (1 ms)
[ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim2
[ OK ] VulkanAPITest.unsqueeze_3dto4d_dim2 (1 ms)
[ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim3
[ OK ] VulkanAPITest.unsqueeze_3dto4d_dim3 (1 ms)
[----------] 10 tests from VulkanAPITest (270 ms total)
[----------] Global test environment tear-down
[==========] 10 tests from 1 test suite ran. (270 ms total)
[ PASSED ] 10 tests.
```
Also, to improve my confidence in unit tests, I modified [force_flush.py](https://www.internalfb.com/code/fbsource/[6e606c6f62dafd2121e78ffe14ae12f1b6d8d405]/fbcode/wearables/camera/ml/pytorch_vulkan_native/demo/force_flush.py) to run several combinations of `aten::unsqueeze` on OD.
Verified these work as expected.
```
torch.zeros([])
torch.randn([])
torch.rand([])
torch.ones([])
torch.tensor(0, dtype=torch.float)
```
Found that Vulkan in general does not support the following. That's ok though since it's technically a 1d tensor which is not part of my task.
```
torch.tensor([])
```
Differential Revision: D53071189
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118311
Approved by: https://github.com/liuk22
# Changes
* introduce `--check-mergeability` trymerge flag that attempts to merge PR locally, using the same merge logic as the mergebot, but requires just a read-only `GITHUB_TOKEN` and git repo.
* change mergeability workflow to utilize the new --check-mergeability logic
# Alternatives considered
1.
> Rewrite `https://github.com/pytorch/test-infra/actions/workflows/pr-dependencies-check.yml` to correctly support partially merged ghstacks.
That would be a slightly better approach, but ROI is lower, as it requires reimplementing trymerge logic and additional effort to consolidate the codebase (trymerge lives in pytorch repo).
`pr-dependencies-check.yml` still produces human-readable results for partially merged ghstack prs (even if it falsely reports them as non-mergeable).
2.
> Instead of introducing new trymerge flag, use existing flags, including `--dry-run`.
That didn't work, as no combination of existing flags skips the rule checks and ROCKSET lookups.
# Testing
1. Manual testing `trymerge.py --check-mergeability` on the regular and ghstack PRs:
```
export GITHUB_TOKEN=
export GIT_REPO_DIR=`pwd`
export GITHUB_REPOSITORY=pytorch/pytorch
export GIT_REMOTE_URL=https://github.com/pytorch/pytorch
# Test 1 (2 prs, 1 is closed)
python3 ../pytorch/.github/scripts/trymerge.py --check-mergeability 117862
Skipping 1 of 2 PR (#117859) as its already been merged
echo $?
0
# Test 2 (3 prs, 1 is closed)
python3 ../pytorch/.github/scripts/trymerge.py --check-mergeability 118125
Skipping 1 of 3 PR (#117859) as its already been merged
echo $?
0
# Test 3 (3 prs, intentional conflicts introduced into `main`):
python3 ../pytorch/.github/scripts/trymerge.py --check-mergeability 118125
Skipping 1 of 3 PR (#117859) as its already been merged
stdout:
Auto-merging torch/_inductor/ir.py
Auto-merging torch/_inductor/lowering.py
CONFLICT (content): Merge conflict in torch/_inductor/lowering.py
error: could not apply 66ba5b8792f... Realize inputs to DynamicScalar before unwrapping
...
RuntimeError: Command `git -C /Users/ivanzaitsev/pytorch2 cherry-pick -x 66ba5b8792fa076c4e512d920651e5b6b7e466f4` returned non-zero exit code 1
```
2. Workflow run:
https://github.com/pytorch/pytorch/actions/runs/7660736172/job/20878651852?pr=118258
<img width="516" alt="image" src="https://github.com/pytorch/pytorch/assets/108101595/28fbf0d2-ac2a-4518-b41d-b32b41373747">
<img width="621" alt="image" src="https://github.com/pytorch/pytorch/assets/108101595/ddbf8566-a417-43ec-9d0e-f623f4a71313">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118258
Approved by: https://github.com/PaliC, https://github.com/huydhn
Otherwise it takes 1+h to build CUDA12.1 docker
- Limit UCC builds to just sm_52(M60) and sm_86(A10G), which I think has the biggest impact
- Replace hardcoded `-j6` build parallelism with more dynamic `-j$[$(nproc) - 2]`
- Remove redundant check about Ubuntu-14.04
- Added `DOCKER_BUILDKIT` to parallelize the builds
As result, docker build time drops from 1+h to 35 min
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118167
Approved by: https://github.com/huydhn
This PR add support for rowwise sharded embedding by adding a
MaskPartial placement that inherits from the default partial placement,
and override the Partial constracts to construct the mask and release
the mask after the reduction
The MaskPartial placement have the potential to support other ops
sharding computation that requires a mask for semantic correctness.
currently make it live in the embedding ops but we can move it to a
common place if needed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118080
Approved by: https://github.com/tianyu-l
ghstack dependencies: #118079
In theory this tells the system that we will access the file sequentially which allows prefetching future blocks. In practice it doubles the read-ahead size on Linux (which effectively doubles the read sizes).
Without this, CUDA uploads of files that aren't already in FS cache, using mmapped files (safetensors) as source, run at ~1 GB/s (from an SSD that has ~7 GB/s read speed...).
With this, they run at ~1.5 GB/s which is still bad but better than before!
It is possible to increase the read performance further by touching the pages from multiple threads; in fact, when the tensors loaded from the file are used by the CPU, we get fairly good load performance (~5 GB/s), which appears to be because multiple threads page fault and trigger more concurrent reads which improves SSD read throughput... however, this is not the case for CUDA uploads, and it is difficult to make that change in a generic way because it's unclear what the usage pattern of the input file is going to be.
All of the numbers above are taken on Samsung 990 Pro SSD, on Linux kernel 6.5 with FS cache cleared between every attempt to load a file. The file is loaded via `safetensors.safe_open` which uses UntypedTensor.from_file to load the file into memory, which in turn uses MapAllocator.cpp.
I felt safe doing this change unconditionally but please let me know if you'd like to see a separate allocator flag for this, propagated through to UntypedTensor. Note that the fadvise API is not available on macOS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117805
Approved by: https://github.com/mikaylagawarecki
Test [ci-verbose-test-logs] (this worked, the test logs printing while running and interleaved and are really long)
Settings for no timeout (step timeout still applies, only gets rid of ~30 min timeout for shard of test file) and no piping logs/extra verbose test logs (good for debugging deadlocks but results in very long and possibly interleaved logs).
Also allows these to be set via pr body if the label name is in brackets ex [label name] or the test above.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117668
Approved by: https://github.com/huydhn
Before the PR, we have a graph break for the following test:
```python
def test_cublas_allow_tf32(x):
if torch.backends.cuda.matmul.allow_tf32:
return x.sin() + 1
return x.cos() - 1
```
In this PR, we first add "torch.backends.cuda" to MOD_INLINELIST to trace through the python binding and get the actual call torch._C._get_cublas_allow_tf32, where it's already a TorchInGraphVariable. Because _get_cublas_allow_tf32 is accessing the same variable as at::globalContext().allowTF32CuBLAS(), which is guarded by dynamo as a global state [here](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp#L443), we could safely assume it returns a ConstantVariable during tracing.
After this pr, we get the following graph:
```python
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] <eval_with_key>.0 class GraphModule(torch.nn.Module):
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] def forward(self, L_x_ : torch.Tensor):
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] l_x_ = L_x_
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: /home/yidi/local/pytorch/test/dynamo/test_functions.py:515 in test_cublas_allow_tf32, code: return x.cos() - 1
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] cos = l_x_.cos(); l_x_ = None
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sub = cos - 1; cos = None
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] return (sub,)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118236
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
Added support for constant outputs. We will just embed the constant directly into the output, like `return (x, 1)`.
Also adds support for None input/outputs. For None inputs we address it the same way we do to constants, which is that a placeholder with no users will be inserted into the graph, and the None will be embedded into whatever operator is using the None. For None outputs, we will also address the same way we do constants, which is that we embed it into the output, like `return (x, None)`.
Differential Revision: D52881070
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117894
Approved by: https://github.com/zhxchen17
All single element list types are `Tensor[]` so they will always be Tuple.
I don't know of any way to easily access the pyi type and compare that to a real run so no testing here :(
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118238
Approved by: https://github.com/ezyang
This PR allows pointwise ops to operate on tensors with ragged_idx != 1. It does this by passing the ragged_idx metadata into the construction of the returned NestedTensor when computing pointwise ops. The assumption is that: pointwise ops can operate directly on the values tensors, and the resulting tensor should have all the same metadata properties as the input tensors. For binary ops, a test is added to verify that adding two tensors with different ragged_idx cannot be added.
Previously:
* unary pointwise ops would error out when performed on nested tensors with ragged_idx != 1
* binary pointwise ops would produce tensors with nonsense shapes
Differential Revision: [D53032641](https://our.internmc.facebook.com/intern/diff/D53032641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118157
Approved by: https://github.com/jbschlosser
Dry run open for labels so we can run trymerge locally with dryrun without actually affected the PR
Make Dr.CI results easier to read (previously a massive json dump, now just the job names + ids, in a nicer format)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118240
Approved by: https://github.com/huydhn
We only want to traverse over each node in the graph exactly once, and we do that by inserting nodes into the "seen" set. The issue is that we forget to check the "seen" set when inserting the root nodes. Typically that is not a problem, because the root nodes are from the different outputs and thus usually correspond to different nodes. With split_with_sizes, though all of the outputs correspond to the same node, ands this leads to the node being iterated over 3 times, and 3 sets of hooks being attached to the same node.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118252
Approved by: https://github.com/zou3519
ghstack dependencies: #117552, #118234, #118249
Fixes #ISSUE_NUMBER
We are trying to adapt `SparsePrivateUse1` in our code. However, I found that `sparse_stup` has not been exposed yet, which makes it impossible for me to implement stup and register. I hope that the header files in this directory can be exposed. @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118274
Approved by: https://github.com/ezyang
Summary: Now that set_ is marked as a view op, this special case is no longer necessary
Test Plan: CI exposed the need for this special case in the first place, so I think we can just rely on the existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118154
Approved by: https://github.com/bdhirsh
The OverlappingCPU Loader is causing a major drop in performance when used with multiple threads. This PR is a temporary fix while we investigate why this is the case.
Benchmarks for save, using a 7.25GB FSDP model, as per the TSS benchmark. Both benchmarks run on 8 ranks.
Before this PR
9.475 s
8 threads
After this PR
1.632 s
8 threads
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118114
Approved by: https://github.com/wz337, https://github.com/fegin
This diff introduce the following changes:
1. Fix sympy_subs to preserve integer and non-negative properties of replaced symbol when replacement is string
why is this needed?
I was compiling an expression:
x*abs(y) where y =-2
what happens is that this expression is passed as ``s1*abs(s0)`` then s0 is replaced to ks0 with a call to sympy_subs.
but sympy_subs used to replace s0 (integer=false, nonegative=false) with ks0(inetegr=true, nonegative = true)
resulting in ``x*abs(ks0) = x*ks0`` which is wrong
2. rename sympy_symbol to sympy_index_symbol to make it explicit.
3. add assertion that replaced expression is not passed as string but always a sympy expression.
Fixes https://github.com/pytorch/pytorch/issues/117757
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118150
Approved by: https://github.com/ezyang
Previously, we generated the grid argument with tree.numel for
a benchmark TritonKernel. This was not correct, because it
didn't match the launch config used for profiling and running.
This PR fixed the issue by emitting the grid value computed
by the kernel's grid_fn, which is used by the profiler and
the kernel's runner.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118202
Approved by: https://github.com/shunting314, https://github.com/jansel
Summary:
Test Plan:
```
lintrunner --take MYPYINDUCTOR --all-files
ok No lint issues.
lintrunner -a
ok No lint issues.
Successfully applied all patches.
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116311
Approved by: https://github.com/int3
Summary:
Class FQN is needed when unpacking CustomObj instance.
For all other Arguments, e.g. Tensor, TensorList, SymInt, we always know their exact type. However, CustomObjArgument had an opaque type.
Adding this field also helps unveiling the type of this opaque object.
Test Plan: CI
Differential Revision: D53029847
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118158
Approved by: https://github.com/zhxchen17
Was debugging an export issue, and currently when `key` does not exist in `self.items`, the error message is
```
File "/opt/pytorch/torch/_dynamo/variables/dicts.py", line 208, in getitem_const
return self.items[key]
~~~~~~~~~~^^^^^
torch._dynamo.exc.InternalTorchDynamoError: <torch._dynamo.variables.dicts.ConstDictVariable._HashableTracker object at 0x7fd7697cbf90>
```
This PR changes it to be the following.
```
File "/data/users/angelayi/pytorch/torch/_dynamo/variables/dicts.py", line 199, in getitem_const
raise KeyError(arg.value)
torch._dynamo.exc.InternalTorchDynamoError: shape
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117902
Approved by: https://github.com/williamwen42
Currently, we create new_group for sub_group pg during mesh initialization. The PR changes this so we will:
1) re-use sub_group pg if it exsits,
2) create new sub_group pg if it does not exist.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115716
Approved by: https://github.com/wanchaol
This PR introduces the initial `fully_shard` frontend without any distributed logic that will be built into per-parameter-sharding FSDP.
- We design `fully_shard` to be a _module-level_ API (taking in an `nn.Module`), e.g. as opposed to a tensor-level one.
- We define a `FSDP` class and use a dynamic class swap, setting `module.__class__` to a newly created class that subclasses `FSDP` and `type(module)`, to allow FSDP to override and add methods on the module.
- We name this class as `FSDP<type(module)>`, e.g. `FSDPLinear` for `Linear`.
- We disable the `deepcopy` because the state object inserted on the module will not be trivially `deepcopy`-able.
- Calling `fully_shard(module)` inserts a state object on `module` but not any of its children. This state object will be used for any FSDP-specific state.
- We raise an error on `ModuleList` or `ModuleDict` since they do not implement `forward()`, and FSDP will rely on `forward()` to insert logic (https://github.com/pytorch/pytorch/issues/113794).
- In the future, we will deprecate the existing `fully_shard` that calls into the same backend logic as `FullyShardedDataParallel` as there is no adoption for that and we prefer to reuse that name.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117776
Approved by: https://github.com/wconstab, https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #117994, #118186, #117984
Don't require using it as `@requires_cuda()` -> `@requires_cuda` instead No need for the partial function invoked many times
Split out this change from the initial large refactoring in #117741 to hopefully get merged before conflicts arise
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118281
Approved by: https://github.com/ezyang
The existing warning in `DTensor.__new__()` checks `if requires_grad != local_tensor.requires_grad:` and warns with:
> To construct DTensor from `torch.Tensor`, it's recommended to use `local_tensor.detach()` and make `requires_grad` consistent.
Calling `local_tensor.detach()` will have the returned `Tensor` have `requires_grad=False`, so the error message refers to the case where `local_tensor.requires_grad is True` but the user passed `requires_grad=False` to `to_local()`.
However, there is the converse case, where `local_tensor.requires_grad is False` but the user passed `requires_grad=True`. In this case, the original `if requires_grad != local_tensor.requires_grad:` check succeeds, and the warning is emitted. However, the warning message does not apply in that case.
This can happen via `_prepare_output_fn` -> `redistribute` -> `Redistribute.forward()`, where `output.requires_grad is False` but it passes `requires_grad=input.requires_grad` which can be `True`.
We should not warn in this case since `Redistribute.forward()` is our own framework code, so I was proposing to relax the warning.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118186
Approved by: https://github.com/XilunWu, https://github.com/wanchaol
ghstack dependencies: #117994
When CUDA is not available `c10d.init_process_group("nccl"...)` will fail with
> RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
Hence add a corresponding skip marker to the classes deriving from DynamoDistributedSingleProcTestCase next to the `requires_nccl` marker.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117741
Approved by: https://github.com/ezyang, https://github.com/malfet
resume_in_* code objects show up in user backtraces when failures occur
in code that has been Dynamo processed. It is obvious to me, a PT2
developer, that these are generated by PT2, but it is NOT obvious to a
non-core dev that this is happened. Add an extra torch_dynamo
breadcrumb to help get people to the right place.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118201
Approved by: https://github.com/albanD
Summary: When there were optionals with specified default values the code was improperly handling the number of parameters causing `IndexError: tuple index out of range`
Test Plan: new tests
Differential Revision: D52977644
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118035
Approved by: https://github.com/williamwen42
This test isn't run in CI because the CI runners don't have dill installed.
This fixes the tests so they run for me locally, and in the next PR I add
dill to the CI so we can test it properly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116230
Approved by: https://github.com/jansel
We split install_global_once into two APIs:
- `install_global_by_id(prefix, value) -> name`: installs a global if it hasn't
been installed yet
- `install_global(prefix, value) -> name`: always installs the global (and
generates a unique name for it)
Then, we refactor most callsites of `install_global_unsafe` to one of
the previous. Some callsites cannot be refactored because we create the
global name first, do a lot of stuff with it, and then install it.
This fixes more test flakiness.
Test Plan:
- Existing tests; I can't reliably repro the flakiness
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118100
Approved by: https://github.com/ezyang, https://github.com/mlazos
# Summary
Simplification of Backend Selection
This PR deprecates the `torch.backends/cuda/sdp_kernel` context manager and replaces it with a new context manager `torch.nn.attention.sdpa_kernel`. This context manager also changes the api for this context manager.
For `sdp_kernel` one would specify the backend choice by taking the negation of what kernel they would like to run. The purpose of this backend manager was to only to be a debugging tool, "turn off the math backend" and see if you can run one of the fused implementations.
Problems:
- This pattern makes sense if majority of users don't care to know anything about the backends that can be run. However, if users are seeking to use this context manager then they are explicitly trying to run a specific backend.
- This is not scalable. We are working on adding the cudnn backend and this API makes it so so that more implementations will need to be turned off if user wants to explicitly run a given backend.
- Discoverability of the current context manager. It is somewhat un-intutive that this backend manager is in backends/cuda/init when this now also controls the CPU fused kernel behavior. I think centralizing to attention namespace will be helpful.
Other concerns:
- Typically backends (kernels) for operators are entirely hidden from users and implementation details of the framework. We have exposed this to users already, albeit not by default and with beta warnings. Does making backends choices even more explicit lead to problems when we potentially want to remove existing backends, (perhaps inputs shapes will get covered by newer backends).
A nice side effect is now that we aren't using the `BACKEND_MAP` in test_transformers many, many dynamo failures are passing for CPU tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114689
Approved by: https://github.com/cpuhrsch
This PR does what it says and more.
1. We increase coverage by a LOT! Previously, complex was not tested for many many configs, including foreach + maximize at the same time. Or the fused impls. Or just random configs people forgot about.
2. I rearranged the maximize conditional and the _view_as_real to preserve list-ness. This is needed for _view_as_real to function properly, I did add a comment in the Files Changed. This new order also just...makes more aesthetic sense.
3. Note that LBFGS and SparseAdam are skipped--they don't support complex and now we know.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118160
Approved by: https://github.com/mikaylagawarecki
The return type for the forward pass of nn.AdaptiveMaxPool1d is specified to be Tensor, but if self.return_indices, then the result type should be tuple[Tensor,Tensor].
For users trying to trace/script this function with indices, the incorrect typing is problematic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118162
Approved by: https://github.com/albanD
This PR rewrites sharded embedding rule to use OpStrategy instead of the
rule, one step further to get rid of rules and consolidate the embedding
operator implementation, to prepare for rowwise embedding
implementation, which will come in next PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118079
Approved by: https://github.com/tianyu-l
Summary:
We observed the following error when launch e2e AFOC model test
```
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.
```
f524190245
Differential Revision: D53011463
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118105
Approved by: https://github.com/jackiexu1992
On Linux and Mac `int64_t` is an alias to either `long` (Linux) or `long long` (Mac)
Because of that, attempt to construct `c10::Scalar` from the other type will fail with `conversion from ‘long long int’ to ‘c10::Scalar’ is ambiguous`.
I.e. attempt to compile:
```cpp
int main() {
c10::Scalar s = 1L;
}
```
on MacOS failed with:
```
foo.cpp:3:15: error: conversion from 'long' to 'c10::Scalar' is ambiguous
c10::Scalar s = 1L;
^ ~~
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor
DEFINE_IMPLICIT_CTOR)
^
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:62:3: note: candidate constructor
Scalar(uint16_t vv) : Scalar(vv, true) {}
^
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:63:3: note: candidate constructor
Scalar(uint32_t vv) : Scalar(vv, true) {}
^
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:64:3: note: candidate constructor
Scalar(uint64_t vv) {
^
```
Prevent this by providing missing constructors when needed. Alas one can not use SFINAE, as template constructors on Scalar mess up a lot of implicit conversions, so I use `static_asserts` to detect early on if premise for constructing this class holds.
Add ScalarTest::LongsAndLongLongs that is essentially a compile time test
Discovered while trying to enable AOTI on MacOS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118149
Approved by: https://github.com/ezyang, https://github.com/albanD
ghstack dependencies: #118077, #118076
- Add `darwin` to the list of supported platform
- Add `#include <sstream>` to `aoti_runtime/model.h`
- Refactor Linux specific constant compilation logic to `_compile_consts_linux`
- Add `_compile_consts_darwin` that converts consts to .S file that is linked into a shared library
- Patch file using magic to avoid converting bytes to large hexadecimal string
- Generate integer constants with `LL` suffix on MacOS (corresponds to int64_t definition)
- Enable test_aot_inductor.py tests on MacOS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118076
Approved by: https://github.com/desertfire
ghstack dependencies: #118077
By not passing linker flag if `compile_only` is set to `True`
Before that change every invocation of AOTI compiler resulted in emitting at least 4 warnings:
```
clang: warning: -lomp: 'linker' input unused [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-shared' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-undefined dynamic_lookup' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-L/Users/nshulga/miniforge3/lib' [-Wunused-command-line-argument]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118077
Approved by: https://github.com/desertfire
Summary:
This diff introduces a caching mechanism to improve the performance of the partitioner in PyTorch. The changes involve adding a cache to store the DFS path of each node in the graph, which can be reused later when trying to find cycles in the graph.
This shows significant improvements for the edge use cases where the ASR model (which is around 6000+ nodes) used to take 26 minutes, but after this it takes around 8 minutes.
Test Plan: Relying on the existing ExecuTorch CI tests that heavily use this partitioning mechanism and also tested out locally via Bento notebooks.
Differential Revision: D51289200
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115943
Approved by: https://github.com/SherlockNoMad
**Summary**
Previously DTensor sharding plans filter (i.e. `is_tensor_shardable()`) cannot correctly handle the case where the input `DTensor` has 0 dimension. This filter should return `True` if the sharding placement on 0 dimension is `Replicate` even if `tensor dim < num of shards` on that dimension in which case `tensor dim == 0` and `num of shards == 1`.
In this PR we also noticed a behavior discrepancy of `torch.addmm`. See #118131
**Test Plan**
```
pytest test/distributed/_tensor/test_dtensor_ops.py -s -k addmm
pytest test/distributed/_tensor/test_dtensor_ops.py -s -k mm_cpu_float32
CUDA_VISIBLE_DEVICES="" pytest test/distributed/_tensor/test_matrix_ops.py -s -k empty_operand
pytest test/distributed/_tensor/test_matrix_ops.py -s -k empty_operand
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117726
Approved by: https://github.com/wanchaol
Summary: as title, the "canonical" flag is added to sigmoid serializer, so that we can optionally "normalize" the IR to give stable names and orders to IR nodes, which could help with the cases to compare IR definitions.
Test Plan: buck run @//mode/opt //aps_models/ads/config_model_authoring/stability:cli export-generated-module-state-command
Differential Revision: D52431965
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116758
Approved by: https://github.com/avikchaudhuri
## Context
This is an example that runs into an AssertionError while lowering in Inductor.
```
# While lowering, b will be expanded because b.size(1) == 1.
a = torch.zeros([u0, 512])
b = torch.ones([u0, 1])
return a * b
```
Below's the tail-end of the stack trace. Here's the important bits:
1. In _inductor/sizevars.py, we'll call `self.shape_env.defer_runtime_assert(expr, msg, fx_node=V.graph.current_node)`.
2. This leads to the creation of a `ShapeEnvEvent` with an FX node via `kwargs={"fx_node": V.graph.current_node}` ([see](0c9b513470/torch/fx/experimental/recording.py (L245-L247))).
3. Eventually, we try to call `maybe_convert_node()` but it expects translation validation to be on ([see](0c9b513470/torch/fx/experimental/recording.py (L118-L121))).
```
File "pytorch/torch/_inductor/lowering.py", line 221, in transform_args
for i, x in zip(indices, broadcast_tensors(*[args[i] for i in indices])):
File "pytorch/torch/_inductor/lowering.py", line 294, in wrapped
out = decomp_fn(*args, **kwargs)
File "pytorch/torch/_inductor/lowering.py", line 676, in broadcast_tensors
x = expand(x, target)
File "pytorch/torch/_inductor/lowering.py", line 294, in wrapped
out = decomp_fn(*args, **kwargs)
File "pytorch/torch/_inductor/lowering.py", line 793, in expand
return TensorBox(ExpandView.create(x.data, tuple(sizes)))
File "pytorch/torch/_inductor/ir.py", line 1871, in create
new_size = cls._normalize_size(x, new_size)
File "pytorch/torch/_inductor/ir.py", line 1862, in _normalize_size
new_size[i] = V.graph.sizevars.expect_equals(
File "pytorch/torch/_inductor/sizevars.py", line 338, in expect_equals
self.expect_true(sympy.Eq(left, right), msg=msg)
File "pytorch/torch/_inductor/sizevars.py", line 333, in expect_true
self.shape_env.defer_runtime_assert(expr, msg, fx_node=V.graph.current_node) # (1) is here
File "pytorch/torch/fx/experimental/recording.py", line 257, in wrapper
return event.run(self) # (2) happens right before this
File "pytorch/torch/fx/experimental/recording.py", line 155, in run
replacearg(index=3, key="fx_node", fn=maybe_convert_node)
File "pytorch/torch/fx/experimental/recording.py", line 138, in replacearg
kwargs[key] = fn(kwargs[key])
File "pytorch/torch/fx/experimental/recording.py", line 128, in maybe_convert_node
assert hasattr(shape_env, "name_to_node") # (3) is here
```
## Approach
Since [translation validation](c6be5d55a5/torch/fx/experimental/validator.py (L574)) may not be on during Inductor lowering, we can check if that's True and return the FX node's name in this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118066
Approved by: https://github.com/ezyang, https://github.com/peterbell10
This PR intends to fix the following issue when swapping two tensors
```python
>>> import torch
>>> torch.manual_seed(5)
>>> t1 = torch.randn(2)
>>> t2 = torch.randn(3)
>>> t1
tensor([-0.4868, -0.6038])
>>> t2
tensor([-0.5581, 0.6675, -0.1974])
>>> torch.utils.swap_tensors(t1, t2)
>>> t1
tensor([-0.5581, 0.6675, -0.1974])
>>> t2
tensor([-0.4868, -0.6038])
>>> t1.fill_(0.5) # t1 back to its unswapped state :o
tensor([-0.4868, -0.6038])
```
What happens here is that in `THPVariable_Wrap` (which is used when going back from C++ --> Python), we check if the TensorImpl of the tensor to be returned already has a pointer to a PyObject in its PyObject slot. If this is the case then this object is returned.
57491d2046/torch/csrc/autograd/python_variable.cpp (L271-L292)
When we run any operation that returns the same TensorImpl (e.g. inplace op, `t.to(dtype=t.dtype)`, etc.), although `t1` now has `t2`'s TensorImpl, `t2`'s TensorImpl still has a reference to `t2`, so when we do the op on `t1` and `THPVariable_Wrap` attempts to return the pointer to the TensorImpl's PyObject, we return a pointer to `t2` instead.
The TensorImpl should have the PyObjects in their PyObjectSlots swapped as well in `swap_tensors`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116955
Approved by: https://github.com/albanD
This PR is another step towards modernizing our optimizer tests by tackling the simplest foreach tests. The replaced tests are now removed in `test/optim/test_optim.py`.
**Changes in coverage?** Yes!
- This PR _decreases_ coverage (!!!!) by only checking the direction on the forloop implementations vs both the forloop and foreach. Why? I believe it should be sufficient to check the forloop only, as the foreach parity is already checked in the `foreach_matches_forloop` test.
- This PR also _increases_ coverage for SparseAdam with contiguous params on CUDA, which was previously forbidden due to an old old bug that has since been fixed.
What will it take to fully remove `test_basic_cases`?
- We need to flavor the tests with LRSchedulers
- Testing for param groups --> which all just distinguish between lrs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117410
Approved by: https://github.com/albanD
Summary: Adding an experimental API to FX graph module to place "hooks" every time when we are changing or replacing nodes in a graph, so that we can properly update the new name in graph signature and potentially other places.
Test Plan:
buck test mode/opt -c fbcode.enable_gpu_sections=true caffe2/test/distributed/_tensor/experimental:tp_transform
buck test mode/opt caffe2/test:test_export -- -r test_replace_hook
Differential Revision: D52896531
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117825
Approved by: https://github.com/avikchaudhuri
Summary: `constraints` argument for `torch.export` has been deprecated in favor of the `dynamic_shapes` argument. This PR updates the use of the deprecated API in `caffe2/test/cpp` and `torchrec/distributed/test/test_pt2`.
Test Plan: CI
Differential Revision: D52977354
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118026
Approved by: https://github.com/chenyang78
Summary: Show the stack when SEGMENT_FREE and SEGMENT_UNMAP occurs. This may be useful for debugging such as when empty_cache() may cause a segment to be freed. If the free context is unavailable, resort to the segment allocation stack.
Test Plan: CI
Differential Revision: D52984953
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118055
Approved by: https://github.com/zdevito
1. I'd like to remove the patching that avoids the profiler hook, but it adds an additional graph break due to nested wrappers. #117767 if interested, see (internal only) paste for [before](P996529232) and [after](P997507449) this PR.
```
I've locally run perf benchmarks for yolov3: Before the speedup is 4.183x, and after it is 4.208x.
I've also run it for resnet50: before, speedup is 3.706x and now it is 3.924x.
```
2. @mlazos I now unwrap twice in the dynamo and inductor tests. This feels like we're testing deficiently--should we add tests to test that tracing through the profiler hook and the use_grad hook are functioning according to expectations (I know there's at least one graph break in one).
3. There's a strange memory thing going on...what is happening? This has been resolved with @voznesenskym's [change](https://github.com/pytorch/pytorch/pull/116169). (for details see below)
<details>
This PR will fail the test_static_address_finalizer test due to a mysterious thing that is happening (idk what, but maybe the dynamo cache or a frame _expecting_ the patching to have been done).
There is no Python refcycle, as the backrefs for `p_ref()` look like:

(so 5 backrefs but none of them python)
And the refs:

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115772
Approved by: https://github.com/jansel, https://github.com/mlazos
* custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing)
* normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes). Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries
Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497
"items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069
Approved by: https://github.com/huydhn
Fix https://github.com/pytorch/pytorch/issues/97352.
This PR changes the way the linking to intel MKL is done and updating MKL on Windows to mkl-2021.4.0 .
There are for both conda and pip packages MKL version with which you can link dynamically. mkl-devel contains the static versions of the dlls and MKL contains the needed dlls for the runtime. MKL dlls and static libs starting with 2021.4.0 have the version in their names( for MKL 2023 we have mkl_core.2.dll and for 2021.4.0 we have mkl_core.1.dll) so its possible to have multiple versions installed and it will work properly.
For the wheel build, I added dependency for whell MKL and on conda a dependecy for the conda MKL and on libtorch I copied the MKL binaries in libtorch.
In order to test this PR I have to use custom builder https://github.com/pytorch/builder/pull/1467
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102604
Approved by: https://github.com/IvanYashchuk, https://github.com/malfet
This expands the reinplacing pass to allow reinplacing view-scatter operations.
e.g. if our python code is:
```
a = view1(inp)
b = view2(a)
b.copy_(src)
```
this generates a functionalized graph like:
```python
a = view1(inp)
a_updated = view2_scatter(a, src)
inp_updated = view1_scatter(inp, a_updated)
```
First, the `canonicalize_view_scatter_ops` step rewrites the functionalized graph
in the form:
```python
inp_updated = _generalized_scatter(inp, src, [view1, view2])
a_updated = view1(inp_updated)
```
I then register `_generalized_scatter` as a normal inplacable op which can be
handled by the pre-existing mechanism. Since we've fused the two scatter ops into one,
the reinplacing pass sees only one user of `inp` which allows the entire operation to be
reinplaced if desired (and I add heuristics that sometimes choose not to reinplace).
Finally, there is a decomposition step which decomposes out-of-place or in-place
`_generalized_scatter` operations either back into view_scatter operations, or
into the version with mutations. When introducing mutations, the reinplaced
version is equivalent to the original mutation:
```
a = view1(inp)
b = view2(a)
b.copy_(src)
```
Or when out-of-place we end up with a minor restructuring of the graph:
```
a = view1(inp)
tmp = view2_scatter(a, src)
inp_updated = view1_scatter(inp, tmp)
a_updated = view1(inp_updated)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116899
Approved by: https://github.com/lezcano
ghstack dependencies: #116898, #117121
Currently if you have the code:
```python
idx = torch.arange(10, device=x.device)
src = torch.ones(10, dtype=x.dtype, device=x.device)
x.index_put_((idx,), src)
expand = x.expand((2, x.shape[0]))
```
The `index_put_` cannot be reinplaced under dynamic shapes due to the user
`aten.sym_size(x, 0)` however since this function only looks at the tensor
metadata, it is actually fine to reinplace.
Here I ignore these operators in the analysis of the reinplacing pass, so
reinplacing can happen under dynamic shapes as well. I also handle cases
where views are created just to be fed to `sym_size`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117121
Approved by: https://github.com/lezcano
ghstack dependencies: #116898
Previously, if someone wrote a python abstract impl but didn't import
the module it is in, then we would raise an error message suggesting
that the user needs to add an abstract impl for the operator.
In addition to this, we suggest that the user try importing the module
associated with the operator in the pystub (it's not guaranteed that
an abstract impl does exist) to avoid confusion.
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117770
Approved by: https://github.com/ydwu4, https://github.com/williamwen42
At lest if one tries to compile the AOTI code on Darwin, compilation
fails with implicit instantiation of undefined template error:
```
In file included from /Users/nshulga/git/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:3:
/Users/nshulga/git/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model.h:69:21: error: implicit instantiation of undefined template 'std::basic_stringstream<char>'
std::stringstream ss;
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118075
Approved by: https://github.com/desertfire
ghstack dependencies: #118074
Memory leak detected on ROCm. Skip until it can be addressed.
PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 python test_eager_transforms.py -k test_compile_vmap_hessian_cuda
See #117642 for moving rocm CI to unstable due to this test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118009
Approved by: https://github.com/jeanschmidt
Summary:
Previously, heatbeat was incremented once per finishing a for loop over a list
of in-progress work items, under the assumption that either the processing
would be predictably quick, or it would hang completely.
In fact, there can be cuda API contention that causes the processing of works
to slow down arbitrarily but not truly deadlock. To guard against this, we
bump the heartbeat at the smallest unit of progress, one work item being
successfully processed.
Test Plan: CI
Differential Revision: D52973948
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118016
Approved by: https://github.com/shuqiangzhang, https://github.com/kwen2501
XLA CI is currently broken in PyTorch, I think there are 2 reasons causing that
1. There is an offending Pytorch PR c393b2f1ee. Han is working on a fix in https://github.com/pytorch/xla/pull/6345
2. Commit that pytorch pin to 2990cb38c17e06d0dbe25437674ca40130d76a8f was not a valid commit. I think this is because we tried to help them to land a breaking pr in https://github.com/pytorch/xla/pull/6307. However I think we did a rebase which vanish that commit. now the CI failed
```
fatal: reference is not a tree: 2990cb38c17e06d0dbe25437674ca40130d76a8f
585
```
Let me first update the pin to the master so it at least run some test, this way we can discover if there is any additional issue. I will rebase after @qihqi 's fix passed all CI
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117999
Approved by: https://github.com/clee2000
Fixes https://github.com/pytorch/pytorch/issues/117851
In tests, we ran into an issue where:
- In frame A, Dynamo would install a global
- We call reset()
- reset() did not delete the installed global due to a refcycle
- In frame B, Dynamo would re-use the same global
- Python gc ran, deleting the installed global, leading to the compiled
version of frame B raising NameNotFound
This PR changes the following:
- module globals are now installed at a per-frame basis.
- renames install_global to install_global_unsafe: if the names are not
unique and end up being re-used across frames, then we've got trouble.
Test Plan:
- I tested that this got rid of the test flakiness locally. I'm not sure
how to easily write a test for this, because I don't actually know
what the refcycle in the above is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117998
Approved by: https://github.com/ezyang, https://github.com/anijain2305
Currently, when the user passes a model state_dict which is not a file,
ONNXProgram.save calls torch.save along with io.BytesIO, which does not
support memory-map. That makes the file stream to be fully allocated in
memory.
This PR removes the torch.save call and passes the dict directly to the
serializer. this is beneficial for the scenario when model_state_dict
is generated by torch.load(..., mmap=True) as the state dict will be
mappped in memory instead of fully loaded in memory.
This PR leverages https://github.com/pytorch/pytorch/pull/102549
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117863
Approved by: https://github.com/wschin
Today, our param_group testing does the equivalent of pitting weight and bias with different optimizer hyperparams and then check that the overall result is going the right direction based on maximize.
This PR introduces two tests to encompass coverage:
1. For every optimizer input (no differentiable), always force bias to have 0 weight_decay, and then check that the direction is expected. This is basically a replica to today's tests, but is more methodical as the test is a real use case.
2. To ensure that the different groups have distinct behavior, I added another test where lr is basically 0 in default group, and ensure that the param in the default group doesn't move while loss does.
Together, these tests do a better job of testing param groups than today's tests, **though we do lose some flavors**. For example, RMSProp also pits centered=True vs False across the param_groups, Adadelta has a variation on rho, and ASGD has a variation for t0. I don't think this is really a loss, as the previous test was just testing for direction and our new tests test stronger guarantees.
The leftover param group configs are used in conjunction with LRSchedulers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117675
Approved by: https://github.com/albanD
Summary:
When using a custom deleter InefficientStdFunctionContext was using a
std::unique_ptr<> to store the pointer and call the deleter - but this failed to
call the deleter if the pointer was null. Since we have a separate holder class
anyway take out the std::unique_ptr<> and call the deleter directly.
Fixes#117273
Test Plan:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117418
Approved by: https://github.com/wjakob, https://github.com/yanboliang
Summary:
Our upcoming compiler upgrade will require us not to have shadowed variables. Such variables have a _high_ bug rate and reduce readability, so we would like to avoid them even if the compiler was not forcing us to do so.
This codemod attempts to fix an instance of a shadowed variable. Please review with care: if it's failed the result will be a silent bug.
**What's a shadowed variable?**
Shadowed variables are variables in an inner scope with the same name as another variable in an outer scope. Having the same name for both variables might be semantically correct, but it can make the code confusing to read! It can also hide subtle bugs.
This diff fixes such an issue by renaming the variable.
- If you approve of this diff, please use the "Accept & Ship" button :-)
Test Plan: Sandcastle
Reviewed By: igorsugak
Differential Revision: D52582853
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117996
Approved by: https://github.com/PaliC, https://github.com/kit1980, https://github.com/malfet
# Context
Let's say we do `View.create(x, sizes)` where `x` is a `SliceView` and `sizes` contains unbacked symints e.g. `sizes = [i14, 256]`. Then, this we'll run ([this code](7e37f63e5e/torch/_inductor/ir.py (L2058-L2071))) where we.
1. Call `x.realize()` -- SliceView(Pointwise) -> SliceView(ComputedBuffer).
2. Retrieve storage & layout via `as_storage_and_layout(x)`
3. Calculate `new_layout` based off layout & `new_sizes`
3. `return ReinterpretView(storage, new_layout)`
However, (2) will raise `NotImplementedError` ([see](7e37f63e5e/torch/_inductor/ir.py (L1704-L1731))) since `x` is a `SliceView` and that isn't supported.
Thus, I tried adding support for `SliceView` in `as_storage_and_layout`. This worked for my case, but if instead `sizes` had backed symints e.g. `sizes=[s0, 256]` then some benchmarked models lost accuracy.
```
if isinstance(x, SliceView):
return as_storage_and_layout(
x.data,
freeze=freeze,
want_contiguous=want_contiguous,
stride_order=stride_order,
)
```
So instead of the above, I tried unwrapping the `SliceView` via `x = x.unwrap_view()`. This works for my usecase and passes CI but I'm not entirely sure why. If we unwrap our `SliceView` and create a `ReinterpretView`, I'd assume we'd lose the reindexer from `SliceView`. ~~But maybe we can re-create the same indexing from the `ReinterpretView`'s strides?~~ edit: we do lose vital information (like offset) when you release your `SliceView` and create a `ReinterpretView` so that's a no-go.
Moving onto the final version of this PR. We call `ExternKernel.realize_input()` (feels a bit weird to use `ExternKernel` but it's exactly what I need). It will go ahead and handle our `SliceView` case ([see](a468b9fbdf/torch/_inductor/ir.py (L3733-L3739))) by converting it to a `ReinterpretView` with the correct offset.
# Test
```
$ python test/inductor/test_unbacked_symints.py
..
----------------------------------------------------------------------
Ran 10 tests in 20.813s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117013
Approved by: https://github.com/jansel, https://github.com/ezyang
Summary:
In `torch.export.export(f, args, kwargs, ..., dynamic_shpapes=None, ...)`, `dataclass` is an acceptable type of inputs (for args and kwargs). The `dynamic_shapes` of the `dataclass` inputs needs to be the same `dataclass` type which replaces each tensor attributes with `dynamic_shapes` of the corresponding tensors. (https://github.com/pytorch/pytorch/blob/main/torch/export/dynamic_shapes.py#L375)
However, some `dataclass` may have limitations on the types of attributes (e.g., having to be tensors) such that the same `dataclass` cannot be constructed for dynamic shapes.
For an input of `dataclass` type, this task enables a `dynamic_shapes` of a tuple type that specifies dynamic shape specifications for each tensor of the input in the same order as the input dataclass type's flatten_fn (https://github.com/pytorch/pytorch/blob/main/torch/utils/_pytree.py#L103)
Test Plan: buck test //caffe2/test:test_export
Differential Revision: D52932856
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117917
Approved by: https://github.com/avikchaudhuri
Summary: The `FxNetAccFusionsFinder.recursive_add_node` function can run into an exponential complexity when applied to an fx graph with multiple densely connected layers of nodes. Here we add a `visited` set which reduces the worst case complexity to linear.
In the internal MRS models with the densely connected layer structure, this fix reduces the fx split time from forever to < 100ms, hence unblocking the internal enablement.
P.S. As much as I want to add a unit test, I can't find any existing tests for the `_SplitterBase` infra. Happy to add one if pointed to where. Thanks!
Test Plan: CI
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D52951321](https://our.internmc.facebook.com/intern/diff/D52951321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117969
Approved by: https://github.com/oulgen, https://github.com/khabinov
**Summary**
This PR switches the softmax and log_softmax ops to use OpStrategy instead of rules. This PR also adds support when the softmax dimension is sharded -- a replication is performed before computation.
**Test**
`python test/distributed/_tensor/test_math_ops.py -k test_softmax_fwd`
`python test/distributed/_tensor/test_math_ops.py -k test_softmax_with_bwd`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117723
Approved by: https://github.com/XilunWu
Summary: `constraints` argument for `torch.export` has been deprecated in favor of the `dynamic_shapes` argument. This PR updates the use of the deprecated API in `scripts/sijiac/prototypes` and `test/inductor`.
Test Plan: buck test mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor
Differential Revision: D52931743
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117915
Approved by: https://github.com/angelayi
Mark `dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_retracibility_dynamic_shapes` explicitly as slow
I cannot figure out what the correct way to do this is
Tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117896
Approved by: https://github.com/huydhn
This PR:
- refactors the redistribute implementation logic to make it more
sound, by figuring out the transform informations first and then apply
transformation step by step, we also cache the decisions so that it
could be reuse again
- for uneven sharding, refactor uneven sharding logic, and use a logical
shape concept for each transform information to fix the uneven sharding
multi-mesh redistribute bug
fixes https://github.com/pytorch/pytorch/issues/115310
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115525
Approved by: https://github.com/XilunWu
As discussed on [slack](https://pytorch.slack.com/archives/C3PDTEV8E/p1703699711772289) adding Andrew Gu's advanced FSDP design notes with a few additions from myself based on our discussion.
I hope I did the RST right, I haven't done RST in a while.
- The first section is Andrew's words verbatim + formatting
- The second section is Andrew's words verbatim + formatting + a few of my additions that were confirmed by Andrew, and which hopefully should help understand the process better.
tagging @albanD as requested.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117323
Approved by: https://github.com/awgu
This PR changes torch.export to require an nn.Module as input, rather than taking an arbitrary callable.
The rationale for this is that we have several invariants the ExportedProgram that are ambiguous if the top-level object being traced is a function:
1. We "guarantee" that every call_function node has an `nn_module_stack` populated.
2. We offer ways to access the state_dict/parameters/buffers of the exported program.
We'd like torch.export to offer strong invariants—the value proposition of export is that you can trade flexibility for stronger guarantees about your model.
An alternative design would be to implicitly convert the top-level function into a module, rather than require that the user provide a module. I think that's reasonable (it's what we did in TorchScript), but in the spirit of being explicit (another design tenet of export) I avoid that here.
Differential Revision: [D52789321](https://our.internmc.facebook.com/intern/diff/D52789321/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117528
Approved by: https://github.com/thiagocrepaldi, https://github.com/zhxchen17, https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan
For large complex values the division produces inf or NaN values which leads other functions to produce such too,
e.g. `torch._refs.sgn` used in a test.
Example:
```
$ python -c 'import torch; print(torch._refs.sgn(torch.complex(torch.tensor([-501]*16, dtype=torch.float32), torch.tensor([-1e20]*16, dtype=torch.float32))))'
tensor([-0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj])
$ python -c 'import torch; t = torch.complex(torch.tensor([-501]*16, dtype=torch.float32), torch.tensor([-1e20]*16, dtype=torch.float32)); print(t / t.abs())'
tensor([-0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj])
```
Implement the same algorithm as used in numpy and x86 (#93277)
Reason here is that for a tensor with a component of `1e20` the abs-squared value used in the division contains a term `1e20 * 1e20` which overflows the dynamic range of float32 (3e38) and yields an "inf", so the division yields "nan"
Output after change:
```
$ python -c 'import torch; t = torch.complex(torch.tensor([-501]*16, dtype=torch.float32), torch.tensor([-1e20]*16, dtype=torch.float32)); print(torch._refs.sgn(t), t.sgn(), t / t.abs())'
tensor([-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j]) tensor([-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j]) tensor([-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j])
```
CC @quickwritereader who wrote the initial code and @VitalyFedyunin who was involved in the initial review and @lezcano who reviewed #93277
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116972
Approved by: https://github.com/lezcano
Summary: Using faster binding following https://github.com/pytorch/pytorch/pull/117500. torch.utils.cpp_extension.load_inline builds a lot of things and is very slow. With this change, later we can further reduce the included header files using the ABI-compatible mode and thus further speed up the compilation.
Result:
```
python test/inductor/test_cuda_cpp_wrapper.py -k test_relu_cuda_cuda_wrapper
Before: Ran 1 test in 32.843s
After: Ran 1 test in 26.229s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117693
Approved by: https://github.com/jansel
As discussed on [slack](https://pytorch.slack.com/archives/C3PDTEV8E/p1703699711772289) adding Andrew Gu's advanced FSDP design notes with a few additions from myself based on our discussion.
I hope I did the RST right, I haven't done RST in a while.
- The first section is Andrew's words verbatim + formatting
- The second section is Andrew's words verbatim + formatting + a few of my additions that were confirmed by Andrew, and which hopefully should help understand the process better.
tagging @albanD as requested.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117323
Approved by: https://github.com/albanD, https://github.com/awgu
In case no keyword arguments are passed, `**kwargs` would expand just fine without the need for extra overhead of `or {}`. In addition to reducing boilerplate, this also comes with a small perf improvement:
```
In [1]: def null(*args, **kwargs):
...: pass
...:
In [2]: def call1(*args, **kwargs):
...: return null(*args, **(kwargs or {}))
...:
In [3]: def call2(*args, **kwargs):
...: return null(*args, **kwargs)
...:
In [4]: %timeit call1()
145 ns ± 2.07 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [5]: %timeit call2()
118 ns ± 2.14 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [6]: %timeit call1()
147 ns ± 6.19 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [7]: %timeit call2()
117 ns ± 0.846 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117880
Approved by: https://github.com/Skylion007
Summary:
`conv1d` has two arguments `weight` and `bias` which are stored as constant tensors on the CPU and they are transferred to GPU at every inference call. We create a context for this operator to avoid the repeated passing. Specifically, we
- created `Conv1dPackedContext`,`create_conv1d_context` and `run_layernorm_context` in `Convolution.h` and `Convolution.cpp`
- registered them in `Register.cpp`
- rewrote the graph representation of the op in `vulkan_rewrite.cpp`
Test Plan:
## Numerical test
```
[luwei@82308.od /data/sandcastle/boxes/fbsource (8a8d911dc)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*conv1d*"
Buck UI: https://www.internalfb.com/buck2/7760800b-fd75-479a-9368-be5fcd5a7fef
Network: Up: 0B Down: 0B
Jobs completed: 4. Time elapsed: 0.6s.
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *conv1d*
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from VulkanAPITest
[ RUN ] VulkanAPITest.conv1d_simple
[ OK ] VulkanAPITest.conv1d_simple (159 ms)
[ RUN ] VulkanAPITest.conv1d
[ OK ] VulkanAPITest.conv1d (57 ms)
[----------] 2 tests from VulkanAPITest (217 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (217 ms total)
[ PASSED ] 2 tests.
```
Full test result in P1053644934, summary as below
```
[----------] 419 tests from VulkanAPITest (28080 ms total)
[----------] Global test environment tear-down
[==========] 419 tests from 1 test suite ran. (28080 ms total)
[ PASSED ] 418 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
```
## Graph representation comparison
We created a model using `conv1d` and traced it as below
```
# Define a simple model that uses conv1d
class MyModel(torch.nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.conv1d = nn.Conv1d(16, 33, 3)
def forward(self, x):
return self.conv1d(x)
# Create an instance of the model
model = MyModel()
# Create a dummy input tensor for tracing
input_tensor = torch.randn(20, 16, 50)
# Use torch.jit.trace to trace the model and generate a graph
traced_model = torch.jit.trace(model, input_tensor)
```
Then we converted the traced model to Vulkan backend using `optimize_for_mobile`
```
from torch.utils import mobile_optimizer
vulkan_model = mobile_optimizer.optimize_for_mobile(
traced_model, backend="vulkan", preserved_methods=to_preserve
)
```
Next we can print the graph of the `vulkan_model` as `print(vk_model.graph)`
- before this diff: `conv1d` was used
```
graph(%self.1 : __torch__.___torch_mangle_16.MyModel,
%x : Tensor):
%60 : Device = prim::Constant[value="cpu"]()
%self.conv1d.bias : Float(33, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=<Tensor>]()
%37 : bool = prim::Constant[value=0]()
%36 : NoneType = prim::Constant()
%59 : Device = prim::Constant[value="vulkan"]()
%self.conv1d.weight : Float(33, 16, 3, strides=[48, 3, 1], requires_grad=0, device=cpu) = prim::Constant[value=<Tensor>]()
%7 : int = prim::Constant[value=1](), scope: __module.conv1d # /mnt/xarfuse/uid-23453/243f3953-seed-nspid4026532834_cgpid7972545-ns-4026532831/torch/nn/modules/conv.py:306:0
%18 : int[] = prim::Constant[value=[1]]()
%19 : int[] = prim::Constant[value=[0]]()
%39 : Tensor = aten::to(%x, %59, %36, %37, %37)
%20 : Tensor = aten::conv1d(%39, %self.conv1d.weight, %self.conv1d.bias, %18, %19, %18, %7)
%58 : Tensor = aten::to(%20, %60, %36, %37, %37)
return (%58)
```
- after this diff: `conv1d` was replaced with `run_conv1d_context`
```
graph(%self.1 : __torch__.___torch_mangle_6.MyModel,
%x : Tensor):
%85 : Device = prim::Constant[value="cpu"]()
%51 : bool = prim::Constant[value=0]()
%50 : NoneType = prim::Constant()
%84 : Device = prim::Constant[value="vulkan"]()
%53 : Tensor = aten::to(%x, %84, %50, %51, %51)
%prepack_folding_forward._jit_pass_packed_weight_0 : __torch__.torch.classes.vulkan.Conv1dPackedContext = prim::GetAttr[name="prepack_folding_forward._jit_pass_packed_weight_0"](%self.1)
%22 : Tensor = vulkan_prepack::run_conv1d_context(%53, %prepack_folding_forward._jit_pass_packed_weight_0)
%83 : Tensor = aten::to(%22, %85, %50, %51, %51)
return (%83)
```
Differential Revision: D52865379
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117780
Approved by: https://github.com/yipjustin
torch.nn.Linear crashes with internal assert if invoked with 5D tensors,
due to the bug in MPS framework, i.e. invoking
```swift
import MetalPerformanceShadersGraph
let graph = MPSGraph()
let x = graph.constant(1, shape: [2, 1, 2, 1, 2], dataType: .float32)
let y = graph.constant(1, shape: [2, 3], dataType: .float32)
let z = graph.matrixMultiplication(primary: x, secondary: y, name: nil)
let device = MTLCreateSystemDefaultDevice()!
let buf = device.makeBuffer(length: 48)!
let td = MPSGraphTensorData(buf, shape: [2, 1, 2, 1, 3], dataType: .int32)
let cmdBuf = MPSCommandBuffer(from: device.makeCommandQueue()!)
graph.encode(to: cmdBuf, feeds: [:], targetOperations: nil, resultsDictionary: [z:td], executionDescriptor: nil)
cmdBuf.commit()
```
crashes with
```
AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSNDArray/Kernels/MPSNDArrayIdentity.mm:813: failed assertion `New volume: 4 should match old volume: 8 [reshapeWithCommandBuffer] MPSNDArrayIdentity.'
zsh: abort ./build/matmul
```
Workaround the issue by flattening the forward and backward tensors if number of dimentions is greater than 4
Add regression tests to Linear opinfo samples
Fixes https://github.com/pytorch/pytorch/issues/114942
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117837
Approved by: https://github.com/janeyx99
Using ONNX opset 14, the aten scaled_dot_product_attention oeprator can be implemented with bfloat16 support because Add-14 does support bfloat16
This PR simply add bfloat16 to the list of supported types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117878
Approved by: https://github.com/BowenBao
Summary:
Following the implementation of Softmax, striding over the texture differently based on the desired dimension.
Softmax performs a similar operation as cumsum (generally called "scan") iterating over all items in a dimension, but cumsum only needs to iterate once to collate the sum, compared to softmax which needs to iterate multiple times to collect the max and denominator for the final calculation.
Similar to the softmax implmentation there's likely opportunities to optimize, but this gets all dims < 4 functional first.
Test Plan:
`LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*cumsum*"`:
```
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *cumsum*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from VulkanAPITest
[ RUN ] VulkanAPITest.cumsum_1d
[ OK ] VulkanAPITest.cumsum_1d (93 ms)
[ RUN ] VulkanAPITest.cumsum_2d
[ OK ] VulkanAPITest.cumsum_2d (74 ms)
[ RUN ] VulkanAPITest.cumsum_3d
[ OK ] VulkanAPITest.cumsum_3d (105 ms)
[ RUN ] VulkanAPITest.cumsum_4d
[ OK ] VulkanAPITest.cumsum_4d (73 ms)
[----------] 4 tests from VulkanAPITest (346 ms total)
[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (346 ms total)
[ PASSED ] 4 tests.
```
Differential Revision: D52814000
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117580
Approved by: https://github.com/yipjustin
Summary: The added test case ends up emitting an inductor IR as the buffer string, lets properly emit the buffer name instead.
Test Plan: added new test
Differential Revision: D52899373
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117838
Approved by: https://github.com/aakhundov
* custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing)
* normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes). Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries
Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497
"items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069
Approved by: https://github.com/huydhn
Attempts to make the input/output mismatch error better by first checking if the inputs/outputs are able to be pytree flattened into supporting types (tensors, symints, ...). So if user passes in some datastructure which does not have a pytree flatten registration, this will error with the message "It looks like one of the inputs is with type CustomType is not supported or pytree flatten-able.... please register a pytree flatten/unflatten function using the pytree.register_pytree_node API".
The check inside of produce_matching should now only error if something unexpected happens (dynamo accidentally adds an input or removes an output), and should be considered an internal error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117598
Approved by: https://github.com/avikchaudhuri, https://github.com/BowenBao
As usual, almost no work on PyTorch side, all changes are on the builder end, namely:
- 8b67d32929 - depend on `blas * mkl` only on x86 machines
- eb78393f1e - install arm64 conda when running on Apple Silicon
- 0d3aea4ee0 - constrain llvmdev-9 to x86 machines only
- 6c6a33b271 - set correct DEVELOPER_DIR path
TODO:
- We should auto-detect this `DEVELOPER_DIR` via `xcode-select`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117801
Approved by: https://github.com/atalman
This PR adds the bare minimum functionality to get torchbind working in an e2e testable way on PT2.
It implements:
* ProxyTensor support
* Simple torch.export support (proxytensor-only path, e.g. non-strict).
* add some tests exercising the path.
Because all this is not fully baked, I hide the functionality behind a feature flag (`enable_torchbind_tracing()`) so it does not affect regular users for now.
Still on the agenda:
* Dynamo support
* Actual FakeMode support
* Mutability support
Hoping to get this first bit in as a standalone, as it will unblock some more extensive experimentation/testing going on internally.
Differential Revision: [D51825372](https://our.internmc.facebook.com/intern/diff/D51825372/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117697
Approved by: https://github.com/SherlockNoMad
Because ANDROID>=21 is assumed in CI tests, it is time to remove old workarounds. math_compat.h contains solely wrapper math functions for ANDROID, so we can remove its usage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116167
Approved by: https://github.com/ezyang
Today watchdog's sleep interval is 1s. That's a bit long compared to modern GPU link's (or network link's) speed.
Take DDP and Ampere for example:
DDP's bucket size = 25 MB
Ampere's NVLink speed = 250 GB/s
25 MB / 250 GB/s = 100 ms.
So we are updating the interval to 100 ms.
Update:
25 MB / 250 GB/s = 0.1 ms
But let's see how it goes so far between making the checking more aggressive.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117297
Approved by: https://github.com/fduwjj
Fixes#117660
(1) skip dynamic tests for exported program in `test_fx_to_onnx_onnxruntime.py`, as they are not expected to pass anyway.
(2) Move dolly model to runtime, since it's working in exporting, but it is blocked by non-persistent buffers as well.
(3) openai whisper has changed/regression due to modeling modifications.
(4) Replace OpenLlama with Llama, because OpenLlama is deprecated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117703
Approved by: https://github.com/thiagocrepaldi
* custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing)
* normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes). Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries
Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497
"items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069
Approved by: https://github.com/huydhn
For a persistent reduction, we generate 2 flavor of 'equivalant' kernels at the same time
- persistent reduction
- regular reduction
A MultiKernel wraps these 2 kernels and pick the one with better performance at runtime.
Here I talk more about implementation details:
- Inductor maintains states for generating kernels. E.g. the wrapper code. After we generate code for one kernel, we need restore the inductor state before we can generate the counterpart.
***There is one thing I need some comments from others***:
There is one tricky thing about kernel arguments. In general, inductor removes a buffer from the argument list if it's only used inside the kernel. But somehow a buffer removed by persistent reduction kernel may still be kept by the regular (non-persistent) reduction kernel because of some CSE invalidation rule. My current implementation avoid removing buffers if multi_kernel is enabled. This makes sure both flavors of reduction has consistent argument list. Another idea I have is, we generate the multi-kernel definition with the union of arguments from both sub-kernels. Let each sub-kernel pick the subset of arguments it wants. But this will make the code-gen or multi-kernel much complex.
I'm not sure if there is some easy and clean way to resolve this.
Testing command:
```
TORCHINDUCTOR_MULTI_KERNEL=1 TORCH_LOGS=+torch._inductor.graph TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103469
Approved by: https://github.com/jansel
Fixes#117685.
This PR only makes ConstantSource perserved for built-in ops when we find all the inputs are either constant tensors or python constants.
It doesn't fundamentally solve the problem of preserving ConstantSource information through all operators that's potentially can be constant folded.
For the following code in the issue:
```
class Bob(torch.nn.Module):
def __init__(self, p, val) -> None:
super().__init__()
self.p = p
self.y = torch.nn.Parameter(torch.tensor(val))
def forward(self, x: torch.Tensor) -> torch.Tensor:
# This only looks dynamic but it's actually a constant value
if get_y(self.y) < self.p:
return torch.cat([x,x])
else:
return x
```
The graph exported looks like following:
```python
class GraphModule(torch.nn.Module):
def forward(self, x):
arg0: "f32[s0, s1]";
arg0, = fx_pytree.tree_flatten_spec(([x], {}), self._in_spec)
l_x_ = arg0
# File: /home/yidi/local/pytorch/test/dynamo/test_export.py:1498 in forward, code: return torch.cat([x, x])
cat = torch.cat([l_x_, l_x_]); l_x_ = None
return pytree.tree_unflatten([cat], self._out_spec)
```
Test Plan:
Added a new test for the given repro.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117704
Approved by: https://github.com/jansel, https://github.com/anijain2305
enable this test in meta-internal CI, since it's mildly infuriating to not be able to locally test this when working inside meta
One change:
This test uses `pkgutil.walk_packages`, which ignores namespace packages. A quirk in Meta's internal python packaging system is that it adds `__init__.py` to each source directory. So this test picks up more files to check internally than in the GitHub CI.
So I changed this test from using raw `pkgutil` to a version that also looks into namespace packages, so we're checking the same thing across both CIs.
Differential Revision: [D52857631](https://our.internmc.facebook.com/intern/diff/D52857631/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117712
Approved by: https://github.com/ezyang
Summary:
By default, in LLD 16, .lrodata is placed immediately after .rodata.
However, .lrodata can be very large in our compiled models, which leads to
relocation out-of-range errors for relative relocations. So we place it
after other the sections that are referenced from .text using relative
relocations. This is the default behavior in GNU ld.
Reviewed By: muchulee8, desertfire, khabinov, chenyang78
Differential Revision: D52557846
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117575
Approved by: https://github.com/chenyang78, https://github.com/khabinov
This PR refactors the distributed related variables to use
DistributedVariable for common methods, so that things like
`python_type` works for all distributed variables.
Maybe we can add `as_python_constant` to the DistributedVariable too? I
didn't add in this PR but if that make sense I can update.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117590
Approved by: https://github.com/voznesenskym
Fixes#115922
This PR is prepared to separate existing https://github.com/pytorch/pytorch/pull/116926 and to apply suggestions in the review.
`scalar_t` which is defined as `c10::impl::ScalarTypeToCPPType<ScalarType::Half>::t` appears to be causing the issue with `Visual Studio 2022 17.8.4` (coming with `MSVC 14.38.33130`)
Error message:
```
aten\src\ATen/cpu/vec/vec_base.h(150): fatal error C1001: Internal compiler error.
(compiler file 'D:\a_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\toinil.c', line 910)
```
---
Related line was added for a similar issue before as a workaround (`scalar_t` definition) [Fix compile error for vs2022](https://github.com/pytorch/pytorch/pull/85958)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117497
Approved by: https://github.com/ezyang, https://github.com/malfet
# Context
In some cases, we might want to build the `context_fn` with runtime-defined policies. One way of implementing this is to make `context_fn` be a partial, which holds the information that we want to pass. One concrete example is the [automatic policy selection from `xformers`](ad986981b1/xformers/checkpoint.py (L185)).
# The problem
The previous implementation wouldn't work with partials because `FunctoolsPartialVariable` doesn't have a `fn` attribute.
This PR addresses this case, but ideally we could get this solved in a more general fashion, as callable classes and `NestedUserFunctionVariable` are not supported by this PR.
# Tests
I've added a basic test that mimics the tests around it. The tests could probably be simplified, but I've decided to keep changes to a minimum.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117657
Approved by: https://github.com/yf225
Following the RFC https://github.com/pytorch/pytorch/issues/114856, before upstream Intel XPU Inductor Backend, we need to preapre corresponding Inductor test cases. This PR aims to generalize part of Inductor test case so that a new GPU backend can reuse the existing test case with minimal code change.
This Pull Request preferentially generalizes the test cases that cover Inductor's base functionality as follow:
- test/inductor/test_codecache.py
- test/inductor/test_codegen_triton.py
- test/inductor/test_kernel_benchmark.py
- test/inductor/test_torchinductor.py
- test/inductor/test_torchinductor_codegen_dynamic_shapes.py
- test/inductor/test_torchinductor_dynamic_shapes.py
- test/inductor/test_torchinductor_opinfo.py
- test/inductor/test_triton_heuristics.py
- test/inductor/test_triton_wrapper.py
Feature request: https://github.com/pytorch/pytorch/issues/114856
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117513
Approved by: https://github.com/EikanWang, https://github.com/jansel
Summary: Allow the trainer to explicitly shutdown the compile-worker pools to save CPU resource, thereby avoiding QPS degradation.
Test Plan: See the test plan in D52839313
Differential Revision: D52839313
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117664
Approved by: https://github.com/yanboliang
pydocstyle check
averagers.py
Pre
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:1 at module level:
D100: Missing docstring in public module
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:20 in public method `__init__`:
D107: Missing docstring in __init__
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:27 in public method `average_parameters`:
D102: Missing docstring in public method
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:84 in public method `__init__`:
D107: Missing docstring in __init__
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:106 in public method `average_parameters`:
D205: 1 blank line required between summary line and description (found 0)
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:106 in public method `average_parameters`:
D400: First line should end with a period (not '`')
6
Post
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:1 at module level:
D100: Missing docstring in public module
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:20 in public method `__init__`:
D107: Missing docstring in __init__
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:27 in public method `average_parameters`:
D102: Missing docstring in public method
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:84 in public method `__init__`:
D107: Missing docstring in __init__
4
utils.py
Pre
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:1 at module level:
D100: Missing docstring in public module
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:17 in public function `average_parameters`:
D205: 1 blank line required between summary line and description (found 0)
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:45 in public function `get_params_to_average`:
D205: 1 blank line required between summary line and description (found 0)
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:45 in public function `get_params_to_average`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:68 in public function `average_parameters_or_parameter_groups`:
D200: One-line docstring should fit on one line with quotes (found 3)
5
Post
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:1 at module level:
D100: Missing docstring in public module
1
hierarchical_model_averager.py
Pre
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:1 at module level:
D100: Missing docstring in public module
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:16 in public class `HierarchicalModelAverager`:
D205: 1 blank line required between summary line and description (found 0)
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:98 in public method `__init__`:
D107: Missing docstring in __init__
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:137 in private method `_find_process_group`:
D205: 1 blank line required between summary line and description (found 0)
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:137 in private method `_find_process_group`:
D400: First line should end with a period (not ',')
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:137 in private method `_find_process_group`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:151 in public method `average_parameters`:
D205: 1 blank line required between summary line and description (found 0)
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:151 in public method `average_parameters`:
D400: First line should end with a period (not '`')
8
Post /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:1 at module level:
D100: Missing docstring in public module
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:99 in public method `__init__`:
D107: Missing docstring in __init__
2
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117038
Approved by: https://github.com/H-Huang
Summary: Recently, we found merge splits (D45204109) is not working for AFOC model, thus patch a fix.
Test Plan:
The error log: P1046934021
# Flows used to local reproduce
### non-first:
f522317780
after the fix: P1047603217
### first:
f522253163
after the fix: P1047764917
Differential Revision: D52856359
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117707
Approved by: https://github.com/jackiexu1992
Fix for https://github.com/pytorch/pytorch/issues/113895
There are three phases to cudagraph trees. Warmup, recording, and execution. On recording and execution we are executing under the current_stream. In warmup we execute under a side stream that we also use for cudagraph recording so as to reuse memory.
After we execute on the side stream we need to sync the current stream to the side stream. Previously there was a `torch.cuda.synchronize` but not a `torch.cuda.current_stream().wait_stream(stream)`. This PR removes the global sync and adds a wait_stream. I have confirmed that it fixes https://github.com/pytorch/pytorch/issues/113895.
It's not entirely clear me why torch.cuda.synchronize would be insufficient - I would have thought the global sync would encompass the stream to stream sync. However, we do have a number of [instances](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/compile_fx.py#L748-L749) throughout the code base where we do a stream->stream sync after the global sync so clearly I am missing something here. In any case the stream->stream sync is better perf than a global synchronize.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117578
Approved by: https://github.com/zdevito
Attempts to make the input/output mismatch error better by first checking if the inputs/outputs are able to be pytree flattened into supporting types (tensors, symints, ...). So if user passes in some datastructure which does not have a pytree flatten registration, this will error with the message "It looks like one of the inputs is with type CustomType is not supported or pytree flatten-able.... please register a pytree flatten/unflatten function using the pytree.register_pytree_node API".
The check inside of produce_matching should now only error if something unexpected happens (dynamo accidentally adds an input or removes an output), and should be considered an internal error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117598
Approved by: https://github.com/avikchaudhuri, https://github.com/BowenBao
Whenever an IR node has reference to an unbacked SymInt, we must
register it as a use of the unbacked SymInt.
This fix isn't complete but the rest of the fix is fairly difficult, so
putting this in to start.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117650
Approved by: https://github.com/lezcano
Work.result() returns a vector of tensors. This signature is problematic as some collectives may just return one tensor (e.g all-reduce), while some others may return multiple tensors (e.g. all-gather).
It would be clearer/easier for users to directly access the result via the tensor/tensorlist passed to the collective APIs.
Deprecating work.result() would also allow us to remove the `outputs_` field in the Work class, avoiding an "artificial" reference to the tensor, which could potentially hold up the tensor's memory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117565
Approved by: https://github.com/wconstab
It should have never been landed, but was landed again, thanks to
ghstack grafting/ungrafting see discussion on https://github.com/pytorch/pytorch/pull/116910
This reverts commit e457b6fb18782425661e8a09d0222d0b29518ad1.
Fixes https://github.com/pytorch/pytorch/issues/114301
Previously, coalesced work (created by `end_coalescing`) is not watched by watchdog, which results in silent timeout.
The culprit is that we reset `coalescing_state_` to 0 before checking it to see if we should enqueue a work.
Example:
```
import torch
import torch.distributed as dist
from datetime import timedelta
dist.init_process_group(backend="nccl", timeout=timedelta(seconds=10))
rank = dist.get_rank()
world_size = dist.get_world_size()
device = torch.device(f"cuda:{rank}")
# Create tensors of different sizes to create hang
s = 100 * 1024 * 1024 * (world_size - rank)
with dist._coalescing_manager(device=device):
dist.all_reduce(torch.ones(s, device=device))
dist.broadcast(torch.ones(s, device=device), src=0)
torch.cuda.synchronize()
print(f"{dist.get_rank()} done")
```
Watchdog fires:
```
$ torchrun --nproc-per-node 2 example.py
...
[rank1]:[E ProcessGroupNCCL.cpp:545] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=10000) ran for 10000 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:545] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=10000) ran for 10567 milliseconds before timing out.
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117682
Approved by: https://github.com/wconstab, https://github.com/fduwjj
Summary: `constraints` argument for `torch.export` has been deprecated in favor of the `dynamic_shapes` argument. This PR updates the use of the deprecated API in `deeplearning/aot_inductor/test/test_custom_ops.py`.
Test Plan: buck test mode/dev-nosan fbcode//deeplearning/aot_inductor/test:test_custom_ops -- test_export_extern_fallback_nodes_dynamic_shape
Differential Revision: D52790332
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117573
Approved by: https://github.com/angelayi
Summary:
AOTInductor currently infer cuda device index by `cudaGetDevice()`. This assumes outer runtime calls `cudaSetDevice()` somewhere, before invoking AOTInductor run.
This diff adds an explicit argument for specifying target Device. e.g. compiled on "cuda:0", run on "cuda:1".
todo:
- Are the changes in interface.h BC breaking? as it changes the function signatures in .so file. Might just need introduce a new "Create" function.
Test Plan: CI
Differential Revision:
D52747132
Privacy Context Container: 368960445142440
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117413
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/khabinov
We have one for Dynamo that currently applies to all "compile"
configurations (PYTORCH_TEST_WITH_DYNAMO, PYTORCH_TEST_WITH_INDUCTOR). I
don't want to figure out the inductor situation right now, so we're
going to add another denylist for inductor and work through it later.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117553
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117409, #116667, #117591, #117500, #116910
Summary:
We saw the following failure when compiling custom triton kernels:
```
RuntimeError: Argument 'getitem_22' of Node 'triton_kernel_wrapper_functional_proxy_3' was used before it has been defined! Please check that Nodes in the graph are topologically ordered
```
The root-cause is when doing the replacement, the replacement is replaced by another replacement. The fix will keep finding the replacement until it is not replaced
Test Plan:
Added a test case
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117612
Approved by: https://github.com/aakhundov
Fixed#116848
Related to the bug introduced in my previous PR here: https://github.com/pytorch/pytorch/pull/113749/files#diff-a1b077971cddfabfa0071c5162265066e867bc07721816d95b9cbe58431c38e3R3264
Originally, the code was
```python
def upsample_nearestnd(
x,
output_size,
scales_x: Tuple[Optional[float], ...],
n: int = 2,
exact: bool = False,
):
# ...
scales = [i / o for i, o in zip(i_sizes, o_sizes)]
for i, scale in enumerate(scales):
if scale:
scales[i] = scale
```
which is wrong as `scales_x` is not used but can be provided by the user. The code was working for cases when user provided scale value can be recomputed using `input / output` sizes, e.g. scale=2.0. However, this would fail if input scale is a float value, e.g. 2.3, in this case recomputed scale is a bit different (e.g. 2.292682926829268, depending on input and output size) and can lead to an inconsistent output.
This problem was "fixed" to the following in my previous PR: https://github.com/pytorch/pytorch/pull/113749
```python
def upsample_nearestnd(
x,
output_size,
scales_x: Tuple[Optional[float], ...],
n: int = 2,
exact: bool = False,
):
# ...
scales = [i / o for i, o in zip(i_sizes, o_sizes)]
for i, scale in enumerate(scales_x):
if scale:
scales[i] = scale
```
however, this leads to a wrong scale value as it should be inverted as (1 / scale).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117538
Approved by: https://github.com/peterbell10
Attempts to make the input/output mismatch error better by first checking if the inputs/outputs are able to be pytree flattened into supporting types (tensors, symints, ...). So if user passes in some datastructure which does not have a pytree flatten registration, this will error with the message "It looks like one of the inputs is with type CustomType is not supported or pytree flatten-able.... please register a pytree flatten/unflatten function using the pytree.register_pytree_node API".
The check inside of produce_matching should now only error if something unexpected happens (dynamo accidentally adds an input or removes an output), and should be considered an internal error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117598
Approved by: https://github.com/avikchaudhuri
Summary: To be used in https://github.com/pytorch/pytorch/pull/113873. Since set_ is effectively an inplace view op, we'll need to skip caching them.
Test Plan: Built pytorch; specifically this step: `/home/slarsen/local/miniconda3/envs/pytorch-3.10/bin/python -m torchgen.gen --source-path /home/slarsen/local/pytorch/cmake/../aten/src/ATen --install_dir /home/slarsen/local/pytorch/build/aten/src/ATen --per-operator-headers --generate sources --output-dependencies /home/slarsen/local/pytorch/build/aten/src/ATen/generated_sources.cmake`
Differential Revision: [D52814561](https://our.internmc.facebook.com/intern/diff/D52814561)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115769
Approved by: https://github.com/bdhirsh
Summary: Need to pass this along
Test Plan:
```
cd ~/fbsource/fbcode/executorch/backends/xnnpack/test
buck test fbcode//mode/dev-nosan :test_xnnpack_ops -- test_fp32_sdpa
buck run fbcode//mode/dev-nosan :test_xnnpack_models -- executorch.backends.xnnpack.test.models.llama2_et_example.TestLlama2ETExample.test_fp32
```
Reviewed By: larryliu0820
Differential Revision: D52812369
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117579
Approved by: https://github.com/larryliu0820
To avoid potential hang in watchdog thread which will prevent us from dumping timeout debugging info, we move the check of global collective timeout signals and dumping debugging info to monitoring thread. We also need to ensure that we don't wait very long to check out the timeout signal from store; otherwise, we will miss the signal and don't get debugging info dumped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117168
Approved by: https://github.com/wconstab
Reland #117425
Previous to this PR, xfail tests didn't provide (1) guarantee of error message/reason (could be outdated), and (2) execution of the test (xfail_if_model_type_is_not_exportedprogram). Therefore, the tests are less robust with xfail labeled, as we can't be sure if it's still failing with the same reason, and if it's even still failing. This PR fixes the issue with try/except with error message matching to consolidate the xfail truth and reason.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117592
Approved by: https://github.com/BowenBao
This PR introduces a key path API to pytrees, drawing direct inspiration from JAX's [key path API](https://jax.readthedocs.io/en/latest/jax-101/05.1-pytrees.html#key-paths).
I added the 3 APIs described there, and a registry of `flatten_with_keys` fns for each node type, which is a version of `flatten` that also returns `KeyEntry`s describing how to access values from the original pytree.
Current use cases for this API:
- Folks would like to do argument traversal over input pytrees to do verification and compatibility enforcement. Keypaths are useful for this—https://fburl.com/code/06p7zrvr is a handrolled pass doing basically the same thing but probably more fragilely.
- In export non-strict mode, we need to figure out a way to track sources for pytree inputs. In strict mode, dynamo handles this for us, but we'd like a decoupled component to handle this when we're not using dynamo.
I'm sure there are places it would be useful.
Some design notes:
- I only implemented the API for the Python pytree impl. optree has some differences in how their keypath APIs are designed (see https://github.com/pytorch/pytorch/issues/113378 for discussion). I have some issues with the proposed typed_path solution in that discussion and prefer JAX's API, but we can hash that out separately.
- The way folks register a `flatten_with_keys` fn is through a new kwarg to `register_pytree_node`. This follows how we do serialization fns, although the list of additional arguments is getting unwieldy.
- My impl handles pytrees with an undefined `flatten_with_keys` fn is different from JAX. I will raise an error, JAX creates a fallback keyentry.
Differential Revision: [D52547850](https://our.internmc.facebook.com/intern/diff/D52547850/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116786
Approved by: https://github.com/voznesenskym
As title. This PR enables dynamic shapes for running llama with ORT. Both forward and backward are captured as a single graph with this PR.
Summary of changes:
- Test llama attention, llama decoder, llama model to ensure (1) no graph breaks (2) models exported with dynamic shapes with onnxrt dynamo backend
- Reshape SymInt to tensor with shape (1,) to align with the cast done for int in fx_onnx_interpreter.py
- Create an util function to map Python types (e.g., float) to ONNX tensor element type (e.g., onnx.TensorProto.FLOAT).
- Return `hint` for torch.Sym* in type promotion pass.
- Remove _replace_to_copy_with_to since exporter supports aten::_to_copy it now.
- Modify _get_onnx_devices to return CPU device for torch.Sym*.
- Introduce _adjust_scalar_from_fx_to_onnx (e.g., change 0 to tensor(0)) and _adjust_scalar_from_onnx_to_fx (e.g., change tensor(0) to 0) for adjusting scalars when passing values to and receive values from ORT.
- Now, ValueInfoProto of graph inputs (i.e., input_value_infos) are stored and used as `ORT-expected type` when calling `_adjust_scalar_from_fx_to_onnx`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117009
Approved by: https://github.com/titaiwangms
- We silently run skipped tests and then raise a skip message with the
error message (if any)
- Instead of raising expectedFailure, we raise a skip message with the
error message (if any)
We log the skip messages in CI, so this will let us read the logs and do
some basic triaging of the failure messages.
Test Plan:
- existing tests. I hope that there are no tests that cause each other
to fail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117401
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117391, #117400
### Summary
In #85398, while fixing a bug (which was _not caused by, but was exposed by_ AVX512 implementation) in `_vec_logsoftmax_lastdim`, I had made some revisions to use more threads in some cases, but was asked to roll back [those changes](https://github.com/pytorch/pytorch/pull/85398#discussion_r1087680237) during the PR's review.
At the time, landing that PR asap seemed essential, so I agreed to roll-back that change,
In some cases, more threads can be used than are being used with the current approach.
<strike>In this PR, I'm reintroducing those changes, which are geared towards more efficient multi-threading.</strike>.
On second thought, even for other softmax kernels besides `_vec_log_softmax_lastdim` and `_vec_softmax_lastdim`, we could simply use `grain_size` of 0 or 1, instead of complicating code because `CHUNK_SIZE` for each thread is already being computed as per some heuristic, and if `grain_size` would be `0`, then work among the OpenMP threads (which, BTW, stay constant in number, unless explicitly changed, since we don't use the OpenMP `num_threads` clause in PyTorch) would be distributed equitably, thus yielding the similar speedup as the approach in the first commit of this PR.
I've also added op-level benchmarks pertaining to example input shapes in this PR.
### Benchmarks
Machine - Intel(R) Xeon(R) Platinum 8468H (Xeon 4th gen, formerly codenamed Sapphire Rapids)
One socket of 48 physical cores was used, with & without HyperThreading.
Intel OpenMP & tcmalloc were preloaded.
Softmax benchmarks can be run with the following command, but the relevant benchmarks are the last dim ones -
`KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 OMP_NUM_THREADS=48 MKL_NUM_THREADS=48 numactl --membind=0 --cpunodebind=0 python -m pt.softmax_test --tag-filter all`
#### Already existing benchmarks
|Benchmark name (dim is 1, by default) | Previous implementation's latency (in ms) | This implementation's latency (in ms)|Speedup Percentage = (old-new)*100/old | Speedup ratio (old/new)|
|-------------|--------|-------|----------------------------|----------|
|Softmax_N1_C3_H256_W256_cpu|31.364|11.594|63.03% |2.705|
|Softmax_N4_C3_H256_W256_cpu|34.475|24.966| 27.58%|1.380|
|Softmax_N8_C3_H512_W256_cpu|94.044|78.372|16.66%|1.199|
|Softmax2d_N8_C3_H512_W256_cpu|100.195|79.529|20.62%|1.259|
#### Some of the following benchmarks are being added in this PR
|Benchmark name| Previous implementation's latency (in ms) | This implementation's latency (in ms)|Speedup percentage = (old-new)*100/old| Speedup ratio (old/new) |
|-------------|--------|-------|----------------------------|--------------------|
|LogSoftmax_M128_N128_dim1_cpu|7.629|6.475|15.12%| 1.178|
|LogSoftmax_M48_N128_dim1_cpu|6.848|5.969|12.83%| 1.147|
|LogSoftmax_M16_N1024_dim1_cpu|7.004|6.322|9.73%| 1.107|
|LogSoftmax_M32_N1024_dim1_cpu|7.037|6.558|6.80%| 1.073|
|LogSoftmax_M48_N1024_dim1_cpu|7.155|6.773|5.33%|1.056|
|LogSoftmax_M16_N512_dim1_cpu|6.797|5.862|13.75%|1.159|
|LogSoftmax_M32_N512_dim1_cpu|7.223|6.202|14.13%|1.164|
|LogSoftmax_M48_N512_dim1_cpu|7.159|6.301|11.98%|1.136|
|LogSoftmax_M16_N256_dim1_cpu|6.842|5.682|16.95%|1.204|
|LogSoftmax_M32_N256_dim1_cpu|6.840|6.086|11.02%|1.123|
|LogSoftmax_M48_N256_dim1_cpu|7.005|6.031|13.94%|1.161|
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116367
Approved by: https://github.com/jgong5, https://github.com/ezyang
Currently `matrixMultiplicationWithPrimaryTensor:secondaryTensor:` returns incorrect results if one of the matrix dimensions is greater than 32K
Solve it by providing a very naive matrix multiplication metal shader and call it if stride size is greater than 32768 elements, as slicing inside the MPSGraph doesn't work either, since `-sliceTensor:starts:ends:strides:` somehow affects matmul as well, if tiling is done as follows:
```objc
NSMutableArray<MPSGraphTensor*>* rows = [NSMutableArray new];
for (int64_t i = 0; i < M; i += tile_size) {
const auto i_end = std::min(i + tile_size, M);
NSMutableArray<MPSGraphTensor*>* row_chunks = [NSMutableArray new];
for (int64_t j = 0; j < K; j += tile_size) {
const auto j_end = std::min(j + tile_size, K);
MPSGraphTensor* tile = nil;
for (int64_t k = 0; k < N; k += tile_size) {
const auto k_end = std::min(k + tile_size, N);
auto selfChunk = [graph sliceTensor:selfTensor
starts:@[ @(i), @(k) ]
ends:@[ @(i_end), @(k_end) ]
strides:@[ @(1), @(1) ]
name:nil];
auto otherChunk = [graph sliceTensor:otherTensor
starts:@[ @(k), @(j) ]
ends:@[ @(k_end), @(j_end) ]
strides:@[ @(1), @(1) ]
name:nil];
auto chunkMM = [graph matrixMultiplicationWithPrimaryTensor:selfChunk secondaryTensor:otherChunk name:nil];
tile = tile ? [graph additionWithPrimaryTensor:tile secondaryTensor:chunkMM name:nil] : chunkMM;
}
[row_chunks addObject:tile];
}
auto row = row_chunks.count > 1 ? [graph concatTensors:row_chunks dimension:1 name:nil] : row_chunks.firstObject;
[rows addObject:row];
}
return rows.count > 1 ? [graph concatTensors:rows dimension:0 name:nil] : rows.firstObject;
```
One can always use metal MM by defining `PYTORCH_MPS_PREFER_METAL` environment variable
Fixes https://github.com/pytorch/pytorch/issues/116769
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117549
Approved by: https://github.com/kulinseth
Currently, DCP requires the `model.state_dict()` to be materialized before passing it to DCP to load, since DCP uses the pre-allocated storage from the initialized model state_dict. Therefore, even for fine-tuning and distributed inference, users would need to explicitly materialize the model on GPU before `DCP.load_state_dict()`.
Today's flow:
```
with torch.device("meta"):
model2 = parallelize_module(
MLPModule("meta"), tp_mesh, parallelize_plan=parallelize_plan
)
model.to_empty(device='cuda')
state_dict_to_load = model2.state_dict()
DCP.load_state_dict(
state_dict=state_dict_to_load,
storage_reader=DCP.FileSystemReader(CHECKPOINT_DIR),
)
model2.load_state_dict(state_dict_to_load)
```
This PR adds support for meta tensor loading. In DCP's planner, when encountering tensors/DTensor on meta device, we initialize tensor/DTensor on the current device on the fly and replace the tensor/DTensor on meta device in the state_dict. After the change, users no longer needs to manually call `model.to_empty()` when loading existing checkpoints for fine-tuning and distributed inference.
Updated user flow:
```
with torch.device("meta"):
model2 = parallelize_module(
MLPModule("meta"), tp_mesh, parallelize_plan=parallelize_plan
)
# no longer need to call model.to_empty(device='cuda')
state_dict_to_load = model2.state_dict()
DCP.load_state_dict(
state_dict=state_dict_to_load,
storage_reader=DCP.FileSystemReader(CHECKPOINT_DIR),
)
model2.load_state_dict(state_dict_to_load, assign=True)
```
Note that for distributed training, it's still the users' responsibility to reset the parameters (`model.reset_parameters()`) as checkpoint might not exist.
Note that we need to loop thru the state_dict to replace meta tensor/DTensor instead of calling `model.to_empty()` since `DCP.load()` only takes in state_dict but not model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113319
Approved by: https://github.com/fegin, https://github.com/LucasLLC
Introduces a new op `slice_inverse()`. This is used in the reverse view_func for slice and several other ops (e.g. `split_with_sizes`, `chunk`). It's implemented behind the scenes by a call to `as_strided()`, but it's easier for subclasses to implement the more limited `slice_inverse()` than the full `as_strided()`. This PR:
* Introduces the op itself
* Updates all relevant functional inverses to call `slice_inverse()` instead of `as_strided()` directly
* Makes codegen changes to allow `slice_scatter()` to be the copy variant for `slice_inverse()`
* Need to avoid view_copy codegen (assumes if view name ends in inverse, we don't need to gen one, which is possibly a bad assumption)
@albanD / @soulitzer / @bdhirsh: I'm most interested in your thoughts on the codegen changes and whether this is the right way to go.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117041
Approved by: https://github.com/bdhirsh
This should fix remaining errors with Resize op in torchvision: https://github.com/pytorch/vision/actions/runs/7298953575?pr=8127
```
/opt/conda/envs/ci/lib/python3.8/site-packages/torch/nn/functional.py:4072: in interpolate
return torch._C._nn._upsample_bicubic2d_aa(input, output_size, align_corners, scale_factors)
E torch._dynamo.exc.TorchRuntimeError: Failed running call_function <function interpolate at 0x7f4443fe00d0>(*(FakeTensor(..., size=(1, s0, s1, s2)),), **{'size': [s4, floor(s3*s4/floor(s1*s3/s2))], 'mode': 'bicubic', 'align_corners': False, 'antialias': True}):
E aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:5567: SymIntArrayRef expected to contain only concrete integers
E
E from user code:
E File "/pytorch/vision/torchvision/transforms/v2/functional/_geometry.py", line 260, in resize_image
E image = interpolate(
E
E Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
E
E
E You can suppress this exception and fall back to eager by setting:
E import torch._dynamo
E torch._dynamo.config.suppress_errors = True
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117347
Approved by: https://github.com/peterbell10
We have one for Dynamo that currently applies to all "compile"
configurations (PYTORCH_TEST_WITH_DYNAMO, PYTORCH_TEST_WITH_INDUCTOR). I
don't want to figure out the inductor situation right now, so we're
going to add another denylist for inductor and work through it later.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117553
Approved by: https://github.com/voznesenskym
Previous to this PR, xfail tests didn't provide (1) guarantee of error message/reason (**could be outdated**), and (2) execution of the test (`xfail_if_model_type_is_not_exportedprogram`). Therefore, the tests are less robust with xfail labeled, as we can't be sure if it's still failing with the same reason, and if it's even still failing. This PR fixes the issue with try/except with error message matching to consolidate the xfail truth and reason.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117425
Approved by: https://github.com/thiagocrepaldi
- passrate.py: compute the pass rate
- update_failures.py: update `dynamo_test_failures.py`
Both of these scripts require you to download the test results from CI
locally. Maybe we can automate this more in the future. Checking these
in for now, with no tests :P.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117400
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117391
Refactor common part between `mm_out_mps` and `addmm_out_mps` into `do_mm` static function.
Change input placeholder initialization logic in a way that `addmm` can handle matrix multiplication with empty dimension.
Add tests for `mm`+`addmm` with empty tensors to OpInfo but skip addmm with empty matrices from onnx tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117223
Approved by: https://github.com/albanD
For training graphs (when inputs require grad), previously, we would speculate the forward and backward graph to determine if there are any graph breaks, side effect and etc but would not actually use these speculated graphs. We would just insert a call function node on the graph and later rely on autograd's tracing.
This approach does not work for more generalized graphs like graphs that include user defined triton kernels because autograd is not able to do the higher order function conversation.
This PR speculates the forward and backward functions and emits them in a HOF that later gets used via templating mechanism.
While working on this PR, I have exposed some bugs in the current tracing due to trampoline functions losing the source information resulting in incorrect graphs being produced. I have fixed these source information bugs and killed the trampolines.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116897
Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/voznesenskym
This adds a function `statically_known_true` for `SymBool` that works
like inductor's `is_expr_static_and_true`. That is, it tries to simplify the
expression to a constant or returns `False` if it cannot be simplified.
This is useful in cases that can be optimized if the condition is met,
otherwise it doesn't effect correctness so we can avoid adding guards.
I also use this new function in inductor for `FakeTensorUpdater` and
`remove_noop_pass` which both generated unexpected guards previously.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117359
Approved by: https://github.com/lezcano
Fixes#117110
When slicing we can end up with start and end which are out of bounds, which is
handled in python slicing by clamping to the correct bounds. There is also the
case where end < start which should result in an empty slice.
In the isoneutral_mixing failure we have the second case, with `start=2, end=0`
which in `slice_scatter` became `src_size[dim] = -2`.
This PR improves slice's edge case handling and factors the start and end
normalization code out so it can be shared with slice_scatter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117377
Approved by: https://github.com/lezcano
## Motivation
The current code of `value in [torch.backends.cudnn, torch.ops]` requires `value` to have the implementation of `__eq__`. If the value is a custom object and does not implement `__eq__`, dynamo will throw error. For example, ConvolutionOpContext, the custom 'torch._C.ScriptClass' object registered in IPEX, dynamo will throw the following error:
**torch._dynamo.exc.InternalTorchDynamoError: '__eq__' is not implemented for __torch__.torch.classes.ipex_prepack.ConvolutionOpContext**
I think this is a common issue, To avoid this issue, the PR replaces the current code `value in [torch.backends.cudnn, torch.ops]`with `isinstance(value, (torch.backends.cudnn.CudnnModule, torch._ops._Ops)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116856
Approved by: https://github.com/jansel
As Conda binaries are still built on MacOS 12, which renders MPS unusable after https://github.com/pytorch/pytorch/pull/116942
Test plan:
```
% xcrun -sdk macosx metal --std=macos-metal2.3 -Wall -o Index Index.metal
% xcrun -sdk macosx metal --std=macos-metal2.2 -Wall -o Index Index.metal
Index.metal:167:1: error: type 'const constant ulong3 *' is not valid for attribute 'buffer'
REGISTER_INDEX_OP_ALL_DTYPES(select);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Index.metal:159:5: note: expanded from macro 'REGISTER_INDEX_OP_ALL_DTYPES'
REGISTER_INDEX_OP(8bit, idx64, char, INDEX_OP_TYPE, ulong3); \
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...
```
Fixes https://github.com/pytorch/pytorch/issues/117465
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117472
Approved by: https://github.com/xuzhao9
When working with internal flows, it can sometimes be ambiguous what
version of the code they are working with. In this case, having the
function name available in the stack trace can help identify what you
are looking at.
Example now looks like:
```
[DEBUG] # File: /data/users/ezyang/a/pytorch/a.py:5 in f, code: return x + x
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117459
Approved by: https://github.com/Skylion007
Summary: If any of the `TensorBox` arguments of a custom (user-written) Triton kernel in the graph is wrapped into a `BaseView` subclass which is not `ReinterpretView`, this currently conflicts with the cloning (which preserves RVs) and downstream processing (which needs a layout to mark mutation) of the input.
This PR adds conversion of the non-RV views to `ReinterpretView`s by realizing the corresponding inputs to the Triton kernel. As realization happens anyway before the Triton kernel call, this should not affect the perf. But it covers currently missed patterns in the internal models (see the unit test for a repro).
Test Plan:
```
$ python test/dynamo/test_triton_kernels.py -k test_triton_kernel_slice_and_view_input
...
----------------------------------------------------------------------
Ran 1 test in 3.909s
OK
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117468
Approved by: https://github.com/oulgen
* This is an old builtin function equivalent to the bool constructor. it is easy enough to add support for.
* I also realized the tests were in the wrong class (the one reserved for testing default args) so I moved them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117463
Approved by: https://github.com/jansel
We can use `__index__` to do this conversion because that will trigger a
guard on data dependent SymInt if the tensor is a fake tensor, but if
we fetch item directly and put it in the Scalar, we may still be able to
make it work out.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117454
Approved by: https://github.com/yanboliang
ghstack dependencies: #117451, #117452
This fastpath is unnecessary because in the logic below we
do the same thing:
```
auto& var = THPVariable_Unpack(obj);
if (var.numel() != 1 ||
!at::isIntegralType(
var.dtype().toScalarType(), /*include_bool*/ true)) {
throw_intlist_exception(this, i, obj, idx);
}
auto scalar = var.item();
TORCH_CHECK(scalar.isIntegral(/*include bool*/ false));
res.push_back(scalar.toSymInt())
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117452
Approved by: https://github.com/yanboliang
ghstack dependencies: #117451
The current timeout check frequency is relied on monitoring thread's timeout thread which can be too long (even if we set it to 2mins) so let's use a separate timeout variable which users can configure it. And we only only let default PG to check TCPStore so even more frequent check should be fine. (Our stress test is performed on every half second).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117093
Approved by: https://github.com/wconstab, https://github.com/kwen2501
Summary:
A follow up for #117097. In that PR I didn't add
`_scaled_dot_product_attention_for_cpu` into the core_aten_decomposition
table. This PR does that and also add a unit test.
Test Plan: python test/export/test_export.py -k
test_scaled_dot_product_attention
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117390
Approved by: https://github.com/drisspg
Summary:
## Context
When running large models with a lot of operators, the default descriptor pool allocated by the Vulkan compute API may run out of descriptor sets. This changeset introduces the `VULKAN_DESCRIPTOR_POOL_SIZE` build variable (which will default to `1024u`) which can allow for a larger descriptor pool to be allocated if necessary.
## Notes for Reviewers
This is a simple stopgap solution until we have bandwidth to implement the more general solution, which would be to modify the `DescriptorPool` class defined in `api/Descriptor.[h,cpp]` to automatically allocate a new descriptor pool when memory runs out. However, I would consider this change to be low priority since with a delegate/graph mode of execution, the descriptor pool can often be allocated to exactly fit a model's requirements.
Test Plan:
There should be no functional changes under default build settings. Run `vulkan_api_test` to make sure everything works as before; CI should test for that as well.
```
# On devserver
LD_LIBRARY_PATH=/home/ssjia/Github/swiftshader_prebuilt/swiftshader/build/bin/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*"
```
Reviewed By: yipjustin, jorgep31415
Differential Revision: D52742140
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117398
Approved by: https://github.com/yipjustin
Summary:
Fix the numerical issue with addcmul.
Found that torch.addcmul will generate different value from torch.add+torch.mul with 32 bit check. Mini repro: N4823658
Change addcmul tp torch.add+torch.mm
Test Plan:
buck test
before change
```
the diff index is: 0
the diff index is: 1
the diff index is: 6
```
after change numeric on par
Differential Revision: D52745671
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117404
Approved by: https://github.com/mengluy0125
This is a placeholder implementation for reconstructing streams via global storage to unblock FSDP, pending proper stream support design
This PR does a few things:
1) fixes registration for devices with indices. We were only supporting "cuda", we now support "cuda:k" interfaces where k is # of gpu
2) Changes the stream objects in dynamo to take devices as device types, instead of strings, and updates the string based device APIs to gracefully take device types.
3) Introduces a reconstruct-by-global (using existing cleanup hook structures) to streams as a placeholder impl for now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117386
Approved by: https://github.com/jansel
Summary:
* in some fx partial specialized codegen via `concrete_args` on boolean arguments, we extend to further use the graphmodule on strong typed runtime like torchscript.
* this diff fix the type annotation for boolean only and preserve argument mapping for leafing pytree nodes.
Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:fx -- --exact 'caffe2/test:fx - test_partial_trace (test_fx.TestFX)'
Differential Revision: D52667883
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117201
Approved by: https://github.com/houseroad
Summary:
These dtypes are added since we see more demand for these sub byte dtypes, especially with
the popularity of LLMs (https://pytorch.org/blog/accelerating-generative-ai-2/#step-4-reducing-the-size-of-the-weights-even-more-with-int4-quantization-and-gptq-2021-toks)
Note these are just placeholders, the operator support for these dtypes will be implemented with tensor subclass.
e.g. torch.empty(..., dtype=torch.uint1) will return a tensor subclass of uint1, that supports different operations like bitwsise ops, add, mul etc. (will be added later)
Also Note that these are not quantized data types, we'll implement quantization logic with tensor subclass backed up by these dtypes as well.
e.g `Int4GroupedQuantization(torch.Tensor)` will be implemented with torch.uint4 Tensors (see https://github.com/pytorch-labs/ao/pull/13 as an example)
Test Plan:
CIs
python test/test_quantization.py -k test_uint1_7_dtype
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117208
Approved by: https://github.com/ezyang
Measures the duration of a collective operation using nccl start/end
events and includes this duration (in ms) in the flight recorder data.
duration_ms will be an optional field, since it only works when
timing is enabled. Currently timing is enabled when flight recorder
is enabled, but this is not a strict requirement. Duration is also
not available for collectives not in a completed state.
Note: computing duration can lead to a hang due to calling cudaEventDuration when
the cuda driver queue is full.
We don't ever want dump() api to hang, since we might want dump to help
debug a hang. Hence, we only query durations from the watchdog thread,
and it's possible during dump() call, some of the most recent
collectives durations won't have been computed yet at time of dump. We
make this tradeoff to ensure that dump() itself will never hang.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114817
Approved by: https://github.com/fduwjj, https://github.com/zdevito
ghstack dependencies: #116905
- Removes an outdated assert that prevents perf tests from running DDP, we now have single node --multiprocess and perf tests are already wrapping the model using `deepcopy_and_maybe_ddp`
- Append rank name to traces to avoid all ranks trying to create the same file
- Renames `deepcopy_and_maybe_ddp` to `deepcopy_and_maybe_parallelize` to include FSDP
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113332
Approved by: https://github.com/H-Huang, https://github.com/wconstab
The problem is that the dynamo_test_failures logic recognizes tests by
their TestClass.test_name. Unfortunately we have duplicate
TestClass.test_name in test_legacy_vmap and test_vmap. This PR
unduplicates them.
Something more robust would have been to include the test file name in
the dynamo_test_failures logic, but... it's a bit too late for that. We
can fix it if it becomes more of a problem in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117320
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117318
Whenever the monitor thread kills the watchdog thread for being stuck, we do so to save cluster time and get a faster failure signal, but we want to know more about why it got stuck.
One possible reason for watchdog stuckness is GIL contention, which could be ruled out or observed by making an attempt to acquire the GIL at exit time.
If we cannot acquire the GIL within a short time window (1s) we abort the attempt and report GIL contention, otherwise we report that GIL was acquired successfully.
Reland: uses a function pointer to avoid destructor ordering issues on dlclose. (Looks like the destructor for the std::function was being run later than the libtorchpython lib was unloaded, leading to a crash).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117312
Approved by: https://github.com/zdevito
add skips to tests that involve record_context_cpp on ARM as it is only supported on linux x86_64 arch. Error is reported as below:
```
Traceback (most recent call last):
File "/usr/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
yield
File "/usr/lib/python3.10/unittest/case.py", line 591, in run
self._callTestMethod(testMethod)
File "/usr/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
method()
File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2674, in wrapper
method(*args, **kwargs)
File "/opt/pytorch/pytorch/test/test_cuda.py", line 3481, in test_direct_traceback
c = gather_traceback(True, True, True)
RuntimeError: record_context_cpp is not support on non-linux non-x86_64 platforms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117344
Approved by: https://github.com/malfet, https://github.com/drisspg
Added clarification for the example provided for the pos_weight parameter in the BCEWithLogitsLoss class, particularly in multi-label binary classification context. This enhancement addresses potential misunderstandings about the application of 'binary' classification, which typically implies two classes, to scenarios involving multiple classes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117046
Approved by: https://github.com/mikaylagawarecki
It looks like the inductor fallback previously worked with HOPs but no longer
does, so I fixed that:
- all HOPs are exposed under torch.ops.higher_order, so I changed how
inductor looks them up
- the inductor fallback assumed that an operator's signature was (*args,
**kwargs). This is true for all the OpOverloads but not HOPs. I
rewrote the code to not rely on this.
Test Plan:
- existing tests
- new test for auto_functionalized HOP.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117084
Approved by: https://github.com/williamwen42
Support this fallback by converting the jagged layout NT to strided layout NT, and the convert the result back to jagged layout NT.
This fallback might not be efficient since it uses unbind, contiguous and split.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116445
Approved by: https://github.com/soulitzer
**Summary**
When a tensor is used as the act of conv and extra input of the binary add node, we shouldn't do conv binary inplace fusion.
```
a
/ \
conv
\
add
```
**TestPlan**
```
python -u -m pytest -s -v test_mkldnn_pattern_matcher.py -k test_conv2d_binary_inplace_fusion_failed_cpu
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117331
Approved by: https://github.com/jgong5
ghstack dependencies: #117330
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), The first runtime component we would like to upstream is `Device` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 4 PRs. This is one of the 4 PRs and covers the changes under `c10`.
# Design
Intel GPU device is a wrapper of sycl device on which kernels can be executed. In our design, we will maintain a sycl device pool containing all the GPU devices of the current machine, and manage the status of the device pool by PyTorch. The thread local safe is considered in this design. The corresponding C++ files related to `Device` will be placed in c10/xpu folder. And we provide the c10 device runtime APIs, like
- `c10::xpu::device_count`
- `c10::xpu::set_device`
- ...
# Additional Context
In our plan, 4 PRs should be submitted to PyTorch for `Device`:
1. for c10
2. for aten
3. for python frontend
4. for lazy initialization shared with CUDA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116019
Approved by: https://github.com/gujinghui, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet
Summary:
Fixes https://github.com/pytorch/pytorch/issues/117033
Sometimes the solution returned by `sympy.solvers.inequalities.reduce_inequalities` can contain sub-expressions of the form `CRootOf(...)`, denoting the complex root of some equation in `x`, where `x` is an arbitrary symbol. We will now gracefully fail when this happens, like we already do when the solver itself fails.
Test Plan: added a test
Differential Revision: D52715578
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117310
Approved by: https://github.com/ezyang
Summary:
rrelu_with_noise() was listed as having default parameters in the schema but the
actual code definition didn't have them.
The failing example was calling rrelu() which DOES have default parameters and
it passes those defaulted values to C++. Under the covers the C code was calling
the python version of rrelu_with_noise().
Although the C++ code was passing all the values to the python version of
rrelu_with_noise() the pytorch C++ -> Python dispatch code looks at the schema
and strips any parameters which match the schema's listed defaults so if the
schema shows defaults that aren't in the code it will be a problem.
Test Plan:
I added a unit test for this specific case. It would probably be better to write
a more general one to validate all the ops against their schemas - but I haven't
learned enough about the test harness to do that yet.
Fixes#115811
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117141
Approved by: https://github.com/yanboliang, https://github.com/oulgen
During ONNXProgram.save, the implicit/explicit state_dict passed in must
be loaded in memory in order to read each initializer and create an
external tensor proto with them
This PR ensures torch.load uses memory-map to support large models that
cannot fit in memory
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117295
Approved by: https://github.com/BowenBao
ghstack dependencies: #117294
## Context
This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch.
## Notes for Reviewers
This changeset removes references to `c10::MemoryFormat` in `api/Tensor.[h,cpp]`; when constructing a `vTensor`, the `api::StorageType` (i.e. whether the tensor will be backed by buffer or texture storage) and `api::GPUMemoryLayout` (i.e. which dimension will be the fastest moving dimension) must be specified directly.
Differential Revision: [D52662234](https://our.internmc.facebook.com/intern/diff/D52662234/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117183
Approved by: https://github.com/liuk22, https://github.com/yipjustin
ghstack dependencies: #117176, #117177, #117178, #117179, #117180, #117181
## Context
This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch.
## Notes for Reviewers
This changeset introduces `api::ScalarType` in `api/Types.h`, which is intended to function the same as `c10::ScalarType`; thus `api/Types.h` is the primary file of interest. The rest of the changes are straightforward replacements of `c10::ScalarType` with `api::ScalarType`.
Differential Revision: [D52662237](https://our.internmc.facebook.com/intern/diff/D52662237/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117181
Approved by: https://github.com/yipjustin
ghstack dependencies: #117176, #117177, #117178, #117179, #117180
## Context
This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch.
## Notes for Reviewers
This changeset replaces all instances of `c10::ArrayRef<T>` with `std::vector<T>&` and all instances of`c10::IntArrayRef` with `std::vector<int64_t>&`. There are a lot of changes in this changeset but that is simply due to the large number of callsites. All the changes are straightforward replacements.
Differential Revision: [D52662235](https://our.internmc.facebook.com/intern/diff/D52662235/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117179
Approved by: https://github.com/yipjustin
ghstack dependencies: #117176, #117177, #117178
## Context
This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch.
## Notes for Reviewers
This changeset introduces `api::Error` class in `api/Exception.h`, which is a more barebones copy of the `c10::Error` class from [`c10/util/Exception.h`](https://github.com/pytorch/pytorch/blob/main/c10/util/Exception.h). The macros `VK_CHECK_COND` (equivalent to `TORCH_CHECK(cond, msg)`) and `VK_THROW` (equivalent to `TORCH_CHECK(false, msg)` are introduced as well to replace calls to `TORCH_CHECK()` and similar macros.
Although this is a large diff, the most meaningful changes are in the added files `api/Exception.[h,cpp]` and `api/StringUtil.[h,cpp]` (which is mostly adapted from [`c10/util/StringUtil.h`](https://github.com/pytorch/pytorch/blob/main/c10/util/StringUtil.h)) which implements `api::Error` and the new macros. The rest of the diff is replacing calls to `TORCH_CHECK()` and similar macros with `VK_CHECK_COND()` and `VK_THROW()`.
Differential Revision: [D52662233](https://our.internmc.facebook.com/intern/diff/D52662233/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117178
Approved by: https://github.com/yipjustin
ghstack dependencies: #117176, #117177
## Context
This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch.
## Notes for Reviewers
The majority of the changes in this changeset are:
* Replacing instances of `ska::flat_hash_map` with `std::unordered_map`
* `ska::flat_hash_map` is an optimized hash map, but the optimizations shouldn't be too impactful so `std::unordered_map` should suffice. Performance regression testing will be done at the final change in this stack to verify this.
* Replacing `c10::get_hash` with `std::hash` where only one variable is getting hashed or the `utils::hash_combine()` function added to `api/Utils.h` (which was copied from `c10/util/hash.h`)
Differential Revision: [D52662231](https://our.internmc.facebook.com/intern/diff/D52662231/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117177
Approved by: https://github.com/yipjustin
ghstack dependencies: #117176
## Context
This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch.
## Notes for Reviewers
This changeset deprecates various easy-to-replace symbols from the `c10` library with either C++ STL equivalents or by using copying those `c10` symbols as native equivalents. The symbols that were impacted are:
* `c10::irange`
* removed and replaced with standard for loops
* `C10_LIKELY` and `C10_UNLIKELY`
* These macros allow for some branch re-ordering compiler optimizations when building with GCC. They aren't strictly necessary and their impact is likely minimal so these have simply been removed.
* `c10::SmallVector<T, N>`
* My understanding is that `c10::SmallVector<T, N>` is essentially a wrapper around `std::vector<T>` that is optimized for array sizes up to `N`. I don't believe that this optimization is worth creating a native equivalent, so I replaced instances this symbol with replaced with `std::vector<T>`
* `c10::multiply_integers`
* This function is simply a convenient wrapper around `std::accumulate`, so I copied it as a native equivalent in `api/Utils.h`
This changeset comprises entirely of the replacements described above.
Differential Revision: [D52662232](https://our.internmc.facebook.com/intern/diff/D52662232/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117176
Approved by: https://github.com/yipjustin
Support this fallback by converting the jagged layout NT to strided layout NT, and the convert the result back to jagged layout NT.
This fallback might not be efficient since it uses unbind, contiguous and split.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116445
Approved by: https://github.com/soulitzer
Adds `--compile-autograd` flag to benchmark suite to run accuracy and performance tests. Also adds autograd_captures and autograd_compiles to dynamo stats
e.g. accuracy_inductor.csv
```
dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles
cuda,BERT_pytorch,4,pass,2655,2,8,7,1,1
cuda,Background_Matting,4,pass_due_to_skip,0,0,0,0,0,0
cuda,DALLE2_pytorch,0,eager_fail_to_run,0,0,0,0,0,0
cuda,LearningToPaint,4,pass,639,2,8,7,1,1
...
```
e.g. speedup_inductor.csv
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles
cuda,hf_T5,8,1.214311,136.236793,88.350570,0.751322,18.754706,24.962275,3298,2,8,8,1,1
cuda,hf_T5,8,1.226645,135.431856,52.461461,1.040973,18.754706,18.016508,795,1,7,7,0,0
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117196
Approved by: https://github.com/jansel
…reference) (#109065)
Summary:
Modify the way we update gc_count in CUDACachingAlloctor to make it faster.
Originally D48481557, but reverted due to nullptr dereference in some cases (D49003756). This diff changed to use correct constructor for search key (so avoid nullptr dereference). Also, added nullptr check (and returns 0 if it is) in gc_count functions.
Differential Revision: D49068760
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117064
Approved by: https://github.com/zdevito
Many of our pattern matching replacements are specified as a `search_fn` and a `replacment_fn`. The search_fn's are traced out once with static shapes, converted to a pattern, and then matched on every graph compiled with inductor.
The static shape patterns would not match with graphs that are traced out with dynamic shapes because SymInts would be added to the graph as `sym_size` fx nodes which added additional uses and prevented matching. The previous PR partially addresses this by deduping SymInts that are resolvable to graph inputs, as is the calling convention in aot autograd.
This PR adjusts our matching of the `search_fn` by adding SymInts to the arguments we trace out the search_fn with so that their symint accesses are deduped. Later, if we have a match, we will trace out the replacement graph with the correct Tensors and corresponding symbolic shapes that will get added to the graph.
Note: the replacement patterns will insert sym_size uses which could potentially be removed, but I'll leave that for follow up.
Fix for https://github.com/pytorch/pytorch/issues/111190.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115441
Approved by: https://github.com/jansel
ghstack dependencies: #116158
TorchDynamo will guard grad_mode and the local dispatch key set.
3a429423fc/torch/csrc/dynamo/guards.cpp (L13-L16)
While using ThroughputBenchmark, those tls state will not be init as same as the main thread status.
3a429423fc/torch/csrc/utils/throughput_benchmark-inl.h (L64-L94)
Run following scripts
```
import torch
linear = torch.nn.Linear(128, 128)
compiled = torch.compile(linear)
x = torch.rand(10, 128)
with torch.no_grad(), torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
compiled(x)
compiled(x)
from torch._dynamo import config
config.error_on_recompile = True
from torch.utils import ThroughputBenchmark
with torch.no_grad(), torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
bench = ThroughputBenchmark(compiled)
bench.add_input(x)
stats = bench.benchmark(
num_calling_threads=10,
num_warmup_iters=100,
num_iters=100,
)
print(stats)
```
will lead to 2 re-compile reasons:
```
triggered by the following guard failure(s): ___check_global_state()
triggered by the following guard failure(s): tensor 'x' dispatch key set mismatch.
```
This will trigger a re-compile in torchdynamo. But since `ThroughputBenchmark` is used for sharing weight within threads, the model should not be changed anymore while running the benchmark. So this PR is to init the tls state as same as main thread. Then we can use ` ThroughputBenchmark` to run torchdynamo optimized models.
throughputbenchmark
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113246
Approved by: https://github.com/jgong5, https://github.com/desertfire
Fixes#105338
This PR changes the ops consistency tests from manual adding ops into testing list to automated testing all ops in registry. It also spots more complex dtype bugs in the converter.
Overall, this PR provides:
(1) Whole test coverage on ONNX registry
(2) More completed complex supports
(3) Only test the same dtypes as torchlib
(4) Auto xfail unsupported nodes
Follow-up issue: https://github.com/pytorch/pytorch/issues/117118
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116319
Approved by: https://github.com/justinchuby
We observed that `with torch.no_grad()` in runtime_wrapper introduced ~10% (0.06ms->0.066ms) inference performance regression on lennard_jones on cpu.
For inference tasks in benchmark, grad has been disabled, but in the current runtime_wrapper, no_grad is set again and its time is counted into the running time.
Therefore, we add `is_grad_enabled` check in runtime_wrapper before running with torch.no_grad. If grad has been disabled, there is no need to set no_grad.
Before this pr:
1.043x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,lennard_jones,1,**1.043427**,**0.068366**,4.756151,0.941846,45.056819,47.838822,9,1,0,0
After this pr:
1.146x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,lennard_jones,1,**1.146190**,**0.061844**,4.468380,0.936456,44.427264,47.441920,9,1,0,0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117089
Approved by: https://github.com/jgong5, https://github.com/bdhirsh
Key vars are strings used as dict keys (e.g. duration_s was a string
"duration_ms")
_s confused me with time (seconds) since duration_s was a key string and
duration_ms is another variable holding a time value.
Now duration_key is "duration_ms".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116905
Approved by: https://github.com/zdevito
Summary:
Describes the arguments in more detail. Not in user facing docs for now, but a step towards getting there eventually.
Test Plan: CI
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117190
Approved by: https://github.com/drisspg
Summary: As titled. #115913 added
`_scaled_dot_product_flash_attention_for_cpu` and the export result of
`scaled_dot_product_attention` includes this op. Adding this
decomposition so that it's being decomposed the same way as
`_scaled_dot_product_attention_math`.
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117097
Approved by: https://github.com/lezcano
Using `@skipifTorchDynamo` is wrong, the correct usage is
`@skipIfTorchDynamo()` or `@skipIfTorchDynamo("msg")`. This would cause
tests to stop existing.
Added an assertion for this and fixed the incorrect callsites.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117114
Approved by: https://github.com/voznesenskym
In c10d PG initialization, we wrap TCPStore with multiple layers of PrefixStore which adds layers of prefix.
One example is:
"default_pg/0//cuda//timeout_dump"
When initialized the default PG, because there is no store passed. We first add the prefix "default_pg" to the TCPStore returned from rendezvous:
bdeaaad70c/torch/distributed/distributed_c10d.py (L1240)
We then add pg_name (aka 0) bdeaaad70c/torch/distributed/distributed_c10d.py (L1376) and device (aka cuda) bdeaaad70c/torch/distributed/distributed_c10d.py (L1387)
to the prefix. Then when we call store_->set("timeout_dump"). The actual key used for writing into TCPStore is "default_pg/0//cuda//timeout_dump".
For sub-PG, things get even interesting, we put the store wrapped with default pg name to a cache:
bdeaaad70c/torch/distributed/distributed_c10d.py (L1517)
And when creating each subPG, it is append its PG name right after the cached store. The example keys are:
'default_pg/0//10//cuda//timeout_dump', 'default_pg/0//12//cuda//timeout_dump', 'default_pg/0//38//cuda//timeout_dump', 'default_pg/0//39//cuda//timeout_dump'. (10, 12, 38 and 39 are all PG names of each subPG created)
The reason why the number in the name is bumped up so high is because for each subPG creation, all ranks have to call the API together and the global variable used for PG name will be bumped up monolithically:
bdeaaad70c/torch/distributed/distributed_c10d.py (L3666)
Similar things happen for using hashing for PG names.
This has a potential issue, because each sub-PG has an instance of ProcessGroupNCCL, and if we want to set something global to notify all sub-PGs (and all ranks). This added prefix causes bugs. For example, if on sub-PG 1, we set a value to TCPStore with key ('default_pg/0//1//cuda//timeout_dump'), while we use the default PG instances to check the TCPStore, which are using the key ('default_pg/0//cuda//timeout_dump'), default PG instances will never get the notified signals. So in this PR, we added a new API in PrefixStore which we get the innermost non-PrefixStore for set and check. The next PR will make changes in NCCL watchdog.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117074
Approved by: https://github.com/wconstab, https://github.com/H-Huang
From https://pytest.org/en/7.4.x/how-to/assert.html#advanced-assertion-introspection
pytest only rewrites test modules directly discovered by its test collection process, so asserts in supporting modules which are not themselves test modules will not be rewritten.
In CI we usually call the test file (`python test_ops.py`), which then calls run_test which then calls pytest.main, so the test module is already imported as `__main__`, so pytest does not import the test module itself and relies on the already imported module. (#95844)
However, calling `pytest test_ops.py` will rely on pytest to import the module, resulting in asserts being rewritten, so I add --assert=plain by default into the opts so we don't have to worry about this anymore. Another way to make pytest stop assertion rewriting in a file is to include `PYTEST_DONT_REWRITE` somewhere in the file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117060
Approved by: https://github.com/zou3519
**Expected behavior**: when rank 0 have inf grad, rank 1...k should get `found_inf=1` after `dist.reduce_all`
**Bug addressed in this PR**: for cpu offloaded param.grad, when rank 0 have inf, rank 1...k would not have found_inf=1. This is because `found_inf` was copied before `future.wait` on async `dist.reduce_all`
repro the bug using the newly added unit test: `pytest test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py -k test_sharded_grad_scaler_found_inf`
```
File "/data/users/weif/pytorch/test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py", line 320, in _test_sharded_grad_scaler_found_inf
self.assertEqual(
File "/data/users/weif/pytorch/torch/testing/_internal/common_utils.py", line 3576, in assertEqual
raise error_metas.pop()[0].to_error(
AssertionError: Scalars are not close!
Expected 1.0 but got 2.0.
Absolute difference: 1.0 (up to 1e-05 allowed)
Relative difference: 1.0 (up to 1.3e-06 allowed)
rank: 0 iter: 0 expect origin scale 2.0 to be backed off by 0.5 but got 2.0
```
verify the bug is fixed: `pytest test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py -k test_sharded_grad_scaler_found_inf`
```
test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py dist init r=1, world=8
dist init r=3, world=8
dist init r=7, world=8
dist init r=4, world=8
dist init r=6, world=8
dist init r=2, world=8
dist init r=0, world=8
dist init r=5, world=8
NCCL version 2.19.3+cuda12.0
. [100%]
====================================================================== 1 passed, 19 deselected in 27.43s =========================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115710
Approved by: https://github.com/awgu
This PR adds support for torch.autograd.Function subclasses in compiled autograd. We do this by:
- Creating a uid for all torch.autograd.Function via its metaclass. This uid is used in the compiled autograd key, which is a subset of the cache key to the compiled graph
- "Lifting" the backward/saved_tensors, having them as input arguments in the compiled graph
- Creating proxies to track the backward's inputs and outputs. Since the backward's outputs (grads) have to match the forward's inputs, we pass the node's `input_info` (forward's input sizes) to build the proxies tracking the backward's outputs.
- Use a `FakeContext` class as a replacement for the autograd node's context object (`BackwardCFunction`) during tracing, only support passing saved_tensors from the forward to the backward
- Index each backward, to support multiple torch.autograd.Functions in the same graph
- Special case for `CompiledFunctionBackward`, lifting CompiledFunction will fail 4 tests and requires some skipfiles changes that I'd rather do that in a separate PR
Example graph: test_custom_fn_saved_multiple_tensors (eager fw + compiled autograd)
```python
class MyFn(torch.autograd.Function):
@staticmethod
def forward(ctx, x, y):
ctx.save_for_backward(x, y)
return torch.sin(x), torch.sin(y)
@staticmethod
def backward(ctx, gO_x, gO_y):
(x, y) = ctx.saved_tensors
return gO_x * torch.cos(x), gO_y * torch.cos(y)
```
The backwards is lifted via `getitem_5` and `call_backward`
```python
# Compiled autograd graph
===== Compiled autograd graph =====
<eval_with_key>.0 class CompiledAutograd(torch.nn.Module):
def forward(self, inputs, sizes, hooks):
# No stacktrace found for following nodes
getitem: "f32[]" = inputs[0]
getitem_1: "f32[10]" = inputs[1]
getitem_2: "f32[10]" = inputs[2]
getitem_3: "f32[10]" = inputs[3]
getitem_4: "f32[10]" = inputs[4]; inputs = None
expand: "f32[10]" = torch.ops.aten.expand.default(getitem, [10]); getitem = None
mul: "f32[10]" = torch.ops.aten.mul.Tensor(expand, getitem_2); getitem_2 = None
mul_1: "f32[10]" = torch.ops.aten.mul.Tensor(expand, getitem_1); expand = getitem_1 = None
getitem_5 = hooks[0]; hooks = None
call_backward = torch__dynamo_external_utils_call_backward(getitem_5, (getitem_3, getitem_4), mul_1, mul); getitem_5 = mul_1 = mul = None
getitem_6: "f32[10]" = call_backward[0]
getitem_7: "f32[10]" = call_backward[1]; call_backward = None
accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_4, getitem_7); getitem_4 = getitem_7 = None
accumulate_grad__1 = torch.ops.inductor.accumulate_grad_.default(getitem_3, getitem_6); getitem_3 = getitem_6 = None
return []
```
then is later inlined by dynamo
```python
# Dynamo graph
===== __compiled_fn_0 =====
<eval_with_key>.1 class GraphModule(torch.nn.Module):
def forward(self, L_inputs_0_ : torch.Tensor, L_inputs_1_ : torch.Tensor, L_inputs_2_ : torch.Tensor, L_inputs_3_ : torch.Tensor, L_inputs_4_ : torch.Tensor):
getitem = L_inputs_0_
getitem_1 = L_inputs_1_
getitem_2 = L_inputs_2_
x = L_inputs_3_
y = L_inputs_4_
# File: <eval_with_key>.0:10, code: expand = torch.ops.aten.expand.default(getitem, [10]); getitem = None
expand = torch.ops.aten.expand.default(getitem, [10]); getitem = None
# File: <eval_with_key>.0:11, code: mul = torch.ops.aten.mul.Tensor(expand, getitem_2); getitem_2 = None
mul = torch.ops.aten.mul.Tensor(expand, getitem_2); getitem_2 = None
# File: <eval_with_key>.0:12, code: mul_1 = torch.ops.aten.mul.Tensor(expand, getitem_1); expand = getitem_1 = None
mul_1 = torch.ops.aten.mul.Tensor(expand, getitem_1); expand = getitem_1 = None
# File: /data/users/xmfan/core/pytorch/test/inductor/test_compiled_autograd.py:412, code: return gO_x * torch.cos(x), gO_y * torch.cos(y)
cos = torch.cos(x)
getitem_6 = mul_1 * cos; mul_1 = cos = None
cos_1 = torch.cos(y)
getitem_7 = mul * cos_1; mul = cos_1 = None
# File: <eval_with_key>.0:17, code: accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_4, getitem_7); getitem_4 = getitem_7 = None
accumulate_grad__default = torch.ops.inductor.accumulate_grad_.default(y, getitem_7); y = getitem_7 = None
# File: <eval_with_key>.0:18, code: accumulate_grad__1 = torch.ops.inductor.accumulate_grad_.default(getitem_3, getitem_6); getitem_3 = getitem_6 = None
accumulate_grad__default_1 = torch.ops.inductor.accumulate_grad_.default(x, getitem_6); x = getitem_6 = None
return ()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115573
Approved by: https://github.com/jansel
**Description**
Add dynamic quantization config for x86 inductor backend.
To support the QKV structure in self-attention, we removed an assertion in port-metadata-pass that requires single dequantize node after quantize node.
**Test plan**
```
python test/test_quantization.py -k TestQuantizePT2EX86Inductor.test_dynamic_quant_linear
python test/test_quantization.py -k TestQuantizePT2EX86Inductor.test_qat_dynamic_quant_linear
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115337
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
Previously, kwargs were incorrectly dispatched by passing them as the true kwargs to the torch function call. To fix, the kwargs of the original torch op need to be stored in a dictionary and passed as an argument to the torch function implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117083
Approved by: https://github.com/drisspg
Which is faster then ternary.
Following script
```python
import torch
from timeit import default_timer
global_setup = """
"""
setup = """
c10::SymInt a = c10::SymInt(123);
"""
code = """
-a;
"""
from torch.utils.benchmark import Timer
t = Timer(stmt=code, setup=setup, global_setup=global_setup, language="c++", timer=default_timer)
print(t.blocked_autorange())
```
reports 4.17 ns median type before and 3.61 ns after on x86_64 Linux and 2.02 ns before and 1.91 ns after on Apple M1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117015
Approved by: https://github.com/albanD
Inductor codegen for `_assert_async` is currently disabled because we don't really understand how to codegen `scalar_to_tensor` on a Sympy expression. I initially tried to see if I could get this to work, but I got into some weird problem involving stride sorting, so I decided to fix it properly by not going through a tensor.
So we introduce an `_assert_scalar` which takes a scalar as an argument, avoiding needing to turn a SymBool into a tensor before asserting on it. I also add `_functional_assert_scalar` for good luck, although this doesn't do anything right now because https://github.com/pytorch/pytorch/pull/104203 still hasn't been landed.
I need to customize the codegen for this operator, so I decide to directly implement it in Inductor, rather than trying to treat it as a generic ExternKernel. This leads to the new AssertScalar IR node. This is written carefully so that it doesn't get DCE'd by Inductor.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114148
Approved by: https://github.com/jansel
Summary:
config.is_predispatch is a config to instruct inductor to enable predispatch
tracing (high level pre-dispatch IR). Currently, there is no dedicated pass
for this config.
In this commit, for better pass function management, we created
`predispatch_pass` to hold the pass functions to be run on the high level
pre-dispatch IR-based graphs.
Test Plan: CI
Differential Revision: D52491332
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116788
Approved by: https://github.com/frank-wei
reland of https://github.com/pytorch/pytorch/pull/116559, which was reverted by internal.
The underlying reason for the revert is that the torch.dynamo.disable can't be used by the
pytorch codebase, as it's conflicting with some torch.deploy together, although the later one
only run some inference, but it somehow take that weird dependency on fsdp..
We have seen this issue with our functional collectives that we can't
use any dynamo components otherwise torch.deploy would complain..
verified internally that after removing torch.dynamo.disable the test
passed again
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117020
Approved by: https://github.com/awgu
Instead of printing the tensor's data print the dtype and shape metadata of the tensor.
```
Executing: <VarMeanBackward0 object at 0x1352d0e20> with grad_outputs: [None,f32[]]
```
This is important in order to avoid doing a cuda sync and also useful to reduce verbosity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116523
Approved by: https://github.com/albanD
Through the new dynamic_shapes API and using torch.export.Dim, dimensions that are equal will now be represented by the same symbol, so we no longer need to store `equality_constraints`.
Differential Revision: D52351705
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116979
Approved by: https://github.com/avikchaudhuri
After https://github.com/pytorch/pytorch/pull/116595/files compiling every .cu file results in
```
/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
-static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
^
detected during:
instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=float, From=int64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=float, From=int64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
-static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
^
detected during:
instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=float, From=uint64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=float, From=uint64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h
/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
-static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
^
detected during:
instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=double, From=int64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=double, From=int64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h
/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
-static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
^
detected during:
instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=double, From=uint64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=double, From=uint64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h
/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
-static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
^
detected during:
instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=c10::complex<float>, From=int64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=c10::complex<float>, From=int64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h
/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
-static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
^
detected during:
instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=c10::complex<float>, From=uint64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=c10::complex<float>, From=uint64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h
/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
-static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
^
detected during:
instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=c10::complex<double>, From=int64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=c10::complex<double>, From=int64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h
/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
-static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
^
detected during:
instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=c10::complex<double>, From=uint64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=c10::complex<double>, From=uint64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h
```
Fix it by using using `if constexpr` to avoid calling `static_cast<uint64_t>` for any floating point type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117023
Approved by: https://github.com/albanD
Summary:
In torch/_inductor/lowering.py (https://fburl.com/code/jd58vxpw), we are using
```
fallback_cumsum(x, dim=axis, dtype=dtype)
```
so this will treat `x` as args, `dim` and `dtype` as kwargs from https://fburl.com/code/cikchxp9
The issue has been fixed from D52530506 for OSS but not Meta internal. This diff address the Meta internal issue by some refactoring so both Meta internal and OSS can use the same helper function. The diff also added some debug log.
Test Plan:
before
```
aoti_torch_proxy_executor_call_function(proxy_executor, 2, 1, std::vector<int64_t>{torch.int64}.data(), 2, std::vector<AtenTensorHandle>{buf702, buf708}.data());
```
after
```
aoti_torch_proxy_executor_call_function(proxy_executor, 2, 1, std::vector<int64_t>{0}.data(), 2, std::vector<AtenTensorHandle>{buf702, buf708}.data());
```
so `torch.int64` changed to `0`
Differential Revision: D52532031
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116779
Approved by: https://github.com/desertfire, https://github.com/chenyang78
When originally authored, it was not necessary to unconditionally apply
hermetic mode, but I chose to apply it in eager mode to help catch bugs.
Well, multipy is kind of dead, and hermetic mode is causing real
implementation problems for people who want to do fancy Python stuff
from the dispatcher. So let's yank this mode for now.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116996
Approved by: https://github.com/jansel
Improve perf of vec512 bfloat16 (and also float16) load and store with partial vector lanes using masked load/store instead of via `memcpy` with aux buffer. In inductor CPU backend, we do load/store half (16) vector lanes for bfloat16 and float16.
Using the following micro-benchmark script for `layernorm + add`:
```python
import torch
import torch.nn as nn
from benchmark_helper import time_with_torch_timer
class AddLayernorm(nn.Module):
def __init__(self, hidden_size):
super().__init__()
self.ln = nn.LayerNorm(hidden_size)
def forward(self, hidden_states):
return hidden_states + self.ln(hidden_states)
hidden_states = torch.randn(1, 512, 1024).to(torch.bfloat16)
with torch.no_grad():
compiled_add_ln = torch.compile(add_ln)
print(time_with_torch_timer(compiled_add_ln, hidden_states, iters=10000))
```
Measured on single-core `Intel(R) Xeon(R) Platinum 8358 CPU`.
Before: 1.39 ms
After: 498.66 us
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116961
Approved by: https://github.com/sanchitintel, https://github.com/leslie-fang-intel
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/116492, `linear` will be decomposed into `bmm` when input dim exceeds 2 and not contiguous. Fix this issue by convert the pattern back into `qlinear`. This PR focus on int8_fp32 case, will follow up int8_bf16 case in next PR.
**Test Plan**
```
python -u -m pytest -s -v test_mkldnn_pattern_matcher.py -k test_qlinear_input_dim_exceeds_2_and_not_contiguous
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116599
Approved by: https://github.com/jgong5
ghstack dependencies: #116937
This PR is to create a separate CI job for inductor UTs on ROCm. You will need to add `ciflow/inductor` tag on PRs to trigger this job. However, the job will run on its own on any commit merged in main. This job takes around 1.5 hours to run and it is run in parallel to other rocm jobs. It is run only on the MI210 CI runners to ensure maximum inductor functionality is tested.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110544
Approved by: https://github.com/jithunnair-amd, https://github.com/jansel, https://github.com/huydhn
**Summary**
Similar as https://github.com/pytorch/pytorch/pull/115153, when we do the `qconv_binary` fusion with post op sum, we also need to ensure that: all users of the extra input in this pattern should be ancestor nodes of the compute node, except for the binary node connected to the compute node.
Also change some variable names in this diff as:
- Change name of `qconv2d_node_after_weight_prepack` to `compute_node`
- Change name of `extra_input_node` to `extra_input_of_binary_node`
**Test Plan**
```
python -u -m pytest -s -v test_mkldnn_pattern_matcher.py -k test_qconv2d_add_3
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115809
Approved by: https://github.com/jgong5
ghstack dependencies: #115153
Summary: fixed an import problem for test_xnnpack_quantizer so that it can run in CI
Test Plan:
internal CI
sanity check: buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_conv2d (caffe2.test.quantization.pt2e.test_xnnpack_quantizer.TestXNNPACKQuantizer)'
Differential Revision: D52576449
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116911
Approved by: https://github.com/mcr229
Fixes#115595
As an aside, there are currently no tests checking the output of `torch.signal.windows.kaiser` against the output of scipy's implementation, which is what is done with `torch.kaiser_window`. The same goes for the other window functions in `torch.signal.windows`. I did some tests on my end, but I'm not sure what the best practice is, so I haven't included them for now.
@gchanan @mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116470
Approved by: https://github.com/ezyang
Here's the problem: if we support unsigned integer types, and in particular if we support uint64_t, we need a way to represent these integers in Scalar. However, Scalar currently stores all integral values inside int64_t, which is not wide enough to accommodate all possible uint64_t values. So we need to do something to Scalar to support it.
The obvious thing to do is add a uint64_t field to the union, and used it some situations. But when should we use it? The proposal is that we only use it if and only if the integer in question is not representable in int64_t. The historical precedent for this is our handling for uint8_t. Because this type is representable inside int64_t, we have historically stored it inside Scalar as an int64_t. In general, the concept behind Scalar is that it doesn't know the signedness/unsignedness/bitwidth of its input; in particular, we typically construct Scalar from Python int, which doesn't have any concept of how wide the integer is! So it doesn't make any sense to allow for a small integer like 255 to be representable under both the HAS_i tag and the HAS_u tag. So we forbid the latter case.
Although I have proposed this, the PR as currently written just chokes when you pass it a uint64_t that's too big. There's some more logic that would have to be written out for this. I'm putting this out to start to get some agreement that this is the way to do it.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116595
Approved by: https://github.com/albanD
Adds bfloat16 to fractional_max_pool. If op supports fp32 and fp16, it really should support bf16 for the most part. Most but not all ops satisfy this, so I am adding support for the few that do not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116950
Approved by: https://github.com/lezcano
Whenever the monitor thread kills the watchdog thread for being stuck,
we do so to save cluster time and get a faster failure signal, but we
want to know more about why it got stuck.
One possible reason for watchdog stuckness is GIL contention, which
could be ruled out or observed by making an attempt to acquire the GIL
at exit time.
If we cannot acquire the GIL within a short time window (1s) we abort
the attempt and report GIL contention, otherwise we report that GIL was
acquired successfully.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116798
Approved by: https://github.com/zdevito
This prepares the PR where we implement sets in terms of dicts.
To do so, rather than storing internally a dictionary that maps literals
to VariableTrackers, it stores (pretty much) a dictionary from VTs to VTs.
To do so, keys are wrapped in an opaque internal class _Hashable.
The Hashable class is opaque on purpose so that it fails hard if
if it inadvertently leaks back into user code.
We also found and fixed a number of latent bugs and inconsistencies
in the way dynamo checked what can be a dict key. More generally, we
make much clearer what are the things that need to be modified to add
a new supported key type to Dicts.
Fixes [#107595](https://www.internalfb.com/tasks?t=107595)
Fixes [#111603](https://www.internalfb.com/tasks?t=111603)
Re-PR of https://github.com/pytorch/pytorch/pull/111196 sadly due to reverts, we could not reuse @lezcano's original PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116785
Approved by: https://github.com/mlazos
Summary:
We intend to preserve autograd ops for predispatch export. Therefore, we
need to exempt the autograd ops in some places, e.g. verifier and
proxy_tensor.py.
Test Plan:
python test/export/test_export.py -k test_predispatch_export_with_autograd_op
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116527
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #116339
Summary:
As current export doesn't support training, so grad mode ops doesn't
make sense. To avoid the confusion, we choose to early error if there
exist grad mode ops.
Test Plan:
python test/export/test_safeguard.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116339
Approved by: https://github.com/tugsbayasgalan
Where base stack targets default branch, rather than base. But as
default branch is likely to advance, since PR was made, search for
mergebase before determining whether `base`..`head` are in sync with `orig` branch
Also, rather than hardcode default branch name, fetch it from `GitHubPR.default_branch()`
Test Plan: https://github.com/malfet/deleteme/pull/77
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116873
Approved by: https://github.com/ezyang
Use a similar approach with Sleef as in #99550
to improve the precision and extremal value handling of the `abs` function for complex tensors.
This fixes
- test_reference_numerics_extremal__refs_abs_cpu_float64
- test_reference_numerics_extremal__refs_abs_cpu_float128
which failed on PPC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116859
Approved by: https://github.com/lezcano
**problem**
when prefetching for next forward, current forward may be annotated by
`@torch.no_grad`. `param.grad_fn` keeps being None during prefetching.
`_post_backward_hook` never gets triggered
repro
```pytest test/distributed/fsdp/test_fsdp_freezing_weights.py```
**solution**
this PR enabled autograd during prefetching (`_use_unsharded_views`), so
`param.grad_fn` are properly assigned for next forward
a longer-term fix would be moving `_use_unsharded_views` out of
`_prefetch_handle` and put it in `_pre_forward_unshard`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116792
Approved by: https://github.com/awgu
Part 3 of implementation for general [subclass view fake-ification](https://docs.google.com/document/d/1C5taWiplmX7nKiURXDOAZG2W5VNJ2iV0fQFq92H0Cxw).
Changes codegen to generate `view_func()` / `rev_view_func()` by default for python subclasses. With `view_func()` existing more often now, the lazy view rebase logic [here](f10c3f4184/torch/csrc/autograd/variable.cpp (L665-L695)) causes some slight behavior changes for in-place ops on views:
* Additional view nodes are inserted into output graphs, changing their string representation, although they are functionally the same. The extra nodes are removed in AOTAutograd's DCE pass.
* When `t` is a `FunctionalTensor`, calling `t.grad_fn` will now invoke `view_func()`; we need to make sure we're operating in a `FunctionalTensorMode` so the view op calls succeed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116512
Approved by: https://github.com/bdhirsh, https://github.com/soulitzer
ghstack dependencies: #115894
Part 2 of implementation for general [subclass view fake-ification](https://docs.google.com/document/d/1C5taWiplmX7nKiURXDOAZG2W5VNJ2iV0fQFq92H0Cxw).
Details:
* Codegen `rev_view_func()` alongside `view_func()`
* Reverse view_func gives you a "base" from a "view": `rev_view_func(new_view) -> new_base` AKA it plays the original view backwards
* Utilizes the functional inverses defined in `FunctionalInverses.cpp`, passing `InverseReturnMode::AlwaysView`
* Manually implements functional inverses for `narrow()` and `chunk()`
* **NB: Multi-output views now set view_func() / rev_view_func() for each of the output views!**
* Due to this, the `as_view()` overload that operates on a list of views is scrapped in favor of iteration via codegen
Example codegen in `ADInplaceOrViewTypeN.cpp`:
```cpp
at::Tensor narrow(c10::DispatchKeySet ks, const at::Tensor & self, int64_t dim, c10::SymInt start, c10::SymInt length) {
auto _tmp = ([&]() {
at::AutoDispatchBelowADInplaceOrView guard;
return at::_ops::narrow::redispatch(ks & c10::after_ADInplaceOrView_keyset, self, dim, start, length);
})();
std::function<at::Tensor(const at::Tensor&)> func=nullptr;
std::function<at::Tensor(const at::Tensor&)> rev_func=nullptr;
if (false || !self.unsafeGetTensorImpl()->support_as_strided() ||
c10::AutogradState::get_tls_state().get_view_replay_enabled()) {
func = [=](const at::Tensor& input_base) {
return at::_ops::narrow::call(input_base, dim, start, length);
};
rev_func = [=](const at::Tensor& input_view) {
// NB: args from narrow() signature are passed along to the inverse
return at::functionalization::FunctionalInverses::narrow_copy_inverse(self, input_view, at::functionalization::InverseReturnMode::AlwaysView, dim, start, length);
};
}
auto result = as_view(/* base */ self, /* output */ _tmp, /* is_bw_differentiable */ true, /* is_fw_differentiable */ true, /* view_func */ func, /* rev_view_func */ rev_func, /* creation_meta */ InferenceMode::is_enabled() ? CreationMeta::INFERENCE_MODE : (at::GradMode::is_enabled() ? CreationMeta::DEFAULT : CreationMeta::NO_GRAD_MODE));
return result;
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115894
Approved by: https://github.com/soulitzer
As per title. With this, https://github.com/pytorch/pytorch/issues/93697
does not choke, but spits out many of these:
```
[ERROR] Name: "L['self']"
[ERROR] Source: local
[ERROR] Create Function: NN_MODULE
[ERROR] Guard Types: ['ID_MATCH']
[ERROR] Code List: ["___check_obj_id(L['self'], 139962171127504)"]
[ERROR] Object Weakref: <weakref at 0x7f4b72f7c9a0; to
'ActorCriticPolicy' at 0x7f4b7b7df6d0>
[ERROR] Guarded Class Weakref: <weakref at 0x7f4afbd08b30; to
'ABCMeta' at 0x56463a727840 (ActorCriticPolicy)>
[ERROR] Created at:
[ERROR] File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 248, in __call__
[ERROR] vt = self._wrap(value)
[ERROR] File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 474, in _wrap
[ERROR] return self.wrap_module(value)
[ERROR] File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 941, in wrap_module
[ERROR] return self.tx.output.register_attr_or_module(
[ERROR] File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/output_graph.py", line
735, in register_attr_or_module
[ERROR] install_guard(source.make_guard(GuardBuilder.NN_MODULE))
[ERROR] Error while creating guard:
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116460
Approved by: https://github.com/jansel
ghstack dependencies: #116459
- Enable Sonoma testing on M2 machines
- Add 70+ ops to the list of supported ones on MacOS Sonoma
- Enable nn.functional.
- Add explicit `TORCH_CHECK` to mark scatter/gather, index_select and linalg ops as yet not supporting Complex, as attempt to call those will crash with various MPS asserts such as:
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:96:0: error: 'mps.reduction_min' op operand #0 must be tensor of MPS type values or memref of MPS type values, but got 'tensor<5x5xcomplex<f32>>'
(mpsFileLoc): /AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:96:0: note: see current operation: %3 = "mps.reduction_min"(%1, %2) <{keep_dims}> : (tensor<5x5xcomplex<f32>>, tensor<2xsi32>) -> tensor<1x1xcomplex<f32>>
```
- Treat bools as int8 to fix regression re-surfaced in `index_fill` (used to be broken in Monterey, then fixed in Ventura and broken in Sonoma again)
- `nn.functional.max_pool2d` results now match CPU output for uint8 dtype in Sonoma
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116764
Approved by: https://github.com/kulinseth, https://github.com/seemethere
Fixes https://github.com/pytorch/pytorch/issues/116385
Don't call `torch._transformer_encoder_layer_fwd` when `bias=False`
`bias=False` was not something that `torch._transformer_encoder_layer_fwd` was meant to work with, it was my bad that this wasn't tested as I approved https://github.com/pytorch/pytorch/pull/101687.
`bias=False` was causing the `tensor_args` in [`TransformerEncoder`](a17de2d645/torch/nn/modules/transformer.py (L663-L677)) to contain `None`s and error on checks for the fastpath like `t.requires_grad for t in tensor_args`.
Alternative fix would be to
1) Pass `torch.zeros_like({*}.weight)` to the kernel when `bias=False` and filter `tensor_args` as appropriate
2) Fix `torch._transformer_encoder_layer_fwd` to take `Optional<Tensor>` for biases and fix the kernels as appropriate
Let me know if these approaches are preferable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116760
Approved by: https://github.com/jbschlosser
Note about the Updates:
This PR:
1. skips more flash attention related UTs on MI200
2. Fix additional ATen compiling errors after hipification
3. Fix the author "root" of a specific commit
4. Includes the patch from Nikita in favor of block level static initialization.
CAVEAT: This revised PR has a commit that modifies the CI to force its running on MI200 nodes. That specific commit must be reverted before merge.
Original PR (https://github.com/pytorch/pytorch/pull/114309) Note:
This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project.
Know limitations:
- Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`.
- Only supports power of two sequence lengths.
- No support for varlen APIs.
- Only support head dimension 16,32,64,128.
- Performance is still being optimized.
Fixes#112997
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115981
Approved by: https://github.com/malfet
Summary:
If heartbeat monitor times out and kills the process, we want to know why.
It's convenient to use an internal tool for this, but we plan to later
integrate with torchelastic to call into pyspy or something else, which will be
both better (including py stacks) and compatible with OSS.
Test Plan: tested manually, observed c++ stacktraces were dumped
Reviewed By: fduwjj
Differential Revision: D52370243
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116717
Approved by: https://github.com/zdevito
Summary: Now we can allocate an AOTIModelContainerRunner object instead of relying on torch.utils.cpp_extension.load_inline. Also renamed AOTInductorModelRunner to AOTIRunnerUtil in this PR.
Test Plan: CI
Reviewed By: khabinov
Differential Revision: D52339116
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116269
Approved by: https://github.com/khabinov
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), The first runtime component we would like to upstream is `Device` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 4 PRs. This is one of the 4 PRs and covers the changes under `c10`.
# Design
Intel GPU device is a wrapper of sycl device on which kernels can be executed. In our design, we will maintain a sycl device pool containing all the GPU devices of the current machine, and manage the status of the device pool by PyTorch. The thread local safe is considered in this design. The corresponding C++ files related to `Device` will be placed in c10/xpu folder. And we provide the c10 device runtime APIs, like
- `c10::xpu::device_count`
- `c10::xpu::set_device`
- ...
# Additional Context
In our plan, 4 PRs should be submitted to PyTorch for `Device`:
1. for c10
2. for aten
3. for python frontend
4. for lazy initialization shared with CUDA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116019
Approved by: https://github.com/gujinghui, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet
Summary:
Currently we have a very ugly specialization on edge dialect in verifier like the following:
```
# TODO Remove this branch.
if ep.dialect == "EDGE": # !!! Don't change this allowlist. !!!
pass
else:
raise e
```
In this diff we do some additional work to make signature checking also work in exir. We decouple the transformation stack in torch export and exir so that different layers of the stack can evolve in their own fashion and the team can divide and conquer them seperately.
Test Plan: CI
Differential Revision: D52499225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116705
Approved by: https://github.com/tugsbayasgalan
As per title. With this, https://github.com/pytorch/pytorch/issues/93697
does not choke, but spits out many of these:
```
[ERROR] Name: "L['self']"
[ERROR] Source: local
[ERROR] Create Function: NN_MODULE
[ERROR] Guard Types: ['ID_MATCH']
[ERROR] Code List: ["___check_obj_id(L['self'], 139962171127504)"]
[ERROR] Object Weakref: <weakref at 0x7f4b72f7c9a0; to
'ActorCriticPolicy' at 0x7f4b7b7df6d0>
[ERROR] Guarded Class Weakref: <weakref at 0x7f4afbd08b30; to
'ABCMeta' at 0x56463a727840 (ActorCriticPolicy)>
[ERROR] Created at:
[ERROR] File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 248, in __call__
[ERROR] vt = self._wrap(value)
[ERROR] File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 474, in _wrap
[ERROR] return self.wrap_module(value)
[ERROR] File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 941, in wrap_module
[ERROR] return self.tx.output.register_attr_or_module(
[ERROR] File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/output_graph.py", line
735, in register_attr_or_module
[ERROR] install_guard(source.make_guard(GuardBuilder.NN_MODULE))
[ERROR] Error while creating guard:
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116460
Approved by: https://github.com/jansel
ghstack dependencies: #116459
This is proof-of-concept implementation of how people can use a marker `mark_strict` to enable torchdynamo while exporting under non-strict mode. The main idea is that `mark_strict` will turn into an HOO which then utilizes dynamo to do correctness analysis in the same way how torch.cond works today. There are some notable limitations:
1. This API is not meant for public use yet
2. Strict region can't work with arbitrary container inputs
3. We don't preserve `nn_module_stack` and other node metadata for the strict region.
4. strict_mode HOO will show up in the final graph. This is undesirable in the long term, but for short term experiments, it should be good enough. Will fix this in the follow up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114658
Approved by: https://github.com/ydwu4
Summary:
Currently, the code is working. We know this becuase we observe heartbeat
timeouts.
However, there is a chance that if the code were refactored, the compiler could
optimize away the load of heartbeat_ inside heartbeatMonitor, and we wouldn't
know.
Using atomic here is not really for thread synchronization, but more to ensure
compiler optimizations (hoisting the read outside the loop) can never be
allowed to happen. Again, we know this isn't currently happening bc if it
were, it would not be an intermittent failure, it would be an always failure.
(at least with a fixed compiler/platform).
I previously avoided atomic bc we didn't want shared locks between heartbeat
monitor and watchdog thread. Why? if watchdog held the lock and hung, monitor
could also hang. However, this really can't happen (Afaik) when using an
atomic.
Test Plan: existing CI tests
Differential Revision: D52378257
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116702
Approved by: https://github.com/fduwjj, https://github.com/zdevito
Stacks recorded when tensors are being freed during exit could
try to acquire the GIL. Py_IsInitialized can be used to check if we
are post Python exit and should not attempt to acquire the GIL.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116709
Approved by: https://github.com/aaronenyeshi
**Summary**
Take this Pattern as example
```
# ReLU
# / \
# Conv1
# / \
# Conv2
# \ /
# Add
```
The current `ConvBinaryInplace` check will fail to perform Inplace fusion (using outplace fusion instead) due to `ReLU` having 2 users. However, if all users of `ReLU` are ancestor nodes of `Conv2`, we should be able to proceed with the `ConvBinaryInplace` fusion. This diff relaxes the `ConvBinaryInplace` check accordingly.
**TestPlan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_conv2d_binary_inplace_fusion_pass_cpu
python -m pytest test_mkldnn_pattern_matcher.py -k test_conv2d_binary_inplace_fusion_failed_cpu
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115153
Approved by: https://github.com/CaoE, https://github.com/jgong5
We change manually_set_subgraph_inputs to three modes: manual, automatic and flatten_manual. The flatten_manual wil first flatten the sub_args then recussively call set_subgrah_inputs = "manual". This allows us to control the order of the placeholder shown up in the graph, which is necessary for map, where we want to keep the mapped arguments before the rest positional arguments.
Right now, map only takes a single tensor as mapped argument but it would become pretty easy to match the subgraph inputs to original proxy if we have a "flatten_manual" option.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115853
Approved by: https://github.com/zou3519
Ran into the following exception during C++ file compilation.
```
error: use of undeclared identifier 'aoti_torch_scatter_reduce_out'
aoti_torch_scatter_reduce_out(buf12, buf12,0,buf13,buf14, "sum",1);
^
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116700
Approved by: https://github.com/aakhundov
This PR increases the read size for the `hub.download_url_to_file` function from 8,192 bytes to 131,072 bytes (128 * 1,024), as reading in larger chunks should be more efficient. The size could probably be larger still, at the expense of the progress bar not getting updated as often.
It re-introduces use of the `READ_DATA_CHUNK` constant that was originally used for this purpose in 4a3baec961 and since forgotten.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116536
Approved by: https://github.com/NicolasHug
Updated range_constraints to be the union of shape_env.var_to_range and shape_env.runtime_var_to_range, with shape_env.runtime_var_to_range taking priority.
Due to 0/1 specialization, if we bound an unbacked symint to be less than 5, the range of possible values for this symint is actually recorded as [2, 5] in shape_env.var_to_range. To fix this so that users will be able to see a more understandable range of [0, 5], shape_env.runtime_var_to_range was created to store the range of [0, 5]. Since range_constraints is a user-facing attribute to query the ranges of certain symints, we want to use shape_env.runtime_var_to_range to get the unbacked symints ranges, rather than shape_env.var_to_range.
Additionally, run_decompositions() has an issue where it will always add assertions to the graph, even if a previous run has already added the assertions. So, I added a part to the AddRuntimeAssertionsForInlineConstraints which will store which assertions have already been added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115427
Approved by: https://github.com/zhxchen17
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 27084ed</samp>
This pull request simplifies and cleans up the code that uses the cuDNN library for convolution, batch normalization, CTC loss, and quantized operations. It removes the unnecessary checks and conditions for older cuDNN versions and the experimental cuDNN v8 API, and ~~replaces them with the stable `cudnn_frontend` API that requires cuDNN v8 or higher. It also adds the dependency and configuration for the `cudnn_frontend` library in the cmake and bazel files.~~ Correction: The v7 API will still be available with this PR, and can still be used, without any changes to the defaults. This change simply always _builds_ the v8 API, and removes the case where _only_ the v7 API is built.
This is a re-land of https://github.com/pytorch/pytorch/pull/91527
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95722
Approved by: https://github.com/malfet, https://github.com/atalman
Fixes#116504
When this API is invoked, a runtime error occurs. When the NPU acceleration device is used, the input tensor is not processed at a branch. As a result, some input tensors are on the CPU and some are on the NPU. As a result, an error is reported.
Here, I adapt to other acceleration devices and move the tensor on the acceleration device to the CPU. It's tested and feasible.
The details are in the issue:#116504
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116682
Approved by: https://github.com/lezcano
Context: Existing FSDPExtension have some bug in the case when the
unflatten tensor involves some compute/communications in cuda stream,
the current logic of FSDPExtension unflatten tensor happens in the
unshard stream, which makes runtime lost sync with the compute stream,
and if there're some dependencies between the compute stream and the
unflatten tensor logic, currently it would lose sync point, which could
possibly lead to NaN.
This PR make the FSDPExtension to record the compute stream and let
DTensorExtension to directly use the compute stream for unflatten_tensor.
In long term we might want to directly make the FSDP runtime logic to only
make the unshard happen in unshard stream, and use unshard views to
happen in the compute stream. We currently fix this in the Extension
directly as this is the simplest thing to do without affecting FSDP
runtime logic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116559
Approved by: https://github.com/awgu, https://github.com/fduwjj, https://github.com/yifuwang
ghstack dependencies: #116426
Summary: Sometimes we will get a None value from ops which returns Tensor type in the schema. Allow this case during serialization.
Test Plan: test__scaled_dot_product_flash_attention
Differential Revision: D52491668
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116664
Approved by: https://github.com/SherlockNoMad
Summary:
- add workMetaList_.size() so we know how many outstanding works there
were when killing
- Print our first log before debuginfo dump instead of after, since it
is clearer when reading the logs that we time out and then dump
- Organize the log strings- put them near where they are used
cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225
imported-using-ghimport
Test Plan: Imported from OSS
Reviewed By: fduwjj
Differential Revision: D52369167
Pulled By: wconstab
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116661
Approved by: https://github.com/fduwjj
Summary:
Making union types harder to use wrong:
1. Initialize unset fields still with None, but we don't assert on the uniqueness of not None field, since it's possible to set a real field to None.
2. Raise error on unset fields in union, reducing the error surface and enforcing type safety.
3. Serialize union type with only tag and omit all the unset fields, this makes the serialized model more readable and debuggable.
Test Plan:
buck test mode/opt caffe2/test:test_export
buck test mode/opt executorch/exir/...
buck test mode/opt mode/inplace aps_models/ads/icvr/tests:export_test
Differential Revision: D52446586
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116511
Approved by: https://github.com/angelayi
Previously, we have the writer register to each NCCL PG(backend), so for every pg, we have a NCCL PG instance, so if we use some customized writer when multiple sub-PGs are used, we need to ensure user to register the writer for every backend which indicates a bad UX. Furthermore, the debug info is global, so it does not make sense to have the writer for each instance. We even have a static mutex in the `dumpDebuggingInfo` to ensure we serialize the write, that makes it more obvious that we can make the writer a singleton so that we only have one writer instance for all PG instances.
Although the rationale is clear, the implementation may vary a lot. So this PR is RFC for now to see if this implementation makes sense or not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116489
Approved by: https://github.com/kwen2501
Summary:
We introduced `node.meta["numeric_debug_handle"]` in https://github.com/pytorch/pytorch/pull/114315 to
indicate the numeric debug handle for values in the graph, in this PR we supported preserving this field
in prepare and convert so that we can use these for numerical debugging
Next: we also want to preserve these in deepcopy of GraphModule as well
Test Plan:
python test/test_quantization.py -k test_quantize_pt2e_preserve_handle
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116477
Approved by: https://github.com/tugsbayasgalan
As titled, this PR adds gradient hook to the FSDP DTensor extension, to check if there's gradients that are AsyncCollectiveTensors, if there're some, we eagerly wait there.
This is needed because sometimes the parameter's gradient might still pending with AsyncCollectiveTensor, if we directly feed them to FSDP then FSDP would use the ACT's storage to do reduce_scatter, which is wrong.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116122
Approved by: https://github.com/awgu, https://github.com/fduwjj
Somehow the logprefix only have ProcessGroup 0 rank [global rank]. This does not give the expected result as per the comment says "a prefix that is unique to this process group and rank". So this PR fix it and make it different for different subPGs.
The reason is that we set the prefix static which is shared across all NCCLPG instances and whoever calls this function first will set `rank_` and `uid_` to the prefix. We always initialize PG 0 first that's why we always see PG[0] + global ranks for all subPGs.
<img width="484" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/7fbb0226-7e25-4306-9cee-22e17b00bc8e">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116520
Approved by: https://github.com/wconstab
ghstack dependencies: #116218
Weight tying is useful when we'd like to share weights (and their gradients) between two modules, e.g. the word/token embedding module and the output linear module in language models. This test demonstrates that with DTensor it can be achieved just as with normal tensor, e.g. using `model.fc.weight = model.embedding.weight`.
To test: `python test/distributed/tensor/parallel/test_tp_examples.py -k test_weight_tying`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116475
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
A less general version of this wrapper was used in the keynote on
`torch.compile(numpy)`. We expose a generic version of the wrapper
that works seamlessly with `torch.compile`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114610
Approved by: https://github.com/albanD
Fixes#96617
There was already an attempt to fix this build issue - see #96618. One commit is reused from this attempt (@zboszor) with adjustments to commit message. Another one differs and takes into account provided review feedback (@ezyang).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115976
Approved by: https://github.com/ezyang
Introduce `mtl_setBuffer` and `mps_dispatch1DJob` and use it to bind
Tensor to metal kernel as well as disptatch Metal job
This avoids potential typos/bugs when one tries to bind tensor to a
Metal kernel but forgets about storage offset
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116566
Approved by: https://github.com/Skylion007
Fix: #115900https://github.com/pytorch/xla/issues/6085
This PR adds a last resort for testing for BF16 support on CUDA. This is necessary on GPUs
such as RTX 2060, where `torch.cuda.is_bf_supported()` returns False, but we can
successfully create a BF16 tensor on CUDA.
Before this PR:
```python
>>> torch.cuda.is_bf_supported()
False
>>> torch.tensor([1.], dtype=torch.bfloat16, device="cuda")
tensor([...], device='cuda:0', dtype=torch.bfloat16)
```
After this PR:
```python
>>> torch.cuda.is_bf_supported()
True
>>> torch.tensor([1.], dtype=torch.bfloat16, device="cuda")
tensor([...], device='cuda:0', dtype=torch.bfloat16)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115924
Approved by: https://github.com/jansel
Remove weird logic for designating matrices as transposed if sizes match(which always true if square matrices are multiplied with each other), which resulted in `torch.addmm` returns transposed matrix compared to `torch.mm`, see below:
```
% python -c "import torch;torch.set_default_device('mps');a=torch.eye(2);b=torch.arange(4.0).reshape(2, 2);print(a@b);print(torch.addmm(torch.zeros(2, 2), a,b))"
tensor([[0., 1.],
[2., 3.]], device='mps:0')
tensor([[0., 2.],
[1., 3.]], device='mps:0')
```
Fixes introduced to `torch.mm` in https://github.com/pytorch/pytorch/pull/77462 suggests that this is not needed
Modify `sample_inputs_addmm` to test `torch.addmm` with square matrices, but skip this config for `test_autograd_dense_output_addmm`, see https://github.com/pytorch/pytorch/issues/116565
TODO: probably tweak tolerances, as `test_output_match_addmm_cpu_float16` fails with 2x2 matrices, but passes using 3x3 ones with errors slightly exceeding the tolerance
Fixes https://github.com/pytorch/pytorch/issues/116331
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116547
Approved by: https://github.com/albanD, https://github.com/Skylion007
Use proper `os.fspath` to better convert `os.PathLike` object to a path.
Replace `pathlib.Path` with `os.PathLike` which is more generic and typing correct. `pathlib.Path` is an instance of `os.PathLike`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116562
Approved by: https://github.com/malfet
For training graphs (when inputs require grad), previously, we would speculate the forward and backward graph to determine if there are any graph breaks, side effect and etc but would not actually use these speculated graphs. We would just insert a call function node on the graph and later rely on autograd's tracing.
This approach does not work for more generalized graphs like graphs that include user defined triton kernels because autograd is not able to do the higher order function conversation.
This PR speculates the forward and backward functions and emits them in a HOF that later gets used via templating mechanism.
While working on this PR, I have exposed some bugs in the current tracing due to trampoline functions losing the source information resulting in incorrect graphs being produced. I have fixed these source information bugs and killed the trampolines.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116358
Approved by: https://github.com/jansel
Basically we observed that if there are multiple PGs and if the timeout happens on one of the subPG, we somehow use the local rank in the dump file. We realize that:
1. For setting the timeout signal in the store, any watchdog thread from any PG can do that.
2. For checking and dump, only the watchdog thread of default PG which we will always create and contain all ranks (no file name conflict) is needed here because the store signal and dump debug info are all global.
3. Since dump is global, we want to avoid the case when ranks from sub-PG pollute logs from global ranks (local rank 0 vs global rank 0). So that we use global ranks here to initialize debug info writer. (Down the road, we are thinking about making it a singleton so that user only register it once for multi-PG case.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116218
Approved by: https://github.com/wconstab
Turned command sequence mentioned in https://dev-discuss.pytorch.org/t/how-to-get-a-fast-debug-build/1597 and in various discussions into a tool that I use almost daily to debug crashes or correctness issues in the codebase
Essentially it allows one to turn this:
```
Process 87729 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
frame #0: 0x00000001023d55a8 libtorch_python.dylib`at::indexing::impl::applySelect(at::Tensor const&, long long, c10::SymInt, long long, c10::Device const&, std::__1::optional<c10::ArrayRef<c10::SymInt>> const&)
libtorch_python.dylib`at::indexing::impl::applySelect:
-> 0x1023d55a8 <+0>: sub sp, sp, #0xd0
0x1023d55ac <+4>: stp x24, x23, [sp, #0x90]
0x1023d55b0 <+8>: stp x22, x21, [sp, #0xa0]
0x1023d55b4 <+12>: stp x20, x19, [sp, #0xb0]
```
into this
```
Process 87741 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
frame #0: 0x00000001024e2628 libtorch_python.dylib`at::indexing::impl::applySelect(self=0x00000001004ee8a8, dim=0, index=(data_ = 3), real_dim=0, (null)=0x000000016fdfe535, self_sizes= Has Value=true ) at TensorIndexing.h:239:7
236 const at::Device& /*self_device*/,
237 const c10::optional<SymIntArrayRef>& self_sizes) {
238 // See NOTE [nested tensor size for indexing]
-> 239 if (self_sizes.has_value()) {
240 auto maybe_index = index.maybe_as_int();
241 if (maybe_index.has_value()) {
242 TORCH_CHECK_INDEX(
```
while retaining good performance for the rest of the codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116521
Approved by: https://github.com/atalman
Summary:
`_fold_conv_bn_qat` is taking a long time currently, so skipping it when it's not necessary,
we can have follow up fixes to actually reduce the patterns or cache the patterns if possible
Test Plan:
uncomment the print in `test_speed`, run
python test/test_quantization.py -k test_speed
and make sure the convert time is low, e.g. 0.1s instead of 8-9 seconds
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116440
Approved by: https://github.com/andrewor14
Summary:
The INTERFACE_CALL opcode for the mobile TorchScript interpreter contained an out of bounds read issue leading to memory corruption.
This change adds an explicit check that the number of inputs passed to the format method called when handling the INTERFACE_CALL opcode is a valid and within bounds of the stack.
Test Plan: contbuild + OSS signals
Differential Revision: D49739450
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110301
Approved by: https://github.com/dbort
Run slow/periodic and inductor workflows on push to release branches
Right now there are no signal from those jobs on release branches at all.
This will run periodic jobs on every commit to release branch, which is fine, as they are short lived and have a much lower traffic that a regular jobs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116516
Approved by: https://github.com/clee2000
This PR adds a very simple mutation tracking mechanism to dynamo which can later be improved to be more thorough. Currently it allows tensors to be in tl.load but if it sees a tensor used anywhere else (including a tl.load), it bails out.
One question about the method: is `ast.NodeVisitor` the best thing to use here? Having to detect mutations with this is kinda pretty since you need to keep setting state at each transition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116466
Approved by: https://github.com/aakhundov
Fixes https://github.com/pytorch/pytorch/issues/107842
Unlike `AdaptiveAvgPool`, `AdaptiveMaxPool` does not have a CUDA kernel for ChannelsLast. We workaround this by calling `contiguous()` on the input. However, there is an edge case when the channels dimension has size 1.
```python
>>> t = torch.randn(2, 1, 3, 3)
>>> t.stride()
(9, 9, 3, 1)
>>> t_c = t.to(memory_format=torch.channels_last)
>>> t_c.stride()
(9, 1, 3, 1) # (CHW, 1, CW, C)
>>> t_c.is_contiguous()
True # contiguity check doesn't check strides for singleton dimensions
```
Since the CUDA kernel treats the batch,`B`, and channels,`C`, dimensions as implicitly flattened and increments the data pointer for `input` to the start of the next plane using
669b182d33/aten/src/ATen/native/cuda/AdaptiveMaxPooling2d.cu (L67)
If our input falls into the aforementioned edge case, the `data_ptr` will not be incremented correctly. The simple fix for this is to calculate the stride for the channels dimension using $\prod_{i > 1}size(i)$
Analogous fix for the 3D case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116482
Approved by: https://github.com/albanD
Summary: currently pad_sequence caused symbolic shape specialization in export which is unintended. Adding a decomp seems to work to avoid the c++ kernel which caused the specialization.
Test Plan: buck test mode/opt caffe2/test:test_export -- -r pad_sequence
Reviewed By: SherlockNoMad
Differential Revision: D52345667
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116285
Approved by: https://github.com/peterbell10, https://github.com/lezcano
By adding `get_ghstack_dependent_prs` that using `git branch --contains`
finds all PRs containing stacked branch, selecting longest one (in
terms of distance between origin and default branch) and skipping all
open PRs
Please note, that reverts should be applied in a reversed order with the
one how PRs were landed originally.
Use a bit of a defensive programming, i.e. revert single PR if attempt to fetch dependencies fails for some reason.
Test plan:
- Lint
- ```
>>> from trymerge import GitRepo, GitHubPR, get_ghstack_prs, get_ghstack_dependent_prs
>>> pr=GitHubPR("pytorch", "pytorch", 115188)
>>> pr1=GitHubPR("pytorch", "pytorch", 115210)
>>> repo=GitRepo("/Users/nshulga/git/pytorch/pytorch")
>>> get_ghstack_dependent_prs(repo, pr1)
[('22742d93a5357c9b5b45a74f91a6dc5599c9c266', <trymerge.GitHubPR object at 0x100f32f40>)]
>>> get_ghstack_dependent_prs(repo, pr)
[('22742d93a5357c9b5b45a74f91a6dc5599c9c266', <trymerge.GitHubPR object at 0x10102eaf0>), ('76b1d44d576c20be79295810904c589241ca1bd2', <trymerge.GitHubPR object at 0x10102eb50>)]
>>> rc=get_ghstack_dependent_prs(repo, pr)
rc[0]>>> rc[0][1].pr_num
115210
>>> rc[1][1].pr_num
115188
```
- see: https://github.com/malfet/deleteme/pull/59#issuecomment-1869904714 and https://github.com/malfet/deleteme/pull/74#issuecomment-1870542702
Fixes https://github.com/pytorch/test-infra/issues/4845
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116447
Approved by: https://github.com/huydhn
ghstack dependencies: #116446
Prep change for allowing stacked reverts
This is a no-op that factors out some helper function that would be
useful later:
- `get_pr_commit_sha` finds a committed sha for a given PR
- `_revlist_to_prs` converts a revlist to GitHubPRs conditionally
filtering some out
- `do_revert_prs` reverts multiple PRs in a batch, but so far is
invoked with only one PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116446
Approved by: https://github.com/huydhn, https://github.com/seemethere
Fixes#112602
For some reason, I could not get the same output when running pycodestyle command as indicated in the issue. I manually ran ruff checks fixing the following issues `D202`, `D204`, `D205`, `D207`, `D400` and `D401`.
### Requested output
nn.modules.activation:
before: 135
after: 79
nn.modules.batchnorm
before: 21
after: 3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113531
Approved by: https://github.com/mikaylagawarecki
## Context
**Currently, `*.h` and `*.cpp` produces many lint warnings/errors from `clang-tidy` in the Meta internal Phabricator mirror**. These changes address all the lint warnings in the `api`, `graph`, and `impl` folders in preparation for upcoming planned work.
## Review Guide
* Most changes are the result of automatically applied patches from `clang-tidy`
* However, some warnings had to be manually addressed
* There should be no functional changes
* Many of the `clang-tidy` warnings arose from the `facebook-hte-BadMemberName` rule which checks for compliance with variable naming rules from Meta's internal C++ style guide
* However, the rest of the ATen codebase does not conform to this rule, and PyTorch Vulkan was written to be consisten with ATen's naming conventions; thus, to stay consistent with the rest of ATen, this rule is disabled wherever relevant using `// @lint-ignore-every CLANGTIDY facebook-hte-BadMemberName`
* Lint was disabled entirely for`vulkan_api_test.cpp` since there are too many warnings to address at the moment. Addressing all of them will be a small project of its own; thus, in the interim lint will be disabled to reduce distracting signals for developers.
Internal:
## Notes for Internal Reviewers
This diff was largely created with
```
cd ~/fbsource/xplat/caffe2/aten/src/ATen/native/vulkan
arc lint -e extra -a --take CLANGTIDY * 2>&1 | tee ~/scratch/lint.txt
```
The above command automatically applied patches suggested by `clang-tidy`, and the rest of the warnings were addressed manually.
To disable `facebook-hte-BadMemberName`, I found that disabling it via a `.clang-tidy` file didn't work with `arc lint`, and the only way that worked was through the adding a comment
```
// @lint-ignore-every CLANGTIDY facebook-hte-BadMemberName
```
Differential Revision: [D50336057](https://our.internmc.facebook.com/intern/diff/D50336057/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116431
Approved by: https://github.com/GregoryComer, https://github.com/kirklandsign
After this refactor:
* ```TorchVariable``` definition and all references are removed.
* All ```is_allowed``` references except one are removed.
- The only left one is in ```torch/_dynamo/decorators:_disallow_in_graph_helper```. It was called when users put ```disallow_in_graph``` decorator on a function. Since we use the lists in ```trace_rules``` to decide the function's trace rule, so the decorator would only be used as customer function rather than torch functions. I'll defer this to a separate decorator refactor PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116312
Approved by: https://github.com/jansel
Not sure if this is recent API change or what but `gh_get_labels('malfet', 'deleteme')` used to raise an exception (see https://github.com/malfet/deleteme/actions/runs/7334535266/job/19971328673#step:6:37 )
```
File "/home/runner/work/deleteme/deleteme/.github/scripts/label_utils.py", line 50, in get_last_page_num_from_header
link_info[link_info.rindex(prefix) + len(prefix) : link_info.rindex(suffix)]
AttributeError: 'NoneType' object has no attribute 'rindex'
```
And with this fix it returns the expected list
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116444
Approved by: https://github.com/huydhn
The cuDNN RNNv6 API has been deprecated and support will be dropped in a recent release; this PR migrates to the newer API to support newer cuDNN versions that would otherwise break the build.
Note that it may not be tested yet in upstream CI if the upstream CI cuDNN version is less than 8.9.7.
CC @ptrblck @malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115719
Approved by: https://github.com/albanD, https://github.com/malfet
Should allow for uses cases mentioned in #110940
This would allow scalars to also be float64s in the foreach implementation. The fused implementation would still create a float32 step on Adam and AdamW. This PR also does NOT worry about performance and is mainly for enablement.
Next steps:
- Relax the constraint on fused adam(w) and allow torch.float64 scalars there
- Allow _performant_ mixed dtypes in foreach (a bigger project in itself).
This PR will conflict with my other PRs, I will figure out a landing order
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115841
Approved by: https://github.com/albanD
When exporting a model with a convolution kernel on cpu, if mkldnn is disabled and nnpack is enabled, export will go down the nnpack optimized convolution kernel for certain shapes ((code pointer)[cd449e260c/aten/src/ATen/native/Convolution.cpp (L542-L552)]). This means that we will automatically create a guard on that certain shape. If users want to export without any restrictions, one option is to disable nnpack. However, no config function exists for this, so this PR is adding a config function, similar to the `set_mkldnn_enabled` function.
Original context is in https://fb.workplace.com/groups/1075192433118967/posts/1349589822345892/?comment_id=1349597102345164&reply_comment_id=1349677642337110.
To test the flag, the following script runs successfully:
```
import os
import torch
from torchvision.models import ResNet18_Weights, resnet18
torch.set_float32_matmul_precision("high")
model = resnet18(weights=ResNet18_Weights.DEFAULT)
model.eval()
with torch.no_grad():
# device = "cuda" if torch.cuda.is_available() else "cpu"
torch.backends.mkldnn.set_flags(False)
torch.backends.nnpack.set_flags(False) # <--- Added config
device = "cpu"
model = model.to(device=device)
example_inputs = (torch.randn(2, 3, 224, 224, device=device),)
batch_dim = torch.export.Dim("batch", min=2, max=32)
so_path = torch._export.aot_compile(
model,
example_inputs,
# Specify the first dimension of the input x as dynamic
dynamic_shapes={"x": {0: batch_dim}},
# Specify the generated shared library path
options={
"aot_inductor.output_path": os.path.join(os.getcwd(), "resnet18_pt2.so"),
"max_autotune": True,
},
)
```
I'm not sure who to add as reviewer, so please feel free to add whoever is relevant!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116152
Approved by: https://github.com/malfet
This PR is the start to enable the integrate pytorch distributed logs in Torch LOGs. We now already have one tag "distributed" for all distributed components but distributed is a very large component and we want to have some hierarchy and give users options to only turn on logs for certain submodules. So we also added tags starting with "dist_*" for each submodule. (This PR only adds some of them and we are going to add more down the road)
Related discussions can be found here: https://github.com/pytorch/pytorch/issues/113544
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116434
Approved by: https://github.com/awgu, https://github.com/wanchaol
get_unbacked_symbol_defs and get_unbacked_symbol_uses inconsistently return dicts vs. sets. The majority of the use cases of these methods use them for set membership, which is deterministic, but set iteration is non deterministic. Therefore, in the one place where we iterate through unbacked symbols, we sort by the symbol name before iterating to preserve determinism.
Another approach would be to have these functions consistently return dictionaries, where the key of the dictionary is the name of the symbol. I'm happy to do that approach if we think it's likely future code will forget to sort before iteration.
Fixes#113130
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116421
Approved by: https://github.com/oulgen, https://github.com/aakhundov
**Summary**:
Ops like `native_layer_norm_backward` return a tuple of optional torch.Tensor.
This PR allows to use OpStrategy to represent `native_layer_norm_backward`'s
return value sharding.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115682
Approved by: https://github.com/wanchaol
Removes a part of the sparse adam test and the following three tests: `test_fused_optimizer_raises`, `test_duplicate_params_across_param_groups`, `test_duplicate_params_in_one_param_group`
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (d2d129de)]$ python test/test_optim.py -k test_fused_optimizer_raises -k test_duplicate_params_across_param_groups -k test_duplicate_params_in_one_param_group
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
...
----------------------------------------------------------------------
Ran 3 tests in 0.023s
OK
```
Increases coverage by testing the duplicate param tests on ALL the optims instead of just one each. Also fixes SparseAdam bug which was accidentally calling torch.unbind through list instead of putting params in a list. This bug was caught by migrating the weird warning stuff to just one easy warning context manager, which checks that nothing else gets raised.
The new test_errors does not run slower than before, overhead is still king:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (d2d129de)]$ python test/test_optim.py -k test_errors
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
..........................
----------------------------------------------------------------------
Ran 26 tests in 10.337s
OK
```
Compared to test_errors BEFORE my commit :p
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b47aa696)]$ python test/test_optim.py -k test_errors
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
.............sssssssssssss
----------------------------------------------------------------------
Ran 26 tests in 11.980s
OK (skipped=13)
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b47aa696)]$
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116315
Approved by: https://github.com/mikaylagawarecki
As the title, this PR enables vectorization for the situation when the the index_expr depends on vectorized itervar. There are two cases here:
1. The vectorized itervar has constant stride in the index_expr. We vectorize the index_expr with `Vectorized<int32>::arange` for this case.
2. Otherwise, we load the index_expr vector in a non-contiguous way with a loop.
Below is the generated code for the first case from the test `test_concat_inner_vec`. Here `x1` is the index_expr and depends on the vectorized itervar `x1`. It has constant stride 1. We vectorized it with arange. We use `all_zero` to implement a short-cut for masks to avoid unnecessary execution of nested masked regions which are invalid.
Before:
```c++
#pragma omp for collapse(2)
for(long x0=static_cast<long>(0L); x0<static_cast<long>(32L); x0+=static_cast<long>(1L))
{
for(long x1=static_cast<long>(0L); x1<static_cast<long>(155L); x1+=static_cast<long>(1L))
{
auto tmp0 = c10::convert<long>(x1);
auto tmp1 = static_cast<long>(0);
auto tmp2 = tmp0 >= tmp1;
auto tmp3 = static_cast<long>(35);
auto tmp4 = tmp0 < tmp3;
auto tmp5 = [&]
{
auto tmp6 = in_ptr0[static_cast<long>(x1 + (35L*x0))];
return tmp6;
}
;
auto tmp7 = tmp4 ? tmp5() : static_cast<decltype(tmp5())>(0.0);
auto tmp8 = tmp0 >= tmp3;
auto tmp9 = static_cast<long>(155);
auto tmp10 = tmp0 < tmp9;
auto tmp11 = [&]
{
auto tmp12 = in_ptr1[static_cast<long>((-35L) + x1 + (120L*x0))];
return tmp12;
}
;
...
```
After:
```c++
#pragma omp for
for(long x0=static_cast<long>(0L); x0<static_cast<long>(32L); x0+=static_cast<long>(1L))
{
for(long x1=static_cast<long>(0L); x1<static_cast<long>(144L); x1+=static_cast<long>(16L))
{
auto tmp0 = c10::convert<int>(x1);
auto tmp1 = at::vec::Vectorized<int32_t>::arange(tmp0, 1);
auto tmp2 = static_cast<int>(0);
auto tmp3 = at::vec::Vectorized<int>(tmp2);
auto tmp4 = to_float_mask(tmp1 >= tmp3);
auto tmp5 = static_cast<int>(35);
auto tmp6 = at::vec::Vectorized<int>(tmp5);
auto tmp7 = to_float_mask(tmp1 < tmp6);
auto tmp8 = [&]
{
auto tmp9 = masked_load(in_ptr0 + static_cast<long>(x1 + (35L*x0)), to_float_mask(tmp7));
return tmp9;
}
;
auto tmp10 =
[&]
{
if (all_zero(to_float_mask(tmp7)))
{
return at::vec::Vectorized<float>(static_cast<float>(0.0));
}
else
{
return decltype(tmp8())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp8(), to_float_mask(tmp7));
}
}
()
;
...
```
Below is the generated code for the second case from the test case `test_expr_vec_non_contiguous`. Here, the index_expr is `31L + (63L*(c10::div_floor_integer(x1, 32L))) + (c10::div_floor_integer(x2, 32L))` which depends on the vectorized itervar `x2` and doesn't have constant stride. So, we load the index_expr vector with a loop. (In fact, this can be further optimized since the index_expr is invariant with the data points in the range [x2, x2+16). So it can be regarded as a scalar. This will be optimized in the follow-up PR.) The code uses `vector_lane_mask_check` to implement the masked version of non-contiguous load.
Before:
```c++
#pragma omp for collapse(2)
for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L))
{
for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(1L))
{
{
float tmp_acc0 = -std::numeric_limits<float>::infinity();
for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L))
{
auto tmp0 = c10::convert<long>(31L + (63L*(c10::div_floor_integer(x1, 32L))) + (c10::div_floor_integer(x2, 32L)));
auto tmp1 = static_cast<long>(2048);
auto tmp2 = tmp0 < tmp1;
auto tmp3 = [&]
{
auto tmp4 = in_ptr0[static_cast<long>(31L + (63L*(c10::div_floor_integer(x1, 32L))) + (2048L*(static_cast<long>(x1) % static_cast<long>(32L))) + (65536L*x0) + (c10::div_floor_integer(x2, 32L)))];
return tmp4;
}
;
auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
tmp_acc0 = max_propagate_nan(tmp_acc0, tmp5);
}
out_ptr0[static_cast<long>(x1 + (1024L*x0))] = tmp_acc0;
}
}
}
```
After:
```c++
#pragma omp for
for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L))
{
for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
{
{
#pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())})
float tmp_acc0 = -std::numeric_limits<float>::infinity();
at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L))
{
auto tmp0 =
[&]
{
__at_align__ std::array<int, 16> tmpbuf;
#pragma GCC unroll 16
for (long x1_inner = 0; x1_inner < 16; x1_inner++)
{
tmpbuf[x1_inner] = static_cast<long>(31L + (63L*(c10::div_floor_integer((x1 + x1_inner), 32L))) + (c10::div_floor_integer(x2, 32L)));
}
return at::vec::Vectorized<int>::loadu(tmpbuf.data());
}
()
;
auto tmp1 = static_cast<int>(2048);
auto tmp2 = at::vec::Vectorized<int>(tmp1);
auto tmp3 = to_float_mask(tmp0 < tmp2);
auto tmp4 = [&]
{
auto tmp5 =
[&]
{
__at_align__ std::array<float, 16> tmpbuf;
#pragma GCC unroll 16
for (long x1_inner = 0; x1_inner < 16; x1_inner++)
{
if (vector_lane_mask_check(tmp3, x1_inner))
{
tmpbuf[x1_inner] = in_ptr0[static_cast<long>(31L + (63L*(c10::div_floor_integer((x1 + x1_inner), 32L))) + (2048L*(static_cast<long>((x1 + x1_inner)) % static_cast<long>(32L))) + (65536L*x0) + (c10::div_floor_integer(x2, 32L)))];
}
}
return at::vec::Vectorized<float>::loadu(tmpbuf.data());
}
()
;
return tmp5;
}
;
auto tmp6 =
[&]
{
if (all_zero(to_float_mask(tmp3)))
{
return at::vec::Vectorized<float>(static_cast<float>(0.0));
}
else
{
return decltype(tmp4())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp4(), to_float_mask(tmp3));
}
}
()
;
tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp6);
}
tmp_acc0_vec.store(out_ptr0 + static_cast<long>(x1 + (1024L*x0)));
}
}
}
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114545
Approved by: https://github.com/lezcano
Sometimes, the first statement that sets this variable in the try block fails due to out of memory issues and the finally block tries to delete this variable, but it was not written to in the first place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116260
Approved by: https://github.com/lezcano
The `initial` argument in `functools.reduce` can be `None`.
```python
initial_missing = object()
def reduce(function, iterable, initial=initial_missing, /):
it = iter(iterable)
if initial is initial_missing:
value = next(it)
else:
value = initial
for element in it:
value = function(value, element)
return value
```
Reference:
- python/cpython#102759
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116398
Approved by: https://github.com/Skylion007
This pull request primarily addresses two issues to resolve the `QConvPointWiseBinaryPT2E` layout problem:
- As the changes made in 611a7457ca, for `QConvPointWiseBinaryPT2E` with post-op `sum`, we should also utilize `NoneLayout` and return `accum` instead of `QConvPointWiseBinaryPT2E`.
- Additionally, this pull request fixes an issue in the `_quantized_convolution_onednn` implementation. Given that we expect `accum` to be inplace changed, we should avoid copying `accum` by changing the memory format or data type inside the kernel implementation. Instead, we have moved the necessary changes of memory format or data type to the lowering of `QConvPointWiseBinaryPT2E`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115613
Approved by: https://github.com/jgong5, https://github.com/oulgen
ghstack dependencies: #116172
**Summary**
Re-land https://github.com/pytorch/pytorch/pull/115329. Open a new PR since the origin branch has been deleted.
Change the QConv2d Binary fusion post op name from `add` to `sum`, since we are actually using OneDNN `post op sum` instead of `Binary_Add` for now.
**TestPlan**
```
python -m pytest test_quantized_op.py -k test_qconv2d_sum_pt2e
python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_pt2e
python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_float_output_pt2e
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116172
Approved by: https://github.com/kit1980
After this refactor:
* ```TorchVariable``` definition and all references are removed.
* All ```is_allowed``` references except one are removed.
- The only left one is in ```torch/_dynamo/decorators:_disallow_in_graph_helper```. It was called when users put ```disallow_in_graph``` decorator on a function. Since we use the lists in ```trace_rules``` to decide the function's trace rule, so the decorator would only be used as customer function rather than torch functions. I'll defer this to a separate decorator refactor PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116312
Approved by: https://github.com/jansel
Remove ```ExecutionRecorder.MOD_EXCLUDES``` since now torch python modules are wrapped as ```PythonModuleVariable``` after #115724.
This is reported from Meta internal user cases, where it triggers failure when replay & record is enabled. But the enablement was triggered by ```TORCH_COMPILE_DEBUG=1``` rather than they really need this. Actually they are not using it according the conversation with the team members. I think we don't maintain replay & record well, so probably we can remove them from our codebase to avoid such issues in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116347
Approved by: https://github.com/jansel
Fixes#114903
Previously large split variance reductions stored the intermediates as float16
precision, which may lead to overflow as the intermediate result is
unnormalized.
In #114903 we see two different `num_split` decisions made based on the
hardware capabilities, one of which has large enough intermediates to cause
overflows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115181
Approved by: https://github.com/shunting314
**Motivation:**
Denormal numbers are used to store extremely small numbers that are close to 0. Denormal numbers can incur extra computational cost. To solve the low performance issue caused by denormal numbers, Pytorch supports flushing denormal numbers and it successfully configures flush denormal mode
Currently set_flush_denormal() is only supported on x86 architectures supporting SSE3 ([https://pytorch.org/docs/stable/generated/torch.set_flush_denormal.html (Opens in new window or tab)](https://pytorch.org/docs/stable/generated/torch.set_flush_denormal.html) and now we want to extend this functionality for ARM architecture.
**This PR:**
->Supports set_flush_denormal() on ARM.
->Datatypes supported and tested: FP64, FP32, BFloat16
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115184
Approved by: https://github.com/jgong5
As the [RFC](https://github.com/pytorch/pytorch/issues/114856) mentions, this is the step 1 to add Intel GPU backend as an alternative inductor backend.
### Design
Typically, in order to integrate Intel GPU backend into Inductor, we need to inherit from `WrapperCodegen` and `TritonScheduling` and implement the corresponding subclasses respectively. However, since `WrapperCodegen` and `TritonScheduling` have some device-bias code generation **scattered** in their methods, overriding them in subclasses would introduce a lot of duplicated parent class code.
For example:
2a44034895/torch/_inductor/codegen/wrapper.py (L487)2a44034895/torch/_inductor/codegen/triton.py (L1996)
So we abstract the device-bias code scattered in WrapperCodegen and TritonScheduling and provide a unified interface "DeviceOpOverrides". This way, when integrating a new backend, we can maximize the reuse of `WrapperCodegen` and `TritonScheduling` code by inherit and implement this interface for device flexibility.
Currently the `DeviceOpOverrides` only cover Python wrapper code generation. We can futher extend it to cover Cpp wrapper code generation on demand.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116020
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
Summary:
- Ported `all_to_all_single` to native c10d_functional
- Added Inductor support for the native `all_to_all_single` via the new collective IR's `create_out_of_place()`
- Since the new collective IR derives from `FallbackKernel` which implements a generic `free_unbacked_symbols`, no additional unbacked symbol handling for all_to_all_single is required
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113438
Approved by: https://github.com/yf225, https://github.com/ezyang
Summary: amd build is broken
Test Plan:
```
buck-out/v2/gen/fbcode/75c2b50d9f8b18d8/caffe2/__fb_libtorch_hipify_gen_eqsb_torch/csrc/distributed/c10d/intra_node_comm.hip__/out/torch/csrc/distributed/c10d/intra_node_comm.hip:37:1: error: non-void function does not return a value [-Werror,-Wreturn-type]
}
^
1 error generated when compiling for gfx90a.
```
Now it's gone
Differential Revision: D52373348
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116291
Approved by: https://github.com/yifuwang
Summary: lifted tensor constants were not being treated the same way as named buffers when unlifting, i.e. getting name correction to convert "." in FQNS to "_" for proper names. Additionally, future torchbind object support will allow objects to be registered, so only register_buffer for lifted constants if the value is a tensor.
Differential Revision: D52367846
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116266
Approved by: https://github.com/angelayi
Summary:
We have found out the root cause of the hang is NOT due to destruction
of stores. The hang in the check() only happens when the store is of
type FileStore.
The file held by each filestore was a temp file, which was created by
Python Tempfile, it was deleted by default when the file was closed.
Note that the file was opened and closed by every check() in the watchdog and in constructor of FileStore.
The when check() tried to open the deleted file again, open() would fail
after the timeout value (by default 5 mins), hence the hang happened.
The fix is simple, just avoid the default deletion after the file is
closed.
Test Plan:
1. We first reproduce the hang in check() in the existing unit test:
test_init_process_group_for_all_backends by enabling the
DumpOnTimeOut and making the main thread sleep for 2s, to give enough time for tempfile
to be deleted
2. Adding log to check ref count of fileStore and also the sequence of
file opening and closing
3. With the repro, an exception will be thrown as "no such file or
directory' and unit test would fail
4. Verify the tests now passes with the above knob change
5. add an unit test in test_c10d_nccl to cover the fileStore check() code path
python test/distributed/test_c10d_common.py ProcessGroupWithDispatchedCollectivesTests
python test/distributed/test_c10d_nccl.py ProcessGroupNCCLTest.test_file_store_check
Reviewers:
Subscribers:
Tasks:
T173200093
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116297
Approved by: https://github.com/fduwjj
ghstack dependencies: #116296
Fixes#115234
Currently, the unpickling code does not support integers larger than 64 bits in size. However, this is a part of the Python unpickling code.
See `pickle.py` in CPython:
```
def decode_long(data):
r"""Decode a long from a two's complement little-endian binary string.
>>> decode_long(b'')
0
>>> decode_long(b"\xff\x00")
255
>>> decode_long(b"\xff\x7f")
32767
>>> decode_long(b"\x00\xff")
-256
>>> decode_long(b"\x00\x80")
-32768
>>> decode_long(b"\x80")
-128
>>> decode_long(b"\x7f")
127
"""
return int.from_bytes(data, byteorder='little', signed=True)
```
E.g.:
```
>>> int.from_bytes(bytearray(b'\xff\xff\xff\xff\xff\xff\xff\xff\x00'), byteorder='little', signed=True)
18446744073709551615
```
This PR makes it so that integers of arbitrary size are supported with JS BigNums.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115264
Approved by: https://github.com/zdevito
`add` when passed one sparse and one dense argument will error if the
sparse argument does not have csr layout. This PR modifies the
underlying algorithm to be generic on the compressed dimension handling
both csr and csc. The functions are renamed to use the
`sparse_compressed` qualifier rather than `sparse_csr`
Fixes: #114807
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115433
Approved by: https://github.com/cpuhrsch, https://github.com/pearu
ghstack dependencies: #115432
Change default from 2 min to 10 min.
Why? Many cases of heartbeat timeout were reported, but increasing
timeout led to the same job hanging in a different place, suggesting
heartbeat kill was working well and not a false positive. However, some
others reported jobs running fine with increased timeouts. One such
case was investigated below, and suggests that indeed a 2 min timeout is
too aggressive. While we have not fully root caused the issue, it
is better to avoid killing jobs that would otherwise complete.
Current theory is that watchdog is not totally deadlocked, but is slowed
down in its processing of work objs due to some intermittent resource
contention. Hence, allowing more time is more of a workaround than a
fix.
Debug/Analysis:
https://docs.google.com/document/d/1NMNWoTB86ZpP9bqYLZ_EVA9byOlEfxw0wynMVEMlXwM
Differential Revision: [D52368791](https://our.internmc.facebook.com/intern/diff/D52368791)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116267
Approved by: https://github.com/fduwjj
Recent 2 triton PRs (https://github.com/openai/triton/pull/2701, https://github.com/openai/triton/pull/2756) change the interface for triton.compile, this PR added the necessary change on inductor side to work with both old and new compile API.
Also there is some simplification between compilation call in subprocess and the one in main process
- previously we pass warm_cache_only=True if the compilation happens in subprocess. But triton never use that argument in the currently used pin. So I removed that
- previously we only pass compute_capability if compilation happens in subprocess. The PR change that to always passing compute_capability to triton.compile no matter if the compilation happens in main or sub process.
Updated:
There are more interface change from triton side. E.g.
- tl.math.{min, max} now requires a propagate_nan argument
- JITFunction.run now requires a warmup argument. This affect the benchmarking phase of matmul max-autotune; on the other hand, JITFunction.run forbids stream argument now. Simply removing passing this in when benchmarking matmul triton kernel will work for both old and new version of triton.
- triton Autotuner change attribute name from 'warmup' to 'num_warmup' and from 'rep' to 'num_rep'. This cause dynamo failed to handle triton Autotuner object since dynamo TritonKernelVariable makes assumption about attribute names. It's used in some test cases that a model call triton Autotuner directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115878
Approved by: https://github.com/jansel
Summary: Currently, we [`clone`](19207b9183/torch/_inductor/lowering.py (L5273)) every `TensorBox` argument of custom Triton kernels while lowering them to the Inductor IR, during which the stride information of the kernel inputs is lost. This is problematic in the common case when the strides of a `torch.Tensor` argument are passed as scalars to a custom Triton kernel alongside the tensor itself (due to the underlying Triton code interpreting the tensors as raw pointers, so the contained stride semantics of the `torch.Tensor` is lost).
In this PR, we add an extended version of the existing [`clone` lowering](19207b9183/torch/_inductor/lowering.py (L2289))---`clone_preserve_reinterpret_view`---which carries over the `ir.ReinterpretVew` layers (if any) from the source `TensorBox` to the cloned one. The rationale behind adding a new function (and switching to it in the `triton_kernel_wrap` only for now) as opposed to extending the existing `clone` is keeping the semantics of the latter untouched, as it is a lowering of `torch.clone` (albeit incomplete, as the `memory_format` is currently ignored). Changing the existing `clone` would change the semantics which is not necessarily desirable in general. Open to suggestions, though.
Test Plan:
```
$ python test/dynamo/test_functions.py -k test_triton_kernel_strided_input
...
----------------------------------------------------------------------
Ran 1 test in 5.568s
OK
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116219
Approved by: https://github.com/jansel
Part 1 of implementation for general [subclass view fake-ification](https://docs.google.com/document/d/1C5taWiplmX7nKiURXDOAZG2W5VNJ2iV0fQFq92H0Cxw).
The following functional inverses are currently implemented scatter-style and thus never return views:
* `as_strided_copy_inverse()`
* `diagonal_copy_inverse()`
* `expand_copy_inverse()`
* `select_copy_int_inverse()`
* `slice_copy_Tensor_inverse()`
* `split_copy_Tensor_inverse()`
* `split_with_sizes_copy_inverse()`
* `unbind_copy_int_inverse()`
* `unfold_copy_inverse()`
We need to get actual views for the introduction of reverse view funcs coming next.
Details:
* Use `as_strided()` to implement actual view inverses for the above
* Assumes we're given a mutated_view that is actually part of a bigger storage; this isn't really the case for functionalization
* Introduce `InverseReturnMode` enum for customization of functional inverses
* `AlwaysView` - always return an actual view; needed for reverse view_funcs()
* `NeverView` - always do a copy; useful for certain functionalization use cases (e.g. XLA, executorch)
* `ViewOrScatterInverse` - return an actual view in most cases, but prefer scatter inverses when they exist. this avoids the need to implement `as_strided()` for subclasses, which can be difficult or impossible
* Make sure functionalization works as before
* Use `ViewOrScatterInverse` when reapply_views TLS is True or `NeverView` otherwise
* Adds tests to ensure old behavior for above inverses **in functionalization**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115893
Approved by: https://github.com/bdhirsh
These tests are failing in CI with this error
```
File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/torch/_dynamo/variables/builder.py", line 1126, in wrap_numpy_ndarray
value.flags.writeable = True
^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.InternalTorchDynamoError: cannot set WRITEABLE flag to True of this array
```
And it may be related to a `SIGKILL` exception being raised shortly after the
failure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116263
Approved by: https://github.com/lezcano
Summary:
Original commit changeset: 2a9588cfd51b
Original Phabricator Diff: D52062368
Test Plan: In investigating S386328 and S382826, we found checkpoint loading succeed after backout D52062368: S386328_backout_1220_193648
Differential Revision: D52356011
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116243
Approved by: https://github.com/voznesenskym
I found whatever the attribute `async_op` in collective ops is `True` or `False` explicitly set by the users, it always leads to the graph break because the argument `async_op` is wrappered as `ConstantVariable(bool)` in dynamo. So here we need to use the `value` for the judgement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115921
Approved by: https://github.com/jansel, https://github.com/wconstab
Make ```SkipFilesVariable``` only handle function type, and route skipped classes to ```UserDefinedClassVariable```. The reasons behind this are:
* We'd like to remove ```is_allowed```, so the allowed/disallowed torch classes should have a proper place to handle. We can put them in either ```SkipFilesVariable``` and ```UserDefinedClassVariable``` under the current architecture, but it's confusing to have two places do one thing.
- Going forward, let's make ```SkipFilesVariable``` only handle functions, and probably I'll rename it to ```SkippedFunctionVariable``` in the following PRs.
- Let's do dispatch by value's type, all torch classes stuff would go to ```UserDefinedClassVariable``` in the next PR.
* We'd merge in_graph/skip/inline trace decision into the same API ```trace_rule.lookup```, so probably we have to limit the input to only function for better organizing ```VariableBuilder._wrap``` logics.
- Next step, I'll merge ```skipfiles.check``` into ```trace_rules.lookup```, and do the skipfile check before wrapping them into correct variable tracker.
- Though the ```TorchCtxManagerClassVariable``` is decided by ```trace_rules.lookup```, I'll refactor it out in the following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115963
Approved by: https://github.com/jansel
This is to address issue #115309.
Test plan
`python test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training_is_seq_parallel_False`
`python test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training_is_seq_parallel_True`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115530
Approved by: https://github.com/wanchaol
Recent 2 triton PRs (https://github.com/openai/triton/pull/2701, https://github.com/openai/triton/pull/2756) change the interface for triton.compile, this PR added the necessary change on inductor side to work with both old and new compile API.
Also there is some simplification between compilation call in subprocess and the one in main process
- previously we pass warm_cache_only=True if the compilation happens in subprocess. But triton never use that argument in the currently used pin. So I removed that
- previously we only pass compute_capability if compilation happens in subprocess. The PR change that to always passing compute_capability to triton.compile no matter if the compilation happens in main or sub process.
Updated:
There are more interface change from triton side. E.g.
- tl.math.{min, max} now requires a propagate_nan argument
- JITFunction.run now requires a warmup argument. This affect the benchmarking phase of matmul max-autotune; on the other hand, JITFunction.run forbids stream argument now. Simply removing passing this in when benchmarking matmul triton kernel will work for both old and new version of triton.
- triton Autotuner change attribute name from 'warmup' to 'num_warmup' and from 'rep' to 'num_rep'. This cause dynamo failed to handle triton Autotuner object since dynamo TritonKernelVariable makes assumption about attribute names. It's used in some test cases that a model call triton Autotuner directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115878
Approved by: https://github.com/jansel
As stated. I do notice there is perhaps opportunity to abstract, but the tests as written are also super understandable and more abstraction might not be desirable.
This PR _increases coverage_. The original tests each tested 12 default configs (left out Rprop). Now the tests test ~80 configs, and then foreach + fused on top of that! Test time, we basically increase over 10-fold, but this test is tiny so we are not worried:
Old:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5ca9672c)]$ python test/test_optim.py -k test_step_is_noop_when_params_have_no_grad
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
.
----------------------------------------------------------------------
Ran 1 test in 0.028s
OK
```
New (includes the old test):
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5ca9672c)]$ python test/test_optim.py -k test_step_is_noop_when_params_have_no_grad
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
...........................
----------------------------------------------------------------------
Ran 27 tests in 0.456s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115299
Approved by: https://github.com/albanD
ghstack dependencies: #114802, #115023, #115025
Removing 4 tests:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7539011b)]$ python test/test_optim.py -v -k test_fused_optimizers_with_large_tensors -k test_fused_optimizers_with_varying_tensors -k test_multi_tensor_optimizers_with_large_tensors -k test_multi_tensor_optimizers_with_varying_tensors
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_fused_optimizers_with_large_tensors (optim.test_optim.TestOptim) ... ok
test_fused_optimizers_with_varying_tensors (optim.test_optim.TestOptim) ... ok
test_multi_tensor_optimizers_with_large_tensors (optim.test_optim.TestOptim) ... ok
test_multi_tensor_optimizers_with_varying_tensors (optim.test_optim.TestOptim) ... ok
----------------------------------------------------------------------
Ran 4 tests in 22.731s
OK
```
For the same 4 but more granular:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7539011b)]$ python test/test_optim.py -v -k test_fused_large_tensor -k test_fused_mixed_device_dtype -k test_foreach_large_tensor -k test_foreach_mixed_device_dtype
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_foreach_large_tensor_ASGD_cpu_float16 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
....
test_fused_mixed_device_dtype_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
test_foreach_large_tensor_ASGD_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Adadelta_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Adagrad_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_AdamW_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Adam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_NAdam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_RAdam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_RMSprop_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Rprop_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_SGD_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_ASGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adadelta_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adagrad_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adamax_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_NAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_RAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_RMSprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Rprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_large_tensor_AdamW_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_large_tensor_Adam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_mixed_device_dtype_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_mixed_device_dtype_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
----------------------------------------------------------------------
Ran 50 tests in 50.785s
OK (skipped=25)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115025
Approved by: https://github.com/albanD
ghstack dependencies: #114802, #115023
Replace the following:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (1bbf1c6f)]$ python test/test_optim.py -k test_peak_mem_multi_tensor_optimizers
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
.
----------------------------------------------------------------------
Ran 1 test in 38.599s
OK
```
with 11 tests (one for each foreach optim :))
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (1bbf1c6f)]$ python test/test_optim.py -k TestOptimRenewedCUDA.test_foreach_memory
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
...........
----------------------------------------------------------------------
Ran 11 tests in 39.293s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115023
Approved by: https://github.com/albanD
ghstack dependencies: #114802
New tests look like:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (af8fca04)]$ python test/test_optim.py -v -k TestOptimRenewedCUDA.test_fused
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_fused_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
----------------------------------------------------------------------
Ran 2 tests in 34.591s
OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (af8fca04)]$ python test/test_optim.py
-v -k test_set_default_dtype_works_with_foreach
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_set_default_dtype_works_with_foreach_ASGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
...
test_set_default_dtype_works_with_foreach_ASGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adadelta_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adagrad_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adamax_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_NAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_RAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_RMSprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Rprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_SGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
----------------------------------------------------------------------
Ran 22 tests in 32.915s
OK (skipped=11)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114802
Approved by: https://github.com/albanD
Added a `--num_workers` option to `server.py` that allows more than 1 worker in the `ThreadPoolWorker` used for model predictions. Each worker uses its own `cuda.Stream()` that is created when the worker thread is initialized.
Ran benchmark for 2-4 workers with `compile=False` (since compile is not thread-safe)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116190
Approved by: https://github.com/albanD
ghstack dependencies: #115286, #116187, #116188, #116189
Added two `ThreadPoolExecutor`s with 1 worker each for D2H and H2D copies. Each uses its own `cuda.Stream`. The purpose is to try to overlap D2H and H2D with compute and allow the worker handling prediction to launch compute kernels without being blocked by D2H/H2D.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116189
Approved by: https://github.com/albanD
ghstack dependencies: #115286, #116187, #116188
Before this PR, each `request_time` was separated by the time for a `torch.randn(...)` to create the fake `data` tensor on CPU. This meant that the gap between `request_times` **scaled with the batch_size**. So the latency comparisons across batch sizes were inaccurate. In this PR we generate all the fake data outside the loop to avoid this.
Other bug fixes:
- Only start polling GPU utilization after warmup event is complete
- Correct calculation of throughput: previously `(num_batches * batch_size) / sum(response_times)`, should have been `(num_batches * batch_size) / (last_response_time - first_request_time)`
- Make sure that response sent back to frontend is on CPU
- Use a lock to ensure writing to `metrics_dict` in `metrics_thread` and `gpu_utilization_thread` in a thread-safe manner
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116188
Approved by: https://github.com/albanD
ghstack dependencies: #115286, #116187
Experiment: run model predictions in the backend in a ThreadPoolExecutor so that each model prediction does not block reading requests from the queue
Baseline is reset in above PR that bugfixes a lot of the metrics calculations but I kept the metrics here anyway
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115286
Approved by: https://github.com/albanD
Follow-up to https://github.com/pytorch/pytorch/pull/115920
This PR fixes the error with symbolic expression in aten.div:
```python
import torch
aten = torch.ops.aten
def func(x, a):
return aten.div(x * 0.5, a, rounding_mode=None)
cfunc = torch.compile(func, dynamic=True, fullgraph=True)
device = "cpu"
x = 124
a = 33
out = cfunc(x, a)
expected = func(x, a)
torch.testing.assert_close(out, expected)
```
Error message:
```
File "/pytorch/torch/_inductor/graph.py", line 700, in call_function
out = lowerings[target](*args, **kwargs)
File "/pytorch/torch/_inductor/lowering.py", line 293, in wrapped
out = decomp_fn(*args, **kwargs)
File "/pytorch/torch/_inductor/lowering.py", line 4823, in div_mode
return div(a, b)
File "/pytorch/torch/_inductor/lowering.py", line 293, in wrapped
out = decomp_fn(*args, **kwargs)
File "/pytorch/torch/_inductor/lowering.py", line 4857, in div
a, b = promote_constants(
File "/pytorch/torch/_inductor/lowering.py", line 368, in promote_constants
ex = next(x for x in inputs if isinstance(x, (TensorBox, ExpandView)))
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
LoweringException: StopIteration:
target: aten.div.Tensor_mode
args[0]: 1.0*s0
args[1]: s1
kwargs: {'rounding_mode': None}
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116196
Approved by: https://github.com/peterbell10
Removes an unnecessary duplicated utility functions and just have it rely on itertools. Since the file is low traffic, I also added the modified files to UFMT'd files and formatted them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116192
Approved by: https://github.com/malfet
Decorates all NT tests with `@markDynamoStrictTest` to ensure we get the correct signal. Adds xfails where needed to get things passing.
Includes a fix in meta_utils.py for a bug that was breaking several python 3.11 tests. In particular, a dense tensor graph input that is a view of a strided NT would slip past Dynamo's check and break in meta-ification.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116111
Approved by: https://github.com/soulitzer, https://github.com/zou3519
ghstack dependencies: #115192
Before this change `-SymInt(std::numeric_limits<int64_t>::min()) == 0` would reliably crash with null pointer dereference, as `data_` of the SymInt returned by `operator-` would be `0x8000000000000000`, because of the carry/overflow flags set by `negq`.
Before the change x86_64 assembly generated for
4f02cc0670/c10/core/SymInt.cpp (L137)
looked as follows:
```
0x7ffff7f2f490 <+115>: movq %rax, %rdx
0x7ffff7f2f493 <+118>: negq %rdx
0x7ffff7f2f496 <+121>: movq %rdx, (%rbp)
0x7ffff7f2f49a <+125>: movabsq $0x4000000000000000, %rdx ; imm = 0x4000000000000000
0x7ffff7f2f4a4 <+135>: cmpq %rdx, %rax
0x7ffff7f2f4a7 <+138>: jle 0x7ffff7f2f520 ; <+259> at SymInt.cpp:141:1
```
`negq %rfx` correspond to unary minus and `cmpq %rdx, 0x4000000000000000` are inverted `check_range`
b6d0d0819a/c10/core/SymInt.h (L247-L249)
Flags raised by `negq` will affect the results of `cmpq`, and as result value would not be allocated on heap, but rather preserved as `nullptr`.
Not sure if it's worth benchmarking, but perhaps using `__builtin_sub_overflow` would be faster as it does not require an extra comparison, just guarantees that overflow flags is cleared after the op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116160
Approved by: https://github.com/Skylion007, https://github.com/colesbury
Updated range_constraints to be the union of shape_env.var_to_range and shape_env.runtime_var_to_range, with shape_env.runtime_var_to_range taking priority.
Due to 0/1 specialization, if we bound an unbacked symint to be less than 5, the range of possible values for this symint is actually recorded as [2, 5] in shape_env.var_to_range. To fix this so that users will be able to see a more understandable range of [0, 5], shape_env.runtime_var_to_range was created to store the range of [0, 5]. Since range_constraints is a user-facing attribute to query the ranges of certain symints, we want to use shape_env.runtime_var_to_range to get the unbacked symints ranges, rather than shape_env.var_to_range.
Additionally, run_decompositions() has an issue where it will always add assertions to the graph, even if a previous run has already added the assertions. So, I added a part to the AddRuntimeAssertionsForInlineConstraints which will store which assertions have already been added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115427
Approved by: https://github.com/zhxchen17
This replaces a bunch of unnecessary lambdas with the operator package. This is semantically equivalent, but the operator package is faster, and arguably more readable. When the FURB rules are taken out of preview, I will enable it as a ruff check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116027
Approved by: https://github.com/malfet
* Enable PERF402. Makes code more efficient and succinct by removing useless list copies that could be accomplished either via a list constructor or extend call. All test cases have noqa added since performance is not as sensitive in that folder.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115505
Approved by: https://github.com/malfet
Fixes incorrect range check when index is `std::numeric_limits<int64_t>::min()`, as result of unary minus operations for such values is undefined, but in practice is equal to self, see https://godbolt.org/z/Wxhh44ocr
Lower bound check was `size >= -index`, which was incorrect if `index` is `INT64_MIN`, with `-1 - index`, which for all int64_t values returns result that also fits into int64_t range. `- (index + 1)` is more readable and results in the identical optimized assembly, see https://godbolt.org/z/3vcnMYf9a , but its intermediate result for `INT64_MAX` is outside of `int64_t` range, which leads to a similar problems as with `int64_min` in original example.
Added regression test.
Fixes https://github.com/pytorch/pytorch/issues/115415
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116062
Approved by: https://github.com/Skylion007, https://github.com/albanD
Summary:
This change makes two major improvements to PyTorch Vulkan's shader authoring workflow.
## Review Guide
There are a lot of changed files because every GLSL shader had to be touched. The majority of changes is changing
```
#define PRECISION $precision
#define FORMAT $format
```
to
```
#define PRECISION ${PRECISION}
#define FORMAT ${FORMAT}
```
due to changes in how shader templates are processed.
For reviewers, the primary functional changes to review are:
* `gen_vulkan_spv.py`
* Majority of functional changes are in this file, which controls how shader templates are processed.
* `shader_params.yaml`
* controls how shader variants are generated
## Python Codeblocks in Shader Templates
From now on, every compute shader (i.e. `.glsl`) is treated as a shader template. To this effect, the `templates/` folder has been removed and there is now a global `shader_params.yaml` file to describe the shader variants that should be generated for all shader templates.
**Taking inspiration from XNNPACK's [`xngen` tool](https://github.com/google/XNNPACK/blob/master/tools/xngen.py), shader templates can now use Python codeblocks**. One example is:
```
$if not INPLACE:
layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict writeonly image3D uOutput;
layout(set = 0, binding = 1) uniform PRECISION sampler3D uInput;
layout(set = 0, binding = 2) uniform PRECISION sampler3D uOther;
layout(set = 0, binding = 3) uniform PRECISION restrict Block {
ivec4 output_sizes;
ivec4 input_sizes;
ivec4 other_sizes;
float alpha;
}
uArgs;
$else:
layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict image3D uOutput;
layout(set = 0, binding = 1) uniform PRECISION sampler3D uOther;
layout(set = 0, binding = 2) uniform PRECISION restrict Block {
ivec4 output_sizes;
ivec4 other_sizes;
float alpha;
}
uArgs;
```
Another is:
```
// PYTHON CODEBLOCK
$if not IS_DIV:
const int c_index = (pos.z % ((uArgs.output_sizes.z + 3) / 4)) * 4;
if (uArgs.other_sizes.z != 1 && c_index + 3 >= uArgs.output_sizes.z) {
ivec4 c_ind = ivec4(c_index) + ivec4(0, 1, 2, 3);
vec4 mask = vec4(lessThan(c_ind, ivec4(uArgs.output_sizes.z)));
other_texel = other_texel * mask + vec4(1, 1, 1, 1) - mask;
}
// PYTHON CODEBLOCK
$if not INPLACE:
ivec3 input_pos =
map_output_pos_to_input_pos(pos, uArgs.output_sizes, uArgs.input_sizes);
const vec4 in_texel =
load_texel(input_pos, uArgs.output_sizes, uArgs.input_sizes, uInput);
imageStore(uOutput, pos, OP(in_texel, other_texel, uArgs.alpha));
$else:
const vec4 in_texel = imageLoad(uOutput, pos);
imageStore(uOutput, pos, OP(in_texel, other_texel, uArgs.alpha));
```
In addition to making it easier and clearer to write shader templates, this enables shaders that were previously unable to be consolidated into a single template to now be represented using a single template, such as non inplace and inplace variants of the same shader.
## `generate_variant_forall` in shader variant YAML configuration
YAML files that describe how shader variants should be generated can now use a `generate_variant_forall` field to iterate over various settings for a specific parameter for each variant defined. Example:
```
unary_op:
parameter_names_with_default_values:
OPERATOR: exp(X)
INPLACE: 0
generate_variant_forall:
INPLACE:
- VALUE: 0
SUFFIX: ""
- VALUE: 1
SUFFIX: "inplace"
shader_variants:
- NAME: exp
OPERATOR: exp(X)
- NAME: sqrt
OPERATOR: sqrt(X)
- NAME: log
OPERATOR: log(X)
```
Previously, the `inplace` variants would need to have separate `shader_variants` entries. If there are multiple variables that need to be iterated across, then all possible combinations will be generated. Would be good to take a look to see how the new YAML configuration works.
Test Plan:
There is no functional change to this diff; we only need to make sure that the generated shaders are still correct. Therefore, we only need to run `vulkan_api_test`.
```
# On Mac Laptop
buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*"
```
Reviewed By: digantdesai
Differential Revision: D52087084
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115948
Approved by: https://github.com/manuelcandales
Noticed that on many MRS kernels the grid wrapper for autotuning is huge with a bunch of duplicates due to num_warps and num_stages not being needed for grid calculation. Lets deduplicate these entries.
Previously, we would see wrapper like
```
def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```
now it looks like
```
def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115849
Approved by: https://github.com/jansel
Summary: To be used in https://github.com/pytorch/pytorch/pull/113873. Since set_ is effectively an inplace view op, we'll need to skip caching them.
Test Plan: Built pytorch; specifically this step: `/home/slarsen/local/miniconda3/envs/pytorch-3.10/bin/python -m torchgen.gen --source-path /home/slarsen/local/pytorch/cmake/../aten/src/ATen --install_dir /home/slarsen/local/pytorch/build/aten/src/ATen --per-operator-headers --generate sources --output-dependencies /home/slarsen/local/pytorch/build/aten/src/ATen/generated_sources.cmake`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115769
Approved by: https://github.com/bdhirsh
Attempt to surface the segfault that happens on exit by resetting the "pytest last run" cache if pytest succeeds. CI does not rerun on success so we won't hit an infinite loop anywhere, and I don't expect people to rerun on success (unless they're looking for flakes? Either way I highly doubt any one is using the --sc/--scs flag locally).
This ensures that if pytest succeeds but the process gets a non zero exit code, the rerun will start at beginning instead of skipping all the "succeeding" tests.
This only applies if the --sc/--scs flags are used, custom to pytorch and probably not used anywhere other than CI, not to be confused with --stepwise, which pytest has by default
Here's a list of segfaulting inductor/test_aot_inductor tests, which I added skips for:
```
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_duplicated_params_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_fqn_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_no_args_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_output_misaligned_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_pytree_inputs_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_seq_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_simple_split_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_addmm_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_aliased_buffer_reuse_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_buffer_reuse_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_convolution_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_duplicated_params_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_empty_graph_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_fqn_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_large_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_missing_output_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_no_args_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_output_misaligned_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_output_path_1_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_pytree_inputs_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_repeat_interleave_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_return_constant_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_reuse_kernel_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_seq_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_simple_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_simple_split_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_small_constant_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_with_no_triton_profiler_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_with_offset_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_with_profiler_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_zero_size_weight_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115775
Approved by: https://github.com/desertfire
In this PR, we are implementing Functionalization on pre-dispatch graph. Today, every dispatch key except for Dispatchkey.Python has a dedicated mode stack in python. PreDispatch tracing relies on this behaviour by pushing ProxyTorchDispatchMode to Dispatchkey.PreDispatch mode stack and handle the dispatching logic in python. To make pre-dispatch functionalization work, we now need to push FunctionalTensorMode on DispatchKey.PreDispatch mode stack and make sure it runs before ProxyTorchDispatchMode. (this is very similar to how post-dispatch tracing work). Here are some design decisions we made for this flow to work:
1. FunctionalTensorMode internally calls C++ functionalize key. Since C++ functionalization goes after PreDispatch, if we are not careful, we will keep re-entering into PreDispatch key. We solve this by directly dispatching to C++ Functionalize key.
2. We delete mode_stack_per_key logic because the only realistic time it is exercised is for PreDispatch and it is in general not safe to have a plain list because FunctionalTensorMode and ProxyTorchDispatchMode ordering matter and it is hard to enforce it on plain list. Instead, now we have a private class that tracks PreDispatch mode stack.
3. We will still run CompositeImplicitAutograd decomps in this PR, and disable this logic later as a followup.
Some missing bits after this PR:
1. Preserving autograd ops in a functional form. Right now they still show up in the graph but in a "non-functional" way.
2. Turn off CompositeImplicitAutograd decomps
3. Functionalizing HOO
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113728
Approved by: https://github.com/bdhirsh
After # 113734 landed (adding dynamic storage offsets), we found that compilation times increased significantly. The reason: tensors_definitely_do_not_overlap was doing comparisons on storage offsets which were adding guards
626b7dc847/torch/_functorch/_aot_autograd/input_output_analysis.py (L268-L276)
This guard is added on all pairs of tensors which are views of the same source tensor - i.e. it the number of guards can be quadratic in the number of input tensors. This PR adds a test to prevent similar regressions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115793
Approved by: https://github.com/yanboliang
Fixes#114310 and supersedes #114748.
There are two reasons why we have quite a few special cases for `round`:
1. `round` is actually two ops. With `ndigits=None` (default), `round` always returns an integer. When `ndigits` is an integer, the returned type is a float.
2. Although `round` takes two arguments, it is a unary function with a parameter rather than a binary one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115259
Approved by: https://github.com/peterbell10, https://github.com/lezcano
Currently the inductor code for `x.any(-1)` does a this strange dance:
```python
tmp0 = tl.load(in_ptr0 + (r1 + (128*x0)), rmask & xmask)
tmp1 = tmp0.to(tl.int64)
tmp2 = (tmp1 != 0)
```
This happens because `register_lowering` is doing type promotion with the
dimension argument, and so promotes to `int64` which we then cast back to bool.
A better fix would be to fix `register_lowering` but for now I just remove
the unnecessary type promotion from `aten.any`.
In the current code we also see:
```python
tmp5 = tl.where(rmask & xmask, tmp3, 0)
```
which promotes the boolean value to int since `0` is an int32 in triton.
This fixes it to generate a boolean constant instead.
Finally there is also a triton bug where the `tl.load` itself upcasts to
`tl.int8`. I fix this by adding an explicit cast to `tl.int1`. The final
kernel code looks like:
```python
tmp0 = tl.load(in_ptr0 + (r1 + (128*x0)), rmask & xmask).to(tl.int1)
tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK])
tmp3 = tl.full([1, 1], 0, tl.int1)
tmp4 = tl.where(rmask & xmask, tmp1, tmp3)
tmp5 = triton_helpers.any(tmp4, 1)[:, None]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109913
Approved by: https://github.com/lezcano
By representing `torch.cfloat`/`torch.chalf` as `float2`/`half2` metal types and modifying `SCATTER_OPS_TEMPLATE`/`GATHER_OPS_TEMPLATE` to accept third argument which is fully specialized `cast` function, which is no-op for regular type, but special cased for float->complex and complex->float
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115727
Approved by: https://github.com/kulinseth
This PR fixes two bugs
1) Constant folding a triton kernel results in the kernel's inputs to be returned back without any modification. Disable constant folding for triton kernels. Need more investigation
2) NoneLayout buffers should not be deleted as they do not exist
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115908
Approved by: https://github.com/aakhundov, https://github.com/jansel
Make ```SkipFilesVariable``` only handle function type, and route skipped classes to ```UserDefinedClassVariable```. The reasons behind this are:
* We'd like to remove ```is_allowed```, so the allowed/disallowed torch classes should have a proper place to handle. We can put them in either ```SkipFilesVariable``` and ```UserDefinedClassVariable``` under the current architecture, but it's confusing to have two places do one thing.
- Going forward, let's make ```SkipFilesVariable``` only handle functions, and probably I'll rename it to ```SkippedFunctionVariable``` in the following PRs.
- Let's do dispatch by value's type, all torch classes stuff would go to ```UserDefinedClassVariable``` in the next PR.
* We'd merge in_graph/skip/inline trace decision into the same API ```trace_rule.lookup```, so probably we have to limit the input to only function for better organizing ```VariableBuilder._wrap``` logics.
- Next step, I'll merge ```skipfiles.check``` into ```trace_rules.lookup```, and do the skipfile check before wrapping them into correct variable tracker.
- Though the ```TorchCtxManagerClassVariable``` is decided by ```trace_rules.lookup```, I'll refactor it out in the following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115963
Approved by: https://github.com/jansel
Summary:
This change makes the `DTensor.from_local()` placements in backward pass from `Partial()` to `Replicate()` as pass through for following reasons:
1. When we run backward pass of DTensor.from_local, if the target placement is partial() (i.e. from user manual overwrite code instead of torch_dispatch) we keep the grad as replicate. This is because converting the gradients back to `Partial()` is meaningless.
2. The current div logic will lead to wrong numerical value in the above case.
Test Plan:
**CI**:
CI Tests
**Unit test**:
`buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:redistribute`
- Passed
**With model training**:
```
# We tested the case where input tensor is manually overwrite as Partial() and
# output tensor manually overwrite to Shard() then to local.
# Before the change: numerical value not correct
Forward pass:
collective: ReduceScatter
backward pass:
collective: AllGather + div by process group size
# After the change: div is removed as expected.
Forward pass:
collective: ReduceScatter
Backward pas:
collective: AllGather
```
Differential Revision: D52175709
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115967
Approved by: https://github.com/wanchaol
Motivation: it would be nice to be able to test using the metrics in log_compilation_event; currently dumps logs (or logs to a database in fbcode) - these are hard to use in unit tests.
This change:
* always record the information in torch._dynamo.utils.record_compilation_metrics; here, log into a limited-size deque to prevent the list of metrics from getting too long
* if config.log_compilation_metrics, then call back into the original log_compilation_event function
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115788
Approved by: https://github.com/yanboliang
## summary
`zip(inputs, self.input_layouts, self.desired_input_layouts)` is used in `_prepare_input_fn`; similar for `_prepare_output_fn`. Without assertion, unmatched dimension in inputs/outputs will be lost, potentially causing unexpected behabiors.
## test plan
`python test/distributed/tensor/parallel/test_tp_style.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115957
Approved by: https://github.com/wanchaol
Summary:
c2_protobuf_rule ([here](https://fburl.com/code/iyiulpmv)) is broken on buck2, ultimately due to the following error:
> .\./caffe2.proto: File does not reside within any path specified using --proto_path (or -I). You must specify a --proto_path which encompasses this file. Note that the proto_path must be an exact prefix of the .proto file names -- protoc is too dumb to figure out when two paths (e.g. absolute and relative) are equivalent (it's harder than you think).
The root cause is differences in how buck1 and buck2 handle `%SRCDIR%` (absolute versus relative paths). This diff fixes the build.
Test Plan:
# Before
```
buck2 build arvr/mode/win/opt //xplat/caffe2:caffe2.pb.h
```
```
More details at https://www.internalfb.com/intern/buck/build/c6550454-ae6d-479e-9d08-016e544ef050
BUILD SUCCEEDED
```
```
Action failed: fbsource//xplat/caffe2:caffe2.pb.h (genrule)
Remote command returned non-zero exit code <no exit code>
Reproduce locally: frecli cas download-action 5df17cf64b7e2fc5ab090c91e1129f2f3cad36dc72c7c182ab052af23d3f32aa:145
stdout:
stderr:
OUTMISS: Missing outputs: buck-out/v2/gen/fbsource/dd87aacb8683145b/xplat/caffe2/caffe2.pb.h/out/caffe2.pb.h
```
# After
Buck1 still works
```
buck1 build arvr/mode/win/opt //xplat/caffe2:caffe2.pb.h
```
Buck2 works
```
buck2 build arvr/mode/win/opt //xplat/caffe2:caffe2.pb.h
```
```
Buck UI: https://www.internalfb.com/buck2/e5dae607-325a-4eab-b0c9-66fe4e9a6254
BUILD SUCCEEDED
```
Differential Revision: D52218365
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115954
Approved by: https://github.com/mcr229
some typo result in the note section not rendered properly, can't see
this from the last PR directly as the last PR only show the first commit
documentation :(
Also make the parallelize_module doc example more concrete
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115974
Approved by: https://github.com/wz337
The PyTorch build breaks when building from tip on ppc64le with following error pytorch/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp:863:46: error: no matching function for call to 'at::vec::DEFAULT::Vectorizedc10::qint8::dequantize(at::vec::DEFAULT::Vectorized&, at::vec::DEFAULT::Vectorized&)
Issue reported #115165
This patch fixes the build issue.
Fixes#115165
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115729
Approved by: https://github.com/albanD
Summary: This is useful the comparing the Triton kernels generated by two different invocations of torch.compile on the same model (e.g., checking of serial compile and parallel compile generate identical Triton kernels).
Test Plan:
Unit test:
buck2 test mode/opt //caffe2/torch/fb/module_factory/sync_sgd/tests:test_torchdynamo_wrapper -- --print-passing-details >& ~/tmp/log.test
PyPer Mast job:
https://www.internalfb.com/mast/job/sw-951074659-OfflineTraining_87587a4e
See the *.py files generated in:
pyper_traces/tree/torchinductor_traces/sw-951074659-OfflineTraining_87587a4e/4623
Differential Revision: D52221500
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115979
Approved by: https://github.com/yanboliang
Summary: During the inference time the intermediate graphs for optimization are not used so the Executor's graph is the only graph we need to keep around these two flags
Test Plan:
the FLAGS are all off by default
baseline
```
buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true
I1212 10:24:20.407408 401092 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 182863 Kb
```
```
buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true --torch_jit_release_profiling_graph_after_optimization=true
I1212 10:31:37.663487 464000 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 186127 Kb
```
```
buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true --torch_jit_release_profiling_graph_after_optimization=true --torch_jit_execution_plan_avoid_extra_graph_copy=true
I1212 10:29:42.848093 447218 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 129451 Kb```
Differential Revision: D52081631
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115657
Approved by: https://github.com/houseroad
as titled, when using SAC + torch.compile, it currently only check for
functional tensor, but not checking any tensor subclasses, therefore SAC
under torch.compile would ignore the tensor types like tensor
subclasses. Fixed in this PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115960
Approved by: https://github.com/bdhirsh
Summary:
Refactor update inactive constant buffer to allow updating with active
buffer.
Test Plan:
Existing test to test inactive buffer updates.
UpdateConstantsCuda in cpp test for active buffer updates.
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116001
Approved by: https://github.com/chenyang78
Summary:
This change is to make the input tensor contiguous for DTensor reduce scatter in the case no padding is needed.
There's no exception thrown during training, but we ran into numerical value correctness issue without the change.
Test Plan:
**CI**
CI test
**WHEN model test**:
- Verified loss for each iteration within the expected range.
- Verified NE on-par with this change with 4B training data.
Differential Revision: D52170822
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115847
Approved by: https://github.com/wanchaol
- Add call to `at::globalContext().userEnabledMkldnn()` to `apply_mkldnn_matmul_heur`
- Surround calls to `mkldnn_matmul` with `try {} catch {}`
- Print warning and fall back to BLAS (by calling `at::globalContext().setUserEnabledMkldnn()`) if `mkldnn_matmul()` fails
Test plan: On Linux arm run:
```shell
$ sudo chmod 400 /sys; python -c "import torch;m=torch.nn.Linear(1, 32);print(torch.__version__);print(m(torch.rand(32, 1)))"
Error in cpuinfo: failed to parse the list of possible processors in /sys/devices/system/cpu/possible
Error in cpuinfo: failed to parse the list of present processors in /sys/devices/system/cpu/present
Error in cpuinfo: failed to parse both lists of possible and present processors
2.3.0.dev20231215
bad err=11 in Xbyak::Error
bad err=11 in Xbyak::Error
/home/ubuntu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/nn/modules/linear.py:116: UserWarning: mkldnn_matmul failed, switching to BLAS gemm:internal error (Triggered internally at /pytorch/aten/src/ATen/native/LinearAlgebra.cpp:1509.)
return F.linear(input, self.weight, self.bias)
tensor([[-0.5183, 0.2279, -0.4035, ..., -0.3446, 0.0938, -0.2113],
[-0.5111, 0.2362, -0.3821, ..., -0.3536, 0.1011, -0.2159],
[-0.6387, 0.0894, -0.7619, ..., -0.1939, -0.0282, -0.1344],
...,
[-0.6352, 0.0934, -0.7516, ..., -0.1983, -0.0247, -0.1366],
[-0.4790, 0.2733, -0.2862, ..., -0.3939, 0.1338, -0.2365],
[-0.5702, 0.1682, -0.5580, ..., -0.2796, 0.0412, -0.1782]],
grad_fn=<AddmmBackward0>)
```
Fixes https://github.com/pytorch/pytorch/issues/114750
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115936
Approved by: https://github.com/lezcano
Support for something we need for both FSDP and optimizers. For sourced args that are not inputs (params, etc) - we use the dynamic_getattr flow on tensors. This soundly handles the storage and registration and guarding downstream of tensor_wrap for the grad values. For non sourced (true intermediates), we only support None (the idea being that if we have a true intermediate in the graph with grad, we are already doing something weird).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115898
Approved by: https://github.com/bdhirsh
ghstack dependencies: #115315, #112184
Summary:
This is important for writing aten IR based graph transformation.
```
In [4]: [x.name for x in torch.ops.aten.reshape.default._schema.arguments]
Out[4]: ['self', 'shape']
In [8]: torch.ops.aten.reshape.default(torch.rand(1,2), shape=[2])
Out[8]: tensor([0.7584, 0.4834])
# === CANNOT CALL `self` BY KWARGS ===
In [7]: torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[7], line 1
----> 1 torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2])
TypeError: OpOverload.__call__() got multiple values for argument 'self'
```
# Where's the problem?
1. the aten ops first arg is usually named `self` (aten/src/ATen/native/native_functions.yaml)
2. Unfortunately, in `torch._ops.{OpOverload, OpOverloadPacket}.__call__()`, the first arg is (by python convention) named `self` too.
So when call `self` by kwargs, `OpOverloadPacket.__call__` received:
```
OpOverloadPacket.__call__(self, {"self": ...})
```
It is Python that does not allow some argument named "arg" to appear twice. and hence
> TypeError: OpOverload.__call__() got multiple values for argument 'self'
# How to fix?
**Note that**, in above, `self` is an instance of `OpOverloadPacket`, and the "self" kwarg is the input tensor to the aten op. To fix, we only need to differentiate the two `self`s.
In Python, first arg of a method does not need to be named `self`. So we change the `__call__` definition to:
```
def __call__(_self, ...):
```
Now the call becomes:
```
OpOverloadPacket.__call__(_self, {"self": ...})
```
where:
* `_self` is the instance to the `OpOverloadPacket`
* `"self"` is the input tensor to the aten op.
Test Plan:
```
In [4]: [x.name for x in torch.ops.aten.reshape.default._schema.arguments]
Out[4]: ['self', 'shape']
In [3]: torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2])
Out[3]: tensor([0.5127, 0.3051])
```
Differential Revision: D51731996
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114920
Approved by: https://github.com/houseroad
A bug fix of a recently merged PR per comment: https://github.com/pytorch/pytorch/pull/101507#discussion_r1271393706
The follow test would fail without this bug fix:
```
import torch
def test_erfinv():
for device in ['cpu', 'mps']:
x = torch.tensor([0.1, 0.2, 0.3, 0.4, 0.5], device=device)
y = x[2:].erfinv()
x2 = torch.tensor([0.3, 0.4, 0.5], device=device)
y2 = x2.erfinv()
print(y)
print(y2)
torch.testing.assert_close(y, y2)
print(f"{device} passes.")
test_erfinv()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105801
Approved by: https://github.com/malfet
> **__Note:__** XNNPACK Upgrade is too large in the range of **40k** files and **10m** Lines of code, Thus we break the update of the library into multiple parts. All Parts [1 - n] Must be landed together for it to work. ***This also means If there is a revert. Please revert the Entire Stack.***
This change is everything remaining requiring XNNPACK version to work.
@allow-large-files
Differential Revision: [D52099769](https://our.internmc.facebook.com/intern/diff/D52099769/)
---
submodule
(unblock merge to make ShipIt happy)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115714
Approved by: https://github.com/digantdesai
Fixes#77818
I saw that PR #99246 was approved, but no one fixed the rebase conflicts, so I am bringing this up again to be merged.
I am leveraging @mattiaspaul work. Quoting the description here:
> * this pull request enables 3D convolutions (forward/backward) for MPS (Apple Silicon) within the same Convolution.mm file as conv2d.
> * does not support channel_last (since pytorch doesn't implement channel_last for 3D tensors)
> * does not support conv3d_transpose and treats depth-separable convolutions not as normal case (there are no MPS kernels available for either of those so far)
> * requires MacOS >=13.2 (Ventura)
Please, let me know if there are any other changes needed and I'll be happy to implement them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114183
Approved by: https://github.com/malfet
Helps call attention to any cases where the dump actually times out.
The timeout is likely to hit if we run into slow stacktrace processing.
Log any exceptions encountered in the background thread, but don't raise
them- we're already willing to abandon the debug dump, and want to
proceed with our normal execution (in the case of dumppipe) or shutdown
process (when dumping happens on timeout and shutdown is already
initiated).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115876
Approved by: https://github.com/zdevito
ghstack dependencies: #115807
- It was installed twice : once in `/usr/local/cuda/lib64` folder and 2nd time in `/usr/lib64`
- And don't install CuDNN headers thrice, only in `/usr/local/cuda/includa`
- Error on unknown CUDA version
- Modify bazel builds to look for cudnn in `/usr/local/cuda` folder
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115872
Approved by: https://github.com/huydhn
## Summary
This PR added 3 intra-node GPU allreduce algorithms to PyTorch:
- One-shot allreduce (inspired by FasterTransformer): all ranks simultaneously read and accumulate data from other ranks.
- Two-shot allreduce (inspired by FasterTransformer): all ranks simultanesouly read and accumulate `1 / world_size` data from other ranks. Then all ranks read accumulated data from other ranks. (effectively one-shot reduce-scatter + one-shot all-gather).
- Hybrid cube mesh allreduce (original): a one-shot allreduce variant that avoids transmission over PCIe on HCM topology.
## Micro Benchmarks



## Details
The intra-node algos are organized behind `c10d::IntraNodeComm`, which is responsible for:
- Managing handshaking and cuda IPC handle exchange among ranks.
- Querying NVLink connection and detecting topology.
- Performing algo selection based on available info.
- Launching the selected allreduce kernel.
`c10d::IntraNodeComm` is integrated into `c10d::ProcessGroupNCCL` as follows:
- When the `ENABLE_INTRA_NODE_COMM` environment variable is set, `c10d::ProcessGroupNCCL` initialize a `c10d::IntraNodeComm` for its ranks.
- If the setup is not suitable for intra-node comm (e.g. not all ranks are from the same node), the rendezvous logic guarantees all participants fall back consistently.
- `c10d::ProcessGroupNCCL::allreduce` consults `c10d::IntraNodeComm` whether to use intra-node allreduce and carries out the communication accordingly.
We currently detect two types of topoloies from the nNVLink connection mesh:
- Fully connected: all GPU pairs has direct NVLink connection (e.g. NVSwitch or fully connected sub-set of hybrid cube mesh)
- `msg <= 256KB`: one-shot allreduce.
- `256KB < msg <= 10MB`: two-shot allreduce.
- `msg > 10MB`: instructs the caller to fallback to NCCL.
- Hybrid cube mesh
- `msg <= 256KB`: one-shot allreduce.
- `msg > 256KB`: instructs the caller to fallback to NCCL.
## Next Steps
- Fine tune algo selection based on GPU model, topology, link speed.
- Potentially optimize the two-shot allreduce impl. Accroding to FasterTransformer, two-shot allreduce is preferred until 50MB. There might be room for improvement, but PyTorch does impose more constraints:
- FasterTransformer uses a single process to drive multiple devices. It can use `cudaDeviceEnablePeerAccess` enable device-level peer access.
- PyTorch uses multiple process to drive multiple devices. With cuda IPC, a device can only share a specific region to other devices. This means extra copies may be unavoidable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114001
Approved by: https://github.com/yf225
**Summary**
Change the QConv2d Binary fusion post op name from `add` to `sum`, since we are actually using OneDNN `post op sum` instead of `Binary_Add` for now.
**TestPlan**
```
python -m pytest test_quantized_op.py -k test_qconv2d_sum_pt2e
python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_pt2e
python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_float_output_pt2e
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115329
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
Summary:
Both cuSPASRELt and CUTLASS support 1:2 semi-structured sparsity for
fp32, which this PR enables.(thanks @alexsamardzic).
Furthermore, this PR also updates the sparse_config to take into account
the different shape constraints for sparse and dense matrices.
Technically, cuSPARSELt supports smaller sparse matrix constraints as it
seens to pad to the CUTLASS constraints under the hood. However, in
practice small sparse matrices are not commonly used and we care more
about the dense constraints for LLM inference.
For now, we keep the CUTLASS constraints in place for both cuSPARSELt
and CUTLASS tensors
This PR also reconnects the _FUSE_TRANSPOSE flag for cuSPARSELt tensors.
Test Plan:
```
python test/test_sparse_semi_structured.py
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115550
Approved by: https://github.com/cpuhrsch
Tests that are added to a list in dynamo_test_failures.py will
automatically be marked as expectedFailure when run with
PYTORCH_TEST_WITH_DYNAMO=1. I'm splitting this PR off on its own so that
I can test various things on top of it.
Also added an unMarkDynamoStrictTest that is not useful until we turn
on strict mode by default.
Test Plan:
- code reading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115845
Approved by: https://github.com/voznesenskym
Summary:
We do not need to dequantize and quantize again for this op.
With this optimization cunet-enc ops:
vulkan.quantized_max_pool2d_quint8{48, 36, 2} 207532
vulkan.quantized_max_pool2d_quint8{24, 18, 4} 78832
vulkan.quantized_max_pool2d_quint8{12, 9, 8} 49296
Without optimization:
vulkan.quantized_max_pool2d_quint8{48, 36, 2} 234416
vulkan.quantized_max_pool2d_quint8{24, 18, 4} 94380
vulkan.quantized_max_pool2d_quint8{12, 9, 8} 58760
Test Plan:
Ensure all vulkan quantize tests pass:
buck2 run --target-platforms ovr_configplatform/macos:arm64-fbsourcexplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 78 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 78 tests from VulkanAPITest
...
[==========] 78 tests from 1 test suite ran. (1519 ms total)
[ PASSED ] 78 tests.
buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 395 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 395 tests from VulkanAPITest
...
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 395 tests from VulkanAPITest (6515 ms total)
[----------] Global test environment tear-down
[==========] 395 tests from 1 test suite ran. (6515 ms total)
[ PASSED ] 394 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
YOU HAVE 5 DISABLED TESTS
Reviewed By: yipjustin, copyrightly
Differential Revision: D50998619
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115690
Approved by: https://github.com/SS-JIA
This removes global (but ROCM_ONLY) `over_arch` and `gcn_arch_override_flag` variables in favor of block level static initialization introduced in C++11
To quote from [ISO/IEC 14882-2014](https://www.open-std.org/jtc1/sc22/wg21/docs/standards)
>The zero-initialization (8.5) of all block-scope variables with static storage duration (3.7.1) or thread storage
> duration (3.7.2) is performed before any other initialization takes place. Constant initialization (3.6.2) of a
> block-scope entity with static storage duration, if applicable, is performed before its block is first entered.
> An implementation is permitted to perform early initialization of other block-scope variables with static or
> thread storage duration under the same conditions that an implementation is permitted to statically initialize
> a variable with static or thread storage duration in namespace scope (3.6.2). Otherwise such a variable is
> initialized the first time control passes through its declaration; such a variable is considered initialized upon
> the completion of its initialization. If the initialization exits by throwing an exception, the initialization
> is not complete, so it will be tried again the next time control enters the declaration. If control enters
> the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for
> completion of the initialization.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115844
Approved by: https://github.com/huydhn
Noticed that on many MRS kernels the grid wrapper for autotuning is huge with a bunch of duplicates due to num_warps and num_stages not being needed for grid calculation. Lets deduplicate these entries.
Previously, we would see wrapper like
```
def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```
now it looks like
```
def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115849
Approved by: https://github.com/jansel
The mutex was originally added to avoid racing to dump debuginfo,
where a race in this case would result in a corrupted dump file.
The reason a mutex helps is that it forces all dump requests to be
serialized, so that an observer would either see an in-progress file, a
complete file, or no file. Without a mutex, a fourth state is possible
(a file that has been written to by multiple threads and is invalid).
Becuase the mutex was a ProcessGroupNCCL class member, and each PG
instance has its own watchdog thread that can launch a dump, it was not
doing its job. Making the mutex static shares it between instances of
the class and ensures serialization of dumps triggered by any PG.
(Note: dumps triggered by different PGs have the same, global contents
anyway- there is only one global flight recorder, so it doesn't matter
who triggers it.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115803
Approved by: https://github.com/kwen2501
ghstack dependencies: #115771, #115798, #115800, #115801
Adds a PG {process group uid} prefix component to logs.
This is helpful in situations where there are multiple processgroups,
and rank information by itself is confusing. (For example rank0 on PG1
may correspond to rank3 on PG0. People may assume 'rank0' references
the global (PG0) world, but it may reference a sub-pg. Prefacing the PG
helps clarify this.
Does NOT change logs from inside WorkNCCL functions, since WorkNCCL
doens't know what PG ID it corresponds to. Will address these logs
separately.
Example:
```
[I ProcessGroupNCCL.cpp:787] [PG 0 Rank 0] ProcessGroupNCCL initialization ...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115801
Approved by: https://github.com/fduwjj
ghstack dependencies: #115771, #115798, #115800
Put the repeated code that string formats [Rank {rank}] in one place.
Sets up for the next PR that also adds more info to this prefix.
(Does not change exception messages, which could be done as well.
Exception messages are not formatted quite the same way. Tries
instead to keep from changing log behavior (in this PR) and only
refactor code.
Did limited testing (some logs were observed OK).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115800
Approved by: https://github.com/fduwjj
ghstack dependencies: #115771, #115798
The NCCL flight recorder is per-process (it is shared by all
processgroups), but individual process groups used to construct their
own pipe for being signaled to dump the flight recorder.
This ensures that only one pipe per process is created, by only creating
the pipe on the first ProcessGroup (uid_ == 0) which should be the world
group.
Filenames are still keyed off of rank, but this should now be global
rank instead of sub-pg rank, making the filenames unique across the
whole trainer process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115798
Approved by: https://github.com/zdevito
ghstack dependencies: #115771
This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project.
Know limitations:
- [ ] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`.
- [ ] Only supports power of two sequence lengths.
- [ ] No support for varlen APIs.
- [ ] Only support head dimension 16,32,64,128.
- [ ] Performance is still being optimized.
Fixes https://github.com/pytorch/pytorch/issues/112997
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114309
Approved by: https://github.com/jeffdaily, https://github.com/malfet
---------
Co-authored-by: Joseph Groenenboom <joseph.groenenboom@amd.com>
## Summary
This PR added 3 intra-node GPU allreduce algorithms to PyTorch:
- One-shot allreduce (inspired by FasterTransformer): all ranks simultaneously read and accumulate data from other ranks.
- Two-shot allreduce (inspired by FasterTransformer): all ranks simultanesouly read and accumulate `1 / world_size` data from other ranks. Then all ranks read accumulated data from other ranks. (effectively one-shot reduce-scatter + one-shot all-gather).
- Hybrid cube mesh allreduce (original): a one-shot allreduce variant that avoids transmission over PCIe on HCM topology.
## Micro Benchmarks



## Details
The intra-node algos are organized behind `c10d::IntraNodeComm`, which is responsible for:
- Managing handshaking and cuda IPC handle exchange among ranks.
- Querying NVLink connection and detecting topology.
- Performing algo selection based on available info.
- Launching the selected allreduce kernel.
`c10d::IntraNodeComm` is integrated into `c10d::ProcessGroupNCCL` as follows:
- When the `ENABLE_INTRA_NODE_COMM` environment variable is set, `c10d::ProcessGroupNCCL` initialize a `c10d::IntraNodeComm` for its ranks.
- If the setup is not suitable for intra-node comm (e.g. not all ranks are from the same node), the rendezvous logic guarantees all participants fall back consistently.
- `c10d::ProcessGroupNCCL::allreduce` consults `c10d::IntraNodeComm` whether to use intra-node allreduce and carries out the communication accordingly.
We currently detect two types of topoloies from the nNVLink connection mesh:
- Fully connected: all GPU pairs has direct NVLink connection (e.g. NVSwitch or fully connected sub-set of hybrid cube mesh)
- `msg <= 256KB`: one-shot allreduce.
- `256KB < msg <= 10MB`: two-shot allreduce.
- `msg > 10MB`: instructs the caller to fallback to NCCL.
- Hybrid cube mesh
- `msg <= 256KB`: one-shot allreduce.
- `msg > 256KB`: instructs the caller to fallback to NCCL.
## Next Steps
- Fine tune algo selection based on GPU model, topology, link speed.
- Potentially optimize the two-shot allreduce impl. Accroding to FasterTransformer, two-shot allreduce is preferred until 50MB. There might be room for improvement, but PyTorch does impose more constraints:
- FasterTransformer uses a single process to drive multiple devices. It can use `cudaDeviceEnablePeerAccess` enable device-level peer access.
- PyTorch uses multiple process to drive multiple devices. With cuda IPC, a device can only share a specific region to other devices. This means extra copies may be unavoidable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114001
Approved by: https://github.com/yf225
This test isn't actually parametrized by `dtype` so it seems to surface bogus failures where "unsupported" types "work" but in reality fp8 is used every time.
CC @drisspg I'm guessing this doesn't surface in upstream CI because there are no SM9.0 runners yet?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115661
Approved by: https://github.com/drisspg
Description:
- Added non-integer expr support for floordiv in triton codegen
- Added a test
- cpp test is skipped as failing and https://github.com/pytorch/pytorch/pull/115647 may fix it
This PR is fixing compilation error with the following code:
```python
import torch
def func(x, a):
n = (a * 1.234) // 8.234
y = x + n
return y
cfunc = torch.compile(func, dynamic=True, fullgraph=True)
device = "cuda"
x = torch.tensor(0, dtype=torch.float32, device=device)
a = 33
out = cfunc(x, a)
expected = func(x, a)
torch.testing.assert_close(out, expected)
```
Error message on Nightly:
```
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised:
CompilationError: at 7:38:def triton_(in_ptr0, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr):
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex
tmp0 = tl.load(in_ptr0 + (x0), xmask)
tmp1 = ((1.23400000000000*ks0) // 8.23400000000000)
^
AssertionError()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115751
Approved by: https://github.com/peterbell10
This diff aims to directly import DeviceMesh from torch.distributed.device_mesh instead of importing it from dist._tensor. This is done to avoid a circular dependency issue. The code changes in each file of the diff are as follows:
- torch/distributed/_functional_collectives.py: import DeviceMesh from torch.distributed instead of dist._tensor.
Overall, this diff aims to improve the code by avoiding circular dependencies and improving the import statements.
==
The above summary is generated by LLM with minor manual fixes. The following summary is by me.
The original import causes some issues when compiling DDP with compiled_autograd. The root cause of compilation failure is not identified but it is good to fix the lazy initialization, which indirectly fixes the compilation issues for DDP.
Differential Revision: [D51857246](https://our.internmc.facebook.com/intern/diff/D51857246/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115649
Approved by: https://github.com/wconstab, https://github.com/wz337
ghstack dependencies: #115523, #115302, #115648
Summary:
This library contains global state, e.g. pytorch_qnnp_params. If we make
it a static library, different shared libraries linking that static
library can end up with their own copies of the global state, leading to
bugs. Make it a shared library instead, to avoid this issue.
Test Plan: buck2 test fbsource//fbandroid/javatests/com/facebook/playground/apps/fb4aplayground/scenarios/pytorchscenario:pytorchscenario -- --run-disabled --regex runBundledInputWithLocalAsset
Differential Revision: D51926024
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115570
Approved by: https://github.com/malfet
Fixes#114903
Previously large split variance reductions stored the intermediates as float16
precision, which may lead to overflow as the intermediate result is
unnormalized.
In #114903 we see two different `num_split` decisions made based on the
hardware capabilities, one of which has large enough intermediates to cause
overflows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115181
Approved by: https://github.com/shunting314
We were only passing a subset of the group creation information to the
NCCL PG. We are specifically missing the information on which global
ranks belong to a particular PG.
This allows the NCCL PG to use this additional information for things
like better trace logging.
Test Plan:
OSS CI
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114736
Approved by: https://github.com/kwen2501
Summary:
Dynamo test methodology provides a good example to patch various
treaments on the same set of test cases. A pitfall is the global config
that could be easily modified somewhere. Here we change the behavior of
the export API thru hijacking it with self defined code.
For supporting non-strict test suite, the `strict=False` is explicitly
passed into the export API when it's called w/ or w/o strict arg.
Test Plan:
python test/export/test_export_nonstrict.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115399
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
Related to #103973#110532#108404#94891
**Context:**
As commented in 6ae0554d11/cmake/Dependencies.cmake (L1198)
Kernel asserts are enabled by default for CUDA and disabled for ROCm.
However it is somewhat broken, and Kernel assert was still enabled for ROCm.
Disabling kernel assert is also needed for users who do not have PCIe atomics support. These community users have verified that disabling the kernel assert in PyTorch/ROCm platform fixed their pytorch workflow, like torch.sum script, stable-diffusion. (see the related issues)
**Changes:**
This pull request serves the following purposes:
* Refactor and clean up the logic, make it simpler for ROCm to enable and disable Kernel Asserts
* Fix the bug that Kernel Asserts for ROCm was not disabled by default.
Specifically,
- Renamed `TORCH_DISABLE_GPU_ASSERTS` to `C10_USE_ROCM_KERNEL_ASSERT` for the following reasons:
(1) This variable only applies to ROCm.
(2) The new name is more align with #define CUDA_KERNEL_ASSERT function.
(3) With USE_ in front of the name, we can easily control it with environment variable to turn on and off this feature during build (e.g. `USE_ROCM_KERNEL_ASSERT=1 python setup.py develop` will enable kernel assert for ROCm build).
- Get rid of the `ROCM_FORCE_ENABLE_GPU_ASSERTS' to simplify the logic and make it easier to understand and maintain
- Added `#cmakedefine` to carry over the CMake variable to C++
**Tests:**
(1) build with default mode and verify that USE_ROCM_KERNEL_ASSERT is OFF(0), and kernel assert is disabled:
```
python setup.py develop
```
Verify CMakeCache.txt has correct value.
```
/xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt
USE_ROCM_KERNEL_ASSERT:BOOL=0
```
Tested the following code in ROCm build and CUDA build, and expected the return code differently.
```
subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"])
```
This piece of code is adapted from below unit test to get around the limitation that this unit test now was skipped for ROCm. (We will check to enable this unit test in the future)
```
python test/test_cuda_expandable_segments.py -k test_fixed_cuda_assert_async
```
Ran the following script, expecting r ==0 since the CUDA_KERNEL_ASSERT is defined as nothing:
```
>> import sys
>>> import subprocess
>>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"])
>>> r
0
```
(2) Enable the kernel assert by building with USE_ROCM_KERNEL_ASSERT=1, or USE_ROCM_KERNEL_ASSERT=ON
```
USE_ROCM_KERNEL_ASSERT=1 python setup.py develop
```
Verify `USE_ROCM_KERNEL_ASSERT` is `1`
```
/xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt
USE_ROCM_KERNEL_ASSERT:BOOL=1
```
Run the assert test, and expected return code not equal to 0.
```
>> import sys
>>> import subprocess
>>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"])
>>>/xxxx/pytorch/aten/src/ATen/native/hip/TensorCompare.hip:108: _assert_async_cuda_kernel: Device-side assertion `input[0] != 0' failed.
:0:rocdevice.cpp :2690: 2435301199202 us: [pid:206019 tid:0x7f6cf0a77700] Callback: Queue 0x7f64e8400000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
>>> r
-6
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114660
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/jithunnair-amd
Currently this skip imports torchvision, so if your torchvision install
is broken then the entire file fails at collection time. This instead
means only the test itself will fail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115677
Approved by: https://github.com/lezcano
This implements an optional alternate interface to the AOTI
generated DSO, intended to increase efficiency for models running on
CPU and requiring minimal overhead. See comment in config.py for more
explanation.
This took a while to get right (e.g., I initially required 1-D
MiniArrayRef<T> for the inputs, but found that multi-dimensional
ArrayRefTensor<T> ended up simplifying the implementation and allowed
test_aot_inductor.py to run) and is somewhat intricate, so I am
anticipating that review will require some back-and-forth.
Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/)
**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112800
Approved by: https://github.com/chenyang78
Summary:
`CheckForSignals()` can be called from multiple threads concurrently, e.g. from within `ExecuteStepRecursive()`. This means that `my_sigint_count_` and `my_sighup_count_` can be written concurrently, causing data races.
To fix, use atomic exchange which writes the new value and returns the old value in one atomic operation.
Test Plan: Running TSAN tests that failed before and now pass
Differential Revision: D52018963
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115510
Approved by: https://github.com/malfet
Summary:
Dynamo test methodology provides a good example to patch various
treaments on the same set of test cases. A pitfall is the global config
that could be easily modified somewhere. Here we change the behavior of
the export API thru hijacking it with self defined code.
For supporting non-strict test suite, the `strict=False` is explicitly
passed into the export API when it's called w/ or w/o strict arg.
Test Plan:
python test/export/test_export_nonstrict.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115399
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
AOTInductor currently relies of export_to_torch_ir to generate a graph, and passes it to inductor to generate the .so. They would like the FQN to be consistent so that they can easily find/update the weights in the .so.
Note that since export flattens all modules in to a single computational graph, we will change the FQNs in the original module by replacing all periods with underscores. For example, `foo.child1param`, which points to a submodule named `foo`'s parameter named `child1param`, will be renamed to `foo_child1param` since we no longer have the submodule `foo`. This is done just by doing `name.replace(".", "_")`.
Outputted AOTInductor c++ code: https://www.internalfb.com/phabricator/paste/view/P900120950?lines=377-355%2C354
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115462
Approved by: https://github.com/tugsbayasgalan
This PR fixed#104791
bitcast requires the source and target have the bitwidth.
Because the input tensor's dtype could be promoted, e.g. from float16 to
float, we have to cast the tensor to its original source dtype before
invoking bitcast in such cases. After that, we also need to convert
the bit-casted tensor back to float to make sure we keep using higher
precision values for the rest of the computation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115619
Approved by: https://github.com/jansel, https://github.com/eellison
`torch._C.has_mkldnn` does not respect cases where users try to disable mkldnn using `torch._C._set_mkldnn_enabled()`. This is relevant to edge use cases, where they do not want decompositions to go to the ATen opset, and do not want the mkldnn operator to appear in the graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115448
Approved by: https://github.com/jgong5, https://github.com/ydwu4
Summary:
An internal MRS model was taking over a day's worth of time to compile due to many duplicates in dependency tracking. This PR replaces the list with a custom dedup list.
Normally one could use a set/dict for this purpose however the list in question gets elements appended as it is being iterated over which means that we need to keep the list semantics.
Test Plan: ad hoc testing
Differential Revision: D52060659
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115609
Approved by: https://github.com/jansel
This PR is intended to fix the following problem:
When using `CTCLoss`, there is a cudnn path gated by a call to [`_use_cudnn_ctc_loss`](
e918461377/aten/src/ATen/native/cudnn/LossCTC.cpp (L73-L101)) which checks some conditions
e918461377/aten/src/ATen/native/LossCTC.cpp (L486-L496)
However, there are more checks in `_cudnn_ctc_loss`
e918461377/aten/src/ATen/native/cudnn/LossCTC.cpp (L122-L130)
some of which are not present in `_use_cudnn_ctc_loss` (e.g. the check that `targets` is on CPU which will cause a RuntimeError after dispatching to `_cudnn_ctc_loss`). Instead, these checks should be in `_use_cudnn_ctc_loss` so that the normal `_ctc_loss` path will be used if the checks are not met)
e.g. Before this PR
```python
>>> import torch
>>> ctcloss = torch.nn.CTCLoss()
>>> log_probs = torch.randn((50, 3, 15), device='cuda').log_softmax(2)
>>> target = torch.randint(1, 15, (30 + 25 + 20,), dtype = torch.int)
>>> input_lengths = torch.tensor((50, 50, 50), device='cuda')
>>> target_lengths = torch.tensor((30, 25, 20), device='cuda')
>>> ctcloss(log_probs, target, input_lengths, target_lengths)
tensor(4.1172, device='cuda:0')
>>> target = target.to('cuda')
>>> ctcloss(log_probs, target, input_lengths, target_lengths)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/users/mg1998/pytorch/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/users/mg1998/pytorch/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/data/users/mg1998/pytorch/torch/nn/modules/loss.py", line 1779, in forward
return F.ctc_loss(log_probs, targets, input_lengths, target_lengths, self.blank, self.reduction,
File "/data/users/mg1998/pytorch/torch/nn/functional.py", line 2660, in ctc_loss
return torch.ctc_loss(
RuntimeError: Expected tensor to have CPU Backend, but got tensor with CUDA Backend (while checking arguments for cudnn_ctc_loss)
```
After this PR the above snippet runs without error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115617
Approved by: https://github.com/janeyx99
As in the title.
The reproducer for the bug is as follows:
```python
>>> import torch
>>> dtype = torch.bfloat16
>>> D1 = torch.tensor([[1, 2], [3, 4]], dtype=dtype, requires_grad=True)
>>> D2 = torch.tensor([[1, 2], [3, 4]], dtype=dtype, requires_grad=True)
>>> torch.autograd.gradcheck(torch.mm, (D1, D2), fast_mode=True)
```
<details>
```
torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor(0., dtype=torch.bfloat16)
analytical:tensor(4.9062, dtype=torch.bfloat16)
The above quantities relating the numerical and analytical jacobians are computed
in fast mode. See: https://github.com/pytorch/pytorch/issues/53876 for more background
about fast mode. Below, we recompute numerical and analytical jacobians in slow mode:
Numerical:
tensor([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]], dtype=torch.bfloat16)
Analytical:
tensor([[1., 2., 0., 0.],
[3., 4., 0., 0.],
[0., 0., 1., 2.],
[0., 0., 3., 4.]], dtype=torch.bfloat16)
```
</details>
```python
The max per-element difference (slow mode) is: 4.0.
>>> torch.autograd.gradcheck(torch.mm, (D1, D2), fast_mode=True, eps=1e-1)
```
<details>
```
<snip>
torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor(5., dtype=torch.bfloat16)
analytical:tensor(4.9062, dtype=torch.bfloat16)
The above quantities relating the numerical and analytical jacobians are computed
in fast mode. See: https://github.com/pytorch/pytorch/issues/53876 for more background
about fast mode. Below, we recompute numerical and analytical jacobians in slow mode:
Numerical:
tensor([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]], dtype=torch.bfloat16)
Analytical:
tensor([[1., 2., 0., 0.],
[3., 4., 0., 0.],
[0., 0., 1., 2.],
[0., 0., 3., 4.]], dtype=torch.bfloat16)
```
</details>
```
The max per-element difference (slow mode) is: 4.0.
```
Notice that changing `eps` value has no effect to max per-element difference.
With this PR, increasing `eps` value will lead to sensible results in numerical jacobian:
```python
>>> torch.autograd.gradcheck(torch.mm, (D1, D2), fast_mode=True, eps=1e-1)
```
<details>
```
<snip>
torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor(5., dtype=torch.bfloat16)
analytical:tensor(4.9062, dtype=torch.bfloat16)
The above quantities relating the numerical and analytical jacobians are computed
in fast mode. See: https://github.com/pytorch/pytorch/issues/53876 for more background
about fast mode. Below, we recompute numerical and analytical jacobians in slow mode:
Numerical:
tensor([[0.9375, 1.8750, 0.0000, 0.0000],
[2.9688, 3.7500, 0.0000, 0.0000],
[0.0000, 0.0000, 1.2500, 2.5000],
[0.0000, 0.0000, 2.5000, 3.7500]], dtype=torch.bfloat16)
Analytical:
tensor([[1., 2., 0., 0.],
[3., 4., 0., 0.],
[0., 0., 1., 2.],
[0., 0., 3., 4.]], dtype=torch.bfloat16)
```
</details>
```
The max per-element difference (slow mode) is: 0.5.
```
Finally:
```python
>>> torch.autograd.gradcheck(torch.mm, (D1, D2), fast_mode=True, eps=1e-1, atol=1)
True
```
that would fail with the current main branch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115634
Approved by: https://github.com/lezcano, https://github.com/soulitzer, https://github.com/albanD
ghstack dependencies: #115536
Summary: The original logic has an incorrect assumption that there is at one object name left when traversing the module tree. This is not correct when the leaf module is wrapped by FSDP.
Test Plan: CI
Differential Revision: D52049293
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115592
Approved by: https://github.com/wz337
Fixes#50051.
This PR is based on #50320 and I address the last feedback.
On Windows it is enabled by default. Can be enabled or disabled via USE_CUSTOM_TERMINATE env variable.
This PR adds support for overriding the terminate handler in order to log uncaught exceptions in the threads.
If an exception is thrown and not caught, it will print <Unhandled exception caught in c10/util/AbortHandler.h>
The point of doing this is that in issue #50051, exceptions were thrown but not logged. With this logging system it will be easier to debug it in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101332
Approved by: https://github.com/albanD, https://github.com/malfet
> **__Note:__** XNNPACK Upgrade is too large in the range of **40k** files and **10m** Lines of code, Thus we break the update of the library into multiple parts. All Parts [1 - 6/n] Must be landed together for it to work. ***This also means If there is a revert. Please revert the Entire Stack.***
This change is everything remaining requiring XNNPACK version to work.
Differential Revision: [D52044420](https://our.internmc.facebook.com/intern/diff/D52044420/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115587
Approved by: https://github.com/digantdesai
As in the title.
In addition:
- improve the algorithm for finding a minima of operation timings: break the inner loop early when a next minima candidate is found
- add tests and fix bugs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115499
Approved by: https://github.com/cpuhrsch
Previously if two calls to cumsum were generated in the same triton kernel
we would generate identical helper functions with different names. Now this
recognizes identical functions and only defines it once. To do this I defer
choosing the name until after codegen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115546
Approved by: https://github.com/lezcano
ghstack dependencies: #109132
But limit it to MacOS Sonoma +
Before the calling `torch.cat` with complex types failed, but now it works.
Before:
```
% python -c "import torch;print(torch.cat([torch.rand(3, 3, dtype=torch.cfloat).to('mps'), torch.rand(3, 3, dtype=torch.cfloat).to('mps')]))"
TypeError: Trying to convert ComplexFloat to the MPS backend but it does not have support for that dtype.
```
After:
```
% python -c "import torch;print(torch.cat([torch.rand(3, 3, dtype=torch.cfloat).to('mps'), torch.rand(3, 3, dtype=torch.cfloat).to('mps')]))"
tensor([[0.4857+0.0030j, 0.9375+0.8630j, 0.3544+0.9911j],
[0.5293+0.8652j, 0.8440+0.1991j, 0.5152+0.8276j],
[0.0136+0.7469j, 0.1403+0.4761j, 0.2943+0.0896j],
[0.6458+0.0035j, 0.3579+0.4577j, 0.1723+0.1508j],
[0.4420+0.3554j, 0.4396+0.7272j, 0.2479+0.1191j],
[0.3895+0.2292j, 0.7886+0.1613j, 0.9243+0.4180j]], device='mps:0')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115513
Approved by: https://github.com/kulinseth
ghstack dependencies: #115512
* Try linux.large.arc for stale workflow
* Run stale workflow on PR changes
* Added arc runner lable to the list of self hosted runners
* Added concurency linux-job
* Cleanup
* Added workflow_dispatch for testing purpose
Summary:
Dynamo test methodology provides a good example to patch various
treaments on the same set of test cases. A pitfall is the global config
that could be easily modified somewhere. Here we change the behavior of
the export API thru hijacking it with self defined code.
For supporting non-strict test suite, the `strict=False` is explicitly
passed into the export API when it's called w/ or w/o strict arg.
* For existing failed strict test cases, non-strict also fails.
* For passed strict but failed non-strict cases, we mark them as
`@testing.expectedFailureNonStrict`.
* Moreover, I manually check the failure reason and some of them are not
related to nn.Module asserting exception. I mark them as `# Need to fix
for non-strict mode`.
Test Plan:
python test/export/test_export_nonstrict.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115399
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
This changes cached thread_local tensors to stack-allocated buffers. Since we were incidentally caching output in a thread_local, I had to add manual thread_local caching of outputs, which I implemented by caching a buffer and a Tensor whose storage is that buffer and then just memcpying the result into the cached buffer every time. Ideally, memory planning would be able to identify allocations that are the backing storage for outputs, but this should be good enough in the absence of planning.
Differential Revision: [D50416438](https://our.internmc.facebook.com/intern/diff/D50416438/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112116
Approved by: https://github.com/jansel, https://github.com/desertfire
This pull request adds a tool to visualize sharding. It uses the device_mesh and placement details to construct a visualization of the split of a torch dtensor.
Things to fix:
- [x] This implementation only uses the first element of the placement tuple, when can there be more than one elements?
- [x] The calculation of the split is happening here but maybe it is already done somewhere internally in Shard class and can we directly call that here?
Fixes#108746
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114307
Approved by: https://github.com/wanchaol
This PR adds a experimental implicit replication support for DTensor to
inter-op with torch.Tensor, basically under this context manager DTensor
could work together with torch.Tensor by assuming the torch.Tensor
sharding layout is replicated.
Note that this is risky for DTensor so we don't turn it on by default,
but for certain cases where it is for sure replicated, user can use this
to allow DTensor and Tensor computation work together
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115297
Approved by: https://github.com/awgu
We added a monitor thread in NCCL PG in https://github.com/pytorch/pytorch/pull/112518. To summarize what we are doing in monitor thread: it listens to the heartbeat from watchdog thread and detect unhealthy nccl watchdog hang (due to several reasons such as nccl/cuda API bugs or unexpected blocking behaviors). This is the last resort to ensure that we don't silently keep the training job run for hours.
We didn't open this feature as default, since we want to perform more due diligence and have some customers to try it out. So far, we didn't see any obstacle which blocks turning on this feature and received positive feedback from users. We now decided to turn in on by default in this PR.
If this feature turns out not to work as expected and disturb one's training process, one can set `TORCH_NCCL_ENABLE_MONITORING=0` to disable this feature. Please kindly file an issue with us so that we can see if we missed any corner cases during the design.
Differential Revision: [D52045911](https://our.internmc.facebook.com/intern/diff/D52045911)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115577
Approved by: https://github.com/wconstab, https://github.com/kwen2501
Summary:
cuSPARSELt has support for different alg_id, which are set via
`cusparseLTMatmulAlgSetAttribute`, in total there are 4 different
alg_ids, 0 - 3.
Previously we were just using the default alg_id, as from our initial
experiments we found that for most shapes the default alg_id is the
fastest and that they made no difference on numerical correctness, just
performance. From our previous experiments the fastest alg_id seemed to
differ only on small matmul shapes.
danthe3rd found a performance regression when running with
cuSPARSELt v0.4.0 vs v0.5.0, on LLM shapes, which match these
characteristics (activations are small, weights are large).
However it's likely that this is due to the alg_id ordering changing, as
mentioned in the release notes for v0.5.0.
```
cusparseLtMatmulAlgSelectionInit() does not ensure the same ordering of
algorithm id alg as in v0.4.0.
```
This PR adds in the following:
- support for passing in alg_id to _cslt_sparse_mm
- a new op, _cslt_sparse_mm_search, which returns the optimal alg_id for
a given matmul
_cslt_sparse_mm_search has the same function signature as
_cslt_sparse_mm, minus the alg_id parameter.
We are able to achieve v0.4.0 performance with alg_id=1 on the shapes
that daniel provided.
We will address autoselecting the best alg_id in a future PR, possibly
with torch.compile.
Test Plan:
```
python test/test_sparse_semi_structured -k cslt
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115178
Approved by: https://github.com/cpuhrsch
But limit it to MacOS Sonoma +
Before the calling `torch.cat` with complex types failed, but now it works.
Before:
```
% python -c "import torch;print(torch.cat([torch.rand(3, 3, dtype=torch.cfloat).to('mps'), torch.rand(3, 3, dtype=torch.cfloat).to('mps')]))"
TypeError: Trying to convert ComplexFloat to the MPS backend but it does not have support for that dtype.
```
After:
```
% python -c "import torch;print(torch.cat([torch.rand(3, 3, dtype=torch.cfloat).to('mps'), torch.rand(3, 3, dtype=torch.cfloat).to('mps')]))"
tensor([[0.4857+0.0030j, 0.9375+0.8630j, 0.3544+0.9911j],
[0.5293+0.8652j, 0.8440+0.1991j, 0.5152+0.8276j],
[0.0136+0.7469j, 0.1403+0.4761j, 0.2943+0.0896j],
[0.6458+0.0035j, 0.3579+0.4577j, 0.1723+0.1508j],
[0.4420+0.3554j, 0.4396+0.7272j, 0.2479+0.1191j],
[0.3895+0.2292j, 0.7886+0.1613j, 0.9243+0.4180j]], device='mps:0')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115513
Approved by: https://github.com/kulinseth
ghstack dependencies: #115512
InterpreterModule is better than GraphModule codegen; it's more debuggable and
has better stack traces. The only reason we don't use it today is because
torch.compile doesn't work with it.
I work around this by constructing a GraphModule separately for usage during
dynamo tracing, but otherwise using torch.fx.Interpreter.
Differential Revision: [D51971661](https://our.internmc.facebook.com/intern/diff/D51971661/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115436
Approved by: https://github.com/zhxchen17
ghstack dependencies: #115408
UnflattenedModule doesn't really behave like a graph module; we customize `__call__` to do something completely different than what GraphModule does. So, things that test `isinstance(unflattened_module, GraphModule)` and do something with the GraphModule are often broken.
This change makes UnflattenedModule it's own thing.
Differential Revision: [D51959097](https://our.internmc.facebook.com/intern/diff/D51959097/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115408
Approved by: https://github.com/zhxchen17
Prerequisite for adding more complex type support and FFT operation
Check using `conjugateWithTensor:name:` selector defined as follows
```objc
/// Returns the complex conjugate of the input tensor elements.
///
/// - Parameters:
/// - tensor: The input tensor.
/// - name: An optional string which serves as an identifier for the operation..
/// - Returns: A valid `MPSGraphTensor` object containing the elementwise result of the applied operation.
-(MPSGraphTensor *) conjugateWithTensor:(MPSGraphTensor *) tensor
name:(NSString * _Nullable) name
MPS_AVAILABLE_STARTING(macos(14.0), ios(17.0), tvos(17.0))
MPS_SWIFT_NAME( conjugate(tensor:name:) );
```
- Rename `isOnMacOS13orNewer(unsigned minor)` hook to `isOnMacOSorNewer(major, minor)`
- Replace `torch._C.__mps_is_on_macos_13_or_newer` with `torch._C._mps_is_on_macos_or_newer`
- Add `torch.backends.mps.is_macos_or_newer` public API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115512
Approved by: https://github.com/albanD
In https://github.com/pytorch/pytorch/pull/115449/ somehow after turning on `DUMP_ON_TIMEOUT=1`, some existing tests failed. Upon checking, the failing is because of TCPStore check call within watchdog thread.
1. It's not because of TCPStore creation has not completed, because if I put it sleep for a long time, the test still failed. Rather, it's because we query TCPStore after we shutdown the PG.
2. The reason for that is: The `std::chrono::steady_clock::now()` function in C++ returns a `time_point` object representing the current point in time according to the steady clock. The default unit of this time_point is not directly specified in terms of seconds or nanoseconds; rather, it is dependent on the internal representation of the steady clock, which can vary between implementations. In reality it's actually nanosecs which makes the delta so big that we are checking the store every time when watchdog thread wakes up. To make things even worse, `terminateProcessGroup_` might be turned to be `True` before the next check for the outmost while but before TCPStore check, so watchdog gets stuck because we are checking a TCPStore which is already deleted. And main thread is still waiting for watchdog to join.
The solution here is:
1. Add back `std::chrono::duration_cast` to ensure the delta is indeed mil_sec, so that the timeout check logic is working as expected.
2. Check `terminateProcessGroup_` as well so that, we don't do any dump when main thread has already mark the process exited.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115475
Approved by: https://github.com/wconstab
Summary:
In order to get better performance on conv2d pw its better to read the input together in a batch.
With this optimization on CUNET-enc ops:
Kernel Name Workgroup Size Duration P50 (ns)
=========== ============== =================
vulkan.quantized_conv2d_pw_2x2{96, 72, 2} 891332
vulkan.quantized_conv2d_pw_2x2{48, 36, 4} 528528
vulkan.quantized_conv2d_pw_2x2{24, 18, 8} 557336
Without this optimization:
Kernel Name Workgroup Size Duration P50 (ns)
=========== ============== =================
vulkan.quantized_conv2d_pw_2x2{96, 72, 2} 1633268
vulkan.quantized_conv2d_pw_2x2{48, 36, 4} 1177228
vulkan.vulkan.quantized_conv2d_pw_2x2{24, 18, 8} 1343264
Test Plan:
Ensure all vulkan quantize tests pass:
buck2 run --target-platforms ovr_configplatform/macos:arm64-fbsourcexplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 78 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 78 tests from VulkanAPITest
[ RUN ] VulkanAPITest.uniform_buffer_copy
...
[----------] Global test environment tear-down
[==========] 78 tests from 1 test suite ran. (1519 ms total)
[ PASSED ] 78 tests.
buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 395 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 395 tests from VulkanAPITest
[ RUN ] VulkanAPITest.zero_size_tensor
[ OK ] VulkanAPITest.zero_size_tensor (83 ms)
...
xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp:7593: Skipped
QueryPool is not available
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 395 tests from VulkanAPITest (6515 ms total)
[----------] Global test environment tear-down
[==========] 395 tests from 1 test suite ran. (6515 ms total)
[ PASSED ] 394 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
YOU HAVE 5 DISABLED TESTS
Reviewed By: yipjustin
Differential Revision: D50997530
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115221
Approved by: https://github.com/yipjustin
Constant time access of first value in collection. This is a constant time operation instead of converting the item to a list to get the first item which is linear. The rule is turned on which automatically autofixes and enforces this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115507
Approved by: https://github.com/malfet
Summary:
cuSPARSELt has support for different alg_id, which are set via
`cusparseLTMatmulAlgSetAttribute`, in total there are 4 different
alg_ids, 0 - 3.
Previously we were just using the default alg_id, as from our initial
experiments we found that for most shapes the default alg_id is the
fastest and that they made no difference on numerical correctness, just
performance. From our previous experiments the fastest alg_id seemed to
differ only on small matmul shapes.
danthe3rd found a performance regression when running with
cuSPARSELt v0.4.0 vs v0.5.0, on LLM shapes, which match these
characteristics (activations are small, weights are large).
However it's likely that this is due to the alg_id ordering changing, as
mentioned in the release notes for v0.5.0.
```
cusparseLtMatmulAlgSelectionInit() does not ensure the same ordering of
algorithm id alg as in v0.4.0.
```
This PR adds in the following:
- support for passing in alg_id to _cslt_sparse_mm
- a new op, _cslt_sparse_mm_search, which returns the optimal alg_id for
a given matmul
_cslt_sparse_mm_search has the same function signature as
_cslt_sparse_mm, minus the alg_id parameter.
We are able to achieve v0.4.0 performance with alg_id=1 on the shapes
that daniel provided.
We will address autoselecting the best alg_id in a future PR, possibly
with torch.compile.
Test Plan:
```
python test/test_sparse_semi_structured -k cslt
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115178
Approved by: https://github.com/cpuhrsch
Currently, we place constants in the .so. To avoid cases
where constants are too large (i.e. >2G), we put the
constants into .lrodata, which allows doesn't have 2G limit.
Not sure why, lld still issues errors like beow even if
those large constants data are stored in .lrodata section:
"relocation R_X86_64_PC32 out of range: 5459191920 is not in
[-2147483648, 2147483647]"
In constrast, the default gnu ld linker works fine. Let's
switch back to use ld to unblock some internal models.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115478
Approved by: https://github.com/desertfire, https://github.com/htyu
Fixes#113422Fixes#94575
This is now possible:
```py
model = Model()
compiled_model = torch.compile(model)
model.load_state_dict(compiled_model.state_dict()) # previously key mismatch!
```
This also makes it much easier to checkpoint and load models that were wrapped like so:
```py
FSDP(torch.compile(model))
# or
DDP(torch.compile(model))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113423
Approved by: https://github.com/msaroufim
1. Removes calls to `replace_all` and `clone` and makes VTs mutable.
2. Properly handles Tuple Iterator mutation. Previously TupleIterator variables would only be properly reconstructed if they were advanced at least once in a frame. On calls to `next`, the source information would be lost (due to constructing a new iterator without using builder), which would ensure that during codegen the variable would be reconstructed from scratch. Now that VTs are mutated, the source is never lost, so we need to properly track mutation and handle it by replaying calls to `next` at the end of the modified bytecode.
3. Added test for checking iadd side effects, this was missing in our unit test coverage.
4. Fixed two incorrect sources, DelayGraphBreakVariable, and UserMethodVariable both relied on setting the source to AttrSource(parent, name) at the callsite of `var_getattr`.
5. Fixed a bug in inplace adding for lists, it would set the resulting VariableTracker's source to `None` which would utilize a different reconstruct path in codegen. Now this is handled explicitly by reconstructing vars when allow_cache=`False`, so that during side effect replay, the mutated var is correctly updated.
In subsequent PRs:
* Refactoring side effect tracking to be significantly simpler (I think we only need an `is_modified` flag)
* Refactor `next_variables` iterator to match the signature of `next`
* Remove all references to `options` in the code
* Refactor VTs representing mutable collections to implement their own mutation update handling
* Remove clone and/or make it specific to lists for creating slices
* Add mutation tracking/replay for sets
* Add mutation tracking/replay for iter.py
* Removing setting source in builder (it's set at the top level after a var is returned)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113725
Approved by: https://github.com/jansel
Re-enable type checking for distributed_c10d.py
Type checking for distributed_c10d.py was inadvertently turned off in issues that have accumulated since.
Note: the backwards compatibility linter does not like some of these changes. But they were incorrect before. This needs human verification, however.
#suppress-api-compatibility-check
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115223
Approved by: https://github.com/wconstab
Summary: This diff is only for prototype to unblock the TP work. PyTorch distributed team is working on a more generic backward op for `aten.layer_norm`. Will remove this op from the experimental file once it is ready.
Test Plan:
**Local Test**:
Accuracy:
- Dtensor + Checkpoint: first run loss: P884569822 (on-par with baseline: P884213363)
- 2nd by loading saved checkpoint: P884583429 (on-par with baseline: P884271869)
Trace:
- Collective functions are inserted automatically.
- Example: https://fburl.com/perfdoctor/l567ww1x
**MAST Test**:
With: trainer = 128, batch_size=512
- NE on-par:
(see: 4441_ep_bs512_2fsdp_tp_sp_dtensor)
{F1155318138}
Differential Revision: D51490868
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115398
Approved by: https://github.com/wanchaol
Currently (after https://github.com/pytorch/pytorch/pull/114407), the user has must pass the original user ``model`` to APIs such as ``ONNXProgram.__call__``, ``ONNXProgram.adapt_torch_inputs_to_onnx`` and ``ONNXProgram.adapt_torch_outputs_to_onnx`` APIs.
This was needed because when the model is fakefied, a version of the non-fakefied model is needed so that the Initializers, buffers and constants can be extracted from a real model (and used as input to the ONNX model).
That approach brings an unnecessary usability burden to the user when the model is not fakefied, because the model that was already passed to ``torch.onnx.dynamo_export`` could be used to extract ``state_dict``.
This PR adds ``ONNXProgram._model_torch`` attribute to store the user model and demote ``model`` argument of the aforementioned APIs to optional, only (as opposed to required).
As a result, for the fakefied model scenario, the user still need to pass the required model, but for non fakefied models, the persisted model is implicitly used to extract the model state_dict, making it easier to use.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115281
Approved by: https://github.com/BowenBao
ghstack dependencies: #114407
multiprocessing.Queue relies on, among other things, background threads to send messages between processes. This works in the happy path but can cause issues if a process is exiting by bypassing atexit handlers or crashing because the writer to the Queue can terminate while the reader is blocked reading the queue. The reader sees the queue as non-empty yet even with a timeout will actually block forever.
An example of a Queue deadlock is here: https://gist.github.com/chipturner/342f72341f087737befe9df84d0e41ce
Since the error reporting case here is a simple one-shot message from the dying child to the parent, we can just use a file-based rendezvous. This eliminates the deadlock when a large traceback is still being flushed to the network when a child exits.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114688
Approved by: https://github.com/suo, https://github.com/yifuwang
Summary:
Most NT operations end with creating a new NestedTensor, which is time-consuming. Trying to reduce overhead during the NestedTensor creation.
The ops return a new NestedTensor with the same offsets, so "tensor not in _tensor_symint_registry" would be false in most case. The "in" (__contain__) function takes ~8 us. If we use the "get" directly, then we save a few us for most NT operations.
Test Plan:
Before:
get_tensor_symint take 15us
https://pxl.cl/3XF83
After
get_tensor_symint take 10us
https://pxl.cl/3XFc9
Differential Revision: D51992836
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115450
Approved by: https://github.com/soulitzer
Patch `--save-xml` when `TEST_IN_SUBPROCESS`
When `--save-xml` is given as a unit test argument and the test is handled by a `TEST_IN_SUBPROCESS` handler (e.g., `run_test_with_subprocess` for `distributed/test_c10d_nccl`), the `--save-xml` args were first "consumed" by argparser in `common_utils.py`. When a following subprocess in this `if TEST_IN_SUBPROCESS:` section starts, there are no `--save-xml` args, thus leaving `args.save_xml` to `None`.
Since argparser for `--save-xml` option has a default argument of `_get_test_report_path()` when the arg is `None`, it's not a problem for Github CI run. It could be an issue when people run those tests without `CI=1`. Test reports won't be saved in this case even if they passed `--save-xml=xxx`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115463
Approved by: https://github.com/clee2000
Currently, the ONNX exporter using torch.nn.Module as input can support
FakeTensor because the ONNX model stores all initializers
When using torch.export.ExportedProgram as input, the initializers are
lifted as inputs. In order to execute the ONNX model, we need to pass a
reference to the non-fake model to the
ONNXProgram.adapt_torch_inputs_to_onnx API, so that initializers can be
fetched from the model and fed to the ONNX model as input
ps: https://github.com/pytorch/pytorch/issues/115461 will track the API revision for the cases where additional `model_with_state_dict` are required to produce complete ONNX files exported with fake support. This is also tracked by the umbrella fake tensor issue https://github.com/pytorch/pytorch/issues/105464 FYI @BowenBao
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114407
Approved by: https://github.com/BowenBao
Fixes https://github.com/pytorch/pytorch/issues/113717.
When `preserve_rng_state=True`, we let AOTAutograd trace through `torch.random.fork_rng` op, and the tracing doesn't work under CUDA, hence the original error reported in the issue.
But since we are already doing RNG functionalization at Inductor level, we don't actually need to trace this `fork_rng` op. So we should just rewrite `preserve_rng_state` to False when we are using torch.compile (and let Inductor do its RNG functionalization which it's already been doing).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113718
Approved by: https://github.com/wanchaol
## Summary
Since #97351, CPU ATen kernel for `mean` for BF16 & FP16 dtypes has been unvectorized (it's not even implicitly vectorized).
This PR vectorizes `mean` for BF16 & FP16 on CPU in a `cast_fp32 -> sum -> div -> cast_bf16_or_fp16` fashion.
The perf benefit would be especially pronounced on machines with `AVX512_BF16` and/or `AVX512_FP16` ISA support.
## Benchmarking data for BF16 (collected before & after the change in this PR)
**Machine:** Intel® Xeon® (4th generation series, formerly codenamed Sapphire Rapids) Platinum 8468H
One socket (48 physical cores) - used `numactl --membind=0 --cpunodebind=0`
libtcmalloc & Intel OpenMP were preloaded
Environment variable used -
`KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 OMP_NUM_THREADS=48 MKL_NUM_THREADS=48`
**Workload:** E2E performance on BS 32 resnet50 (using BF16 via AMP) inference using oneDNN Graph JIT fuser (`mean` kernel is dispatched to eager mode ATen kernel, and is the bottleneck right now)
| **BEFORE:** Latency with unvectorized mean (lower is better)| **AFTER:** Latency with vectorized mean (lower is better)| Speedup due to vectorizing mean|
|----------------------------|-------------------------|------------|
| 19.1 ms | 10.8 ms | latency reduced by ~43.45% |
**Benchmarking script for BF16 -**
```
import time
import torch
import torchvision
# enable oneDNN Graph JIT fuser
torch.jit.enable_onednn_fusion(True)
# AMP for JIT mode is enabled by default, and is divergent with its eager mode counterpart
torch._C._jit_set_autocast_mode(False)
# sample input should be of the same shape as expected inputs
example_input = torch.rand(32, 3, 224, 224)
# Using resnet50 from torchvision in this example for illustrative purposes,
# but the line below can indeed be modified to use custom models as well.
model = getattr(torchvision.models, "resnet50")().eval()
with torch.no_grad(), torch.cpu.amp.autocast(cache_enabled=False, dtype=torch.bfloat16):
# Conv-BatchNorm folding for CNN-based Vision Models should be done with ``torch.fx.experimental.optimization.fuse`` when AMP is used
import torch.fx.experimental.optimization as optimization
# Please note that optimization.fuse need not be called when AMP is not used
model = optimization.fuse(model)
model = torch.jit.trace(model, (example_input))
model = torch.jit.freeze(model)
# a couple of warm-up runs
model(example_input)
model(example_input)
# speedup would be observed in subsequent runs
start = time.time()
model(example_input)
end = time.time()
inference_time = (end - start) * 1000
print("Inference time is ", inference_time)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114582
Approved by: https://github.com/jgong5, https://github.com/malfet
Summary: This PR does 2 things:
1) Previously this would simply error, now it will ignore any
torch.inf values that it recieves. note: The code checks for torch.inf after
aminmax that way if there are no torch.inf values found, the perf is a
relatively unchanged
2) as mentioned in https://github.com/pytorch/pytorch/issues/100051,
values close to (but not quite at) the maximum/minimum float value could
overflow to infinity in the course of _adjust_min_max() (when this large
value would be multiplied by something in the middle of a calculation
that would otherwise result in a non inf value). This was fixed by
rearranging the order of operations for the lines in question without
altering the actual equations. Specifically, where operations in lines
1095, 1098 and 1100 have multiplication and division of large values,
its better to divide the two large values before multiplying, rather
than multiplying the two large values together (creating overflow) before dividing like it had been.
Test Plan: python test/test_quantization.py
TestObserver.test_histogram_observer_ignore_infinity
python test/test_quantization.py TestObserver.test_histogram_observer_handle_close_to_infinity
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D51489345](https://our.internmc.facebook.com/intern/diff/D51489345)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103467
Approved by: https://github.com/andrewor14
RE #115301
Decoupling gives us a path to disable timing without disabling the
flight recorder.
Flight recorder is still useful for stuckness analysis without 'timing'.
Disabling timing makes it miss the 'started'
state that comes from using an extra nccl event at the start of each
collective. It will also be missing 'duration_ms' of collectives, which
hasn't been landed yet, but is useful for timing/perf work more than
stuckness analysis.
Hopefully we can enable timing by default and leave both on, but it's
nice to have the flexiblity for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115358
Approved by: https://github.com/fduwjj
This PR is proposing a new approach to solve the nn/optim only linked by python object identity problem.
The idea is to have a function that can swap the content of two Tensors t1 and t2 while preserving all the old references.
This would allow us to swap the `model.weight` with a new Tensor (can be any subclass of Tensor and any TensorImpl (xla, sparse, nested tensorimpl would work)). The use within nn will be done in a follow up.
This is done by swapping the whole content of the PyObject and then putting back the fields associated with external references (refcount, gc tracking and weakrefs).
Note that we have to properly handle all the cases where there is memory used before the public pointer PyObject* and where the PyObject is bigger due to dict/weakref being inlined (older CPython version) or due to slots.
The main limitation of this approach is that the number of slots need to match for the objects being swapped and thus limit usage of slots in subclasses.
Draft right now to see what @colesbury thinks about doing this?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111747
Approved by: https://github.com/colesbury
Removes always restore, assuming that a HOP will cleanup any leftover state from tracing fwd + bwd
This required a minor change to the autograd fn variable higher order op. If we are tracing forward DON'T add the call_function node into the main graph, since we are only tracing it for the purposes of speculation. Instead return the result directly to be passed to the backward for speculation. This was the only observable side effect on the output graph that I found.
Test plan:
test_smoke_from_test_autograd in test_autograd_function.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115317
Approved by: https://github.com/voznesenskym, https://github.com/jansel
Summary: Add two logic:
1. If the custom op is returning a `Tensor` but also doesn't have an out tensor as input, return an empty tensor.
2. If the custom op is returning more than one Tensor and the number of out tensors is not the same as return Tensor, return a tuple of empty tensors.
Test Plan: Rely on new unit tests
Differential Revision: D51471651
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114143
Approved by: https://github.com/cccclai
Summary:
GraphFunction internally stores the optimized graph after generating it and then it is passed into the executor which makes a copy of it. So we store the optimized graph effectively twice.
This diff allows to set a flag to not store the optimized graph inside the GraphFunction.
The code is NoP right now until the flag is enabled.
Test Plan:
I ran SL with this on raas with good memory saving on raas server. From command line:
exmaple model run
```
buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=953556500 --model_snapshot_to_load=362
I1207 11:04:58.657143 3556226 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 953556500_362 is 255646 Kb
```
then with flag enabled:
```
buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=953556500 --model_snapshot_to_load=362 --torch_jit_do_not_store_optimized_graph=true
I1207 11:06:25.245779 3577383 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 953556500_362 is 165167 Kb
```
So collective with this flag and the flag from D51950418
```
buck run mode/opt-clang sigrid/predictor/client/localnet:run_model -- --model_id_to_load=953556500 --model_snapshot_to_load=362 --torch_jit_do_not_store_optimized_graph=true --torch_jit_enable_profiling_graph_executor=false
I1207 11:09:17.502743 3592345 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 953556500_362 is 114848 Kb
```
Differential Revision: D51931895
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115381
Approved by: https://github.com/malfet
Summary:
This is to allow easier extension of quant workflow in the future, as we are seening more
diverse ways of doing quantization
putting up this for feedbacks first
Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_observer_callback
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115001
Approved by: https://github.com/kimishpatel
The test introduced in #102530 has a bug:
Construction of `crow_indices` raises an exception: "value cannot be converted to type int32 without overflow" which is obviously correct.
This makes the test fail which is supposed to check for an overflow in nnz.
Fix by making the construction of `crow_indices` pass although with an invalid value which would error later but triggers the correct check.
Given that I'm not sure it is even worth checking for an overflow in nnz:
- `crow_indices[..., -1] == nnz` is already enforced
- this can only hold if `crow_indices` is able to hold `nnz` without overflow
- `col_indices` has to be of the same type as `crow_indices`
- Hence the type of `col_indices` has to be able to hold the value of `nnz`
So in conclusion: The situation being checked for cannot reasonably occur
CC @pearu as the test author for additional insight
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114940
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
Summary:
Rename _device_mesh.py to device_mesh.py, update all callsites, add documentation.
We created stubs for public class and methods in torch.distributed.device_mesh so that torch.distributed.device_mesh can be imported with or without distributed is available().
Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/115099
Prior to landing, CI signals are all passed. Shipit added the "ci/trunk" label to the PR and DID NOT wait for it and went ahead committing. More context can be found in the reverted PR above.
Test Plan: CI.
Differential Revision: D51861018
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115193
Approved by: https://github.com/fegin
*
Context:
Joel sees that unless he manually writes to the fake tensor memo, fakification seems to produce spurious symbols! Voz (me) objects, saying that not only is directly writing to memo a bad pattern, recursively invoking fakification on tensor subclass elements in dynamo should suffice! Joel says that while he morally agrees, he has a test proving otherwise, a most perplexing situation.
Digging in, I figured out that while *we were* making fake tensors correctly, with properly cached symbols and the like, we were *also* incorrectly creating spurious symbols, leading the test to fail.
Before this PR, we would only cache source->symint. This was generally fine, but meant that you would create a symbol, then potentially throw it out due to symint cache. For example, the cache hit flow was:
make a symbol (ex: s2) -> use it to make a symint -> hit the cache (my_source-s1)
Now, in this example, you have a symbol in your val_to_var/var_to_val (s2) that is unused. This is sound, but wasteful, and furthermore, misleading.
This was causing a test added in a PR in this stack to fail, specifically, because the test was using
```
curr_var_to_val = {
str(k): v for k, v in context.fake_mode.shape_env.var_to_val.items()
}
````
To validate that no new symbols were being created (that is, that recursively creating fake tensors for subclasses was working).
The test is correct, but the implementation of caching would make (by this method of observation) cache hits look like cache misses.
So, the fix here is to move the cache up to be a general symbol cache, rather than only a cache for symints.
The initial implementation did that! But then, it ran into some interesting errors when it came to replay. When replaying symbol creation, behaviors would diverge in the new shape env! How could that be? The answer is because creating a new shape_env resulted in us replaying symbol creation... but with a cache from a different shape env! This was short circuiting symbol creation - and so, adding an extra layer to the cache for id(shape_env) fixes the problem.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115396
Approved by: https://github.com/mlazos
Summary:
[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process
We have seen a handful of jobs training stuck where one of the trainer goes down
while others are stuck in c++ land and hence not handling the sigterm.
Test Plan: Manually validated by attaching gdb to one of the processes and sent a kill -9 to another. Saw the log ```WARNING] Unable to shutdown process 4422 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL```
Differential Revision: D51862545
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115219
Approved by: https://github.com/wconstab, https://github.com/fduwjj
Adds a useful high level wrapper for calling `dist.save/load` with the correct storage readers and writers.
Instead of doing:
```
DCP.save(
state_dict={...},
storage_writer=StorageWriter(...)
)
DCP.load(
state_dict={...},
storage_reader=StorageReader(...)
)
```
We can now do:
```
checkpointer = Checkpointer(...)
checkpointer.save(state_dict={...})
checkpointer.load(state_dict={...})
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114603
Approved by: https://github.com/fegin, https://github.com/wz337
I approved https://github.com/pytorch/pytorch/pull/110850 which did the following
Previously:
`num_batches_tracked` not in state_dict when doing `m.load_state_dict(state_dict)` --> always overwrite module's `num_batches_tracked` in `load_from_state_dict` with a 0 cpu tensor
Now:
`num_batches_tracked` not in state_dict loaded when doing `m.load_state_dict(state_dict)` --> only overwrite module's `num_batches_tracked` in `load_from_state_dict` with a 0 cpu tensor if module does not have `num_batches_tracked`
This causes the following issue:
```
with torch.device('meta'):
m = BatchNorm(...)
m.load_state_dict(state_dict, assign=True)
```
If `num_batches_tracked` is not in `state_dict`, since `modules's` `num_batches_tracked` is present on meta device, it is not overwritten with a 0 cpu tensor. When compiling, this error is raised
```
AssertionError: Does not support mixing cuda+meta
```
I am not sure whether the explicit check for meta device makes sense as a fix, will add testing if this fix is ok
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115285
Approved by: https://github.com/albanD
After auditing higher_order_ops.py, the graph checkpoints were only getting used in the event of an exception, so it is safe to remove because we restart analysis in this case now.
To make this clearer the current state is the following:
Checkpoint side effects
Capture subgraph
if graph break:
restore as usual
else:
throw away inlining translator and subgraph tracer
Restore side effects
This will change to the following after this change:
Checkpoint side effects
Capture subgraph:
if graph break:
restart analysis
else:
throw away inlining translator and subgraph tracer
Restore side effects
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115321
Approved by: https://github.com/jansel, https://github.com/zou3519
Replaces the "always sleep 30 sec before abort" with "wait up to 30 sec
for the future to complete then abort". The difference in this case is
the abort happens as soon as the dump finishes up to a maximum, instead
of always waiting the maximum.
Allows multiple calls to dump, which will be serialized.
Renames tryWriteDebugInfo to launchAsyncDebugDump in spirit of the
change to support more than one launch and to always launch rather than
only launching on the first call.
Adds a test for dumping on timeout.
This reverts commit ac7d14baad53fa7d63119418f760190f289d8a01.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115332
Approved by: https://github.com/fduwjj
Summary: Add a toggle to inductor config that will force matmul precision dtypes to match between cublas and triton backends for addmm, bmm, and mm operations.
Test Plan: CI + model launches
Reviewed By: jansel
Differential Revision: D51442001
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115326
Approved by: https://github.com/jansel
This PR enables the fx passes and mkldnn optimizations for aarch64 It improved the bert inference performance up to 5.8x on AWS c7g instance when compared torch.compile() vs no compile path. This is enabled when pytorch is built with USE_MKLDNN_ACL option for aarch64.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115037
Approved by: https://github.com/jgong5, https://github.com/malfet
This PR aims for parity+ compared to the old testing for the simplest foreach test case.
Test coverage increase: we now test foreach optimizers with CPU as well as on GPU.
Before:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ python test/test_optim.py -v -k test_multi_tensor_optimizers
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_multi_tensor_optimizers (optim.test_optim.TestOptim) ... ok
----------------------------------------------------------------------
Ran 1 test in 7.253s
OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$
```
Now, we get granular test cases at the cost of overhead!
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ python test/test_optim.py -v -k test_foreach
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_foreach_ASGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adadelta_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adagrad_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_AdamW_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adamax_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_NAdam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_RAdam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_RMSprop_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Rprop_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_SGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_ASGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adadelta_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adagrad_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adamax_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_NAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_RAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_RMSprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Rprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_SGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
----------------------------------------------------------------------
Ran 22 tests in 30.954s
OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$
```
Why the increase in time?
Two reasons:
1. overhead. Any _CUDA_ *Info test (OpInfo, ModuleInfo, OptimizerInfo) will wrap itself with the `CudaNonDefaultStream` policy, and `CudaNonDefaultStream.__enter__` when called for the first time will go through all visible CUDA devices and synchronize each of them, thus forcing the CUDAContext to be init'd. Doing this for all 8 devices takes ~10-15s. Also, test parametrization costs a little overhead too, but not to the level init'ing CUDA context does.
2. We test more! Now, we have 72 configs (in the foreach optimizer world) whereas we only had 59 before.
Next steps for the future:
- consider adding more Tensor LR configs (like a Tensor LR without capturable in the single tensor case)
- this is likely the next PR or 2: migrate all uses of _test_derived_optimizers in test_optim to TestOptimRenewed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114797
Approved by: https://github.com/albanD
Summary: We construct a unified API that can be easily add pointwise ops to be batched in the post grad
Test Plan:
# unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion
```
Buck UI: https://www.internalfb.com/buck2/19b3f641-782f-4f94-a953-3ff9ce2cfa7b
Test UI: https://www.internalfb.com/intern/testinfra/testrun/1125900251953016
Network: Up: 67KiB Down: 32KiB (reSessionID-c2a80f26-8227-4f78-89fc-bcbda0ae8353)
Jobs completed: 18. Time elapsed: 1:19.8s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0
# local reproduce
### cmf
P881792289
### igctr
### dsnn
### icvr
Reviewed By: xuzhao9
Differential Revision: D51332067
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114778
Approved by: https://github.com/xuzhao9
Default should be False because in general, we're interested
in reliability and composability: we want to check that
running PyTorch with and without Dynamo has the same semantics (with
graph breaks allowed).
Test Plan:
Existing tests?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115276
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115267
Due to not all tests in the Dynamo shard actually running in CI, we've
started to bitrot on this implementation. Since our plan is to trace
into the functorch implementations instead of construct a HOP
(which is what capture_func_transforms=True does), let's turn off this
config by default.
Test Plan:
- Tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115267
Approved by: https://github.com/voznesenskym, https://github.com/guilhermeleobas
Summary:
Documenting the `Work` object
For a collective (broadcast, all_reduce, etc.) when async_op=True we return a `Work` object to which users can call `.wait()`, `.is_success()`, among other things but this class is not documented
Test Plan: Preview the docs build in OSS
Differential Revision: D51854974
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115172
Approved by: https://github.com/wconstab
As titled, this PR removes the unnessecary getitem call from the graph that's manipulated in MapHigherOrder, where we want to get the first dim slice of original tensor for specualtion but using call_method will accidentally create a get_item call in the graph, so want to avoid it by calling unpack_var_sequence on input tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115207
Approved by: https://github.com/yanboliang
ghstack dependencies: #115115, #115204, #115205
We want to remove the map_wrapper and replace it with dynamo always on. This is the first step of this plan.
In this PR, we make dynamo directly generates a map_impl nodes. This hasn't touch the eager logic yet. So the execution path after this PR looks like 1. `dynamo -> map_impl` when torch.compile is on. (Before this PR, it's `dynamo -> map_wrapper -> map_impl` and 2. `map_wrapper -> map_impl` (This PR did't touch the logic here).
The added TODO(yidi) is addressed in the following pr.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115205
Approved by: https://github.com/yanboliang
ghstack dependencies: #115115, #115204
# Summary
This PR updates the FlashAttention code from:
02ac572f3f.
Or Tag 2.3.2
To 92dd5703ec
Or tag 3.2.6.
As well I think that this should be cherry picked into 2.2.0 release since there was a temporary ~15% perf regression for causal masking. It is not technically a regression since Flash wasn't released yet but it would be nice to have in the release.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115313
Approved by: https://github.com/Skylion007
Previously we could only use `ncclCommSplit` when we knew all backends were connected on all shards (due to the need to perform a NOCOLOR split), which in practice meant we could only use it for subgroups that were copies of the entire world.
This change allows for specifying a bound device id to `init_process_group` which tells the pg and its backends that the specified device, and the specified device only, will be associated with this rank.
This guarantee lets us do an early connect (which we could not previously do due to how ProcessGroupNCCL infers devices based on tensors and not the rank number). And by doing the early connect, we have the guarantee ranks are connected and can perform nocolor splits when needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114916
Approved by: https://github.com/kwen2501
**Summary**:
#114174 did not test the case where `elementwise_affine=False` (i.e. `weight` and `bias` are `None`) and this test would fail due to cached sharding propagation. The difference on sharding prop between these cases is, when `weight` and `bias` are None, the forward layer norm op will be recognized as a "static shape op" and `propagate_op_sharding` will be applied rather than `propagate_op_sharding_non_cached`. A fix is to force re-compute sharding when `normalized_shape` changes by setting op schema's `RuntimeSchemaInfo`'s `static_argnum` to include `normalized_shape` (i.e. 1)
**Test**:
pytest test/distributed/_tensor/test_math_ops.py -s -k layer_norm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115250
Approved by: https://github.com/wanchaol
Current non-strict test cases (added in #114697) are already supported by strict mode, so it can't demonstrate the incremental value of non-strict mode. How about adding test cases that fail in strict mode but pass in non-strict mode?
Test Plan:
python test/export/test_export.py -k test_external_call_non_strict_real_tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115245
Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17
Summary: move matmul precision out of the system info (system hash) and into the cache in preparation for switching precisions during compile
Test Plan: CI
Reviewed By: jansel
Differential Revision: D51442000
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115242
Approved by: https://github.com/jansel
On some systems it is possible to receive a signal that does not have a name. Rare, but possible. This prevents our error handler from crashing and instead properly reports the signal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114696
Approved by: https://github.com/xmfan
In certain edge cases when using lazy tensors, the base tensor stored in the `FunctionalStorageImpl` and the `value_` tensor stored in the `FunctionalTensorWrapper` diverge. For instance, take this simple example
```python
class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.fc1 = torch.nn.Linear(4, 2, bias=False)
def forward(self, x):
return x @ self.fc1.weight.transpose(0, 1)
with torch.device("lazy"):
model = Model()
x = torch.ones(4)
out = model(x)
```
The call to `transpose` on the lazily initialized weight `fc1.weight` applies a view op on the functional tensor which only gets propagated to the functional tensor wrapper and not the base tensor in the storage. Thus, causing them to diverge.
To fix this behaviour, we need to reset the functional tensor's storage. To facilitate this, we add a `reset_storage` method to `FunctionalTensorWrapper` which clears away the old storage and view metas.
CC: @behzad-a @GlebKazantaev @wconstab @bdhirsh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115235
Approved by: https://github.com/bdhirsh
This PR covers `ExportedProgram` to `test_fx_op_consistency.py`, which helps us identify the necessary but missing io_steps.
Next, we should refactor the tests to actually cover all ops supported by registry.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114886
Approved by: https://github.com/thiagocrepaldi
Replaces the "always sleep 30 sec before abort" with "wait up to 30 sec
for the future to complete then abort". The difference in this case is
the abort happens as soon as the dump finishes up to a maximum, instead
of always waiting the maximum.
Allows multiple calls to dump, which will be serialized.
Renames `tryWriteDebugInfo` to `launchAsyncDebugDump` in spirit of the
change to support more than one launch and to always launch rather than
only launching on the first call.
Adds a test for dumping on timeout.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115176
Approved by: https://github.com/zdevito
Slight refactor to:
* lazily compute min / max seq_len used for flash. this avoids unnecessary graph breaks / specialization when we're not accessing these
* store min / max seq_len in a general `metadata_cache`. condensing these should make it easier to avoid specializing on these and others we may add in the future
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115212
Approved by: https://github.com/soulitzer, https://github.com/ani300
ghstack dependencies: #114311
Apply a few optimizations to funcol:
- allgather on non-0 dim, the resulting tensor already needs to access
data in order to do torch.cat, so we sync wait here so that we don;t
need to go through ACT dispatch for chunk + cat alltogether
- have a fast return logic to aten.view as it's a commonly hit op for
view related ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113324
Approved by: https://github.com/XilunWu
Summary: Right now when load_model fails (either because of loading error or validation eager run failure), the result won't be logged in generated csv files. Let's log them in csv so that they are monitored by the expected results checking.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114784
Approved by: https://github.com/malfet
# Summary
This PR introduces a new Tensor subclass that is designed to be used with torch.nn.functional.scaled_dot_product_attention. Currently we have a boolean `is_causal` flag that allows users to do do causal masking without the need to actually create the "realized" attention bias and pass into sdpa. We originally added this flag since there is native support in both fused kernels we support. This provides a big performance gain ( the kernels only need to iterate over ~0.5x the sequence, and for very large sequence lengths this can provide vary large memory improvements.
The flag was introduced when the early on in the kernel development and at the time it was implicitly meant to "upper_left" causal attention. This distinction only matters when the attention_bias is not square. For a more detailed break down see: https://github.com/pytorch/pytorch/issues/108108. The kernels default behavior has since changed, largely due to the rise of autogressive text generation. And unfortunately this would lead to a BC break. In the long term it may actually be beneficial to change the default meaning of `is_causal` to represent lower_right causal masking.
The larger theme though is laid here: https://github.com/pytorch/pytorch/issues/110681. The thesis being that there is alot of innovation in SDPA revolving around the attention_bias being used. This is the first in hopefully a few more attention_biases that we would like to add. The next interesting one would be `sliding_window` which is used by the popular mistral model family.
Results from benchmarking, I improved the meff_attention perf hence the slightly decreased max perf.
```Shell
+---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+
| Type | Speedup | batch_size | num_heads | q_seq_len | k_seq_len | embed_dim | dtype | head_dim |
+---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+
| Average | 1.2388050062214226 | | | | | | | |
| Max | 1.831672915579016 | 128 | 32 | 1024 | 2048 | 2048 | torch.bfloat16 | 64 |
| Min | 0.9430534166730135 | 1 | 16 | 256 | 416 | 2048 | torch.bfloat16 | 128 |
+---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114823
Approved by: https://github.com/cpuhrsch
Summary:
This work is for PT2 inference. Since the IR from Export will change to pre-grad aten IR in a few months. We need to start this work from now on. Here is what I do in this diff:
1) Copy the fuse parallel linear pass to fb folder and adapt it to aten IR. We still want to keep the original `group_batch_fusion.py` because it is still used in training. In future at certain time point when PT2 training decided to retire the torch IR based group_batch_fusion, we can remove it. But right now, it's better to have torch IR and aten IR version seperately.
Our plan is to gradually transform the existing and important pre-grad passes to aten IR based passes.
Differential Revision: D51017854
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114776
Approved by: https://github.com/zhxchen17
Previously we only supported Tensor, Constants, and SymNode. We lift
that restriction (there's not really a good reason for it). HOPs like
torch.cond, torch.map already do input validation (those are the ones
that can only support Tensor, Constant, and SymNode inputs).
Test Plan:
New test for `wrap`, which is a HOP that has
manually_set_subgraph_inputs=False
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115186
Approved by: https://github.com/ydwu4, https://github.com/yanboliang
ghstack dependencies: #115185
**Summary**
Enable the qlinear weight prepack when input dimension size exceeds 2. There are extra reshape node before and after the `addmm` or `mm` node if input dimension size exceeds 2.
**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k input_dim_exceeds_2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113928
Approved by: https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #113733, #113912
**Summary**
When decomposing `Linear` to `addmm` or `mm` within Inductor, if the input dimension size exceeds 2, `reshape` nodes are introduced to convert the input into a 2-dimensional form before and after the `addmm` or `mm` node. It is essential to identify and match this pattern during quantization for dequantization promotion. For instance,
```
# quant
# + - - - | - - - +
# | dequant |
# | | |
# | reshape |
# | / \ |
# | node1 node2 |
# + - | - - - | - +
# reshape reshape
# + - | - - - | - +
# quant quant
```
In this PR, we mainly do 2 things:
- Extend support for the dequantization pattern in QLinear when the input dimension size exceeds 2.
- Revise the implementation of the dequant promotion pass, as it now needs to accommodate the matching of four different patterns.
**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k input_dim_exceeds_2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113912
Approved by: https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #113733
**Summary**
In the previous QLinear implementation, it was assumed that inputs have a dimension of 2. In this update, we have modified QLinear to accept inputs with a dimension greater than 2, incorporating input and output reshaping accordingly.
**Test Plan**
```
python -u -m pytest -s -v test_quantized_op.py -k test_qlinear_pt2e
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113733
Approved by: https://github.com/jgong5, https://github.com/eellison
Might be some upstream updates, the previous hack starts to not pick up model names, updating to use the other more appropriate variable.
Also fix a bug with an unused argument that was supposed to be removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115108
Approved by: https://github.com/thiagocrepaldi
We technically allow backends to aot_autograd to pass a config saying "yes I am ok with seeing input mutations in my graph".
With https://github.com/pytorch/pytorch/pull/112906 though, there can be input mutations that show up in the backward (that we need to handle for correctness), that are a large pain to keep out of the graph. The meta-point is that it's been ~a year since we added the config, and it almost always makes sense for backends to support input mutations for performance reasons (inductor does). So I just allow these input mutations in the graph in this rare backward situation, even if the backend didn't explicitly use the config.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115195
Approved by: https://github.com/drisspg
This adds the `ir.Scan` node (currently only supported on CUDA) which re-uses the existing reduction kernel machinery to support different kinds of non-pointwise ops. Just like reductions it supports prologue and epilogue fusions and has both persistent and non-persistent kernel generation.
Currently this doesn't support the equivalent of `Reduction.create_multilayer` and will instead fall back to eager in those cases. This is because splitting into multiple kernel invocations ends up being far slower than cub's single kernel strategy which matches the performance of a copy kernel.
Fixes https://github.com/pytorch/pytorch/issues/93631
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106581
Approved by: https://github.com/lezcano, https://github.com/atalman
Introduce OptimizerInfos + use them to refactor out the error testing.
Why OptimizerInfos?
- cleaner, easier way to test all configs of optimizers
- would plug in well with devicetype to auto-enable tests for devices like MPS, meta
- would allow for more granular testing. currently, lots of functionality is tested in `_test_basic_cases` and some of that should be broken down more.
What did I do for error testing?
- I moved out some error cases from `_test_basic_cases` into a new test_errors parametrized test.
- The new test has to live in TestOptimRenewed (bikeshedding welcome) because the parametrized tests need to take in device and dtype and hook correctly, and not all tests in TestOptim do that.
- TestOptimRenewed also is migrating to the toplevel test/test_optim.py now because importing TestOptimRenewed does not work (because of test instantiation, TestOptimRenewed gets replaced with TestOptimRenewedDevice for CPU, CUDA, and whatever other device).
Is there any change in test coverage?
- INCREASE: The error case where a single Parameter (vs a container of them) are passed in has now expanded to all optims instead of only LBFGS
- DECREASE: Not much. The only thing is we no longer test two error cases for foreach=True AND foreach=False, which I think is redundant. (Highlighted in comments)
Possible but not urgent next step: test ALL possible error cases by going through all the constructors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114178
Approved by: https://github.com/albanD
We can auto-functionalize operators that mutate their inputs as long as
the outputs of the operator do not alias their inputs. The user needs
to provide an abstract impl for the operator if it has non-trivial
returns.
- We update can_auto_functionalize(op) to include ops that return (but
do not alias) Tensors
- We update auto_functionalized(op, mutated_args_names, kwargs) to
return (out, mutated_args), where `out = op(**kwargs)` and
`mutated_args` are the new values of the inputs that would have been
mutated.
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115135
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114955, #114956, #115134
In preparation for the next PR up in the stack, which is going to update
"can_auto_functionalize" to support more operators than just ones that
return nothing. We are unable to auto-generate FakeTensor kernels for
operators that do not return nothing, but we are able to generate
functionalization kernels for operators that return something.
Test Plan:
Existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115134
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114955, #114956
Summary:
This adds function to model container doing weight swapping with double buffering.
There are 2 parts for double buffering
a) Write constants into inactive buffer
b) Swap active buffer
For (a), we write the constants into the buffer that's currently not in use, and store the information in both constants map and the corresponding constant array to read.
For (b), we obtain the lock, and activate the constant map/constant array that is inactive, and flag the one that's currently in use to inactive.
Test Plan:
test/cpp/aot_inductor/test.cpp
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D51543732](https://our.internmc.facebook.com/intern/diff/D51543732)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114446
Approved by: https://github.com/chenyang78, https://github.com/eellison
The `bsr_dense_addmm` triton kernel introduced in https://github.com/pytorch/pytorch/pull/114595 is a generalization of `bsr_dense_mm` triton kernel and a more efficient version of it because it uses an extra kernel parameter `SPLIT_N` that has notable effect to performance for r.h.s operand with a larger number of columns.
This PR eliminates the `bsr_dense_mm` triton kernel in favor of using `bsr_dense_addmm` triton kernel.
The performance increase of `bsr_dense_mm` is as follows (float16, `NVIDIA A100-SXM4-80GB`):
- with 16x16 blocks, the average/maximal speed up is 50/71 %
- with 32x32 blocks, the average/maximal speed up is 30/63 %
- with 64x64 blocks, the average/maximal speed up is 12/26 %
- with 128x128 blocks, the average/maximal speed up is 7/17 %
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115030
Approved by: https://github.com/cpuhrsch
Summary:
This diff fix the param unflattening when using FSDP together with TP. Currently we hardcode the `reshape_size` to be multiplied by 2, which instead should be the size of the process group.
Before the fix, example exception: `shape '[257, 514]' is invalid for input of size 264196`, where the process group size is 4 instead of 2.
Test Plan:
**CI**:
CI test
**Unit test**:
`buck2 test mode/dev-nosan //caffe2/test/distributed/tensor/parallel:fsdp_2d_parallel`
- Passed
**Test model with WHEN**:
- Verified that checkpoint can be saved and resumed successfully;
- Verified the accuracy with window_ne, which is on-par with baseline.
https://pxl.cl/3Wp8w
Differential Revision: D51826120
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115105
Approved by: https://github.com/fegin
Continuation of #112185, following the design in this [doc](https://docs.google.com/document/d/1ipSxcTzEMMOAPvxP-YJlD5JBZZmIGgh8Q34ixtOUCRo).
Summary:
* Introduce `SubclassSymbolicPolicy` containing separate dynamic dim / constraint policies for the outer and inner tensors
* Expand the automatic dynamic algorithm to recurse into inner tensors and produce one of these for a subclass instance
* Maintain legacy behavior for subclasses by recursively calling `mark_dynamic()` on inner tensors *of the same dim as outer* when `mark_dynamic(outer, ...)` is called
* Addresses this: 6a86cf00ad/torch/_dynamo/variables/builder.py (L1750)
* Add `outer_size` and `outer_stride` arguments to `__tensor_unflatten__()` so that you can find out what symbols were allocated for the outer size / stride (you are expected to return a tensor that compares equal to the outer symbols)
* Signatures now:
```python
# attrs is a list of inner tensor attributes on x; inner_tensor = getattr(x, attr)
# ctx is anything useful for rebuilding the class we want to guard on
attrs, ctx = x.__tensor_flatten__()
...
# inner_tensors is a dict of {attr -> tensor}
# ctx is taken unmodified from flattening and (eventually) guarded on
# outer_size is the expected size of the output; possibly symbolic
# outer_stride is the expected strides of the output; possibly symbolic
y = MySubclass.__tensor_unflatten__(inner_tensors, ctx, outer_size, outer_stride)
# at the __tensor_unflatten__() call-site in PT2, we assert y.shape == outer_size and y.stride() == outer_stride
# the assert simplifies symbols when there are relationships between outer and inner symbols
```
* Size info needed for `NestedTensor` at least, stride info needed for `DTensor` at least
* Punting on `outer_storage_offset` because storage_offset handling is horribly broken in PT2 right now
* ~~Add new `__tensor_mark_dynamic__()` to allow overriding the behavior of mark_dynamic on a per-subclass basis~~ (booted to future work)
* ~~Add guards for tensor subclasses by calling `__tensor_flatten__()` in the guard to test equality on `ctx`~~
* Now handled in #114469
* Next PR: add TENSOR_MATCH guards on inner tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114311
Approved by: https://github.com/ezyang, https://github.com/drisspg, https://github.com/voznesenskym, https://github.com/bdhirsh
If TORCH_NCCL_DUMP_ON_TIMEOUT is set, then along with producing a dump
file when a timeout happens, you can trigger a dump by writing to local pipe
`<TORCH_NCCL_DEBUG_INFO_TEMP_FILE>_<rank>.pipe` (by default
/tmp/nccl_trace_{rank}_<rank>.pipe).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115139
Approved by: https://github.com/wconstab
Contents may be out of date. Our CircleCI workflows are gradually being migrated to Github actions.
Structure of CI
===============
setup job:
1. Does a git checkout
2. Persists CircleCI scripts (everything in `.circleci`) into a workspace. Why?
We don't always do a Git checkout on all subjobs, but we usually
still want to be able to call scripts one way or another in a subjob.
Persisting files this way lets us have access to them without doing a
checkout. This workspace is conventionally mounted on `~/workspace`
(this is distinguished from `~/project`, which is the conventional
working directory that CircleCI will default to starting your jobs
in.)
3. Write out the commit message to `.circleci/COMMIT_MSG`. This is so
we can determine in subjobs if we should actually run the jobs or
not, even if there isn't a Git checkout.
CircleCI configuration generator
================================
One may no longer make changes to the `.circleci/config.yml` file directly.
Instead, one must edit these Python scripts or files in the `verbatim-sources/` directory.
Usage
----------
1. Make changes to these scripts.
2. Run the `regenerate.sh` script in this directory and commit the script changes and the resulting change to `config.yml`.
You'll see a build failure on GitHub if the scripts don't agree with the checked-in version.
Motivation
----------
These scripts establish a single, authoritative source of documentation for the CircleCI configuration matrix.
The documentation, in the form of diagrams, is automatically generated and cannot drift out of sync with the YAML content.
Furthermore, consistency is enforced within the YAML config itself, by using a single source of data to generate
multiple parts of the file.
* Facilitates one-off culling/enabling of CI configs for testing PRs on special targets
Also see https://github.com/pytorch/pytorch/issues/17038
Future direction
----------------
### Declaring sparse config subsets
See comment [here](https://github.com/pytorch/pytorch/pull/17323#pullrequestreview-206945747):
In contrast with a full recursive tree traversal of configuration dimensions,
> in the future I think we actually want to decrease our matrix somewhat and have only a few mostly-orthogonal builds that taste as many different features as possible on PRs, plus a more complete suite on every PR and maybe an almost full suite nightly/weekly (we don't have this yet). Specifying PR jobs in the future might be easier to read with an explicit list when we come to this.
----------------
----------------
# How do the binaries / nightlies / releases work?
### What is a binary?
A binary or package (used interchangeably) is a pre-built collection of c++ libraries, header files, python bits, and other files. We build these and distribute them so that users do not need to install from source.
A **binary configuration** is a collection of
* release or nightly
* releases are stable, nightlies are beta and built every night
* python version
* linux: 3.7m (mu is wide unicode or something like that. It usually doesn't matter but you should know that it exists)
* macos: 3.7, 3.8
* windows: 3.7, 3.8
* cpu version
* cpu, cuda 9.0, cuda 10.0
* The supported cuda versions occasionally change
* operating system
* Linux - these are all built on CentOS. There haven't been any problems in the past building on CentOS and using on Ubuntu
* MacOS
* Windows - these are built on Azure pipelines
* devtoolset version (gcc compiler version)
* This only matters on Linux cause only Linux uses gcc. tldr is gcc made a backwards incompatible change from gcc 4.8 to gcc 5, because it had to change how it implemented std::vector and std::string
### Where are the binaries?
The binaries are built in CircleCI. There are nightly binaries built every night at 9pm PST (midnight EST) and release binaries corresponding to Pytorch releases, usually every few months.
We have 3 types of binary packages
* pip packages - nightlies are stored on s3 (pip install -f \<a s3 url\>). releases are stored in a pip repo (pip install torch) (ask Soumith about this)
* conda packages - nightlies and releases are both stored in a conda repo. Nighty packages have a '_nightly' suffix
* libtorch packages - these are zips of all the c++ libraries, header files, and sometimes dependencies. These are c++ only
* shared with dependencies (the only supported option for Windows)
* static with dependencies
* shared without dependencies
* static without dependencies
All binaries are built in CircleCI workflows except Windows. There are checked-in workflows (committed into the .circleci/config.yml) to build the nightlies every night. Releases are built by manually pushing a PR that builds the suite of release binaries (overwrite the config.yml to build the release)
# CircleCI structure of the binaries
Some quick vocab:
* A \**workflow** is a CircleCI concept; it is a DAG of '**jobs**'. ctrl-f 'workflows' on https://github.com/pytorch/pytorch/blob/main/.circleci/config.yml to see the workflows.
* **jobs** are a sequence of '**steps**'
* **steps** are usually just a bash script or a builtin CircleCI command. *All steps run in new environments, environment variables declared in one script DO NOT persist to following steps*
* CircleCI has a **workspace**, which is essentially a cache between steps of the *same job* in which you can store artifacts between steps.
## How are the workflows structured?
The nightly binaries have 3 workflows. We have one job (actually 3 jobs: build, test, and upload) per binary configuration
3. For each binary configuration, e.g. linux_conda_3.7_cpu there is a
1. smoke_linux_conda_3.7_cpu
1. Downloads the package from the cloud, e.g. using the official pip or conda instructions
2. Runs the smoke tests
## How are the jobs structured?
The jobs are in https://github.com/pytorch/pytorch/tree/main/.circleci/verbatim-sources. Jobs are made of multiple steps. There are some shared steps used by all the binaries/smokes. Steps of these jobs are all delegated to scripts in https://github.com/pytorch/pytorch/tree/main/.circleci/scripts .
* Linux jobs: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/linux-binary-build-defaults.yml
* Common shared code (shared across linux and macos): https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/nightly-binary-build-defaults.yml
* binary_checkout.sh - checks out pytorch/builder repo. Right now this also checks out pytorch/pytorch, but it shouldn't. pytorch/pytorch should just be shared through the workspace. This can handle being run before binary_populate_env.sh
* binary_populate_env.sh - parses BUILD_ENVIRONMENT into the separate env variables that make up a binary configuration. Also sets lots of default values, the date, the version strings, the location of folders in s3, all sorts of things. This generally has to be run before other steps.
* binary_install_miniconda.sh - Installs miniconda, cross platform. Also hacks this for the update_binary_sizes job that doesn't have the right env variables
* binary_run_in_docker.sh - Takes a bash script file (the actual test code) from a hardcoded location, spins up a docker image, and runs the script inside the docker image
### **Why do the steps all refer to scripts?**
CircleCI creates a final yaml file by inlining every <<* segment, so if we were to keep all the code in the config.yml itself then the config size would go over 4 MB and cause infra problems.
### **What is binary_run_in_docker for?**
So, CircleCI has several executor types: macos, machine, and docker are the ones we use. The 'machine' executor gives you two cores on some linux vm. The 'docker' executor gives you considerably more cores (nproc was 32 instead of 2 back when I tried in February). Since the dockers are faster, we try to run everything that we can in dockers. Thus
* linux build jobs use the docker executor. Running them on the docker executor was at least 2x faster than running them on the machine executor
* linux test jobs use the machine executor in order for them to properly interface with GPUs since docker executors cannot execute with attached GPUs
* linux upload jobs use the machine executor. The upload jobs are so short that it doesn't really matter what they use
* linux smoke test jobs use the machine executor for the same reason as the linux test jobs
binary_run_in_docker.sh is a way to share the docker start-up code between the binary test jobs and the binary smoke test jobs
### **Why does binary_checkout also checkout pytorch? Why shouldn't it?**
We want all the nightly binary jobs to run on the exact same git commit, so we wrote our own checkout logic to ensure that the same commit was always picked. Later circleci changed that to use a single pytorch checkout and persist it through the workspace (they did this because our config file was too big, so they wanted to take a lot of the setup code into scripts, but the scripts needed the code repo to exist to be called, so they added a prereq step called 'setup' to checkout the code and persist the needed scripts to the workspace). The changes to the binary jobs were not properly tested, so they all broke from missing pytorch code no longer existing. We hotfixed the problem by adding the pytorch checkout back to binary_checkout, so now there's two checkouts of pytorch on the binary jobs. This problem still needs to be fixed, but it takes careful tracing of which code is being called where.
# Code structure of the binaries (circleci agnostic)
## Overview
The code that runs the binaries lives in two places, in the normal [github.com/pytorch/pytorch](http://github.com/pytorch/pytorch), but also in [github.com/pytorch/builder](http://github.com/pytorch/builder), which is a repo that defines how all the binaries are built. The relevant code is
```
# All code needed to set-up environments for build code to run in,
# but only code that is specific to the current CI system
pytorch/pytorch
- .circleci/ # Folder that holds all circleci related stuff
- config.yml # GENERATED file that actually controls all circleci behavior
- verbatim-sources # Used to generate job/workflow sections in ^
- scripts/ # Code needed to prepare circleci environments for binary build scripts
- setup.py # Builds pytorch. This is wrapped in pytorch/builder
- cmake files # used in normal building of pytorch
# All code needed to prepare a binary build, given an environment
# with all the right variables/packages/paths.
pytorch/builder
# Given an installed binary and a proper python env, runs some checks
# to make sure the binary was built the proper way. Checks things like
# the library dependencies, symbols present, etc.
- check_binary.sh
# Given an installed binary, runs python tests to make sure everything
# is in order. These should be de-duped. Right now they both run smoke
# tests, but are called from different places. Usually just call some
# import statements, but also has overlap with check_binary.sh above
- run_tests.sh
- smoke_test.sh
# Folders that govern how packages are built. See paragraphs below
- conda/
- build_pytorch.sh # Entrypoint. Delegates to proper conda build folder
- switch_cuda_version.sh # Switches activate CUDA installation in Docker
- pytorch-nightly/ # Build-folder
- manywheel/
- build_cpu.sh # Entrypoint for cpu builds
- build.sh # Entrypoint for CUDA builds
- build_common.sh # Actual build script that ^^ call into
- wheel/
- build_wheel.sh # Entrypoint for wheel builds
- windows/
- build_pytorch.bat # Entrypoint for wheel builds on Windows
```
Every type of package has an entrypoint build script that handles the all the important logic.
## Conda
Linux, MacOS and Windows use the same code flow for the conda builds.
Conda packages are built with conda-build, see https://conda.io/projects/conda-build/en/latest/resources/commands/conda-build.html
Basically, you pass `conda build` a build folder (pytorch-nightly/ above) that contains a build script and a meta.yaml. The meta.yaml specifies in what python environment to build the package in, and what dependencies the resulting package should have, and the build script gets called in the env to build the thing.
tl;dr on conda-build is
1. Creates a brand new conda environment, based off of deps in the meta.yaml
1. Note that environment variables do not get passed into this build env unless they are specified in the meta.yaml
2. If the build fails this environment will stick around. You can activate it for much easier debugging. The “General Python” section below explains what exactly a python “environment” is.
2. Calls build.sh in the environment
3. Copies the finished package to a new conda env, also specified by the meta.yaml
4. Runs some simple import tests (if specified in the meta.yaml)
5. Saves the finished package as a tarball
The build.sh we use is essentially a wrapper around `python setup.py build`, but it also manually copies in some of our dependent libraries into the resulting tarball and messes with some rpaths.
The entrypoint file `builder/conda/build_conda.sh` is complicated because
* It works for Linux, MacOS and Windows
* The mac builds used to create their own environments, since they all used to be on the same machine. There’s now a lot of extra logic to handle conda envs. This extra machinery could be removed
* It used to handle testing too, which adds more logic messing with python environments too. This extra machinery could be removed.
## Manywheels (linux pip and libtorch packages)
Manywheels are pip packages for linux distros. Note that these manywheels are not actually manylinux compliant.
`builder/manywheel/build_cpu.sh` and `builder/manywheel/build.sh` (for CUDA builds) just set different env vars and then call into `builder/manywheel/build_common.sh`
The entrypoint file `builder/manywheel/build_common.sh` is really really complicated because
* This used to handle building for several different python versions at the same time. The loops have been removed, but there's still unnecessary folders and movements here and there.
* The script is never used this way anymore. This extra machinery could be removed.
* This used to handle testing the pip packages too. This is why there’s testing code at the end that messes with python installations and stuff
* The script is never used this way anymore. This extra machinery could be removed.
* This also builds libtorch packages
* This should really be separate. libtorch packages are c++ only and have no python. They should not share infra with all the python specific stuff in this file.
* There is a lot of messing with rpaths. This is necessary, but could be made much much simpler if the above issues were fixed.
## Wheels (MacOS pip and libtorch packages)
The entrypoint file `builder/wheel/build_wheel.sh` is complicated because
* The mac builds used to all run on one machine (we didn’t have autoscaling mac machines till circleci). So this script handled siloing itself by setting-up and tearing-down its build env and siloing itself into its own build directory.
* The script is never used this way anymore. This extra machinery could be removed.
* This also builds libtorch packages
* Ditto the comment above. This should definitely be separated out.
Note that the MacOS Python wheels are still built in conda environments. Some of the dependencies present during build also come from conda.
## Windows Wheels (Windows pip and libtorch packages)
The entrypoint file `builder/windows/build_pytorch.bat` is complicated because
* This used to handle building for several different python versions at the same time. This is why there are loops everywhere
* The script is never used this way anymore. This extra machinery could be removed.
* This used to handle testing the pip packages too. This is why there’s testing code at the end that messes with python installations and stuff
* The script is never used this way anymore. This extra machinery could be removed.
* This also builds libtorch packages
* This should really be separate. libtorch packages are c++ only and have no python. They should not share infra with all the python specific stuff in this file.
Note that the Windows Python wheels are still built in conda environments. Some of the dependencies present during build also come from conda.
## General notes
### Note on run_tests.sh, smoke_test.sh, and check_binary.sh
* These should all be consolidated
* These must run on all OS types: MacOS, Linux, and Windows
* These all run smoke tests at the moment. They inspect the packages some, maybe run a few import statements. They DO NOT run the python tests nor the cpp tests. The idea is that python tests on main and PR merges will catch all breakages. All these tests have to do is make sure the special binary machinery didn’t mess anything up.
* There are separate run_tests.sh and smoke_test.sh because one used to be called by the smoke jobs and one used to be called by the binary test jobs (see circleci structure section above). This is still true actually, but these could be united into a single script that runs these checks, given an installed pytorch package.
### Note on libtorch
Libtorch packages are built in the wheel build scripts: manywheel/build_*.sh for linux and build_wheel.sh for mac. There are several things wrong with this
* It’s confusing. Most of those scripts deal with python specifics.
* The extra conditionals everywhere severely complicate the wheel build scripts
* The process for building libtorch is different from the official instructions (a plain call to cmake, or a call to a script)
### Note on docker images / Dockerfiles
All linux builds occur in docker images. The docker images are
* pytorch/conda-cuda
* Has ALL CUDA versions installed. The script pytorch/builder/conda/switch_cuda_version.sh sets /usr/local/cuda to a symlink to e.g. /usr/local/cuda-10.0 to enable different CUDA builds
* Also used for cpu builds
* pytorch/manylinux-cuda90
* pytorch/manylinux-cuda100
* Also used for cpu builds
The Dockerfiles are available in pytorch/builder, but there is no circleci job or script to build these docker images, and they cannot be run locally (unless you have the correct local packages/paths). Only Soumith can build them right now.
### General Python
* This is still a good explanation of python installations https://caffe2.ai/docs/faq.html#why-do-i-get-import-errors-in-python-when-i-try-to-use-caffe2
# How to manually rebuild the binaries
tl;dr make a PR that looks like https://github.com/pytorch/pytorch/pull/21159
Sometimes we want to push a change to mainand then rebuild all of today's binaries after that change. As of May 30, 2019 there isn't a way to manually run a workflow in the UI. You can manually re-run a workflow, but it will use the exact same git commits as the first run and will not include any changes. So we have to make a PR and then force circleci to run the binary workflow instead of the normal tests. The above PR is an example of how to do this; essentially you copy-paste the binarybuilds workflow steps into the default workflow steps. If you need to point the builder repo to a different commit then you'd need to change https://github.com/pytorch/pytorch/blob/main/.circleci/scripts/binary_checkout.sh#L42-L45 to checkout what you want.
## How to test changes to the binaries via .circleci
Writing PRs that test the binaries is annoying, since the default circleci jobs that run on PRs are not the jobs that you want to run. Likely, changes to the binaries will touch something under .circleci/ and require that .circleci/config.yml be regenerated (.circleci/config.yml controls all .circleci behavior, and is generated using `.circleci/regenerate.sh` in python 3.7). But you also need to manually hardcode the binary jobs that you want to test into the .circleci/config.yml workflow, so you should actually make at least two commits, one for your changes and one to temporarily hardcode jobs. See https://github.com/pytorch/pytorch/pull/22928 as an example of how to do this.
# Update the PR, need to force since the commits are different now
git push origin my_branch --force
```
The advantage of this flow is that you can make new changes to the base commit and regenerate the .circleci without having to re-write which binary jobs you want to test on. The downside is that all updates will be force pushes.
## How to build a binary locally
### Linux
You can build Linux binaries locally easily using docker.
```sh
# Run the docker
# Use the correct docker image, pytorch/conda-cuda used here as an example
#
# -v path/to/foo:path/to/bar makes path/to/foo on your local machine (the
# machine that you're running the command on) accessible to the docker
# container at path/to/bar. So if you then run `touch path/to/bar/baz`
# in the docker container then you will see path/to/foo/baz on your local
# machine. You could also clone the pytorch and builder repos in the docker.
#
# If you know how, add ccache as a volume too and speed up everything
# Export whatever variables are important to you. All variables that you'd
# possibly need are in .circleci/scripts/binary_populate_env.sh
# You should probably always export at least these 3 variables
exportPACKAGE_TYPE=conda
exportDESIRED_PYTHON=3.7
exportDESIRED_CUDA=cpu
# Call the entrypoint
# `|& tee foo.log` just copies all stdout and stderr output to foo.log
# The builds generate lots of output so you probably need this when
# building locally.
/builder/conda/build_pytorch.sh |& tee build_output.log
```
**Building CUDA binaries on docker**
You can build CUDA binaries on CPU only machines, but you can only run CUDA binaries on CUDA machines. This means that you can build a CUDA binary on a docker on your laptop if you so choose (though it’s gonna take a long time).
For Facebook employees, ask about beefy machines that have docker support and use those instead of your laptop; it will be 5x as fast.
### MacOS
There’s no easy way to generate reproducible hermetic MacOS environments. If you have a Mac laptop then you can try emulating the .circleci environments as much as possible, but you probably have packages in /usr/local/, possibly installed by brew, that will probably interfere with the build. If you’re trying to repro an error on a Mac build in .circleci and you can’t seem to repro locally, then my best advice is actually to iterate on .circleci :/
But if you want to try, then I’d recommend
```sh
# Create a new terminal
# Clear your LD_LIBRARY_PATH and trim as much out of your PATH as you
# know how to do
# Install a new miniconda
# First remove any other python or conda installation from your PATH
# Always install miniconda 3, even if building for Python <3
# All MacOS builds use conda to manage the python env and dependencies
# that are built with, even the pip packages
conda create -yn binary python=2.7
conda activate binary
# Export whatever variables are important to you. All variables that you'd
# possibly need are in .circleci/scripts/binary_populate_env.sh
# You should probably always export at least these 3 variables
exportPACKAGE_TYPE=conda
exportDESIRED_PYTHON=3.7
exportDESIRED_CUDA=cpu
# Call the entrypoint you want
path/to/builder/wheel/build_wheel.sh
```
N.B. installing a brand new miniconda is important. This has to do with how conda installations work. See the “General Python” section above, but tldr; is that
1. You make the ‘conda’ command accessible by prepending `path/to/conda_root/bin` to your PATH.
2. You make a new env and activate it, which then also gets prepended to your PATH. Now you have `path/to/conda_root/envs/new_env/bin:path/to/conda_root/bin:$PATH`
3. Now say you (or some code that you ran) call python executable `foo`
1. if you installed `foo` in `new_env`, then `path/to/conda_root/envs/new_env/bin/foo` will get called, as expected.
2. But if you forgot to installed `foo` in `new_env` but happened to previously install it in your root conda env (called ‘base’), then unix/linux will still find `path/to/conda_root/bin/foo` . This is dangerous, since `foo` can be a different version than you want; `foo` can even be for an incompatible python version!
Newer conda versions and proper python hygiene can prevent this, but just install a new miniconda to be safe.
### Windows
TODO: fill in
PyTorch migration from CircleCI to github actions has been completed. All continuous integration & deployment workflows are defined in `.github/workflows` folder
#### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the
existing and past issues](https://github.com/pytorch/pytorch/issues)
It's likely that your bug will be resolved by checking our FAQ or troubleshooting guide [documentation](https://pytorch.org/docs/master/dynamo/index.html)
It's likely that your bug will be resolved by checking our FAQ or troubleshooting guide [documentation](https://pytorch.org/docs/main/dynamo/index.html)
- type:textarea
attributes:
label:🐛 Describe the bug
@ -33,7 +33,7 @@ body:
label:Minified repro
description:|
Please run the minifier on your example and paste the minified code below
Learn more here https://pytorch.org/docs/master/compile/troubleshooting.html
Learn more here https://pytorch.org/docs/main/torch.compiler_troubleshooting.html
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.