Summary:
For Speech sequential model, there could be a case where model(data) does not work correctly for feed forward,
Speech model uses a different type of Criterion (a.k.a loss function) to feed a data on individual components like encoder, predictor, joiner.
Hence we need extra parameter to pass feedforward wrapper
Differential Revision: D57680391
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126891
Approved by: https://github.com/jerryzh168
Summary:
After QAT is completed or given pre-tuned weight observer via tunable PTQ algorithm, it should not over-write again with a given weight, at least for static QAT never.
Dynamic QAT also does not require to re-run weight observer again by design.
This is a fix
Test Plan: Signals
Differential Revision: D57747749
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127309
Approved by: https://github.com/jerryzh168
This PR adds _foreach_max support, the second reduction foreach op we have :D
I did have to change the autogen slightly for foreach. I can promise that the existing foreach ops' derivative behavior has not changed as I've added a skip list for the harder requirement I am setting (that the arg list should match in length). I needed to add this requirement as there is another wrong max (the one that does take in a dim for reduction) that keeps getting matched first.
Caveats!
- We do not fast path if the shapes, dtypes, device, the regular shebang for foreach are not met. We fall back to slowpath!
- MORE IMPORTANTLY, we also do not fast path for int8 and int16 and bool, but that's really a skill issue on my end as I've hardcoded -INFINITY into the CUDA kernels, and -INFINITY is not defined for small ints. It'd be nice to know how to do this properly, but that work can also come later.
- This does NOT support empty Tensors in the list, because the original max op also does not support empty Tensors. ~I think this should be allowed though, and this PR may come later.~ I understand why this is not allowed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127187
Approved by: https://github.com/albanD
Add and test torchao nightly testing workflow.
This workflow will be triggered under the following conditions:
1. If the PR has ciflow/torchao label
2. Manual trigger
It will run the torchao benchmark on torchbench/timm/huggingface model workloads with 5 configs (noquant, autoquant, int8dynamic, int8weightonly, int4weightonly). The output will be updated to the PT2 Dashboard: https://hud.pytorch.org/benchmark/compilers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126885
Approved by: https://github.com/huydhn
This PR excises opcheck's dependency on
torch.testing._internal.common_utils, (which comes with dependencies on
expecttest and hypothesis). We do this by moving what we need to
torch.testing._utils and adding a test for it.
Fixes#126870, #126871
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127292
Approved by: https://github.com/williamwen42
ghstack dependencies: #127291
With the current state of export's dynamic shapes, we struggle with guards and constraints that are beyond the current dynamic shapes language, expressed with dims and derived dims. While we can compile and guarantee correctness for guards within the current language (e.g. min/max ranges, linear relationships, integer divisibility) we struggle to dynamically compile guards which extend beyond that.
For these "complex" guards, we typically do either of the following: 1) raise a constraint violation error, along the lines of "not all values of <symbol> in the specified range satisfy <guard>", with or without suggested fixes, 2) specialize to the provided static values and suggest removing dynamism, or 3) fail compilation due to some arbitrary unsupported case. Previous [work](https://github.com/pytorch/pytorch/pull/124949) went towards resolving this by disabling forced specializations, instead allowing the user to fail at runtime with incorrect inputs.
In this PR, relying on [hybrid backed-unbacked symints](https://github.com/pytorch/pytorch/issues/121749), [deferred runtime asserts](https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/runtime_assert.py), and the function [_is_supported_equivalence()](d7de4c9d80/torch/fx/experimental/symbolic_shapes.py (L1824)), we add a flag `_allow_complex_guards_as_runtime_asserts` which allows the user to compile exported programs containing these guards and maintain dynamism, while adding correctness checks as runtime assertions in the graph.
Hybrid backed-unbacked symints allow us to easily bypass "implicit" guards emitted from computation - guards that we ~expect to be true. Popular examples revolve around reshapes:
```
# reshape
def forward(self, x, y): # x: [s0, s1], y: [s2]
return x.reshape([-1]) + y # guard s0 * s1 = s2
This leads to the following exported program
class GraphModule(torch.nn.Module):
def forward(self, x: "f32[s0, s1]", y: "f32[s2]"):
sym_size_int: "Sym(s2)" = torch.ops.aten.sym_size.int(y, 0)
mul: "Sym(-s2)" = -1 * sym_size_int; sym_size_int = None
sym_size_int_1: "Sym(s0)" = torch.ops.aten.sym_size.int(x, 0)
sym_size_int_2: "Sym(s1)" = torch.ops.aten.sym_size.int(x, 1)
mul_1: "Sym(s0*s1)" = sym_size_int_1 * sym_size_int_2; sym_size_int_1 = sym_size_int_2 = None
add: "Sym(s0*s1 - s2)" = mul + mul_1; mul = mul_1 = None
eq: "Sym(Eq(s0*s1 - s2, 0))" = add == 0; add = None
_assert_scalar = torch.ops.aten._assert_scalar.default(eq, "Runtime assertion failed for expression Eq(s0*s1 - s2, 0) on node 'eq'"); eq = None
view: "f32[s0*s1]" = torch.ops.aten.view.default(x, [-1]); x = None
add_1: "f32[s0*s1]" = torch.ops.aten.add.Tensor(view, y); view = y = None
return (add_1,)
```
Another case is symbol divisibility:
```
def forward(self, x): # x: [s0, s1]
return x.reshape([-1, x.shape[0] - 1]) # Eq(Mod(s0 * s1, s0 - 1), 0)
```
Applying deferred runtime asserts also helps dynamic compilation for "explicit" complex guards that typically cause problems for export. For example we can generate runtime asserts for not-equal guards, and complex conditions like the following:
```
class Foo(torch.nn.Module):
def forward(self, x, y):
# check that negation of first guard also shows up as runtime assertion
if x.shape[0] == y.shape[0]: # False
return x + y
elif x.shape[0] == y.shape[0] ** 3: # False
return x + 2, y + 3
elif x.shape[0] ** 2 == y.shape[0] * 3: # True
return x * 2.0, y * 3.0
```
For the above graph we will generate 3 runtime assertions: the negation of the first 2, and the 3rd condition as a guard.
One additional benefit here over the current state of exported programs is that this adds further correctness guarantees - previously with explicit complex guards, if compilation succeeded, the guards would be ignored at runtime, treated as given.
As shown above, the runtime asserts appear as math ops in the graph, generated by the sympy interpreter, resulting in an _assert_scalar call. There is an option to avoid adding these asserts into the graph, by setting `TORCH_DYNAMO_DO_NOT_EMIT_RUNTIME_ASSERTS=1`. This results in the "original" computation graph, with dynamism, and any incorrect inputs will fail on ops during runtime. Further work could go into prettifying the printer, so the majority of the graph isn't guard-related.
Ideally this PR would subsume and remove the recently added [_disable_forced_specializations](https://github.com/pytorch/pytorch/pull/124949) flag, but that flag still handles one additional case of specialization: single-variable equalities where the symbol is solvable for a concrete value: see this [PR](https://github.com/pytorch/pytorch/pull/126925)
This PR doesn't change any behavior around data-dependent errors/unbacked symints yet, that could be further work.
NOTE: will take naming change suggestions for the flag :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127129
Approved by: https://github.com/avikchaudhuri
This PR tries to report some failures at build time. Once the build fails, it generally indicates that we can wrap the code inside some conditional macros, and it is a hint to further reduce the built code size. The sizeof operations were used to ensure that the assertion dependents on specific template instantiations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127371
Approved by: https://github.com/ezyang, https://github.com/Skylion007
Definition (Linear Transformation):
A mapping $T : V \to W$ between $F$-vector spaces $V,W$ is called a *linear transformation* if and only if
a) $T(u+v)=T(u)+T(v)$,
b) $T(cv)=cT(v)$
for all $u, v \in V$, $c \in F$.
Consequently, $T(0_V)=0_W$.
Thus $x \mapsto xA^T+b$ for nonzero $b$ is **not** a linear transformation, but is often referred to as an affine linear transformation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127240
Approved by: https://github.com/soulitzer, https://github.com/albanD
As FindPythonInterp and FindPythonLibs has been deprecated since cmake-3.12
Replace `PYTHON_EXECUTABLE` with `Python_EXECUTABLE` everywhere (CMake variable names are case-sensitive)
This makes PyTorch buildable with python3 binary shipped with XCode on MacOS
TODO: Get rid of `FindNumpy` as its part of Python package
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124613
Approved by: https://github.com/cyyever, https://github.com/Skylion007
Summary: Unlike JIT Inductor, AOTI currently unlifts weights and buffers from input args, so the reinplace pass didn't really work for AOTI because it only checks mutation on placeholder, which led to excessive memory copies for kv_cache updates in LLM models. This PR removes those memory copies and roughly offers a 2x speedup. In the future, we will revert the unlift logic in AOTI and make the behvior consitent with JIT Inductor.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127297
Approved by: https://github.com/peterbell10, https://github.com/chenyang78
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.
Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.
UPDATE: Use `FutureWarning` instead of `DeprecationWarning`.
Resolves#126888
- #126888
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898
Approved by: https://github.com/albanD
1. **Expose seqused_k & alibi_slopes arguments**:
- This can be used when your sequence length k is not the full extent of the tensor. This is useful for kv cache scenarios and was not previously supported in the FA2 TORCH integration. We need these arguments for external xformers lib call to the _flash_attention_forward API.
Before:
```
std::optional<Tensor> seqused_k = c10::nullopt;
std::optional<Tensor> alibi_slopes = c10::nullopt;
```
After:
```
_flash_attention_forward(...
std::optional<Tensor>& seqused_k,
std::optional<Tensor>& alibi_slopes,
```
2. There is a difference between the **TORCH_FA2_flash_api:mha_fwd** and **FA2_flash_api:mha_fwd** (same for **mha_varlen_fwd**) at the query transposition (GQA) step.
The **CHECK_SHAPE** is applied on the original query vs the reshaped query. This causes an error (because of the shape constraint) for such inputs:
```
q = torch.randn([7, 1, 4, 256], dtype=torch.bfloat16, device='cuda')
k = torch.randn([7, 51, 1, 256], dtype=torch.bfloat16, device='cuda')
v = torch.randn([7, 51, 1, 256], dtype=torch.bfloat16, device='cuda')
```

- i've modified the code as little as possible, but if you prefer a more verbose change like the following, dont hesitate to tell me:
```
at::Tensor swapped_q = seqlenq_ngroups_swapped
? q.reshape({batch_size, num_heads_k, num_heads / num_heads_k, head_size_og}).transpose(1, 2)
: q;
if (seqlenq_ngroups_swapped) {
seqlen_q = num_heads / num_heads_k;
num_heads = num_heads_k;
}
CHECK_SHAPE(swapped_q, batch_size, seqlen_q, num_heads, head_size_og);
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126520
Approved by: https://github.com/drisspg
As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068
Approved by: https://github.com/jansel
ghstack dependencies: #124021, #126019
As for now, the document of distributed.new_group() says that it returns `None` when current ranks is not in the new created process group. However, it actually returns `GroupMember.NON_GROUP_MEMBER`. I have check the code and think it is more appropriate that we fix the document.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122703
Approved by: https://github.com/wconstab, https://github.com/kwen2501
Summary:
Adds a "safe" parallel compile implementation that a) Popens a sub-process with an entry point we control, and b) Uses a ProcessPoolExecutor in that sub-processes to perform parallel compiles. This change essentially squashes these two implementations from jansel, but removes the "thread-based" approach since benchmarking revealed that compile-time performance was poor compared to the existing impl:
https://github.com/pytorch/pytorch/pull/124682https://github.com/pytorch/pytorch/pull/122941
This PR adds the implementation, but defaults to the existing "fork". I'll submit a separate change to enable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126816
Approved by: https://github.com/jansel
As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019
Approved by: https://github.com/jansel
ghstack dependencies: #124021
This pass was broken in a number of ways, as we were not generating
asserts whenever we took it, even though we need to. While doing so,
we found that the analysis we were using for choosing
whether to generate asserts or not for dynamic shapes was completely
broken.
Eliminating indirect indexing in this way allows for a number of optimisations.
In particular, we can now fuse against these kernels (indirect indexing disallows fusions).
The new strategy is as follows:
- We always propagate sympy expressions if we can.
- If an expression was an indirect_indexing, we call `check_bounds`
- We also call `check_bounds` within `CSEProxy.indirect_indexing`
- The checks are issued in the buffer where they would go if the were used in a load
- This makes them always be codegen'd before the load and stores
- In the case of stores, they will be generated potentially much earlier than the stores themselves, which is fine.
We add quite a few asserts to preexisting tests to strengthen them. In particular, we make sure
that issuing an assert plays well with all kinds of C++ vectorisation.
For now, we rely on the logic within `_maybe_evaluate_static` to prove
these bounds. This logic is rather limited though. In the future, we might want
to rely on Z3 here to be able to prove bounds in a more general way.
Supersedes https://github.com/pytorch/pytorch/pull/113068
Fixes https://github.com/pytorch/pytorch/issues/121251
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114471
Approved by: https://github.com/peterbell10
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info.
1. Cpp template infrastructure
Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
2. Initial FP32 gemm template
This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
3. Correctness and performance
The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
Static shapes
| Benchmark | torchbench | huggingface | timm_models |
|------------|-------------|--------------|--------------|
| Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
| Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
| Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
Key models being sped up:
drq: 1.14x
soft_act: 1.12
cait_m36_384: 1.18x
Dynamic shapes
| Benchmark | torchbench | huggingface | timm_models |
| --- | --- | --- | --- |
| Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
| Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
| Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
Key models being sped up:
BERT_pytorch: 1.22x
pyhpc_turbulent: 1.13x
soft_actor_critic: 1.77x
BlenderbotForCausalLM: 1.09x
cait_m36_384: 1.17x
Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021
Approved by: https://github.com/jansel
This PR fixes Issue #124391.
There are two root causes.
### Root Cause 1 [better support for stream during cudagraph capture]
When recording a new function, CUDA graph tree records memory block states (e.g., address, size, allocated, etc) via `getCheckpointState`. Let's say the record is called `block_state`.
Later, CUDA graph tree would like to recover exactly the same memory block states by `apply_checkpoint_execution_state_in_allocator`, which a) frees all memory blocks; b) allocate all recorded block states (regardless of `block_state->allocated`); c) free blocks with `block_state->allocated == False`; and d) check block_state matches remaining blocks (e.g., `block_state->ptr == block->ptr`).
An error may occur when multiple streams exists during recording. [Note](https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L2149-L2152) that a block will not be merged with other blocks if it is used by some streams, even if `block->allocated==False`. This may lead to a mismatch between `block_state->ptr` and `block->ptr` in `apply_checkpoint_execution_state_in_allocator`.
This PR solves the issue by avoiding inserting events if this events coming from a stream used during cudagraph capture. The reason is that we know all events or streams used during cudagraph capture must have been completed before cudagraph capture finishes.
### Root Cause 2 [fix a bug in checkpoint state]
When we getCheckpointState, we create block state. At that time, we do not record block->device. So block_state->device == 0 no matter the real value of block->device. See [how](https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L744-L750) BlockState is created from a block.
When use block state during setSegmentStateToCheckpoint, we use [block_state.device (=0)](https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L1526). This leads to errors.
We fixed this issue by recording block->device into block_state in getCheckpointState.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126809
Approved by: https://github.com/eellison
Summary:
ProcessGroupNCCL set up group_name/desc in c10d log and NCCL when initializing nccl communicator. In eager initialization mode, pg_name and pg_desc is set after communicator initialization so the information won't be available in pytorch log or NCCL communicator.
This PR fix this by setting pg_name/desc earlier
Differential Revision: D57759816
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127053
Approved by: https://github.com/wconstab, https://github.com/kwen2501
## Motivation
Resolves#126626 to support TorchTitan.
With this PR, we add back support for cases where a parameter or buffer is used in multiple stages. An example of such usage is in LLaMA (torchtitan), code snippet:
```
for layer in self.layers.values():
h = layer(h, self.freqs_cis)
```
## Solution
Step 1:
Remove the previous guards of `if len(node.users) == 1`.
Step 2:
Call `move_param_to_callee` multiple times, one for each stage ("callee").
Step 3:
Delay deletion of the `get_attr` node (for getting the param) from root till this param has been sunk into each stage that uses it.
The PR also cleans up the old code around this (dropping the TRANSMIT mode and supporting REPLICATE mode only).
## Test
Changed the `ExampleCode` model to use `mm_param1` in multiple stages.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126653
Approved by: https://github.com/pianpwk
This PR adds a registration function and a global registry for GraphModuleSerializer. After this PR, custom serialization methods can be done through registration instead of subclassing for ease of maintenance.
## Changes
- Add a test case where it injects custom op to test serialization.
- Add custom op handler
- Change allowed op for verifier
Co-authored-by: Zhengxu Chen <zhxchen17@outlook.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126550
Approved by: https://github.com/zhxchen17
Summary:
Expand TorchScript `__init__` annotation warning to `list` and `dict` with reference to GSD task T187638414 and annotation warning reproduction D56834720.
Currently, the TorchScript compiler ignores and throws `UserWarning`s for the following annotation types for empty values within the `__init__` function: `List`, `Dict`, `Optional`. However, the compiler should additionally cover warnings for `list` and `dict`. This diff adds support for `list` and `dict`.
Test Plan:
Added 4 new unit tests:
`test_annotated_empty_list_lowercase` and `test_annotated_empty_dict_lowercase` verify that TorchScript throws UserWarnings for the list and dict type annotations on empty values.
```
(base) [jananisriram@devvm2248.cco0 /data/users/jananisriram/fbsource/fbcode (e4ce427eb)]$ buck2 test @mode/{opt,inplace} //caffe2/test:jit -- --regex test_annotated_empty_list_lowercase
...
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
(base) [jananisriram@devvm2248.cco0 /data/users/jananisriram/fbsource/fbcode (e4ce427eb)]$ buck2 test @mode/{opt,inplace} //caffe2/test:jit -- --regex test_annotated_empty_dict_lowercase
...
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
```
`test_annotated_with_jit_empty_list_lowercase` and `test_annotated_with_jit_empty_dict_lowercase` verify that TorchScript throws UserWarnings for the list and dict type annotations on empty values with the jit annotation.
```
(base) [jananisriram@devvm2248.cco0 /data/users/jananisriram/fbsource/fbcode (e4ce427eb)]$ buck2 test @mode/{opt,inplace} //caffe2/test:jit -- --regex test_annotated_with_jit_empty_list_lowercase
...
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
(base) [jananisriram@devvm2248.cco0 /data/users/jananisriram/fbsource/fbcode (e4ce427eb)]$ buck2 test @mode/{opt,inplace} //caffe2/test:jit -- --regex test_annotated_with_jit_empty_dict_lowercase
...
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
```
Differential Revision: D57752002
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127045
Approved by: https://github.com/davidberard98
The previous fallback ignores any known hint values in the expression and only
looks at the value ranges. By using the `symbolic_hint` we will use both hints
and value ranges.
Also removed the recursive use of `size_hint` on the bounds, since these should
always be constants.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127262
Approved by: https://github.com/lezcano
ghstack dependencies: #127251
Doesn't affect current behavior by default, for #126544
I'm not sure what the exact mechanism is here but CUDA errors appear to already be thrown in the main process, meaning that the watchdog is separately throwing CUDA errors again. However this rethrown error causes the process to be terminated as it cannot be handled from user code (which doesn't have visibility of the watchdog thread).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126587
Approved by: https://github.com/kwen2501
Fixes
> ERROR: expected to be in states [<TrainingState.FORWARD_BACKWARD: 2>] but current state is TrainingState.IDLE
Error that would occur when composing pt2 fsdp and cudagraphs. Cudagraphs caches output tensor impls in the fast path, so we were inadvertently accumulating multiple hooks on what should have been fresh allocations.
from code comment:
```
# this output represents a fresh allocated tensor.
# We return the same TensorImpl from run to run to avoid overhead.
# autograd.Function will reset the Autograd meta of output tensors
# as part of aot_autograd, but _backward_hooks are stored on tensors separately,
# so we need to manually reset hooks.
``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126914
Approved by: https://github.com/awgu, https://github.com/xmfan
Summary:
Global store may already have been destroyed when we do the check.
This leads to a Null Pointer Exception. This caused a SEV in Production.
Stack trace from crash:
```
[trainer2]:# 5 c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)
[trainer2]:# 6 c10d::ProcessGroupNCCL::heartbeatMonitor()
```
Test Plan:
Will deploy in small training job and with `NCCL_DUMP_ON_TIMEOUT` set.
Job should complete with no exceptions.
Reviewers:
Subscribers:
Tasks: T190163458
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127030
Approved by: https://github.com/Skylion007, https://github.com/shuqiangzhang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125543
This PR address 2 issues with derived dim suggested fixes, 1) newly introduced roots, and 2) root swapping.
1 | Newly introduced roots appear with modulo guards, e.g. Mod(dx, 2) = 0 suggests dx is a derived dim equal to 2 * _dx, introducing a new root _dx. Currently the final suggested fixes handle this correctly, but we can get intermediate results where related derived dims don't rely on a unified root, and are a mixture of min/max range and derived suggestions.
For example:
```
"dx": {"eq": 3*_dx-1, "max": 36}
"dy": {"eq": dx+1}
This should lead to suggested fixes
_dx = Dim('_dx', max=12)
dx = 3 * _dx - 1
dy = 3 * _dx
```
This PR prettifies the suggested fixes routine by unifying to a single root, and making each intermediate suggestion either a derived dim or min/max range, not both.
2 | The current suggested fixes for derived dims can lead to root dims/derived dims being swapped, e.g. `dy - 1, dy` -> `dx, dx + 1`. This leads to problematic suggested fixes that look like `dy - 1 = Dim("dy - 1")` since we don't have access to the original variable name.
This PR only adds a suggested fix for the root dim, and removes all other derived suggestions.
For example, with the export test case test_derived_dim_out_of_order_simplified:
```
_dimz = torch.export.Dim("_dimz", min=6, max=8)
dimy = _dimz - 1
dimx = dimy - 1
dimz = torch.export.Dim("dimz", min=6, max=8) # doesn't work, should be = _dimz
class Foo(torch.nn.Module):
def forward(self, x, y, z):
return x + y[1:] + z[2:]
foo = Foo()
u, v, w = torch.randn(5), torch.randn(6), torch.randn(7)
export(
foo,
(u, v, w),
dynamic_shapes=({0: dimx}, {0: dimy}, {0: dimz}),
)
```
Before:
```
Suggested fixes:
_dimz = Dim('_dimz', min=3, max=9223372036854775807) # 2 <= _dimz - 1 <= 9223372036854775806
_dimz - 2 = Dim('_dimz - 2', min=4, max=6)
_dimz = Dim('_dimz', min=2, max=9223372036854775806) # 2 <= _dimz <= 9223372036854775806
_dimz - 1 = _dimz - 1
dimz = _dimz
```
New suggested fixes:
```
Suggested fixes:
dimz = _dimz
```
Note: This assumes the specified derived relations between dims are correct. This should be valid because: 1) if the relation is plain wrong (e.g. (dx, dx - 1) provided with inputs (6, 4)), this gets caught in beforehand in produce_guards. 2) if the relation is correct but does not match the emitted guard, for example:
```
def forward(self, x, y):
return x.reshape([-1]) + y # guard: s0 * 2 = s1
dx = Dim("dx")
export(
model,
(torch.randn(6, 2), torch.randn(12)),
dynamic_shapes={"x": (dx, 2), "y": (dx + 6, )}
)
```
This produces two linear equations, leading to specialization since a) produce_guards is able to solve for a concrete value, and b) the export constraint solver will anyways force specializations due to range constraints.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125543
Approved by: https://github.com/avikchaudhuri
Summary: We want to track how well torch.jit.trace can be converted to export in large scale. As a first step, we log all of torch.jit.trace unittests whether we can convert the traced module to export module OR we can export the model directly
Test Plan: CI
Differential Revision: D57629682
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126900
Approved by: https://github.com/SherlockNoMad
## Goal
As title
## Design
Based on the fact that each TorchScript module has a `code` property which provides the original source code for the `forward` function, I implemented a function to extrapolate `forward` function signature by using the AST parser.
Some other tradeoff
* Directly parsing src code as string --> will be very buggy
* Directly using `compile` function in Python to get the function object --> raises a lot of exceptions because of missing packages or undefined variable names
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126787
Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan
The main motivation for this refactor is that today, when generating templates, this is what happens.
```
def_kernel() # registers hook for fully generating function definition
store_output() # registers hook for generating the output store. *also* keeps a number of things generated on `self.body`.
```
Later on, when we codegen the template: f8c4c268da/torch/_inductor/codegen/simd.py (L1402)
```
epilogue_node.codegen() # Also writes to body!
template.finalize() # Calls the above two hooks for def_kernel and store_output, which then reads from the accumulated `self.body`
```
Today, this is fine, as long as `store_output` is the last function called in the template. However, there's a couple things we probably want to do with kernels that makes this annoying.
1. In FlexAttention backwards, we might want a `modification` to be positioned *after* the `store_output` (just logically from a code organization POV). This doesn't work today because `modification` also needs to codegen a subgraph, but writing to `body` here conflicts with `store_output`'s implicit saved state on `self.body`.
2. If we want to support prologue fusion, we need to go through a bunch of contortions today to call the template hook finalization a couple times (https://github.com/pytorch/pytorch/pull/121211/files#diff-73b89475038a5b4705da805f1217783883fb90398ee1164995db392fc4a342c1R322)
3. The current code also makes it quite difficult to support fusion into multiple output nodes.
To resolve this, I do two things:
1. I *remove* the default `self.body` on `TritonTemplateKernel`. Instead, I have a dict of `self.subgraph_bodies`, which can be enabled in a context with `TritonTemplateKernel.set_subgraph_body`. This allows multiple different template functions to write to their own isolated bodies.
2. I add functions that allow you to finalize specific hooks on `PartialRender`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127144
Approved by: https://github.com/jansel
The generated bytecode for the first frame is below. Inlined comments about the LOAD_ATTR which causes Dynamo to trigger again on `__getattr__`.
~~~
[__bytecode] MODIFIED BYTECODE fn /data/users/anijain/pytorch2/test/dynamo/test_activation_checkpointing.py line 1129
[__bytecode] 1129 0 COPY_FREE_VARS 1
[__bytecode] 2 RESUME 0
[__bytecode] 4 PUSH_NULL
[__bytecode] 6 LOAD_GLOBAL 10 (__compiled_fn_1)
[__bytecode] 18 LOAD_FAST 0 (x)
[__bytecode] 20 LOAD_DEREF 1 (mod)
[__bytecode] 22 LOAD_ATTR 6 (_checkpoint_wrapped_module)
[__bytecode] 32 LOAD_CONST 1 (0)
[__bytecode] 34 BINARY_SUBSCR
[__bytecode] 44 LOAD_ATTR 7 (weight)
[__bytecode] 54 LOAD_DEREF 1 (mod)
[__bytecode] 56 LOAD_ATTR 6 (_checkpoint_wrapped_module)
[__bytecode] 66 LOAD_CONST 1 (0)
[__bytecode] 68 BINARY_SUBSCR
[__bytecode] 78 LOAD_ATTR 8 (bias)
# When this optimized bytecode is executed, these two lines call the __getattr__ of ActivationWrapper module.
# Dynamo gets invoked on __getattr__.
# If we had inlined __getattr__ during the tracing, we would have seen the LOAD_ATTR
# on more low level data structures like _modules, obviating the need for CPython
# to call python overriden __getattr__. But today, UnspecializedNNModuleVariable
# calls python getattr at tracing time (instead of inlining it), resulting in LOAD_ATTR
# on the module itself.
# To prevent Dynamo to skip tracing of __Getattr__ on the optimized bytecode,
# we can check if its top level frame and just skip it.
[__bytecode] 88 LOAD_DEREF 1 (mod)
[__bytecode] 90 LOAD_ATTR 0 (a)
[__bytecode] 100 PRECALL 4
[__bytecode] 104 CALL 4
[__bytecode] 114 UNPACK_SEQUENCE 1
[__bytecode] 118 RETURN_VALUE
~~~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127263
Approved by: https://github.com/yf225
Since we use cuda 12.1 by default now, it would be better to update the doc.
Many people (including me), want to directly copy-paste commands in readme 😉 Let's make our life easier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122125
Approved by: https://github.com/malfet
Fixes some files in #123062
Run lintrunner on files:
test/test_nnapi.py,
test/test_numba_integration.py,
test/test_numpy_interop.py,
test/test_openmp.py,
test/test_optim.py
```bash
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126845
Approved by: https://github.com/ezyang
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
ghstack dependencies: #127122, #127123, #127124, #127125
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127125
Approved by: https://github.com/Skylion007
ghstack dependencies: #127122, #127123, #127124
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127124
Approved by: https://github.com/Skylion007
ghstack dependencies: #127122, #127123
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127123
Approved by: https://github.com/Skylion007
ghstack dependencies: #127122
As discussed, this cleans up the code so that create_aot_dispatcher literally chooses an aot_dispatch function and runs it. Moves wrapper logic to jit_compile_runtime_wrappers, and adds aot_dispatch_export to handle export cases in one place.
This also makes aot_dispatch_* return the same type always: a Callable and the forward metadata, instead of returning different number of arguments in export cases. Callers that don't care about fw_metadata can just ignore it. Added return type hints to enforce the same exact interface among all the aot_dispatch_* functions.
It'd be nice to move the checks from the synthetic base and dedup wrappers that have to do with export outside of those wrappers, but it's probably fine for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126402
Approved by: https://github.com/oulgen, https://github.com/bdhirsh
ghstack dependencies: #126193
Related to https://github.com/pytorch/pytorch/issues/98467
The tacotron2 benchmark creates a lot of nodes which fusion then checks. This improves some of the perf of that checking.
`can_fuse_vertical` calls `fusable_read_and_write` on O(read deps * write deps) combinations - but only cares about write deps that are MemoryDeps - so do the isinstance check outside the inner loop to save O(read deps) when it won't matter anyway.
Also moves `fusable_read_and_write` to a instance method (instead of a closure) since it doesn't actually capture any variables.
I also tried pre-splitting the read deps into `StarDep` vs `MemoryDep` but that didn't actually make any perf difference.
Testing:
```
time python benchmarks/dynamo/torchbench.py --accuracy --inference --amp --backend inductor --disable-cudagraphs --device cuda --only tacotron2
```
Before this change: 10m15s
After this change: 9m31s
Related to #98467
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127061
Approved by: https://github.com/peterbell10, https://github.com/jansel
ghstack dependencies: #127060
Related to #98467
The tacotron2 benchmark creates a lot of nodes which fusion then checks. This
improves some of the perf of that checking.
`score_fusion_memory` is called O(n^2) times - so by moving the set union, `has_unbacked_symbols` check, and `numbytes_hint` out of the loop we call them O(n) times and the O(n^2) call gets cheaper.
Testing:
```
time python benchmarks/dynamo/torchbench.py --accuracy --inference --amp --backend inductor --disable-cudagraphs --device cuda --only tacotron2
```
Before this change: 12m33s
After this change: 10m15s
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127060
Approved by: https://github.com/peterbell10, https://github.com/jansel
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127122
Approved by: https://github.com/kit1980
## Description
Fixes https://github.com/pytorch/pytorch/issues/114450. This PR builds upon the work from @imzhuhl done in https://github.com/pytorch/pytorch/pull/114451.
This PR requires https://github.com/pytorch/pytorch/pull/122472 to land firstly.
We leverage the serialization and deserialization API from oneDNN v3.4.1 to save the opaque MKLDNN tensor during the compilation and restore the opaque tensor when loading the compiled .so.
ideep version is updated so that we won't break any pipeline even if third_party/ideep is not updated at the same time.
### Test plan:
```sh
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_conv_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_deconv_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_linear_freezing_non_abi_compatible_cpu
```
### TODOs in follow-up PRs
1. We found that using `AOTI_TORCH_CHECK` will cause performance drop on several models (`DistillGPT2`, `MBartForConditionalGeneration`, `T5ForConditionalGeneration`, `T5Small`) compared with JIT Inductor which uses `TORCH_CHECK`. This may need further discussion how to address (`AOTI_TORCH_CHECK` is introduced in
https://github.com/pytorch/pytorch/pull/119220).
2. Freezing in non-ABI compatible mode will work with the support in this PR. While for ABI compatible mode, we need to firstly address this issue: `AssertionError: None, i.e. optional output is not supported`.
6c4f43f826/torch/_inductor/codegen/cpp_wrapper_cpu.py (L2023-L2024)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124350
Approved by: https://github.com/jgong5, https://github.com/desertfire
# Motivation
## for `torch.amp.GradScaler`,
- `torch.cpu.amp.GradScaler(args...)` is completely equivalent to `torch. amp.GradScaler("cpu", args...)`.
- `torch.cuda.amp.GradScaler(args...)` is completely equivalent to `torch.amp.GradScaler("cuda", args...)`.
So, we intend to depreate them and **strongly recommend** developer to use `torch.amp.GradScaler`.
## for `custom_fwd` and `custom_bwd`,
this is a good solution to make the custom function run with or without effect even in an autocast-enabled region and can be shared by other backends, like CPU and XPU.
So we generalize it to be device-agnostic and put them int `torch/amp/autocast_mode.py` and re-expose to `torch.amp.custom_fwd` and `torch.amp.custom_bwd`. Meanwhile, we deprecate `torch.cuda.amp.custom_fwd` and `torch.cuda.amp.custom_bwd`.
# Additional Context
Add UT to cover the deprecated warning.
No need for more UTs to cover the functionality of `torch.amp.custom_f/bwd`, the existing UTs that previously covered the functionality of `torch.cuda.amp.custom_f/bwd` can cover them.
To facilitate the review, we separate these code changes to two PRs. The first PR cover `torch.amp.GradScaler`. The follow-up covers `custom_fwd` and `custom_bwd`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126527
Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/janeyx99, https://github.com/EikanWang
tlparse prints failure description like this
> dynamic shape operator: aten._unique2.default; to enable, set torch._dynamo.config.capture_dynamic_output_shape_ops = True
adding os env var to set it easier for testing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127017
Approved by: https://github.com/jackiexu1992
This PR moves the post compile portion of aot_dispatch_autograd into runtime_wrappers.py. Completing this allows us to run the post compile section on its own when warm starting.
I considered leaving this thing in jit_compile_runtime_wrappers, but we're gonna run into circular dependency issues later if we don't move it over
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126193
Approved by: https://github.com/bdhirsh
ghstack dependencies: #126907
This continues the full deprecation after https://github.com/pytorch/pytorch/pull/114425. It's been 6 months! And I'm fairly certain no one is going to yell at me as this patch is not really used.
------
# BC Breaking note
As of this PR, SparseAdam will become consistent with the rest of our optimizers in that it will only accept containers of Tensors/Parameters/param groups and fully complete deprecation of this path. Hitherto, the SparseAdam constructor had allowed raw tensors as the params argument to the constructor. Now, if you write the following code, there will be an error similar to every other optim: "params argument given to the optimizer should be an iterable of Tensors or dicts"
```
import torch
param = torch.rand(16, 32)
optimizer = torch.optim.SparseAdam(param)
```
Instead you should replace the last line with
```
optimizer = torch.optim.SparseAdam([param])
```
to no longer error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127081
Approved by: https://github.com/soulitzer
`QualnameMapMixin` was intended to provide a mapping from new FQN of the piped model to the FQN of the original model. It was there because previous tracers and flattening during tracing would modify the FQNs.
Now that we use unflattener, the FQN of the stage modules are the same as the original FQNs. We don't need `QualnameMapMixin` any more.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127018
Approved by: https://github.com/H-Huang
This adds a bunch of global configurations to the cache key. There's definitely more I haven't added, but this is just an audit of all of the `torch.*` globals that are used in jit_compile_runtime_wrappers, runtime_wrappers, etc.
It also makes the hash details object subclass FXGraphHashDetails, which implements other hashed data like configs inductor depends on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126907
Approved by: https://github.com/aorenste
PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`:
* `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()`
* `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()`
CPU impls for these new ATen ops will be added in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946
Approved by: https://github.com/davidberard98
## Context
This stack prototypes automatic micro-pipelining of `all-gather -> matmul` and `matmul -> reduce-scatter` via Inductor. The idea originates from the paper [Overlap Communication with Dependent Computation via
Decomposition in Large Deep Learning Models](https://dl.acm.org/doi/pdf/10.1145/3567955.3567959). The implementation and some key optimizations are heavily influenced by @lw's implementation in xformers.
The stack contains several components:
- `ProcessGroupCudaP2P` - a thin wrapper around `ProcessGroupNCCL`. It in addition maintains a P2P workspace that enables SM-free, one-sided P2P communication which is needed for optimal micro-pipelining.
- `fused_all_gather_matmul` and `fused_matmul_reduce_scatter` dispatcher ops.
- Post-grad fx pass that detects `all-gather -> matmul` and `matmul -> reduce-scatter` and replaces them with the fused dispatcher ops.
To enable the prototype feature:
- Set the distributed backend to `cuda_p2p`.
- Set `torch._inductor.config._micro_pipeline_tp` to `True`.
*NOTE: the prototype sets nothing in stone w.r.t to each component's design. The purpose is to have a performant baseline with reasonable design on which each component can be further improved.*
## Benchmark
Setup:
- 8 x H100 (500W) + 3rd gen NVSwitch.
- Llama3 8B training w/ torchtitan.
- 8-way TP. Reduced the number of layers from 32 to 8 for benchmarking purpose.
Trace (baseline): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpjaz8zgx0
<img width="832" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4addba77-5abc-4d2e-93ea-f68078587fe1">
Trace (w/ micro pipelining): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpn073b4wn
<img width="963" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4f44e78d-8196-43ab-a1ea-27390f07e9d2">
## This PR
`ProcessGroupCudaP2P` is a thin wrapper around `ProcessGroupNCCL`. By default, it routes all collectives to the underlying `ProcessGroupNCCL`. In addition, `ProcessGroupCudaP2P` initializes a P2P workspace that allows direct GPU memory access among the members. The workspace can be used in Python to optimize intra-node communication patterns or to create custom intra-node collectives in CUDA.
`ProcessGroupCudaP2P` aims to bridge the gap where certain important patterns can be better optimized via fine-grained P2P memory access than with collectives in the latest version of NCCL. It is meant to complement NCCL rather than replacing it.
Usage:
```
# Using ProcessGroupCudaP2P
dist.init_process_group(backend="cuda_p2p", ...)
# Using ProcessGroupCudaP2P while specifying ProcessGroupCudaP2P.Options
pg_options = ProcessGroupCudaP2P.Options()
dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)
# Using ProcessGroupCudaP2P while specifying ProcessGroupNCCL.Options
pg_options = ProcessGroupNCCL.Options()
dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)
# Using ProcessGroupCudaP2P while specifying both
# ProcessGroupCudaP2P.Options and ProcessGroupNCCL.Options
pg_options = ProcessGroupCudaP2P.Options()
pg_options.nccl_options = ProcessGroupNCCL.Options()
dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)
# Down-casting the backend to access p2p buffers for cuda_p2p specific
# optimizations
if is_cuda_p2p_group(group):
backend = get_cuda_p2p_backend(group)
if required_p2p_buffer_size > backend.get_buffer_size():
# fallback
p2p_buffer = backend.get_p2p_buffer(...)
else:
# fallback
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122163
Approved by: https://github.com/wanchaol
Before the pr, we have a graph break for `callable(nn_module)`:
```python
class M(nn.Module):
def forward(self, x):
return x.sin()
def f(m):
return callable(m)
res = torch.compile(f, fullgraph=True)(M())
```
```
Traceback (most recent call last):
File "/data/users/yidi/pytorch/t.py", line 17, in <module>
out = torch.compile(f, backend="eager", fullgraph=True)(M())
File "/data/users/yidi/pytorch/torch/_dynamo/eval_frame.py", line 414, in _fn
return fn(*args, **kwargs)
File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 1077, in catch_errors
return callback(frame, cache_entry, hooks, frame_state, skip=1)
File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 456, in _convert_frame_assert
return _compile(
File "/data/users/yidi/pytorch/torch/_utils_internal.py", line 74, in wrapper_function
return function(*args, **kwargs)
File "/home/yidi/.conda/envs/pytorch/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 799, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/data/users/yidi/pytorch/torch/_dynamo/utils.py", line 210, in time_wrapper
r = func(*args, **kwargs)
File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 618, in compile_inner
out_code = transform_code_object(code, transform)
File "/data/users/yidi/pytorch/torch/_dynamo/bytecode_transformation.py", line 1167, in transform_code_object
transformations(instructions, code_options)
File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 177, in _fn
return fn(*args, **kwargs)
File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 564, in transform
tracer.run()
File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 2244, in run
super().run()
File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 886, in run
while self.step():
File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 801, in step
self.dispatch_table[inst.opcode](self, inst)
File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 496, in wrapper
return inner_fn(self, inst)
File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 1255, in CALL_FUNCTION
self.call_function(fn, args, {})
File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 739, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/data/users/yidi/pytorch/torch/_dynamo/variables/builtin.py", line 948, in call_function
return handler(tx, args, kwargs)
File "/data/users/yidi/pytorch/torch/_dynamo/variables/builtin.py", line 711, in <lambda>
return lambda tx, args, kwargs: obj.call_function(
File "/data/users/yidi/pytorch/torch/_dynamo/variables/builtin.py", line 948, in call_function
return handler(tx, args, kwargs)
File "/data/users/yidi/pytorch/torch/_dynamo/variables/builtin.py", line 835, in builtin_dipatch
unimplemented(error_msg)
File "/data/users/yidi/pytorch/torch/_dynamo/exc.py", line 216, in unimplemented
raise Unsupported(msg)
torch._dynamo.exc.Unsupported: builtin: callable [<class 'torch._dynamo.variables.nn_module.NNModuleVariable'>] False
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127026
Approved by: https://github.com/jansel
## Description
Fixes https://github.com/pytorch/pytorch/issues/114450. This PR builds upon the work from @imzhuhl done in https://github.com/pytorch/pytorch/pull/114451.
This PR requires https://github.com/pytorch/pytorch/pull/122472 to land firstly.
We leverage the serialization and deserialization API from oneDNN v3.4.1 to save the opaque MKLDNN tensor during the compilation and restore the opaque tensor when loading the compiled .so.
ideep version is updated so that we won't break any pipeline even if third_party/ideep is not updated at the same time.
### Test plan:
```sh
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_conv_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_deconv_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_linear_freezing_non_abi_compatible_cpu
```
### TODOs in follow-up PRs
1. We found that using `AOTI_TORCH_CHECK` will cause performance drop on several models (`DistillGPT2`, `MBartForConditionalGeneration`, `T5ForConditionalGeneration`, `T5Small`) compared with JIT Inductor which uses `TORCH_CHECK`. This may need further discussion how to address (`AOTI_TORCH_CHECK` is introduced in
https://github.com/pytorch/pytorch/pull/119220).
2. Freezing in non-ABI compatible mode will work with the support in this PR. While for ABI compatible mode, we need to firstly address this issue: `AssertionError: None, i.e. optional output is not supported`.
6c4f43f826/torch/_inductor/codegen/cpp_wrapper_cpu.py (L2023-L2024)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124350
Approved by: https://github.com/jgong5, https://github.com/desertfire
As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126545
Approved by: https://github.com/jansel
ghstack dependencies: #124021, #126019, #126068
As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068
Approved by: https://github.com/jansel
ghstack dependencies: #124021, #126019
As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019
Approved by: https://github.com/jansel
ghstack dependencies: #124021
Previously, the default was that Inductor did not respect strides for
all (builtin and custom) ops unless the op has a
"needs_fixed_stride_order" tag on it. This PR changes it so that:
- inductor doesn't respect strides for builtin ops. To change the
behavior, one can add the "needs_fixed_stride_order" tag
- inductor does respect strides for custom ops. To change the behavior,
one can add the "does_not_need_fixed_stride_order" tag
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126986
Approved by: https://github.com/ezyang, https://github.com/albanD
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info.
1. Cpp template infrastructure
Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
2. Initial FP32 gemm template
This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
3. Correctness and performance
The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
Static shapes
| Benchmark | torchbench | huggingface | timm_models |
|------------|-------------|--------------|--------------|
| Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
| Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
| Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
Key models being sped up:
drq: 1.14x
soft_act: 1.12
cait_m36_384: 1.18x
Dynamic shapes
| Benchmark | torchbench | huggingface | timm_models |
| --- | --- | --- | --- |
| Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
| Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
| Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
Key models being sped up:
BERT_pytorch: 1.22x
pyhpc_turbulent: 1.13x
soft_actor_critic: 1.77x
BlenderbotForCausalLM: 1.09x
cait_m36_384: 1.17x
Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021
Approved by: https://github.com/jansel
Summary: I found that doubling this significantly improved performance, but doubling again did not, so I stopped here.
Test Plan: CI
Benchmarked with llm_experiments repo as previously in stack; relevant data:
before:
trans_b torch.float16 1396.11 usec (4100)
trans_b torch.float16 1399.54 usec (4104)
after:
trans_b torch.float16 1096.00 usec (4100)
trans_b torch.float16 1093.47 usec (4104)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126877
Approved by: https://github.com/malfet
ghstack dependencies: #126745, #126746, #126793, #126794
# Summary
Updates the modification jinja template's api, so as to specify the output_name for the fixed buffer. As well updates flex-attention's usage to make the algorithm more clear/ closer align with the vmap impl
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127035
Approved by: https://github.com/Chillee
The original change was about 9.5% slower than then before #122074 .
This improves it to be only about 1.4% slower.
Also touched up some unrelated nits that the linter complained about.
Fixes#126293
Ran torchbench 3 times on each change. Perf values before (stable), after (fix),
and with #122074 backed out (backout):
```
../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_isoneutral_mixing amp first dynamic cpp
stable:
43.948x
45.754x
44.906x
fix:
47.505x
49.987x
47.493x
backout:
48.243x
48.199x
48.192x
../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_equation_of_state amp first static default
stable:
15.224x
13.286x
15.354x
fix:
16.402x
16.370x
16.183x
backout:
16.554x
16.675x
16.787x
../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench lennard_jones float32 first static default
stable:
1.712x
1.651x
1.640x
fix:
1.804x
1.798x
1.792x
backout:
1.864x
1.824x
1.836x
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126996
Approved by: https://github.com/jansel
**Summary**
Added all_reduce_coalesced tracing to CommDebugMode and added test case to test_comm_mode test suite.
**Test Plan**
pytest test/distributed/_tensor/debug/test_comm_mode.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127025
Approved by: https://github.com/XilunWu
PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`:
* `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()`
* `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()`
CPU impls for these new ATen ops will be added in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946
Approved by: https://github.com/davidberard98
Summary:
Previously https://github.com/pytorch/pytorch/pull/124949 added the ability to disable forced specializations on dynamic shapes for export, keeping dynamism for complex guards instead of specializing, allowing unsoundness by having the user fail at runtime.
It avoided disabling one case: single-variable equality guards, where a variable is specified as dynamic but can be solvable for a concrete value, suggesting the correct behavior is specialization. For example, guard : Eq(s0 // 4, 400) suggests s0 should specialize to 1600.
In debugging, some users (e.g. APS) would like to keep this dynamic, and defer to failing at runtime instead. This PR adds this, so now all forced specializations should be turned off. Mostly this should be used for debugging, since it produces unsoundness, and lets the user proceed with (probably) incorrect dynamism.
Test Plan: export tests
Differential Revision: D57698601
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126925
Approved by: https://github.com/angelayi
Do not inherit parser from common_utils
* I don't think we use any variables in run_test that depend on those, and I think all tests except doctests run in a subprocess so they will parse the args in common_utils and set the variables. I don't think doctests wants any of those variables?
Parse known args, add the extra args as extra, pass the extra ones along to the subprocess
Removes the first instance of `--`
I think I will miss run_test telling me if an arg is valid or not
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126709
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/Flamefire
Address the classes of user errors stemming from (possibly)
unintentional dynamic shapes usage or mismatch of configuration time and
run time data shapes/dtypes.
The goal is to ensure a clear error is raised rather than relying on some underlying
error to bubble up when a tensor shape is not compatible, or worse,
having a silent correctness issue.
**Classes of shape/dtype errors**
* (a) error is thrown within the stage-module forward code, but may be
hard to understand/trace back to an input issue
* (b) silent correctness issue happens inside the stage-module forward,
but the correct output shape is still produced
produces the expected output shape
* (c) the stage-module produces an output that is locally correct, but not
matching the expectation of the following stage, leading to a hang or
correctness issue down the line
**How validation helps**
Input shape validation
- improves debugability of case (a)
- guards against case (b)
- only needed on first stage, since subsequent stages use pre-allocated recv
buffers that can't change shape/size even if they wanted to
Output shape validation
- guards against case (c)
Validation of first stage input and all stages' outputs inductively verifies all shapes
Shape/dtype are most critical as they literally affect the number of
bytes on the wire. Strides and other tensor properties may also (?)
matter, and the validation function can be adjusted accordingly if needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126732
Approved by: https://github.com/kwen2501
Summary:
co-dev reland of https://github.com/pytorch/pytorch/pull/124520, which requires
the removal of some executorch tests.
Before this PR, we didn't check that types in a schema were valid. This
is because TorchScript treats unknown types as type variables.
This PR checks types in a schema for the TORCH_LIBRARY APIs. To do this,
we add an `allow_typevars` flag to parseSchema so that TorchScript can
use allow_typevars=True. We also add some error messages for common
mistakes (e.g. using int64_t or double in schema).
Test Plan: Wait for tests
Differential Revision: D57666659
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126861
Approved by: https://github.com/albanD
Recently we added the following warning, which is printed on every rank and makes the log a bit verbose.
This PR dedups certain logs that are identical across ranks and prints them only on head rank of each node.
Resolves https://github.com/pytorch/pytorch/issues/126275
=========================================
[rank0]:[W502 14:06:55.821964708 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4
[rank1]:[W502 14:06:57.994276972 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4
[rank2]:[W502 14:07:00.353013116 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4
[rank3]:[W502 14:07:02.515511670 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125432
Approved by: https://github.com/wconstab
Summary:
Initial commit for TorchScript 2 ExportedProgram Converter.
TODO:
- Improve TorchScript IR coverage
- parameter and buffers should be owned by output ExportedProgram
- Experiment on conditional op conversion
Test Plan: buck2 run mode/dev-nosan fbcode//caffe2/test:test_export -- -r TestConverter
Differential Revision: D57694784
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126920
Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan
This is a meta only tool, this allow users to profile any python function by annotating it with **strobelight** using
the strobelight profiler.
ex
```
def fn(x, y, z):
return x * y + z
# use decorator with default profiler.
@strobelight()
@torch.compile()
def work():
for i in range(100):
for j in range(5):
fn(torch.rand(j, j), torch.rand(j, j), torch.rand(j, j))
work()
```
test
```
python torch/utils/strobelight/examples/cli_function_profiler_example.py
strobelight_cli_function_profiler, line 274, 2024-05-20 11:05:41,513, INFO: strobelight run id is: -6222660165281106
strobelight_cli_function_profiler, line 276, 2024-05-20 11:06:08,318, INFO: strobelight profiling running
strobelight_cli_function_profiler, line 257, 2024-05-20 11:06:11,867, INFO: strobelight profiling stopped
strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:16,164, INFO: Total samples: 2470
strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:16,164, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/oiqmyltg
strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:16,164, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/b10x92x0
strobelight_cli_function_profiler, line 274, 2024-05-20 11:06:18,476, INFO: strobelight run id is: -4112659701221677
strobelight_cli_function_profiler, line 276, 2024-05-20 11:06:45,096, INFO: strobelight profiling running
strobelight_cli_function_profiler, line 257, 2024-05-20 11:06:52,366, INFO: strobelight profiling stopped
strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:56,222, INFO: Total samples: 1260
strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:56,222, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/0yyx6el5
strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:56,223, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/8m2by4ea
(base) [lsakka@devvm4561.ash0 /data/users/lsakka/pytorch/pytorch (strobelight2)]$ python torch/profiler/strobelight_cli_function_profiler_example.py
strobelight_cli_function_profiler, line 274, 2024-05-20 11:07:26,701, INFO: strobelight run id is: -2373009368202256
strobelight_cli_function_profiler, line 276, 2024-05-20 11:07:53,477, INFO: strobelight profiling running
strobelight_cli_function_profiler, line 257, 2024-05-20 11:07:56,827, INFO: strobelight profiling stopped
strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:01,138, INFO: Total samples: 2372
strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:01,138, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/dk797xg9
strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:01,138, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/4w6c8vnm
strobelight_cli_function_profiler, line 274, 2024-05-20 11:08:03,235, INFO: strobelight run id is: -1919086123693716
strobelight_cli_function_profiler, line 276, 2024-05-20 11:08:29,848, INFO: strobelight profiling running
strobelight_cli_function_profiler, line 257, 2024-05-20 11:08:37,233, INFO: strobelight profiling stopped
strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:41,138, INFO: Total samples: 1272
strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:41,138, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/43r58aew
strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:41,138, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/9g52onmw
(base) [lsakka@devvm4561.ash0 /data/users/lsakka/pytorch/pytorch (strobelight2)]$
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126693
Approved by: https://github.com/aorenste
As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126545
Approved by: https://github.com/jansel
ghstack dependencies: #124021, #126019, #126068
As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068
Approved by: https://github.com/jansel
ghstack dependencies: #124021, #126019
As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019
Approved by: https://github.com/jansel
ghstack dependencies: #124021
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info.
1. Cpp template infrastructure
Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
2. Initial FP32 gemm template
This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
3. Correctness and performance
The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
Static shapes
| Benchmark | torchbench | huggingface | timm_models |
|------------|-------------|--------------|--------------|
| Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
| Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
| Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
Key models being sped up:
drq: 1.14x
soft_act: 1.12
cait_m36_384: 1.18x
Dynamic shapes
| Benchmark | torchbench | huggingface | timm_models |
| --- | --- | --- | --- |
| Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
| Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
| Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
Key models being sped up:
BERT_pytorch: 1.22x
pyhpc_turbulent: 1.13x
soft_actor_critic: 1.77x
BlenderbotForCausalLM: 1.09x
cait_m36_384: 1.17x
Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021
Approved by: https://github.com/jansel
Is this supposed to be bitwise identical? Wasn't sure how to interpret the comment but it seems to be giving mismatches like:
```
Mismatched elements: 1 / 2 (50.0%)
Greatest absolute difference: 4.6372413635253906e-05 at index (1,) (up to 1e-05 allowed)
Greatest relative difference: 3.4600801882334054e-05 at index (1,) (up to 1.3e-06 allowed)
To execute this test, run the following from the base repo dir:
python test/test_optim.py -k test_complex_2d_LBFGS_cpu_complex64
```
on Neoverse-N2 SBSA ARM CPUs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126358
Approved by: https://github.com/lezcano, https://github.com/janeyx99
As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068
Approved by: https://github.com/jansel
ghstack dependencies: #124021, #126019
As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019
Approved by: https://github.com/jansel
ghstack dependencies: #124021
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info.
1. Cpp template infrastructure
Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
2. Initial FP32 gemm template
This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
3. Correctness and performance
The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
Static shapes
| Benchmark | torchbench | huggingface | timm_models |
|------------|-------------|--------------|--------------|
| Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
| Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
| Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
Key models being sped up:
drq: 1.14x
soft_act: 1.12
cait_m36_384: 1.18x
Dynamic shapes
| Benchmark | torchbench | huggingface | timm_models |
| --- | --- | --- | --- |
| Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
| Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
| Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
Key models being sped up:
BERT_pytorch: 1.22x
pyhpc_turbulent: 1.13x
soft_actor_critic: 1.77x
BlenderbotForCausalLM: 1.09x
cait_m36_384: 1.17x
Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021
Approved by: https://github.com/jansel
This test shows that we could always set `reshard_after_forward=False` but manually insert calls to `module.reshard()` to implement the resharding after forward. This is useful for advanced PP schedules.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126892
Approved by: https://github.com/wanchaol
ghstack dependencies: #126887
Add `# mypy: disallow-untyped-defs` to scheduler.py and then fix the resulting fallout.
We probably should eventually add a new node between BaseSchedulerNode and all the non-FusedSchedulerNode types to indicate the split between nodes that have a valid `self.node` and ones that don't. That would cause a lot of the `assert self.node is not None` churn to go away - but was a bigger change because a lot of code makes assumptions about types that aren't reflected in the types themselves.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126656
Approved by: https://github.com/eellison
Summary:
1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store.
2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return.
- Depending on the implementation they can either:
- point to existing store (and expected to `use_agent_store` as true - point 1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared.
- build args that `torch.distributed.init_process_group` can bootstrap by creating new store.
Additional points:
- When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases.
- `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes.
Why:
- Reduce moving parts
- easier to swap implementation
- improve tractability
- addressing perf/debug-ability will benefit all usecases
-
Test Plan: CI
Differential Revision: D57055235
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125743
Approved by: https://github.com/d4l3k
Rule is enforced by #126103.
The rule:
- If `torch.a.b` defines a public class `C` (i.e. to be exposed in torch API namespace), then `torch.a.b` must be a public path, i.e. no `_`.
- `torch.a.b` should ideally have an `__all__` that defines what should be imported from this file when it is imported.
- All other definitions in `torch.a.b` that you don't want to expose should have a `_` prefix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126812
Approved by: https://github.com/wconstab
This PR requires a little justification, but let's start with what it does first:
1. When you have a 0d CPU scalar int64/float64 tensor input to a graph, we will preallocate a backed SymInt/SymFloat corresponding to what you would get if you call item() on this tensor. This means you can freely change your input to be a Python int/float or a Tensor with an item() call and end up with exactly the same level of expressivity (specifically, you can guard on the internal SymInt/SymFloat no matter what). By default, the source of the backed SymInt/SymFloat is `L['tensor'].item()`, but if you have promoted a float input into a Tensor, we will cancel out `torch.as_tensor(L['float']).item()` into just `L['float']`.
2. We switch wrap_symfloat to use this, instead of hand crafting the new SymNodeVariable. Everything works out, except that we carefully pass the item() result to tracked fakes (and not the fake Tensor argument)
OK, so why do this at all? There is some marginal benefit where now some item() calls on scalar inputs can be guarded on, but IMO this is a pretty marginal benefit, and if it was the only reason, I wouldn't do this. The real reason for this is that I need to be able to propagate fake tensors through the graphs that are produced by Dynamo, and if I am doing the old custom wrap_symfloat logic, there's no way I can do this, because ordinarily an item() call will cause an unbacked SymInt when I reallocate.
The other obvious way to solve the problem above is to make a HOP alternative that item() that "bakes in" the backed SymInt its supposed to return. But this strategy seems more parsimonious, and it does have the marginal benefit I mentioned above. The main downside is that what I have to do next, is make it so that when I run tensor computation, I also apply the equivalent operations to the SymInt/SymFloat as well. That's next PR.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126245
Approved by: https://github.com/eellison
ghstack dependencies: #126637
Fixes#71398
Add `__reduce__` and `__setstate__` methods for `torch._C.Generator`.
`__reduce__` returns a tuple of 3 values:
1. `torch.Generator` itself.
2. A one-element tuple containing the `torch.device` to create the `Generator` with, since this cannot be changed after the object is created.
3. The state, a three-element tuple: the initial seed, the offset (or `None` if a CPU `Generator`), and the RNG state tensor.
`__setstate__` calls `manual_seed`, `set_offset` (if not `None`), and `set_state` on each respective element of the state.
Added test demonstrating successful reserialization with cpu and cuda `Generator`s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126271
Approved by: https://github.com/ezyang
If a user accesses an OpOverloadPacket, then creates a new OpOverload,
then uses the OpOverloadPacket, the new OpOverload never gets hit. This
is because OpOverloadPacket caches OpOverloads when it is constructed.
This PR fixes the problem by "refreshing" the OpOverloadPacket if a new
OpOverload gets constructed and the OpOverloadPacket exists.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126863
Approved by: https://github.com/albanD
## Context
This stack prototypes automatic micro-pipelining of `all-gather -> matmul` and `matmul -> reduce-scatter` via Inductor. The idea originates from the paper [Overlap Communication with Dependent Computation via
Decomposition in Large Deep Learning Models](https://dl.acm.org/doi/pdf/10.1145/3567955.3567959). The implementation and some key optimizations are heavily influenced by @lw's implementation in xformers.
The stack contains several components:
- `ProcessGroupCudaP2P` - a thin wrapper around `ProcessGroupNCCL`. It in addition maintains a P2P workspace that enables SM-free, one-sided P2P communication which is needed for optimal micro-pipelining.
- `fused_all_gather_matmul` and `fused_matmul_reduce_scatter` dispatcher ops.
- Post-grad fx pass that detects `all-gather -> matmul` and `matmul -> reduce-scatter` and replaces them with the fused dispatcher ops.
To enable the prototype feature:
- Set the distributed backend to `cuda_p2p`.
- Set `torch._inductor.config._micro_pipeline_tp` to `True`.
*NOTE: the prototype sets nothing in stone w.r.t to each component's design. The purpose is to have a performant baseline with reasonable design on which each component can be further improved.*
## Benchmark
Setup:
- 8 x H100 (500W) + 3rd gen NVSwitch.
- Llama3 8B training w/ torchtitan.
- 8-way TP. Reduced the number of layers from 32 to 8 for benchmarking purpose.
Trace (baseline): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpjaz8zgx0
<img width="832" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4addba77-5abc-4d2e-93ea-f68078587fe1">
Trace (w/ micro pipelining): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpn073b4wn
<img width="963" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4f44e78d-8196-43ab-a1ea-27390f07e9d2">
## This PR
`ProcessGroupCudaP2P` is a thin wrapper around `ProcessGroupNCCL`. By default, it routes all collectives to the underlying `ProcessGroupNCCL`. In addition, `ProcessGroupCudaP2P` initializes a P2P workspace that allows direct GPU memory access among the members. The workspace can be used in Python to optimize intra-node communication patterns or to create custom intra-node collectives in CUDA.
`ProcessGroupCudaP2P` aims to bridge the gap where certain important patterns can be better optimized via fine-grained P2P memory access than with collectives in the latest version of NCCL. It is meant to complement NCCL rather than replacing it.
Usage:
```
# Using ProcessGroupCudaP2P
dist.init_process_group(backend="cuda_p2p", ...)
# Using ProcessGroupCudaP2P while specifying ProcessGroupCudaP2P.Options
pg_options = ProcessGroupCudaP2P.Options()
dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)
# Using ProcessGroupCudaP2P while specifying ProcessGroupNCCL.Options
pg_options = ProcessGroupNCCL.Options()
dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)
# Using ProcessGroupCudaP2P while specifying both
# ProcessGroupCudaP2P.Options and ProcessGroupNCCL.Options
pg_options = ProcessGroupCudaP2P.Options()
pg_options.nccl_options = ProcessGroupNCCL.Options()
dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)
# Down-casting the backend to access p2p buffers for cuda_p2p specific
# optimizations
if is_cuda_p2p_group(group):
backend = get_cuda_p2p_backend(group)
if required_p2p_buffer_size > backend.get_buffer_size():
# fallback
p2p_buffer = backend.get_p2p_buffer(...)
else:
# fallback
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122163
Approved by: https://github.com/wanchaol
as titled. I found that there're some issues in the eager mode SAC where
sometimes we would have recompute pop from storage of ops that are
missing, these ops are detach ops. So this PR refactors the two modes,
so that they would always recompute ignored ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126751
Approved by: https://github.com/yf225
Summary:
Capture dtype in flight recorder.
Mismatched dtypes can lead to hangs.
Newly added logs to job show mismatching DTYPE of op, which affects data
size. Even though the sizes match and we don't see the dtype on the FR
log.
We end up capturing the type as follows:
```
{'entries': [{'record_id': 0, 'pg_id': 0, 'process_group': ('0', 'default_pg'), 'collective_seq_id': 1, 'p2p_seq_id': 0, 'op_id': 1, 'profiling_name': 'nccl:all_reduce', 'time_created_ns': 1715989097552775261, 'duration_ms': 6.697696208953857, 'input_sizes': [[3, 4]], 'input_dtypes': [6], 'output_sizes': [[3, 4]], 'output_dtypes': [6], 'state': 'completed', 'time_discovered_started_ns': 1715989097593778240, 'time_discovered_completed_ns': 1715989097593778461, 'retired': True,
```
Notice the new fields:
input_dtypes: [6]
output_dtypes: [6]
Test Plan:
unit tests
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/issues/126554
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126581
Approved by: https://github.com/wconstab
Fixes#123036
In unit test `DynamicShapesCudaWrapperCudaTests.test_scaled_dot_product_attention_cuda_dynamic_shapes_cuda_wrapper`, computed buffer buf3 is compiled to a fallback kernel `aoti_torch_cuda__scaled_dot_product_flash_attention`. It has 9 outputs whose types are `[MultiOutput, MultiOutput, None, None, s1, s1, MultiOutput, MultiOutput,MultiOutput]`. The type `s1` here is passed from [generate_output](acfe237a71/torch/_inductor/ir.py (L5658)).
They type check for Symbol is missing for fallback kernel output generation. This PR fixes this issue by checking `output.is_Symbol`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126797
Approved by: https://github.com/desertfire
Summary:
We found that the UNIT tests would hang only in one test,
linux-focal-cuda11.8-py3.9-gcc9 / test (multigpu, 1, 1,
linux.g5.12xlarge.nvidia.gpu),
in which DSA would still be raised, but somehow the process would cause
errors like:
P1369649418
Test Plan:
Run CI tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126701
Approved by: https://github.com/wconstab
ghstack dependencies: #126409
Summary: Instead of a explicit config for users to determine buffer mutation, we automatically detect whether there's buffer mutation in the model and determine which section constants would be placed. If constants are too large and doesn't fit within section, we error out directly.
Test Plan: Existing tests for buffer mutation and large weight linking
Differential Revision: D57579800
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126706
Approved by: https://github.com/desertfire
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/126449. For `ops.masked` in CPP backend, when input dtype is `bool`, we actually load it as `VecMask<float, N>`. So, we should unify the type of `other` and `mask` to the same as `VecMask<float, N>` to invoke `blendv` method.
**Test Plan**
```
clear && python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_ops_masked_with_bool_input
clear && PYTORCH_ALL_SAMPLES=1 python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive__chunk_cat_cpu_bool
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126662
Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10
This file includes `from __futures__ import annotations` which interacts with `compile` by causing type annotations to be populated as strings. Triton does not parse the string annotation correctly. Avoid this behavior by passing `dont_inherit=True` to `compile`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126454
Approved by: https://github.com/peterbell10
We had a previous PR that added configs for an internal model. Running the below script on output from autotuning, we can prune back the added configs with negligible perf loss: P1365917790.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126570
Approved by: https://github.com/nmacchioni
Summary: This kernel is special-cased on ARM because it's important for LLMs, so let's have test coverage.
Test Plan: Ran locally and it passes. Intentionally broke fp16_gemv_trans and saw it fail, confirming it provides coverage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126700
Approved by: https://github.com/malfet
Found while getting scheduler.py to typecheck - split off to make reviewing easier.
1. is_template: I'm pretty sure this is a bug. Based on the definition of `is_template` I'm pretty sure we want to return the node's `get_template_node()`, not the node itself.
2. can_free: It seems that this was intended to b a raise, not a return.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126610
Approved by: https://github.com/eellison
fixes#126379
This is the easy fix. An additional fix that I did not do is to
deregister the excepthook (or rather, restore the orignal one) when
calling dist.destroy_process_group. This might be a bit complicated in
practice, so landing as is for now.
Also, couldn't figure out a clean way to test this. assertRaisesRegex
wasn't getting a string value, probably becuase of the stderr
redirection done via the excepthook in the first place.
Output from the original repro is cleaned up with the fix:
```
[rank0]: Traceback (most recent call last):
[rank0]: File "/data/users/whc/pytorch/except.py", line 6, in <module>
[rank0]: raise ZeroDivisionError
[rank0]: ZeroDivisionError
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126739
Approved by: https://github.com/yf225
Addresses follow up comments on #123992 and allows the use case of
writing code that checks `get_node_local_rank(fallback_rank=0)` and
runs correctly whether in the presence of a launcher (e.g. torchrun),
or run locally on a single device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126737
Approved by: https://github.com/shuqiangzhang
Summary:
Split out `seq_id` into `collective_seq_id` and `p2p_seq_id`. The main idea here is that collectives that go to all machines should have identical `collective_seq_id` and therefore it makes it easier to spot if one of machines isn't handling a collective operation.
Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync.
Resolves issue: https://github.com/pytorch/pytorch/issues/125173
Test Plan:
Unit tests.
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125727
Approved by: https://github.com/zdevito
Current tolerances fail on RTX 6000 (Ada) with `Mismatched elements: 2 / 144 (1.4%)`
```
AssertionError: Tensor-likes are not close!
Mismatched elements: 2 / 144 (1.4%)
Greatest absolute difference: 0.002197265625 at index (5, 0, 0) (up to 0.001 allowed)
Greatest relative difference: 0.08203125 at index (3, 0, 0) (up to 0.016 allowed)
To execute this test, run the following from the base repo dir:
python test/test_nestedtensor.py -k test_sdpa_with_packed_in_proj_cuda_bfloat16
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126356
Approved by: https://github.com/drisspg
Currently if an assertion is statically known to be false, dynamo converts it to
`_assert_async` which inductor currently ignores. Instead this graph breaks to
raise the original assertion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126661
Approved by: https://github.com/ezyang
This adds logging that will mark any invocation of a matmul for a particular input shapes, and record every template configs performance on it. Then, we can parse that into a script which will minimize the total mm execution time given N allowed templates. And in future, other experiments..
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126560
Approved by: https://github.com/nmacchioni, https://github.com/jansel
The padded dense -> jagged conversion op has the signature:
```
_fbgemm_dense_to_jagged_forward(Tensor dense, Tensor[] offsets, SymInt? total_L=None) -> Tensor
```
when `total_L` is not specified, the meta registration has a data-dependent output shape (based on `offsets[0][-1]`). Returning an unbacked SymInt here should work in theory, but traceable wrapper subclass support is missing in later code to handle deferred runtime asserts. This PR fixes this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126198
Approved by: https://github.com/ezyang
Summary:
Add an additional config to allow buffer mutation.
For data that's greater than 2GB, we would need to set it as read-only, otherwise overflow would occur.
This is a temporary solution since it won't handle cases that requires mutable data greater than 2GB.
Test Plan: Included in commit.
Differential Revision: D57514729
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126584
Approved by: https://github.com/chenyang78
I wasn't paying enough attention and didn't notice that LOAD_DEREF is
defined differently for InliningInstructionTranslator. Match it up with
the code there.
This also fixes comptime.print(), which was broken, because closing over
an argument turned it into a cell rather than a regular local.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126637
Approved by: https://github.com/yanboliang
Fixes the error in non-strict export when we're tracing a module that initializes another module in its forward function. This appears in [many huggingface models](https://github.com/search?q=repo%3Ahuggingface%2Ftransformers+CrossEntropyLoss%28%29&type=code&fbclid=IwAR285uKvSevJM6SDbXmb4-monj4iH7wf8opkvnec-li7sKpn4lUMjIvbGKc). It's probably not good practice to do this, but since it appears in so many places, and strict-export supports this, we will also support this.
The approach we'll take for these cases is that we will inline the call to the module. Parameters and buffers initialized as constants (with `torch.tensor`) will be represented as constant tensors, and those initialized with tensor factory functions (`torch.ones`) will show up as an operator in the graph. The module stack for the ops in the inlined module will reflect the toplevel's module stack.
One issue is that strict-export seems to segfault when there is an `nn.Parameter` call in the constructor (https://github.com/pytorch/pytorch/issues/126109). Non-strict export will succeed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125725
Approved by: https://github.com/ydwu4
# Context
Here's a peripheral scenario causing the JIT-pass and AOT-pass to pick different fusions.
```py
# JIT -- buf3 is a MultiTemplateBuffer
V.graph.buffers = [buf0, buf1, buf2, buf3, buf4]
^ ^
# JIT pass calls finalize_multi_template_buffers()
V.graph.buffers = [buf0, buf1, buf2, buf4, *buf3*]
# AOT, note proximity_score(buf2, buf4) is "better" for fusion than JIT
V.graph.buffers = [buf0, buf1, buf2, buf4, *buf3*]
^ ^
```
It happens like this:
* JIT starts with the original set nodes using V.graph.buffers
* In JIT, finalize_multi_template_buffers() is called which can change the order of the buffers.
* This makes the order of buffers/scheduler nodes different.
* Now, each node's min/max-order is different than before.
* As a result, the proximity between two nodes is different. ad67553c5c/torch/_inductor/scheduler.py (L2316-L2335)
# Error
```
$ TORCH_LOGS="+fusion" python test/inductor/test_max_autotune.py -k test_jit_fusion_matches_aot_fusion
======================================================================
FAIL: test_jit_fusion_matches_aot_fusion (__main__.TestMaxAutotune)
----------------------------------------------------------------------
Traceback (most recent call last):
...
File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1718, in compile_to_fn
code, linemap = self.codegen_with_cpp_wrapper()
File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1618, in codegen_with_cpp_wrapper
return self.codegen()
File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1636, in codegen
self.scheduler.codegen()
File "/data/users/colinpeppler/pytorch/torch/_dynamo/utils.py", line 210, in time_wrapper
r = func(*args, **kwargs)
File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2602, in codegen
self.get_backend(device).codegen_node(node) # type: ignore[possibly-undefined]
File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cuda_combined_scheduling.py", line 66, in codegen_node
return self._triton_scheduling.codegen_node(node)
File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3377, in codegen_node
return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel)
File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3602, in codegen_node_schedule
final_kernel.call_kernel(final_kernel.kernel_name)
File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3055, in call_kernel
grid = wrapper.generate_default_grid(name, grid)
File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cpp_wrapper_cuda.py", line 174, in generate_default_grid
params is not None
AssertionError: cuda kernel parameters for triton_poi_fused_add_0 should already exist at this moment, only found dict_keys(['Placeholder.DESCRIPTIVE_NAME', 'triton_poi_fused_add_mul_0', 'triton_poi_fused_pow_1'])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126622
Approved by: https://github.com/chenyang78
ghstack dependencies: #125982
XLA build job uses a docker image from XLA, which doesn't have sccache installed. The XLA build job just builds pytorch, XLA gets built during the test job. The pytorch build was taking 1+hrs, with a warm cache it takes <30min
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126117
Approved by: https://github.com/malfet
By dispatching it to multiple threads and using vectorized dot operation (with fp16 to fp32 upcasts via left shift)
This bumps stories110M eval from 22 to 55 tokens/sec using bfloat16
TODO:
- Refactor tinygemm template and use it here
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126592
Approved by: https://github.com/mikekgfb
`tl.associative_scan` supports non-commutative combine functions but `tl.reduce`
doesn't. This effects non-persistent scans, where we use the reduction from
the previous loop iterations as the base for future iterations.
Here I work around this by taking the last element of the scan output and using
that as the reduced value. This is done using a trick where we create a
mask that is 1 at the desired element and 0 elsewhere, then sum over it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126633
Approved by: https://github.com/Chillee, https://github.com/lezcano
This PR is meant to address issue #123451, more specifically, the ```test_graph_optims``` and ```test_graph_scaling_fused_optimizers``` functions in ```test_cuda.py``` have been updated so that they now use the new OptimizerInfo infrastructure.
Lintrunner passed:
```
$ lintrunner test/test_cuda.py
ok No lint issues.
```
Tests passed:
```
>python test_cuda.py -k test_graph_optims
Ran 19 tests in 7.463s
OK (skipped=9)
>python test_cuda.py -k test_graph_scaling_fused_optimizers
Ran 6 tests in 2.800s
OK (skipped=3)
```
Both the functions have been moved to the newly created TestCase class ```TestCudaOptims```. The test is mostly the same except the ```@optims``` decorator is used at the top of the function to implicitly call the function using each of the optimizers mentioned in the decorator instead of explicitly using a for loop to iterate through each of the optimizers.
I was unable to use the ```_get_optim_inputs_including_global_cliquey_kwargs``` to get all kwargs for each of the optimizers since some of the kwargs that are used in the original ```test_graph_optims``` function are not being returned by the new OptimizerInfo infrastructure, more specifically, for the ```torch.optim.rmsprop.RMSprop``` optimizer, the following kwargs are not returned whenever ```_get_optim_inputs_including_global_cliquey_kwargs``` is called:
```
{'foreach': False, 'maximize': True, 'weight_decay': 0}
{ 'foreach': True, 'maximize': True, 'weight_decay': 0}
```
I ran into the same issue for ```test_graph_scaling_fused_optimizers```, for the ```torch.optim.adamw.AdamW``` optimizer, whenever ```optim_info.optim_inputs_func(device=device)``` was called, the following kwarg was not returned:
```
{'amsgrad': True}
```
Due to this issue, I resorted to using a dictionary to store the kwargs for each of the optimizers, I am aware that this is less than ideal. I was wondering whether I should use the OptimizerInfo infrastructure to get all the kwargs regardless of the fact that it lacks some kwargs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125127
Approved by: https://github.com/janeyx99
- log only first node key cache miss
- log existing node key sizes
- log which node's collected sizes became dynamic
e.g.
```
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to new autograd node: torch::autograd::GraphRoot (NodeCall 0) with key size 39, previous key sizes=[]
...
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to new autograd node: torch::autograd::AccumulateGrad (NodeCall 5) with key size 32, previous key sizes=[21]
...
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 0 of torch::autograd::GraphRoot (NodeCall 0)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of SumBackward0 (NodeCall 1)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 4 of SumBackward0 (NodeCall 1)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of ReluBackward0 (NodeCall 2)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 9 of AddmmBackward0 (NodeCall 3)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of torch::autograd::AccumulateGrad (NodeCall 5)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of ReluBackward0 (NodeCall 6)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126602
Approved by: https://github.com/jansel
ghstack dependencies: #126144, #126146, #126148, #126483
To remove the disrupting warning
```
warnings.warn("torch.library.impl_abstract was renamed to "
"torch.library.register_fake. Please use that instead; "
"we will remove torch.library.impl_abstract in a future "
"version of PyTorch.",
DeprecationWarning, stacklevel=2)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126606
Approved by: https://github.com/ezyang
can't repro this regression. also nothing in the faulty PR range would cause it only for 1 model. the job is still causing noise, so we should mute it. I think just updating the graph break count is better than skipping the model here since it's still passing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126414
Approved by: https://github.com/ezyang
Summary: TSIA. The two looks the same to me, but buck was failing with the following error when `with torch._inductor.utils.fresh_inductor_cache()` is used:
```
_________________________ ReproTests.test_issue126128 __________________________
self = <caffe2.test.dynamo.test_repros.ReproTests testMethod=test_issue126128>
def test_issue126128(self):
def fn():
x = torch.randn(1, 10)
y = torch.randn(10, 1)
return torch.mm(x, y).sum()
def fn2():
x = torch.randn(10, 100)
y = torch.randn(100, 10)
return torch.mm(x, y).sum()
> with torch._inductor.utils.fresh_inductor_cache():
E AttributeError: module 'torch._inductor' has no attribute 'utils'
```
Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_repros.py::ReproTests::test_issue126128'`
Differential Revision: D57516676
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126596
Approved by: https://github.com/xmfan
Fixes#123451 (only addresses test_torch.py cases)
This PR solves the specific task to update `test_grad_scaling_autocast` and `test_params_invalidated_with_grads_invalidated_between_unscale_and_step` in `test/test_torch.py` to use the new OptimizerInfo infrastructure.
I have combined tests that call `_grad_scaling_autocast_test` into one called `test_grad_scaling_autocast` and used `_get_optim_inputs_including_global_cliquey_kwargs` to avoid hard-coded configurations.
```
$ lintrunner test/test_cuda.py
ok No lint issues.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125538
Approved by: https://github.com/janeyx99
As discussed with Andrew before, under compile we will register per-tensor backward hook instead of multi-grad hook, because it's difficult for Dynamo to support `register_multi_grad_hook` (or anything `.grad_fn` related). We expect both to have the same underlying behavior, ~~and we will add integration test (in subsequent PR) to show that compile and eager has same numerics.~~
As discussed below, we will change eager path to use per-tensor backward hook as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126350
Approved by: https://github.com/awgu
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124070
Approved by: https://github.com/jansel, https://github.com/jgong5
Previously, we make a copy of `torch.export.unflatten` in pippy/_unflatten.py.
But it turns out to be too hard to track bug fixes and improvements in upstream version. For example, `torch.export.unflatten` recently added support for tied parameters, which is something pipelining needs.
Now that we moved into pytorch, we make a reference to `torch.export.unflatten` instead of maintaining a copy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126217
Approved by: https://github.com/H-Huang
Summary:
Added USE_LITE_AOTI cmake flag, which is turned OFF by default.
When it is turned on, the AOTI sources (inductor_core_resources) are included when building lite interpreter
Test Plan:
```
ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DUSE_LITE_AOTI=ON
```
Differential Revision: D57394078
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126572
Approved by: https://github.com/malfet
Summary:
Tool for scouting exportability issues in one shot.
- Collect sample inputs for all submodules by running eager inference with forward_pre_hook.
- Start from root module, recursively try exporting child modules, if current module export fails.
Limitations:
- only works for nn.module that contains tree-like submodules structure. this doesn't work for flatten GraphModule.
TODO: support dynamic_dims
Sample output: https://docs.google.com/spreadsheets/d/1jnixrqBTYbWO_y6AaKA13XqOZmeB1MQAMuWL30dGoOg/edit?usp=sharing
```
exportability_report =
{
'': UnsupportedOperatorException(func=<OpOverload(op='testlib.op_missing_meta', overload='default')>),
'submod_1': UnsupportedOperatorException(func=<OpOverload(op='testlib.op_missing_meta', overload='default')>),
'submod_2': None
}
```
Test Plan: buck2 run mode/dev-nosan fbcode//caffe2/test:test_export -- -r TestExportTools
Differential Revision: D57466486
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126471
Approved by: https://github.com/zhxchen17
e.g. dist_ddp -> ddp
'distributed' shortcut remains unchained
Feedback has been that it is not appealing to have the dist_ prefix,
and the main reason for it was to keep the distributed shortcuts grouped
together in the help menu. It's nice to have shorter shortcuts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126499
Approved by: https://github.com/XilunWu, https://github.com/kwen2501
ghstack dependencies: #126322
Fixes#121188
Prevent Segmentation Fault in 'torch._C._nn.thnn_conv2d'
Previously, calling 'torch._C._nn.thnn_conv2d' with invalid arguments for padding, stride, and kernel_size would result in a segmentation fault. This issue has been resolved by implementing argument validation (using Torch Check). Now, when invalid arguments are detected, a runtime error is raised with a debug message detailing the correct format.
Additionally, this commit includes tests to cover the three referenced cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121906
Approved by: https://github.com/janeyx99
The `compile` + `exec` workflow is susceptible to behavior drifting from
a "normal" import use importlib instead to avoid this.
In particular here annotations were being stored as strings due to
`from __futures__ import annotations` in the scope calling `compile`.
Triton cares about annotations on global variables and this makes it
much easier to reliably code-gen them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126454
Approved by: https://github.com/peterbell10
Reopen due to rebase error. Fixes https://github.com/pytorch/pytorch/issues/117599
The reported hang test : `test_cuda.py::TestCuda::test_grad_scaling_autocast_fused_optimizers` is passing with this PR
HSA Async copy / host wait on completion signal is resolved in MultiTensorApply.cuh
```
:4:command.cpp :347 : 8881368803196 us: [pid:1268211 tid:0x7f5af80d7180] Command (InternalMarker) enqueued: 0xc4e2070
:4:rocvirtual.cpp :556 : 8881368803201 us: [pid:1268211 tid:0x7f5af80d7180] Host wait on completion_signal=0x7f5967df3e00
:3:rocvirtual.hpp :66 : 8881368803209 us: [pid:1268211 tid:0x7f5af80d7180] Host active wait for Signal = (0x7f5967df3e00) for -1 ns
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125456
Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/janeyx99
Add Execution Trace communication collective meta data.
For specification see https://github.com/pytorch/pytorch/issues/124674
New fields look like
```
{
"id": 80, "name": "record_param_comms", "ctrl_deps": 79,
"inputs": {"values": [[[78,74,0,100,4,"cuda:0"]],21,["0","default_pg"],0,"allreduce",[],[],0,1,2], "shapes": [[[100]],[],[[],[]],[],[],[],[],[],[],[]], "types": ["GenericList[Tensor(float)]","Int","Tuple[String,String]","Int","String","GenericList[]","GenericList[]","Int","Int","Int"]}, "outputs": {"values": [[[78,74,0,100,4,"cuda:0"]]], "shapes": [[[100]]], "types": ["GenericList[Tensor(float)]"]},
"attrs": [{"name": "rf_id", "type": "uint64", "value": 53},{"name": "fw_parent", "type": "uint64", "value": 0},{"name": "seq_id", "type": "int64", "value": -1},{"name": "scope", "type": "uint64", "value": 0},{"name": "tid", "type": "uint64", "value": 2},{"name": "fw_tid", "type": "uint64", "value": 0},{"name": "op_schema", "type": "string", "value": ""},{"name": "kernel_backend", "type": "string", "value": ""},{"name": "kernel_file", "type": "string", "value": ""},
{"name": "collective_name", "type": "string", "value": "allreduce"},
{"name": "dtype", "type": "string", "value": "Float"},
{"name": "in_msg_nelems", "type": "uint64", "value": 100},
{"name": "out_msg_nelems", "type": "uint64", "value": 100},
{"name": "in_split_size", "type": "string", "value": "[]"},
{"name": "out_split_size", "type": "string", "value": "[]"},
{"name": "global_rank_start", "type": "uint64", "value": 0},
{"name": "global_rank_stride", "type": "uint64", "value": 1},
{"name": "pg_name", "type": "string", "value": "0"},
{"name": "pg_desc", "type": "string", "value": "default_pg"},
{"name": "pg_size", "type": "uint64", "value": 2}]
}
```
## Unit Test
Added a new unit test to check the execution trace collected has right attributes
`touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_execution_trace`
```
STAGE:2024-05-08 17:39:10 62892:62892 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 17:39:10 62893:62893 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 17:39:11 62892:62892 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 17:39:11 62893:62893 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 17:39:11 62892:62892 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
STAGE:2024-05-08 17:39:11 62893:62893 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
[rank1]:[W508 17:39:12.329544411 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model
indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank0]:[W508 17:39:12.329626774 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model
indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank0]:[W508 17:39:12.339239982 execution_trace_observer.cpp:825] Enabling Execution Trace Observer
[rank1]:[W508 17:39:12.339364516 execution_trace_observer.cpp:825] Enabling Execution Trace Observer
STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
[rank1]:[W508 17:39:12.352452400 execution_trace_observer.cpp:837] Disabling Execution Trace Observer
STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:322] Completed Stage: Collection
[rank0]:[W508 17:39:12.354019014 execution_trace_observer.cpp:837] Disabling Execution Trace Observer
STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
Execution trace saved at /tmp/tmpy01ngc3w.et.json
Execution trace saved at /tmp/tmptf8543k4.et.json
ok
----------------------------------------------------------------------
```
Also run profilerunit test
`touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_torch_profiler`
```
STAGE:2024-05-08 18:24:22 1926775:1926775 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 18:24:22 1926774:1926774 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
[rank1]:[W508 18:24:24.508622236 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank0]:[W508 18:24:24.508622241 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
Trace saved to /tmp/tmpdrw_cmcu.json
Trace saved to /tmp/tmpnio7ec9j.json
ok
----------------------------------------------------------------------
Ran 1 test in 19.772s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126317
Approved by: https://github.com/yoyoyocmu, https://github.com/sanrise
Summary: The PT2E quantization flow does not support unquantized
outputs yet. To work around this, users may wish to remove the
output observer from their graphs. However, this fails currently
in some cases because the `PortNodeMetaForQDQ` pass is too
restrictive, for example:
```
conv -> obs -------> output0
\\-> add -> output1
```
Previously we expected conv to always have exactly 1 user,
which is the observer. When the observer is removed, however,
conv now has 2 users, and this fails the check.
```
conv -------> output0
\\-> add -> output1
```
This commit relaxes the error into a warning to enable
this workaround.
Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_multi_users_without_output_observer
Reviewers: jerryzh168
Subscribers: jerryzh168, supriyar
Differential Revision: [D57472601](https://our.internmc.facebook.com/intern/diff/D57472601)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126487
Approved by: https://github.com/tarun292
#### Conditions for allowlisting tensor subclasses
We allow tensor subclasses types that
(1) Do not override `__setstate__`, `__getattr__`, `__setattr__`, `__get__`, `__set__` or `__getattribute__` of `torch.Tensor` (`torch.Tensor` does not have a definition of `__getattr__`, `__get__` or `__set__` so we check that these are `None`)
(2) Use the generic `tp_alloc`
(3) Are in a module that *has been imported by the user*
to be pushed onto the stack as strings by `GLOBAL` instructions, while storing the type in a dict
The strings will be converted to the classes as appropriate when executing `REBUILD` with `_rebuild_from_type_v2`
*Note that we use `inspect.getattr_static(sys.modules[module], name)` to get the class/function as this method claims to have no code execution.
The rationale for the 3 conditions above is as follows:
The rebuild func provided by `Tensor.__reduce_ex__` is `torch._tensor._rebuild_from_type_v2`, which is defined as such (note the call to `getattr`, `Tensor.__setstate__` and the call to `as_subclass` as well as the call to `_set_obj_state` which calls `setattr`)
4e66aaa010/torch/_tensor.py (L57-L71)
`as_subclass` is implemented with a call to `THPVariable_NewWithVar`
that will eventually call `tp_alloc` here
4e66aaa010/torch/csrc/autograd/python_variable.cpp (L2053)
The `func` arg to `_rebuild_from_type_v2` for wrapper subclasses is `Tensor.rebuild_wrapper_subclass`, which will similarly call into `THPVariable_NewWithVar` and hit the above `tp_alloc`
**Note that we do not call `tp_init` or `tp_new` (i.e. `cls.__init__` or `cls.__new__`) when unpickling**
### How do we check something is a tensor subclass/constraints around imports
In order to check whether `bla` is a tensor subclass in the bytecode `GLOBAL module.name`, we need to do an `issubclass` check, which entails converting the global string to the appropriate type. We *do not* arbitrarily import modules but will perform this check as long as the given subclass (given by `module.name`) has already been imported by the user (i.e. `module in sys.modules` and `issubclass(getattr(sys[modules], name), torch.Tensor)`
This PR also allowlisted `torch._utils._rebuild_wrapper_subclass` and `torch.device` (used by `_rebuild_wrapper_subclass`)
### API for allow listing
This PR also added `torch.serialization.{add/get/clear}_safe_globals` that enables user to allowlist globals they have deemed safe and manipulate this list (for example they could allowlist a tensor subclass with a custom `__setstate__` if they have checked that this is safe).
Next steps:
- Add testing and allowlist required classes for all in-core tensor subclasses (e.g. `DTensor`, `FakeTensor` etc.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124331
Approved by: https://github.com/albanD
This kind of an experiment for uploading test stats during the run, and also for test dashboard stuff so it can re calculate the info
Add workflow that is callable via workflow dispatch for uploading additional test stats
Adds script that only calculates the additional info
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126080
Approved by: https://github.com/ZainRizvi
```
$ INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 python test/inductor/test_unbacked_symints.py -k test_vertical_pointwise_reduction_fusion
File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 1953, in fuse_nodes_once
for node1, node2 in self.get_possible_fusions():
File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2010, in get_possible_fusions
check_all_pairs(node_grouping)
File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 1997, in check_all_pairs
if self.can_fuse(node1, node2):
File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2252, in can_fuse
return self.get_backend(device).can_fuse_vertical(node1, node2)
File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cuda_combined_scheduling.py", line 39, in can_fuse_vertical
return self._triton_scheduling.can_fuse_vertical(node1, node2)
File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3237, in can_fuse
if not all(
File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3238, in <genexpr>
TritonKernel.is_compatible((numel2, rnumel2), n.get_ranges())
File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 1543, in is_compatible
cls._split_iteration_ranges(groups, lengths)
File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 1507, in _split_iteration_ranges
while current_group < len(remaining) and sv.size_hint(remaining[current_group]) == 1:
File "/data/users/colinpeppler/pytorch/torch/_inductor/sizevars.py", line 442, in size_hint
return int(out)
File "/home/colinpeppler/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/core/expr.py", line 320, in __int__
raise TypeError("Cannot convert symbols to int")
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
TypeError: Cannot convert symbols to int
```
Where the unbacked symints show up at.
```
> /data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py(1506)_split_iteration_ranges()
(Pdb) print(groups)
(1, 512*u0)
(Pdb) print(lengths)
([u0, 32, 16], [])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125982
Approved by: https://github.com/jansel
Previously, we would default to the config `compile_threads`. That controls the number of forks we use for async compile. It defaults to 1 in fbcode because fork() has known issues with safety. In precompilation, we are using threads, which have no safety issues and should strictly improve compile time. there isn't really any reason to reduce except for testing, and it doesn't make sense to share the same value as for determining forks.
This changes so we default it to use as many threads as needed unless the env variable is set.
Differential Revision: [D57473023](https://our.internmc.facebook.com/intern/diff/D57473023)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126333
Approved by: https://github.com/nmacchioni
add a switch to change the gemm autotuning search space between the default (the current set of hardcoded configs) and an exhaustive search space that enumerates all block sizes in [16, 32, 64, 128, 256], stages in [1, 2, 3, 4, 5], and warps in [2, 4, 6]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126220
Approved by: https://github.com/eellison
**Description**
Lower the qlinear binary post op pattern to Inductor. Use post op sum (in-place) if the extra input has the same dtype as output. Otherwise, it uses binary add.
**Supported linear-binary(-unary) patterns**
```
linear(X) extra input
\ /
Add
|
Optional(relu)
|
Y
1. int8-mixed-fp32
+---+---------------+-----------+------------------------------+---------+
| # | Add type | Quant out | Pattern | Post op |
+---+---------------+-----------+------------------------------+---------+
| 1 | In-/out-place | Yes | linear + fp32 -> (relu) -> q | add |
+---+---------------+-----------+------------------------------+---------+
| 2 | In-/out-place | No | linear + fp32 -> (relu) | sum |
+---+---------------+-----------+------------------------------+---------+
2. int8-mixed-bf16
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| # | X2 dtype | Add type | Quant out | Pattern | Post op |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 1 | BF16 | In-/out-place | Yes | linear + bf16 -> (relu) -> to_fp32 -> q | add |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 2 | BF16 | In-/out-place | No | linear + bf16 -> (relu) | sum |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 3 | FP32 | Out-place | Yes | linear + fp32 -> (relu) -> q | add |
| | | In-place right| | | |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 4 | FP32 | Out-place | No | linear + fp32 -> (relu) | sum |
| | | In-place right| | | |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 5 | FP32 | In-place left | Yes | linear + fp32 -> to_bf16 -> relu -> to_fp32 -> q | add |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 6 | FP32 | In-place left | No | linear + fp32 -> to_bf16 -> (relu) | add |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
```
Note
(1) The positions of linear and the extra input can be swapped.
(2) we don't insert q-dq before the extra input of linear-add by recipe. But if q-dq is found at the
extra input, we don't match that pattern because we cannot match all these patterns in 3 passes.
**Test plan**
python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_add
python test/inductor/test_cpu_cpp_wrapper.py -k test_qlinear_add
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122593
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/eellison
As titled. Some ops require adjustment of output shape argument. In rule-based sharding prop, global output shape was inferred in the rule (in `view_ops.py`). In strategy-based sharding prop, it is now obtained from propagated out_tensor_meta (in `sharding_prop.py`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126011
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
**Overview**
This PR supports constructing an ND mesh with `from_group()` by passing in `group: List[ProcessGroup]` and `mesh: Union[torch.Tensor, "ArrayLike"]` together. The `ndim` of the device mesh returned from `from_group()` is equal to the number of `ProcessGroup`s passed. If the `ndim` is greater than 1, then the `mesh` argument is required (since there is no simple way to recover the `mesh` tensor from the process groups otherwise).
This PR also adds `mesh_dim_names` as an argument to forward to the device mesh for convenience.
<details>
<summary> Old Approach </summary>
**Overview**
- This PR mainly adds `mesh_shape` to `from_group()` so that the user can construct an ND (N > 1) device mesh from a process group. This is to unblock HSDP, where we can pass the overall data parallel process group to `from_group()` with `mesh_shape = (replicate_dim_size, shard_dim_size)` and `from_group()` will construct subgroups for the user. (The user can then get the subgroups from the submeshes.)
- Constructing the 2D `DeviceMesh` from an existing shard process group and replicate process group is hard because we cannot easily recover the array of ranks in their parent group on each rank in general.
- This PR also adds `mesh_dim_names` to `from_group()` so that the user can name the mesh dimensions of the constructed device mesh.
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126258
Approved by: https://github.com/wanchaol
Copy of #126089, with some additional fixes & tests
Partial fix for #125635: previously, the deepcopy implementation would group together any tensors with any aliasing relationship and assign them to the same tensor. This was sort of good if you have two tensors `b = a.detach()`, because then if you deepcopy `list = [a, b]` to `list2 = list.deepcopy()`, then writes to `list2[0]` will also modify `list2[1]`. But for the most part, it's bad; (1) if you have `b = a.as_strided((4, 4), (16, 1), 16)`, then it'll make `b == a` in the deepcopied implementation, which is completely wrong; and (2) even if you have `b = a.detach()`, these are still initially two different tensors which become the same tensor after the old deepcopy implementation.
The new implementation only groups together tensors that have the same identity. This is a partial fix, but it's more reasonable. What changes:
* (becomes more correct): different views of the same base tensor will no longer all become equal after deepcopying
* (still kind of wrong): views won't actually alias each other after deepcopying.
* (arguably a minor regression): equivalent views of the same tensor will no longer be copied to the same tensor - so they won't alias.
BC breaking: C++ deepcopy interface changes from accepting `IValue::HashAliasedIValueMap memo` to accepting `IValue::HashIdentityIValueMap memo`. If there are objections, we can keep the old API. However, it seems likely that users generally won't try to deepcopy from C++.
Differential Revision: [D57406306](https://our.internmc.facebook.com/intern/diff/D57406306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126126
Approved by: https://github.com/ezyang
Summary:
This is a step towards upgrading the MKL library and using a buckified targets rather than importing from TP2.
- Add new `//third-party/mkl:mkl_xxx` targets that are currently aliases to `third-party//IntelComposerXE:mkl_xxx`.
- Switch usage of `external_deps = [("IntelComposerXE", None, "mkl_xxx")]` to `deps = ["fbsource//third-party/mkl:mkl_xxx"]`
Note that this only changes references to `mkl_xxx` references in `IntelComposerXE` but not references to "svml" or "ipp*".
Test Plan: sandcastle
Differential Revision: D57360438
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126371
Approved by: https://github.com/bertmaher
As discussed before, for now Dynamo is not able to support DTensor constructor, and instead we have to use `DTensor.from_local()`.
This won't affect eager and it's a compile-only change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126346
Approved by: https://github.com/awgu
Fixes#126012.
`from` is a reserved keyword in Python, thus we can't make the C++ impl available with `from` as function parameter. This PR changes the name to `from_` and also adjusts the docs.
If we want to preserve backwards compatibility, we can leave the C++ name as-is and only fix the docs. However, `torch.can_cast(from_=torch.int, to=torch.int)` won't work then.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126030
Approved by: https://github.com/albanD
This PR is part of an effort to speed up torch.onnx.export (https://github.com/pytorch/pytorch/issues/121422).
- For each node that is processed in onnx.export, a check is run to see if all inputs are "reliable" (static shape, etc.). This value does not change, so it is much faster to cache it on the first computation. The caching is added to the ConstantMap state.
- Resolves (6) in #121422.
- Also see #123028 with a similar addition of a cache state.
(partial fix of #121545)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124912
Approved by: https://github.com/justinchuby
**Context**
For FSDP, gradient accumulation across microbatches has two flavors: (1) reduce-scatter or (2) no reduce-scatter. (1) incurs the collective per microbatch backward but saves gradient memory (storing the sharded gradients), while (2) avoids the communication but uses more gradient memory (storing the unsharded gradients).
- FSDP2 offers (1) without any intervention. The user should simply make sure to run the optimizer step after `K` microbatches for `K > 1`.
- FSDP2 offers (2) via `module.set_requires_gradient_sync()` (e.g. `module.set_requires_gradient_sync(is_last_microbatch)`.
For HSDP, since we reduce-scatter and then all-reduce, we have additional flexibility and get three flavors: (1) reduce-scatter and all-reduce, (2) reduce-scatter but no all-reduce, and (3) no reduce-scatter and no all-reduce. This PR adds support for (2).
- FSDP2 offers (1) without any intervention like mentioned above.
- FSDP2 offers (3) via `module.set_requires_gradient_sync()` like mentioned above.
- FSDP2 offers (2) via `module.set_requires_all_reduce()` similar to `set_requires_gradient_sync()`.
**Overview**
For HSDP, to reduce-scatter but not all-reduce during gradient accumulation, the user can do something like:
```
for microbatch_idx, microbatch in enumerate(microbatches):
is_last_microbatch = microbatch_idx == len(microbatches) - 1
model.set_requires_all_reduce(is_last_microbatch)
# Run forward/backward
```
This PR also makes the minor change of making the `recurse: bool` argument in these setter methods to be kwarg only.
**Developer Notes**
We choose to implement this by saving the partial reduce output to the `FSDPParamGroup` for simplicity, where we assume that the set of parameters that receive gradients does not change across microbatches. An alternative would be to view into the partial reduce output per parameter and save the view to each parameter. We prefer to avoid this alternative for now because it introduces more complexity to do extra viewing when saving the partial reduce output to each parameter, accumulating into them, and accumulating back to the last microbatch's reduce output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126166
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #126067, #126070, #126161
This PR adds `torch.ops._c10d_functional.all_gather_into_tensor_out`.
It's important for tracing FSDP2, because FSDP2 pre-allocates the output buffer of AllGather, and makes input buffer an alias of the output buffer, and expects both of them to be used to achieve lower memory usage. If we don't preserve this behavior and instead functionalize the AllGather op, AllGather op will then create a brand-new output buffer (instead of reusing), thus significantly increasing the memory usage.
The expectation is that we will "re-inplace" the AllGather op by switching to the out variant in Inductor post-grad stage via an FX pass, so this API is not expected to be directly used by users.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126334
Approved by: https://github.com/yifuwang, https://github.com/wanchaol
# Motivation
We generalize a device-agnostic API `torch.amp.autocast` in [#125103](https://github.com/pytorch/pytorch/pull/125103). After that,
- `torch.cpu.amp.autocast(args...)` is completely equivalent to `torch.amp.autocast('cpu', args...)`, and
- `torch.cuda.amp.autocast(args...)` is completely equivalent to `torch.amp.autocast('cuda', args...)`
no matter in eager mode or JIT mode.
Base on this point, we would like to deprecate `torch.cpu.amp.autocast` and `torch.cuda.amp.autocast` to **strongly recommend** developer to use `torch.amp.autocast` that is a device-agnostic API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126062
Approved by: https://github.com/eqy, https://github.com/albanD
Some operations have a scalar input parameter, like `torch.add(a, b, alpha=2.0)`. Currently, the aot compile does not support such a case because it requires the signature of the captured graph to align with the operation's signature. This means that some inputs in the captured graph may be scalar(float, int, bool, etc.). It breaks the assumption of `compile_fx_aot` as it assumes all the example inputs are tensor - 0f6ce45bcb/torch/_inductor/compile_fx.py (L1048)
This PR intends to support such cases by allowing not-aligned signature and filtering out the non-Tensor parameters.
Captured graph for `torch.add(a, b, alpha=2.0)`
```
opcode name target args kwargs
------------- -------- --------------- ---------------- --------------
placeholder arg0_1 arg0_1 () {}
placeholder arg1_1 arg1_1 () {}
call_function add aten.add.Tensor (arg0_1, arg1_1) {'alpha': 2.0}
output output_1 output ((add,),) {}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124177
Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/jgong5
Summary:
This PR implements sliding window and updates "aten._flash_attention_forward/_flash_attention_backward" to expose the window_size_left and window_size_right arguments. With this kwarg added we can dispatch to the FAv2 impl if the necessary constraints are met.
These arguments will eventually be provided to "aten.sdpa_flash" but for now they are needed when called by xformers into their effort to directly use the Pytorch FAv2 impl instead of building their own.
Test Plan:
Use the default aten.sdpa_flash tests since we've added optional arguments set to the previous default value: -1, /*window_size_left*/
Using buck2 build --flagfile fbcode//mode/dev-nosan fbcode//caffe2/caffe2/fb/predictor/tests:inference_context_test
Differential Revision: D56938087
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126061
Approved by: https://github.com/drisspg, https://github.com/desertfire
This means that propagate real tensor is no longer unsound: if the
route we took at compile time diverges with runtime, you will get a
runtime assert.
Also add structured trace logs for these.
Also fix bug where xreplace with int range is not guaranteed to return
a sympy expression.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126287
Approved by: https://github.com/Skylion007
As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068
Approved by: https://github.com/jansel
ghstack dependencies: #126019
The `test_device_guard.py` was improperly set up, so there were failures on multi-GPU machines. By design the `DeviceGuard` should keep `idx` the same even after it was applied.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126240
Approved by: https://github.com/jansel
As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019
Approved by: https://github.com/jansel
Summary: Found a unit test that was causing an assertion failure during an attempt to use unbacked symints in the guards expression, but it turns out unbacked symints can't affect guards anyway, so we can just filter them out. Also in this diff: test_torchinductor_dynamic_shapes.py was not configured to exercise the codecache because the TestCase setUp method was indavertently skipping the setUp of the immediate parent class.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126251
Approved by: https://github.com/peterbell10
**As title.**
Still, `ep.run_decompositions()` will use `core_aten_decompositions()` by default. Cases like `ep.run_decompositions(get_decompositions([]))` will use empty table, and go with [`aot_autograd_decompositions`](04877dc430/torch/_functorch/aot_autograd.py (L456-459)) only.
**Motivation**
We didn't have a clean way to pass in an empty decomp table. Since we've made `pre_dispatch` export as default and `ep.run_decompositions` remains with `aot_export_module(..., pre_dispatch=False)`, allowing empty table would help make blank control easier.
**Testing**
CI
Also looked through all the references in fbcode. The only concern I have is whether we should update [this example](04877dc430/torch/onnx/_internal/exporter.py (L817)) or not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126142
Approved by: https://github.com/angelayi
Fixes#126012.
`from` is a reserved keyword in Python, thus we can't make the C++ impl available with `from` as function parameter. This PR changes the name to `from_` and also adjusts the docs.
If we want to preserve backwards compatibility, we can leave the C++ name as-is and only fix the docs. However, `torch.can_cast(from_=torch.int, to=torch.int)` won't work then.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126030
Approved by: https://github.com/albanD
Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7228787720582401/
There a few improvements here, which luckily fix some xfails:
* In generally, it can be unsafe to call operations on Tensors under a `no_dispatch()` mode that is purely trying to disable ambient modes, because this ALSO disables tensor subclass handling. So we test to see if there is a tensor subclass and don't propagate real tensors if that's the case. Another acceptable outcome might be to try to only disable the ambient fake tensor mode, this would help us propagate real tensors through more exotic tensor types, but I'm not going to do it until someone asks for it.
* We're graph breaking for wrapped tensors too late. Pull it up earlier so we do it before we try to muck around with the real tensor.
* I noticed that occasionally when I do `storage.copy_(real_storage)`, the sizes mismatch. Careful code reading suggests that I should just copy in the real data when the tensor was initially allocated, so that's what I do now, eliminating the need for a storage copy.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126281
Approved by: https://github.com/Skylion007
Summary:
Encountered module import error when running triton kernel file.
The cause seems to be D57215950 which changed "do_bench" to "do_bench_gpu" for torch._inductor.runtime.runtime_utils
However, in the codegen, instead we have "from triton.testing import do_bench", so the line below should be reverted back to "do_bench".
Test Plan:
LOGLEVEL=DEBUG TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=0 CUDA_VISIBLE_DEVICES=5 TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT='/home/adelesun/mts_profiling/outputs/profile_output.txt' TORCH_LOGS='+inductor,+schedule,output_code' TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_CACHE_DIR='/home/adelesun/mts_profiling/code' TORCHINDUCTOR_ENABLED_METRIC_TABLES=kernel_metadata buck2 run mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.nvcc_arch=v100,a100,h100 -c fbcode.split-dwarf=true caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --local-model /home/adelesun/mts_profiling/inputs/offsite_cvr_model_526372970_793.input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR 2>&1 | tee /home/adelesun/mts_profiling/outputs/benchmark_output.txt
bento console --kernel=aetk --file=/home/adelesun/mts_profiling/code/op/copmbxfunzmywemwmg66lnlcx4apvn2f2vsi3glgisausgfvit4g.py
file ran successfully
Differential Revision: D57345619
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126213
Approved by: https://github.com/shunting314
> previous: Originally, the variables `new_eta` and `new_mu` would be constructed `len(grouped_mus)` times, but each of their values is the same and won't be changed. Therefore, it can be simplified using Python list multiplication, which only constructs one tensor.
- [X] Ill assumption that every param will have the same step.
- [x] DIfferent implementation between `foreach=Ture` and `foreach=False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125440
Approved by: https://github.com/janeyx99
This PR is part of an effort to speed up torch.onnx.export (#121422).
- The `auto debug_names = ` infers a copy, where as `const auto& debug_names` does not.
- However, this ones requires us to be careful, since calls to `setDebugName` changes `debug_names` and invalidates the `exist_name` iterator. So if we simply change `auto` to `const auto&`, then between that line and `find` we have corrupted the iterator by calling `output[i]->setDebugName`. This change aims to be functionally equivalent to the original, which is why we first get the Value pointer, then call `output[i]->setDebugName`, and finally call `setDebugName` on the found value. It is possible functionally it is OK to simply call `output[i]->setDebugName` first and then find and the second `setDebugName`, but this would not be identical to current behavior.
- Resolves (2) in #121422.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123026
Approved by: https://github.com/justinchuby
unit test: ``pytest test/distributed/_composable/fsdp/test_fully_shard_state_dict.py``
with meta init and cpu offloading, we have meta tensors after`model.load_state_dict(assign=True, strict=False)`. This PR avoided calling `.cpu` on meta tensors otherwise it's a runtime error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126267
Approved by: https://github.com/awgu
after test removal for windows cpu + avx related configs, it's going to be the long pole for trunk
Just checked: without rocm, avg tts for trunk is 2.5 hrs last week, with rocm its about 3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125933
Approved by: https://github.com/ZainRizvi
The current call passes in `['/actual/path']` to os.walk which is a string pointing to no path and thus silently leads to and empty traversal.
There is an unused function just above that handles that, so I guess this is what was supposed to be called.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126103
Approved by: https://github.com/suo
For mm inputs which are not inputs of the graph, assume that we can memory plan them in the aten.cat and exclude the padding cost in the benchmarking comparison. Technically we also have to do a small amount of 0s writing, but that should be relatively small and encompassed in the weighting of the padding time by `1.1`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125780
Approved by: https://github.com/shunting314
ghstack dependencies: #125772, #125773
This field never changes so pre_compile doesn't need to return it again: remove it just for a cleaner refactor.
As @aorenste points out, the fw_metadata passed to post_compile is actually the fw_metadata after all wrapper's pre_compile's have run. I want to make this clear in the code, so I renamed the arg in post_compile.
Wrappers that need the exact metadata that they were passed in pre_compile need to save that fw_metadata properly themselves.
Currently, wrappers come in two categories:
1. Wrappers that modify fw_metadata, but then never use fw_metadata in post compile
2. Wrappers that never modify fw_metadata, and only consume the "final" fw_metadata.
So none of the behaviors will change for the existing wrappers. That said, it might be useful to define a "SimpleCompilerWrapper" subclass which guarantees it does not modify fw_metadata. I'll do that in a separate PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125854
Approved by: https://github.com/aorenste, https://github.com/bdhirsh
This PR is part of an effort to speed up torch.onnx.export (#121422).
- Doing a reverse look-up in `symbol_dim_map` incurs a linear cost in number of symbols. This happens for each node, so incurs a quadratic cost to the whole export.
- Add a reverse look-up `dim_symbol_map` that is kept in parallel of `symbol_dim_map`. This avoids a linear time look-up, which creates a quadratic export time complexity.
- This is a highly pragmatic solution. If someone more familiar with the code base has a better solution, I'm interested to hear about it.
- Resolves (9) in #121422.
(partial fix of #121422)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123029
Approved by: https://github.com/justinchuby
- Only search for magma if it is used (GPU builds)
- Don't report it was not found when it isn't searched for
- Don't report if magma is disabled (currently: "MAGMA not found. Compiling without MAGMA support" is reported)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117858
Approved by: https://github.com/malfet
Or my journey to learn how to write fast Metal kernels (more details would be posted [here](https://github.com/malfet/llm_experiments/tree/main/metal-perf) )
Using gpt-fast as a benchmark (by running `python generate.py --checkpoint_path checkpoints/stories110M/model_int8.pth --device mps`)
Before the change, on M2 Pro I get 50 tokens per sec
After adding a very naive
```metal
template<typename T>
kernel void int8pack_mm(
constant T * A [[buffer(0)]],
constant char * B [[buffer(1)]],
constant T * scales [[buffer(2)]],
device T * outputData [[buffer(3)]],
constant uint3 & sizes [[buffer(4)]],
uint thread_index [[thread_position_in_grid]]) {
const uint lda = sizes.y;
const uint ldc = sizes.z;
const uint m = thread_index / sizes.z; // 0..sizes.x-1
const uint n = thread_index % sizes.z; // 0..sizes.z-1
constant T *A_ptr = A + m * lda;
constant char *B_ptr = B + n * lda;
float rc = 0.0;
for(uint k = 0; k < sizes.y; k++) {
const auto a_val = float(A_ptr[k]);
const auto b_val = float(B_ptr[k]);
rc += a_val * b_val;
}
outputData[thread_index] = T(rc * float(scales[n]));
}
```
Perf dropped down to sad 15 tokens per seconds.
Replacing inner loop with vectorized operations
```metal
float rc = 0.0;
for(uint k = 0; k < sizes.y/4; k++) {
const auto a_val = float4(A_ptr[k]);
const auto b_val = float4(B_ptr[k]);
rc += dot(a_val, b_val);
}
```
Perf jumps back up to 53 tokens per second, but it's a bit of a lie when it comes to llama2-7B perf.
Next step in unlocking the performance were to replace a 1D grid with a 2D one, but limit the thread group size to a single row, which results in a much better data locality which unfortunately is not observable with `stories110M` anymore as it small model size and Python runtime overhead hide the perf gain)
There were several unsuccessful attempts at caching inputs in thread local memory or using `float4x4` to speed up computation. But the key to unlocking the perf were a comment in 631dfbe673/mlx/backend/metal/kernels/gemv.metal (L184)
which hinted at exploiting both SIMD groups and thread local caches, which resulted in 5x jump in performance compared to initial vectorization approach and 3x perf jump in end-to-end llama7b test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125704
Approved by: https://github.com/mikekgfb
Summary:
Previously we tried to convert all .to() calls to to_copy in the graph, now some user reports that other methods like .float() is not covered: https://github.com/pytorch/PiPPy/issues/1104#issuecomment-2093352734
I think fundemantally .float() should look similar to .to() in export and this diff tries to expand the coverage of the tensor conversion methods here.
Test Plan: buck run mode/opt caffe2/test:test_export -- -r float_conversion
Differential Revision: D56951634
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125628
Approved by: https://github.com/tugsbayasgalan
We find some Inductor test case failues when enabling Inductor UT for Intel GPU, the root cause is new introduced Inductor device-bias code from recent community PRs, which cause differnet beheaviors between Intel GPU and CUDA. This PR generalize these codes to align their beheaviors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126261
Approved by: https://github.com/EikanWang, https://github.com/peterbell10
Summary: When looking up for what backend call to use for a fallback op (see get_backend_index_for_aoti), sometimes we need to search for a NativeFunction's structured delegate. Previous str:NativeFunctionsGroup dict missed some cases, such as aten.index.Tensor, and that's why aten.index.Tensor was specified in the fallback_ops list but no C shim entry was generated for it. This PR uses a more robust OperatorName:NativeFunctionsGroup mapping.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125962
Approved by: https://github.com/chenyang78
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info.
1. Cpp template infrastructure
Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
2. Initial FP32 gemm template
This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
3. Correctness and performance
The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
Static shapes
| Benchmark | torchbench | huggingface | timm_models |
|------------|-------------|--------------|--------------|
| Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
| Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
| Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
Key models being sped up:
drq: 1.14x
soft_act: 1.12
cait_m36_384: 1.18x
Dynamic shapes
| Benchmark | torchbench | huggingface | timm_models |
| --- | --- | --- | --- |
| Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
| Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
| Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
Key models being sped up:
BERT_pytorch: 1.22x
pyhpc_turbulent: 1.13x
soft_actor_critic: 1.77x
BlenderbotForCausalLM: 1.09x
cait_m36_384: 1.17x
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021
Approved by: https://github.com/jansel
Summary: It seems that most (all?) of our utilities for examining benchmark output expect single-line entries per benchmark. The way the --warm-start-latency flag is currently implemented, it means that we'll see two entries for every benchmark run (one for the warm-up run and one for the actual run). This PR adds a --disable-output flag that we can use for the first run to suppress populating the csv. This way, the existing utilities like `benchmarks/dynamo/check_accuracy.py` will function without any changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125953
Approved by: https://github.com/desertfire
ghstack dependencies: #125917
If I do:
```
xla_device = xm.xla_device()
xla_tensor_0 = torch.tensor(42, dtype=torch.uint32).to(xla_device)
```
I got the error:
```
RuntimeError: false INTERNAL ASSERT FAILED at "/ansible/pytorch/torch/csrc/lazy/core/hash.h":139, please report a bug to PyTorch. Unsupported scalar type:UInt16
```
This PR intends to fix this issue.
The data type can be found in pytorch/c10/core/ScalarType.h.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125972
Approved by: https://github.com/JackCaoG
Summary:
Move const strings to top of file. This is in preparation of tooling to
make use of shared constants (e.g. version string). A non-functional change.
Ideally we want these const strings to be available from both C++ and Python - but I haven't figured out how to correctly share things in PyTorch. I'll do this in a subsequent change.
Test Plan:
python test/distributed/test_c10d_nccl.py NCCLTraceTest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125640
Approved by: https://github.com/wconstab
Adds trace_subgraph to _MakefxTracer, the motivation is in https://github.com/pytorch/pytorch/pull/122972. Also migrate all existing usage of reenter_make_fx to the new sub-tracer. Previously, the torch function mode for creating torch_fn metadata won't be re-enetered when we're in ProxyTensorMode (since it's inside of __torch_function__). This PR reconstruct the torch function mode based on parent tracer's config and reentered the torch function mode so the metadata is shown in the graph.
**Test Plan:**
Existing tests. We have a bunch of make_fx tests for cond, map and while_loop. Also remove expected failure for torch_fn since reenter_make_fx is able to re-construct torch function modes.
Also fixes https://github.com/pytorch/pytorch/issues/124643
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125363
Approved by: https://github.com/Chillee
ghstack dependencies: #125267
Code movement + minor rewrites. We extract the states of make_fx out and encapsulate them into a _MakefxTracer class. This allows us to create a new make_fx_tracer when tracing subgraphs, the actual logic for tracing subgraph is in the next diff.
Test Plan:
Existing tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125267
Approved by: https://github.com/Chillee
Add `ManualPipelineStage` under `_PipelineStage.py`
Fix some type hints since `args_recv_info` can contain more than one RecvInfo. Previously the hint was `Tuple[InputInfo]` which meant it is a tuple of size 1. This is different from `List[InputInfo]` which can contain any number of items. I needed to update to `Tuple[InputInfo, ...]` to make the number of items flexible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126123
Approved by: https://github.com/kwen2501
# Summary
I was getting
``` Shell
File "/home/drisspg/meta/pytorch/torch/cuda/__init__.py", line 312, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: invalid literal for int() with base 10: '90a'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126185
Approved by: https://github.com/Skylion007
Observed Problem
---------------------
When `torchrun` has finished running the main trainer function (aka entrypoint/user function) successfully, I noticed that sometimes it SIGTERMS the child processes. Then `torchrun` exits successfully.
This results in misleading warning log messages towards the end of the job like the one below:
```
W0510 14:52:48.185934 672413 api.py:513] Closing process 675171 via signal SIGTERM
W0510 14:52:48.185984 672413 api.py:513] Closing process 675172 via signal SIGTERM
W0510 14:52:48.186013 672413 api.py:513] Closing process 675174 via signal SIGTERM
# <---- ^^^ ??? everything runs successfully but child still SIGTERM'ed? ^^^ --->
I0510 14:52:48.229119 672413 api.py:877] [main] worker group successfully finished. Waiting 300 seconds for other agents to finish.
I0510 14:52:48.229161 672413 api.py:922] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish
I0510 14:52:48.229395 672413 api.py:936] Done waiting for other agents. Elapsed: 0.0001709461212158203 seconds
I0510 14:52:48.257544 672413 dynamic_rendezvous.py:1131] The node 'localhost_672413_0' has closed the rendezvous 'torchrun_qpfd'.
I0510 14:52:48.568198 672413 distributed.py:200] Deleting temp log directory: /tmp/torchrun_udgp8zoq
I0510 14:52:48.568989 672413 distributed.py:202] Finished running `main`
```
Root Cause
------------------
I noticed that this was due to the incorrect usage of `torch.multiprocessing.ProcessContext.join()` in `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext`.
`torch.multiprocessing.ProcessContext.join()` does not actually wait for ALL child procs to exit, but rather waits for **at-least-one** child proc to exit. If only a subset of the child procs have exited, it returns `False` and if all child procs have exited it returns `True`.
`torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` was assuming that `torch.multiprocessing.ProcessContext.join()` blocks indefinitely until all child procs have exited.
Fix
---------
The fix is simple, just loop, while continuing to call `pc.join()` until it returns `True`
> **NOTE**: that the indefinite blocking is NOT an issue since by the time `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` calls `pc.join()` it already did all the checking to validate that the entrypoint functions either return successfully or that one of them has failed. So we are really just waiting for the unix process to exit after running the entrypoint function.
> **NOTE**: since `pc.join()` already blocks until at-least-one child proc exits, there is no need to add a polling interval in the body of the loop and the debug logging will show at most `nproc_per_node` times so no log spamming is observed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125969
Approved by: https://github.com/d4l3k
Internal xref:
https://fb.workplace.com/groups/6829516587176185/posts/7211398545654652/
Previously I did it in a crappy way using clone_input in the callback,
but this results in tensors that don't have quite the same
size/stride/storage offset and there was an internal test case where
not having completely accurate information was causing a downstream
problem in propagation. So now I make real tensors as similar to their
fake equivalents as much as possible. Though... I don't bother with
autograd lol.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126175
Approved by: https://github.com/albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125758
Aliased and unused params are currently an issue for strict-mode export. For a model like this:
```
def __init__(self):
# ...
self.alpha = nn.Parameter(torch.randn(4))
self.beta = self.alpha
self.gamma = self.alpha
def forward(self, x):
return x + self.beta
```
Dynamo will trace only 1 parameter (beta) and assign a dynamo name (e.g. `L__self___beta`) which can be difficult to match to the correct FQN in the original eager module. This leads to export graph signature potentially having the incorrect target FQN for the parameter, leading to downstream issues unflattening (the parameter may be assigned to the wrong target attribute, mismatching the relevant placeholder node in the unflattened module).
This handles aliasing issues by assigning all tensors present in the state dict as module attributes, even if they're unused. Still, only the used tensors will appear in the graph's forward pass.
Another issue that exists is weight-sharing is not maintained in unflattening (all params/buffers are re-cloned) - handle this by checking tensor ids too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125758
Approved by: https://github.com/zhxchen17
This adds a new dispatch mode, PreDispatchSchemaCheckMode, built on top of SchemaCheckMode, used for verifying op schemas for functionalization for PreDispatch IR. More specifically, the mode runs in eager mode on concrete inputs, checking if op schemas incorrectly claim to be functional, but are aliasing or mutating. This mode is pushed to the pre-dispatch mode stack, and run before decompositions.
Current testing is hooked up to OpInfo, containing 1103 tests on 600 unique ops. Below is a list of ops that fail testing. One caveat is we only raise errors on ops that claim to be functional - if an op schema admits aliasing or mutating but fails testing for the other, it still may decompose further and become functional.
List of failed ops:
```
aten.atleast_1d.default
aten.atleast_2d.default
aten.atleast_3d.default
aten.cartesian_prod.default
aten.conj_physical.default
aten.alpha_dropout.default
aten.feature_dropout.default
aten.feature_alpha_dropout.default
aten.unsafe_chunk.default
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125481
Approved by: https://github.com/tugsbayasgalan
This adds a pattern which replaces:
```python
scale(x) - scale(x).amax(dim, keepdim=True)
```
with
```python
scale(x - x.amax(dim, keepdim=True))
```
where `scale` can be either multiplication or division by a scalar,
or a tensor that is broadcast in the `dim` dimension.
We can find this pattern inside of the decomposed graph of:
```python
F.softmax(scale(x), dim=dim)
```
This has the effect of both reducing the chance of hitting the `fma`
issue and also means we avoid recomputing `scale(x)` inside and outside
the reduction which may be significant if we can remove an extra division.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124119
Approved by: https://github.com/lezcano
We've had issues using addr2line. On certain versions of
CentOS it is on a version that has a performance regression making it very slow,
and even normallly it is not that fast, taking several seconds even when parallelized
for a typical memory trace dump.
Folly Symbolize or LLVMSymbolize are fast but it requires PyTorch take a dependency on those libraries to do this, and given the number of environments we run stuff in, we end up hitting cases where we fallback to slow addr2line behavior.
This adds a standalone symbolizer to PyTorch similar to the unwinder which has
no external dependencies and is ~20x faster than addr2line for unwinding PyTorch frames.
I've tested this on some memory profiling runs using all combinations of {gcc, clang} x {dwarf4, dwarf5} and it seems to do a good job at getting line numbers and function names right. It is also careful to route all reads of library data through the `CheckedLexer` object, which ensure it is not reading out of bounds of the section. Errors are routed through UnwindError so that those exceptions get caught and we produce a ?? frame rather than crash. I also added a fuzz test which gives all our symbolizer options random addresses in the process to make sure they do not crash.
Differential Revision: [D56828968](https://our.internmc.facebook.com/intern/diff/D56828968)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123966
Approved by: https://github.com/ezyang, https://github.com/aaronenyeshi
Implements forward automatic differentiation support for miopen_batch_norm as well as unskips the associated unit tests. Also fixes a class of functorch related unit tests that fail due to failing a contiguous tensor assertion in BatchNorm_miopen.cpp. Solution was to just limit tensors to miopen_batch_norm that have at least 3 dimensions. The exact restriction already existed in the cudnn path and is why the tests in question only failed on ROCm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125069
Approved by: https://github.com/jeffdaily, https://github.com/andrewor14
This PR is part of an effort to speed up torch.onnx.export (#121422).
- The inputs (dynamic inputs and constants) do not change as as nodes are added and it is expensive to re-compute for every node. So, we cache this value so we avoid computing it for every node. Open to entirely other solution as well.
- Resolves (5) in #121422.
(partial fix of #121545)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123028
Approved by: https://github.com/justinchuby
Summary: Follow-up to https://github.com/pytorch/ao/pull/229.
This resolves the difference between `input.div(scales)` and
`input.mul(1.0 / scales)`, which results in small numerical
discrepancies on some inputs.
Test Plan:
python test/test_quantization.py TestQuantizedTensor.test_decomposed_quantize_per_channel_group
python test/test_quantization.py TestQuantizedTensor.test_decomposed_quantize_per_token
Reviewers: jerryzh168
Subscribers: jerryzh168, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125781
Approved by: https://github.com/jerryzh168
This simplifies the test a bit.
**Context**
Option 1: Ref model is data parallel. Each rank's ref model receives local batch. We manually all-reduce gradients and divide them by world size to match DDP/FSDP semantics.
Option 2: Ref model is not data parallel. Each rank's ref model receives the same global batch. We manually divide the ref model's gradients by world size to match DDP/FSDP semantics. (Note that all ranks have the same ref model and same global batch.)
All of our other unit tests are written following Option 1, which is simpler and a more direct comparison to what our claimed semantics are. This PR switches the gradient accumulation test from being written as following Option 2 to as following Option 1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126161
Approved by: https://github.com/wanchaol
ghstack dependencies: #126067, #126070
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": false,
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": [1024, 1024],
"strides": [1024, 1]
},
{
"is_symbolic": false,
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": [1024, 1024],
"strides": [1024, 1]
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
Differential Revision: [D57216427](https://our.internmc.facebook.com/intern/diff/D57216427)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116368
Approved by: https://github.com/jansel, https://github.com/atalman
Fix https://github.com/pytorch/pytorch/issues/125437 .
Triton matmul template does not work well with non-contiguous inputs and cause mis-aligned memory access. It happens both for inductor matmul template and triton.ops.matmul op. This PR avoid adding `tl.multiple_of` and `tl.max_contiguous` if the input tensors are not contiguous. This work around the issue. We'll follow up and try to figure out the root cause in the GH issue.
The if/else added to the template should be resolved at compile time and they by themselves does not cause any perf hit.
Test:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --accuracy --only BertForMaskedLM --training
```
Previously fail with misaligned memory access and now pass
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126106
Approved by: https://github.com/htyu
as titled, if _split_tensor does not require padding or even is evenly
sharded on the dim, no need to calculate padding and could simply return
This is to avoid some unnecessary CPU operations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125810
Approved by: https://github.com/wz337
Fix for https://github.com/pytorch/pytorch/issues/122871. There are two cases where we emit pointwise cat:
- fusing into a pointwise use
- horizontally fusing copy_ kernels
The regression I looked into previously was due to being overly aggressive in the latter case. I've updated the logic there so that we only emit the horizontal fusion in the case where there are not reductions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125772
Approved by: https://github.com/Chillee
The big idea is that floats are treated as Tensors on input/output to the FX graph, but on the inside, we immediately call item() on the synthetic Tensor and record regular float operations on it. Canonicalization to Tensor operations will happen in a standalone FX pass. This behavior is controlled by `specialize_float` config variable when set to False.
The generated graph looks like this for the test `test_unspec_float_output`:
```
def forward(self, L_x_: "f32[3]", L_y_: "f32[]"):
l_x_ = L_x_
l_y_ = L_y_
# File: /data/users/ezyang/a/pytorch/test/dynamo/test_unspec.py:511 in f, code: return x + 1, y * 2
add: "f32[3]" = l_x_ + 1; l_x_ = None
item: "Sym(zf0)" = l_y_.item(); l_y_ = None
mul: "Sym(2*zf0)" = item * 2; item = None
scalar_tensor: "f32[]" = torch.scalar_tensor(mul); mul = None
return (add, scalar_tensor)
```
The ingredients:
* **torch/_dynamo/variables/builder.py** When `specialize_float` is False, we wrap float literals with `wrap_symfloat`. This is an unholy mashup of `wrap_symint` and `wrap_unspecialized_primitive`. The overall strategy is that we first generate a tensor argument (because that's what we want to show up into the FX graph), but then immediately call item() on the tensor argument to get a SymNodeVariable, which we will do the rest of the tracing with. Importantly, this SymNodeVariable is backed with the source of the original float: this means we can guard on the resulting value (something we could NOT do with UnspecializedPythonVariable). This has to be done manually, because if you literally call item() on the tensor, you will end up with an unbacked float. There is a bit of copy paste from wrap_symint and wrap_unspecialized_primitive which we can try to factor out, but this really is its own thing and you should review every line of code in the function.
* **torch/fx/experimental/symbolic_shapes.py** We now can generate guards on float inputs, and these guards are handled inside of ShapeEnv. So we need to be able to allocate (backed!) float symbols, and produce guards for them. Fairly straightforward generalization.
* **torch/_dynamo/codegen.py** I also need to maintain the invariant that there are no float outputs to the FX graph. I chose to do this at codegen time. When we detect a SymNodeVariable on the return stack for a float, we on the fly convert it (via `as_tensor`) to a TensorVariable, which is the true output. We then special case the output bytecode to call item() on it again. The tensor conversion is memoized on SymNodeVariable since we typically run the code generation process twice.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125325
Approved by: https://github.com/lezcano, https://github.com/jansel
This is mainly:
- Fix refcount access macro
- Hide all the Dynamo code that needs update as usual
- Add _PyWeakref_ClearRef as an extern provided by CPython. Including the pycore header that defines it would require raw c include shenanigans that I don't think are worth it.
This allows to build both with regular and nogil version of cpython. Both
Note that this requires the 3.13 branch at least past [d3094744d40de2deefbda9b1996d5029c9ebf0b0](d3094744d4) which we need for mimalloc include and weakref function being exposed.
debug-only issues in pybind11 with PyMem_MALLOC vs PyObject_MALLOC being should be synced either by updating pybind or cpython. @colesbury I can send a PR to ifdef the proper use in pybind if you think that this is the best solution here?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126033
Approved by: https://github.com/colesbury
Summary:
per-AMD, software pipelining is enabled by setting `num_stages=0`, and should provide a nice perf boost for GEMMs. caveat is that `num_stages=1` is preferred for instances of back-to-back GEMMs, but take `num_stages=0` as the better default.
wait to land until triton upstream lands in OSS, pipelining does not work well on the fork
Test Plan: n/a
Reviewed By: xw285cornell, yoyoyocmu
Differential Revision: D56221447
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125858
Approved by: https://github.com/pragupta, https://github.com/yoyoyocmu
Summary:
#125682 (D56586844) added support for lazy symbolization to `Error` and adopted it for internal use cases; this commit adopts it for `get_backtrace()` as well.
Test Plan:
Sandcastle and GH CI.
NOTE: This is a resubmit of D56881683, a spurious copypasted line in the Android implementation broke the build, but this was not surfaced by diff tests.
Reproed the breakage with
```
$ fbpython scripts/build_android_app/build_android_app.py --buck-config-files='@//fbandroid/mode/have_libgflags @//fbandroid/mode/static_linking @//xplat/langtech/mobile/android_opt_buck_config_with_et_boltnn' --build-target='fbsource//xplat/langtech/mobile:transcribe_binAndroid-android-arm64'
```
Verified that the fixed diff builds successfully.
Differential Revision: D57275456
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126064
Approved by: https://github.com/ezyang
Fixes#123698
This PR makes TensorImpl::has_symbolic_sizes_strides return false for NestedTensors.
1. It passes in the actual sizes when we call `_make_wrapper_subclass` - this is the change that makes the subclass register as `has_symbolic_sizes_strides() == True`
2. It adds a field to `_make_wrapper_subclass` where an explicit `numel` can be provided. This allows us to skip the numel computation for the storage, which previously fails due to arithmetic on NestedInts.
3. Implements `aten::numel` for NJT - this is separate from the overridden numel in `make_wrapper_subclass` for now. Note also that this means that we leave `dispatch_sizes_strides_policy="sizes"`, so that we call into the custom `numel` implementation (as well as `sizes` and `strides`), because `numel` cannot currently be computed from `sizes` for NJT.
Note also that this depends on #121361, because calling TensorImpl::set_sizes_and_strides() tries to clone the sizes into the tensor, which means that we need `clone` to be implemented on NestedInt.
Differential Revision: [D57225736](https://our.internmc.facebook.com/intern/diff/D57225736)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124687
Approved by: https://github.com/albanD
Summary: Fixed typo in documentation. Trying to get familiar with the PR workflow for contributing to PyTorch.
Test Plan: None
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125974
Approved by: https://github.com/ezyang
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info.
1. Cpp template infrastructure
Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
2. Initial FP32 gemm template
This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
3. Correctness and performance
The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
Static shapes
| Benchmark | torchbench | huggingface | timm_models |
|------------|-------------|--------------|--------------|
| Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
| Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
| Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
Key models being sped up:
drq: 1.14x
soft_act: 1.12
cait_m36_384: 1.18x
Dynamic shapes
| Benchmark | torchbench | huggingface | timm_models |
| --- | --- | --- | --- |
| Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
| Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
| Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
Key models being sped up:
BERT_pytorch: 1.22x
pyhpc_turbulent: 1.13x
soft_actor_critic: 1.77x
BlenderbotForCausalLM: 1.09x
cait_m36_384: 1.17x
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021
Approved by: https://github.com/jansel
Fix: #125387
This PR helps keep track of whether an instantiated `ViewMeta` has symbolic values as
input or not. This is used for checking whether we use the AOTAutograd `ViewMeta`-replay
execution path, e.g. it doesn't support tensors that have `ViewMeta` with symbolic inputs.
In summary, the changes are:
- Add the field `ViewMeta::has_symbolic_inputs` and make it a required constructor
parameter
- Add the field `FunctionalTensorWrapper::is_symbolic_` and the method
`FunctionalTensorWrapper::maybe_mark_symbolic`
- Marks a `FunctionalTensorWrapper` as symbolic iff any of its `ViewMeta` have
symbolic inputs
- Add the plumbing of `FunctionalTensorWrapper::is_symbolic` to the Python API
- Codegen the computation of `ViewMeta::has_symbolic_inputs` for each view operation
- Use the AOTAutograd `ViewMeta`-replay path if:
- `target_functional_tensor` is not `None`; and
- `target_functional_tensor` is not symbolic (instead of using a functorch config)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125876
Approved by: https://github.com/ezyang
Turning on guard_nn_modules adds large number of guards, so we are bound to take a perf hit. But the perf hit is small. These are the numbers

First we observe that compared to Python guards, C++ guards give around 6x speedup. This reduces the total time spent in guards. This is shown in the last column (cpp_guards/inductor_optimized_latency). The worst model is around 1.61%, with most of the models below 1%. I think this is good enough signal to turn the config on.
One might also wonder how much guard slowdown occurs with `guard_nn_modules=True`. This is the table

For most models, the guard overhead with nn module guards is under 2x. There are a few outliers, where the slowdown is really high and for those models we spend 1%-2% time in C++ guards as shown in first table.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125202
Approved by: https://github.com/ezyang
Since we now will support `capturable=False` when it's valid, narrow the eager fallback conditions to the cases where `compile` will fail. The lone case here is when the user deletes the capturable flag; `state_steps` are on cuda and `capturable` is `False`. Because a cuda tensor is not supported in the `value` kwarg for foreach ops this results in an error.
The fallback wrapper is changed to check the device of `state_steps` if `capturable=False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125825
Approved by: https://github.com/janeyx99
Summary: Since table caffe2_pytorch_usage_stats only has 1 day retention which renders it useless for TS migration purposes, we want to build a lightweight counter mechanism to collect usage data about torch jit APIs which can monitor the usage decline in the long term.
Test Plan: CI
Reviewed By: SherlockNoMad
Differential Revision: D57216847
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125986
Approved by: https://github.com/gmagogsfm
_extract_graph_with_inputs_outputs() does membership testing on the input nodes but often that collection is a list so the test is O(n). Ensure it's a set before looping over all the nodes.
This change speeds up the internal repro (D57090987) by about 18%:
before:
```
708.88user 15.86system 12:16.19elapsed 98%CPU (0avgtext+0avgdata 12898628maxresident)k
0inputs+91968outputs (3major+3532970minor)pagefaults 0swaps
```
after:
```
583.39user 15.98system 10:10.11elapsed 98%CPU (0avgtext+0avgdata 12895108maxresident)k
0inputs+87488outputs (4major+3374582minor)pagefaults 0swaps
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125937
Approved by: https://github.com/oulgen, https://github.com/anijain2305
Add the missing documentation for `initial_accumulator_value` parameter in Adagrad, and update the algorithm description in the documentation (adjusted to reflect the implementation).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125886
Approved by: https://github.com/janeyx99
Summary: Register fake impl for quantized embedding bag ops (e.g. quantized::embedding_bag_4bit_rowwise_offsets) and bypass registration if it has been registered.
Test Plan:
Before:
```
NotImplementedError: quantized::embedding_bag_4bit_rowwise_offsets: attempted to run this operator with Meta tensors, but there was no fake impl or Meta kernel registered
```
See context here -
https://fb.workplace.com/groups/1075192433118967/permalink/1423106614994212/
After:
Snapsoht was published successfully with PT2Archive.
```
AIMP_DISABLE_PRUNING=false fdb buck2 run mode/opt-split-dwarf -c python.package_style=inplace -c fbcode.enable_gpu_sections=true lego/scripts:lego_cli -- debug-locally --model_entity_id 545861329 --config_version 14 --publish_context OFFLINE_PUBLISH --lego_pipeline aiplatform.modelstore.model_generation.lego.lego_pipeline_builder.gmpp_lego_pipeline --gmpp_config '{"gmpp_pipeline_descriptor": "aiplatform.modelstore.model_generation.v1.ads_pipelines.aimp_pyper_pipeline.model_generation_pipeline", "worker_process_number":24, "worker_thread_per_process_number": 12, "use_work_assignment": true}' --publish_config_overrides '{"gpu_inference_options": "{\"submodules_to_lower\": []}"}' 2>&1 | tee ./gmpp_lc_aimp.txt
```
Reviewed By: ydwu4
Differential Revision: D57172944
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125884
Approved by: https://github.com/ydwu4
Summary: This is for JIT Inductor with cpp wrapper, fixing https://github.com/pytorch/pytorch/issues/117367.
In the backward pass, we don't have real inputs to execute the backward pass to autotune kernels. We have 3 options here, 1) use random tensor inputs; 2) store the forward outputs and feed them to backward (non-trivial because of parameter re-ordering); 3) autotune each kernel with random inputs in a subprocess (similar to select_algorithm). This PR uses the easist option 1. Option 3 is where we are going as the next step, which will simplify the cpp wrapper codegen for the CUDA backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125291
Approved by: https://github.com/chenyang78, https://github.com/angelayi
This adds implementations for:
* _flash_attention_forward
* _efficient_attention_forward
* _flash_attention_backward
* _efficient_attention_backward
These flop counts are implemented as follows:
* Unbind the batch elements
* Calculate flops individually for each element in the batch
* Sum the final result
This means that we are accessing the concrete sequence lengths (which could be slow, and may trigger a GPU/CPU sync); but, the FLOP numbers will vary with the sparsity of the NestedTensor - more accurate than if we just assumed we padded everything.
Differential Revision: [D57120139](https://our.internmc.facebook.com/intern/diff/D57120139)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125776
Approved by: https://github.com/Chillee
it's kind of gross that aot_synthetic base requires storing the *old* fw_metadata's InputInfo, but it is what it is. After this change, aot_dispatch_base's runtime wrappers should all be implemented. After this, I'll start working on aot_dispatch_autograd's remaining runtime wrapping changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125764
Approved by: https://github.com/bdhirsh
ghstack dependencies: #125610
yolo
iirc the a10g/sm86 runners have ~21 GB of space, so we can increase parallelism on it to 3. This results in about 6GB CUDA mem per proc. The previous calculation + 2 procs resulted in about 8 GB
Also fixes the the calc for per proc memory, assuming that CUDA context + anything else take about a little under 1GB of space (previous calc was .11 on about 7.5 - 8 GB <= .9GB)
Times on main are about 1.9-2.5hr per shard
This commit is around 1.6-2hr per shard
Risks: increase in flaky tests due to OOM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125598
Approved by: https://github.com/huydhn
More details further down, but first a more high-level description of "how do we functionalize storage resizing"
Today, dynamo converts `param.untyped_storage().resize_(x)` calls that it sees from fsdp into a custom op, `ops.inductor.resize_storage_bytes_(x)`
So given this setup, there are 3 main cases that I think we want to handle:
(1) graph input starts with a real storage size, gets resized down to zero in the graph
(2) graph input starts with 0 storage size, gets resized up in the graph
(3) graph input starts with 0 storage size, gets resized up and used in some compute, then resized back down to 0
For case (1) we need to emit a `resize_storage_bytes_` at the end of the graph, similar to how we emit `copy_()` for data mutations.
For case (2), we need to emit a `resize_storage_bytes_` in the graph, and we **also** need to emit a `copy_()` (the input had its storage resized up, and filled in with data, which is we need to reflect as an input mutation)
For case (3), the net effect is that the input had no data on entry and exit of the function, so we don't need to emit any mutable ops in the end of the graph.
The main thing to call out is that: we need to write a functionalization rule for `resize_storage_byte_`, (`FunctionalTensorWrapper::storage_resize_()`) and this rule actually does very little. We would like to **not** emit any new ops in the graph (like say, a functional resize op). Instead, we should expect / rely on the fact that any resize up will be immediately followed by a `copy_()`/`foreach_copy_`/`out=` op, that will fill in the data of the tensor. So `FunctionalTensor` can temporarily live in a state where its data is invalid, until the `x.copy_(y)` "updates" its data with the new tensor.
So effectively, all that this rule does is:
(1) it stores metadata on the storage, indicating that the tensor was resized, as well as the updated storage size. We need this info in AOTAutograd, so it knows whether to emit a mutable resize_() op in the graph epilogue
(2) There is also a corner case: if we are resizing down to zero, but our tensor had **previously** had a zero size storage, then we update `value_` to point to the original value of the tensor. The reason this seems safe is because if we have a zero storage sized tensor `x`, and we resize it up, use it in some compute, resize it back down to zero, and use it somewhere, we would want the functional version of this code to use the original `x` after the second resize. For FSDP, this is important because we end up saving parameters (graph inputs) for backward, and we want to make sure that the thing we save (and the output to the forward graph) is the original, zero-storage-sized parameter, and not the "version 2" of the parameter after the first resize_()
I think a good order to look at changes in this PR would be:
(1) `test_aotdispatch.py` shows the 3 main cases I focused on as well as the expected functionalized graphs
(2) In `FunctionalStorageImpl.h/cpp`, I had to add a notion of "original base", and "original/curr_size". The first is so I can re-use the zero-size tensor after multiple resizes, and the second is so I can tell in AOTAutograd whether any resizes canceled each other out into a no-op
(3) FunctionalTensorWrapper.h/cpp has the new resize functionalizion rule + some extra utils
(4) `_functorch/_autograd`: the main changes in this folder were around adding the logic at trace-time to detect when we need to put a resize_() in the graph. I also have some assertions to check that any inputs that experience storage resizing will **always be in the graph** and not the opaque epilogue, and I also limited the resize_() mutation case so that you can only ever start with zero storage, or end with zero storage (you can't do e.g. `torch.ones(2).storage().resize_(3)`), and banned it on tensor subclasses
(5) `fake_tensor.py`/`meta_utils.py`: we now need to be able to fakeify tensors with zero storage, so I added a quick version of it in meta_utils.py. This also.. has ramifications for fake tensor caching that I need to fix (include the storage size on the cache key, maybe?)
------------------
This PR subsumes https://github.com/pytorch/pytorch/pull/120971.
This PR is enough to **almost** get a simple ppFSDP forward pass tracing with a functionalized resize_() properly. It also attempts to do the updated version from @jansel, where we don't have any notion of `resize_()` in the graph at all, post functionalization. It would probably be good to test it with @yf225 's FSDP changes, and see how many of the FX passes it allows us to remove. I think that in theory, it should allow us to remove all FX passes that affect the forward graph / partitioner, **except** the one that forces views to be recomputed in the backward (more details below).
There are a few things worth calling out:
(1) failed attempt at functionalizing `aten.copy_()`. I originally wanted to get a version takes these operations:
```
param.storage().resize_(all_gather_size)
param.copy_(all_gather_buffer)
out = aten.matmul(param, param)
```
and functionalizes them into:
```
out = aten.matmul(all_gather_buffer, all_gather_buffer)
```
This would involve getting functionalization to turn `x.copy_(y)` into a giant no-op that just returns `y`. Unfortunately, we can't actually do this in a reasonable way within functionalization (instead, there's a functional `aten.copy` in the graph - see the test case graph expecttest for details). Why? In order for that transformation to be safe, `x` and `y` need to have the same metadata. However, it's possible for `x` and `y` to be subclasses of different types. This is not something we can easily tell from within functionalization, and would be a layering violation. So for now I'm leaving it to downstream code to optimize away the `aten.copy` (this is already the case today, so I think inductor can handle this)
(2) The forward doesn't **actually** run successfully in this PR (see the `assertRaisesRegex` in the test). Why?
The final forward graph looks like this:
```
def forward(self, primals_1, primals_2):
_foreach_copy = torch.ops.aten._foreach_copy.default([primals_1], [primals_2]); primals_2 = None
getitem = _foreach_copy[0]; _foreach_copy = None
mm = torch.ops.aten.mm.default(getitem, getitem); getitem = None
t_1 = torch.ops.aten.t.default(primals_1); primals_1 = None
return [mm, t_1]
```
Where `primals_1` starts out as a secretly-zero-storage-size parameter, and gets resized up and back down within the forward (these are functionalized away).
Importantly, the matmul happy on the result of the `foreach_copy`, **but** the activation that we save for backward (`t_1`) is the result of transposing the **original parameter** (the zero-storage-size param). This is exactly the optimization in fsdp that allows us to have good peak memory usage.
The problem is that the min-cut partitioner decides to save `t_1` for backward. Running this code in eager breaks, because the kernel for `aten.permute(x)` is not happy when `x` has secretly-zero-sized-storage.
The real problem here is that in eager mode the `permute` kernel runs during the backward, after backward hooks have properly resized the saved activation. Here, we are running the transpose in the forward.
One option would be to turn off the checks in our view kernels and allow them to work on zero-storage-sized tensors, which feels pretty bad. Another option is to tweak the partitioner (or use one of Will's FX passes) to force the partitioner to not save views for backward, and allow the views to be recomputed in the backward. This seems kind of silly, but is also probably harmless.
(3) The backward is still broken. To be fair, this issue is pretty separable from "functionalizing storage resize calls", and can be fixed later (either by a real fix to our tracing infra, or via another hacky FX pass). More description of this problem is described at issue (8) of my PR description in https://github.com/pytorch/pytorch/pull/120971
(4) I only added support for "full graph" resizing: basically, the limited case where a param starts with zero storage size, and gets resized up and back down. I think we can add support for the graph break case, but I think we can keep that add-on separate from this PR unless we need it immediately. I also added asserts so we should fail loudly when we hit this case
(5) I have a change to FakeTensor creation when inputs have zero storage size that.. is probably ok. But I also removed FakeTensor caching on view ops, which I probably need to fix before I can land this PR
(6) I added a notion of "original_base" to `FunctionalStorageImpl`. More details are in the comments, but my rational for this was that we basically need it to ensure that autograd saves the **original**, zero-storage-sized param for backward, after resizing up and back down
(7) I had to update our eager kernels for `aten.copy` and `aten._foreach_copy`, to handle the case where the `self` argument has secretly-zero-storage. Inductor can probably generate correct code for this case, but we need these ops to work properly in this situation for the `aot_eager` backend to do the right thing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122434
Approved by: https://github.com/jansel
This PR switches export IR from aot-dispatch to pre-dispatch IR.
**What is pre-dispatch IR and why should you care?**
Currently the default IR returned by torch.export can contain only functional ATen operators after ALL pytorch dispatcher decompositions (for example, CompositeImplicitAutograd) run.
In contrast, pre-dispatch IR refers to an IR that can contain all functional ATen operators (i.e., not just from the core subset), before any decomposition happens, as well as operators that manipulate autograd state. Pre-dispatch IR closely resembles eager PyTorch computation, but is still functional and serializable by torch.export. As a result:
You can train the pre-dispatch IR in eager mode as the IR contains necessary information for the autograd engine to automatically generate a backward graph.
You can write sound graph transformations more easily as the IR is functional.
Since it is an ATen IR, it is still normalized. For example, torch.add has multiple overloads, but aten.add.Tensor is unique in this IR.
If you want to get the core aten IR out of torch.export, you will need to:
```
ep = torch.export.export(M(), inputs)
ep_for_core_aten = ep.run_decompositions()
```
Differential Revision: [D57172986](https://our.internmc.facebook.com/intern/diff/D57172986)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125860
Approved by: https://github.com/zhxchen17
Summary: #125682 (D56586844) added support for lazy symbolization to `Error` and adopted it for internal use cases; this commit adopts it for `get_backtrace()` as well.
Test Plan: Sandcastle and GH CI.
Differential Revision: D56881683
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125750
Approved by: https://github.com/ezyang
This PR introduces a tool to dynamically switch between ARC runners and old runners without having to update the PR to the latest version.
There is also a third option - use both runners at the same time (aka shadow deployment). In this case, failed workflows using ARC launchers will not block merge process.
The GitHub issue is used to control access to ARC launchers - [Access Rules Issue](https://github.com/pytorch/test-infra/issues/5132):
* In the FIRST comment you can specify who will use the ARC runners:
* Add a GitHub username to use ARC runners.
* Add "*" at the beginning to switch ALL users to ARC runners.
* Add "!" at the beginning to switch ALL users to old runners.
* In the SECOND comment you can specify do we need to run ARC runners and old runners at the same time.
* To use both runners, add a second comment with the word "both".
* If we want to use only one type of runners, just remove the second comment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125680
Approved by: https://github.com/ZainRizvi
Fix for https://github.com/pytorch/pytorch/issues/122871. There are two cases where we emit pointwise cat:
- fusing into a pointwise use
- horizontally fusing copy_ kernels
The regression I looked into previously was due to being overly aggressive in the latter case. I've updated the logic there so that we only emit the horizontal fusion in the case that we would have to emit separate copy_ kernels anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125772
Approved by: https://github.com/Chillee
This PR is part of a series of PRs to significantly speed up torch.onnx.export for models with many nodes (e.g. LLM). See #121422 for more analysis.
- As part of torch.onnx.export, a reverse look-up is made in env. This is done for each node, and this look-up costs in proportional to the graph size, which incurs and overall O(N^2) time complexity.
- A pragmatic solution is simply to keep a separate data structure to make this de facto constant time. So, this introduces a set containing all the values of env. Open to other ideas. Ideally `exist_in_env` wouldn't be needed at all, but to preserve current behavior exactly I'm not sure how that can be done.
- Resolves (4) in #121422.
- This code change and the choice of py::set looks a bit more natural on top of #123063, where the env is changed from a std::unordered_map to a py::dict.
Partially fixes#121422
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124909
Approved by: https://github.com/srikris-sridhar, https://github.com/justinchuby
When dispatching a fake tensor op we cache the result with `(op, args)` as the key. There are some args (such as one with a dynamic output shape) where the output can't be cached. Instead of validating the args every time we compute the cache only validate the args when we first see a new cache key.
18.3% FakeTensor perf win on the microbenchmark (21.7% cumulative)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124223
Approved by: https://github.com/oulgen, https://github.com/masnesral
ghstack dependencies: #122911
Summary: This change introduces a new flagg to perform a "warm start" test from the benchmark harness. The idea is to test a model twice: first with a fresh inductor cache (i.e., a "cold start"), and then a second run in a fresh process with the cache available (i.e. a "warm start"). We can later add this mode to CI runs to collect compile times for warm start.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125353
Approved by: https://github.com/eellison, https://github.com/desertfire
Move gha artifact download to before any xml parsing is done for uplaod-test-stats
Do not download gha artifacts during xml parsing since got uploaded to s3 in the above and will be downloaded when all the artifacts are downloaded from s3
The previous method resulted in dups if you run the script again
TODO: write a deduper so we don't have to worry at all
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125609
Approved by: https://github.com/huydhn
Summary: This fix does three things:
1. When we add inputs from partioner to the top level graph module, we insert in the order of partioner which is not guaranteed to be same as original graph inputs. This PR fixes that.
2. When we replace autograd ops with HOP, we create new submodules and access their outputs via getitem calls. As a result, previous node names associated with getitem gets updated, resulting in the graph being different from produced graph signature. So I just update the graph signature accordingly.
3. We run runtime_assertion pass before autograd HOP pass because the constraints won't be populated correctly.
Differential Revision: [D57130314](https://our.internmc.facebook.com/intern/diff/D57130314)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125793
Approved by: https://github.com/zhxchen17
Fixes#69031, #42793
This PR fixes the bug introduced in #54981 where parameters used within a `no_sync` scope are not respected when `find_unused_parameters` is set to `True`. The `local_used_map_` and `numGradHooksTriggeredMap_` variables should be updated regardless of the `no_sync` state.
Tested and verified with fairseq2 and wav2vec2 ASR finetuning recipe. All gradients are correctly synced across workers as expected after applying this fix.
Co-authored-by: Kaushik Ram Sadagopan <kaushikram2811@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124193
Approved by: https://github.com/rohan-varma
Get's rid of the following warning:
```
/Users/shenke/workspace/pytorch/test/test_mps.py:9229: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
if base.storage().data_ptr() != other.storage().data_ptr():
```
(noticed while looking at https://github.com/pytorch/pytorch/issues/96153#issuecomment-2101876484 )
Respective change to view ops was landed back in 2022, see https://github.com/pytorch/pytorch/pull/91414
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125838
Approved by: https://github.com/albanD
**Summary**
Per the discussion in https://github.com/pytorch/pytorch/pull/123444, the `decomposed quant/dequant` patterns changed after https://github.com/pytorch/pytorch/pull/123445, we can move the optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase to avoid the changes. In this way, we can:
- Avoid the pattern matcher failure introduced in https://github.com/pytorch/pytorch/pull/123445
- Make the quantization pattern clearer in the pattern matcher phase, since the `quant/dequant` nodes have not been decomposed.
**Changes in this PR**
- Move optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase.
- Corresponding changes in the quantization pattern matcher to ensure no bc-breaking.
**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_q
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124041
Approved by: https://github.com/peterbell10, https://github.com/jgong5
Summary:
It seems like required functions are not available due to `_MSC_VER` guard. Does anyone have more context why this functionality has been disabled for windows?
I'm also unsure how this currently compiles in OSS land on windows, as there doesn't seem to be any preprocessor protection around `scaled_gemm` getting pulled in.
Test Plan:
Fix compilation errors like this
```
C:\open\fbsource\xplat\caffe2\aten\src\ATen\cuda\tunable\TunableGemm.h(74): error C2039: 'scaled_gemm': is not a member of 'at::cuda::blas'
C:\open\fbsource\xplat\caffe2\aten\src\ATen\cuda\CUDABlas.h(19): note: see declaration of 'at::cuda::blas'
C:\open\fbsource\xplat\caffe2\aten\src\ATen\cuda\tunable\TunableGemm.h(74): note: the template instantiation context (the oldest one first) is
C:\open\fbsource\xplat\caffe2\aten\src\ATen\cuda\tunable\TunableGemm.h(71): note: while compiling class template 'at::cuda::tunable::DefaultScaledGemmOp'
Action failed: fbsource//xplat/caffe2:ATen_cuda_lib_ovrsource (cxx_compile aten/src/ATen/native/cuda/Blas.cpp)
```
Differential Revision: D57087985
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125792
Approved by: https://github.com/malfet, https://github.com/eqy
Summary:
The macros that build `c10::Error` compute the stack trace at the point of throwing, which is then returned as part of the `what()`. If `what()` is never called, which is the case for most exceptions (since logging is throttled), the cost of computing the stack trace was wasted.
By far, the most expensive part of computing the stack trace is its symbolization; just unwinding the stack and collecting the instruction addresses is comparatively cheap. We can thus defer the symbolization to first invocation of `what()`.
Test Plan:
Added unit tests exercising the lazy nature of `what()`.
Ran an adfinder canary: https://www.internalfb.com/intern/ads/canary/460118801509424346
We can see that the cost of symbolization is obliterated (meaning that `what()` is virtually never called, as expected):
{F1496627896}
Differential Revision: D57128632
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125787
Approved by: https://github.com/huydhn
**Summary**
Per the discussion in https://github.com/pytorch/pytorch/pull/123444, the `decomposed quant/dequant` patterns changed after https://github.com/pytorch/pytorch/pull/123445, we can move the optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase to avoid the changes. In this way, we can:
- Avoid the pattern matcher failure introduced in https://github.com/pytorch/pytorch/pull/123445
- Make the quantization pattern clearer in the pattern matcher phase, since the `quant/dequant` nodes have not been decomposed.
**Changes in this PR**
- Move optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase.
- Corresponding changes in the quantization pattern matcher to ensure no bc-breaking.
**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_q
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124041
Approved by: https://github.com/peterbell10, https://github.com/jgong5
Enables LRScheduler to handle tensor LRs.
Note on test changes:
For the test modifications I just removed itertools.product and created two loops. This allows us to create a new set of optim_inputs on each iteration to prevent mutations on the tensor LR carrying over across iterations. Nothing else in those tests was modified.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123753
Approved by: https://github.com/janeyx99
ghstack dependencies: #123751, #123752
This PR fix `torch.backends.xeon.run_cpu` behavior when it is launched from `torchrun` with `--nproc-per-node` parameter.
As a CPU launcher, `run_cpu` would bind cores to each instance it launches using `numactl`, and assign cores to each instance evenly.
However, if we use `torchrun` to start `run_cpu` and use `--nproc-per-node` to create multiple `run_cpu` processes. In this case, each `run_cpu` process would assume it can use all the CPU cores, which causes each `run_cpu` process compete for CPU cores. This results in poor performance.
This PR recognize environment variable `LOCAL_WORLD_SIZE` and `LOCAL_RANK` set by `torchrun`, then use this information to further shard the cores bind to each instance. With this PR, when launched by `torchrun --nproc-per-node ...`, different CPU cores will be bind to different workers, which maximize CPU utilization and application performance.
The specific use case this PR enabled is using TorchServe with DeepSpeed tensor parallel. In this case, TorchServe would run `torchrun --nproc-per-node <tp_size>` to start tensor parallel workers it needed. When run TorchServe on multisocket CPU server with DeepSpeed tensor parallel, we need this PR to achieve best performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123711
Approved by: https://github.com/jingxu10, https://github.com/ezyang
This PR adds the autotune Infrastructure for CPU. It generalizes and extends `BenchmarkRequest` with CPU support and C++ module loader. A `do_bench_cpu` util function is added for benchmarking functions on CPU with warmups and returns the median number from multiple trials.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125159
Approved by: https://github.com/jansel
Summary:
github issue: https://github.com/pytorch/pytorch/issues/73828
Whenever we transition from RECORD_AND_SAVE to WARMUP in the profiler schedule, we instantiate a new backend profiler which wipes out the last cycle's information. This makes using the `repeat` parameter less useful in the schedule as you only get contents of the last cycle/repeat. In this diff, we save the accumulated Function Events before setting the new ones and then merge the two EventLists after post processing/cleaning is done. This diff only fixes Function Events so that we can get statistics over each cycle within a schedule. A follow up should be made to accumulate the chrome tracings as well if it is requested.
Test Plan: Added functional python tests in test_profiler.py that test different schedules and their FunctionEvent counts
Differential Revision: D56956245
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125510
Approved by: https://github.com/aaronenyeshi
```
python test/test_fx.py -k test_public_api_surface
```
was failing with a complaint about infinite recursion. Fixed that and then marked the two API changes from #123681 as private (for `get_example_value`) and backward compatible (for `insert_deferred_runtime_asserts`).
Fixes#104012
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125706
Approved by: https://github.com/BoyuanFeng
## Problem this PR resolves
Today, most of distributed tests are arranged like this:
```
def test_allreduce(self):
pg = self._create_process_group_nccl(store, self.opts())
pg.allreduce(tensor)
...
```
Thus, we are paying PG creation time **per test**. That's bad. But why were we doing that? Is there a constraint?
If we look deeper, we would find that most of our test cases inherit from `torch.testing._internal.common_distributed.MultiProcessTestCase`. From the name, nothing seems wrong, and probably fits distributed well. But a "problem" exists in its `setUp()` and `tearDown()` methods, which basically do the following:
```
def setUp(self):
self._spawn_processes()
def tearDown(self):
for p in self.processes:
p.terminate()
```
Since `setUp` and `tearDown` are "**test-scope fixtures"**, meaning, they are called per test, each test will have brand new processes. Of course we'd have to recreate ProcessGroup every time.
## How we are fixing it
First, obviously, we need to put a PG's lifetime into a longer scope. Python `unittest` provides such a helper, called **"class-scope fixtures."** It is embodied by a `setUpClass` method and a `tearDownClass` method (note the name difference), which are called only once for all tests in the same test class. Therefore, we would do:
```
@classmethod
def setUpClass(self):
dist.init_process_group(...)
@classmethod
def tearDownClass(self):
dist.destroy_process_group()
```
**In this PR, we create a new test template for distributed: `MultiProcContinousTest`, to hold this class-scope fixture.**
Second, we'd need to avoid per-test process spawn and terminate. That's easy, we can either:
1. launch the whole test file with `torchrun --nproc-per-node=...` or
2. use `mp.spawn()` under `if __name__ == "__main__":`.
Point is, launch the processes only once.
## Result
We moved the "positive tests" from test_c10d_nccl.py to test_c10d_ops_nccl.py.
Before this PR:
```
$ python test_c10d_nccl.py -k ProcessGroupNCCLTest
Ran 24 tests in 174.457s
```
After this PR:
```
$ torchrun --nproc-per-node 2 test_c10d_ops_nccl.py
or
$ python test_c10d_ops_nccl.py
Ran 24 tests in 16.247s
```
10X speedup.
## Limitation
For tests intended to test destroy or abort of PGs, we'd need to go back to the old style. So it would make sense to divide our tests into two classes: one for positive tests where we would reuse the PGs, and the other one for abort/destroy and negative tests like watchdog timeout.
## Next step
Migrate the tests of distributed that would fit with this test style!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125648
Approved by: https://github.com/wconstab
I am building PyTorch with the Intel oneAPI 2024.0 compiler and without cuSparseLt, and encountered various type errors of the following forms:
```
[ 63%] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu.o
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/ComputeSparseTile.h(87): error: no operator "=" matches these operands
operand types are: cutlass::uint2b_t = int
detected during:
instantiation of "at::native::Indices4x4 at::native::LargestValuesGreedy<Op>::operator()(Tile4x4Accessor) [with Op=at::native::IdentityOp, Tile4x4Accessor=at::native::KernelTypes<cutlass::half_t>::Tile4x4Accessor]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredPack.h(349): here
instantiation of "void at::native::KernelTypes<Element_>::sparse_semi_structured_tile_kernel(at::native::KernelTypes<Element_>::Params, MetadataStore, Algorithm) [with Element_=cutlass::half_t, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>, MetadataStore=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(201): here
instantiation of "void at::native::sparse_semi_structured_tile_kernel<KT,Metadata,Algorithm>(KT::Params, Metadata, Algorithm) [with KT=at::native::KernelTypes<cutlass::half_t>, Metadata=at::native::MetadataCutlass, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>]"
(177): here
instantiation of "void at::native::named_algorithms(T) [with T=lambda [](auto, const std::__cxx11::string &)->auto]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(265): here
instantiation of "std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::sparse_semi_structured_tile_typed<Element,MetadataFormat>(at::Tensor, std::__cxx11::string) [with Element=cutlass::half_t, MetadataFormat=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(293): here
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/ComputeSparseTile.h(88): error: no operator "=" matches these operands
operand types are: cutlass::uint2b_t = int
detected during:
instantiation of "at::native::Indices4x4 at::native::LargestValuesGreedy<Op>::operator()(Tile4x4Accessor) [with Op=at::native::IdentityOp, Tile4x4Accessor=at::native::KernelTypes<cutlass::half_t>::Tile4x4Accessor]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredPack.h(349): here
instantiation of "void at::native::KernelTypes<Element_>::sparse_semi_structured_tile_kernel(at::native::KernelTypes<Element_>::Params, MetadataStore, Algorithm) [with Element_=cutlass::half_t, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>, MetadataStore=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(201): here
instantiation of "void at::native::sparse_semi_structured_tile_kernel<KT,Metadata,Algorithm>(KT::Params, Metadata, Algorithm) [with KT=at::native::KernelTypes<cutlass::half_t>, Metadata=at::native::MetadataCutlass, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>]"
(177): here
instantiation of "void at::native::named_algorithms(T) [with T=lambda [](auto, const std::__cxx11::string &)->auto]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(265): here
instantiation of "std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::sparse_semi_structured_tile_typed<Element,MetadataFormat>(at::Tensor, std::__cxx11::string) [with Element=cutlass::half_t, MetadataFormat=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(293): here
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredPack.h(238): error: function "lambda [](cutlass::uint2b_t, cutlass::uint2b_t)->void" cannot be called with the given argument list
argument types are: (int, int)
object type is: lambda [](cutlass::uint2b_t, cutlass::uint2b_t)->void
detected during:
instantiation of "at::native::KernelTypes<Element_>::Tile4x4Packed at::native::KernelTypes<Element_>::pack_4x4(at::native::Indices4x4, at::native::KernelTypes<Element_>::Tile4x4Accessor, uint32_t &, int, __nv_bool) [with Element_=cutlass::half_t]"
(354): here
instantiation of "void at::native::KernelTypes<Element_>::sparse_semi_structured_tile_kernel(at::native::KernelTypes<Element_>::Params, MetadataStore, Algorithm) [with Element_=cutlass::half_t, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>, MetadataStore=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(201): here
instantiation of "void at::native::sparse_semi_structured_tile_kernel<KT,Metadata,Algorithm>(KT::Params, Metadata, Algorithm) [with KT=at::native::KernelTypes<cutlass::half_t>, Metadata=at::native::MetadataCutlass, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/ComputeSparseTile.h(177): here
instantiation of "void at::native::named_algorithms(T) [with T=lambda [](auto, const std::__cxx11::string &)->auto]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(265): here
instantiation of "std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::sparse_semi_structured_tile_typed<Element,MetadataFormat>(at::Tensor, std::__cxx11::string) [with Element=cutlass::half_t, MetadataFormat=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(293): here
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredPack.h(241): error: function "lambda [](cutlass::uint2b_t, cutlass::uint2b_t)->void" cannot be called with the given argument list
argument types are: (int, int)
object type is: lambda [](cutlass::uint2b_t, cutlass::uint2b_t)->void
detected during:
instantiation of "at::native::KernelTypes<Element_>::Tile4x4Packed at::native::KernelTypes<Element_>::pack_4x4(at::native::Indices4x4, at::native::KernelTypes<Element_>::Tile4x4Accessor, uint32_t &, int, __nv_bool) [with Element_=cutlass::half_t]"
(354): here
instantiation of "void at::native::KernelTypes<Element_>::sparse_semi_structured_tile_kernel(at::native::KernelTypes<Element_>::Params, MetadataStore, Algorithm) [with Element_=cutlass::half_t, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>, MetadataStore=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(201): here
instantiation of "void at::native::sparse_semi_structured_tile_kernel<KT,Metadata,Algorithm>(KT::Params, Metadata, Algorithm) [with KT=at::native::KernelTypes<cutlass::half_t>, Metadata=at::native::MetadataCutlass, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/ComputeSparseTile.h(177): here
instantiation of "void at::native::named_algorithms(T) [with T=lambda [](auto, const std::__cxx11::string &)->auto]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(265): here
instantiation of "std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::sparse_semi_structured_tile_typed<Element,MetadataFormat>(at::Tensor, std::__cxx11::string) [with Element=cutlass::half_t, MetadataFormat=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(293): here
```
The casts added by this PR get the build working again for me.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124777
Approved by: https://github.com/jcaip
- `FakeContext` hides all fields other than ctx.saved_tensors, this dynamo errors when the autograd.Function.backward uses other attrs on ctx and it also doesn't allow fallback to eager.
- If we remove it, we still can't fallback to eager: node variables are already freed (ctx.saved_tensors throws)
- However, we can fallback to "pseudo-eager" by using a duck-typed ctx and routing the ctx.saved_tensors to lifted tensors
- Dynamo tries to inline external_utils.call_backward, treats BackwardCFunction as a AutogradFunctionContextVariable (only used up until we create the fake context: FakeBackwardCFunction)
- we call_function backward from the forward class AutogradFunctionVariable, and we still pass in the fake context as a UserDefinedObjectVariable (can later use AutogradFunctionContextVariable + HOO graph speculate)
Fixes#125489#124827
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125661
Approved by: https://github.com/jansel
Summary:
Update the Kineto submodule in PyTorch. The following diffs are included:
- Removed CUPTI overhead track in AMD traces
- Delay logging for CUDA stream wait event until the end
- Changed chrome trace unit will be in milliseconds, and data will be in ns
- Refactored roctracer to include metadata and improved names.
- Lowered Kineto Stage log level, reducing noisy output
- Changed relative time of ts to quarterly interval for distributed trace alignment
- Fixed Non-risky deprecated use of 0/NULL
- Removed hardcoding of /opt/rocm
- Handling cuLaunchKernelEx better
- Fixed Non-risky missing field initializers and unused variables.
Test Plan: CI and this is running internally.
Differential Revision: D57011897
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125621
Approved by: https://github.com/sraikund16
As per title.
This ensures that all the places where we assume the method defined in _tensor.py do exist.
BC-Breaking: This is bc-breaking as the user cannot subclass this private class anymore.
You should replace any use of _TensorBase to Tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125558
Approved by: https://github.com/ezyang
Summary:
block traverse mode:
Assumption:
culprits block formed by (start_idx, end_idx) in topologically sorted graph
and the error will go away if graph patterns breaks
Reviewed By: junhanh
Differential Revision: D56799587
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125613
Approved by: https://github.com/jfix71
In case the `dfs_iter_find_cycle` function receives duplicated node entries in the `all_user_nodes` argument, it will still process each one of them. This commit changes the `all_user_nodes` list into a set, so each element is unique, resulting in a shorter execution time of the `propose_partitions` function.
Fixes#125584
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125585
Approved by: https://github.com/Skylion007
Summary:
By default, some inferred dynamic shapes guards/constraints that are not expressible with the current dynamic shapes language will lead to specialization to the concrete input values provided. If disable_forced_specializations is set to True, we will not specialize, and will not perform runtime checks on such produced guards. Instead, we allow the user to specify arbitrary shapes, and fail during runtime if the inputs are invalid. Constraints expressible with the language (e.g. ranges, linear derived dims) will still be enforced, and behavior for all other guards remains the same.
Cases where we typically specialize are reshapes:
```
x: [4, 6] # [s0, s1]
x = x.reshape([x.shape[0] - 1, -1])
# this emits a guard Mod(s0*s1, s0-1) = 0, we specialize on s0=4, s1=6
x: [4, 6], y: [24] # [s0, s1], [s2]
x = x.reshape([-1]) + y
# this emits a guard s0*s1 = s2, we specialize on s0=4, s1=6, s2=24
```
For now only applicable for non-strict mode (need to figure out how to pass this flag into dynamo's call of produce_guards).
Test Plan: Added test case that checks compilation, runtime, and suggested fixes behavior.
Differential Revision: D56361177
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124949
Approved by: https://github.com/avikchaudhuri
Now that efficient attention is supported in dtensor, we can modify the transformer test to use dtensor in SDPA and get rid of the manual num_head adjustments.
Caveat: Efficient attention is supported only with bf16/fp32 (not fp64) and has other constraints. If any of the constraints are not satisfied, the SDPA would fall back to the math decomposed attention, which will break as it does not fully work with dtensor (it creates a `torch.Tensor` mask in the middle). I considered adding some checks like in P1202254918 but that needs to be added everywhere this Transformer is used. Is it necessary if the current CI machines can run efficient attention?
Test files containing this Transformer:
- `test/distributed/tensor/parallel/test_tp_examples.py`
- `test/distributed/_composable/fsdp/test_fully_shard_training.py`
- `test/distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122997
Approved by: https://github.com/XilunWu
ghstack dependencies: #122995, #122996
Previously, the new tensor out of the "new factory" all become replicated.
With this PR, if the new tensor has the same shape as the old tensor **and** the shape can be evenly sharded, then the old spec is inherited and preferred.
To accommodate this when the old tensor has sharded placements, the input args for local computation (size, stride) need to be adjusted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122995
Approved by: https://github.com/wanchaol
@wanchaol was seeing the loss eventually become NaN when compiling individual transformer blocks in torchtitan - with this patch I no longer see the NaN loss.
The problem is the following:
(1) It is possible to have graph inputs to a compiled region that are AsyncCollectiveTensors. In particular: when we compile individual transformer blocks in the llama model, the first layer (embedding layer) is run in eager mode, and it outputs an AsyncCollectiveTensor that is fed to the first transformer block
(2) ideally, we would like that AsyncCollectiveTensor graph input to desugar into a `wait_tensor()` op that shows up at the beginning of the graph.
(3) the way this is supposed to happen is: AOTAutograd traces through the __torch_dispatch__ of AsyncCollectiveTensor, tracing out a `wait_tensor()` call before dispatching to any of the other ops in the function we are tracing
(4) however: `trigger_wait()` was getting called in a way where we would ignore its output (and return `self.elem` directly), which would cause the `wait_tensor` ops to get DCE'd.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125677
Approved by: https://github.com/wanchaol, https://github.com/yifuwang
ghstack dependencies: #125676
This is the first PR in a series where I try to organize our runtime wrappers a bit: specifically, I'd like to separate wrappers into objects that have (up to) 2 methods:
A **pre-compile** function, which takes in flat_fn and flat_args (inputs to the compiler) and wraps/modifies them
A **post-compile** function, which takes in a compiled_fn and runtime args and wraps the compiled_function.
Extra metadata necessary to run the compile functions can be stored on the attributes of the class. This way, when we think about caching, the set of attributes on the class should be the exact set of metadata that we need to serialize and save in the cache (along with common data, like fw_metadata)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125595
Approved by: https://github.com/bdhirsh
# Motivation
As discussed in [#124479](https://github.com/pytorch/pytorch/pull/124479), `torch.amp.autocast` can NOT be completely equivalent to `torch.cuda.amp.autocast` and `torch.cpu.amp.autocast` since `torch.amp.autocast` has NOT the default `dtype` for CPU (`torch.bfloat16` by default) and CUDA (`torch.float16` by default) respectively. We would like `torch.amp.autocast` to be more generic to help the developer/customer write the device-agnostic code. Because there are not enough reasons to add device-specific autocast `torch.xxx.amp.autocast` for each device backend.
# Solution
When `None` is passed to `dtype`, we should use `torch.get_autocast_dtype` to get the related dtype for each backend. Meanwhile, `torch.get_autocast_dtype` is necessary to be supported in JIT path for BC.
# Additional Context
With this PR, `torch.amp.autocast(device_type='cuda')` is equivalent to `torch.cuda.amp.autocast`.
Add two new UTs to cover this change in eager and jit path respectively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125103
Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/gujinghui
Summary:
The macros that build `c10::Error` compute the stack trace at the point of throwing, which is then returned as part of the `what()`. If `what()` is never called, which is the case for most exceptions (since logging is throttled), the cost of computing the stack trace was wasted.
By far, the most expensive part of computing the stack trace is its symbolization; just unwinding the stack and collecting the instruction addresses is comparatively cheap. We can thus defer the symbolization to first invocation of `what()`.
Test Plan:
Added unit tests exercising the lazy nature of `what()`.
Ran an adfinder canary: https://www.internalfb.com/intern/ads/canary/460118801509424346
We can see that the cost of symbolization is obliterated (meaning that `what()` is virtually never called, as expected):
{F1496627896}
Reviewed By: ezyang
Differential Revision: D56586844
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125682
Approved by: https://github.com/ezyang
as titled, for meta tensor ops, we should avoid calling the RNGTracker,
which could potentially alter the current RNG state. Meta tensor ops
should be no-op and post `to_empty` init would really alter the RNG
state
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125693
Approved by: https://github.com/XilunWu
Fixes#125272
## About
(This is a re-spin of PR #106617)
Kineto introduced a new profiler to read performance counters from NVIDIA GPUs (CUPTI Range Profiler API) added in PR[75616](https://github.com/pytorch/pytorch/pull/75616). Support for the range profiler mode was disabled as we had to link with a NV PerfWorks library (`libnvperf_host.so`). This PR adds that link.
The change includes-
* Updates cmake build files to find `libnvperf_host.so` and set `CUDA_nvperf_host_LIBRARY`
* WIP use the above cmake variable in kineto, will update this PR after kineto PR has landed
See https://github.com/pytorch/kineto/pull/724
## Example usage of CUPTI profiler
The code snippet below shows how to configure pytorch profiler in CUPTI Profiler mode. Any code included in profiling window with be profiler by CUPTI/Kineto. Note how the `_ExperimentalConfig` struct is used to configure profiler metrics
```
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CUDA],
record_shapes=True,
on_trace_ready=trace_handler,
experimental_config=torch.profiler._ExperimentalConfig(
profiler_metrics=[
"kineto__tensor_core_insts",
"dram__bytes_read.sum",
"dram__bytes_write.sum"],
profiler_measure_per_kernel=False),
) as prof:
res = train_batch(modeldef)
prof.step()
```
For a full example see this [xor.py](https://gist.github.com/briancoutinho/b1ec7919d8ea2bf1f019b4f4cd50ea80) gist.
### Details of how to configure CUPTI profielr
The` _Experimental` config structure can be used to pass metrics to profiler
```
profiler_metrics : a list of CUPTI profiler metrics used
to measure GPU performance events. Any metric supported by CUPTI can be used, see here=
https://docs.nvidia.com/cupti/r_main.html#r_profiler
There are two special alias metrics `kineto__tensor_core_insts` and `kineto__cuda_core_flops` for FLOPS counting.
profiler_measure_per_kernel (bool) : whether to profile metrics per kernel
or for the entire measurement duration.
```
## Testing
Built from source with kineto [PR](https://github.com/pytorch/kineto/pull/724)
```
$> USE_CUDA=1 python setup.py install
-- CUDA_cupti_LIBRARY = /public/apps/cuda/11.6/extras/CUPTI/lib64/libcupti.so
-- CUDA_nvperf_host_LIBRARY = /public/apps/cuda/11.6/extras/CUPTI/lib64/libnvperf_host.so
```
Then run example [xor.py](https://gist.github.com/briancoutinho/b1ec7919d8ea2bf1f019b4f4cd50ea80). This only works on V100+ GPUs only. Adding logs for debugging etc.
```
>$ export KINETO_LOG_LEVEL=1
>$ python xor.py
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:167] CUDA versions. CUPTI: 16; Runtime: 11060; Driver: 11040
Log file: /tmp/libkineto_activities_1683060.json
Trace start time: 2023-02-11 19:11:47 Trace duration: 500ms
Warmup duration: 0s
Max GPU buffer size: 128MB
Enabled activities: cuda_profiler_range
Cupti Profiler metrics : kineto__tensor_core_insts, dram__bytes_read.sum, dram__bytes_write.sum
Cupti Profiler measure per kernel : 0
Cupti Profiler max ranges : 10
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:638] Enabling GPU tracing
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:567] Running child profiler CuptiRangeProfiler for 500 ms
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:104] Configuring 3 CUPTI metrics
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:109] sm__inst_executed_pipe_tensor.sum
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:109] dram__bytes_read.sum
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:109] dram__bytes_write.sum
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:575] Running child profiler CuptiRangeProfiler for 500 ms
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:672] Tracing starting in 9s
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:677] Tracing will end in 10s
STAGE:2023-02-11 19:11:37 1683060:1683060 ActivityProfilerController.cpp:310] Completed Stage: Warm Up
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:693] Starting child profiler session
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125685
Approved by: https://github.com/sraikund16
Fixes#125526 [#1811](https://github.com/pytorch/builder/issues/1811)
Adopt syntax=docker/dockerfile:1 whcih has been stable since 2018, while still best practice to declare in 2024.
- Syntax features dependent upon the [syntax directive version are documented here](https://hub.docker.com/r/docker/dockerfile).
- While you can set a fixed minor version, [Docker officially advises to only pin the major version]
```
(https://docs.docker.com/build/dockerfile/frontend/#stable-channel):
We recommend using docker/dockerfile:1, which always points to the latest stable release of the version 1 syntax, and receives both "minor" and "patch" updates for the version 1 release cycle.
BuildKit automatically checks for updates of the syntax when performing a build, making sure you are using the most current version.
```
**Support for building with Docker prior to v23 (released on Feb 2023)**
NOTE: 18.06 may not be the accurate minimum version for using docker/dockerfile:1, according to the [DockerHub tag history](https://hub.docker.com/layers/docker/dockerfile/1.0/images/sha256-92f5351b2fca8f7e2f452aa9aec1c34213cdd2702ca92414eee6466fab21814a?context=explore) 1.0 of the syntax seems to be from Dec 2018, which is probably why docker/dockerfile:experimental was paired with it in this file.
Personally, I'd favor only supporting builds with Docker v23. This is only relevant for someone building this Dockerfile locally, the user could still extend the already built and published image from a registry on older versions of Docker without any concern for this directive which only applies to building this Dockerfile, not images that extend it.
However if you're reluctant, you may want to refer others to [this Docker docs page](https://docs.docker.com/build/buildkit/#getting-started) where they should only need the ENV DOCKER_BUILDKIT=1, presumably the requirement for experimental was dropped with syntax=docker/dockerfile:1 with releases of Docker since Dec 2018. Affected users can often quite easily install a newer version of Docker on their OS, as per Dockers official guidance (usually via including an additional repo to the package manager).
**Reference links**
Since one of these was already included in the inline note (now a broken link), I've included relevant links mentioned above. You could alternatively rely on git blame with a commit message referencing the links or this PR for more information.
Feel free to remove any of the reference links, they're mostly only relevant to maintainers to be aware of (which this PR itself has detailed adequately above).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125632
Approved by: https://github.com/malfet
This PR seeks to increase observability of save/load requests. This is accomplished with two main changes:
1. The creation of save_id and load_id:
- a save_id and load_id is added to the filesystem writer. `save_id` is re-generated on every save call, and `load_id` is also re-generated on every load call.
- both these ID's are stored in a new `StorageMeta` class, and saved as part of Metadata. (`load_id` is None when we save, and only set during load)
2. A new mechanism is implemented in the save path which gives the SavePlanner a chance to inspect the `storage_meta` object. The mechanism mirrors the same metadata exchange in the load path. In the load path, `storage_meta` is added to `metadata` such that the LoadPlanner can also access `storage_meta` before we begin loading.
*If users now wish to access the checkpoint_id in the SavePlanner, they simple need to access the value in `storage_meta` from the `set_up_planner` call*
*Additionally, users now have a generic way of passing data to the SavePlanner from the StorageWriter at the start of the save path, similar to the load path*
This PR has been tested for backwards compatibility -- meaning any checkpoints saved before this PR can continue being loaded after this PR.
One major consideration is that there is limited forwards compatibility. If a checkpoint is generated _past_ this PR, there is no support for loading it using older torch versions. This brings up a fairly important point: since we expect the metadata object (which is saved to the disk) to continue evolving, and we want to support forwards compatibility, we explore patching `pickle` so we can at least add new members to `metadata` and maintain fwd compat.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124772
Approved by: https://github.com/fegin
Fixes [internal error](https://fb.workplace.com/groups/1075192433118967/permalink/1416709435633930/).
The issue is that the asserting nodes added in the `insert_deferred_runtime_assertion` pass do not contain metadata that the ExportedProgram requires the graph to have. One solution to fix this is to retrace the entire module, or another solution is to manually add back this metadata.
This diff implements the latter solution (manually add back the metadata) through hooking into fx.graph's `create_node` function, and adding export-specific metadata for every node that is created. The reason I did this is so that the `insert_deferred_runtime_assertion` does not have to know about what metadata export wants.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125414
Approved by: https://github.com/zhxchen17, https://github.com/BoyuanFeng
Summary: capture_pre_autograd_graph is deprecated and torch.export won't able to provide timely fix for this API. To reduce some confusion around this we should explicitly give users clear warnings.
Test Plan: eyes
Reviewed By: tarun292
Differential Revision: D56955202
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125602
Approved by: https://github.com/angelayi
By defining `CASE_ISSIGNED` macros that just returns `std::numeric_limits<dtype>::is_signed` for the types where it makes sense and explicitly code some types when it does not
Remove `default:` case from the switch to avoid regressions like the one reported in https://github.com/pytorch/pytorch/issues/125124 , as [`-Wswitch-enum`](https://clang.llvm.org/docs/DiagnosticsReference.html#wswitch-enum) in combination with `-Werror` will raise an error in case of a missing entry, for example:
```
/Users/nshulga/git/pytorch/pytorch/c10/core/ScalarType.h:518:11: warning: enumeration value 'QInt32' not handled in switch [-Wswitch]
switch (t) {
^
/Users/nshulga/git/pytorch/pytorch/c10/core/ScalarType.h:518:11: note: add missing switch cases
switch (t) {
^
1 warning generated.
```
Fixes https://github.com/pytorch/pytorch/issues/125124
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125637
Approved by: https://github.com/albanD
This resolves a bug in eager where if an old state dict is loaded (without the capturable flag) but the original dict had the capturable flag, then state_steps would be on cuda but we would take the non-capturable path. We now fallback to eager if capturable=False.
Current design doc and discussion: https://docs.google.com/document/d/1DmmbiaSp16CDZtGw1qzXKHFTY_0gqc0xpnBdviXq0vk/edit#heading=h.871u7bvwz7ze
Note on the actual fallback logic - there was an issue with torchscript originally not handling *args, **kwargs properly, after rectifying that by using `functools.wraps`, there was an additional bug with scoping which required the single tensor implementation to be in the global scope at the time of the fallback closure being created. I pass in the single tensor function to the `_disable_dynamo_if_unsupported` decorator to workaround this bug.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123619
Approved by: https://github.com/janeyx99
Fixes https://github.com/pytorch/pytorch/issues/125109 which is a regression introduced by https://github.com/pytorch/builder/pull/1467 that adds dynamic dependency to mkl, which if installed in the user-dir is placed into `sysconfig.sysconfig.get_config_var("userbase") / "Library" / "bin"`
Fix this one, but adding `userbase` folder to the DLL search path
Testing before this fix:
```
Python 3.12.3 (tags/v3.12.3:f6650f9, Apr 9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Administrator\AppData\Roaming\Python\Python312\site-packages\torch\__init__.py", line 141, in <module>
raise err
OSError: [WinError 126] The specified module could not be found. Error loading "C:\Users\Administrator\AppData\Roaming\Python\Python312\site-packages\torch\lib\shm.dll" or one of its dependencies.
>>> exit()
```
After:
```
c:\Program Files\Python312>python
Python 3.12.3 (tags/v3.12.3:f6650f9, Apr 9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> exit()
```
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125684
Approved by: https://github.com/malfet
We save and restore the DynamicLayerStack during frame eval but since fx graph has no way to express a try/finally we just assume it will happen. If we throw an exception between the push and pop to the stack then we're left in a state that affects following operations poorly. Make sure that if it's in a bad state we restore it after frame eval.
Repro:
before:
```
$ rm test/dynamo_skips/TestSparseCPU.test_log1p_cpu_uint8
$ rm test/dynamo_expected_failures/FuncTorchHigherOrderOpTests.test_vmap_free_tensor
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k 'test_log1p_cpu_uint8'
============= 1 passed, 8588 deselected in 9.75s =============
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k
'test_vmap_free_tensor_dynamic_shapes or test_log1p_cpu_uint8'
================== short test summary info ===================
FAILED [0.0632s] test/test_sparse.py::TestSparseCPU::test_log1p_cpu_uint8 - AssertionError: "only Tensors of floating point dtype can require gradients"
does not match "You are attempting to call Tensor.requires_grad_() (or perhaps using torch.autograd.functional.* APIs) inside of a function ...
======= 1 failed, 1 skipped, 8587 deselected in 10.99s =======
```
(Note that adding test_vmap_free_tensor_dynamic_shapes causes test_vmap_free_tensor_dynamic_shapes to fail)
after:
```
$ rm test/dynamo_skips/TestSparseCPU.test_log1p_cpu_uint8
$ rm test/dynamo_expected_failures/FuncTorchHigherOrderOpTests.test_vmap_free_tensor
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k 'test_log1p_cpu_uint8'
============= 1 passed, 8588 deselected in 9.89s =============
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k
'test_vmap_free_tensor_dynamic_shapes or test_log1p_cpu_uint8'
======= 1 passed, 1 skipped, 8587 deselected in 11.34s =======
```
(test_vmap_free_tensor_dynamic_shapes passes either way)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122074
Approved by: https://github.com/oulgen
Summary:
Right now DCP only flatten a mapping (e.g., dict) if that mapping has tensor objects. This behavior is odd as users may save different non-tensor objects on different ranks. Without flattening the mappings, we may lose these non-tensor objects. One use case is dataloader state_dict.
We may also want to do so for a list/tuple. But this will cause extra pickles. So we don't do this for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125335
Approved by: https://github.com/LucasLLC, https://github.com/wz337
ghstack dependencies: #125333, #125501, #125334
This fixes a logic regression introduced by https://github.com/pytorch/pytorch/pull/123247 where
```python
if self.use_device and self.use_device != _get_privateuse1_backend_name():
```
was replaced with
```python
VALID_DEVICE_OPTIONS = ["cuda", "xpu", "privateuseone"]
if self.use_device not in VALID_DEVICE_OPTIONS:
```
That triggers a warning every time code is invoke with `self.use_device` set to None
This change also skips all the checks which are useless if `use_device` is None to begin with
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125654
Approved by: https://github.com/aaronenyeshi
Looking at the unrelated Windows timeout failure on https://github.com/pytorch/pytorch/pull/125199, it looks like we don't have a timeout value set for C++ tests atm. In this case, a C++ test on Windows timed out after 2+ hours.
```
2024-05-02T23:35:34.0639067Z Running cpp/c10_TypeList_test 1/1 ... [2024-05-02 23:35:34.059021]
2024-05-02T23:35:34.0641108Z Executing ['pytest', 'C:\\actions-runner\\_work\\pytorch\\pytorch\\build\\win_tmp\\build\\torch\\test\\c10_TypeList_test.exe', '-m', 'not serial', '-v', '-vv', '-rfEX', '-n', '2', '--junit-xml-reruns', 'test-reports\\python-pytest\\test\\run_test\\test\\run_test-c898ddeff8f33cbf.xml', '-x', '--reruns=2'] ... [2024-05-02 23:35:34.062137]
2024-05-03T02:45:33.7862004Z Process SpawnPoolWorker-2:
2024-05-03T02:45:33.7927201Z Traceback (most recent call last):
2024-05-03T02:45:33.7928032Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\process.py", line 315, in _bootstrap
2024-05-03T02:45:33.7928722Z self.run()
2024-05-03T02:45:33.7929722Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\process.py", line 108, in run
2024-05-03T02:45:33.7931639Z self._target(*self._args, **self._kwargs)
2024-05-03T02:45:33.7932435Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\pool.py", line 114, in worker
2024-05-03T02:45:33.7933338Z task = get()
2024-05-03T02:45:33.7933946Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\queues.py", line 365, in get
2024-05-03T02:45:33.7935219Z res = self._reader.recv_bytes()
2024-05-03T02:45:33.7935897Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 221, in recv_bytes
2024-05-03T02:45:33.7936609Z buf = self._recv_bytes(maxlength)
2024-05-03T02:45:33.7937302Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 310, in _recv_bytes
2024-05-03T02:45:33.7938316Z waitres = _winapi.WaitForMultipleObjects(
2024-05-03T02:45:33.7938766Z KeyboardInterrupt
```
Retrying was working, but it was already too late to finish the job. I'm setting the same default `THRESHOLD * 3` timeout value here for C++ tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125517
Approved by: https://github.com/clee2000
1. Fix the wrong tests about lazy init for PrivateUse1 named foo
2. Refactor the tests and make it more flexible
3. Disable the two tests temporarily
- test_open_device_faketensor
- test_open_device_scalar_type_fallback
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125572
Approved by: https://github.com/albanD
The scheduler searches for fusion opportunities by looking for common memory access. Two memory access are considered common not only when the buffer name match, but it also requires more things
- index formula matches
- var_ranges matches
In this PR, I want to log all the fusion failures due to mismatch index formula or var_ranges. I also want to further categories the failures. Right now I found the following failure categories
- rand_seed: the index for rand seed access is an integer and different access uses different integer offset
- different numel: this happens for cat operation
- broadcast: e.g. kernel A write a buffer which is broadcasted and read by kernel B
- different loop orders: the major category we want inductor to be able to fuse
- different offset: happens when use a concatenated linear layer to project Q/K/V and then split the result. Each split will point to the same buffer with different offset.
- unknown
My hope is to make sure for the models I tested, there is no fusion failure falling in the unknown category so all the failures are well understood and categories. Right now it's true for BertForMaskedLM ( https://gist.github.com/shunting314/6dc2c903629d342fa63ba731a171adc2 ), DistillGPT2 ( https://gist.github.com/shunting314/145176f2e850103c7fad4ad72f0e200e ) and llm.c ( https://gist.github.com/shunting314/cfc64a326312a889ba55f79bd47b2082 )
For BertForMaskedLM, we found 82 instances of fusion failures and majority of them are due to different loop orders! Studying the log a bit more can help us figure out where all these loop order mismatch comes from in real models.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124986
Approved by: https://github.com/eellison, https://github.com/jansel
Check that they are not used by running the following
```
% grep -h "AT_FORALL_SCALAR_TYPES_AND" . -R|grep -v #define|cut -d\( -f1|sort|uniq
AT_FORALL_SCALAR_TYPES_AND3
AT_FORALL_SCALAR_TYPES_AND3
AT_FORALL_SCALAR_TYPES_AND
AT_FORALL_SCALAR_TYPES_AND2
AT_FORALL_SCALAR_TYPES_AND3
AT_FORALL_SCALAR_TYPES_AND7
AT_FORALL_SCALAR_TYPES_AND2
AT_FORALL_SCALAR_TYPES_AND3
AT_FORALL_SCALAR_TYPES_AND7
AT_FORALL_SCALAR_TYPES_AND2
AT_FORALL_SCALAR_TYPES_AND3
AT_FORALL_SCALAR_TYPES_AND7
// AT_FORALL_SCALAR_TYPES / AT_FORALL_SCALAR_TYPES_AND macros below, which are
AT_FORALL_SCALAR_TYPES_AND
AT_FORALL_SCALAR_TYPES_AND2
AT_FORALL_SCALAR_TYPES_AND3
AT_FORALL_SCALAR_TYPES_AND7
using at::Half; // for AT_FORALL_SCALAR_TYPES_AND3
```
or by checking online using https://github.com/search?type=code&q=AT_FORALL_SCALAR_TYPES_AND4+repo%3Apytorch%2Fpytorch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125607
Approved by: https://github.com/albanD
This PR makes libtorch behave the same as PyTorch when loading optimizer state from archive. With PyTorch, options of parameter groups are loaded from the archive, which is missing currently in libtorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125215
Approved by: https://github.com/janeyx99
Fixes#117850
This PR:
* Adds the class name in the repro command
* Fixes the path to the test file for python 3.8 jobs (apparently `inspect.getfile(class_type)` returns a relative path in this older python version)
Before (in python 3.8):
```sh
PYTORCH_TEST_WITH_DYNAMO=1 python test_autograd.py -k test_foo
```
After:
```sh
PYTORCH_TEST_WITH_DYNAMO=1 python test/test_autograd.py -k TestAutograd.test_foo
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125498
Approved by: https://github.com/huydhn, https://github.com/janeyx99
This diff implements a remote caching strategy (memcache for internal and redis for external) for caching of Inductor FX Graph to Inductor generated wrapper file.
It uses the same idea with the autotuning result cache that is currently live.
This will land turned off and before turning this on by default, I will do more testing and including looking at the dynamic shape guards added by inductor.
Differential Revision: [D56441624](https://our.internmc.facebook.com/intern/diff/D56441624/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124669
Approved by: https://github.com/jansel, https://github.com/eellison
Saw a warning like this:
```
/opt/conda/lib/python3.10/site-packages/torch/utils/hooks.py:86: UserWarning: backward hook functools.partial(<function _pre_backward_hook at 0x7f9a3940fac0>, FullyShardedDataParallel(
....
), <torch.distributed.fsdp.flat_param.FlatParamHandle object at 0x7f25202a9720>) on tensor will not be serialized. If this is expected, you can decorate the function with @torch.utils.hooks.unserializable_hook to suppress this warning
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125464
Approved by: https://github.com/ezyang
That factors out a repeated pattern of creating a library/fetching a func from source
Typical usecase
```cpp
static MetalShaderLibrary lib(SHADER_SOURCE);
...
id<MTLComputePipelineState> cplState = lib.getPipelieStateForFunc("kernel_name")
```
- Make it possible to use with templated sources
- Add `scalarToMetalTypeString(const Tensor&)` variant to avoid repeated `scalarToMetalTypeString(t.scalar_type())` calls in the code
I.e. it makes no functional changes, but reduces MPS codebase size by 365 lines
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125550
Approved by: https://github.com/kulinseth
Summary:
This shim exports symbols on Windows, which can lead to symbol clashes at link time in the following scenario:
1. A DLL imports libtorch
2. A binary imports libtorch, and also depends on the DLL in (1)
Under that scenario, the symbols exported from `shim.h` can clash at link time.
Given that AOTInductor only works for PyTorch2, and PyTorch2 doesn't currently work for Windows, we can work around this problem by simply removing the symbols export on Windows. In the long term, this will need to be figured out when Windows support is added & tested for PyTorch2.
Differential Revision: D56936696
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125472
Approved by: https://github.com/desertfire
This allows `associative_scan` to take an arbitrary pytree of tensors,
which is flattened to their leaves before calling the `associative_scan`
higher order operator.
I also add support in inductor to generate code for scanning over sequences
of tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122137
Approved by: https://github.com/lezcano, https://github.com/Chillee
ghstack dependencies: #119430
Triton updated the interface for `triton.compile` 5162346487
The `target` argument to compile needs to be wrapped in a `GPUTarget` object. Without proper wrapping, we hit an assert in `compile`. If that assert is removed, Triton attempts to read device info from Torch while inside a torch thread, which hits an in bad fork assert. This change is required for compatibility with latest commits in Triton. The implementation is backwards compatible, so existing versions of Triton that work now continue to work.
Re-submitting this after https://github.com/pytorch/pytorch/pull/125241 was reverted due to an unrelated CI issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125553
Approved by: https://github.com/huydhn
This diff makes sure that a custom exception is thrown when no valid
choices remain during autotuning. This allows to gracefully fall back
to a default choice, even if that default choice has not been passed to
autotune_select_algorithm.
Additionally, this diff handles RuntimeErrors during autotuning gracefully, e.g. the corresponding choice is ignored but it does not lead to the compilation failure of the entire model if a problematic choice is encountered during autotuning.
( An error is being logged, though).
Test Plan:
CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124928
Approved by: https://github.com/int3
ghstack dependencies: #125406
This PR completely removes the Inductor IR for legacy functional collectives:
- Removed the `CollectiveKernel` hiearchy and `Wait`, as well as the corresponding lowerings. These IRs are target (i.e. Python) specific and don't model node dependencies propoerly (e.g. they rely on `never_reuse_buffers` for correct behavior). They've been superceded by `ir._CollectiveKernel`.
- Removed `InPlaceHint` and the scheduler logic for handling it. `InPlaceHint` is a codegen-time buffer reuse mechanism controlled by the IR's codegen. It's a bit hacky and overlaps with the default buffer reuse mechanism. Removing it since it is only used by legacy functional collectives.
- Removed `OutputBuffer` and `MultiOutputNoSizeAssert` which are designed for and only used by legacy functional collectives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124992
Approved by: https://github.com/Chillee, https://github.com/wanchaol
Differential Revision: D56347560
More details in this pytorch issue: https://github.com/pytorch/pytorch/issues/124468
It seems there is a race in the ProcessGroupNCCL shutdown logic. The code is quite simple:
```
for i in range(100):
dist.all_to_all_single(tensor_out, tensor_in)
dist.destroy_process_group()
```
What can happen is this:
1. dist.destroy_process_group() calls into shutdown() and then calls into abort: b2f6cfd9c0/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L1095)
2. It'll call ncclCommAbort (not graceful afaict), and also set the ncclAsyncErr_ = ncclSystemError; b2f6cfd9c0/torch/csrc/distributed/c10d/NCCLUtils.hpp (L388).
3. ncclWatchdog thread may not have woken up while all this shutdown process happens. And in shutdown we're not waiting for watchdog thread
4. ProcessGroupNCCL dtor is called. It'll wait for the watchdog thread to join
5. watchdog will check the work's isCompleted() -> then calls checkAndSetException(). Because ncclAsyncError_ was set to ncclSystemError, it'll error out and makes you think it's a nccl error.
So we can mitigate this issue by checking if the comm was aborted during work.isCompleted/isStarted
Some more longer term discussion in the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124466
Approved by: https://github.com/shuqiangzhang, https://github.com/yoyoyocmu, https://github.com/kwen2501
Summary:
We remove the assertion for target_func being cat.
The reason is that we have multiple flavors of concat, such as
cat/cat.default/cat_slice/cat_slice_cat/...
Assertion here is causing multiple times of false positives.
Test Plan: Removing assertion code only.
Differential Revision: D56971387
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125540
Approved by: https://github.com/hl475
While there are some similarities, they are also quite different (one
handles Numpy numbers while the other handles ints. I am also going to
add a wrap_symfloat soon which will do even more different behavior.
So split these out for clarity.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125483
Approved by: https://github.com/lezcano
ghstack dependencies: #125395, #125419
Enable nonzero workspace and Cutlass StreamK for Inductor Cutlass GEMM ops.
This is a simpler rewrite of my original version of #119005 using @peterbell10 's workspace allocation mechanism from #117992
Test Plan:
- Additional unit test in test_cutlass_backend.py which specifically tests StreamK GEMM with workspace requirement
- CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125406
Approved by: https://github.com/jansel
**Context**
We are interested in supporting the case where HSDP reduce-scatters but does not all-reduce in a microbatch backward. This saves communication while still saving memory. Only on the last microbatch do we need to both reduce-scatter and all-reduce. This is not implemented yet and will hopefully come in a future PR.
There is one notable part of doing this. On the last microbatch, we need to perform an accumulation step after reduce-scatter and before all-reduce. If not, then the preceding microbatch's gradients will not be contributed across the replica group. (In other words, we cannot simply accumulate _after_ all-reduce.)
Consider 32 GPUs with 4-way replication and 8-way sharding and 2 microbatches, and focus on global rank 0.
- After the first microbatch, rank 0 will have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)}$, where we define $S(0) = \{0, 1, \dots, 7\}$ to be the ranks in its shard group and we define the $(1)$ superscript to denote the first microbatch.
- Upon the second microbatch, rank 0 after its reduce-scatter will additionally have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(2)}$. If we only all-reduce this, then this second microbatch's gradients become $\frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, so in total, rank 0 has $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)} + \frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, which is wrong.
- Importantly, we must accumulate $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)} + \frac{1}{8} \sum_{i \in S(0)} g_i^{(2)} = \frac{1}{8}\sum_{i \in S(0)} (g_i^{(1)} + g_i^{(2)})$ first before all-reducing to get $\frac{1}{32} \sum_{i=0, 1, \dots, 31} (g_i^{(1)} + g_i^{(2)})$.
Now, note how under this approach, we want a factor of $\frac{1}{8}$ only (i.e. reciprocal of the shard group size), not $\frac{1}{32}$, for the first microbatch's gradients.
- For bf16/fp32, since we use `ReduceOp.AVG` and we only reduce-scatter on the first microbatch, we correctly have a factor of $\frac{1}{8}$ on the first microbatch.
- For fp16, since we precompute the gradient divide factors at init time assuming always reducing over both shard and replica groups, we incorrectly have a factor of $\frac{1}{32}$ on the first microbatch, deviating from the bf16/fp32 case.
We can address this issue by matching the bf16/fp32 vs. fp16 semantics by computing the divide factors at runtime based on which process groups were passed into the reduction function (`foreach_reduce`).
**Additional Notes**
How to implement the HSDP reduce-scatter but no all-reduce is not entirely clear yet. (What is the cleanest way to do this?) We need to store the partial reduce-scatter output and check for it upon the next backward. We should also be sure to error if the set of parameters receiving gradients changes, in which case we cannot support this easily. Anyway, we will implement this in a follow-up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125484
Approved by: https://github.com/wanchaol
ghstack dependencies: #125431, #125479
This does a few things that were originally a few PRs but I am on a new machine and don't have ghstack.
If it is too problematic to review, I can re-split, just let me know.
This does:
- Cleanup context manager use in test_flop_counter
- Remove need for mod argument in FlopCounterMode, warning about it
- Re-implement a Module tracker from scratch using global forward Module use and multi_grad_hook (we cannot use global backward Module hook because they don't look for nested Tensor and they're custom Function based instead of multi_grad_hook).
- Update FlopCouterMode to use the new ModuleTracker. All the existing test suite passes as-is (only changes there are new tests and refactoring mentioned above)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125352
Approved by: https://github.com/mikaylagawarecki
You can trigger ciflow tags on main branch commits, so we should be more conservative when checking to see if a workflow is a PR/on the main branch.
get_pr_number checks for the pr number based on the PR_NUMBER env var or a tag of the for `ciflow/workflow/pr number`
If we fail to find something like this, then assume it is on the main branch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125485
Approved by: https://github.com/huydhn
A re-land of #124239.
This PR fakify ScriptObject inputs and attributes in export non-strict mode by default.
The basic idea is to only fakify the script object during tracing (i.e. aot_export). After we get the traced graph module, eagerly executing, serializing, or running more passes will use the real script objects. This is essentially treating the script object as constant tensor.
Concretely, we
fakify all the script object inputs, and module attributes (gathered by constant_attrs).
patch the module's attributes with fakified script object
right after aot_export, remove the patching (to avoid changing the original module) then modify the exported graph module's attribute to real script object.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125490
Approved by: https://github.com/angelayi
Summary:
We found that some dumps are missing when monitoring thread timeout.
This is likely due to multiple PGs could still dump the same records
at the same time. So we should allow only PG0 to actualy dump
Test Plan:
unit test
python test/run_test.py --cpp --verbose -i cpp/ProcessGroupNCCLErrorsTest
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125356
Approved by: https://github.com/c-p-i-o
Triton updated the interface for `triton.compile` 5162346487
The `target` argument to compile needs to be wrapped in a `GPUTarget` object. Without proper wrapping, we hit an assert in `compile`. If that assert is removed, Triton attempts to read device info from Torch while inside a torch thread, which hits an in bad fork assert. This change is required for compatibility with latest commits in Triton. The implementation is backwards compatible, so existing versions of Triton that work now continue to work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125241
Approved by: https://github.com/jansel
**Context**
We are interested in supporting the case where HSDP reduce-scatters but does not all-reduce in a microbatch backward. This saves communication while still saving memory. Only on the last microbatch do we need to both reduce-scatter and all-reduce. This is not implemented yet and will hopefully come in a future PR.
There is one notable part of doing this. On the last microbatch, we need to perform an accumulation step after reduce-scatter and before all-reduce. If not, then the preceding microbatch's gradients will not be contributed across the replica group. (In other words, we cannot simply accumulate _after_ all-reduce.)
Consider 32 GPUs with 4-way replication and 8-way sharding and 2 microbatches, and focus on global rank 0.
- After the first microbatch, rank 0 will have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)}$, where we define $S(0) = \{0, 1, \dots, 7\}$ to be the ranks in its shard group and we define the $(1)$ superscript to denote the first microbatch.
- Upon the second microbatch, rank 0 after its reduce-scatter will additionally have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(2)}$. If we only all-reduce this, then this second microbatch's gradients become $\frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, so in total, rank 0 has $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)} + \frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, which is wrong.
- Importantly, we must accumulate $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)} + \frac{1}{8} \sum_{i \in S(0)} g_i^{(2)} = \frac{1}{8}\sum_{i \in S(0)} (g_i^{(1)} + g_i^{(2)})$ first before all-reducing to get $\frac{1}{32} \sum_{i=0, 1, \dots, 31} (g_i^{(1)} + g_i^{(2)})$.
Now, note how under this approach, we want a factor of $\frac{1}{8}$ only (i.e. reciprocal of the shard group size), not $\frac{1}{32}$, for the first microbatch's gradients.
- For bf16/fp32, since we use `ReduceOp.AVG` and we only reduce-scatter on the first microbatch, we correctly have a factor of $\frac{1}{8}$ on the first microbatch.
- For fp16, since we precompute the gradient divide factors at init time assuming always reducing over both shard and replica groups, we incorrectly have a factor of $\frac{1}{32}$ on the first microbatch, deviating from the bf16/fp32 case.
We can address this issue by matching the bf16/fp32 vs. fp16 semantics by computing the divide factors at runtime based on which process groups were passed into the reduction function (`foreach_reduce`).
**Additional Notes**
How to implement the HSDP reduce-scatter but no all-reduce is not entirely clear yet. (What is the cleanest way to do this?) We need to store the partial reduce-scatter output and check for it upon the next backward. We should also be sure to error if the set of parameters receiving gradients changes, in which case we cannot support this easily. Anyway, we will implement this in a follow-up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125484
Approved by: https://github.com/wanchaol
ghstack dependencies: #125431, #125479
This provides utilities for creating and querying properties on
sympy.Symbol. I want to use this refactor to get a better handle on how
the 's' prefix is being used in Inductor. To start, I only do
symbolic_shapes code because that's what I'm familiar with.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125395
Approved by: https://github.com/Skylion007
This PR shows a simple utility to broadcast the parameters across replicas for HSDP:
```
replicate_group = mesh.get_group("replicate")
for param in model.parameters():
# E.g. for mesh [[0, 1, 2, 3], [4, 5, 6, 7]] sharding on dim-1 and
# replicating on dim-0, broadcast with sources 0, 1, 2, 3
src_rank = dist.get_process_group_ranks(replicate_group)[0]
torch.distributed.broadcast(
param.to_local(), src=src_rank, group=replicate_group
)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125431
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
Greetings!
Fixes#125403
Please assist me with the testing as it is possible for my reproducer to miss the error in the code. Several (at least two) threads should enter the same part of the code at the same time to check file lock is actually working
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125404
Approved by: https://github.com/ezyang
Fixes cutlass_utils.get_max_alignment() which was so far not checking the alignment properly. Basically
the method so far assumed that the passed layout is contiguous and row-major, which does not have to be true.
Test Plan:
CI - test_cutlass_backend.py to prevent regressions
Added unit test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124930
Approved by: https://github.com/int3
ghstack dependencies: #124929
- Implement a very straightforward Metal copy of CPU int4mm kernel
- Implement int8mm kernel by constructing a graph consisting of upcast, transpose and mm
- Add `isCapturing`, `isCaptureEnabled`, `startCapture` and `stopCapture` methods to `MPSProfile` which can be used to help one debug/profile Metal kernels by wrapping the calls with the following
```cpp
if (getMPSProfiler().profiler.isCaptureEnabled()) {
getMPSProfiler().startCapture(__func__, mpsStream);
}
...
if (getMPSProfiler().isCapturing()) {
getMPSProfiler().stopCapture(mpsStream);
}
```
that, if invoked with `MTL_CAPTURE_ENABLED` environment variable set to one, will produce .gputrace files, in the current working directory, which can later be loaded and used to debug or profiler the kernel
<img width="1093" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/a2bf27e8-df8a-442c-a525-1df67b8a376a">
- Added `test_int4mm` to TestLinalgMPS, which is mostly copy-n-paste of the test from `test_linalg`
TODOs:
- Add weight pack
- Perf-tune both kernels
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125163
Approved by: https://github.com/mikekgfb
Some toy example:
<img width="998" alt="Screenshot 2024-04-17 at 2 00 05 PM" src="https://github.com/pytorch/pytorch/assets/31054793/b5665a63-beb0-4ca1-92c6-c57a052812fd">
We define `FullyShardedDataParallel._unshard(async_op: bool = False)` that can be used to prefetch all-gathers. The user should make sure:
1. Run lazy init before the first `_unshard` call of training. For example, this can hackily be done via `root_module.check_is_root()` on the root FSDP module `root_module`.
2. Call `root_module._wait_unshard_streams_on_current_stream()` before the first `_unshard` call of the current iteration (just need to call it once after last optimizer step and before first `_unshard` of this iteration).
Differential Revision: [D56262876](https://our.internmc.facebook.com/intern/diff/D56262876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124304
Approved by: https://github.com/wanchaol
- Original `test_grad_scaling_autocast_fused_optimizers` does not work since there is no "fused" in `optim_inputs`
- We should use different `grad_scaler`, they should not share 1 `scale`, there is no issue exposed here because the default `_growth_interval` is 2000 so it will not growth and there is also no inf is found so it will not reduced. The one in `test_cuda.py` should also have this issue,
- I set a manual seed to reproduce purpose if there is any numerical failure
- I use Tensor tracker here because we failed this UT in dynamo case, the cpp generated code are not exactly same with fused/non fused kernel.
- I make it check both `cuda` and `cpu`.
- I find some SGD numerical issue with `clang`, and fixed it by using `fmadd` instead of `add/mul` in fused sgd veckernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124904
Approved by: https://github.com/jgong5, https://github.com/janeyx99
Add `PyTorchFileWriter.write_record_metadata(record_name, num_bytes)` that
- writes the zipfile header/end of central directory metadata for an entry*
- reserves `num_bytes` in the zipfile for the payload.
*Since the payload is not provided, the CRC32 computation is skipped and 0s are written in the corresponding entry of the zipfile header
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125184
Approved by: https://github.com/albanD
Summary:
Fixing the implementation of `_flatten_dynamic_shapes()`, to follow how `_process_dynamic_shapes()` does it. The previous implementation would misinterpret some nested dynamic shapes specs, causing it to miss out on some shapes specs, for example with nested inputs/constant input tuples:
```
inputs = (
(2, 1),
(
torch.randn(2, 1),
torch.randn(2, 2),
torch.randn(2, 3),
)
)
dynamic_shapes = (
(None, None),
(
None,
None,
None,
)
)
```
This would get interpreted as 2 shapes specs for 2d and 3d tensors. Fixing so this doesn't happen.
Test Plan: Existing export tests
Differential Revision: D56894923
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125415
Approved by: https://github.com/angelayi
Summary: Remove the check to make sure all GPU labels are enumerated when CUDA is available. There are some systems where CUDA is available but we do not print any GPU labels (because GPU is not available).
Test Plan: Test in regression with ciflow/periodic label
Differential Revision: D56906893
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125430
Approved by: https://github.com/izaitsevfb
Summary:
The chunk/split ops on the weights/constants is folded in a fx pass and each output tensor has the same storage size of the original tensor (which is 3x of its actual size if chunk(3)). However Backend calculates the mem size on device from tensor shape/stride/dtype. This causes the mismatch when copying weights/constants to device as allocated mem on device is always smaller than the size of weights/constants and results in a runtime error in loading weight/constant (T172125529).
This diff fixes the issue by cloning the tensors after const folding so that the tensors has correct storage size.
Test Plan:
Before this change: (18432 = 48 * 64 * 2 * 3)
```
RuntimeError: Failed to load constant getitem_idx0 split (remaining=18432) at fbcode/caffe2/torch/fb/acc_runtime/afg/afg_bindings.cpp:3422: Request failed because an invalid parameter
```
```
buck2 run mode/opt //caffe2/torch/fb/acc_runtime/afg/tests:test_operators-artemis -- -r test_mem_size_mismatch
```
```
Ran 1 test in 7.048s
OK
```
Reviewed By: jfix71
Differential Revision: D56663931
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125199
Approved by: https://github.com/jfix71
`cumsum` and `cumprod` was (is?) buggy for MPS: c8d2a55273/aten/src/ATen/native/mps/operations/UnaryOps.mm (L435-L436)
A workaround casts the input to int32 prior to performing the op to prevent overflow for certain numeric types.
It turns out this issue also affects boolean types:
```python
import torch
print(torch.ones(128, dtype=torch.bool, device="mps").cumsum(0)[-1])
# tensor(-128, device='mps:0')
```
In this PR I'm adding logic to also cast bool dtypes to int32 prior to `cumsum` and `cumprod`, although output is guaranteed not to overflow for the latter with bools. I'm also adding a test to prevent regressions.
Fixes#96614#106112#109166
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125318
Approved by: https://github.com/malfet
`torch.utils.benchmark.Compare` is not directly exposed in torch.utils.benchmark documentation.
I think this is a valuable resource to add since it can help people embracing the torch benchmark way of doing things, and help people building documentation towards it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125009
Approved by: https://github.com/mikaylagawarecki
~Users may have custom use cases for the `strict` parameter in load. In my mind, if we automatically call `state_dict` and `load_state_dict` in save/load, we need to support the same functionality in `nn.Modules`.~
It turns out this is actually not related to nn.Module's strict param. Since `state_dict` is called inside `dcp.load`, it's actually impossible to create a model such that the following would raise an error:
```
state_dict = module.state_dict()
module.load_state_dict(state_dict, strict=True)
```
The issue is actually just when there are elements in `state_dict` which do not exist in the checkpoint. This PR adds the ability to configure this behavior through the DefaultSavePlanner (see tests).
Concretely, if module has extra attributes not present in the checkpoint, we will only raise an error if `DefaultLoadPlanner.allow_partial_load==False`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123869
Approved by: https://github.com/fegin
Summary:
## `-Wmissing-prototypes`
In ATen-Vulkan, we often define functions in `.cpp` files without declaring them in `.h` files without hiding them in an anonymous namespace.
Example: [`Packing.cpp`'s channel_image_repacking()](f1f142c44f/aten/src/ATen/native/vulkan/impl/Packing.cpp (L299-L348))
On Mac, this results in a `-Wmissing-prototypes` warning, which is disabled in this change.
## `-Wshadow`
In `Adapter.cpp`, we overwrite a variable called `properties`, which we fix in this change as opposed to disabling the warning.
Test Plan: CI
Differential Revision: D56850324
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125361
Approved by: https://github.com/SS-JIA
From my test with Ads production workload, I found sometime kernel_file is None and grid is a tuple. It will crash since ExecutionTraceObserver expects string for both kernel_file and grid. This PR is to make sure kernel_file and grid are always passed down as string. Need to find the root cause why kernel_file is none.
Unit test:
buck test @mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125384
Approved by: https://github.com/davidberard98, https://github.com/sraikund16
Summary:
Torchscript modules do not support forward hooks and thus can't work with flop_counter context manager for hierarchical output by passing a module to FlopCounterMode on construction.
Currently any module that includes a script module causes an exception to be thrown so adding a try/catch to ignore any script modules for forward hooks.
Test Plan: CI Signals
Differential Revision: D56850661
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125346
Approved by: https://github.com/842974287
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
* #124944
* #124939
* __->__ #122965
Differential Revision: [D55493240](https://our.internmc.facebook.com/intern/diff/D55493240/)
*This PR is now ready for merge and is not an RFC*
Major choices are:
-- the introduction of the AsyncStager protocol
-- removed `executor` from param.
-- leave async as a separate method (for now)
This proposal seeks to add extension points to dcp.async_save, allowing users to:
- Specify a specific staging method when calling async_save
- Allow a vehicle for also making the staging method async, to allow for cases where we may want to overlap with the training loop (e.g., overlap d2h with and only synchronize at the optim.step)
- Potentially specify the execution method for doing async_save in parallel. For example some users may prefer a subprocess over a thread to avoid GIL issues.
A totally reasonable alternative to this entire proposal is to expect users who want this level of customization
to write their own custom async save methods. Here's an example which addresses the issues mentioned
in PR comments.
```
def custom_async_save(...):
# this step accomplishes staging and includes the usual 'planning' calls (issue 1)
buffered_writer = CpuBufferedWriter() # this is stateful, contains a copy of state_dict
dcp.save(state_dict, storage_writer=buffered_writer)
final_storage_writer = FileSystemWriter()
mp.spawn( # issue2 is gone, do whatever you want here
dcp.save, # or some custom sub-process method which calls dcp.save under the hood
buffered_writer.state_dict, # lot's of way's to do this, not really the most important part
checkpoint_id=checkpoint_id,
storage_writer=storage_writer,
planner=planner,
process_group=process_group, # this actually wouldn't work, but again not the pt.
)
# leaving out the rest of the details for managing your extra special subprocess.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122965
Approved by: https://github.com/daulet-askarov
A common complaint when working with data-dependent code in PyTorch is that it's hard to tell how far you are from the finish line: every time a GuardOnDataDependentSymNode error is hit, you have to somehow fix or workaround it to see the next one.
This PR adds a new mode `torch._functorch.config.fake_tensor_propagate_real_tensors` which modifies fake tensors to also propagate real tensors. This means that when we try to guard on a data-dependent SymNode, we can actually produce a real result. We also produce a warning which you should consult to figure out what the crux points are.
I ran this on vision_maskrcnn. In the baseline (without this mode), the model has 27 graph breaks, resulting in 40 graphs. With this mode on, the model has only 11 graph breaks, resulting in 15 graphs (the remaining graph breaks are due to missing functionality for item() on float tensor and some other Dynamo missing features.) You get a list of things that would have errored like this:
```
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> False
```
Potential later follow ups:
* Improve the warning messages (in particular, should provide user frames)
* GC real tensors when they are no longer needed by tracing. Right now, this will use A LOT of memory, equal to as if your GC was broken and every intermediate tensor was kept live
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125115
Approved by: https://github.com/IvanKobzarev
Minor refactoring:
Remove unused "fused epilogue node" arguments from some method Kernel call signatures.
Test Plan:
Covered by current tests in test_cutlass_backend.py - no functional change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124929
Approved by: https://github.com/eellison
Fixes#121965
This PR hopes to add support complex numbers in the scatter/gather related kernels. For brevity, I will only include `complex<float>` for now as `complex<double>`, for example, will be more complicated.
C++ unit tests are currently passing alongside tests in `test_scatter_gather_ops.py`. Python test suites also seem to be passing.
Please keep the following in mind:
1) I think this is my first time using Pytorch.
2) This is my first contribution to Pytorch.
Environment:
3080 & WSL 2. `nvcc` is at 12.4.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124809
Approved by: https://github.com/mikaylagawarecki
In #123319, we guard some behavior behind the `assume_aligned_inputs` config option. If we set this to `False`, then the behavior added in #123319 becomes the default behavior. See the referenced PR for more details about the behavior affected.
Side effects:
* It's possible that this will hurt performance in some scenarios. For example, if an unaligned input is used in a matmul, it might be better to perform the clone to align it first.
* This will occasionally cause recompiles. Specifically: the check we perform (`(storage_offset * get_dtype_size(dtype)) % ALIGNMENT == 0`) can be guarded on if the storage_offset becomes dynamic. storage_offset becomes dynamic during automatic_dynamic_shapes after a shape or stride changes. Previously, this was increasing graph breaks in cpu inductor torchbench tests (but is fixed by more carefully guarding checks on alignment, so that we don't run them and generate guards unless actually needed).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124336
Approved by: https://github.com/eellison
* Fixes https://github.com/pytorch/pytorch/issues/124886
* Kind of similar to https://github.com/pytorch/pytorch/pull/109393
I think what happens is `exit` and `exit /b` propagate the errorlevel correctly, but `exit /b` only exists the currently running batch script and not the entire cmd.exe (or whatever program is running the batch script), so `exit /b` exits with errorlevel 1, but the the parent cmd exits with 0, and bash sees cmd's 0
I think `goto fail` and `exit` are the same thing when the batch script is run from a bash script so either would work in this case? But the `goto fail` method might be better if someone happens to run the script on cmdline
I assumed that anywhere anyone was exiting after checking the error code, they did want to exit completely, and I'm pretty sure that being inside a parenthesis counts as being a different script, so I changed everything to goto fail just in case, this might be too aggressive?
Logs after this change for a build failure on cuda:
https://github.com/pytorch/pytorch/actions/runs/8912185834/job/24475087535?pr=125306
```
2 errors detected in the compilation of "C:/actions-runner/_work/pytorch/pytorch/aten/src/ATen/native/cuda/AdaptiveMaxPooling3d.cu".
AdaptiveMaxPooling3d.cu
[7599/8420] Linking CXX shared library bin\torch_cpu.dll
ninja: build stopped: subcommand failed.
-- Building version 2.4.0a0+git3171c11
cmake -GNinja -DBUILD_ENVIRONMENT=win-vs2019-cuda11.8-py3 -DBUILD_PYTHON=True -DBUILD_TEST=True -DBUILD_TYPE=release -DBUILD_WHEEL=1 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_COMPILER=C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8/bin/nvcc.exe -DCMAKE_CUDA_COMPILER_LAUNCHER=C:/actions-runner/_work/pytorch/pytorch/build/win_tmp/bin/randomtemp.exe;C:/actions-runner/_work/pytorch/pytorch/build/win_tmp\bin\sccache.exe -DCMAKE_CXX_COMPILER_LAUNCHER=sccache -DCMAKE_C_COMPILER_LAUNCHER=sccache -DCMAKE_GENERATOR=Ninja -DCMAKE_INSTALL_PREFIX=C:\actions-runner\_work\pytorch\pytorch\torch -DCMAKE_PREFIX_PATH=C:\Jenkins\Miniconda3\Lib\site-packages -DCUDA_NVCC_EXECUTABLE=C:/actions-runner/_work/pytorch/pytorch/build/win_tmp/bin/nvcc.bat -DCUDNN_LIBRARY=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\lib\x64 -DNUMPY_INCLUDE_DIR=C:\Jenkins\Miniconda3\lib\site-packages\numpy\core\include -DPYTHON_EXECUTABLE=C:\Jenkins\Miniconda3\python.exe -DPYTHON_INCLUDE_DIR=C:\Jenkins\Miniconda3\Include -DPYTHON_LIBRARY=C:\Jenkins\Miniconda3/libs/python39.lib -DTORCH_BUILD_VERSION=2.4.0a0+git3171c11 -DTORCH_CUDA_ARCH_LIST=8.6 -DUSE_CUDA=1 -DUSE_NUMPY=True C:\actions-runner\_work\pytorch\pytorch
cmake --build . --target install --config Release -- -j 8
(base) C:\actions-runner\_work\pytorch\pytorch>if errorlevel 1 goto fail
(base) C:\actions-runner\_work\pytorch\pytorch>exit /b 1
Error: Process completed with exit code 1.
```
vs original
https://github.com/pytorch/pytorch/actions/runs/8910674030/job/24470387612
```
2 errors detected in the compilation of "C:/actions-runner/_work/pytorch/pytorch/aten/src/ATen/native/cuda/AdaptiveMaxPooling3d.cu".
AdaptiveMaxPooling3d.cu
[7604/8420] Linking CXX shared library bin\torch_cpu.dll
ninja: build stopped: subcommand failed.
-- Building version 2.4.0a0+gite09f98c
cmake -GNinja -DBUILD_ENVIRONMENT=win-vs2019-cuda11.8-py3 -DBUILD_PYTHON=True -DBUILD_TEST=True -DBUILD_TYPE=release -DBUILD_WHEEL=1 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_COMPILER=C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8/bin/nvcc.exe -DCMAKE_CUDA_COMPILER_LAUNCHER=C:/actions-runner/_work/pytorch/pytorch/build/win_tmp/bin/randomtemp.exe;C:/actions-runner/_work/pytorch/pytorch/build/win_tmp\bin\sccache.exe -DCMAKE_CXX_COMPILER_LAUNCHER=sccache -DCMAKE_C_COMPILER_LAUNCHER=sccache -DCMAKE_GENERATOR=Ninja -DCMAKE_INSTALL_PREFIX=C:\actions-runner\_work\pytorch\pytorch\torch -DCMAKE_PREFIX_PATH=C:\Jenkins\Miniconda3\Lib\site-packages -DCUDA_NVCC_EXECUTABLE=C:/actions-runner/_work/pytorch/pytorch/build/win_tmp/bin/nvcc.bat -DCUDNN_LIBRARY=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\lib\x64 -DNUMPY_INCLUDE_DIR=C:\Jenkins\Miniconda3\lib\site-packages\numpy\core\include -DPYTHON_EXECUTABLE=C:\Jenkins\Miniconda3\python.exe -DPYTHON_INCLUDE_DIR=C:\Jenkins\Miniconda3\Include -DPYTHON_LIBRARY=C:\Jenkins\Miniconda3/libs/python39.lib -DTORCH_BUILD_VERSION=2.4.0a0+gite09f98c -DTORCH_CUDA_ARCH_LIST=8.6 -DUSE_CUDA=1 -DUSE_NUMPY=True C:\actions-runner\_work\pytorch\pytorch
cmake --build . --target install --config Release -- -j 8
(base) C:\actions-runner\_work\pytorch\pytorch>if errorlevel 1 exit /b
+ assert_git_not_dirty
+ [[ win-vs2019-cuda11.8-py3 != *rocm* ]]
+ [[ win-vs2019-cuda11.8-py3 != *xla* ]]
++ git status --porcelain
++ grep -v '?? third_party'
++ true
+ git_status=
+ [[ -n '' ]]
+ echo 'BUILD PASSED'
BUILD PASSED
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125306
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/atalman
To fix data-dependent errors we want to recommend that people use `torch._check*` APIs. The `constrain_as*` APIs should be fully subsumed by them, and in the future we should kill them entirely.
Differential Revision: D56774333
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125253
Approved by: https://github.com/ezyang
In the given test case, we have a ModuleList of 3 modules (`norm.0`, `norm.1`, `norm.2`) which share the same `weight` and `bias` tensors. However when we trace, they all end up pointing to one state dict name, (ex. `norm.2`).
```
graph():
%p_norms_0_weight : [num_users=0] = placeholder[target=p_norms_0_weight]
%p_norms_0_bias : [num_users=0] = placeholder[target=p_norms_0_bias]
%p_norms_1_weight : [num_users=0] = placeholder[target=p_norms_1_weight]
%p_norms_1_bias : [num_users=0] = placeholder[target=p_norms_1_bias]
%p_norms_2_weight : [num_users=3] = placeholder[target=p_norms_2_weight]
%p_norms_2_bias : [num_users=3] = placeholder[target=p_norms_2_bias]
%input_ : [num_users=1] = placeholder[target=input_]
%native_layer_norm : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%input_, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {})
%getitem : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 0), kwargs = {})
%native_layer_norm_1 : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%getitem, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {})
%getitem_3 : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm_1, 0), kwargs = {})
%native_layer_norm_2 : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%getitem_3, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {})
%getitem_6 : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm_2, 0), kwargs = {})
return (getitem_6,)
```
This causes an error in the unflattener where after constructing the submodules for `norm.0`, it will have the graph pointing to `norm.2.weight` and `norm.2.bias`:
```
graph():
%p_norms_2_bias : [num_users=1] = placeholder[target=p_norms_2_bias]
%p_norms_2_weight : [num_users=1] = placeholder[target=p_norms_2_weight]
%input_ : [num_users=1] = placeholder[target=input_]
%native_layer_norm : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%input_, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {})
%getitem : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 0), kwargs = {})
return getitem
```
Since the attributes are not within the same scope of the graph, (`norm.0` vs. `norm.2`), they will not be added to the subgraph, causing an error.
So this PR handles the duplicate state dict attributes by modifying the `inputs_to_state` dict to map from node names to a list of possible state dict target names.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125192
Approved by: https://github.com/zhxchen17
the first append not having a space incorrectly merges it to any previous arguments, like `-allow-unsupported-compiler` in my case which results in a silly error: `unrecognized command-line option '-allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS'`
full log:
```
python setup.py develop
Building wheel torch-2.4.0a0+git75fa54a
-- Building version 2.4.0a0+git75fa54a
cmake3 -GNinja -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/code/pytorch/torch -DCMAKE_PREFIX_PATH=/code/pytorch/.venv/lib/python3.12/site-packages;/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/gcc-13.2.0-noa2f4oqalxzqvsebhuntndewgt4gq4h:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/zstd-1.5.6-z3guwm4l5rmmsv4g4wvkej3ri3bppeja:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/zlib-ng-2.1.6-kwi4ljobodjgv5eetnga4bow6crdlacl:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/mpc-1.3.1-nuwa2snyzm265lsupa2dkmxxyhiqcv7e:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/mpfr-4.2.1-wepuwobwttxbtz3nguimxa2mlljjozsi:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/gmp-6.2.1-ashy6kiitonxv2f365f4q3beggzf3646:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/gcc-runtime-14.0.1-wmogkqrzn7t57dogaake2hmhjbod27gs -DNUMPY_INCLUDE_DIR=/code/pytorch/.venv/lib64/python3.12/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/code/pytorch/.venv/bin/python -DPYTHON_INCLUDE_DIR=/usr/include/python3.12 -DPYTHON_LIBRARY=/usr/lib64/libpython3.12.so.1.0 -DTORCH_BUILD_VERSION=2.4.0a0+git75fa54a -DUSE_NUMPY=True /code/pytorch
-- /usr/lib64/ccache/c++ /code/pytorch/torch/abi-check.cpp -o /code/pytorch/build/abi-check
-- Determined _GLIBCXX_USE_CXX11_ABI=1
-- Current compiler supports avx2 extension. Will build perfkernels.
-- Current compiler supports avx512f extension. Will build fbgemm.
-- The CUDA compiler identification is NVIDIA 12.4.131
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - failed
-- Check for working CUDA compiler: /usr/local/cuda-12/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda-12/bin/nvcc - broken
CMake Error at /usr/share/cmake/Modules/CMakeTestCUDACompiler.cmake:59 (message):
The CUDA compiler
"/usr/local/cuda-12/bin/nvcc"
is not able to compile a simple test program.
It fails with the following output:
Change Dir: '/code/pytorch/build/CMakeFiles/CMakeScratch/TryCompile-mSGoFl'
Run Build Command(s): /code/pytorch/.venv/bin/ninja -v cmTC_ee207
[1/2] /usr/local/cuda-12/bin/nvcc -forward-unknown-to-host-compiler -allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all "--generate-code=arch=compute_52,code=[compute_52,sm_52]" -MD -MT CMakeFiles/cmTC_ee207.dir/main.cu.o -MF CMakeFiles/cmTC_ee207.dir/main.cu.o.d -x cu -c /code/pytorch/build/CMakeFiles/CMakeScratch/TryCompile-mSGoFl/main.cu -o CMakeFiles/cmTC_ee207.dir/main.cu.o
FAILED: CMakeFiles/cmTC_ee207.dir/main.cu.o
/usr/local/cuda-12/bin/nvcc -forward-unknown-to-host-compiler -allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all "--generate-code=arch=compute_52,code=[compute_52,sm_52]" -MD -MT CMakeFiles/cmTC_ee207.dir/main.cu.o -MF CMakeFiles/cmTC_ee207.dir/main.cu.o.d -x cu -c /code/pytorch/build/CMakeFiles/CMakeScratch/TryCompile-mSGoFl/main.cu -o CMakeFiles/cmTC_ee207.dir/main.cu.o
gcc: error: unrecognized command-line option '-allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS'
ninja: build stopped: subcommand failed.
CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
cmake/public/cuda.cmake:47 (enable_language)
cmake/Dependencies.cmake:44 (include)
CMakeLists.txt:758 (include)
-- Configuring incomplete, errors occurred!
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125294
Approved by: https://github.com/albanD
We encountered some model accuracy failures as the tolerance is critical. In general, we align with CUDA practice. This PR intends to adjust the tolerance for Torchbench models for training mode on Intel GPU devices and aligns with CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125213
Approved by: https://github.com/desertfire
Summary: In the recent weeks, we have encountered bugs in both the normal synchronous trace and on-demand tracing. This diff on its own does sanity checking to make sure the profiler does not have spans that extend past the boundaries that we expect. It also checks some basic properties of the tracings we expect to see. Right now the sanity tests check some basic properties to make sure that the tracings are not completely broken. Requests/suggestions for other properties are welcome.
Test Plan: Run the tests in OSS and Buck
Reviewed By: aaronenyeshi
Differential Revision: D56374298
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124773
Approved by: https://github.com/aaronenyeshi
Fixes#121965
This PR hopes to add support complex numbers in the scatter/gather related kernels. For brevity, I will only include `complex<float>` for now as `complex<double>`, for example, will be more complicated.
C++ unit tests are currently passing alongside tests in `test_scatter_gather_ops.py`. Python test suites also seem to be passing.
Please keep the following in mind:
1) I think this is my first time using Pytorch.
2) This is my first contribution to Pytorch.
Environment:
3080 & WSL 2. `nvcc` is at 12.4.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124809
Approved by: https://github.com/eqy, https://github.com/mikaylagawarecki
As apparently `vshlq_u32` is faster than `vcvt_f32_f16`
Refactor NEON `tinygemm_kernel` to rely on `load_as_float32x4` and `load_as_float32x4x2` and implement them for float16 (using vcvt), bfloat16 (using left shift) and plain float32 (not using anything)
As result stories110M run at 60 tokens/sec with f16, but at 66 tokens/sec with bf16 and 75 tokens/sec with f32, though more bandwith demand starts to favor reduced floating types as model size gets bigger.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125290
Approved by: https://github.com/mikekgfb
Fix the patch failure, and we should patch the function where it is used, not where it is defined.
Failure info:
```bash
root@cambricon-PowerEdge-C4140:/workspace# python file_based_timer_test.py -k test_expired_timers
/opt/conda/lib/python3.10/site-packages/torch/_custom_ops.py:253: DeprecationWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
return torch.library.impl_abstract(qualname, func, _stacklevel=2)
E
======================================================================
ERROR: test_expired_timers (__main__.FileTimerServerTest)
tests that a single expired timer on a process should terminate
----------------------------------------------------------------------
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2757, in wrapper
method(*args, **kwargs)
File "/opt/conda/lib/python3.10/unittest/mock.py", line 1376, in patched
with self.decoration_helper(patched,
File "/opt/conda/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/opt/conda/lib/python3.10/unittest/mock.py", line 1358, in decoration_helper
arg = exit_stack.enter_context(patching)
File "/opt/conda/lib/python3.10/contextlib.py", line 492, in enter_context
result = _cm_type.__enter__(cm)
File "/opt/conda/lib/python3.10/unittest/mock.py", line 1447, in __enter__
original, local = self.get_original()
File "/opt/conda/lib/python3.10/unittest/mock.py", line 1420, in get_original
raise AttributeError(
AttributeError: <module 'torch.distributed.elastic.timer' from '/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/timer/__init__.py'> does not have the attribute 'log_debug_info_for_expired_timers'
To execute this test, run the following from the base repo dir:
python file_based_timer_test.py -k test_expired_timers
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 1 test in 0.792s
FAILED (errors=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125144
Approved by: https://github.com/gag1jain
On MacOS 14.4, system Python is configured to point to a non-existing include dir
```
% /usr/bin/python3 -c "import sysconfig;print(sysconfig.get_path('include'))"
/Library/Python/3.9/include
```
Workaround the issue by composing path to include folder from `stlib` config, which points to
```
% /usr/bin/python3 -c "import sysconfig;print(sysconfig.get_path('stdlib'))"
/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125285
Approved by: https://github.com/kit1980
While studying some tlparse, I noticed that CompilationMetrics was reporting that there was no error for frames that have no nodes. I'm pretty sure we don't actually install a frame in this situation. has_guarded_code will tell us if that's the case, because it says if the GuardedCode object is None or not.
Actually, while working on this, I was wondering if we can ever trigger the "skip this frame entirely, do not trace it ever again" codepath, as best as I could tell, it's impossible for this to happen by the time we get to compilation metrics block.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125279
Approved by: https://github.com/yanboliang
Summary: There's a shortcoming in the FX graph cache tests in that they don't fully clear all inductor in-memory caches when testing the cache-hit path: We were previously accessing the FX graph cache correctly, but when loading the source object using the PyCodeCache.load_by_key_path() method, _that_ path was serving entries out of memory. To better mimic what happens during warm start (i.e., a new process), we should clear all in-memory caches.
Test Plan: updated the unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125280
Approved by: https://github.com/eellison
1. This PR removes the logic for saving and removing the pre-backward hook handles (which is registered via `register_multi_grad_hook(mode="any")`).
2. This PR removes the logic for _trying_ to guard against mistargeted prefetches that relies on querying if the engine will execute the module output tensors' `grad_fn`s. (See https://github.com/pytorch/pytorch/pull/118118 for original motivation.)
For 1, the logic was error prone since it relied on `set_is_last_backward(False)` being set correctly or else pre-backward hooks could be de-registered too early. We would prefer to match the hook lifetimes with that of the autograd graph. This solves a bug with a 1f1b interleaved schedule.
If we directly remove the manual saving/removing hook handle logic, then we have a ref cycle where the tensors' `grad_fn`s are passed to the hook function. We decide to simply remove this `grad_fn` logic since (1) it cannot perfectly prevent mistargeted prefetches and (2) it introduces undesired complexity. In the future, we may prefer a different mechanism to override the prefetching for more complex/dynamic use cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125269
Approved by: https://github.com/weifengpy
ghstack dependencies: #125190, #125191
Summary: Discovered breakages by enabling codecache by default and doing a CI run. I'll commit these fixes first and eventually enabling caching by default will (hopefully) be a one-liner.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125258
Approved by: https://github.com/eellison
Fix https://github.com/pytorch/pytorch/issues/124900.
When we reconstruct `ContextWrappingVariables`s, we only reconstruct the context class, not the object. Normally, contexts are active (via `with ctx:`) and we initialize the context object in the resume function. But for the case of inactive contexts (contexts declared ahead of time before the `with` block), we do not reconstruct them properly in the optimized bytecode or resume function. So this PR adds initialization for inactive contexts in the resume function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125203
Approved by: https://github.com/jansel
Summary: When CapabilityBasedPartitioner creates the fused subgraph as the call_module node, it didn't populate the node.meta["val"] field.
Test Plan: OSS CI
Differential Revision: D56789259
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125261
Approved by: https://github.com/zhxchen17
Summary: Due to the compatitbility issue, we hard coded the passes to do the pattern optimization. Here, we revisit the method since it has been a while for the changes into production packages. We instead read from the config to decide whether we do the specific pattern optimization, which makes followup pattern add easier.
Differential Revision: D56659934
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125136
Approved by: https://github.com/jackiexu1992
Fixes two build problems on ROCM 6.1 + Ubuntu 22.04
### Inconsistency value of CMAKE_PREFIX_PATH between `.ci/pytorch/build.sh` and Build Instructions
Current `CMAKE_PREFIX_PATH` points to the base environment of the conda (commonly `/opt/conda`). However the conda environment used in the CI should be `/opt/conda/envs/py_<VRESION>`, which is supplied by `$CONDA_PREFIX`.
This divergence may cause libstdc++ version conflicts because the base conda environment may ship a different libstdc++ than the `pv_<VERSION>`, and/or the system default environment. One notable issue is on our internal CI system this script failed to build AOTriton library on Ubuntu 22.04 due to libstdc++ version conflicts between HIP compiler and conda base environment.
This PR fixes this and make sure the CI script follows the official build instruction.
### Incorrect `tinfo` was linked on Ubuntu 22.04 due to flaws in parsing of `os-release`
The code to parse /etc/os-release is incorrect and the distribution info was parsed as `PRETTY_Ubuntu` instead of `Ubuntu`. `libtinfo` will not be linked into the binary due to this flaw. Thus, cpp unit tests failed to build because of missing symbols from `libtinfo`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118216
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet, https://github.com/atalman
This doesn't introduce any new behavior, but sets up a basic cache key generation mechanism that I can test. From here I will:
- Add checks on the ops in an input FXGraph to make sure they are safe to cache. We'll be conservative in the first version here.
- Add serialization for FX graphs
- Save these FX graphs to disk in the cache
- Support graphs with more complicated ops like higher order ops and specialized nn modules
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124642
Approved by: https://github.com/aorenste
Summary: This commit fixes the pattern matching for conv-bn
during QAT fusion where both weight and bias are quantized per
channel. Previously this failed because weights and biases used
the same example kwargs for their scales and zero points,
causing these qparams to be tied during pattern matching.
Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn1d.test_qat_conv_bn_per_channel_weight_bias
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d.test_qat_conv_bn_per_channel_weight_bias
Reviewers: jerryzh168, angelayi
Subscribers: jerryzh168, angelayi, supriyar
Differential Revision: [D56740694](https://our.internmc.facebook.com/intern/diff/D56740694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125208
Approved by: https://github.com/angelayi
Summary:
Add exclusion list to minimizer:
1. some operations cannot be lowered when constructing subgraphs; this usually happens when they are isolated from operation group
2. exclude them in search strategies for automation
Reviewed By: jimone1
Differential Revision: D56327289
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124504
Approved by: https://github.com/jfix71
This PR fakify ScriptObject inputs and attributes in export non-strict mode by default.
The basic idea is to `only fakify the script object during tracing (i.e. aot_export)`. After we get the traced graph module, eagerly executing, serializing, or running more passes will use the real script objects. This is essentially treating the script object as constant tensor.
Concretely, we
1. fakify all the script object inputs, and module attributes (gathered by constant_attrs).
2. patch the module's attributes with fakified script object
3. right after aot_export, remove the patching (to avoid changing the original module) then modify the exported graph module's attribute to real script object.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124239
Approved by: https://github.com/zou3519
Fixes#124528
Going over the options for our MapAllocator and what they do, I don't think any other of them need to be piped up to `torch.load`
4f29103749/aten/src/ATen/MapAllocator.h (L8-L16)
~However, I wonder if this `MmapVisibility(Enum)` is a good way to represent "or-ing" together of `mmap` flags if we want to extend it in the future. I looked over the flags for [`mmap(2)`](https://man7.org/linux/man-pages/man2/mmap.2.html), and could not immediately see how most of them would be useful for `torch.load` (would maybe `MAP_LOCKED` (like `mlock`) or `MAP_HUGE` ever be worthwhile?)~
Using the flags provided by the python `mmap` library so that we can extend the allowed flags and pipe them down to the cpp `mmap` call if there is a need for other flags in the future
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124889
Approved by: https://github.com/albanD
This PR fixes an issue presented when calling `aten.alias(int)` raises a TypeError.
```python
import torch
import torch.autograd.forward_ad as fwAD
def f(x):
return 4312491 * x
device = "cpu"
with torch._subclasses.fake_tensor.FakeTensorMode():
with fwAD.dual_level():
x = torch.randn(3, device=device)
y = torch.ones_like(x)
dual = fwAD.make_dual(x, y)
f(dual)
```
The test case above illustrates this bug.
1) `4312491` turns into a tensor that is a wrapped number
2) Forward mode AD calls `aten::alias` internally
3) The wrapped number (`4312491`) becomes a python integer
4) `aten.alias(int)` raises a `TypeError`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124774
Approved by: https://github.com/albanD, https://github.com/zou3519
This PR introduces a new way of building `dynamic_shapes` for export. The idea is to build up a mapping from input tensors to the dynamic shapes that should be assigned to their corresponding fake tensors.
This mapping is automatically converted to the current form of `dynamic_shapes`, which must exactly match the structure of inputs. We do this by using pytree utils.
With the current `dynamic_shapes`, we had to be careful about user-defined classes that are registered with pytree, since such classes are not necessarily polymorphic containers; they may be fine containing tensors, but not dynamic shapes. Thus we had decided to allow input instances of such classes to be associated with dynamic shapes in flattened form. This decision needs to be mirrored in this PR as well. To make it easier to keep these code paths in sync, we refactor the current recursive procedure for associating inputs with dynamic shapes to use the same pytree utils. This needs minor fixes to a few tests where `dynamic_shapes` were not exactly matching the structure of inputs.
Differential Revision: D56551992
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124898
Approved by: https://github.com/zhxchen17
Part of a multi-PR work to improve #59168
Meant to complete
Write native kernels for AvgPool3d
Write native kernels for MaxPool3d
Write native kernels for AdaptiveAvgPool3d
Write native kernels for AdaptiveMaxPool3d
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116305
Approved by: https://github.com/ezyang
I was seeing for a reduction kernel and a given block size, on AMDGPU, the vectorization bandwidth (16-byte) for a thread was not fully leveraged while it was not a problem for NVGPU. It appeared that each thread got fewer data to process as a whole row were processed by more threads, and the number of elements each thread got was not enough to saturate full vectorization. On AMDGPU, a warp has 64 lanes compared to 32 on the NV side. Therefore I'm tuning down the default number of warps (8 for NV) for AMD. I'm seeing 10% speed up for an internal benchmark.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125084
Approved by: https://github.com/shunting314
For microbatching use cases (e.g. PP), we may use fp32 reduce-scatter (i.e. `MixedPrecisionPolicy(reduce_dtype=torch.float32)`), where we want to accumulate the unsharded gradients in fp32 across microbatches until reduce-scattering in fp32 upon the last microbatch.
Note that the `unsharded_param` is in bf16, so we must save the fp32 accumulated gradient to an attribute different from `.grad`. Moreover, saving a new attribute on the `torch.Tensor` leads to some annoying type checking issues (where the attribute may not be defined), so this PR prefers to save the attribute on the `FSDPParam` class instead.
One could argue that this behavior should be configurable, but since I think for large-scale training, everyone is leaning toward fp32 accumulation across microbatches, let us avoid adding another argument for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125191
Approved by: https://github.com/weifengpy
ghstack dependencies: #125190
The unit test for fp32 `param_dtype` and bf16 `reduce_dtype` was disabled. This PR debugs the issue and identifies the root cause as numeric differences between NCCL bf16 all-reduce vs. bf16 reduce-scatter. We address this by having the baseline use reduce-scatter -> all-gather to implement all-reduce.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125190
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
By using `Py_NewRef`
Also, wrap `THPDtype_to_real`/`THPDtype_to_complex` calls with `HANDLE_TH_ERRORS`
Add regression test for the above issues, by calling to_complex for integral dtypes, that raises an exception and by preserving reference count to the same to_complex/to_real call to detect if leak is happeneing.
Replace
```cpp
auto dtype = (PyObject*)torch::getTHPDtype(current_dtype);
Py_INCREF(dtype);
return dtype;
```
with a more compact/streamlined equivalent
```cpp
return Py_NewRef(torch::getTHPDtype(current_dtype));
```
Fixes https://github.com/pytorch/pytorch/issues/124868
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125154
Approved by: https://github.com/Skylion007, https://github.com/albanD
yolo
Also
* Ensure that at least 1 test always gets run (`//` does truncation which results in 0 if you have too few tests discovered)
* Don't run test removal on slow tests - I'm not touching that yet
I am avoid everything other than pull + trunk workflows, so not doing this on windows CUDA, which runs on periodic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125049
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
When we do cudagraph warmup, we record which outputs are in the cudagraph pool, so subsequently when we invoke a cudagraph and need to reclaim its memory we can free the prior run's outputs and make them error on access.
In warmup, we detect this by ignoring outputs which are an alias of an input that is not a prior output. We did this by checking data pointer. In very rare situations, a data pointer of a non cudagraph input might get reallocated to a cudagraph pool and causes us to ignore it.
This was happening with gpt-fast error with gemma 2 when coordinate_descent_tuning was set to False.
This updates so that we check aliasing with non-cudagraph inputs by looking at storage pointer..
Unrelated: saw very weird behavior where an output had the same data pointer as a supposedly live input but not the same cdata 🤔 I would think that is not possible.
```
out[0]._cdata in [ref()._cdata for ab in non_cudagraph_inps_storage_refs] # False
out[0].data_ptr() in [ref().data_ptr() for ab in non_cudagraph_inps_storage_refs] # True
```
Differential Revision: [D56607721](https://our.internmc.facebook.com/intern/diff/D56607721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124981
Approved by: https://github.com/ezyang
This PR use str for reduce_op directly instead of the c10d enum. Since
our functional collective already uses str, there's no reason that we
need the c10d enum anymore as that requires a conversion
Also the str hash + eq performance is actually significantly faster than
the c10d type, so this would somewhat improves the CPU overhead too
Some local cpu benchmarks on `1000000` hash operations:
```
Hash performance for string type: 0.039897 seconds
Hash performance for integer type: 0.304665 seconds
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125172
Approved by: https://github.com/awgu, https://github.com/XilunWu, https://github.com/tianyu-l
# Summary
This is part one of adding backwards support to FlexAttention.
This PR focuses on the eager implementation and wiring up enough of the templated_attention_backward(name change soon 😉) to get through aot_eager.
Notably this does not actually wire up the triton template just yet in order to make this PR easier to review. That will be the next follow up PR.
#### Structure
We pass both the forward and backward graph to the backwardsHOP since these are both needed to be inlined into the calculation for backwards:
- the forward graph is needed in order to re-compute the scores
- the joint graph is needed in order to construct the correct gradients post softmax_grad calc
### Attatched AOT Graph
https://gist.github.com/drisspg/ce4c041f8df8a5a7983c5174705cf2b5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123902
Approved by: https://github.com/Chillee
Summary:
Fixes https://github.com/pytorch/pytorch/issues/122842
Currently, calling ep.module() on an ExportedProgram leads to a GraphModule with a default forward signature (e.g. arg_0, arg_1, ...). This leads to original placeholder names disappearing for retracing/re-exporting.
Fixing this issue by creating a forward_arg_names field (will take renaming suggestions for this), that stores the positional & keyword arg names that are used. These names aren't present in the call_spec currently stored, and requires a major version bump for the ExportedProgram schema.
Test Plan: Tests exist for export, but names are now changed from generic (e.g. arg_0, arg_1) to follow user inputs (e.g. x, y)
Differential Revision: D56484994
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124765
Approved by: https://github.com/zhxchen17
If there's an exception during collection it can result in the profiler never being stopped properly. As a result all subsequent tests that use profiling will also fail - even if they pass in isolation.
I'm hoping this fixes the flakyness in #124253, #124220, #82720, #119346, #119364, #119490, #119526, #119537 (and the currently closed#82864).
Before:
```
(py312) $ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/profiler/test_profiler.py
===================================================================================================================== FAILURES =====================================================================================================================
============================================================================================================= short test summary info ==============================================================================================================
FAILED test/profiler/test_profiler.py::TestExecutionTrace::test_execution_trace_with_kineto - AssertionError: Element counts were not equal:
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_conv2d_bias_followed_by_batchnorm2d_pattern - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern_benchmark - AttributeError: 'NoneType' object has no attribute 'profiler'
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_fp32_matmul_pattern - AttributeError: 'NoneType' object has no attribute 'profiler'
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_matmul_dim_fp16_pattern - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_kineto_multigpu - torch._dynamo.exc.InternalTorchDynamoError: 'NoneType' object has no attribute 'events'
FAILED test/profiler/test_profiler.py::TestProfiler::test_oom_tracing - AssertionError: RuntimeError not raised
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_basic_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_close_in_scope_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_complex_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_multiple_preexisting_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_open_in_scope_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_optimizer_parameters_sgd - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_refcounts - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_sparse_tensors - RuntimeError: Can't disable Kineto profiler when it's not running
==================================================================================================== 16 failed, 26 passed, 53 skipped in 25.51s ====================================================================================================
```
After:
```
(py312) $ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/profiler/test_profiler.py
===================================================================================================================== FAILURES =====================================================================================================================
============================================================================================================= short test summary info ==============================================================================================================
FAILED test/profiler/test_profiler.py::TestExecutionTrace::test_execution_trace_with_kineto - AssertionError: Element counts were not equal:
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern - RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/data/users/aorenste/pytorch/torch/csrc/autograd/profiler_python.cpp":969...
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern_benchmark - AttributeError: 'NoneType' object has no attribute 'profiler'
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_fp32_matmul_pattern - AttributeError: 'NoneType' object has no attribute 'profiler'
FAILED test/profiler/test_profiler.py::TestProfiler::test_kineto_multigpu - torch._dynamo.exc.InternalTorchDynamoError: 'NoneType' object has no attribute 'events'
FAILED test/profiler/test_profiler.py::TestProfiler::test_oom_tracing - AssertionError: RuntimeError not raised
FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_optimizer_parameters_sgd - RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/data/users/aorenste/pytorch/torch/csrc/autograd/profiler_python.cpp":969, please...
==================================================================================================== 7 failed, 35 passed, 53 skipped in 31.51s =====================================================================================================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125131
Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi
Fixes#100152
1. Fix the wrong tests about lazy init for PrivateUse1 named foo
2. Fix wrong backend meta registry mechanism when compiling with clang++( compiling with g++ work well)(introduced by static variable in inline function)
3. Refactor the tests and make it more flexible
4. Disable the two tests temporarily
- test_open_device_storage_pin_memory
- test_compile_autograd_function_aliasing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124712
Approved by: https://github.com/albanD, https://github.com/malfet
Summary:
fixes two issues:
- when exporting with debug=True, the list of error-causing modules and a dependency path to them is not printed correctly, there's a missing newline after the path, meaning the name of the module for the next error is on the wrong line, which makes the output a confusing mess to read
- when a pickled object references more than one mocked module directly, the error message incorrectly repeats the same information, claiming the referenced attribute is present in several different libraries, because the if condition references the last seen module name while walking the pickle ops, not the module name from the enclosing block `for module_name in all_dependencies:`. this is confusing because one error will print as O(all_dependencies) errors, all with different module names but the same attribute name
Differential Revision: D56578035
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124943
Approved by: https://github.com/JonAmazon, https://github.com/houseroad
This PR renames the `FSDP` class to `FSDPModule`. This is a BC breaking change. The rationale is that `FSDPModule` is more descriptive since `fully_shard` is a module-level API (applied to a `module` arg), so the `FSDP` class will always correspond to a module.
Also, users commonly import `FullyShardedDataParallel` as `FSDP`, so this can help avoid some name conflict in some cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124955
Approved by: https://github.com/wanchaol, https://github.com/wconstab
ghstack dependencies: #124651, #124741, #124767, #124768, #124780, #124787
This completely subsumes https://github.com/pytorch/pytorch/pull/120816
This makes use of the unbacked binding machinery to teach Inductor how to generate deferred runtime asserts directly. There is some back story about why I did it this way, let me explain.
Previously, our strategy for generating runtime asserts was that Dynamo would insert them into the FX graph after finishing tracing, and we would attempt to code generate them based on the FX graph. This is a good strategy for export, where we immediately export the graph. However, this strategy was afflicted by problems in eager, where we reuse the same ShapeEnv as before. In particular, on subsequent graph passes, we would immediately turn all of these assertions into noops, because when we evaluated their expressions, we would see that because we had a deferred runtime assert in the ShapeEnv, we know "oh, of course this expression is True" already. Oops!
So, with this PR, we take the attitude that as long as the ShapeEnv sticks around, the ShapeEnv's list of deferred runtime asserts is the source of truth, and we don't put anything in the graph. So we just need to decide when to actually generate asserts, and the place I picked was Inductor lowering, since we already have an AssertScalar buffer concept, and so I just need to insert them at this point. AssertScalar also uses raw sympy.Expr rather than SymInt/Bool, so it is easier to prevent unrestricted simplification at this point.
There are a few things jumbled together in this PR. I can split them if you want, but some of the changes are before I changed my strategy, but they're useful changes anyway.
**torch/_dynamo/output_graph.py** and **torch/_inductor/lowering.py** - Here, we stop putting deferred runtime asserts in the graph. I also have to make sure we don't DCE unused symbol arguments; we're going to get some goofy graph arguments this way, will be good to restore that optimization eventually. We also just disable codegen for `_assert_scalar` entirely; we assume that ShapeEnv will be good enough to capture all of these.
**torch/_inductor/codegen/wrapper.py** and **torch/_inductor/ir.py** - Add a way to codegen sizevars without forcing simplification
**torch/_inductor/graph.py** - The main logic. Our strategy is to interpose in the same place we are testing that unbacked SymInts are properly showing up in lowered code. The logic is directly analogous to the logic in the existing insert deferred runtime asserts FX pass, but it's simpler because sympy expressions can be directly stored on inductor IR nodes.
**torch/fx/experimental/symbolic_shapes.py** - For extra safety, we have a way of freezing runtime asserts, so that if you try to add more we error. This prevents us from adding runtime asserts after we've done lowering. There's a funny interaction with backwards which there's a comment for in graph.py
**torch/fx/passes/runtime_assert.py** - This is not really needed in this PR, but I rewrote the runtime assert logic to use unbacked_bindings rather than inferring it by looking for unbacked SymInts. Now, keypaths are translated into FX node acessors. Unfortunately, I couldn't delete the old inference code, because you still need it to find backed SymInts from arguments (as this pass may be used on graphs which don't explicitly bind all their shape variables as argments). There are some new tests exercising this.
TODO: I think we need to generate asserts for replacements too. This is a preexisting problem that the old FX pass had too.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124874
Approved by: https://github.com/jansel
ghstack dependencies: #124864
I want to generate runtime assert nodes during lowering, which means
that I need a finalized list of asserts by the time I start lowering.
This means this runtime assert introduced in
https://github.com/pytorch/pytorch/pull/113839 must go. Fortunately,
this runtime assert was never exercisable, apparently, and the test
still "passes" without it. I replace it with a compile time test. We
can revisit if this assert fails in practice.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124864
Approved by: https://github.com/jansel
Summary: When I was debugging an issue, this silent error makes the debugging harder. It is better to error earlier with more descriptive error message.
Test Plan: None
Differential Revision: D56312433
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124411
Approved by: https://github.com/zhxchen17
Earlier globals of inlined functions from other files were not handled correctly. We were not tracking mutations on them. They were colliding with the same global name in the parent function etc. This PR overrides the LOAD/STORE_GLOBAL for inline tx and tracks mutation on them separately.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125002
Approved by: https://github.com/jansel
ghstack dependencies: #125097, #125107
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/124286
The TorchBenchmark includes a method called `run_n_iterations` which runs model multiple times.
43f4e71daa/benchmarks/dynamo/common.py (L2272-L2276)https://github.com/pytorch/pytorch/pull/123399 enables tracing into a `UserDefinedObjectVariable` that's an instance method. It will trace the model into FX graph multiple times within `run_n_iterations`. Then, in the Inductor, `Conv-BN folding` at the module level will fuse the same Conv-BN module multiple times in this case, which leads to accuracy failures. This PR addresses the issue by ensuring that each Conv-BN module is fused only once.
**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_inductor_freezing.py -k test_folded_conv_bn_with_module_sharing
python -u -m pytest -s -v test/inductor/test_inductor_freezing.py -k test_folded_conv_functional_bn_with_module_sharing
python -u -m pytest -s -v test/inductor/test_inductor_freezing.py -k test_conv_bn_with_multi_bn_share_conv
python -u -m pytest -s -v test/inductor/test_inductor_freezing.py -k test_conv_functional_bn_with_multi_bn_share_conv
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124808
Approved by: https://github.com/jansel, https://github.com/jgong5
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.
This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.
- If you approve of this diff, please use the "Accept & Ship" button :-)
Test Plan: Sandcastle
Reviewed By: palmje
Differential Revision: D56587751
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125041
Approved by: https://github.com/Skylion007
Summary:
The LLVM warning `-Wmissing-field-initializers` has found one or more structs in this diff's files which were missing field initializers.
This can be unintended such as:
```
my_struct s1 = {0}; // Initializes *only* the first field to zero; others to default values
my_struct s2 = {}; // Initializes *all* fields to default values (often zero)
```
or it may be because only some of the members of a struct are initialized, perhaps because the items were added to the struct but not every instance of it was updated.
To fix the problem, I've either used `{}` to initialize all fields to default or added appropriate default initializations to the missing fields.
Test Plan: Sandcastle
Reviewed By: palmje
Differential Revision: D56614179
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125047
Approved by: https://github.com/Skylion007
Summary:
This is a forward Hotfix for T186742340.
Some recent changes in Pytorch / Inductor ( D56458606) led to aten.addmm operators being inserted twice into the list of choices to select from during autotuning. This appears to have triggered a test failure in fbcode.
This fix prevents the aten operators being added twice to the list of choices for autotuning.
Test Plan:
* Pytorch CI
* CUDA_LAUNCH_BLOCKING=1 buck2 test 'fbcode//mode/opt' fbcode//accelerators/pytorch/lib/pt2_utils/tests:compile_pt2_test -- --exact 'accelerators/pytorch/lib/pt2_utils/tests:compile_pt2_test - test_compile_pt2 (accelerators.pytorch.lib.pt2_utils.tests.compile_pt2_test.TestCompilePT2)'
Differential Revision: D56642879
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125065
Approved by: https://github.com/eellison
This PR removes the legacy impls of c10d_functional ops which are now irrelevant. For backward compatibility purpose, c10d_functional ops now call into _c10d_functional ops.
We also changed c10d_functional ops to be CompositeExplicitAutograd, so that when traced, only _c10d_functional ops appear in the graph. After this, we'll be able to remove the Inductor IR for the legacy functional collectives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124979
Approved by: https://github.com/wanchaol
This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following #124701. In addition, libfmt dependency is added in CMake code to enable using it in the headers. The libfmt has to be added as private dependency to torch_cuda and torch_hip because they include torch/csrc/distributed/c10d/Utils.hpp which uses libfmt.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124987
Approved by: https://github.com/malfet
Summary:
This makes barrier and rank operations linear instead of quadratic with the number of workers. This drastically improves performance for rendezvous when running with over 1000 hosts.
This uses 2 approaches for different areas:
* local rank assignment: each worker does 1 set and 1 get, local ranks are assigned on the rank 0 host in a O(n) operation which reduces total store operations to be linear with number of workers.
* exit_barrier: use a counter and a final flag so each worker has to do max 1 set, 1 get and 1 add.
At 4000 hosts we see torchelastic be able to run in as little as 10 seconds down from 373 seconds.
Test Plan:
This is testing using many small tests running on a remote cluster.
{D56549942}
```
torchx run --scheduler mast -- --image=torchelastic_benchmark --j=4000x1
```
Differential Revision: D56605193
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124982
Approved by: https://github.com/kiukchung, https://github.com/kurman
- sets it as a fake stack trace as we don't have a generic comment feature
- when verbose is disabled, still adds a contextmanager and flag checks. the alternative is to use MACROS, but that wouldn't be usable with TORCH_LOGS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124954
Approved by: https://github.com/jansel
## Description
Framework overhead is found to be big for the onednn qconv op (used for quantization with PT2E X86Inductor backend). This PR reduces the integration overhead by modifying the implementation of qconv.
## performance results
Running quantized Resnet50 on an Intel(R) Xeon(R) Platinum 8490H machine
Before
```
Average latency: 8.378 ms.
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
onednn::qconv2d_pointwise 86.54% 6.954ms 87.42% 7.025ms 132.547us 53
```
After
```
Average latency: 6.255 ms.
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
onednn::qconv2d_pointwise 85.05% 6.381ms 85.98% 6.451ms 121.717us 53
```
Test script:
```python
import torch
import torchvision
import time
import copy
import numpy as np
from torch._export import capture_pre_autograd_graph
from torch.ao.quantization.quantize_pt2e import (
prepare_pt2e,
convert_pt2e,
)
import torch.ao.quantization.quantizer.x86_inductor_quantizer as xiq
from torch.ao.quantization.quantizer.x86_inductor_quantizer import X86InductorQuantizer
torch._inductor.config.cpp.enable_kernel_profile=True
torch._inductor.config.profiler_mark_wrapper_call = True
torch._inductor.config.freezing = True
torch._inductor.config.cpp_wrapper = True
def bench_model(model, inputs):
times =[]
with torch.no_grad():
for _ in range(5): # warm-up
output = model(inputs)
for _ in range(20):
start_time = time.time()
output = model(inputs)
end_time = time.time()
times.append(end_time - start_time)
print ('Average latency: %0.3f ms.' % (np.median(times) * 1000.0))
with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU]) as p:
out_ipex = model(inputs)
print(p.key_averages().table(sort_by="self_cpu_time_total", row_limit=-1))
def pt2e_ptq(m, example_inputs):
m = m.eval()
exported_model = capture_pre_autograd_graph(m, example_inputs)
quantizer = X86InductorQuantizer()
quantizer.set_global(xiq.get_default_x86_inductor_quantization_config())
prepared_model = prepare_pt2e(exported_model, quantizer)
_ = prepared_model(*example_inputs)
converted_model = convert_pt2e(prepared_model)
torch.ao.quantization.move_exported_model_to_eval(converted_model)
with torch.no_grad():
optimized_model = torch.compile(converted_model)
_ = optimized_model(*example_inputs)
_ = optimized_model(*example_inputs)
bench_model(optimized_model, *example_inputs)
return optimized_model
if __name__ == "__main__":
data = torch.randn(16, 3, 224, 224)
model_fp = torchvision.models.resnet50(weights=torchvision.models.ResNet50_Weights.DEFAULT)
pt2e_ptq(copy.deepcopy(model_fp), (data,))
```
Differential Revision: [D56288440](https://our.internmc.facebook.com/intern/diff/D56288440)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123240
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
Summary:
This diff fixes a bug in PyTorch where when creating a tensor from a List of booleans, PyTorch was throwing an error.
This fix resolves that issue. All credit goes to swolchok for identifying the root cause of the issue and suggesting this fix.
Test Plan: Running our model end to end works as expected and no error occurs.
Differential Revision: D55990810
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124899
Approved by: https://github.com/zhxchen17
**Summary**
This PR is attempt to land an experimental feature designed in #103686 . `local_map` is designed to allow users to apply to `DTensor` objects a function that was written to apply to `torch.Tensor`.
As a function, `local_map` takes in 2 required arguments (`func` and `out_placements`) and 3 optional arguments (`device_mesh`, `in_placements`, `redistribute_inputs`). `func` is the function to be applied to each local shard of input `DTensor`. `out_placements` is the sharding specification of output `DTensor`.
`local_map` returns a new function that does the following:
1. Infer `device_mesh` and `in_placements` from `DTensor` input if they're not provided. If `device_mesh` is provided, it must be identical to the device mesh of every `DTensor` input. If `in_placements` is provided, it serves as the required sharding specification of corresponding `DTensor` input before feeding its local shard into `func`. In case it is different from `DTensor`'s sharding specification, if `redistribute_inputs=False` an exception will be raised, otherwise perform a resharding to the required sharding.
2. Call `func` with the arguments passed in along with `device_mesh` except `DTensor`s. For `DTensor`, pass in its local shard. This `func` may include collectives.
3. For each output of `func` that has validate (i.e. not `None) sharding specification in `out_placements`, construct a new `DTensor` using the output and the specification. Use this `DTensor` as the output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123676
Approved by: https://github.com/wanchaol
torch.library.register_fake reports the python module the fake impl is
located in. This is used to check against
`m.set_python_module("foo.bar")` calls in C++.
The module reporting logic was wrong in most cases. This PR fixes it.
Test Plan:
- exhaustive tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125037
Approved by: https://github.com/williamwen42
Improves the Cutlass backend GEMM template:
* Adds code which allows to create stand-alone test runners for Cutlass GEMM Kernels, which allows (manual) debugging of, for example, CUDA IMA errors or similar problems which occur in practice. Includes some utility code and tests to actually compile and run these standalone tests.
* Cleans up the GEMM template code through various refactorings
* Eliminates code sections and options that are unneccessary now that epilogue fusions are being removed.
* Limits the scope of a workaround for (flaky) Cutlass issues with bias broadcasting to neccessary cases.
* Puts some CPU runtime checks into #if / #endif blocks, such that it's possible to compile CUTLASS Kernels with lower CPU overhead.
* Add documentation comments
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124577
Approved by: https://github.com/jansel
ghstack dependencies: #124576
Summary:
previous attempts don't work eventually. D49720297 causes online train SEV due to extra importing. D56299408 mitigates a tricky bug from Distributed Shampoo constructor but unfortutenaly didn't correct the scuba logging either.
see f552546983
Test Plan: {F1491621504}
Differential Revision: D56378270
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124593
Approved by: https://github.com/anijain2305
This PR is to add support for tensor's is_complex method in dynamo. Take the following code as an example:
```python
def test_tensor_is_complex(x):
if x.is_complex():
return x + 1
else:
return x - 1
```
Before this fix, the is_complex() call will cause a graph break "torch.* op returned non-Tensor bool call_method is_complex". After this fix, the graph break can be avoided.
Fixes#122692
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124927
Approved by: https://github.com/ezyang
A better query for the base commit of a PR.
Some ghstack PRs are not connected to main so git merge-base doesn't work. Instead, use the Github API to query for the base of the PR, which should be more accurate
Sanity checked on one of Ed's ghstack PRs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122214
Approved by: https://github.com/seemethere
I'm restoring the `training` and `inference` options after github.com/pytorch/pytorch/pull/124795 and remove the not less-known `cppwrapper` option instead per @desertfire suggestion. The total number of parameters remains at 10.
Also, the default choice for training and inference are explicitly spelled out when dispatching the workflow manually to catch dev attention.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124971
Approved by: https://github.com/ezyang
Test the generic torch.Stream/Event with fake device gurad and hooks. Since we added a fake device backend, it is mutual exclusive to other backends. Tests will be skipped if TEST_CUDA or TEST_ROCM is true.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123614
Approved by: https://github.com/albanD
ghstack dependencies: #123611, #123612
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
"init",
"is_available",
"synchronize",
"device_count",
"current_device",
"current_stream",
"default_stream",
"set_stream",
"stream",
"device",
]
```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```
---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
fake_tensor.py had mypy error ignored. That seems less than desirable.
Also added SafePyObjectT<T> which is a tagged wrapper around a SafePyObject but provides static type checking (with no other guarantees).
Used `SafePyObjectT<TorchDispatchModeKey>` on some of the TorchDispatchModeTLS API to ensure that we don't accidentally inject a different type than expected into the stack.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124428
Approved by: https://github.com/malfet
A better query for the base commit of a PR.
Some ghstack PRs are not connected to main so git merge-base doesn't work. Instead, use the Github API to query for the base of the PR, which should be more accurate
Sanity checked on one of Ed's ghstack PRs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122214
Approved by: https://github.com/seemethere
Summary:
Previously the `requires_grad` is not propagated from original Tensor to decomposed tensors
Test Plan:
python test/test_parametrization.py -k test_register_parametrization_no_grad
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124888
Approved by: https://github.com/lezcano
Summary: Now that we have reached nanosecond granularity, we can now remove the temporary guards that were previously required for nanosecond precision.
Test Plan: Regression should cover this change
Reviewed By: aaronenyeshi
Differential Revision: D56444570
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124734
Approved by: https://github.com/aaronenyeshi
**Motivation**: There's a Meta-internal use case that deepcopies a bunch of metadata, which includes shapes. When we try to use NestedTensor with this tool, it errors out when we try to deepcopy the metadata, because SymNodes cannot be deepcopied. The change here is to add an implementation of `__deepcopy__`.
**Implementation**:
1. `__deepcopy__` on SymNode calls clone()
2. Implement `clone()` in NestedIntSymNode, which previously didn't have this implemented
**Potential Issues**:
Right now, this works.
But, regarding (2): Eventually we'll have some mapping between the NestedSymIntNode and its corresponding offsets/lengths tensor (cc @soulitzer who is working on this). How should this work with `__deepcopy__`? Should the offsets/lengths tensor also be cloned, or should the new symint reference the same offsets as the old symint?
On one hand, we already have this issue with NestedIntSymNodeImpl::mul(): mul() creates a new NestedIntSymNodeImpl. On the other hand, `__deepcopy__` might imply different semantics.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121361
Approved by: https://github.com/soulitzer
aten.div's output device will be its numerator's device. so it is acceptable to do cuda / cpu type divisions. post grad passes operate only on graphs and can't handle runtime graph inputs. so we change user code to move inputs to cuda for cudagraph. this affects any graph that has cpu tensors as graph inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119729
Approved by: https://github.com/eellison
fixes#123039
In abi mode, ExternKernelSchedulerNode generates code using `aoti_torch_tensor_copy_` which requires `AtenTensorHandle`, but the allocation generates ArrayRefTensor to allocate mem in stack. To fix this issue, this PR prevents ExternKernelSchedulerNode from using stack-mem-allocation in abi, and creates AtenTensorHandle instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124037
Approved by: https://github.com/desertfire
Validate that all arguments are on MPS devices and dtypes are expected
Fixes cryptic messages like
```
% python3 -c "import torch;print(torch.nn.functional.linear(torch.rand(32, 32), torch.rand((32, 32), device='mps')))"
RuntimeError: Placeholder storage has not been allocated on MPS device!
```
And hard crashes like
```
% python3 -c "import torch;print(torch.nn.functional.linear(torch.rand(32, 32, device='mps'), torch.randint(-10, 10, (32, 32), dtype=torch.int8, device='mps')))"
```
Fixes https://github.com/pytorch/pytorch/issues/123995
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124952
Approved by: https://github.com/Skylion007
Summary:
There are multiple things implemented incorrectly in non strict for reparametrizing state dict:
1. The same fake tensor should be generated for duplicated weights.
2. We should snapshot state dict in the beginning to always hold the invariant that ep.state_dict == mod.state_dict()
3. We will overwrite real weights with fake weights if we don't restore the weights in LIFO ordering.
4. We don't turn on strict checking which could sliently fail on corner cases.
This diff aims to solve all these issues at once.
Test Plan: CI
Differential Revision: D56505020
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124847
Approved by: https://github.com/pianpwk
When inductor generates triton code, the triton code can either assume that the inputs given to it are aligned or unaligned. If they are aligned, triton can use more efficient instructions (like vectorized loads or tensor cores). However, if we generate "aligned" code and pass in unaligned inputs, the triton code will error out; to fix this, we clone unaligned inputs that are passed to triton kernels that expect aligned inputs. This can lead to excessive clones if we have inputs that are not expected to be aligned.
In this PR, we use the example input to decide whether the generated triton code should assume alignment or not. If the example input is aligned, then we will generate triton code that assumes alignment; if at runtime we receive an unaligned input, we'll make a clone. Meanwhile, if the example input is not aligned, the generated triton code will not assume inputs are aligned and we won't ever need to clone.
Note that the alignment of the inputs is not guarded on; we found that adding guards on tensor offsets (a) was slow in cases where we do a lot of comparisons on tensor offsets, and (b) led to a lot of recompilations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123319
Approved by: https://github.com/eellison
Summary:
D56289438 from OSS breaks test
deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test - test_cpu_lower_merge_with_ibb_3 (deeplearning.aot_inductor.cpu.test.test_lowering_utils.CPULoweringTest)
The issue is that we use partial for aten.cat that shouldn't be directly failed out with assertion
Test Plan:
```
deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test - test_cpu_lower_merge_with_ibb_3
```
Differential Revision: D56541352
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124882
Approved by: https://github.com/chenyang78
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`
Fixes the obvious symptoms of https://github.com/pytorch/pytorch/issues/124897
Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
The process for populating range_constraints follows separate methods for non-strict (`make_constraints`), and strict (`_process_constraints`). The strict method is somewhat more convoluted, and the analysis that Dynamo performs for strict is already present as part of the non-strict process in make_constraints (produce_guards(), running the export constraint solver).
This PR kills _process_constraints() and replaces calls with make_constraints, without duplicating the work that Dynamo already does.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123985
Approved by: https://github.com/avikchaudhuri
fake_tensor.py had mypy error ignored. That seems less than desirable.
Also added SafePyObjectT<T> which is a tagged wrapper around a SafePyObject but provides static type checking (with no other guarantees).
Used `SafePyObjectT<TorchDispatchModeKey>` on some of the TorchDispatchModeTLS API to ensure that we don't accidentally inject a different type than expected into the stack.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124428
Approved by: https://github.com/malfet
Previously, unbacked SymInts would gradually get larger and larger as we kept rebinding them. Now, we do the replacement to preserve the old symbol.
Actually doing this is a bit tricky. Here’s the order things happen when retracing data dependent:
1. Run fake tensor prop: allocate new unbacked SymInt
2. Run proxy tensor mode, calculate bindings and associate them with FX node
3. Run PropagateUnbackedSymInts, rename unbacked bindings to their old ones so they are consistent
So the problem is when we calculate bindings in step (2), we don't know
what the original names are yet, we only find out later at (3). But by
the time (3) runs, we've already stuffed some new bindings in
meta["unbacked_bindings"] and we don't know how to update them! To fix
this, I introduce resolve_unbacked_bindings which post facto applies any
of the renamings we discovered in (3).
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124782
Approved by: https://github.com/lezcano
ghstack dependencies: #124310, #124314, #124316, #124394, #124739
This is a subset of changes extracted from https://github.com/pytorch/pytorch/pull/124683/
This PR contains modifications to make Inductor work with unbacked symbol inputs, which can occur when a data-dependent sized tensor is saved for backwards. The problems to be fixed:
* When binding initial symbols, we unconditionally bind unbacked symbols (instead of computing if they are needed, which only looks at backed symbols)
* Benchmark generation code doesn't work with unbacked symints as we have no hints to actually feed in real values. So I pick a random number and you are expected to fix it if it doesn't work
* Need to make sure we don't install dependencies on unbacked SymInt inputs, that puts us down the "promptly deallocate the input" path, but that's pointless for unbacked SymInt
Fixes https://github.com/pytorch/pytorch/issues/124652
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124739
Approved by: https://github.com/jansel
ghstack dependencies: #124310, #124314, #124316, #124394
We guard on key order
1) When a key is a non-constant object
2) When we actually need key order - like .values, .items etc
For dicts/OrderedDicts that do not require key order guarding, we just rely on usual `GuardManger + DictGetItemGuardAccessor`. This is faster than going through the `list(d.keys())` based design for OrderedDicts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124779
Approved by: https://github.com/jansel
This PR targets the issue mentioned in #123451 , and solves the specific task to update`test_graph_grad_scaling` in `test/test_cuda.py` to use the new OptimizerInfo infrastructure.
`test_graph_grad_scaling` is moved to a new `TestCase` class called `TestCudaOptims` in order to use `instantiate_device_type_tests`. The test content remained the same. `@onlyCUDA` is applied to the new test; the original use of the wrapper function is also changed to a `@parametrize` decorator for better style.
If we think that this migration is successful, we can delete the original test item under `TestCuda`. Currently it is left untouched to avoid any unexpected issues.
Local linter passed.
```
$ lintrunner test/test_cuda.py
ok No lint issues.
```
Local tests passed.
```
> python .\test\test_cuda.py -k test_graph_grad_scaling
Ran 7 tests in 0.458s
OK (skipped = 3)
```
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123581
Approved by: https://github.com/janeyx99
Summary:
This diff updates opinfo tests to compute more statistics. The results are described in this post:
https://fb.workplace.com/groups/ai.acceleration.team/permalink/825131926110067/
New features:
- Optionally dump kernels to a directory
- Optionally disable block pointers
- Impose a time limit (2 min) on individual tests
- Report a variety of specific error codes when a fails:
- MIXED
- FALLBACK
- EXPORT_ERROR
- COMPILE_ERROR
- MULTIPLE_KERNELS
- MISSING_KERNELS
- TIMEOUT
- Disable setting the RNG seed inside of opinfo, since Dynamo doesn't like this and it caused a lot of tests to fail which otherwise would be able to generate Triton.
- Check each test's `(op,dtype)` pair against {HuggingFace, TIMM, TorchBench} benchmark logs, to see whether tests are representative of real-world usage.
Test Plan:
`buck2 test @//mode/{dev-nosan,mtia} fbcode//triton_mtia/python/test:` passed locally. This code is also exercised by the CI.
Added a bunch of new unit tests:
- Dumping kernels to a directory
- Disabling block pointers
- Mocking various error conditions in inductor
- No kernels
- Multiple kernels
- ATen fallback
- Partial ATen fallback (mixed Triton + ATen)
- `torch.export` raised exception
- `torch.inductor._compile` raised exception
- Timeout while running test
- Test harness raised uncaught exception
- Check that return code == Success when exceptions were raised
- Checking whether various (op,dtype) combos are in benchmarks
- Check that `aten.add.Tensor` IS in the benchmarks
- Check that a made up op is NOT in them
Differential Revision: D56336160
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124657
Approved by: https://github.com/eellison
This is to mirror autograd.Function's setup_context behavior.
The PyTorch Dispatcher removes default values for "FC/BC reasons", but I
convinced myself there's no FC/BC problem for the setup_context API.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124852
Approved by: https://github.com/albanD
ghstack dependencies: #124637, #124805, #124806
aten.div's output device will be its numerator's device. so it is acceptable to do cuda / cpu type divisions. post grad passes operate only on graphs and can't handle runtime graph inputs. so we change user code to move inputs to cuda for cudagraph. this affects any graph that has cpu tensors as graph inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119729
Approved by: https://github.com/eellison
This PR has a lot of "draw the rest of the fucking owl" energy. Here's how to break it down.
1. **torch/_inductor/graph.py** - We start by tightening unbacked symbol invariants. Specifically, as we lower FX nodes, we check whether or not every unbacked_binding recorded on the FX node meta, actually ends up getting bound (according to get_unbacked_symbol_defs) in all the buffers generated by the lowering. Hopefully this invariant is self evident. This leads to a lot of failures.
2. **torch/_inductor/ir.py** - Problem 1: There is softness in how Inductor computes defs of unbacked symbols in IR node. Previously, we tried to infer it by looking at the output sizes/strides/etc and see if new unbacked symbols popped up that we hadn't seen in the inputs. I don't know exactly what was buggy about the old code, but sometimes we would fail to notice an unbacked symbol had been bound, or rebind an unbacked symbol multiple times. Fortunately, thanks to the earlier PRs in our stack, we now have a nice list of unbacked symbol bindings from FX, so we now just store it directly on ExternKernel and use it directly to report defs. This has to be done twice: once for FallbackKernel (e.g., nonzero) and once for DynamicScalar (e.g., item) (see also **torch/_inductor/lowering.py**, **torch/_inductor/codegen/wrapper.py** and **torch/_inductor/codegen/cpp_wrapper_cpu.py** for the lowering and codegen changes for item)
* **process_kernel** - Sidequest! It turns out that Inductor lowering can reallocate unbacked symbols. This happens specifically when we repropagate fake tensors through the operator in `process_kernel`. This repropagation process is necessary because Inductor may have changed the strides of input tensors, and it must now recompute the strides so that it can continue to appropriately plan the rest of the lowering process. This is fine: we just make sure we do the rebind unbacked + compute_unbacked_bindings dance we've been doing previously in the PR stack. But instead of putting unbacked_bindings on a new FX node, they go straight into our unbacked_bindings on the Inductor IR node.
* **codegen_unbacked_symbol_defs** - Sidequest! FallbackKernel lowering is done in two steps. First, you emit the FallbackKernel buffer. Then, you emit MultiOutput buffers which actually give access to the individual outputs of FallbackKernel, which may have been multi-output. There is a design decision here: does the FallbackKernel bind the unbacked symbols, or the MultiOutput buffer? Historically, we put the binding on MultiOutput buffer, because it's more convenient: the FallbackKernel buffer is fake, in fact, it doesn't even get a name in C++ codegen. But it's kind of inconsistent with the keypath model that we've been tracking unbacked bindings with: if you have a multi-output node, you'd expect a keypath like `[0].size()[0]` representing the first output's first dimension size. That suggests that it's the FallbackKernel that should define the things. So that was my first implementation. Unfortunately, the C++ codegen is too cursed and I could not understand how to make it work in that case. So now we just unsoundly assume you cannot have multi-output data dependent output, and do the codegen in MultiOutput. There are some comments explaining exactly what we are improperly assuming.
3. **_rename_unbacked_to** in **torch/fx/experimental/symbolic_shapes.py** - Previously, when we renamed unbacked symbols, we clobbered any facts we previously knew about them. So for example, if we had a replacement `u0 -> s0` but then we renamed u0 to u1, we would now setup the replacement `u0 -> u1`, clobbering the old replacement. This apparently didn't matter in earlier PRs in the stack, but with Inductor now on the ball, there were some tests that indicated this was a problem. The solution is easy: if u0 had a preexisting replacement, reapply it to u1. However...
* **torch/_functorch/_aot_autograd/collect_metadata_analysis.py** - When we run forward analysis, this triggers fake tensor repropagation and fresh allocations. Previously, we just cleared out the pending symbols when finished the analysis. But with the change above, this would also migrate replacements to the new symbols... which are now dead. So now we explicitly suppress generation of these symbols with `ignore_fresh_unbacked_symbols` so that no rebinding happens at all.
* **torch/_dynamo/eval_frame.py** - same deal; I just searched for all sites we called clear() on pending
4. The last step is fixing the long tail of extra problems that show up, now that unbacked_bindings are load bearing into Inductor
* **torch/_dynamo/eval_frame.py** - Some of the exports are making copies of nodes without repropagating fake tensors, so in this case, it is important to also copy the `unbacked_bindings` (apparently this didn't matter before without the Inductor changes)
* **torch/_export/pass_base.py** - I discover that this is doing fake tensor repropagation via a test suite failure. Do the same playbook as AOTAutograd: PropagateUnbackedSymInts too! Actually, they also have implemented their own tracer as well, so do the same playbook as proxy_tensor: record unbacked_bindings on the newly traced nodes. UGH code duplication.
* **torch/_subclasses/fake_tensor.py**, **torch/_subclasses/fake_impls.py** (with call site updates at **torch/_functorch/_aot_autograd/traced_function_transforms.py** and **torch/fx/passes/fake_tensor_prop.py**) - What's this new epoch thing? I noticed that sometimes I would be retracing, call nonzero() on a fake tensor, and not allocate a new unbacked symbol. This is actually bad, because if I don't get a new unbacked symbol, I don't know there's a binding site, and `unbacked_bindings` is now missing a binding. The reason for this is memoization: if I reuse the exact same fake tensor on my retrace, it will already have an unbacked symint memoized on it and we will short circuit allocation. Well, that's no good. So I associate the memos with a fake tensor epoch, and every time you start a new fake tensor propagation from scratch, you bump the epoch so that I clear all the memos.
* **torch/_inductor/scheduler.py** - I notice in unit tests that V.current_node is not always set when we call process_kernel. So I save it into the IR node and restore it when we are running `get_estimated_runtime`.
* **torch/fx/experimental/symbolic_shapes.py** - A few things
* **rebind_unbacked** (re **_tensor_version**). Ordinarily, when you have an unbacked SymInt, you persistently hvae it all the way to the end of the program. `_tensor_version` violates this: this generates an unbacked SymInt (for reasons I don't quite understand?) and then gets rid of it later. This triggered an assert violation. I think this op is kind of misusing unbacked SymInt, but I didn't know how to refactor it, so it gets a special case.
* **rebind_unbacked** (re **Simplify SymBool binding**). Ugh, SymBool, what a pain in the butt. I have an assert that you can only rebind unbacked symbol to another unbacked symbol. This assert fails when a boolean is involved, because the result of running keypath on the result is not `u1`, it's `sympy.Piecewise(... sympy.Eq(u1, 1) ...)`. This is actually just `u1`, but Sympy doesn't know it because it doesn't know that `u1` value range is `[0, 1]`. So we manually implement the simplification needed to get the assert to pass.
* **compute_unbacked_bindings** (re **This is pretty fragile**). There is a really funny disaster involving memoization and Inductor process kernel. Ordinarily when I retrace, if there was a memo hit in the old trace, there will be a memo hit in the new trace. However, Inductor process kernel breaks this, because it recreates fake tensor inputs to the operator call from scratch (since they might have different strides), and obviously these tensor inputs don't have the memo from the old one. I tried a little bit to try to manually transplant the memo to the new fake tensor but it seemed hopeless, so I just let the fresh symbol ride, allocating a new unbacked symbol. However, in one of our tests, we rely on knowing that the first nonzero call is equal to the second (memoized) nonzero call. The equality test looked pretty easy to discharge, so I just went ahead and added a deferred runtime assert to this effect and it worked.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124394
Approved by: https://github.com/jansel
ghstack dependencies: #124310, #124314, #124316
Fixes https://github.com/pytorch/pytorch/issues/123854
Important comment:
```
# Never replace unbacked symbols with other unbacked symbols.
# This is error prone because you can cause references to
# unbacked symbols to time travel backwards. E.g.,
#
# u1 = x.item()
# ... use of u1 ...
# u2 = y.item()
# u3 = z.item()
# torch._check(u1 == u2 + u3)
#
# If you replace u1 with u2 + u3, then the use of u1 now
# references u2 and u3 prior to them actually being bound at
# runtime. It's pretty inconvenient to setup control
# dependencies for substitutions, so ban it entirely.
```
This is kind of risky for the internal MRS workstream, because we added these substitutions upon their request in the first place. Fortunately, we still allow substitutions to backed SymInts and constants, and I believe that is what is actually load bearing.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124316
Approved by: https://github.com/ColinPeppler, https://github.com/lezcano
ghstack dependencies: #124310, #124314
This is a partial revert of https://github.com/pytorch/pytorch/pull/124059
Like in #124297, profiling has revealed that testing equality on *every* output is kind of expensive. So we only test equality when we know there is an unbacked binding. This is the same playbook as the previous PR, just on FakeTensorProp instead of PropagateUnbackedSymInts. Note that we also need to populate `unbacked_bindings` in proxy_tensor.py, since we're generating an entirely new graph in that case.
We now have enough propagation that we're able to trigger a bug related to divisibility replacement. In https://github.com/pytorch/pytorch/pull/113165 we allowed to replace `u0` with `u1 * c` for some constant c, when we have determined that u0 is divisible by c. However, where does the binding for u1 come from? What we will have in practice is that there is some node that is supposed to have bound u1, but which actually is getting a `u1 * c` in its output. So, to get u1, we must divide out c. Fortunately, under the divisibility condition, this is always possible (but remember, we must test divisibility at runtime!)
Because we have tightened up asserts, it is now an error to allocate unbacked SymInts and then fail to track them under unbacked_bindings. In torch/_dynamo/eval_frame.py and torch/_functorch/_aot_autograd/collect_metadata_analysis.py there are examples of benign cases where we repropagated fake tensors but then immediately threw away the results. In these cases, it's not appropriate to rebind, since we're still using the old FX graph that has all of the old symbols. So we just manually clear it. It is possible that other cases will need to be updated, so this PR is "risky" from the perspective of hitting fbcode.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124310
Approved by: https://github.com/lezcano
Before this PR, we didn't check that types in a schema were valid. This
is because TorchScript treats unknown types as type variables.
This PR checks types in a schema for the TORCH_LIBRARY APIs. To do this,
we add an `allow_typevars` flag to parseSchema so that TorchScript can
use allow_typevars=True. We also add some error messages for common
mistakes (e.g. using int64_t or double in schema).
Test Plan:
- new tests
Differential Revision: [D56432690](https://our.internmc.facebook.com/intern/diff/D56432690)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124520
Approved by: https://github.com/albanD
This PR is part of an effort to speed up torch.onnx.export (#121422).
- This copies the shape and type from the node to the nodes that are produced by the export. However, for 1-to-N exports, which are very common, this doesn't make much sense and can give the graph in broken shape or type information. As far as I can tell, a shape inference pass is used to propagate the correct shape and type for all interemediate (and final) nodes.
- If there is a situation where this is necessary (shape inference turned off and only 1-to-1 ops are exported ??), perhaps this can be conditionally skipped. It does incur a quadratic cost. Another option is to set a global default for the metadata and
use that for all nodes that get created. Again, this meta data may not make sense for all ops and seems dangerous to do.
- Resolves (8) in #121422.
(partial fix of #121422)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123027
Approved by: https://github.com/BowenBao
The user does not need to return gradients for these args.
We also change how setup_context works to adapt to kwargonly-args. If
the user's op has no kwonly-args, then their setup_context function must
look like `setup_context(ctx, inputs, output)`: we require that the
arguments have the same names.
If the user's op has kwonly-args, then their setup_context function must
look like `setup_context(ctx, inputs, keyword_only_inputs, output)`.
We require that the arguments have the same names.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124806
Approved by: https://github.com/albanD, https://github.com/williamwen42
ghstack dependencies: #124637, #124805
Summary:
Adding function to log additional debug information before killing the expired watchdog timers.
Additional information like stack trace can be added in the debug function using worker process IDs from expired timers.
Test Plan: buck test mode/opt caffe2/test/distributed/elastic/timer:file_based_timer_test
Differential Revision: D56044153
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123883
Approved by: https://github.com/kurman
Fix for https://github.com/pytorch/pytorch/issues/124289.
There was a tensor which had a single, expanded element. inductor generated the strides as all 0, while sdpa expects a dense last dimension `t.stride(-1) == 1`. While these are equivalent, we still hit an error in the kernel. We could make fixes in sdpa, but matching the insignificant strides in inductor also works and I am less aware of the downstream sdpa kernel details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124859
Approved by: https://github.com/drisspg
ghstack dependencies: #124751
This PR adds a `DeviceMesh.from_group()` static method to convert an existing process group to a device mesh.
Motivation: We need `DeviceMesh.from_group()` to allow FSDP2 to interoperate with distributed libraries that do not use `DeviceMesh` for all parallelisms.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124787
Approved by: https://github.com/wanchaol
ghstack dependencies: #124651, #124741, #124767, #124768, #124780
Summary: We use to skip tensor.to() during tracing when the device is the same. This will bring some performance improvement in eager but making graph capture losing the semantics from original model. In this diff, we add an additional condition to skip the fast path when we don't have actual data inside a tensor, which is the case when we're using FakeTensor / FunctionalTensor to trace the model. This won't have perf impact on previous eager models while making sure we can capture the _to_copy() node in the graph.
Test Plan: buck test mode/opt caffe2/test:test_export -- -r device_to
Differential Revision: D55969674
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123732
Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan
Summary: Before in `move_exported_model_to_train/eval`, we only
switched the CPU versions of the batch norm op. This commit adds
support for the cuda versions of the op too. Note that this fix
is temporary; we won't have to differentiate between these two
cases once we have batch norm consolidation.
Test Plan:
python test/test_quantization.py -k test_move_exported_model_bn
Reviewers: jerryzh168
Subscribers: jerryzh168, leslie-fang-intel, supriyar
Differential Revision: [D56070054](https://our.internmc.facebook.com/intern/diff/D56070054)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123957
Approved by: https://github.com/jerryzh168
When we warmup hipgraphs, we use cudagraph memory pool to allocate a large part of the memory. We don't necessarily execute the kernels on the GPUs. Therefore, we don't want to free up this allocated memory. However, this is conflicting with emptyCache call happening inside findAlgorithm where convolution algorithm benchmarking is happening. For benchmarking, we might use large memory allocations to cache algorithm results. As a fix, we just disable the emptyCache() call during cudagraph warmup.
As per this cuDNN PR which did the same thing for CUDA, we did not have a significant affect on memory footprint. a8ff647e42
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124791
Approved by: https://github.com/eellison, https://github.com/jeffdaily
Summary:
It is common for blocks to be missing frames and there are many users asking why.
Let's improve this output message to cover common reasons:
1) block was allocated before _record_memory_history was enabled
2) context or stacks passed to _record_memory_history does not include this block
3) backward events allocated with C++ stack and will not show if stacks = python
Test Plan:
CI and ran it locally:

Differential Revision: D56490921
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124784
Approved by: https://github.com/zdevito
Summary: Avoid situation where the graph traversal finds a matmul node with a `get_attr` as its `args[0]`, and incorrectly propagate the `get_attr`'s meta to everything downstream.
Test Plan: CI
Differential Revision: D56219120
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124415
Approved by: https://github.com/jerryzh168
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
"init",
"is_available",
"synchronize",
"device_count",
"current_device",
"current_stream",
"default_stream",
"set_stream",
"stream",
"device",
]
```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```
---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------
Differential Revision: [D56443356](https://our.internmc.facebook.com/intern/diff/D56443356)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch.
------------
**torch.Stream APIs**
```
# Defined in torch/csrc/Stream.cpp
class Stream(_StreamBase):
stream_id: _int # Stream id
device_index: _int
device_type: _int
device: _device # The device of the stream
@overload
def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ...
@overload
def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ...
def wait_event(self, event: Event) -> None: ...
def wait_stream(self, other: Stream) -> None: ...
def record_event(self, event: Optional[Event] = None) -> Event: ...
def query(self) -> None: ...
def synchronize(self) -> None: ...
def __hash__(self) -> _int: ...
def __repr__(self) -> str: ...
def __eq__(self, other: object) -> _bool: ...
```
------------------
**torch.Event APIs**:
- IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream.
- currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag.
- elapsedTime API is added to c10::Event
```
# Defined in torch/csrc/Event.cpp
class Event(_EventBase):
device: _device # The device of the Event
event_id: _int # The raw event created by device backend
def __new__(self,
device: Optional[DeviceLikeType] = None,
enable_timing: _bool = False,
blocking: _bool = False,
interprocess: _bool = False) -> Event: ...
@classmethod
def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ...
def record(self, stream: Optional[Stream] = None) -> None: ...
def wait(self, stream: Optional[Stream] = None) -> None: ...
def query(self) -> _bool: ...
def elapsed_time(self, other: Event) -> _float: ...
def synchronize(self) -> None: ...
def ipc_handle(self) -> bytes: ...
def __repr__(self) -> str: ...
```
-----------
c10::Event provides new APIs
- calculate **elapsedTime**.
- Get raw event id
- Synchronize event.
```
double elapsedTime(const Event& event) const {
return impl_.elapsedTime(event.impl_);
}
void* eventId() const {
return impl_.eventId();
}
void synchronize() const {
return impl_.synchronize();
}
```
----------
TODO: need to find a good way to test them in PyTorch with API mocks.
Differential Revision: [D56443357](https://our.internmc.facebook.com/intern/diff/D56443357)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611
Approved by: https://github.com/albanD, https://github.com/jeffdaily
In #124640 I see the error
```
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 887, in load
compiled_graph = FxGraphCache._lookup_graph(key, example_inputs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 776, in _lookup_graph
write_atomic(artifact_path, graph.source_code)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 412, in write_atomic
with tmp_path.open(write_mode) as f:
File "/opt/conda/envs/py_3.10/lib/python3.10/pathlib.py", line 1119, in open
return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp02wlik2v/iu/.28383.139931139675904.tmp'
```
Which is fixed by creating the parent directory first. Since this is what you
want to do in most cases, I add an argument to `write_atomic` to do so itself.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124646
Approved by: https://github.com/lezcano
If a user accesses an OpOverloadPacket, then creates a new OpOverload,
then uses the OpOverloadPacket, the new OpOverload never gets hit. This
is because OpOverloadPacket caches OpOverloads when it is constructed.
This PR fixes the problem by "refreshing" the OpOverloadPacket if a new
OpOverload gets constructed and the OpOverloadPacket exists.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124654
Approved by: https://github.com/albanD
Summary:
1.Package public headers of kineto if USE_KINETO so that they can be used by PrivateUse1 user.
2.Add PrivateUse1 key to ActivityType.
3. Support PrivateUse1 key in function deviceTypeFromActivity and _supported_activities.
4. Fix some bugs when processing profiler results.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124818
Approved by: https://github.com/aaronenyeshi
This PR makes sure to construct the `DeviceMesh`'s `mesh` tensor on CPU device in `init_device_mesh()`. This means that we can call `init_device_mesh()` under meta-device context and still construct the correct `mesh` tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124767
Approved by: https://github.com/wz337
ghstack dependencies: #124651, #124741
Fixes#124546
When setting `use_orig_params = False` and using activation checkpointing, the FQN mapping as retrieved by the `_get_fqns` function is incorrect because the prefix that is added to the name of each activation checkpointed module, `_checkpoint_wrapped_module`, can still be present. I think this is an edge case with the `_get_fqns` function that was not addressed by this previous commit #118119.
Without the change, the list of object names for an activation checkpointed module with FSDP (and `use_orig_params=False`) can be something like:
```
['model', '_fsdp_wrapped_module', 'transformer', 'blocks', '0', '_fsdp_wrapped_module', '_checkpoint_wrapped_module', '_flat_param']
```
Which will incorrectly return just one FQN, `{'model.transformer.blocks.0._flat_param'}`, when all the FQNs of the parameters of the transformer block should be returned.
With the change, the list of object names will now have `_checkpoint_wrapped_module` removed:
```
['model', '_fsdp_wrapped_module', 'transformer', 'blocks', '0', '_fsdp_wrapped_module', '_flat_param']
```
And the FQNs are correctly retrieved and returned in `_get_fqns` when [this condition](ea61c9cb29/torch/distributed/checkpoint/state_dict.py (L168)) is satisfied. The correct FQNs are:
```
{'model.transformer.blocks.0.attn.Wqkv.bias', 'model.transformer.blocks.0.ffn.up_proj.bias',
'model.transformer.blocks.0.attn.out_proj.weight', 'model.transformer.blocks.0.norm_2.weight',
'model.transformer.blocks.0.ffn.down_proj.weight', 'model.transformer.blocks.0.attn.Wqkv.weight',
'model.transformer.blocks.0.norm_2.bias', 'model.transformer.blocks.0.ffn.up_proj.weight',
'model.transformer.blocks.0.ffn.down_proj.bias', 'model.transformer.blocks.0.norm_1.bias',
'model.transformer.blocks.0.norm_1.weight', 'model.transformer.blocks.0.attn.out_proj.bias'}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124698
Approved by: https://github.com/Skylion007
The diff https://github.com/pytorch/pytorch/pull/122661 introduces a new automatic cache refresh mechanism during all inductor-derived test cases.
But this refresh mechanism seems not to work properly across process boundaries, specifically when using autotune_in_subproc, which many tests in test_cutlass_backend.py rely on.
Solution: Set the env var INDUCTOR_TEST_DISABLE_FRESH_CACHE=1
early during test setup within test_cutlass_backend.py
Test Plan:
This is a change to unit tests only.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124574
Approved by: https://github.com/aakhundov
ghstack dependencies: #121497, #123930, #123932, #121734, #124107
This subsumes https://github.com/pytorch/pytorch/pull/124069
In the original PR, my idea was that when we run PropagateUnbackedSymInts, we check that the sizes before and after are exactly the same. This ended up turning up lots of bugs that I didn't feel like fixing. Separately, Ivan let me know that this pass was quite expensive in terms of compile time, since we spent a lot of time thinking about the equalities.
To kill two birds with one stone, we now only check for equality precisely when an unbacked SymInt was bound (thanks to the previous PR in this stack, we now have this information). Specifically, we look to see if `meta["unbacked_bindings"]` is set on the old node, and if it is, we assert the old value is equal to the new value from the repropagation. Note that the pytree key is used to actually extract the new value from the example value, as it may be nested inside an, e.g., tensor size.
We do something a bit naughty at the end: we use `defer_runtime_assert` to actually teach ShapeEnv about the equality. This is implementationally equivalent to what we used to do, but we're going to change this later soon.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124297
Approved by: https://github.com/lezcano
ghstack dependencies: #124290
The important comment:
```
# Whenever we allocate a fresh unbacked Symbol, we add it to this
# pending list. Unbacked symbol allocation can occur at unpredictable
# points during meta tensor propagation, but at some point, the we
# have to know what the binding site for an unbacked symbol is, and
# this is computed when we actually place the node in the graph. The
# important thing is that we always actually handle every unaccounted
# for unbacked symbol, so this list helps us keep track of them and
# then make sure they are all accounted for.
#
# We could potentially give rise to errors earlier by lexically
# scoping when we do propagation, and only allowing unbacked symbols
# to be allocated at this point in time. However this is inconvenient
# to do in Dynamo, because fake tensor propagation is far from when we
# analyze binding sites (set_example_value), so we do it in a more
# mutatey way.
#
# NB: fresh unbacked symbols NEVER get substitutions applied to them,
# they are binding sites!
```
The compute_unbacked_bindings is the other half of the equation: the thing that actually consumes the pending_fresh_unbacked_symbols and does something with them. Important comment:
```
After having run fake tensor propagation and producing example_value
result, traverse example_value looking for freshly bound unbacked
symbols and record their paths for later. It is an error if
we have allocated an unbacked SymInt but it cannot be found in
example_value. (NB: this means if you have a multi-output
function, you must call this on the tuple of tensor output, you
cannot wait!)
```
For example, if I return a tensor with size `[u0, u1]`, and u1 is a fresh unbacked SymInt, then I'll have `{u1: KeyPath(".size(1)")}`, telling me I can get u1 by running `size(1)` on the result of this node. u0 is not fresh (it probably flowed in as an argument), so I don't generate a binding for it.
I eventually intend to propagate this information all the way to Inductor lowering, where extra metadata about unbacked symbol binding will be canonically used for codegen, instead of trying to infer it from defs/uses.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124290
Approved by: https://github.com/lezcano
Summary:
This diff has no logic changes. It updates the variable names to be in sync with the name used in prepare_global_plan in StorageWriter. Pasting func signature for easy reference -
abc.abstractmethod
def prepare_global_plan(self, plans: List[SavePlan]) -> List[SavePlan]:
Differential Revision: D56480396
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124770
Approved by: https://github.com/fegin
This PR adds a private init backend option, to tackle the issues sub
mesh creation:
in device mesh slicing we don't want to create process groups again,
so explicitly turn the group creation off it's useful
Also I think there might be more submesh creation functionality so
having this flag would ensure that there's no new group created
Differential Revision: [D56497780](https://our.internmc.facebook.com/intern/diff/D56497780)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124780
Approved by: https://github.com/awgu
Summary: [#123231](https://github.com/pytorch/pytorch/pull/123231) adds cudagraph supports for more types of functions (i.e., cudagraph managed input mutation). These newly supported functions may have mutated static inputs, leading to assertion errors in some workload which skip cudagraph previously. This diff adds a config to opt in the new feature.
Test Plan: ci
Differential Revision: D56481353
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124754
Approved by: https://github.com/eellison
Fixes https://github.com/pytorch/test-infra/issues/4468
This is done by updating the filter config script to accept a list of test configs coming from workflow dispatch. For example, having `inductor_huggingface_perf,inductor_timm_perf,inductor_torchbench_perf` will benchmark all 3 datasets, while having `inductor_torchbench_perf` will only run TorchBench. This is exposed via a new string workflow dispatch parameters called `benchmark_configs`.
Note that GH limits the maximum number of workflow dispatch parameters to 10, so I need to consolidate `training` and `inference` into `training_and_inference` to squeeze the new parameter into the list.
### Testing
Run the script manually and confirm that the filtered list of test config is correct.
Also manually dispatch the job with the new parameter https://github.com/pytorch/pytorch/actions/runs/8808159905 and only the selected `inductor_huggingface_perf` is kept https://github.com/pytorch/pytorch/actions/runs/8808159905/job/24176683708#step:11:128
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124795
Approved by: https://github.com/clee2000
The test checks for a substring "loadu" in generated code. On AVX systems that line is:
> auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(i0))
however on non-AVX systems it is
> auto tmp0 = in_ptr0[static_cast<long>(i0)];
the difference depends on `codecache.valid_vec_isa_list()` being non-empty. See torch/_inductor/codegen/cpp.py:2639
Modify the test to account for that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117272
Approved by: https://github.com/jgong5, https://github.com/jansel
This reduces the default monitor_interval for torchelastic to 0.1s as testing shows negligble load for common use cases. Even at the extremes, 100k processes is only 45.4% cpu util of a single core.
Torchelastic monitor_interval only monitors the processes on a single worker so under typical loads even for huge jobs we expect ~8 subprocesses per machine with one per GPU.
As an external datapoint, Python's wait polls every 50usec-50ms (https://github.com/python/cpython/blob/main/Lib/subprocess.py#L2035).
## Motivation
This setting is used to control how frequently we poll for failed processes in elastic.
* For some jobs of note we run elastic 3 times per try so with the default timeout of 5 seconds we should save ~15 seconds per retry.
* @kiukchung's use case: Apparently this is annoying in notebooks etc since it adds delay to shutdown when testing things
## Results
This is measured in cores (100% is a single core under full load).
| monitor_interval (s) | nproc-per-node | CPU util (highest observed) |
| -------------------- | -------------- | --------------------------- |
| 1.0 | 10 | 0.2% |
| 0.1 | 1 | 0.4% |
| 0.1 | 10 | 0.4% |
| 0.01 | 10 | 0.9% |
| 0.001 | 10 | 4.0% |
| 0.1 | 100 | 0.5% |
| 0.1 | 1000 | 2.2% |
| 0.1 | 10000 | 15.7% |
| 0.1 | 100000 | 45.4% |
## Methodology
```sh
# run command
$ LOGLEVEL=INFO torchrun --nnodes 1 --nproc-per-node 10 --monitor-interval 0.1 ~/wait.py
# wait a few seconds for all processes to start and reach steady state and then run, wait ~30s or 3 prints and take the highest
$ top -b -d 10 -c | rg 'torchrun.*wait
```
wait.py
```py
import time
time.sleep(10*60)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124692
Approved by: https://github.com/kiukchung, https://github.com/kurman
`from_local` with replicate placement would run mesh_broadcast if `run_check=True`, by default `from_local` have `run_check=True`, but for FSDP state_dict case we are for sure that these are replicated on dp dimension (FSDP + TP) already, so we don't need to check/force check it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123802
Approved by: https://github.com/wanchaol
hiprtc doesn't seem to include the null byte automatically in the failure logs, resulting in heap buffer overflow. Initializing the log string with the null byte avoids the problem.
Found by rocm address sanitizer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121865
Approved by: https://github.com/malfet
Summary: Before in `move_exported_model_to_train/eval`, we only
switched the CPU versions of the batch norm op. This commit adds
support for the cuda versions of the op too. Note that this fix
is temporary; we won't have to differentiate between these two
cases once we have batch norm consolidation.
Test Plan:
python test/test_quantization.py -k test_move_exported_model_bn
Reviewers: jerryzh168
Subscribers: jerryzh168, leslie-fang-intel, supriyar
Differential Revision: [D56070054](https://our.internmc.facebook.com/intern/diff/D56070054)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123957
Approved by: https://github.com/jerryzh168
**Summary**
Lift up the quant node before view like nodes. It can benefit performance of Attention like block. For example, we have the pattern as:
```
DQ
DQ LINEAR
LINEAR VIEW
VIEW PERMUTE
PERMUTE TRANSPOSE
Q Q
DQ DQ
Matmul
DIV
ADD
SOFTMAX
```
We want to lift up the the quant nodes from `matmul` before view like nodes as the output of Linear node.
```
DQ
DQ LINEAR
LINEAR Q
Q VIEW
VIEW PERMUTE
PERMUTE TRANSPOSE
DQ DQ
Matmul
DIV
ADD
SOFTMAX
```
It produces a `DQ->LINEAR->Q` pattern which can be fused by backend.
**Test Plan**
```
python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_attention_block
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122777
Approved by: https://github.com/jerryzh168, https://github.com/jgong5
ghstack dependencies: #122776
This PR switches export IR from aot-dispatch to pre-dispatch IR.
**What is pre-dispatch IR and why should you care?**
Currently the default IR returned by torch.export can contain only functional ATen operators after ALL pytorch dispatcher decompositions (for example, CompositeImplicitAutograd) run.
In contrast, pre-dispatch IR refers to an IR that can contain all functional ATen operators (i.e., not just from the core subset), before any decomposition happens, as well as operators that manipulate autograd state. Pre-dispatch IR closely resembles eager PyTorch computation, but is still functional and serializable by torch.export. As a result:
- You can train the pre-dispatch IR in eager mode as the IR contains necessary information for the autograd engine to automatically generate a backward graph.
- You can write sound graph transformations more easily as the IR is functional.
- Since it is an ATen IR, it is still normalized. For example, torch.add has multiple overloads, but aten.add.Tensor is unique in this IR.
If you want to get the core aten IR out of `torch.export`, you will need to:
```
ep = torch.export.export(M(), inputs)
ep_for_core_aten = ep.run_decompositions()
```
Differential Revision: [D56273267](https://our.internmc.facebook.com/intern/diff/D56273267)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123573
Approved by: https://github.com/gmagogsfm
Summary:
Include improvements such as:
- AMD: roctracer crash fix and roctracer external correlations
- NCCL metadata: process group id to process group name
- Complete nanosecond transition for Kineto
- Remove PrivateUse1Type function causing gpu track to be above cpu tracks
- Use relative time and fix gpu user annotation causing events to overlap
Test Plan: CI and Github CI (full suite)
Reviewed By: sraikund16
Differential Revision: D56475055
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124737
Approved by: https://github.com/davidberard98, https://github.com/malfet
This PR includes two things:
1. Changes to support `load_state_dict(assign=True)`
- These changes are not ideal, but until we have `DTensor` padding the local tensor and general `swap_tensors` adoption, we may need to make do.
2. Example of how to convert a full state dict on rank 0 to sharded state dict on all ranks via broadcast
- ~~To-do: check for `recordStream` from the funcol broadcast; if being called, remediate either via `async_op=False` c10d broadcast or use `TORCH_NCCL_AVOID_RECORD_STREAMS=1`~~ switched to using c10d `async_op=False` broadcast
- To-do: check for broadcast latency since not using any coalescing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124651
Approved by: https://github.com/wanchaol
Summary:
Libuv backed isn't enabled in PTD by default now. Add an option to enable libuv backed to improve scaling of the rendezvous process.
Tries not to make assumption on the default libuv settings in TCPStore since it may change in the next release.
Test Plan: CI
Differential Revision: D56435815
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124684
Approved by: https://github.com/d4l3k, https://github.com/XilunWu
This PR:
- exposes torch.testing._internal.optests.opcheck as
torch.library.opcheck
- Adds support for CustomOpDef (aka functions decorated with
torch.library.custom_op) to opcheck.
Test Plan:
- Updated tests
- We validated opcheck's design internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124496
Approved by: https://github.com/williamwen42
Summary:
The current _AddRuntimeAssertionsForInlineConstraintsPass has 2 known issues caused by its use of torch.fx.Interpreter:
1. SymInt-related ops (e.g. item()) are executed, causing new Unbacked SymInts to appear in the graph during the pass.
2. The graph is reconstructed, and node names/indices can be different from before, causing mismatches with `module_call_graph`, and leading to issues during unflattening.
This refactors the pass to use PassBase instead of _ExportPassBaseDeprecatedDoNotUse, only constructing new nodes for assertions.
Test Plan: This pass is called on all strict-mode export calls with range_constraints, test that behavior remains unchanged.
Differential Revision: D56360137
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124503
Approved by: https://github.com/zhxchen17
The `function_events` in `_parse_kineto_results` is used to contain all function events from the result. It contains 2 kinds of events. One is frontend function events whose correlation id is 0, for example, `aten::add`, `aten::mul`. They are on the top level of the profile results. The other is the backend events, which are associated with the frontend events and its correlation id is > 0, for example, `at::native::vectorized_elementwise_kernel`, it should be the backend event of a frontend element-wise op. They have the device execution duration for the related frontend op.
In the following post processing code, the **frontend function events** should be iterated to find its correlated backend events in `device_corr_map`, instead of iterating all function events, because `device_corr_map` is designed as a dict, whose key is the id of the frontend function event.
3af12447f8/torch/autograd/profiler.py (L543-L560)3af12447f8/torch/autograd/profiler.py (L537-L540)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124596
Approved by: https://github.com/aaronenyeshi
Part of https://github.com/pytorch/ci-infra/issues/113
Since this table is only located in one AWS account, but the ARC account
also needs to access it, explicitly specify the account name for the
table
We've had issues using addr2line. On certain versions of
CentOS it is on a version that has a performance regression making it very slow,
and even normallly it is not that fast, taking several seconds even when parallelized
for a typical memory trace dump.
Folly Symbolize or LLVMSymbolize are fast but it requires PyTorch take a dependency on those libraries to do this, and given the number of environments we run stuff in, we end up hitting cases where we fallback to slow addr2line behavior.
This adds a standalone symbolizer to PyTorch similar to the unwinder which has
no external dependencies and is ~20x faster than addr2line for unwinding PyTorch frames.
I've tested this on some memory profiling runs using all combinations of {gcc, clang} x {dwarf4, dwarf5} and it seems to do a good job at getting line numbers and function names right. It is also careful to route all reads of library data through the `CheckedLexer` object, which ensure it is not reading out of bounds of the section. Errors are routed through UnwindError so that those exceptions get caught and we produce a ?? frame rather than crash. I also added a fuzz test which gives all our symbolizer options random addresses in the process to make sure they do not crash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123966
Approved by: https://github.com/ezyang
Before this PR, we didn't check that types in a schema were valid. This
is because TorchScript treats unknown types as type variables.
This PR checks types in a schema for the TORCH_LIBRARY APIs. To do this,
we add an `allow_typevars` flag to parseSchema so that TorchScript can
use allow_typevars=True. We also add some error messages for common
mistakes (e.g. using int64_t or double in schema).
Test Plan:
- new tests
Differential Revision: [D56432690](https://our.internmc.facebook.com/intern/diff/D56432690)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124520
Approved by: https://github.com/albanD
This commit enables float8_e5m2 and float8_e4m3fn dtypes in fx quantization and PT2E.
Motivation for using fp8 quantization instead of int8:
- it works better to run inference with the same datatype the model was trained with,
- fp8 can handle outliers better, which is one of the problems in LLMs activations.
The numerical recipe we want to use it for is fp8 inference:
- bgemms/gemms running in float8_e4m3fn,
- Per-Tensor-Quantization/Scaling,
- amax observer for measurement with input_backoff and weight_backoff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123161
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
If a Kernel does not return in a reasonable amount of time during autotuning, it can delay inductor compilation a lot. This change introduces soft / hard kill timeouts and a mechanism to kill Kernels being profiled in subprocesses if they take too long.
Correspondingly, a few new config options are introduced within _inductor/config.py - all of them with inline docs.
Test Plan:
Existing tests within test_max_autotune.py and test_cutlass_backend.py ) cover the new codepaths.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123932
Approved by: https://github.com/jansel
ghstack dependencies: #121497, #123930
Support fused_sgd_kernel support for CPU.
## Bench result:
32 core/sockets ICX
Test Scripts:
https://gist.github.com/zhuhaozhe/688763e17e93e4c5e12f25f676ec90d9https://gist.github.com/zhuhaozhe/ad9938694bc7fae8b66d376f4dffc6c9
```
Tensor Size: 262144, Num Tensor 4, Num Threads: 1
_single_tensor_sgd time: 0.2301 seconds
_fused_sgd time: 0.0925 seconds
Tensor Size: 4194304, Num Tensor 32, Num Threads: 32
_single_tensor_sgd time: 2.6195 seconds
_fused_sgd time: 1.7543 seconds
```
## Test Plan:
```
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_optim.py -k test_can_load_older_state_dict
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
python test_torch.py -k test_grad_scaling_autocast_fused
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
```
Looks like we already have some PRs under this issue https://github.com/pytorch/pytorch/issues/123451 to unified the UTs, I did not modified UT in this PR.
Co-authored-by: Jane Xu <janeyx@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123629
Approved by: https://github.com/jgong5, https://github.com/janeyx99
Adds the following to allowed globals for the `weights_only` unpickler
- [x] `torch._utils._rebuild_qtensor` and qtensor related types
- [x] `torch._utils._rebuild_parameter_with_state` (used deserializing a parameter that has user-defined attributes like `Param.foo`)
The remaining rebuild functions that have not been allowlisted are
- [x] `torch._utils._rebuild_wrapper_subclass` (allowlisted in above PR)
- [ ] `torch._utils._rebuild_device_tensor_from_numpy`
- [ ] `torch._utils._rebuild_xla_tensor` (legacy)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124330
Approved by: https://github.com/albanD
By slicing `copyFromBuffer:sourceOffset:toBuffer:destinationOffset:size:` into 2Gb chunks
Add regression test, but limit it to machines with 12Gb of RAM or more, and MacOS 14+, as on MacOS 13 attempt to alloc 4Gb tensor fails with:
```
/AppleInternal/Library/BuildRoots/c651a45f-806e-11ed-a221-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:724: failed assertion `[MPSNDArray initWithDevice:descriptor:] Error: total bytes of NDArray > 2**32'
```
Fixes https://github.com/pytorch/pytorch/issues/124335
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124635
Approved by: https://github.com/kulinseth
Summary:
For motivation behind the overall stack of diffs see D56218385 summary.
This particular diff makes cpp_dumper take a custom printer function to log callstacks one-group-at-a-time and as such no longer running into 30K characters limit of `LOG(INFO)`.
Test Plan:
```
[romanmal@46150.od /data/sandcastle/boxes/fbsource/fbcode (520a7b7b5)]$ buck2 test //caffe2/torch/csrc/distributed/c10d/...
File changed: fbcode//common/base/ThreadStackTrace.cpp
File changed: fbsource//xplat/caffe2/torch/csrc/distributed/c10d/fb/TraceUtils.cpp
File changed: fbcode//caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp
4 additional file change events
Buck UI: https://www.internalfb.com/buck2/d8ceae86-7d6f-4779-ad0c-8e37eddcff98
Network: Up: 0B Down: 0B
Jobs completed: 2. Time elapsed: 1.5s.
Tests finished: Pass 0. Fail 0. Fatal 0. Skip 0. Build failure 0
NO TESTS RAN
[romanmal@46150.od /data/sandcastle/boxes/fbsource/fbcode (520a7b7b5)]$
```
Tested to print the stack trace:
P1220109730
Differential Revision: D56218360
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124628
Approved by: https://github.com/wconstab
Summary: if tqdm is not shutdown properly, it will leave the monitor thread alive. This causes an issue in the multithreading test because we check all events in that test with their tids. The events that correspond to these lingering threads all have TID of (uint64_t)(-1) which is invalid. The work around is turning off monitoring thread when tqdm is loaded. Since these are unit tests, it is safe to turn off monitor thread.
Test Plan: buck test mode/dev-sand caffe2/test:profiler
Differential Revision: D56310301
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124409
Approved by: https://github.com/aaronenyeshi
Currently, if initializers are available, they are included in the ONNX model. If they are not available, the model is serialized without them.
However, there are times in which the initializers are avaialable, but the user prefers not to include them in the model, say for visualizing it on Netron or because the initialziers will be specified along with the inputs in the onnx runtime of choice.
This PR allow users to pass `include_initializers` to `ONNXProgram.save()` API.
Fixes#100996
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121904
Approved by: https://github.com/titaiwangms
Following the example of PyTorch supporting a preferred Linalg library (cusolver or magma), this PR introduces a preferred blas library selector of either cublas or cublaslt for CUDA and hipblas or hipblaslt for ROCm via normal hipification of sources.
The default blas implementation remains cublas or hipblas. cublaslt or hipblaslt can be enabled using environment variable TORCH_BLAS_PREFER_CUBLASLT=1 (or TORCH_BLAS_PREFER_HIPBLASLT=1 as an alias) or by calling `torch.backends.cuda.preferred_blas_library(backend="cublaslt")` or as an alias `backend="hipblaslt"`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122106
Approved by: https://github.com/lezcano
Summary: When unbacked SymInts are used only in a grid of a user-written Triton kernel call, there is no dependency between the Triton kernel's buffer and those unbacked SymInts. As a result, definition of the unbacked SymInts are not codegen-end and the code using them in the grid definition breaks.
Here we add the unbacked SymInts used in the grid to the `get_unbacked_symbol_uses` returned by the `UserDefinedTritonKernel` alongside those used in the `kwargs` (returned by `ExternKernel`).
Test Plan:
```
$ python test/inductor/test_aot_inductor.py -k test_triton_kernel_unbacked_symint
...
----------------------------------------------------------------------
Ran 24 tests in 155.764s
OK (skipped=16)
```
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D56406991](https://our.internmc.facebook.com/intern/diff/D56406991)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124594
Approved by: https://github.com/oulgen
I previously added @skipIfRocm as a class annotation within test/inductor/test_cutlass_backend.py - turns out this annotation always skips if applied at class level, so I need to skip Cutlass tests on ROCm differently..
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123930
Approved by: https://github.com/jansel
ghstack dependencies: #121497
# Motivation
This PR is a part of RFC #114848,. This PR would depend on oneDNN compilation in #117098 and basic integration support in #117112 and Conv integration code in #117512. Some runtime support is needed in #116019.
This PR implements the convolution and deconvolution operators for XPU that should be defined in `aten` libraries. Also, the backward support is also supported.
With this PR, the conv-related operators should be functionality ready.
Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117529
Approved by: https://github.com/EikanWang, https://github.com/malfet
ghstack dependencies: #117512
# Motivation
This PR is a part of RFC #114848,. This PR would depend on oneDNN compilation in #117098 and basic integration support in #117112. Some runtime support is needed in #116019.
This PR provides the oneDNN integration code for Convolution and Deconvolution related operators. All aten convolution operators(conv, deconv, and conv-pointwise fusion) will goes into this layer before executing oneDNN primitive. The integration code is responsible for providing correct memory description for primitive and accompanied with primitive attribute description.
Wit this PR land, we will add Conv related operators accompanied with their registration.
Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117512
Approved by: https://github.com/EikanWang, https://github.com/malfet
Previously, when the Cutlass backend was enabled, using dynamic shapes could lead to exceptions during JIT.
With this change, there are guards in place to just disable the Cutlass backend if dynamic dimensions are involved.
In addition, if no choices for a GEMM are available using the selected backends, then an ATen Kernel is used as fallback, even if the ATen backend is not enabled.
Test:
CI
Additional unit test in test_cutlass_backend.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121497
Approved by: https://github.com/jansel
fixes#123039
In abi mode, ExternKernelSchedulerNode generates code using `aoti_torch_tensor_copy_` which requires `AtenTensorHandle`, but the allocation generates ArrayRefTensor to allocate mem in stack. To fix this issue, this PR prevents ExternKernelSchedulerNode from using stack-mem-allocation in abi, and creates AtenTensorHandle instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124037
Approved by: https://github.com/desertfire
This PR unifies the CUDA, XPU and PrivateUse1 in the torch profiler. Now CUDA, XPU and PrivateUse1 can together use string object `use_device` to distinguish each other and share one device path for calculating kineto time durations and memory statistics for post processing.
#suppress-api-compatibility-check
Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123247
Approved by: https://github.com/aaronenyeshi
Adds a ruff lint rule to ban raising raw exceptions. Most of these should at the very least be runtime exception, value errors, type errors or some other errors. There are hundreds of instance of these bad exception types already in the codebase, so I have noqa'd most of them. Hopefully this error code will get commiters to rethink what exception type they should raise when they submit a PR.
I also encourage people to gradually go and fix all the existing noqas that have been added so they can be removed overtime and our exception typing can be improved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124570
Approved by: https://github.com/ezyang
Update ruff to 0.4.1 .
This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes.
Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0
| Repository | Linter (v0.3) | Linter (v0.4) | Formatter (v0.3) | Formatter (v0.4) |
|----------------------------------------------------|---------------|---------------|------------------|------------------|
| [pytorch/pytorch](https://github.com/pytorch/pytorch) | 328.7 | 251.8 | 351.1 | 274.9 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549
Approved by: https://github.com/ezyang
Fixes GELU, LeakyRELU and MISH activation functions on non-contiguous tensors (for instance, when a transpose operation was applied on the tensors prior to the MPS operator), forward and backward passes.
I also extended tests on the 3 activation functions to check: full-precision and half-precision, contiguous and non-contiguous, and several dims of tensors: scalars, 1D, empty, 2D, > 3D.
I had issues with Mish and GELU activations when asserting the gradients vs. CPU with sum() on some cases, so I reverted to the previous setup by setting a gradient parameter on .backwards().
This PR also fixes an issue with LeakyRELU on empty tensors.
Fixes#98212huggingface/transformers#22468huggingface/transformers#19353
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123049
Approved by: https://github.com/kulinseth
In certain **rare** scenarios, inductor can generate a reduction kernel with really bad perf. E.g., if
- the reduction kernel contains a reduction node followed by a pointwise node
- And the pointwise node use a transposed layout.
- the reduction node is an inner reduction
- and rnumel <= 1024 ,
then inductor will generate a persistent reduction kernel and it causes really bad perf when doing tl.store for the pointwise node since we use a very skinny tile `(XBLOCK=1, RBLOCK=next_power_of_2(rnumel))` .
I've tried a few version of fix.
- The first version is, if I found any pointwise node in a reduction kernel uses a non-contiguous dependency, we use ReductionHint.DEFAULT. This cause 8s compilation time increase for huggingface with no perf wins... The reason is ReductionHint.DEFAULT does more autotunings.
- Then I changed the code to be more specific. We change the hint from INNER to DEFAULT if we are sure that the pointwise kernel can use a >1 stride for the lowest dimension. Kernels meet this condition should mostly have really bad perf anyways.
The situation mentioned above is rare. But it's reported by internal users. I'll also run one more perf test.
Testing script: https://gist.github.com/shunting314/9d3389891fa43633b49b8b7564ad6d8b . Something equivalent is also added as a unit test.
For this specific test from user reports, we improve the mentioned reduction kernels perf by **4.14x** (451us -> 109us)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124131
Approved by: https://github.com/jansel
Summary: This is actually quite noisy and my logs are full of this soft assertion msg. Maybe making it log once?
Test Plan:
On AMD GPU side, I got a lot of those warnings:
```
W0415 01:40:45.109864 917160 collection.cpp:602] Warning: Memcpy ? (? -> ?) (function operator())”
```
So just suppress the excessive logs
Reviewed By: aaronenyeshi, yoyoyocmu
Differential Revision: D55602788
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124469
Approved by: https://github.com/aaronenyeshi
Motivations:
- this is pretty redundant with test_aot_dispatch_dynamic.
- The user story for opcheck is that a user should use opcheck to see
if their operator was "registered correctly". If a user's custom op
only supports dynamic shapes, then it's a bit awkward for
one of the tests (e.g. `test_aot_dispatch_static`) to fail.
- We've already stopped running test_aot_dispatch_static in all of
our opcheck tests.
Test Plan:
- wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124495
Approved by: https://github.com/williamwen42
ghstack dependencies: #124180, #124200, #124299, #124134, #124199, #124403, #124414
This PR adds a `set_reshard_after_backward` method to allow disabling resharding after backward. `reshard_after_backward=False` can be used with `reshard_after_forward=False` to implement "ZeRO-1", where there is only all-gather on the first microbatch forward and reduce-scatter on the last microbatch backward.
```
for microbatch_idx, microbatch in dataloader:
is_last_microbatch = microbatch_idx == num_microbatches - 1
model.set_requires_gradient_sync(is_last_microbatch)
model.set_reshard_after_backward(is_last_microbatch)
model.set_is_last_backward(is_last_microbatch)
microbatch_fwd_bwd(model, microbatch, microbatch_idx)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124319
Approved by: https://github.com/weifengpy
Summary:
With pre-dispatch export and ep.run_decompositions(), range constraints are updated through looking at ShapeEnv.var_to_range. However the lower bounds on these may be incorrect - analysis on un-specialized symbols are done with lower bounds of 2, which mismatch with user-specified bounds (may be 0, 1).
This updates `_get_updated_range_constraints()` to use the old range constraints if possible.
Test Plan: Existing pre-dispatch/dynamic shapes test case.
Differential Revision: D55899872
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123602
Approved by: https://github.com/tugsbayasgalan
Biggest movement is 4% HF inference, 9% TIMM inference. Note, this is max-autotune mode so we are more tolerant of compilation increases. We could improve compilation time by limiting:
```
# Take how many of the top triton kernels to benchmark epilogue
max_epilogue_benchmarked_choices = 3
```
There is a hf_Whisper failure which you can repro on main without this stack with `TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --accuracy --training --only hf_Whisper`. When you turn off epilogue fusion, it fixes the accuracy. I bisected the failure to an epilogue, however when you compare the results of that epilogue with the corresponding separate kernels the results of the output are equivalent.
Inference:
<img width="1686" alt="image" src="https://github.com/pytorch/pytorch/assets/11477974/0b240080-cd33-4c08-89d3-583103b1fb0c">
Training:
<img width="1329" alt="Screenshot 2024-04-16 at 6 16 30 PM" src="https://github.com/pytorch/pytorch/assets/11477974/db0afcc9-7288-4c27-84ce-4fc1a5690788">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124031
Approved by: https://github.com/Chillee, https://github.com/shunting314
ghstack dependencies: #124030, #122642, #123229, #122825
Which is more stringent than clang when equivalently sized NEON registers are cast to each other. In particular, at one point `uint16x4_t` were cast to `int16x4_t`, which gcc does not allow. Added `vreinterpret_s16_u16` (which is a no-op) to solve this and tested in https://godbolt.org/z/sYb4ThM6M
Test plan: Build aarch64 wheels
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124511
Approved by: https://github.com/mikekgfb
Two changes:
- Make the flag for multi template buffer independent from benchmark fusion. While benchmark fusion can be useful, the compilation time/performance trade offs are different than for just templates, which we'd like to enable by default.
- Dont do MultiTemplateBuffers/benchmark-fusion for templates which have custom input gen fn's (currently which only exist internally). Threading the custom input gn fns to benchmark fusion is NYI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122825
Approved by: https://github.com/shunting314
ghstack dependencies: #124030, #122642, #123229
Summary: It seems super confusing that if we set DISABLE_ADDMM_HIP_LT + PYTORCH_TUNABLEOP_ENABLED, the former takes priority. This is because the former goes through the gemm_and_bias and tunable op is integrated with gemm path. Before we can integrate tunable op with gemm_and_bias, we'll probably just let tunable op takes priority
Test Plan: Run a simple linear program and verified.
Differential Revision: D56183954
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124161
Approved by: https://github.com/jeffdaily, https://github.com/nmacchioni
Summary:
This is not great, but our ATen-cpu is not completely GPU agnostic. Previously we have worked on D54453492 (https://github.com/pytorch/pytorch/pull/121082) and D54528255, but there are a few things we haven't resolved, and it's exploding here. So we'll continue to fix them until all are gone.
This ROCm block is for 4.3 which is very old. I don't think it should be supported any more. So let's just kill this macro
Test Plan: CI
Differential Revision: D56172660
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124158
Approved by: https://github.com/jeffdaily, https://github.com/nmacchioni
old: `register_autograd(setup_context, backward, /)`
new: `register_autograd(backward, /, *, setup_context=None)`
Motivations:
- We introduce these APIs as "give us a backward and use setup_context
to save things for backward".
- setup_context isn't always necessary.
Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124403
Approved by: https://github.com/albanD
ghstack dependencies: #124180, #124200, #124299, #124134, #124199
Two small fixes:
- preserve rng around compile_fx_inner
- Now that will precompile in the background while lowering multiple templates in parallel, we no longer can allocate inputs at the beginning of the function because we will have multiple sets of inputs allocated at the same time. Instead, allocate them when needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123229
Approved by: https://github.com/shunting314
ghstack dependencies: #124030, #122642
Two changes:
- in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF.
- Share a single precompilation function among matmuls with same key.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122642
Approved by: https://github.com/shunting314
ghstack dependencies: #124030
We override the `__call__` method and register fake, functional, proxy default dispatch mode implementation in its python_key_mode_table.
The idea is:
1. when inputs contains FakeScriptObject, we dispatch it through _get_dispatch mechanism. We implement dispatch mode keys automatically in the operator's constructor.
2. when inputs are not fakified, we dispatch through the original c++ dispatcher.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123367
Approved by: https://github.com/zou3519
Motivation:
- The API is used for registering an implementation for a specific
device type.
- "impl" is ambiguous and can be confused with Library.impl.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124200
Approved by: https://github.com/albanD
ghstack dependencies: #124180
This PR adds in fast semi-structured sparsification kernels to PyTorch.
These kernels allow for accelerated semi-structured sparsification
kernels in PyTorch.
The kernels have been added as aten native functions
In particular, three new functions have been added:
* `torch._sparse_semi_structured_tile`
This function will return the packed representation and metadata for
both X and X', as well as the thread masks. Note that this applies 2:4
sparsity in a 4x4 tile instead of a 1x4 strip as usual.
* `torch._sparse_semi_structured_apply`
This function takes in an input tensor and thread masks from the above
function and returns a packed representation and metadata from applying
thread masks to the input tensor.
* `torch._sparse_semi_structured_apply_dense`
This function does the same thing as above but instead of returning the
tensor in the sparse representation it returns it in the dense
representation
The subclasses have also been updated to add a new
`prune_dense_static_sort`
classmethod to create sparse tensors with this format. I've added some
additional documentatino on how to calculate the compressed tensors
needed to create a SparseSemiStructuredTensor oneself.
To this end, there are two new helper functions added:
`sparse_semi_structured_tile`
`compute_compressed_swizzled_bitmask`
Differential Revision: [D56190801](https://our.internmc.facebook.com/intern/diff/D56190801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350
Approved by: https://github.com/cpuhrsch
Summary:
```
ncclGroupStart()
ncclCommInit(..)
ncclGroupEnd()
```
above pattern is only needed when we have *single-thread* to manage multiple GPUs
in our case, we always have 1 process managing 1 GPU, we don't need group operation.
Test Plan: CI
Differential Revision: D56274975
Co-authored-by: Cen Zhao <cenzhao@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124416
Approved by: https://github.com/shuqiangzhang
On par with `CUDA` implementation.
For `autocast` logic, same with `CUDA` + `Fused Adam`:
- check inf in `gradscalar.step`
- In fused kernel, if there is `inf`, do nothing. If not, unscale the grad ( also write back) and update the param.
**TestPlan**:
```
# extend CUDA only test for CPU fused adagrad
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_torch.py -k test_grad_scaling_autocast_fused
# extend fused test
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
python test_optim.py -k test_can_load_older_state_dict
# newly added test (follow 6b1f13ea2f/test/test_cuda.py (L1108))
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
```
**Benchmark**:
**5.1x** on 56 core SPR
**Parameter-size=1M**
**Nparams=10**
[test script](https://gist.github.com/zhuhaozhe/ef9a290ad3f8f4067b3373a3bdaa33e7)
```
numactl -C 0-55 -m 0 python bench_adam.py
non-fused 6.0174267292022705 s
fused 1.1787631511688232 s
```
**Note: Fused kernel accuracy**
The accuracy failure in CI shows a little higher than default tolerance
```
2024-04-02T06:09:16.2213887Z Mismatched elements: 21 / 64 (32.8%)
2024-04-02T06:09:16.2214339Z Greatest absolute difference: 1.5735626220703125e-05 at index (6, 6) (up to 1e-05 allowed)
2024-04-02T06:09:16.2214813Z Greatest relative difference: 1.0073336852656212e-05 at index (4, 1) (up to 1.3e-06 allowed)
```
I have debug it step by step and unfortunately we may not able to make the `fused kernel` exactly same with `non fused` one due to compiler optimizations.
For example, in non-fused impl
```
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
```
and in fused impl
```
exp_avg_sq_ptr[d] = scalar_t(beta2) * exp_avg_sq_ptr[d];
// std::cout << "exp_avg_sq " << exp_avg_sq_ptr[d] << std::endl;
exp_avg_sq_ptr[d] = exp_avg_sq_ptr[d] +
scalar_t(exp_avg_sq_grad_coefficient) * grad_val * grad_val;
```
If I keep `std::cout`, I can get exactly same results in UT
```
===============param
0.6796758770942688
0.6796758770942688
```
But when I comment out it, there will be a difference
```
===============param
0.6796758770942688
0.6796759366989136
```
So I will make the tolerance a little higher than default one.
Co-authored-by: Jane Xu <janeyx@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123074
Approved by: https://github.com/jgong5, https://github.com/janeyx99
# PR
This PR supports mutating inputs in cudagraph trees, if these inputs are outputs from previous cudagraph. Please check #121861 for more details.
# Note on Optimistic Mutation Check
To determine whether applying cudagraph, we need to check input mutations, falling into four categories: a) no mutation, b) mutation on parameters/buffers, c) mutation on cudagraph recorded tensors, d) mutation on non-cudagraph recorded tensors. We can apply cudagraph for type a,b,c but cannot for type d. This input mutation types depends on function, current_node, and inputs.
Since `check_for_mutation` is slow, there is a trade-off on making type c or d faster.
- To make type d) faster, we want to `check_for_mutation` and call eager function early. However, this adds unnecessary overhead to type a, b, c due to the extra check.
- To make type c) faster, we want to skip `check_for_mutation` at the beginning and only `check_for_mutation` before `record_function` for a new function. This removes the overhead of `check_for_mutation` for type a, b, c. However, this adds extra overhead to type d due to `check_invariants` for all children nodes.
Instead, we design optimistic mutation check. The assumption is that, given a function and a node, the input mutation types usually remain the same across inputs. So, if we have ever detect a function on a node with type d, we will never detect it as type c. The detailed design is:
- [Slow Path] On the first invocation of a function on a node, we run `check_for_mutation` once and cache the input mutation type as `non_cudagraph_managed_mutation[node_id][func_id]`.
- [Fast Path] On the subsequent invocations of a function on a node, we skip `check_for_mutation`. For `non_cudagraph_managed_mutation[node_id][func_id]` as true, we directly call eager function. Otherwise, we `check_variants` and call cudagraph function.
- [Slow Path] Before `record_function`, we run `check_for_mutation` again.
**Q1: Would there be overhead for type a,b,c,d?**
A: No. We only check input mutation types for the first invocation of a function on a node.
**Q2: If a function happens to be type c during the first invocation on a node, could we detect it as type d in the future?**
A: Yes. This is done by `check_invariants` and guarantees the correctness.
**Q3: If a function happens to be type d during the first invocation on a node, could it still be recognized as type c in the future?**
A: No. But this should happen rarely according to our assumption. In the rare case that it happens, there would not be any correctness issues and the performance is the same as the eager (or inductor optimized) function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123231
Approved by: https://github.com/eellison
Closes#114966
Frozen field assignment in `__init__` in Python 3.8-3.9:
f5bd65ed37/Lib/dataclasses.py (L402-L411)
```python
import builtins
BUILTINS = builtins
def _field_assign(frozen, name, value, self_name):
# If we're a frozen class, then assign to our fields in __init__
# via object.__setattr__. Otherwise, just use a simple
# assignment.
#
# self_name is what "self" is called in this function: don't
# hard-code "self", since that might be a field name.
if frozen:
return f'BUILTINS.object.__setattr__({self_name},{name!r},{value})'
return f'{self_name}.{name}={value}'
```
Frozen field assignment in `__init__` in Python 3.10+:
812245ecce/Lib/dataclasses.py (L436-L445)
```python
__dataclass_builtins_object__ = object
def _field_assign(frozen, name, value, self_name):
# If we're a frozen class, then assign to our fields in __init__
# via object.__setattr__. Otherwise, just use a simple
# assignment.
#
# self_name is what "self" is called in this function: don't
# hard-code "self", since that might be a field name.
if frozen:
return f'__dataclass_builtins_object__.__setattr__({self_name},{name!r},{value})'
return f'{self_name}.{name}={value}'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124393
Approved by: https://github.com/jansel
Fixes a bug where a reference to `_ProcessGroupWrapper` is used without first checking whether gloo is available. This fails on pytorch builds that do not include gloo becuase `_ProcessGroupWrapper` is only pybinded when building with `USE_GLOO=1`. Therefore, creation of a new process group fails with a `NameError` when only NCCL is available as the backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124233
Approved by: https://github.com/rohan-varma, https://github.com/d4l3k
This PR unifies the CUDA, XPU and PrivateUse1 in the torch profiler. Now CUDA, XPU and PrivateUse1 can together use string object `use_device` to distinguish each other and share one device path for calculating kineto time durations and memory statistics for post processing.
#suppress-api-compatibility-check
Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123247
Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui
Also partially fixes#122109
This PR:
- We add a C++ flag (only_lift_cpu_tensors) to toggle the
torch.tensor(1, device='cuda') ctor strategy.
When false (default), it does the current PyTorch behavior
of unconditionally constructing a concrete CUDA tensor then calling
lift_fresh on it. When true, we instead construct a concrete CPU
tensor, call lift_fresh, and then call Tensor.to(device) (under any ambient
modes).
- FakeTensorMode flips this flag depending on if CUDA is available or
not. We don't unconditionally set the flag to True because that is
likely BC-breaking.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124413
Approved by: https://github.com/eellison
This adds a templated version of the ring attention forwards function as well as tests it with memory efficient attention. This doesn't add support for memory efficient attention in DTensor. That will be added in a follow up PR.
This templating is also a POC of how to support other attention ops such as Jagged/nested tensor and as well how to implement striped attention in a scalable way.
Misc changes:
* Fixes all_to_all_single autograd implementation with CUDA + adds NCCL test
* Adds compile support to the ring attention implementations (required some tweaks to process groups)
Test plan:
```
pytest test/distributed/_tensor/test_attention.py
pytest test/distributed/test_functional_api.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124215
Approved by: https://github.com/wanchaol
Summary: In AOTInductor generated CPU model code, there can be direct references to some aten/c10 utility functions and data structures, e.g. at::vec and c10::Half. These are performance critical and thus it doesn't make sense to create C shim for them. Instead, we make sure they are implemented in a header-only way, and use this set of tests to guard future changes.
There are more header files to be updated, but we will do it in other followup PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123848
Approved by: https://github.com/jansel
ghstack dependencies: #123847
Summary: AOTInductor generated code for CPU models may have direct reference to these c10-implemented data types, see _inductor/codegen/cpp_prefix.h. To make sure the AOTI generated code is ABI backward compatible, we need to change those headers to a header-only implementation. The next PR in this stack will add tests to use those data types without linking against libtorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123847
Approved by: https://github.com/jansel
Summary:
This ENV was introduced to safely rollout the behavior change in destroy
process group (e.g., call ncclCommsAbort). Now that this behavior change
were already rolled out, we no longer need this env and we should clean
up it to keep our code cleaner
Test Plan:
Modified/existing ut pass
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124334
Approved by: https://github.com/wconstab
Previously, we didn't expand the shape of example_value of map to the same as inputs (edit: the first mapped dimension). This pr fixes this bug. To make this easier, we change _call_function_and_unflatten_output to accept example_values directly instead of retrieving them from the variable trackers.
Also remove a redundant call function node in strict_mode higher order op in dynamo.
Test Plan:
existing tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124203
Approved by: https://github.com/ezyang, https://github.com/zou3519
#121313 changed precompiled patterns so they are more integrated with the pattern matching code. This resulted with a list of "known" patterns (with their example data) being stored globally. Unfortunately since small FakeTensors store a constant of the original tensor it meant that we leaked cuda tensors in the example data.
Fix this by clearing out the constant storage for the example data that we keep around.
Fixes#124081
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124345
Approved by: https://github.com/xuzhao9
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
"init",
"is_available",
"synchronize",
"device_count",
"current_device",
"current_stream",
"default_stream",
"set_stream",
"stream",
"device",
]
```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```
---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------
@exported-using-ghexport
Differential Revision: [D52923602](https://our.internmc.facebook.com/intern/diff/D52923602/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch.
------------
**torch.Stream APIs**
```
# Defined in torch/csrc/Stream.cpp
class Stream(_StreamBase):
stream_id: _int # Stream id
device_index: _int
device_type: _int
device: _device # The device of the stream
@overload
def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ...
@overload
def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ...
def query(self) -> _bool: ...
def synchronize(self) -> None: ...
def wait_event(self, event: Event) -> None: ...
def wait_stream(self, other: Stream) -> None: ...
def record_event(self, event: Optional[Event] = None) -> Event: ...
def query(self) -> None: ...
def synchronize(self) -> None: ...
def __hash__(self) -> _int: ...
def __repr__(self) -> str: ...
def __eq__(self, other: object) -> _bool: ...
```
------------------
**torch.Event APIs**:
- IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream.
- currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag.
- elapsedTime API is added to c10::Event
```
# Defined in torch/csrc/Event.cpp
class Event(_EventBase):
device: _device # The device of the Event
event_id: _int # The raw event created by device backend
def __new__(self,
device: Optional[DeviceLikeType] = None,
enable_timing: _bool = False,
blocking: _bool = False,
interprocess: _bool = False) -> Event: ...
@classmethod
def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ...
def record(self, stream: Optional[Stream] = None) -> None: ...
def wait(self, stream: Optional[Stream] = None) -> None: ...
def query(self) -> _bool: ...
def elapsed_time(self, other: Event) -> _float: ...
def synchronize(self) -> None: ...
def ipc_handle(self) -> bytes: ...
def __repr__(self) -> str: ...
```
-----------
c10::Event provides new APIs
- calculate **elapsedTime**.
- Get raw event id
- Synchronize event.
```
double elapsedTime(const Event& event) const {
return impl_.elapsedTime(event.impl_);
}
void* eventId() const {
return impl_.eventId();
}
void synchronize() const {
return impl_.synchronize();
}
```
----------
TODO: need to find a good way to test them in PyTorch with API mocks.
Differential Revision: [D55351839](https://our.internmc.facebook.com/intern/diff/D55351839/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611
Approved by: https://github.com/albanD
Differential Revision: D56200666
Previously, when we hit the Functionalize kernel for lift_fresh_copy, we directly dispatch self.clone() to proxy dispatch. As a result, we end up receiving a functional tensor at proxy dispatch. As a work around, I unwrap self manually. Not sure, why it works ok in aot-dispatch tho
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124198
Approved by: https://github.com/bdhirsh
By creating constants using input tensors dtype
One line reproducer:
```
python -c "import torch; x=torch.arange(3, dtype=torch.float16,device='mps');print(torch.nn.functional.binary_cross_entropy(x, x))"
```
Before the change
```
loc("mps_subtract"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":233:0)): error: input types 'tensor<f32>' and 'tensor<3xf16>' are not broadcast compatible
LLVM ERROR: Failed to infer result type(s).
```
After
```
tensor(-33.7812, device='mps:0', dtype=torch.float16)
```
Fixes https://github.com/pytorch/pytorch/issues/124252
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124258
Approved by: https://github.com/kulinseth
The `recurse` argument was not being respected for `set_requires_gradient_sync`. This PR fixes that.
The previous unit test did not have nested FSDP modules with managed parameters, so the `recurse=False` was not being exercised. We augment the unit test to try only disabling gradient sync for the root module and not children.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124318
Approved by: https://github.com/weifengpy
ghstack dependencies: #120952, #124293
A kernel has "dispatcher convention" if there is an additional keyset
arg at the beginning of the argument list. This PR:
- adds a way to register kernels with dispatcher_convention using
Library.impl (pass dispatcher_convention = True)
- adds OpOverload.redispatch
We use both of the above in the new custom ops API: we register the
autograd kernel in dispatcher convention so that we can actually call
redispatch like how pytorch built-in ops do it.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124089
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064, #124065, #124066, #124071
We allow it to accept:
- a string with the op name
- an opoverload
- a new-style custom op
If any of these are referring to a new-style custom op (created with the
custom_op decorator), then we dispatch to CustomOpDef.register_fake.
Otherwise, we do what we previously did.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124066
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064, #124065
Summary:
We explicitly set the cublas workspace even though CUDA 12.2+ fixed the issue where memory usage increased during graph capture. Original issue: https://github.com/pytorch/pytorch/pull/83461
This is because in CUDA 12.2+, the use of cudaMallocAsync in cublas will allocate memory dynamically (even if they're cheap) outside PyTorch's CUDA caching allocator. It's possible that CCA used up all the memory and cublas's cudaMallocAsync will return OOM
Test Plan: CI
Differential Revision: D56226746
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124250
Approved by: https://github.com/houseroad, https://github.com/eqy
Fixes https://github.com/pytorch/pytorch/issues/119607 for 3.11+.
In 3.11+, `_PyFrame_FastToLocalsWithError` could implicity run `COPY_FREE_VARS` on the original frame, leading to double incref's since the dynamo shadow frame can rerun `COPY_FREE_VARS`. So the solution is to skip the first `COPY_FREE_VARS` instruction in the shadow frame if it was already executed in the original frame.
Also move the location for clearing the original frame in 3.12 to handle error cases more thoroughly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124238
Approved by: https://github.com/jansel
Measuring peak memory on the first run can capture cases where compiled artifacts leak into runtime, but it also introduces a lot of noise from cudnn/triton autotuning which generally uses as much memory as it can. Setting this flag as a default will need some discussion, so I will only add it to unblock compiled backward benchmarking (where all autotuning memory use is exposed)
```
e.g. resnet50
# without --warm-peak-memory
memory: eager: 1.95 GB, dynamo: 6.68 GB, ratio: 0.29
# with --warm-peak-memory
memory: eager: 1.96 GB, dynamo: 2.06 GB, ratio: 0.95
```

This issue may also affect large models. Here's an example case of cudnn_convolution_backward autotuning allocating 30GB to tune a model otherwise using 5GB memory:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124326
Approved by: https://github.com/jansel
ghstack dependencies: #119411
# Motivation
According to [[RFC] Intel GPU Upstreaming](https://github.com/pytorch/pytorch/issues/114723), we would like to upstream amp autocast policy to facilitate the functionality and accuracy of `torch.compile` on e2e benchmarks.
# Solution
The first PR aims to make macro `KERNEL` to be generic. It accepts two types of inputs, like `(DISPATCH, OP, POLICY)` and `(DISPATCH, OP, OVERLOAD, POLICY)`.
The second PR intends to refactor CUDA's autocast policy to make it can be shared with `XPU` backend.
The final PR would like to support XPU autocast policy which shares the same recipe with `CUDA` backend.
# Additional Context
Another motivation is we would like to unify autocast API and provide the generic APIs, like:
- `torch.get_autocast_dtype(device_type)`
- `torch.set_autocast_dtype(device_type)`
- `torch.is_autocast_enabled(device_type)`
- `torch.set_autocast_enabled(device_type)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124050
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
Removed a bunch of skips, I also updated test_forloop_goes_right_direction to *not* use the closure when dynamo is tracing. The reason for this is that testing the disabled optimizer doesn't actually test anything.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123322
Approved by: https://github.com/janeyx99
ghstack dependencies: #123498
Fix part of https://github.com/pytorch/pytorch/issues/123603.
Example traceback on branch https://github.com/pytorch/vision/compare/main...wwen/custom_ops_test:
```
running my_custom_op!
Traceback (most recent call last):
File "/data/users/williamwen/torchvision/playground.py", line 13, in <module>
print(opt_fn1(torch.randn(3, 3)))
File "/data/users/williamwen/pytorch2/torch/_dynamo/eval_frame.py", line 387, in _fn
return fn(*args, **kwargs)
File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 977, in catch_errors
return callback(frame, cache_entry, hooks, frame_state, skip=1)
File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 818, in _convert_frame
result = inner_convert(
File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 411, in _convert_frame_assert
return _compile(
File "/data/users/williamwen/pytorch2/torch/_utils_internal.py", line 70, in wrapper_function
return function(*args, **kwargs)
File "/data/users/williamwen/py310-env/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 700, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 266, in time_wrapper
r = func(*args, **kwargs)
File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 568, in compile_inner
out_code = transform_code_object(code, transform)
File "/data/users/williamwen/pytorch2/torch/_dynamo/bytecode_transformation.py", line 1116, in transform_code_object
transformations(instructions, code_options)
File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 173, in _fn
return fn(*args, **kwargs)
File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 515, in transform
tracer.run()
File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 2237, in run
super().run()
File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 875, in run
while self.step():
File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 790, in step
self.dispatch_table[inst.opcode](self, inst)
File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 492, in wrapper
return inner_fn(self, inst)
File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 1260, in CALL_FUNCTION
self.call_function(fn, args, {})
File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 730, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/data/users/williamwen/pytorch2/torch/_dynamo/variables/torch.py", line 747, in call_function
tensor_variable = wrap_fx_proxy(
File "/data/users/williamwen/pytorch2/torch/_dynamo/variables/builder.py", line 1425, in wrap_fx_proxy
return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs)
File "/data/users/williamwen/pytorch2/torch/_dynamo/variables/builder.py", line 1510, in wrap_fx_proxy_cls
example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True)
File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1804, in get_fake_value
raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None
File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1736, in get_fake_value
ret_val = wrap_fake_exception(
File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1251, in wrap_fake_exception
return fn()
File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1737, in <lambda>
lambda: run_node(tx.output, node, args, kwargs, nnmodule)
File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1872, in run_node
raise RuntimeError(make_error_message(e)).with_traceback(
File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1854, in run_node
return node.target(*args, **kwargs)
File "/data/users/williamwen/pytorch2/torch/_ops.py", line 870, in __call__
return self_._op(*args, **(kwargs or {}))
torch._dynamo.exc.TorchRuntimeError: Failed running call_function torchvision.my_custom_op1(*(FakeTensor(..., size=(3, 3)),), **{}):
The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.
If you're using torch.compile/export/fx, it is likely that we are erroneously tracing into a custom kernel. To fix this, please wrap the custom kernel into an opaque custom op. Please see the following for details: https://docs.google.com/document/d/1W--T6wz8IY8fOI0Vm8BF44PdBgs283QvpelJZWieQWQ
from user code:
File "/data/users/williamwen/torchvision/playground.py", line 5, in fn1
return torch.ops.torchvision.my_custom_op1(x)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124240
Approved by: https://github.com/zou3519
Motivations:
- This makes things more consistent: using a Library object, you should
be able to do all of the registration APIs that tie registrations to
the lifetime of the Library.
- I need this for the next PR up in the stack, where we will have
torch.library.register_fake support both CustomOpDef (from the new
custom ops API) and other custom ops.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124065
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064
Previously, if someone used `register_fake` to add a fake impl for an
operator defined in C++, we would require them to add a
`m.set_python_module(<module>)` call to C++. This was to avoid
situations where a user imported the C++ operator without importing the
fake impl.
This "breaks" open registration: there's no way to add a fake impl
outside of a repository that defines an operator, so we want to turn
this behavior off by default in open source.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124064
Approved by: https://github.com/albanD
ghstack dependencies: #123937
# Motivation
This PR is a part of RFC #114848, and it is a successor PR of #116249 and #116019. This PR would depend on oneDNN compilation in #116249. Some runtime support is needed in #116019.
Aten operators like `addmm`, `baddmm` is defined in `Blas.cpp` in `aten/src/ATen/native/mkldnn/xpu/`.
Accompanied with these files provide core functionaliy, `BlasImpl.h`, `Utils.h` and other file provide basic utilities for them. For instance, `Utils.h` provide common memory descriptor query utils for `Matmul.h` and these utility function will also be used in other primitive, like `convolution`. `BlasImpl.h` is a header file that provide helper for handling shape info processing in matmul related operators. It would not only help basic GEMM operator like `addmm, baddmm` but also help fusion operators used in `torch.compile` like `linear_pointwise` in #117824.
In next stage, we would continually complete the oneDNN support through enabling `matmul fusion` and `convolution` related code.
Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117202
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #117098, #117112
# Motivation
As proposed in https://github.com/pytorch/pytorch/issues/114848 and https://github.com/pytorch/pytorch/issues/114723, oneDNN library is an important component for Intel GPU software ecosystem.
Current PR is based on #117098, where oneDNN library for Intel GPU should be ready. This PR is the integration code from aten to oneDNN. GEMM integration code is the core part in this PR. Accompanied with GEMM, more basic support like runtime (device, stream), primitive attr is also included.
We put the oneDNN integration code in directory `aten/src/ATen/native/mkldnn/xpu/detail`. We add a namespace `at::native::xpu::onednn` for oneDNN integration.
The code in this PR would be used in following PRs, where aten operators would call the functions in these integration code.. We separate the prs due to onednn integration is logically separable with aten operator implementation. Also, this can ease the burden of reviewing by avoid too much codes in single PR.
Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117112
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD
TF32 causes issues with the tolerances here; we might also consider migrating some of the `with_tf32_off` tests in this file to `tf32_on_and_off` in case it would be useful to get signal for TF32.
CC @malfet @atalman
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124104
Approved by: https://github.com/zou3519
By unrolling middle loop by 16 elements and using neon to decode packed int4 to float32.
Unrolling entire `n` loop actually makes it a tad slower, probably because ARM has smaller register file that x86
Before/after performance running stories110M on M2Pro
| eager (before) | eager (after) | compile(before) | compile (after) |
| ---- | --- | -- | -- |
| 28 | 57 | 31 | 104 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124257
Approved by: https://github.com/mikekgfb
A unit test within test_cutlass_backend.py can fail with CUDA illegal memory accesses due to the fact that some CUTLASS Kernels contain bugs.
By using autotuning in subprocesses, this CUDA illegal memory access simply
leads to the buggy Cutlass Kernels being filtered out, instead of causing it
to bring down the entire process.
Test Plan:
This is a change to a unit test. It's recommended to use autotune_in_subproc when using the Cutlass backend anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124106
Approved by: https://github.com/eellison
This PR:
- adds a new torch.library.register_fake and deprecates
torch.library.impl_abstract. The motivation is that we have a lot of
confusion around the naming so we are going to align the naming with
the actual subsystem (FakeTensor).
- renames `m.impl_abstract_pystub("fbgemm_gpu.sparse_ops")` to
`m.has_python_registration("fbgemm_gpu.sparse_ops")`. No deprecation
here yet; I need to test how this works with static initialization.
- Renames a bunch of internals to match (e.g. abstractimplpystub ->
pystub)
I'm scared to rename the Python-side internal APIs (e.g.
torch._library.abstract_impl) because of torch.package concerns. I'll do
that in its own isolated PR next just in case it causes problems.
DEPRECATION NOTE: torch.library.impl_abstract was renamed to to
torch.library.register_fake. Please use register_fake. We'll delete
impl_abstract in a future version of PyTorch.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123937
Approved by: https://github.com/albanD
We can't get information about `ami-id`, `instance-id`, `instance-type` for the ARC runners:
```
2024-04-16T11:10:17.0098276Z curl: (22) The requested URL returned error: 401
2024-04-16T11:10:17.0110775Z ami-id:
2024-04-16T11:10:17.0159131Z curl: (22) The requested URL returned error: 401
2024-04-16T11:10:17.0167378Z instance-id:
2024-04-16T11:10:17.0219464Z curl: (22) The requested URL returned error: 401
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124171
Approved by: https://github.com/malfet, https://github.com/ZainRizvi, https://github.com/zxiiro
Two changes:
- in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF.
- Share a single precompilation function among matmuls with same key.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122642
Approved by: https://github.com/shunting314
ghstack dependencies: #124030
Summary: When applying FSDP-2 to FM-FB benchmark with FullModel model, we ran into an error that one of the output tensors of a forward pass is None. I double checked that the same output tensor is also None in FSDP-1. So, we just need to handle the None properly here.
Test Plan:
See that in the internal diff.
Differential Revision: D56087956
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123988
Approved by: https://github.com/awgu
This PR adds an `unshard(async_op: bool = False)` API to manually unshard the parameters via all-gather. This can be used for reordering the all-gather with other collectives (e.g. all-to-all).
This currently requires the user to set `TORCH_NCCL_AVOID_RECORD_STREAMS=1` to avoid `recordStream` from `ProcessGroupNCCL` and get expected memory behaviors.
Differential Revision: [D56148725](https://our.internmc.facebook.com/intern/diff/D56148725)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120952
Approved by: https://github.com/wanchaol
Add serial marker for individual tests so the test file can be removed from the ci serial list
Run serial marked tests first in serial
Run all other tests afterwards in parallel
Slowly reduce list and mark individual tests as serial instead
Hope # of serial tests is small so sharding evenness doesn't get too messed up
Hopefully can do 3 procs for sm86 and cpu?
serial no longer looks like a real word to me
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124085
Approved by: https://github.com/seemethere, https://github.com/malfet
Summary: https://github.com/pytorch/pytorch/pull/123452 added
backward support to this op by turning it into
CompositeImplicitAutograd, which meant it gets decomposed during
export/compile. However, this is not desirable behavior for the
PTQ case when we try to lower the model. This commit enables
QAT without breaking PTQ by refactoring the impl into a separate
op that does have backward support.
Test Plan:
python test/test_quantization.py -k test_decomposed_choose_qparams_per_token_asymmetric_backward
Reviewers: jerryzh168, digantdesai, zou3519
Subscribers: jerryzh168, digantdesai, zou3519, supriyar
Differential Revision: [D56192116](https://our.internmc.facebook.com/intern/diff/D56192116)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124178
Approved by: https://github.com/digantdesai
**Summary**
We wrap DTensor's local tensor in `LocalShardsWrapper` for torchrec's table-wise sharding. The exception is on non-participating ranks: for non-participating ranks, the local tensor is an empty torch.Tensor object. The reason of this design is to avoid complexity on supporting empty tensor case on `LocalShardsWrapper`.
**Test**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e table-wise`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122853
Approved by: https://github.com/wz337
ghstack dependencies: #120265, #121392, #122843
**Summary**
Always wrap local tensor into a `LocalShardsWrapper`. This is for uniformity and it leads to easiness on adoption of DTensor as a wrapper for local shard(s) representation. To support more tensor ops over `LocalShardsWrapper`, users need to extend its `__torch_dispatch__`.
**Test**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise-even`
**Result**
```
Row-wise even sharding example in DTensor
Col 0-15
------- ----------
Row 0-1 cuda:0
Row 2-3 cuda:1
Row 4-5 cuda:2
Row 6-7 cuda:3
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122843
Approved by: https://github.com/wz337
ghstack dependencies: #120265, #121392
**Summary**
This PR serves as a start of this effort by adding an example test that represents TorchRec's `ShardingType.TABLE_WISE` using DTensor.
**Test**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e table-wise`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120265
Approved by: https://github.com/wanchaol
This PR adds in fast semi-structured sparsification kernels to PyTorch.
These kernels allow for accelerated semi-structured sparsification
kernels in PyTorch.
The kernels have been added as aten native functions
In particular, three new functions have been added:
* `torch._sparse_semi_structured_tile`
This function will return the packed representation and metadata for
both X and X', as well as the thread masks. Note that this applies 2:4
sparsity in a 4x4 tile instead of a 1x4 strip as usual.
* `torch._sparse_semi_structured_apply`
This function takes in an input tensor and thread masks from the above
function and returns a packed representation and metadata from applying
thread masks to the input tensor.
* `torch._sparse_semi_structured_apply_dense`
This function does the same thing as above but instead of returning the
tensor in the sparse representation it returns it in the dense
representation
The subclasses have also been updated to add a new
`prune_dense_static_sort`
classmethod to create sparse tensors with this format. I've added some
additional documentatino on how to calculate the compressed tensors
needed to create a SparseSemiStructuredTensor oneself.
To this end, there are two new helper functions added:
`sparse_semi_structured_tile`
`compute_compressed_swizzled_bitmask`
Differential Revision: [D56190801](https://our.internmc.facebook.com/intern/diff/D56190801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350
Approved by: https://github.com/cpuhrsch
Fixes https://github.com/pytorch/pytorch/issues/98921
There were two issues detected:
- `MultiStepLR`: issue is described in https://github.com/pytorch/pytorch/issues/98921, this is resolved by allowlisting `collections.Counter`
- `OneCycleLR`: `state_dict['anneal_func']` is either `<function OneCycleLR._annealing_cos at 0x7f364186f5b0>` or
`<function OneCycleLR._annealing_linear at 0x7f39aa483640>` depending on the `anneal_func` kwarg.
This leads to `WeightsUnpickler error: Unsupported class __builtin__.getattr` from the `weights_only` Unpickler.
Fixed the above in a BC-compatible manner by adding `OneCyclicLR._anneal_func_type` as a string attribute and removing `OneCyclicLR.anneal_func`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123775
Approved by: https://github.com/albanD, https://github.com/malfet
Summary:
Pass Process Group Name and Desc to NCCL communicator in order to access pg information in NCCL layer.
The information is passed as commDesc string(i.e. "<pg_desc>:<pg_name>")
Function only valid when NCCL_COMM_DESCRIPTION is defined.
Differential Revision: D55703310
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124149
Approved by: https://github.com/shuqiangzhang
Fixes#104729
This improves the compiled mode performance of Softmax (by 20%) and other operations (like batchnorm) that invoke the reduce_all function. Thereby also improves BERT inference by around 8%.
Tested on a graviton 3 instance (c7g.4xl). Tests were run in a single-threaded manner.
Script attached below.
Command: `OMP_NUM_THREADS=1 LRU_CACHE_CAPACITY=1024 DNNL_DEFAULT_FPMATH_MODE=BF16 python TestSoftmax.py`
[TestSoftmax.txt](https://github.com/pytorch/pytorch/files/14910754/TestSoftmax.txt)
```python
import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity
model = nn.Softmax().eval()
compiled_model = torch.compile(model)
inputs = torch.randn(1024, 1024)
with torch.set_grad_enabled(False):
for _ in range(50):
compiled_model(inputs) #Warmup
print("Warmup over")
with profile(activities=[ProfilerActivity.CPU]) as prof:
with record_function("model_inference"):
for _ in range(100):
compiled_model(inputs)
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
# Check if the compiled model inference and the eager model inference are similar using torch.allclose
print(torch.allclose(compiled_model(inputs), model(inputs)))
```
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123584
Approved by: https://github.com/jgong5, https://github.com/malfet
I found that returning the copy is actually useful in situations where you might do something like:
```
ret = _copy_state_dict(obj, cache)
ret.update(some_other_values)
```
and would like `cache` not to change structure from `ret.update(some_other_values)`. Open to some notes here, not returning a copy might force the user to do some additional copies for this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123567
Approved by: https://github.com/wz337
Summary: `matrix_instr_nonkdim` and `waves_per_eu` are AMD specific launch configs that can't be treated as fn input args
Test Plan:
HIP_VISIBLE_DEVICES=7 numactl --cpunodebind=1 --membind=1 buck2 run mode/{opt,amd-gpu} -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.rocm_arch=mi300 //hammer/modules/sequential/encoders/tests:hstu_bench -- --torch-compile=True
the E2E works well on the magic model
Differential Revision: D56165438
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124146
Approved by: https://github.com/aakhundov
Current implementation drops the negative frequency components even when the user doesn't ask for the one-sided transform. The tests for the negative frequency components seem to have worked by accident due to internal implementation details but the issue becomes evident in MacOs 14.4.
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123274
Approved by: https://github.com/malfet
Summary: With the merge of D55925068, we have introduced an overflow issue when recording a trace using dyno gputrace. This is because it is possible for TorchOPs to be enumerated but not have an end time since they were running as the recording ended. By default these events have an end time set to INT_MIN. When finding the duration() for such events using end-start, we get an overflow resulting in a very long duration. This was avoided before because we were dividing the INT_MIN by 1000 because we were trying to convert uS to nS. This change introduces a patch for TorchOps and a future PR will be added to create a more universal guard in kineto.
Test Plan:
Trace recorded using resnet test.
Trace:
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1713199267/localhost/libkineto_activities_2247224.json.gz&bucket=gpu_traces
Differential Revision: D56144914
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124080
Approved by: https://github.com/aaronenyeshi
Summary:
As part of the work of unifying process group identifier, log <group_name, group_desc>, instead of pg uid in profiler.
- group_name remains as the unique identifier, e.g. “0”, "1"
- group_desc will be the user specified name, e.g. "fsdp".
Reviewed By: aaronenyeshi, kwen2501
Differential Revision: D55610682
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124035
Approved by: https://github.com/aaronenyeshi
Fixes#121200
This PR introduces AcceleratorOutOfMemoryError for all privateuse1 backend. For python, there is a PyError object which will be set only when privateuse1 is registered. All privateuse1 backend then can use this error for memory errors. Maybe more error types in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121702
Approved by: https://github.com/guangyey, https://github.com/albanD
**Overview**
This PR adds pre/post-all-gather extensions to FSDP2.
- The pre/post-all-gather extensions are specified at the tensor-level on the `sharded_param._local_tensor` (i.e. the tensor wrapped by the sharded `DTensor`). If the user has a tensor-subclass parameter on the module passed to FSDP that preserves the subclass through the sharding ops (e.g. `new_zeros`, `chunk`, etc.), then the `sharded_param._local_tensor` will naturally be of that subclass.
- The pre-all-gather function has signature:
```
def fsdp_pre_all_gather(self) -> Tuple[Tuple[torch.Tensor, ...], Any]
```
- The first return value is a `Tuple[torch.Tensor, ...]` of the all-gather inputs. It is a tuple since a subclass could contribute >1 inner tensors.
- The second return value is any optional metadata needed to pass through to the post-all-gather.
- The post all-gather function has signature:
```
def fsdp_post_all_gather(
self,
all_gather_outputs: Tuple[torch.Tensor, ...],
metadata: Any,
param_dtype: torch.dtype,
*,
out: Optional[torch.Tensor] = None,
) -> Union[Tuple[torch.Tensor, Tuple[torch.Tensor, ...]], None]:
```
- The `all_gather_outputs` are exactly the all-gathered versions of the `fsdp_pre_all_gather` 1st return value (representing the all-gather inputs). We make sure to unflatten these back to ND for the user.
- The `metadata` is the `fsdp_pre_all_gather` 2nd return value, untouched.
- The `param_dtype` is the parameter dtype based on the passed-in `MixedPrecisionPolicy`. Namely, if no policy is passed in, then `param_dtype` is the original dtype, and otherwise, it is the `MixedPrecisionPolicy.param_dtype`.
- If `out` is not specified, then the return value has type `Tuple[torch.Tensor, Tuple[torch.Tensor, ...]]`. The first tuple item is the unsharded parameter (e.g. re-wrapping into some subclass). The second tuple item is a tuple of unsharded inner tensors that FSDP should free during reshard. These should be derived from the all-gather outputs.
- The `out` argument is required due to FSDP's `resize_` usage. We require an in-place variant for the backward all-gather. Here, `out` will be exactly the object returned as the first tuple item in the out-of-place variant mentioned before. The unsharded inner tensors will be allocated before calling `fsdp_post_all_gather`. When `out` is specified, the `fsdp_post_all_gather` should return `None`. If the post-all-gather does not do any out-of-place ops, then the `out` variant can just be a no-op since the unsharded inner tensors will be the same as the all-gather outputs, which FSDP directly writes to after all-gather. (E.g., this is the case for both float8 and `NF4Tensor`.)
- We check for `fsdp_pre_all_gather` and `fsdp_post_all_gather` directly via `hasattr` to accommodate monkey patching so that we do not strictly require the user to use a tensor subclass. The monkey patch must happen after the local tensors have been finalized (after applying FSDP and after any meta-device init).
- For now, we require that all gradients in one FSDP parameter group share the same dtype. This is fine for float8 and `NF4Tensor` use cases. If this requirement is too strict, then in the future we can issue 1 reduce-scatter per dtype per group.
**Design Notes**
- We assume that the `sharded_param._local_tensor` is padded on dim-0.
- This assumption should not block immediate use cases, and when we pad the `DTensor._local_tensor` by default, this assumption will always be true.
- This assumption allows us to call `sharded_param._local_tensor.fsdp_pre_all_gather()`; i.e. it tells us from which tensor object to invoke `fsdp_pre_all_gather()`.
- Suppose we want to compose with CPU offloading. Then, CPU offloading's H2D copy should run first, i.e. `sharded_param._local_tensor.to("cuda").fsdp_pre_all_gather()`, where `_local_tensor.to("cuda")` should return an instance of the subclass so that it still defines `fsdp_pre_all_gather()`. Note that in this case, the subclass instance on GPU is a temporary, which means caching values on it would not be possible. One possibility would be to have `.to("cuda")` move any cached values too.
- `fsdp_post_all_gather` can either return an unsharded parameter that aliases with the all-gather output or does not alias, but there is no way to know a priori.
- If the unsharded parameter aliases with the all-gather output, then we should _not_ free the all-gather output in `unshard`.
- If the unsharded parameter does not alias with the all-gather output, then we prefer to free the all-gather output in `unshard` to avoid holding the unneeded temporary.
- One approach is for eager-mode to check for this alias (by comparing data pointers). However, this might be adversarial to full-graph compilation. The compromise for simplicity can be to always free the all-gather output in `reshard`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122908
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #119302
This PR is part of the FSDP extensions work. For subclasses such as for QLoRA's `NF4Tensor` (using block-wise quantization) that have multiple inner tensors per parameter, we must generalize to allow each parameter to contribute >1 all-gather inputs and hence have >1 all-gather outputs.
This PR does this generalization by converting `FSDPParam.all_gather_input: torch.Tensor` to `FSDPParam.all_gather_inputs: List[torch.Tensor]`. Unfortunately, since we need to preserve the mapping from all-gather inputs/outputs to their source parameter, we have to introduce `List[List]` instead of simply `List` in several places. Furthermore, we still require the flattened 1D `List` for `torch.split` calls, introducing some redundancy between data structures. Nonetheless, I do not see a way to avoid this if we want the generalization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119302
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
- Update `WORKSPACE` to actually use Python-3.10 as job name claims it is
- Get rid of unneeded `future` and `six` dependencies (Removed long time ago)
- Update `requests`, `typing-extensions` and `setuptools` to the latest releases
- Mark `tools/build/bazel/requirements.txt` as a generated file
This also updates idna to 3.7 that contains a fix for [CVE-2024-3651](https://github.com/advisories/GHSA-jjg7-2v4v-x38h), though as we are no shipping a binary with it, it does not expose CI system to any actual risks
TODOs:
- Add periodic job that runs `pip compile` to update those to the latest version
- Unify varios requirements .txt (i.e. bazel requirements and requirements-ci should be one and the same)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124076
Approved by: https://github.com/seemethere, https://github.com/DanilBaibak
Summary: Modify fresh_inductor_cache() to clear cached state before mocking the toplevel cache_dir directory. Any lru_caches (or otherwise) can use the @clear_on_fresh_inductor_cache decorator to register the cache for clearing. Also change the base inductor TestCase class to use fresh_inductor_cache(). Previously that TestCase was only mocking the subdirectory within the toplevel cache dir designated for the FX graph cache artifacts.
Test Plan:
- New unit test
- All existing inductor tests will exercise fresh_inductor_cache()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122661
Approved by: https://github.com/oulgen
Fixes#123597
There's a sizable comment in the PR about why this is needed, but essentially the launch path is really really perf sensitive (running `launch` is ~30 microseconds, and according to the linked issue, regressing it to 33us is worth 6% overall on torchbench). The `bin.launch_metadata` call doesn't look super expensive, but microseconds matter, and this is only useful when we have a launch hook installed (which seems pretty rare?). This change is worth about 2us, and when combined with the other diff in the stack seems to completely eliminate the torchbench regression.
Differential Revision: [D56046347](https://our.internmc.facebook.com/intern/diff/D56046347)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123841
Approved by: https://github.com/jansel, https://github.com/shunting314
Summary:
note: breaking the original diff D55225818 into 3 parts (top-level renaming, higher-order-op subgraphs, constant input de/serialization) because of its size.
Stacked PR to restore original names to placeholder nodes, replacing the default names arg0_1, arg1_1, ...
This PR supports constant argument placeholder (e.g. forward(self, x, y=1)) names and de/serialization, by adding a name field for ConstantArguments in the graph signature, and ConstantInputSpec in the input specs for serialization.
Test Plan: verification checks on placeholder names for all export() calls, unit test in test/export/test_export.py
Differential Revision: D55506949
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123590
Approved by: https://github.com/angelayi, https://github.com/zhxchen17
This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently.
By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120758
Approved by: https://github.com/jansel
Fixes https://github.com/pytorch/pytorch/issues/122459, https://github.com/pytorch/torchtrain/issues/61
Even with the previous PR ("support DTensor/subclass constructors directly in the graph"), I still see some errors when running the repro above that start some logs showing that dynamo is inlining `__new__`.
I noticed that putting `@torch._dynamo.disable` on DTensor's `__new__` makes the entire repro pass.
Why does having dynamo try to inline `Subclass.__new__` run into problems? Morally, dynamo probably shouldn't be inlining __new__ ("creating a subclass" is a blackbox operation that AOTAutograd can trace through anyway). But concretely, we can end up with a node in the dynamo FX graph that has a "partially initialized tensor subclass" as its example value, because the subclass has been created but its fields have not been assigned to yet.
This breaks a bunch of invariants throughout dynamo: there are many places where if we have a tensor subclass node, we want to look at its inner tensors, to see if they are FakeTensors, what their FakeTensorMode is, and if they have dynamic shapes.
One option is to decide that "uninitialized subclass" is a first-class thing that anyone looking at the FX node examples values on the dynamo graph needs to handle, but this seems like a lot of work when in reality we don't need dynamo to trace the __new__ at all. Hence the `torch._dynamo.disable`.
I still wasn't very satisfied, since it was unclear to me **why** dynamo was inlining the `__new__` call, instead of interposing on the `DTensor()` constructor directly. After a long chat with @anijain2305, he explained that with code like this:
```
@torch._dynamo.disable(recursive=False)
def f(x):
out = SubclassConstructor(x)
```
Dynamo will never get the chance to interpose on the subclass constructor. Instead, what will happen is:
(1) Dynamo hands back control to cpython to run `f()`, since we disabled that frame
(2) `SubclassConstructor(x)` is run in eager mode
(3) `SubclassConstructor(x)` eventually calls `SubclassConstructor__new__`
(4) this is a new frame, that cpython then allows dynamo to intercept and start compiling
So it looks like we are basically forced to handle the situation where dynamo might directly start compiling `Subclass.__new__`
All of the above does not explain the story for `__torch_dispatch__` though. Empirically, I have a repro in torchtrain where looking at the dynamo logs, we see dynamo try to inline `__torch_dispatch__`.
```
[rank0]:DEBUG: Skipping frame because no content in function call _prepare_output_fn /data/users/hirsheybar/b/pytorch/torch/distributed/tensor/parallel/style.py 318
[rank0]:DEBUG: torchdynamo start compiling __torch_dispatch__ /data/users/hirsheybar/b/pytorch/torch/distributed/_tensor/api.py:297, stack (elided 5 frames):
```
I haven't been able to create a smaller repro of the problem (even using `_dynamo.disable(recursive=False)`), although in theory, if there is a `torch.*` op that you were to inline (where one of the inputs is a subclass), the next frame would likely be `__torch_dispatch__`. Dynamo always treats `torch.*` operations as not-inlinable though, so in theory we shouldn't ever see dynamo inline `__torch_dispatch__`, but a `_dynamo.disable()` fixes the problem.
I asked Animesh if we can have dynamo automatically apply this behavior to subclasses instead of needing it to be added explicitly. He pointed out that for `disable(recursive=False)`, we can't really do this within dynamo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123347
Approved by: https://github.com/zou3519
ghstack dependencies: #122502, #122751, #123348
Summary: Triton compiler adds constnat argument 1 to `equal_to_1` [only when it's an int](8c5e33c77e/python/triton/runtime/jit.py (L275)). Here we restrict Inductor's `equal_to_1` in the same way.
Test Plan:
```
$ python test/inductor/test_triton_kernels.py -k test_triton_kernel_equal_to_1_float_arg
...
----------------------------------------------------------------------
Ran 1 test in 6.528s
OK
$ python test/inductor/test_triton_kernels.py -k test_triton_kernel_equal_to_1_arg
...
----------------------------------------------------------------------
Ran 2 tests in 10.142s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123886
Approved by: https://github.com/oulgen
ghstack dependencies: #123703
Some changes to how we handle blocks in 3.11+:
- We only keep track of with blocks that are not enclosed in a try block
- We do not compile partial graphs if we are in a block that is not in a tracked with block - i.e. any block enclosed in some non-with try/except/etc. block
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123978
Approved by: https://github.com/jansel
Before this PR we would pass generated source code over a pipe to the compile worker then the compile worker would write out the file. Doing it this way is faster and results in smaller messages to the workers (and lets us skip creating the workers in the warm start case).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123409
Approved by: https://github.com/desertfire
This PR ensures that assignment of attributes of primitive type work without needing any code changes in non-strict mode. (In a previous PR we banned attribute assignments of tensor type unless such attributes are registered as buffers.)
While strict mode errors on (all) attribute assignments, non-strict doesn't care, so one might assume that this kind of attribute assignment should already work in non-strict. However, there's a problem: we run through the program once for metadata collection and then run through it again for tracing, so the values observed during tracing (and potentially burned into the graph) do not reflect what should have been observed had the metadata collection pass not run.
So the only thing this PR needs to do is restore values of assigned attributes of primitive type once the metadata collection pass has run. We do this by moving the attribute assignment detecting context manager from the overall `aot_export` call in `_trace.py` to the metadata collection pass in `aot_autograd.py`, and extending it. The rest of the PR moves some utils around.
Differential Revision: D56047952
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123898
Approved by: https://github.com/angelayi
## Summary
After this PR, the functional collective Python APIs will stop honoring `TORCH_DISABLE_NATIVE_FUNCOL` and only use native funcol ops. Specifically, this PR:
- Removed `use_native_funcol()`.
- Removed the code path in the Python APIs when `use_native_funcol()` is `False`.
- Changed the CI tests that runs on both native funcol and legacy funcol through the Python API to only run with native funcol.
## Test Changes
`test_functional_api.py`
- Removed the tests where only one of output_split_sizes or input_split_sizes is specified. This behavior is unreliable has been removed from the native funcol.
- Removed `TestWaitiness` which tests an implementation detail of the legacy funcol. We have equivalent tests for native funcol in `test/distributed/test_c10d_functional_native.py` b7fac76fc2/test/distributed/test_c10d_functional_native.py (L114-L116)
`test/distributed/_tensor/test_dtensor.py`
`test/distributed/_tensor/test_dtensor_compile.py`
`test/distributed/test_device_mesh.py`
`test/distributed/_tensor/experimental/test_tp_transform.py`
`test/distributed/_tensor/test_matrix_ops.py`
`test/distributed/test_inductor_collectives.py`
- All these tests were double running with both native funcol and legacy funcol. Changed to only run with native funcol.
`test/distributed/test_c10d_functional_native.py`
- Removed the `run_with_native_funcol` decorators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123777
Approved by: https://github.com/wanchaol
ghstack dependencies: #123776
Summary: This DIFF is to pass triton kernel information, such as kernel python file, kernel type, grid, and stream, to record_function. With these information, Execution trace can capture triton kernel and replay it in PARAM.
Test Plan:
unit test
buck2 test caffe2/test:profiler -- test_record_function_fast
Differential Revision: D56021651
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123871
Approved by: https://github.com/sraikund16
Fixes#123916
Due to MultiThreadedTestCase we're leaking is_fx_tracing_flag to other tests which causes any dynamo based tests to fail. The test execution order is arbitrary which caused this to not be caught in development.
Test plan:
```sh
pytest --random-order test/distributed/test_functional_api.py -k 'TestMakeFx or test_all_to_all_single_compile_True'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123958
Approved by: https://github.com/yifuwang
Summary:
without this the `ProcessGroupNCCL` lib would try to infer the device id and emit a warning.
This doesn't change the behavior just makes it explicit.
> ProcessGroupNCCL.cpp:3720] [PG 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Test Plan: CI
Differential Revision: D55998175
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123866
Approved by: https://github.com/awgu
Summary: Currently although only in one place in inductor, the `device` context manager from the device interface is used . This PR creates an inductor specific `DeviceGuard` class for use in these cases, which keeps a reference to the `DeviceInterface` class which is defined and added out of tree. This then offloads the device specific work to the device interface, instead of having to define this logic on the device class which isn't strictly necessary for inductor.
Ideally I would have used the existing `DeviceGuard` class, but these are defined per device and don't work well with inductor's device agnostic/ out of tree compatible design. With the existing classes in mind, I am happy to take suggestions on the renaming of this class.
Whilst I was there, I also took the opportunity to rename `gpu_device` to `device_interface` to clarify this is not necessarily a GPU.
Test Plan: None currently, happy to add some.
Co-authored-by: Matthew Haddock <matthewha@graphcore.ai>
Co-authored-by: Adnan Akhundov <adnan.akhundov@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123338
Approved by: https://github.com/aakhundov
Summary:
We should be using CppPrinter in the cpp wrapper codegen, not the ExprPrinter (which prints expressions for Python)
Not really a memory-planning-specific bug, but exposed by mem planning because it tends to emit more complicated expressions
Differential Revision: D56025683
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123867
Approved by: https://github.com/hl475, https://github.com/chenyang78
Fix: #120336
This PR fixes an issue on AOTAutograd, specifically on backends that don't support views
by themselves (e.g. XLA). Previously, AOTAutograd tried to reconstruct output views by
calling `as_strided` on the concrete bases using sizes and strides of the outputs that
aliased them. Since backends such as XLA doesn't support tensor aliasing, the sizes and
strides would be that of a contiguous tensor (not a view tensor). Because of that, calling
`as_strided` would error, since the output tensor would be bigger than its base. Instead,
this PR applies the sequence of `ViewMeta` gathered for each output during the
functionalization phase.
**Note:** we intentionally don't support base tensors that went through metadata mutation,
i.e. in-place view operations.
In summary, this PR:
- Introduces one `FunctionalTensorWrapper` member function alongside its Python APIs
- `apply_view_metas(base)`: applies the `ViewMeta` sequence of the given instance onto
another base
- Introduces a `OutputAliasInfo.functional_tensor` field
- Saves the `FunctionalTensorWrapper` instance collected by the functionalization phase
- Wraps it with a new `FunctionalTensorMetadataEq` class for comparing only the
metadata of the tensors
- Plumbs `OutputAliasInfo.functional_tensor` to `gen_alias_from_base` function
- Applies the `ViewMeta` sequence of the saved `FunctionalTensor` onto `aliased_base_tensor`
- Propagates `OutputAliasInfo.functional_tensor` when updating `fw_metadata`
(this PR description was updated in order to reflect the most recent changes)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121007
Approved by: https://github.com/bdhirsh
Summary:
Currently, torch.export (through AOTAutograd) compiles with a torch.no_grad() wrapper, which affects the presence of `set_grad_enabled` nodes in pre-dispatch export graphs. This changes the wrapper to nullcontext (i.e. enable grad) if `pre_dispatch=True`.
An example that previously failed without `with torch.no_grad()` is below:
```
class Model(torch.nn.Module):
def forward(self, x, y):
with torch.enable_grad():
x = x + y
return x
model = Model()
exported_program = torch.export._trace._export(
model,
(torch.tensor(2), torch.tensor(3)),
dynamic_shapes=None,
pre_dispatch=True,
strict=False
)
```
The pass would inline the add call, but then try to construct a HOO subgraph with no inputs/outputs:
```
def forward(self):
_set_grad_enabled_1 = torch._C._set_grad_enabled(False)
```
Test Plan: Test case checking that nullcontext & no_grad wrappers lead to different export behaviors (regarding set grad subgraphs).
Reviewed By: tugsbayasgalan
Differential Revision: D55777804
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123671
Approved by: https://github.com/tugsbayasgalan
Summary: Need to have temporary flag in Kineto so the correct JSON output is used. Will delete all temporary flags afterwards
Test Plan: Tested traces using updated hash. Values matched expected order of magnitude/general range that is expected.
Differential Revision: D56045866
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123885
Approved by: https://github.com/aaronenyeshi
Summary:
1.Package public headers of kineto if USE_KINETO so that they can be used by PrivateUse1 user.
2.Add PrivateUse1 key to ActivityType.
3. Support PrivateUse1 key in function deviceTypeFromActivity and _supported_activities.
4. Fix some bugs when processing profiler results.
Co-authored-by: albanD <desmaison.alban@gmail.com>
Co-authored-by: Aaron Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120556
Approved by: https://github.com/aaronenyeshi
We should eventually make the non-overlapping checks faster when dynamic shapes are enabled, but this is pretty difficult to do. So for now this PR adds a config that lets us fail fast when this situation happens, instead of causing compile times to secretly come to a crawl.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123455
Approved by: https://github.com/ezyang
The current codegen is problematic if __compiled_fn_0 clears the inputs list, since we need it for assignment afterwards
```python
def forward(inputs):
__compiled_fn_0 = ... # The actual function needs to be provided
graph_out_0 = __compiled_fn_0(inputs) # clears inputs
temp_list = []
temp_list.append(graph_out_0[0])
inputs[4].grad = graph_out_0[1] # inputs is empty, index error
inputs[7].grad = graph_out_0[2]
inputs[8].grad = graph_out_0[3]
inputs[9].grad = graph_out_0[3]
del graph_out_0
return temp_list
```
With this fix, we use aliases to keep the tensors alive
```python
def forward(inputs):
__compiled_fn_0 = ... # The actual function needs to be provided
inputs_ref_1 = inputs[9]
inputs_ref_2 = inputs[4]
inputs_ref_3 = inputs[8]
inputs_ref_4 = inputs[7]
graph_out_0 = __compiled_fn_0(inputs)
temp_list = []
temp_list.append(graph_out_0[0])
inputs_ref_2.grad = graph_out_0[1]
inputs_ref_4.grad = graph_out_0[2]
inputs_ref_3.grad = graph_out_0[3]
inputs_ref_1.grad = graph_out_0[3]
del graph_out_0
return temp_list
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123359
Approved by: https://github.com/jansel
ghstack dependencies: #123630, #123674, #122353
### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container.
And [Dynamo generates](fdc281f258/torch/_dynamo/codegen.py (L371)) the runtime function's signature using the graph's graphargs.
This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](https://github.com/pytorch/pytorch/pull/83137#issuecomment-1211320670).
```python
# original code
def forward(inputs):
a, b, c, d, e = inputs
inputs.clear()
out = a
out += b
del b # frees memory
out += c
del c # frees memory
out += d
del d # frees memory
out += e
del e # frees memory
return out
# compiled code:
def forward(a, b, c, d, e):
# b, c, d, e can't be freed before end of function
```
This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.
### Solution
We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).
This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.
With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
# a, b, c, d, e can be freed within the function now
```
Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](597f479643/torch/_inductor/compile_fx.py (L1454-L1478)), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122353
Approved by: https://github.com/jansel
ghstack dependencies: #123630, #123674
Summary:
We need a way to allow user set a customized description for a process group, e.g. FSDP, PP.
Here are several use cases of user specified group_desc:
- Logging: we can easily match a log line and understand what's this collective/pg is used to.
- Pytorch traces (e.g. Kineto, Execution Trace) can benefit from the PG desc since trace analysis, benchmarks will be able to easily differentiate PG purpose like FSDP, PP.
- Lower layer collectives(e.g. NCCL) debug: we will be able to expose PG desc to NCCL communicator so NCCL layer operations can be easily correlated to a PG.
Solution: Add a group_desc field to c10d
Differential Revision: D55781850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123472
Approved by: https://github.com/kwen2501
Part of: #123062
Ran lintrunner on:
- `test/jit_hooks`
- `test/lazy`
- `test/linear.py`
- `test/load_torchscript_model.py`
- `test/mkl_verbose.py`
- `test/mkldnn_verbose.py`
with command:
```bash
lintrunner -a --take UFMT --all-files
```
Co-authored-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123807
Approved by: https://github.com/ezyang
This PR adds in fast semi-structured sparsification kernels to PyTorch.
These kernels allow for accelerated semi-structured sparsification
kernels in PyTorch.
The kernels have been added as aten native functions
In particular, three new functions have been added:
* `torch._sparse_semi_structured_tile`
This function will return the packed representation and metadata for
both X and X', as well as the thread masks. Note that this applies 2:4
sparsity in a 4x4 tile instead of a 1x4 strip as usual.
* `torch._sparse_semi_structured_apply`
This function takes in an input tensor and thread masks from the above
function and returns a packed representation and metadata from applying
thread masks to the input tensor.
* `torch._sparse_semi_structured_apply_dense`
This function does the same thing as above but instead of returning the
tensor in the sparse representation it returns it in the dense
representation
The subclasses have also been updated to add a new
`prune_dense_static_sort`
classmethod to create sparse tensors with this format. I've added some
additional documentatino on how to calculate the compressed tensors
needed to create a SparseSemiStructuredTensor oneself.
To this end, there are two new helper functions added:
`sparse_semi_structured_tile`
`compute_compressed_swizzled_bitmask`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350
Approved by: https://github.com/cpuhrsch
Fixes#121758
**TL;DR**: When profiling is turned on, the dispatcher will sometimes attach the autograd sequence number to the recorded profiler event. This PR expands the set of profiler events onto which we attach sequence numbers. Before, we'd only attach a sequence number if the current dispatch key was an Autograd dispatch key. Now, we attach a sequence number if the current dispatch key **set** contains Autograd.
**Context**:
The use case for this is torch.profiler for python subclasses.
Autograd attaches a "sequence number" to all ops that it encounters during the forward pass. Then, the corresponding sequence number can be associated with a backward kernel when backward is executed. This is used by the profiler to associate the forward ops to the backward ops; a forward op and a backward op with the same sequence number are "linked" in some post-processing step.
Prior to this PR, this profiler feature didn't work for python subclasses. The reason is that we don't collect profiler information for all the dispatches for a given kernel; we only dispatch the initial `call`, and not the subsequent `redispatch` invocations. Normally, an Autograd key (if we're running with autograd) is the highest dispatch key, so the initial `call` that we profile is an Autograd key, and we collect the sequence number. But when we're dealing with a python subclass, the first dispatch key is PythonTLSSnapshot, which eventually redispatches into Autograd. We don't record the Autograd sequence number in that case because we don't record redispatches.
To fix this, this PR collects a sequence number whenever the dispatch key **set** contains an Autograd key. That means we might sometimes collect multiple events with the same sequence number, or possibly attach sequence numbers when we won't actually use them? (e.g. maybe if the initial dispatch key handler removes Autograd for some reason). Although this might be a bit confusing for users looking directly at the sequence_nr directly, I think the main use case is for the profiler to create fwd-bwd links; and those should still be generated correctly in these cases.
Differential Revision: [D55724190](https://our.internmc.facebook.com/intern/diff/D55724190)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123304
Approved by: https://github.com/soulitzer
This adds the differentiable collective -- all_to_all_single_grad. This is the initial proof of concept PR and I will be adding the remaining collectives in follow up PRs.
This adds a new function called `all_to_all_single_autograd` which is the autograd variant of `all_to_all_single`. For backwards compatibility + initial testing we wanted to make the autograd variant separate to avoid regressions.
This uses `autograd::Function` to register an Autograd op that calls the original `_c10d_functional::all_to_all_single` via the dispatcher. This works with compile and inductor as opposed to the previous Python implementation that had issues. As this uses the existing `_c10d_functional` ops we don't need to register any meta functions or lowering.
To avoid cudaStream issues this explicitly calls `wait_tensor` in the backward method to ensure it runs under the same stream as the async operation. This hurts performance but can be alleviated potentially using `compile`.
Related work: https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/comm_ops.py
Test plan:
```
pytest test/distributed/test_functional_api.py -k test_all_to_all_single_compile
pytest test/distributed/test_functional_api.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123599
Approved by: https://github.com/yifuwang
Fixes https://github.com/pytorch/pytorch/issues/123298
I was also seeing some crashes in torchtrain due to dynamic shapes, even when I set `compile(dynamic=False)` (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @wanchaol). This doesn't fix the underlying dynamic shape issues with compile + DTensor, but it does prevent dynamic shapes from leaking in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123348
Approved by: https://github.com/ezyang
ghstack dependencies: #122502, #122751
Fixes https://github.com/pytorch/pytorch/issues/122379
It looks like `iter_contains()` in dynamo expects to take in something like `iter_contains(List[VariableTracker], VariableTracker])`. Previously, when we called this function where the list in question was a `RangeVariable`, we would pass in `RangeVariable.items` as our list.
This is wrong, though since `RangeVariable.items` just contains the underlying [start, stop, step]. It looks like `unpack_var_sequence` does the right thing of "materializing" the range into a list of `VariableTrackers`, so I used that instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122751
Approved by: https://github.com/anijain2305, https://github.com/jansel
ghstack dependencies: #122502
Fixes https://github.com/pytorch/pytorch/issues/104505
I was originally going to ban all usages of as_strided + mutation in functionalization. But I'm pretty sure that as_strided + mutation is fine when we are calling as_strided on a base tensor.
So in this PR I added a slightly more conservative check: if we see an as_strided + mutation, where the input to an as_strided was **another** view op, then I error loudly in functionalization and link to the github issue above (in case anyone runs into this in the real world)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122502
Approved by: https://github.com/ezyang, https://github.com/albanD
Summary: If in a custom (user-written) Triton kernel an externally imported symbol is used directly, we need to codegen the corresponding import outside the kernel body in the Python wrapper. E.g., if the user code has this:
```
from triton.language.extra.cuda.libdevice import fast_dividef
@triton.jit
def my_kernel(...):
...
x = fast_dividef(...)
...
```
The `from triton.language.extra.cuda.libdevice import fast_dividef` line needs to be carried over together with the `my_kernel` function. The PR adds this.
Test Plan:
```
$ python test/inductor/test_triton_kernels.py
...
----------------------------------------------------------------------
Ran 464 tests in 113.512s
OK
```
Differential Revision: [D55953241](https://our.internmc.facebook.com/intern/diff/D55953241)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123703
Approved by: https://github.com/jansel, https://github.com/oulgen
Summary:
note: breaking the original diff [D55225818](https://www.internalfb.com/diff/D55225818) into 3 parts (top-level renaming, higher-order-op subgraphs, constant input de/serialization) because of its size.
Stacked PR to restore original names to placeholder nodes, replacing the default names arg0_1, arg1_1, ...
This PR propagates node names to higher-order-op subgraph placeholders, retaining the top-level names and handling naming collisions by suffixing other non-placeholder nodes in the subgraph with an index. This is the same handling as in fx.Graph/fx.Node, but implemented separately as a pass.
Since the input schemas of HOO subgraphs are very different, they are enumerated in _name_hoo_subgraph_placeholders(). Currently cond, map_impl, and wrap_with_set_grad_enabled are handled, but other ops can be easily added.
Test Plan: verification checks on placeholder names for all export() calls, unit test in test/export/test_export.py
Differential Revision: D55456749
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123587
Approved by: https://github.com/angelayi
`FSDPState` only uses `TrainingState.PRE_BACKWARD` as a backward training state, not `TrainingState.POST_BACKWARD`, because the FSDP state itself does not run post-backward (only its `FSDPParamGroup`, which may not exist if the state does not manage any parameters).
This meant that when `is_last_backward=False`, the `FSDPState` was incorrectly still in `TrainingState.PRE_BACKWARD`, and the next `_pre_forward` would not run due to the early return logic for activation checkpointing:
7c451798cc/torch/distributed/_composable/fsdp/_fsdp_state.py (L148-L151)
We fix this by always transitioning to `TrainingState.IDLE` at the end of the current backward task, regardless of `is_last_backward`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123857
Approved by: https://github.com/weifengpy
Fixes some issues with `_load_state_dict_keys`, including:
* updates broken test, which was failing due to incorrect parameters
* adds support for specifying nested keys e.g. (load state dict keys can now specify `something like "optimizer.state"`, which loads all keys under `optimzier.state`.
* updates call site to use the private implementation of `_load_state_dict`, which properly handles empty state dicts (otherwise the keys are ignored)
Big shout out to @diego-urgell who not only identified current issues, but recommended the right solutions!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123679
Approved by: https://github.com/diego-urgell, https://github.com/wz337
Summary:
Building hook for external mechanism to monitor the health of torch elastic launcher. Health check server takes dependency on FileTimerServer to check if launcher is healthy or not. It will be always healthy if FileTimerServer is disabled.
Implementation of start_healthcheck_server is unsupported, however tcp/http server can be started on specific port which can monitor the aliveness of worker_watchdog and accordingly take the action.
Test Plan: buck test mode/opt caffe2/test/distributed/elastic/agent/server/test:local_agent_test
Differential Revision: D55837899
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123504
Approved by: https://github.com/kurman
We have some (limited) support for `set_()` input mutations in `torch.compile`, but one restriction today is that we force them to run outside of the graph, in the opaque runtime epilogue.
This is a problem for ppFSDP. Why? The usage pattern of ppFSDP forward graphs look something like this:
```
def forward_fsdp(sacrificial_param, sharded_param, inp):
allgathered_param = allgather(sharded_param)
sacrificial_param.set_(allgathered_param) # hidden in an autograd.Function that we trace
out = matmul(sacrificial_param, inp)
sacrificial_param.untyped_storage().resize_(0)
return out
```
When we functionalize this graph, `sacrificial_param` sees two distinct types of input mutations, that we must preserve: a `set_`, and a `resize_`. Importantly, at runtime the `set_()` must run **before** the `resize_()`. Why? the `set_()` updates the storage of our sacrificial param to the allgather'd data, which allows the call to `sacrificial_param.resize_()` to free the allgathered data later. If we run the two mutations in reverse order, we will never free the allgathered data.
We want to put the `resize_()` mutation op inside of the graph (see next PR, also there's a much longer description in that PR for anyone interested). However, this will require us to put `set_()` in the graph as well, in order for them to run in the correct order.
In order to do this, I had to add some extra restrictions: You are now required to run `set_()` under `no_grad()` if you use it with `torch.compile`, and if you perform any other mutations to the input, those must be under no_grad as well (otherwise, the mutations may mutate the `grad_fn` of the input, making it no longer safe to keep in the graph). These restrictions are hopefully reasonable, since `set_()` doesn't see much usage today (and the original impetus for adding set_() support a few months ago was for fsdp anyway)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122981
Approved by: https://github.com/jansel
ghstack dependencies: #122433, #123646
In non-strict, assignment of attributes in a model causes their state to contain fake tensors post-tracing, which leads to incorrect results on running the exported model. We now error when this happens, asking the user to use buffers instead.
Next, we add support for assignment of buffers. The final values of the buffers turn into outputs of the graph. Since the buffers are already lifted as inputs and populated with the initial values when the model is run, this leads to a simple programming model where the driver of the model can feed the outputs back as inputs for successive runs.
Differential Revision: D55146852
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122337
Approved by: https://github.com/bdhirsh, https://github.com/tugsbayasgalan
Before this PR we would pass generated source code over a pipe to the compile worker then the compile worker would write out the file. Doing it this way is faster and results in smaller messages to the workers (and lets us skip creating the workers in the warm start case).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123409
Approved by: https://github.com/desertfire
Previously, we'd just check `has_symbolic_sizes_strides()` to know whether a tensor has symbolic sizes or strides; if does, we skip some profiler logic. But sometimes `has_symbolic_sizes_strides()` returns false, but we do actually have symbolic sizes or strides.
So in this change, we add `may_have_symbolic_sizes_strides()` - which should never return false if the tensor has symbolic sizes and strides
Why not change `has_symbolic_sizes_strides()`? It seems like there's preexisting logic that assumes that "if has_symbolic_sizes_strides(), then we can assume that this tensor is guaranteed to have symbolic sizes or strides". In this case, we have python-implemented sizes or strides, which should follow a different code path.
Differential Revision: [D55947660](https://our.internmc.facebook.com/intern/diff/D55947660/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123696
Approved by: https://github.com/aaronenyeshi, https://github.com/soulitzer
Summary: When running the backward for this op, we get the error:
```
RuntimeError: derivative for aten::aminmax is not implemented
```
This commit replaces this call with separate amin and amax
calls instead, which do have implemented derivatives.
Test Plan:
python test/test_quantization.py -k test_decomposed_choose_qparams_per_token_asymmetric_backward
Reviewers: jerryzh168, digantdesai
Subscribers: jerryzh168, digantdesai, supriyar
Differential Revision: [D55805170](https://our.internmc.facebook.com/intern/diff/D55805170)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123452
Approved by: https://github.com/digantdesai, https://github.com/jerryzh168
I found it helpful to be able to see, given some inductor output code, which AOT graph it came from. When you have large models with multiple graphs floating around this can be difficult, so I added the aot_config.aot_id to the printed inductor output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118647
Approved by: https://github.com/ezyang
We are taking API feedback. Changes:
- I removed some of the default values (they weren't being used).
- I was unable to convert the last op (which is essentially an
autograd.Function registered as CompositeImplicitAutograd). That one
is "incorrectly registered"; I punt fixing it to the future.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123454
Approved by: https://github.com/andrewor14
ghstack dependencies: #123453, #123578
If a user accesses an OpOverloadPacket, then creates a new OpOverload,
then uses the OpOverloadPacket, the new OpOverload never gets hit. This
is because OpOverloadPacket caches OpOverloads when it is constructed.
This PR fixes the problem by "refreshing" the OpOverloadPacket if a new
OpOverload gets constructed and the OpOverloadPacket exists.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123578
Approved by: https://github.com/albanD
ghstack dependencies: #123453
This commit introduces a meta function for the `channel_shuffle` operation, enabling PyTorch to perform shape inference and optimizations related to this operation without actual computation. The meta function assumes input shape (*, C, H, W) and validates that the number of channels (C) is divisible by the specified number of groups.
Fixes#122771
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123033
Approved by: https://github.com/ezyang, https://github.com/mikaylagawarecki
**Summary**
Add `matmul` in the quantization recipes, noting that it's not a general recipe but tailored to meet accuracy criteria for specific models. `matmul` recipe is disabled by default.
**Test Plan**
```
python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_attention_block
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122776
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
ghstack dependencies: #122775
**Summary**
Default recipes are enabled in `X86InductorQuantizer` and request comes to customize recipes based on these defaults.
- Avoid annotation propagation and restrict annotation only to annotate `conv`/`linear`.
- Add `matmul` in the quantization recipes, noting that it's not a general recipe but tailored to meet accuracy criteria for specific models.
To meet these requests, we made changes in this PR by introducing interface as `set_function_type_qconfig` and `set_module_type_qconfig`
- `set_function_type_qconfig` accepts functional input as `torch.nn.functional.linear` or `torch.matmul`; `set_module_type_qconfig` accepts nn.Module input as `torch.nn.Conv2d`.
- To disable the recipe for this operator, user can simply exclude it from the list of operations as `quantizer.set_function_type_qconfig(op, None)`.
- To modify or extend the recipe for this operator with default recipe, user can customize as `quantizer.set_function_type_qconfig(op, config)`.
**Test Plan**
```
python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_filter_conv2d_recipe
python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_filter_linear_recipe
python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_filter_maxpool2d_recipe
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122775
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
Summary:
We seperated the FR dump logic from the desync debug logic,
so we no longer set collectiveDebugInfoMode_ to true when we just need FR
dump. That's why monitor thread did not sleep and try to kill the
process without waiting for the dump.
The fix is simple, we should sleep whenever shouldDump_ is true
Test Plan:
Existing unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123788
Approved by: https://github.com/wconstab
I want runtime_wrapper args to be stealable by call_func_at_runtime_with_args, since the args may contain activations which we don't want to hold alive in this scope.
The args to runtime_wrapper **should always be** from a list created within aot_autograd, so it **should always be** safe to steal them: a4a49f77b8/torch/_functorch/aot_autograd.py (L928-L932)
There are some accesses after we execute the compiled_fn, but those index accesses are already inferred at compile time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123674
Approved by: https://github.com/jansel, https://github.com/bdhirsh
ghstack dependencies: #123630
`runtime_wrapper` unpacking the arguments as a Tuple[arg] will prevent them from being freed within its scope. This is problematic if inductors wants to free those inputs, which could be activations in the compiled backwards case. This PR only changes the signature to pass as list, but does not clear it, keeping same refcount as before.
Also adding some mypy annotations. Ideally, instead of `Any`, I would want a type to describe single arg which seems to be usually Tensor or int.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123630
Approved by: https://github.com/jansel, https://github.com/bdhirsh
But appending them to the end of the shared library and mmaping afterwards
Disabled by default, but overridable by `config.aot_inductor.force_mmap_weights`
Implemented by adding `USE_MMAP_SELF` define to `inductor/aoti_runtime/model.h` which is defined when weights are appended to the binary. In that case, shared library name is determined by calling `dladdr`, mmaped and finally checked against random magic number embedded at the end of the weights as well as in const section of the library in question
Added unites to validate that it works as expected
TODO:
- Extend support to CUDA
- munmap region if the same library is reused
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123002
Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/mikekgfb
Summary:
Kineto traces use microsecond level granularity because of chrome tracing defaults to that precision. Fix by adding preprocessor flag to TARGETS and BUCK files. Also remove any unnecessary ns to us conversions made in the profiler itself.
This diff contains profiler changes only. Libkineto changes found in D54964435.
Test Plan:
Check JSON and chrome tracing to make sure values are as expected. Tracing with flags enabled should have ns precision. Tracings without flags should be same as master.
Zoomer: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=796886748550189
Ran key_averages() to make sure FunctionEvent code working as expected:
-- ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
ProfilerStep* 0.74% 3.976ms 64.40% 346.613ms 69.323ms 0.000us 0.00% 61.710ms 12.342ms 5
Optimizer.zero_grad#SGD.zero_grad 0.76% 4.109ms 0.76% 4.109ms 821.743us 0.000us 0.00% 0.000us 0.000us 5
## forward ## 6.89% 37.057ms 27.19% 146.320ms 29.264ms 0.000us 0.00% 58.708ms 11.742ms 5
aten::conv2d 0.22% 1.176ms 7.74% 41.658ms 157.199us 0.000us 0.00% 27.550ms 103.962us 265
aten::convolution 0.79% 4.273ms 7.52% 40.482ms 152.762us 0.000us 0.00% 27.550ms 103.962us 265
aten::_convolution 0.69% 3.688ms 6.73% 36.209ms 136.637us 0.000us 0.00% 27.550ms 103.962us 265
aten::cudnn_convolution 6.04% 32.520ms 6.04% 32.520ms 122.719us 27.550ms 8.44% 27.550ms 103.962us 265
aten::add_ 2.42% 13.045ms 2.42% 13.045ms 30.694us 12.700ms 3.89% 12.700ms 29.882us 425
aten::batch_norm 0.19% 1.027ms 8.12% 43.717ms 164.971us 0.000us 0.00% 16.744ms 63.185us 265
aten::_batch_norm_impl_index 0.31% 1.646ms 7.93% 42.691ms 161.096us 0.000us 0.00% 16.744ms 63.185us 265
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Differential Revision: D55925068
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123650
Approved by: https://github.com/aaronenyeshi
The optimization causes regression in some torchbench benchmarks and
with some older versions of nvcc. The regression is preventable, but it
might require additional template specialization which would increase the
binary size.
Reverting it for now to re-evaluate.
Keeping the introduced tests and cuda-to-hip-mappings since these are
not specific to the optimization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123763
Approved by: https://github.com/janeyx99
https://github.com/pytorch/pytorch/pull/123164 removed the below code (so that constants are not readonly) to support module buffer mutation:
a9a9ce6d9c/torch/_inductor/codecache.py (L1685-L1691)
However, it may cause relocation overflow when the `.data` section is large.
Below is part of the output from `ld --versbose` (`GNU ld (GNU Binutils for Ubuntu) 2.38`). `.data` is in between `.text` and `.bss`. When `.data` is too large, during the linking, the relocation of `.text` against `.bss` may overflow. Rename it to `.ldata` (perhaps that's why previously `.lrodata` instead of `.rodata` is used) so that it won't be in between the `.text` and `.bss` section
```
.text
.rodata
.data
.bss
.lrodata
.ldata
```
We met this issue when fixing https://github.com/pytorch/pytorch/issues/114450 and running the below models on CPU:
- AlbertForMaskedLM
- AlbertForQuestionAnswering
- BlenderbotForCausalLM
- DebertaV2ForMaskedLM
- DebertaV2ForQuestionAnswering
- XGLMForCausalLM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123639
Approved by: https://github.com/jgong5, https://github.com/desertfire
Fixes#122989. (Note that while the missing symbol issue is fixed, the
test itself is still disabled, because the test runner now segfaults on
`atexit()`; but I think that issue is unrelated to the missing symbol.)
In addition to defining the specializations, I also `= delete`d the
default un-specialized version of `aoti_torch_dtype`, so future missing
dtype references will show up as compile-time instead of link-time
errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123692
Approved by: https://github.com/chenyang78
In the next PR I force `set_()` input mutation to require always been in the graph.
It's a lot easier to do this if we make our other debugging backends allow input mutations in the graph. Input mutations are relatively hardened at this point, so I'd rather just have our debugging backends consistently allow input mutations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123646
Approved by: https://github.com/ezyang
ghstack dependencies: #122433
Fixes https://github.com/pytorch/pytorch/issues/122436.
The problem was that even though we were detecting when mutations happen under no_grad or not, we were recording these mutations when they happened to the FunctionalTensor - we should really just be recording them on the underlying storage.
In particular, what would happen is that we would mutate an alias under no_grad (marking the mutation as under no_grad properly), but if we use the base tensor outside of the no_grad region, we would lazily regenerate the base at this point, propagate the mutation to the base, and at that point mark that the base witnessed a mutation (outside of the no_grad region)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122433
Approved by: https://github.com/ezyang
Fixes https://github.com/pytorch/pytorch/issues/123651
Previously, when we performed a size oblivious test, we would only modify the lower bound, e.g., if we knew something had range `[0, 100]`, the size oblivious test would do `[2, 100]`. But what if your original range was `[0, 1]`? Naively intersecting this with `[2, sympy.oo]` would result in an empty set: that's a big no no. And in general, this intersection is kind of questionable: if your original range was `[0, 2]`, do we really want to assume that this quantity is exactly equal to 2 in the size oblivious test?
So here's an idea: when we're doing a size oblivious test, just forget about the max bound entirely. The idea is that the max bound probably wasn't actually helping you discharge the size oblivious test (because size oblivious tests are all about "well, if we can assume thing isn't zero or one, we know what the static value is.") So you can use the max bound OR you can use the size oblivious bound, but you're not allowed to use both at the same time. (It doesn't actually seem necessary to use the max bound, but it would be easy to permit this without using the size oblivious refinement.)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123675
Approved by: https://github.com/PaulZhang12
ARC uses dind-rootless which causes bind mounts to always be mounted as the "root" user inside the container rather than the "jenkins" user as expected. We run chown to ensure that the workspace gets mapped to the jenkins user as well as a trap to ensure this change gets reverted when the script ends for any reason. This is the same workaround as in #122922 but adapted for onnx tests.
Issue: pytorch/ci-infra#112
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123641
Approved by: https://github.com/jeanschmidt, https://github.com/seemethere
Recently there has been work in an experimental repo to start implementing the intrinsics necessary handle F8 workloads. (see: https://github.com/pytorch-labs/float8_experimental)
A recent PR was submitted to add support for AMD F8 types (fnuz). This PR uncovered a bug in the rocm code that caused unit tests to fail due to numerical inaccuracy. This PR fixes that bug by swapping `abs_()` with `abs()` as the former performs elementwise absolute value on the tensor in-place causing the final assertion to fail due to the tensor only containing positive values.
Important to note, this fix is part of a workaround as hipblasLT does not yet support amax (HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER). This functionality has been implemented internally and is going through the proper channels to propagate to the community.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123275
Approved by: https://github.com/drisspg, https://github.com/jeffdaily
Summary: Modify fresh_inductor_cache() to clear cached state before mocking the toplevel cache_dir directory. Any lru_caches (or otherwise) can use the @clear_on_fresh_inductor_cache decorator to register the cache for clearing. Also change the base inductor TestCase class to use fresh_inductor_cache(). Previously that TestCase was only mocking the subdirectory within the toplevel cache dir designated for the FX graph cache artifacts.
Test Plan:
- New unit test
- All existing inductor tests will exercise fresh_inductor_cache()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122661
Approved by: https://github.com/oulgen
Summary: To unblock training where upsamplenearest2d involves input or output tensors which are larger than 2^31. Comes up frequently in image & video applications.
Test Plan:
```
buck2 test mode/opt //caffe2/test:test_nn_cuda -- test_upsamplingnearest2d_backward_64bit_indexing
```
Benchmarking (N5207417):
```
device_ms, cpu_ms, gb/device_ms*1000
# before changes
118.03993721008301 124.09385920000001 98.72685525972494
# after changes
118.05780944824218 124.10893509999994 98.71190944734577
```
Differential Revision: D55625666
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123682
Approved by: https://github.com/ezyang
If we throw an exception in the "wrong" place we can end up with the dispatch state being in a weird state which can cause all future dispatching to fail. Preserve and restore it as part of `preserve_global_state` so we know it's sane after that.
Also fake_tensor's in_kernel_invocation_manager() was leaving a bit set in the dispatcher (DispatchKey.Dense) which affected follow-on code. Fixed that to reset after as well.
Repro:
before:
```
$ rm test/dynamo_skips/TestSparseCPU.test_to_dense_with_gradcheck_sparse_cpu_complex64
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_to_dense_with_gradcheck_sparse_cpu_complex64'
======== 1 passed, 6173 deselected in 5.21s =============
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_torch_inference_mode_ctx or test_to_dense_with_gradcheck_sparse_cpu_complex64'
========= 1 skipped, 6172 deselected, 1 error in 5.29s =========
```
(note that test_to_dense_with_gradcheck_sparse_cpu_complex64 passes on its own but failed when including the skipped test_export.py tests)
after:
```
$ rm test/dynamo_skips/TestSparseCPU.test_to_dense_with_gradcheck_sparse_cpu_complex64
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_to_dense_with_gradcheck_sparse_cpu_complex64'
===================== 1 passed, 6173 deselected in 5.42s =====================
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_torch_inference_mode_ctx or test_to_dense_with_gradcheck_sparse_cpu_complex64'
===================== 1 passed, 1 skipped, 6172 deselected in 7.30s ======================
```
(note that test_to_dense_with_gradcheck_sparse_cpu_complex64 passes in both runs)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122073
Approved by: https://github.com/zou3519
Added a C file that includes the symbols _PyOpcode_Deopt and _PyOpcode_Caches since they are not available in the python lib but they are available on Linux in order to fix linking issues in Windows in python 3.11.
Fixes#93854
Test by running on python 3.11 `python test/functorch/test_dims.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118977
Approved by: https://github.com/ezyang
# Motivation
According to [[RFC] Intel GPU Runtime Upstreaming for Allocator](https://github.com/pytorch/pytorch/issues/116322), we would like to generalize device and host allocator to be device-agnostic. We prioritize the host allocator as it is simpler and more native than the device allocator. In this PR, we intend to refactor the host allocator to make it be shared across different backends. In 2nd PR, we will support host allocator on XPU backend.
# Design
The previous design:
- `CUDAHostAllocatorWrapper` inherits from `c10::Allocator`, and `CUDAHostAllocator` is an implementation of `CUDAHostAllocatorWrapper`.
The design in this PR:
- `CachingHostAllocatorImpl` is an interface that implements the caching host allocator logic that can be sharable across each backend.
- `CachingHostAllocatorInterface` inherits from `c10::Allocator` as an interface and accepts `CachingHostAllocatorImpl` as its implementation.
- `CUDACachingHostAllocator` is a CUDA host allocator whose implementation is `CUDACachingHostAllocatorImpl` which is specialized from `CachingHostAllocatorImpl`.
This design can
- share most code of caching mechanism across different backends, and
- keep the flexibility to expand its exclusive feature on each backend.
# Additional Context
In addition, we will continue to generalize the device allocator in the next stage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123079
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/albanD, https://github.com/gujinghui
In FSDP2, we have this:
```python
if all_gather_work is not None: # async op
all_gather_work.wait()
```
In eager, there are only two possible values for `all_gather_work`:
1. `distributed_c10d.Work` object (when `async_op=True`)
2. `None` (when `async_op=False`)
So the existing `if` statement is sufficient for eager mode.
In compile, there is one additional possible value for `all_gather_work` which is `FakeTensor` object (not None), because we return regular tensor for collective call in compile mode. If we use the existing `if` statement as-is, we will always call `.wait()` on `all_gather_work`, which is not the same semantics as eager.
There are a few ways to fix this:
Option 1: Properly support `distributed_c10d.Work` in Dynamo. This is the best long-term fix but it will take much more time to make it work.
Option 2: Allow calling `.wait()` on FakeTensor in compile mode (and just return None there) - this seems hacky because FakeTensor wouldn't normally have this method.
Option 3: Check whether `all_gather_work` is `distributed_c10d.Work` before calling `.wait()` on it. **<-- This PR**
Option 3 is chosen in this PR because it seems to also make the eager program semantics clearer (we don't need to think about whether `all_gather_work` can be `.wait()` on in all scenarios, as long as we know `distributed_c10d.Work` can be waited on).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123491
Approved by: https://github.com/awgu
Summary:
We want to bundle both ATen-VK and ET-VK in one library. There's a lot of copied code between the two libraries and most of it's fine since it is guarded by different namespaces. This function is the one exception and so we delete it from ATen-VK.
```
Action failed: fbsource//xplat/wearable/wrist/ml:wristmlcore (cxx_link libwristmlcore.so)
Local command returned non-zero exit code 1
Reproduce locally: `env -- 'BUCK_SCRATCH_PATH=buck-out/v2/tmp/fbsource/c81367d319075390/xplat/wearable/wrist/ml/__wristm ...<omitted>... /fbsource/c81367d319075390/xplat/wearable/wrist/ml/__wristmlcore__/libwristmlcore.so.linker.argsfile (run `buck2 log what-failed` to get the full command)`
stdout:
stderr:
ld.lld: error: duplicate symbol: operator<<(std::__ndk1::basic_ostream<char, std::__ndk1::char_traits<char>>&, VmaTotalStatistics)
>>> defined at Resource.cpp:6 (xplat/caffe2/aten/src/ATen/native/vulkan/api/Resource.cpp:6)
>>> Resource.cpp.pic.o:(operator<<(std::__ndk1::basic_ostream<char, std::__ndk1::char_traits<char>>&, VmaTotalStatistics)) in archive buck-out/v2/gen/fbsource/c81367d319075390/xplat/caffe2/__torch_vulkan_api__/libtorch_vulkan_api.pic.a
>>> defined at Resource.cpp:14 (xplat/executorch/backends/vulkan/runtime/api/Resource.cpp:14)
>>> Resource.cpp.pic.o:(.text._ZlsRNSt6__ndk113basic_ostreamIcNS_11char_traitsIcEEEE18VmaTotalStatistics+0x1) in archive buck-out/v2/gen/fbsource/c81367d319075390/xplat/executorch/backends/vulkan/__vulkan_compute_api__/libvulkan_compute_api.pic.a
clang: error: linker command failed with exit code 1 (use -v to see invocation)
Buck UI: https://www.internalfb.com/buck2/fc1cf878-690d-48ab-acdb-ece2c48dab42
Network: Up: 43MiB Down: 391MiB (reSessionID-830cf8b1-c9c8-474a-b8ed-45c37fceb21b)
Jobs completed: 9227. Time elapsed: 1:31.3s.
Cache hits: 22%. Commands: 5665 (cached: 1261, remote: 4002, local: 402)
BUILD FAILED
Failed to build 'fbsource//xplat/wearable/wrist/ml:wristmlcore (ovr_config//platform/android:arm32-clang-r21e-api29-opt-malibu#c81367d319075390)'
```
Test Plan:
```
LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin
```
Reviewed By: copyrightly
Differential Revision: D55926906
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123659
Approved by: https://github.com/SS-JIA
Summary: Previously, `torch.while_loop` was supported only in JIT inductor (added in https://github.com/pytorch/pytorch/pull/122069). Here we extend the support to AOT Inductor.
Test Plan:
```
$ python test/inductor/test_aot_inductor.py -k test_while_loop
...
----------------------------------------------------------------------
Ran 24 tests in 129.236s
OK (skipped=8)
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 50 tests in 136.199s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123586
Approved by: https://github.com/jansel, https://github.com/chenyang78
But appending them to the end of the shared library and mmaping afterwards
Disabled by default, but overridable by `config._force_mmap_aoti_weights`
Implemented by adding `USE_MMAP_SELF` define to `inductor/aoti_runtime/model.h` which is defined when weights are appended to the binary. In that case, shared library name is determined by calling `dladdr`, mmaped and finally checked against random magic number embedded at the end of the weights as well as in const section of the library in question
Added unites to validate that it works as expected
TODO:
- Extend support to CUDA
- munmap region if the same library is reused
Co-authored-by: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123002
Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/mikekgfb
Summary: In `codecache.py` pass the device_interface directly to `_worker_compile()` instead of calling `get_device_interface()` from inside the function.
If the device_interface is registered by an out-of-tree module then it will only be registered inside the main process and not inside the worker process. This fixes this issue. Happy to add a test if required.
Test plan:
No tests added
Co-authored-by: brothergomez <brothergomez@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122492
Approved by: https://github.com/ezyang
This PR adds a linker script optimization based on prioritized symbols that can be extracted from the profiles of popular workloads. The present linker script was generated to target ARM+CUDA and later can be extended if necessary. The reason we target ARM is shown below:
> PyTorch and other applications that access more than 24x 2MB code regions in quick succession can result in performance bottlenecks in the CPU front-end. The link-time optimization improves executable code locality and improve performance. We recommend turning on the optimization always for PyTorch and other application that behaves similarly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121975
Approved by: https://github.com/ptrblck, https://github.com/atalman
This fixes#123176, and partially addresses #121814 too. #123176 uses an
optional device arg while #121814 uses an optional list arg.
For optional arguments that have auxiliary info -- specifically, tuples
/ lists with their length parameter, and device types with their device
index -- we need to hoist out the extra argument. E.g. when passing a
device with ID 1, we want to emit
```
auto var_0 = cached_torch_device_type_cpu;
aoti_torch_foo(..., &var_0, 1);
```
instead of the (syntactically incorrect)
```
auto var_0 = cached_torch_device_type_cpu,1;
aoti_torch_foo(..., &var_0);
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123613
Approved by: https://github.com/desertfire
Summary:
As title.
Before this change, we use the benchmark result saved as cache and print out every time we call a kernel. The information is the same. Let's just print out at the first iteration.
Test Plan: Local test.
Differential Revision: D55878382
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123568
Approved by: https://github.com/jackiexu1992
The check_fn portion of pattern_matcher was retracing the pattern even if a pre-traced pattern was provided.
I think that as long as the patterns don't have control flow based on their inputs then this should be safe.
For this benchmark
```
python benchmarks/dynamo/huggingface.py --training --amp --performance --only MobileBertForQuestionAnswering --backend=inductor
```
this improves the performance of `joint_graph_passes` from about 9s down to 3s.
In the performance dashboard it seems to be a small win - most of the compilation times dropped by a couple seconds:
Torchbench 126s -> 124s
Huggingface 114s -> 110s
TIMM models 209s -> 208s
Dynamic 44s -> 43s
Blueberries 84s -> 81s
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121314
Approved by: https://github.com/eellison
ghstack dependencies: #121313
Make it easier to serialize patterns by adding `pattern_matcher.gen_register_replacement()` which is like `pattern_matcher.register_replacement()` but also requires the replacement to be precompiled.
To precompile patterns (and save to disk) run:
```
torchgen/fuse_attention_patterns/gen_attention_patterns.py
```
- Updated the sfdp patterns to use `gen_register_replacement`.
- Add serialized patterns for mm_pattern and bmm_pattern (The 'misc' patterns don't serialize cleanly so can't be added).
- Updated the testing so it checked the round-trip patterns match and not just that it serialized the same way.
- Checking that the patterns round-trip properly found that the `users` field wasn't being serialized properly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121313
Approved by: https://github.com/eellison
Sorry to add this to your plate but I hope it helps. I find it's ambiguous what "missing keys" and "unexpected keys" are, and the documentation does not add clarity. Today I realized I've been double-guessing myself on this for years.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123637
Approved by: https://github.com/mikaylagawarecki
FSDP2 has this pattern of using user-defined object instance method as hook, and it will throw this error under compile:
`torch._dynamo.exc.Unsupported: call_function UserDefinedObjectVariable(_pre_forward) [FSDPManagedNNModuleVariable(), TupleVariable(), ConstDictVariable()] {}`
This PR adds support for it by always allowing to trace into a UserDefinedObjectVariable that's an instance method (i.e. `MethodType`).
Supersedes https://github.com/pytorch/pytorch/pull/123320.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123399
Approved by: https://github.com/jansel
# Motivation
Previously, `xpu_event` became a dangling pointer because the variable on the stack is destroyed when the scope ends. It results in these event-related functions (`destroyEvent`, `record`, `block`, and `queryEvent`) used in `c10/core/impl/InlineEvent.h`, which serves `c10::Event`, do not work correctly.
# Solution
Use `new` allocated on the heap to assign `xpu_event` to avoid the dangling pointer.
# Additional Context
Add a UT to cover this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123523
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD
Summary: https://github.com/pytorch/pytorch/issues/123174 causes some internal tests to fail, because when the generated model.so uses the MinimalArrayRefInterface, inputs are in ArrayRefTensor which still need to be converted using convert_arrayref_tensor_to_tensor. So let's bring back the relevant code with an enhanced way to detect numbers.
Test Plan: CI
Differential Revision: D55823570
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123481
Approved by: https://github.com/chenyang78
## Pitch
Fixes https://github.com/pytorch/pytorch/issues/122489.
Don't change the `stride_order` of `FlexibleLayout` if it already has the stride with the order required.
## Description
For a layout that's both contiguous and channels last contiguous (for example `size=[s0, 1, 28, 28]`, `stride=[784, 784, 28, 1]` where the `C` dim is `1`), the behavior of calling [require_stride_order](069270db60/torch/_inductor/ir.py (L4053)) (where the order is specified as channels last) on it is different when it's a `FixedLayout` or a `FlexibleLayout`.
- For a `FixedLayout`, the size and stride is unchanged after the call: `size=[s0, 1, 28, 28]`, `stride=[784, 784, 28, 1]`.
- For a `FlexibleLayout`, it will become `size=[s0, 1, 28, 28]`, `stride=[784, 1, 28, 1])`.
When weight is not prepacked (in dynamic shapes cases), the Conv extern kernel returns output in channels **first** for input with `size=[s0, 1, 28, 28]`, `stride=[784, 784, 28, 1]` but output in channels **last** for `size=[s0, 1, 28, 28]`, `stride=[784, 1, 28, 1])`.
In this PR, for a `FlexibleLayout`, we add a check to see if it already has the stride in the required order. If that's the case, we don't change its stride order when freezing it. This makes the behavior of calling [require_stride_order](069270db60/torch/_inductor/ir.py (L4053)) aligned for `FixedLayout` and `FlexibleLayout`.
## Additional context
For a `FixedLayout`, when calling [require_stride_order](069270db60/torch/_inductor/ir.py (L4053)), it will firstly run into [x.get_layout().is_stride_ordered(order)](069270db60/torch/_inductor/ir.py (L4067-L4070)) to check if it's already ordered as expected.
If it is a `FlexibleLayout`, when calling [require_stride_order](069270db60/torch/_inductor/ir.py (L4053)), it runs into [as_storage_and_layout](069270db60/torch/_inductor/ir.py (L4063-L4065)), which will always [freeze_layout_with_stride_order](069270db60/torch/_inductor/ir.py (L1805)) and will always call [as_stride_order](069270db60/torch/_inductor/ir.py (L2909)), without checking if the default stride of this `FlexibleLayout` (which has been realized before) is already as expected ([link](069270db60/torch/_inductor/ir.py (L2693-L2700))).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122945
Approved by: https://github.com/jgong5, https://github.com/jansel
Summary: This was a micro optimization that I thought would save time but it is not correct. For example, we cannot compare fake tensors.
Test Plan:
```
buck2 run 'fbcode//mode/opt' fbcode//langtech/edge/ns/tools/tests:test_ns_jit_traced_model_all_optimization_f328819347_portal_ns
```
now passes
Differential Revision: D55904083
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123616
Approved by: https://github.com/aakhundov
There exist some issues in the previous PR (https://github.com/pytorch/pytorch/pull/120985) of supporting int8 WOQ mm pattern matcher. This PR tends to further optimize it.
1. New patterns are added to match int8 woq mm in gpt-fast model, due to different input layouts.
2. In constant folding, `int8_weight -> dq -> bf16_weight` should be kept for pattern match.
3. Currently, GPT-Fast enables `coordinate_descent_tuning` for CPU. This flag is only useful for CUDA, but it could change the graph: from the non-decomposed fallback pass to the decomposed one. We will disable the flag in GPT-Fast script for CPU, in order to have neat patterns. @yanbing-j
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122955
Approved by: https://github.com/jgong5, https://github.com/jansel
**Note**: This is a reopen of https://github.com/pytorch/pytorch/pull/122288, which was merged by `ghstack land` to its base (not main) by mistake.
**Description**
Add qlinear_binary op for X86Inductor backend of quantization PT2E. It only supports `add` and `add_relu` now.
It will use post op sum if the extra input has the same dtype as output. Otherwise, it uses binary add.
```
+-------------------+--------------+---------------+
| Extra input dtype | Output dtype | Post op |
+-------------------+--------------+---------------+
| Fp32/bf16 | fp32/bf16 | sum or add* |
+-------------------+--------------+---------------+
| Fp32/bf16 | int8 | add |
+-------------------+--------------+---------------+
| int8 | fp32/bf16 | not supported |
+-------------------+--------------+---------------+
| int8 | int8 | sum |
+-------------------+--------------+---------------+
*Use sum if extra input and output have the same dtype; otherwise use add.
```
**Test plan**
python test_quantization.py -k test_qlinear_add_pt2e
python test_quantization.py -k test_qlinear_add_relu_pt2e
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123144
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
Accessing co_lnotab causes a deprecation warning to be issued, causing some dynamo-wrapped tests to fail. We do not need to remove co_lnotab from tests as of now, as they are still useful as an additional check for linetable correctness, but we will need to deal with co_lnotab removal by 3.14.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123577
Approved by: https://github.com/jansel
When minifying, the after-aot minifier ignores non-floating values by
default but does check them when running the the initial graph dump step.
This means we may capture a graph that doesn't fail the tester and doesn't have
any meaningful divergence.
For example, the derivative of `elu(x)` depends on `x > 0` so this value is
saved for backwards and so becomes a graph output. However, the difference
between `FLT_MIN` and `0` in `x` is now enough to trigger an accuracy failure.
I fix this by adding a config variable and environment variable to ignore these
non floating point values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123006
Approved by: https://github.com/ezyang
ghstack dependencies: #123005
Given the following code/dynamo graph:
```
class GraphModule(torch.nn.Module):
def forward(self, L_x_ : torch.Tensor):
l_x_ = L_x_
_print = torch.ops.aten._print('moo')
res = l_x_ + l_x_; l_x_ = None
_print_1 = torch.ops.aten._print('moo')
return (res,)
```
AOTAutograd will trace the following program, threading tokens from the inputs, through the effectful operator calls (torch.ops.aten._print), and as an output:
```
class <lambda>(torch.nn.Module):
def forward(self, arg0_1: "f32[0]", arg1_1: "f32[2, 3]"):
with_effects = torch._higher_order_ops.effects.with_effects(arg0_1, torch.ops.aten._print.default, 'moo'); arg0_1 = None
getitem: "f32[0]" = with_effects[0]; with_effects = None
add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1); arg1_1 = None
with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo'); getitem = None
getitem_2: "f32[0]" = with_effects_1[0]; with_effects_1 = None
return (getitem_2, add)
```
However when we get to inductor, since we want the inductor generated code to not have any token inputs/outputs for better readability, we want to modify the aten graph by removing the tokens from inputs, and creating them through `torch.ops.aten._make_dep_token`, and sinking them through the `torch.ops.aten._sink_tokens` operators.
This has to be done *after* the partitioner, otherwise the partitioner will add the make_token/sink_token operators to the backwards graph.
```
class <lambda>(torch.nn.Module):
def forward(self, arg1_1: "f32[2, 3]"):
_make_dep_token_default: "f32[0]" = torch.ops.aten._make_dep_token.default()
with_effects = torch._higher_order_ops.effects.with_effects(_make_dep_token_default, torch.ops.aten._print.default, 'moo'); _make_dep_token_default = None
getitem: "f32[0]" = with_effects[0]; with_effects = None
add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1); arg1_1 = None
with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo'); getitem = None
getitem_2: "f32[0]" = with_effects_1[0]; with_effects_1 = None
_sink_tokens_default = torch.ops.aten._sink_tokens.default((getitem_2,)); getitem_2 = None
return (add,)
```
When doing inductor lowering, we convert `with_effects` calls to an `EffectfulKernel`, which just a `FallbackKernel` but with a pointer to previous effectful operator's call. During scheduling, we will create a `StarDep` between the EffectfulKernel and its previous EffectfulKernel so that they don't get reordered. The inductor generated python code looks like:
```
def call(args):
arg1_1, = args
args.clear()
assert_size_stride(arg1_1, (2, 3), (3, 1))
# Source Nodes: [_print], Original ATen: []
buf2 = aten._print.default('moo')
# Source Nodes: [_print_1], Original ATen: []
buf3 = aten._print.default('moo')
buf4 = empty_strided_cpu((2, 3), (3, 1), torch.float32)
cpp_fused_add_0(arg1_1, buf4)
del arg1_1
return (buf4, )
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122347
Approved by: https://github.com/bdhirsh
There are many existing ProcessGroupNCCL features controlled by env vars. This PR adds TORCH_NCCL_HIGH_PRIORITY to force the use of high-priority CUDA or HIP streams for the NCCL or RCCL kernels, respectively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122830
Approved by: https://github.com/kwen2501
This PR is part of an effort to speed up torch.onnx.export (#121422).
- Building vals_to_params_map costs linear time in N (number of nodes), when instead we can index into this dictionary directly.
- No need to call HasField on the final else, since c10::nullopt is the default returned value if a field does not exist.
- Resolves (3) in #121422.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123025
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi
Summary: This test is actually broken and probably succeeding by mistake because of a cache hit. Forcing a fresh cache or removing the errant setting cause a consistent failure. Disabling for now until we have time to investigate further.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123211
Approved by: https://github.com/desertfire
Summary:
Previously we were serializing namedtuple treespecs incorrectly:
```python
Point = namedtuple("Point", ["x", "y"])
p = Point(1, 2)
flat, spec = pytree.tree_flatten(p)
print(flat) # [1, 2]
print(spec) # TreeSpec(type=namedtuple, context=Point, children=[*, *])
dumped_spec = pytree.treespec_dumps(spec)
print(dumped_spec)
"""
We only serialize the name of the class and the fields of the namedtuple:
TreeSpec {
type='collections.namedtuple',
context={class_name='Point', class_fields={'x', 'y'}},
children=[Leaf, Leaf]
}
"""
reconstructed_spec = pytree.treespec_loads(dumped_spec)
print(reconstructed_spec)
"""
When we load, we create a new namedtuple class containing the same fields as before,
but the is class is now a completely different class than the original one:
TreeSpec(type=namedtuple, context=torch.utils._pytree.Point, children=[*, *])
"""
spec == reconstructed_spec # False
```
So, we introduce a new API called `pytree._register_namedtuple` where users can pass in the serialized name for each namedtuple class:
```python
Point = namedtuple("Point", ["x", "y"])
pytree._register_namedtuple(Point, "Point")
p = Point(1, 2)
flat, spec = pytree.tree_flatten(p)
print(flat) # [1, 2]
print(spec) # TreeSpec(type=namedtuple, context=Point, children=[*, *])
dumped_spec = pytree.treespec_dumps(spec)
print(dumped_spec)
"""
TreeSpec {
type='collections.namedtuple',
context='Point',
children=[Leaf, Leaf]
}
"""
reconstructed_spec = pytree.treespec_loads(dumped_spec)
print(reconstructed_spec) # TreeSpec(type=namedtuple, context=Point, children=[*, *])
spec == reconstructed_spec # True
```
Test Plan: `python test/test_pytree.py`
Differential Revision: D55771058
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123388
Approved by: https://github.com/zou3519
After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.
In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.
The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.
This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123136
Approved by: https://github.com/desertfire
This would have saved me a few hours while debugging an internal model. We could not support a LOAD_ATTR bytecode, because it was a property, and the inlining failed due to skip. Since LOAD_ATTR does not support continuation function, we would fallback to eager for the whole frame aka skip. But, we should also log this as graph break. This PR does it.
Bonus - removes skip from a test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122684
Approved by: https://github.com/ezyang
We also add the usual comment where we note that we don't handle
negative values in mod properly.
We should also fix this in the definition of ModularIndexing. I'll do that
in a later PR, as for that one I'll also need to fix a number of tests that
are testing an incorrect behaviour.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123253
Approved by: https://github.com/peterbell10
FSDP2 creates CUDA streams outside of compile region in its 1st iteration eager run, and then torch.compile will attempt to record method calls on these streams (e.g. `stream.record_event()`) in >1st iteration compiled run.
Before this PR, stream proxy is None which causes "None doesn't have attribute record_event" error when we try to call `record_event()` on it. After this PR, stream proxy has the correct value which makes calling methods on it possible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123487
Approved by: https://github.com/jansel
Summary:
Kineto traces use microsecond level granularity because of chrome tracing defaults to that precision. Fix by adding preprocessor flag to TARGETS and BUCK files. Also remove any unnecessary ns to us conversions made in the profiler itself.
This diff contains profiler changes only. Libkineto changes found in D54964435.
Test Plan:
Check JSON and chrome tracing to make sure values are as expected. Tracing with flags enabled should have ns precision. Tracings without flags should be same as master.
Tracing with flags enabled: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_37_22.4155151.pt.trace.json.gz&bucket=gpu_traces
Tracing without flags enabled: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_39_15.4166047.pt.trace.json.gz&bucket=gpu_traces
Tracing on main: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_42_43.4177559.pt.trace.json.gz&bucket=gpu_traces
Ran key_averages() to make sure FunctionEvent code working as expected:
-- ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
ProfilerStep* 0.74% 3.976ms 64.40% 346.613ms 69.323ms 0.000us 0.00% 61.710ms 12.342ms 5
Optimizer.zero_grad#SGD.zero_grad 0.76% 4.109ms 0.76% 4.109ms 821.743us 0.000us 0.00% 0.000us 0.000us 5
## forward ## 6.89% 37.057ms 27.19% 146.320ms 29.264ms 0.000us 0.00% 58.708ms 11.742ms 5
aten::conv2d 0.22% 1.176ms 7.74% 41.658ms 157.199us 0.000us 0.00% 27.550ms 103.962us 265
aten::convolution 0.79% 4.273ms 7.52% 40.482ms 152.762us 0.000us 0.00% 27.550ms 103.962us 265
aten::_convolution 0.69% 3.688ms 6.73% 36.209ms 136.637us 0.000us 0.00% 27.550ms 103.962us 265
aten::cudnn_convolution 6.04% 32.520ms 6.04% 32.520ms 122.719us 27.550ms 8.44% 27.550ms 103.962us 265
aten::add_ 2.42% 13.045ms 2.42% 13.045ms 30.694us 12.700ms 3.89% 12.700ms 29.882us 425
aten::batch_norm 0.19% 1.027ms 8.12% 43.717ms 164.971us 0.000us 0.00% 16.744ms 63.185us 265
aten::_batch_norm_impl_index 0.31% 1.646ms 7.93% 42.691ms 161.096us 0.000us 0.00% 16.744ms 63.185us 265
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Differential Revision: D55087993
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122425
Approved by: https://github.com/aaronenyeshi
Debugging is happening in https://github.com/pytorch/pytorch/issues/123126 .
Upgrading triton cause accuracy failure for mixer_b16_224 and levit_128 .
mixer_b16_224 is debugged specifically. It due to extra FMA instructions being used in a single kernel. That kernel itself only introduce small numerical difference. We conclude that this is not some 'real' accuracy issue and we should raise the tolerance to unblock the triton pin update.
The tolerance is picked such that the CI accuracy test can pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123484
Approved by: https://github.com/jansel
Summary:
Even with changes in D55347133, it is still possible to OOM in histogram observer, because the size of allocated tensor also depends on *downsample_rate*.
For example, I still see OOM due to the attempt of allocating a 10GB+ histogram tensor in multi-task model.
To fix OOM issue better, we use *try-catch* clause to avoid OOM.
Empirically, we set the max size of a single histogram tensor size to 1 GB.
Test Plan: Test the change for Multi-Task model (depth + segmentation)
Differential Revision: D55567292
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123309
Approved by: https://github.com/jerryzh168
Summary: Now that we can input shapes as input args for RecordFunctionFast, let's add that to the triton heuristics. Also, lets add the ability to pass in a tuple into the RecordFunctionFast constructor.
Test Plan:
Ran both the _inductor/test_profile.py and profiler/test_profiler.py unit tests. Also added tuple based unit test to profiler/test_profiler.py
Ran record_function_fast.py from the following branch
https://github.com/pytorch/pytorch/compare/sraikund/record_funct_test?expand=1
No shape or args: tests function fast with no args and profile without record_shapes
With shape tests: tests function fast with args and profile with record_shapes true
Args no shape: tests function fast with args inputted but record_shapes set to false
Args shape tuple: tests function fast with args inputted in form of tuple and record_shapes true
Stdout:
No shape or args:: 1.8491458892822266 us
With shape:: 2.211381196975708 us
Args no shape:: 1.9212646484375 us
With shape tuple:: 2.245788335800171 us
Differential Revision: D55809967
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123459
Approved by: https://github.com/davidberard98
For some reason, adding a `TYPE_CHECK` in DATA_PTR_MATCH guard in https://github.com/pytorch/pytorch/issues/123302 increases optimizer guard overhead for `MT5ForConditionalGeneration` by 10x. There is nothing special about MT5. As we are going to move towards the CPP guards soon, there is no reason to investigate this deeper.
We can use `ID_MATCH` instead of `DATA_PTR` match. Today both cant be serialized, so there is no one preference over the other.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123485
Approved by: https://github.com/mlazos
`JUMP_BACKWARD` in 3.12+ may not be in the exception table even though it should be considered a part of the block. Also fix a issue where we didn't propagate the exception table entry to new instructions when expanding the `POP_JUMP_IF_[NOT_]NONE` instruction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123392
Approved by: https://github.com/jansel
Summary:
Building hook for external mechanism to monitor the health of torch elastic launcher. Health check server takes dependency on FileTimerServer to check if launcher is healthy or not. It will be always healthy if FileTimerServer is disabled.
Implementation of start_healthcheck_server is unsupported, however tcp/http server can be started on specific port which can monitor the aliveness of worker_watchdog and accordingly take the action.
Test Plan: buck test mode/opt caffe2/test/distributed/elastic/agent/server/test:local_agent_test
Differential Revision: D55108182
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122750
Approved by: https://github.com/kurman
Speeds up the guard-overhead microbenchmark by around 10% normalized to main-branch CPP guards
~~~
import torch
@torch.compile(backend="eager")
def fn(x, lst):
for l in lst:
x = x + l
return x
n = 1000
lst = [i for i in range(n)]
x = torch.randn(4)
print(fn(x, lst))
print("Sucess")
~~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123396
Approved by: https://github.com/jansel
ghstack dependencies: #123285, #123302, #123303
The user provides a `setup_context` and a `backward_function`. These
get put into a torch.autograd.Function that gets registered as the
custom op's autograd implementation.
Test Plan:
- we update custom ops in the custom_op_db to use the new
register_autograd API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123110
Approved by: https://github.com/albanD
ghstack dependencies: #123108, #123109
Previously it worked with torchgen.model.FunctionSchema. This PR extends
it to work with torch._C._FunctionSchema by making
torchgen.model.FunctionSchema look more like torch._C._FunctionSchema.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123108
Approved by: https://github.com/albanD
Previously, `node.meta["nn_module_stack"]` had type `Dict[str, Tuple[str, class]]` when exported, and later `Dict[str, Tuple[str, str]]` after de/serialization. This PR changes it to consistently be `Dict[str, Tuple[str, str]]` for round-trippability, i.e.
```
{..., 'L__self___conv': ('conv', 'torch.nn.modules.conv.Conv2d')}
```
`source_fn_stack` is left untouched in this PR.
note: the `Union[type, str]` type annotations in ONNX are because ONNX goes through both `export.export()` and `_dynamo.export()` (which still has the original `Dict[str, Tuple[str, class]]` format). nn_module_stack from `export.export()` should consistently have the new format, and we verify/test for that in `_trace.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123308
Approved by: https://github.com/zhxchen17, https://github.com/thiagocrepaldi
Summary: CPP wrapper compilation is currently done in two passes: in the first pass, Python wrapper is generated and run to compile Triton kernels as a side effect, in the second pass C++ wrapper is generated and compiled. When model inputs are mutated, running the Python wrapper in the first pass mutates the inputs, although the first pass (including the Python wrapper run) is strictly a part of the compilation process, hence must not introduce any side effects on the example inputs.
In this PR, we clone mutated inputs in the first pass to avoid input mutation.
Fixes https://github.com/pytorch/pytorch/issues/117364.
Test Plan:
```
$ TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k test_inductor_layout_optimization_input_mutations_cuda
...
.
----------------------------------------------------------------------
Ran 1 test in 6.368s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123316
Approved by: https://github.com/jansel, https://github.com/chenyang78, https://github.com/desertfire
`backend_aot_accuracy_fails` reruns `compile_fx_inner` on the real inputs which
means the graph is recompiled with static shapes. This meant accuracy failures
related to dynamic shapes would never be captured by `REPRO_AFTER=aot`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123005
Approved by: https://github.com/ezyang
Summary:
Previously the validation logic assumes the sharded tensors' global ranks range from `[0 .. WS]`
This is true if we do 1d flat sharding.
But once we get into 2d+, the ranks may not be contiguous any more.
e.g.
```
[0, 2]
[1, 3]
```
The group size is 2 but ranks may be >= 2.
Going forward, the ST will be replaced by DTensor so it's less of an issue but this is just to make it work for stacks still relying on ST (like torchrec).
Test Plan:
added UT
CI
Differential Revision: D55671872
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123230
Approved by: https://github.com/kwen2501
Summary:
Pass python c10d group_name to c++ ProcessGroupNCCL so that the pg name will be consistent across different layers.
Also record pg_name in flight recorder entry.
Differential Revision: D55597200
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123117
Approved by: https://github.com/wconstab
Summary:
This PR restores original names to placeholder nodes, replacing the default names arg0_1, arg1_1, and so on.
User inputs now follow the signature of mod.forward(), for example forward(x, y) produces nodes x, y. If the tensors are nested in dictionaries, lists, tuples, or dataclasses, the names are a concatenation of the path to the tensor, e.g. x = {'a': torch.randn(4), 'b': [torch.randn(4), torch.randn(4)]} produces nodes x_a, x_b_0, x_b_1.
Parameters, buffers, constants, and custom objects follow the FQN of the object, prefixed by "p", "b", "c", and "obj" respectively. For example, self.bar.l0.weight gets you p_bar_l0_weight.
Effect tokens are named token_1, token_2, and so on, since they are not grounded in model inputs or named attributes.
note: breaking the original diff into 3 parts (top-level renaming, higher-order-op subgraphs, constant input de/serialization) because of its size.
Examples:
```python
# params, buffers, constants, inputs, torch.cond
ExportedProgram:
class GraphModule(torch.nn.Module):
def forward(self, p_l0_weight: "f32[4, 4]", p_l0_bias: "f32[4]", c_alpha: "f32[4]", b_beta: "f32[4]", x_0_a: "f32[4, 4]", y: "f32[4, 4]"):
# No stacktrace found for following nodes
mul: "f32[4, 4]" = torch.ops.aten.mul.Tensor(x_0_a, x_0_a)
t: "f32[4, 4]" = torch.ops.aten.t.default(p_l0_weight); p_l0_weight = None
addmm: "f32[4, 4]" = torch.ops.aten.addmm.default(p_l0_bias, y, t); p_l0_bias = y = t = None
return addmm
# model code
class Bar(torch.nn.Module):
def forward(self, x):
return x * x
class Foo(torch.nn.Module):
def __init__(self):
super().__init__()
self.bar = Bar()
self.l0 = torch.nn.Linear(4, 4)
self.alpha = torch.randn(4)
self.register_buffer('beta', torch.randn(4))
def forward(self, x, y):
x = x[0]['a']
mul = self.bar(x)
z1 = self.l0(y)
return z1
# custom objects, dataclasses, tokens, constant inputs
ExportedProgram:
class GraphModule(torch.nn.Module):
def forward(self, token_1: "f32[0]", obj_attr, data_x: "f32[4, 4]", data_y: "f32[4, 4]", mode):
# No stacktrace found for following nodes
mul: "f32[4, 4]" = torch.ops.aten.mul.Scalar(data_x, 30); data_x = None
div: "f32[4, 4]" = torch.ops.aten.div.Tensor_mode(data_y, 1.0, rounding_mode = 'floor'); data_y = None
add: "f32[4, 4]" = torch.ops.aten.add.Tensor(mul, div); mul = div = None
with_effects = torch._higher_order_ops.effects.with_effects(token_1, torch.ops._TorchScriptTesting.takes_foo.default, obj_attr, add); token_1 = obj_attr = add = None
getitem: "f32[0]" = with_effects[0]
getitem_1: "f32[4, 4]" = with_effects[1]; with_effects = None
return (getitem, getitem_1)
# model code
class Foo(torch.nn.Module):
def __init__(self):
super().__init__()
self.attr = torch.classes._TorchScriptTesting._Foo(10, 20)
def forward(self, data, a=1.0, mode="floor"):
x = self.attr.add_tensor(data.x) + torch.div(data.y, a, rounding_mode=mode)
x = torch.ops._TorchScriptTesting.takes_foo(self.attr, x)
return x
dataclass
class DataClass:
x: Tensor
y: Tensor
register_dataclass_as_pytree_node(
DataClass,
serialized_type_name="test.DataClass"
)
args = (DataClass(x=torch.randn(4, 4), y=torch.randn(4, 4)), )
kwargs = {'mode': 'floor'}
ep = torch.export.export(Foo(), args, kwargs, strict=False)
```
Test Plan: verification checks on placeholder names for all export() calls, unit test in test/export/test_export.py
Differential Revision: D55456418
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122904
Approved by: https://github.com/angelayi, https://github.com/thiagocrepaldi
As the design in RFC https://github.com/pytorch/pytorch/issues/114856, this PR implemented Intel GPU Inductor backend by:
- Reuse WrapperCodegen and TritonScheduling for python wrapper and kernel code generation. And implenented device-specific code generation in XPUDeviceOpOverrides
- Reuse fx_pass, lowering, codecache, triton kernel auto-tuning, and compilation.
For the test case, this PR provided test/inductor/test_xpu_basic.py for basic inductor backend functionality testing.
We'll reuse all the existing Inductor test case in the next PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121895
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
**Summary**
Refactor the `Scheduler.fuse_nodes` changes in https://github.com/pytorch/pytorch/pull/121625. In the previous implementation of `Scheduler.fuse_nodes` in https://github.com/pytorch/pytorch/pull/121625, we use the `enable_outer_loop_fusion` context to ensure `OuterLoopFusion` happens after all the norm fusions.
And there is a discussion in https://github.com/pytorch/pytorch/pull/121625/files#r1527177141 to reuse current `score_fusion` mechanism. However, given that [fuse_nodes](f4ff063c33/torch/_inductor/scheduler.py (L1679-L1698)) will invoke `fuse_nodes_once` 10 times. We are concerned that the score approach may potentially disrupt pairs of regular fusion nodes in the 2rd invocation of `fuse_nodes_once` if they have been pick up by the outer loop fusion in the 1st invocation of `fuse_nodes_once`.
In this PR, we propose adding an abstract of `filter_possible_fusions_by_priority`. In each invoking of `fuse_nodes_once`, the possible fusions will be grouped by their priority from the backend. And only the group of possible fusions with highest priority will be fused in this invocation. In this way, we can ensure `OuterLoopFusion` happens after all the norm fusions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123067
Approved by: https://github.com/lezcano, https://github.com/jgong5
ghstack dependencies: #121625
Fixes https://github.com/pytorch/pytorch/issues/123068
Fixes https://github.com/pytorch/pytorch/issues/111256
While investigating the flaky doc build failure .w.r.t duplicated `torch.ao.quantization.quantize` docstring warning, i.e. https://github.com/pytorch/pytorch/actions/runs/8532187126/job/23376591356#step:10:1260, I discover an old but still open bug in Sphinx https://github.com/sphinx-doc/sphinx/issues/4459. These warnings have always been there, but they are hidden because we are using `-j auto` to build docs with multiple threads. It's just by chance that they start to surface now.
The issue can be reproduced by removing `-j auto` from https://github.com/pytorch/pytorch/blob/main/docs/Makefile#L5 and run `make html` locally. Then, these warnings shows up consistently. As `make html` treats warnings as errors, they will fail the build.
```
...
/data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/ao/quantization/quantize.py:docstring of torch.ao.quantization.quantize.quantize:1: WARNING: duplicate object description of torch.ao.quantization.quantize, other instance in quantization, use :noindex: for one of them
/data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py:docstring of torch.nn.parallel.data_parallel.data_parallel:1: WARNING: duplicate object description of torch.nn.parallel.data_parallel, other instance in nn, use :noindex: for one of them
/data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/nn/utils/spectral_norm.py:docstring of torch.nn.utils.spectral_norm.spectral_norm:1: WARNING: duplicate object description of torch.nn.utils.spectral_norm, other instance in nn, use :noindex: for one of them
/data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/nn/utils/weight_norm.py:docstring of torch.nn.utils.weight_norm.weight_norm:1: WARNING: duplicate object description of torch.nn.utils.weight_norm, other instance in nn, use :noindex: for one of them
/data/users/huydo/github/pytorch/docs/source/nn.rst:579: WARNING: duplicate object description of torch.nn.parallel.data_parallel, other instance in generated/torch.nn.functional.torch.nn.parallel.data_parallel, use :noindex: for one of them
/data/users/huydo/github/pytorch/docs/source/nn.rst:594: WARNING: duplicate object description of torch.nn.utils.spectral_norm, other instance in generated/torch.nn.utils.spectral_norm, use :noindex: for one of them
/data/users/huydo/github/pytorch/docs/source/nn.rst:595: WARNING: duplicate object description of torch.nn.utils.weight_norm, other instance in generated/torch.nn.utils.weight_norm, use :noindex: for one of them
/data/users/huydo/github/pytorch/docs/source/quantization.rst:1348: WARNING: duplicate object description of torch.ao.quantization.quantize, other instance in generated/torch.ao.quantization.quantize, use :noindex: for one of them
...
```
The fix is just to clean up those duplicated placeholder py:module docs, which were there because these modules didn't have any docs originally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123244
Approved by: https://github.com/andrewor14, https://github.com/malfet
Summary:
https://fb.workplace.com/groups/1075192433118967/permalink/1402410947063779/
some investigation. large-k pattern will degrade the perf. remove those patterns. Though mmt patten indeed shows some gain in compiling single operator P1201328502 and P1201328722. but it will conflict with other opt in inductor and result in a slow down.
Test Plan:
some result from benchmark, manually to hold stride
```
import torch
import torch._inductor.config as inductor_config
import triton
inductor_config.trace.enabled = True
m1 = torch.rand(9388864, 2, device="cuda", dtype=torch.bfloat16)
m2 = torch.rand(9388864, 12, device="cuda", dtype=torch.bfloat16)
print(f"m1.stride {m1.stride()}")
print(f"m2.stride {m2.stride()}")
torch.compile
def fake_mm(a, b):
return torch.sum(a[:, :, None] * b[:, None, :], dim=0)
tmp = fake_mm(m1, m2)
print(tmp.shape)
s = triton.testing.do_bench(lambda: fake_mm(m1, m2))
print(f"fake mm{s}")
tmp2 = torch.mm(m1.permute(1, 0), m2)
s = triton.testing.do_bench(lambda: torch.mm(m1.permute(1, 0), m2))
print(print(f"mm{s}"))
m3 = m1.permute(1, 0).contiguous()
s = triton.testing.do_bench(lambda: torch.mm(m1.permute(1, 0).contiguous(), m2))
print(print(f"mm without permute{s}"))
result:
fake mm14.968459129333496
mm507.6383972167969
mm without permute0.7466956973075867
```
single kernel can be speed up from 5ms->3ms
{F1477685597}
{F1477685813}
Differential Revision: D55759235
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123371
Approved by: https://github.com/mengluy0125
Summary:
Deserialization of metadata could encounter a bug where commas are used in valid metadata names. This specifically occurs when a split of a `torch.nn.Sequential` stack is used, but may have other possible triggers. Because the deserialization relies on a comma based string split, such names trigger an error. This change uses a simple regular expression to ignore commas within parentheses to avoid the issue.
I add a test that constructs one such problematic sequential stack and show that it can be properly round-tripped with the improved splitting.
Similarly, deserialization could fail when outputs are not a tensor type. Although such outputs like None or constants are not very useful, they do show up in graphs and export should be able to support them. This change improves output node parsing and adds a corresponding test.
Test Plan: buck test //caffe2/test:test_export -- TestSerialize
Differential Revision: D55391674
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122793
Approved by: https://github.com/zhxchen17
Summary: Fixing https://github.com/pytorch/pytorch/issues/123174. There are two problems here,
* Incorrectly calling convert_arrayref_tensor_to_tensor on int arguments. Removing relevant code since we don't use ArrayRef when there is a fallback op.
* codegen_kwargs generates an argument for the out parameter of ExternKernelOut. The fix is to leave that logic to corresponding wrapper codegen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123346
Approved by: https://github.com/chenyang78
This addresses 2 issues with stack_trace metadata:
- stack_trace is currently missing from nodes in non-strict export
- in strict mode, stack_trace is populated for placeholder nodes, which may not be well-defined (with multiple uses)
We filter the call stack during tracing for calls from forward() methods, or ops in `torch.__init__.py` (e.g. sym_size_int, sym_constrain_range, etc.) to populate stack_trace. A node-level check is also added to _export_non_strict().
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121034
Approved by: https://github.com/angelayi
In Triton's [softmax tutorial](https://triton-lang.org/main/getting-started/tutorials/02-fused-softmax.html), native performance is significantly lower than Triton's. We accelerated the native code as follows:
--> Wrote a CUDA kernel `cunn_SoftMaxForwardSmem` for softmax forward that caches the inputs in shared memory. Currently the maximum usable shared memory is 48KB to preserve compatibility with older generation Kepler GPUs but we can increase this. This kernel uses vectorized loads and stores and runs on problem sizes that fit in shared memory and use aligned buffers.
--> Modified the default implementation's intra thread block reduction to use warp shuffles as the first step in reduction and use shared memory only to reduce across warps.
--> Simplified the `WriteFpropResults` code because the loop unrolling brought no benefits but had a potentially detrimental effect on register usage.
We can observe that there is still an advantage in the Triton implementation. We were able to recover the gap by using native `__expf` but we decided to leave `std::exp` to avoid affecting numerical stability.
```
Tests are ran on an A100 GPU using the benchmark in the Triton tutorial.
Before
softmax-performance:
N Triton Torch (native) Torch (jit)
0 256.0 336.946021 595.781814 241.830261
1 384.0 737.741110 762.046555 297.890900
2 512.0 884.128199 860.899863 362.829080
3 640.0 936.228605 901.458039 376.211253
4 768.0 1005.024893 973.306952 384.187594
.. ... ... ... ...
93 12160.0 1336.034308 858.096595 330.642735
94 12288.0 1339.248830 837.047196 331.146707
95 12416.0 1338.877891 839.317673 329.113513
96 12544.0 1335.383669 835.342136 330.067106
97 12672.0 1339.402120 821.690012 329.854051
After
softmax-performance:
N Triton Torch (native) Torch (jit)
0 256.0 375.833684 602.629893 237.019883
1 384.0 312.572329 739.127852 301.777431
2 512.0 495.546303 863.736375 368.438508
3 640.0 520.953881 884.426455 369.633391
4 768.0 677.374681 975.722054 385.317013
.. ... ... ... ...
93 12160.0 1337.253933 1300.589124 330.655916
94 12288.0 1336.333052 1188.412588 331.116192
95 12416.0 1337.610105 1209.703474 329.232825
96 12544.0 1338.723893 1232.849225 330.003484
97 12672.0 1340.232227 1236.057117 329.925347
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122970
Approved by: https://github.com/malfet
Today, we error out on FakeTensor.data_ptr under torch.compile. This PR
moves to error out on FakeTensor.data_ptr under eager mode to avoid
diverging behavior.
We do this by adding another bit onto FakeTensor that we'll remove after
the deprecation cycle.
Test Plan:
- tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123292
Approved by: https://github.com/eellison
ghstack dependencies: #123261, #123282, #123291
This PR adds in fast semi-structured sparsification kernels to PyTorch.
These kernels allow for accelerated semi-structured sparsification
kernels in PyTorch.
The kernels have been added as aten native functions
In particular, three new functions have been added:
* `torch._sparse_semi_structured_tile`
This function will return the packed representation and metadata for
both X and X', as well as the thread masks. Note that this applies 2:4
sparsity in a 4x4 tile instead of a 1x4 strip as usual.
* `torch._sparse_semi_structured_apply`
This function takes in an input tensor and thread masks from the above
function and returns a packed representation and metadata from applying
thread masks to the input tensor.
* `torch._sparse_semi_structured_apply_dense`
This function does the same thing as above but instead of returning the
tensor in the sparse representation it returns it in the dense
representation
The subclasses have also been updated to add a new
`prune_dense_static_sort`
classmethod to create sparse tensors with this format. I've added some
additional documentatino on how to calculate the compressed tensors
needed to create a SparseSemiStructuredTensor oneself.
To this end, there are two new helper functions added:
`sparse_semi_structured_tile`
`compute_compressed_swizzled_bitmask`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350
Approved by: https://github.com/cpuhrsch
For internal purposes, this PR reverts the use of real views in SDPA -> autograd.Function "views" (i.e. `ViewBufferFromNested` and `ViewNestedFromBuffer`). This is a temporary fix to get the FIRST model launched and working.
**Note: this breaks some other Dynamo tests related to SDPA that rely on real views, but the breakage there isn't expected to be likely in a real-world scenario.**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123215
Approved by: https://github.com/YuqingJ
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.
This:
```
try {
...
} catch (exception& e) {
// no use of e
}
```
should instead be written as
```
} catch (exception&) {
```
If the code compiles, this is safe to land.
Test Plan: Sandcastle
Reviewed By: palmje
Differential Revision: D55548497
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123056
Approved by: https://github.com/Skylion007
I'm not sure what "TORCH_DOCTEST_LIBRARY" is, but it prevented these
tests from running under xdoctest. This PR fixes the docstrings and
makes them actually run under xdoctest.
Test Plan:
- wait for CI
- I verified locally that the docstrings are now being tested.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123282
Approved by: https://github.com/williamwen42
ghstack dependencies: #123261
Previously, it suggested that a user add a manual functionalization
kernel. However, since we have auto_functionalize now, the user's first
course of action should be to modify their op into the form that
auto_functionalize accepts (this is possible in the majority of custom
ops).
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123261
Approved by: https://github.com/williamwen42
Since ARC runners use dind-rootless mode setting the ulimit in the docker run command is not possible as the dind-rootless container does not sufficient permissions to do that.
This change looks like it was coming from a migration from another CI system so perhaps it's not necessary anymore.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122629
Approved by: https://github.com/jeanschmidt
If any graph input has overlapping memory, inductor disables cudagraphs. But the function `complex_memory_overlap` detecting memory overlap can have false positive.
E.g. for tensor `rand_strided((8, 1500, 1), (1504, 1, 1), device=self.device)` the function reports overlapping previously.. This is caused by size=1 dimension. The fix is to do squeeze before running the detection algorithm.
This fixes the perf regress for hf_Whisper and timm_efficientdet when we do padding. For these models cudagraphs were dynamically disabled when doing padding due to the issue discussed here and cause perf regress.
This may help the dashboard if this is a common thing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123327
Approved by: https://github.com/Chillee
## Context
Suppose we have two symbols: `u0` and `s0` where we know that `u0 = s0`. Now, let's say we tried to look up the size hint for `u0 + 1`.
* Before this PR, we would use a fallback hint if one was provided.
3f6acf65fd/torch/_inductor/sizevars.py (L406-L407)
* With this PR, we would try to replace `u0` with `s0` via `simplify()` before using a fallback hint. 3f6acf65fd/torch/_inductor/sizevars.py (L46-L47)
## Concrete Example
A scenario where this is useful is when we're running autotuning benchmarking on bmm with two input nodes: one who has `s0` as the batch size and one who has `u0` as the batch size. During benchmarking, we'll create two example input tensors where the input with `u0` has to use a fallback hint for batch size. This will lead to a mismatch.
e3d80f2fa9/torch/_inductor/select_algorithm.py (L991-L997)
Using the fallback hint (i.e. 8192) leads to a batch size mismatch.
```
# Note: s0 = 7 and u0 = 7 and fallback hint is 8192.
LoweringException: ErrorFromChoice: Expected size for first two dimensions of batch2 tensor to be: [7, 30] but got: [8192, 30].
From choice ExternKernelCaller(extern_kernels.bmm)
```
Differential Revision: D55619331
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123140
Approved by: https://github.com/aakhundov
We should sparingly use ID_MATCH guards. When it comes to performance, ID_MATCH is much faster DATA_PTR for Python guards. However, the difference is very small in C++. So, its worth just using DATA_PTR_MATCH.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123302
Approved by: https://github.com/mlazos
ghstack dependencies: #123285
Summary: Modify fresh_inductor_cache() to clear cached state before mocking the toplevel cache_dir directory. Any lru_caches (or otherwise) can use the @clear_on_fresh_inductor_cache decorator to register the cache for clearing. Also change the base inductor TestCase class to use fresh_inductor_cache(). Previously that TestCase was only mocking the subdirectory within the toplevel cache dir designated for the FX graph cache artifacts.
Test Plan:
- New unit test
- All existing inductor tests will exercise fresh_inductor_cache()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122661
Approved by: https://github.com/oulgen
By limiting `VecConvertTests` subtest cases to positive numbers when converting to unsigned types.
As what `static_cast<unsigned int>(-3.0f)` is doing is compiler/architecture specific, as one can observe by running
```cpp
#include <stdint.h>
#include <iostream>
unsigned int convert(float x) {
return static_cast<unsigned int>(x);
}
int main(int argc, const char* argv[]) {
auto inp = std::atof(argc > 1 ? argv[1] : "-3.0");
std::cout << "cvt(" << inp << ")=" << convert(inp) << std::endl;
return 0;
}
```
on x86 would print `cvt(-3)=4294967293`, but on ARM would convert to `0`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123258
Approved by: https://github.com/atalman
For internal purposes, this PR reverts the use of real views in SDPA -> autograd.Function "views" (i.e. `ViewBufferFromNested` and `ViewNestedFromBuffer`). This is a temporary fix to get the FIRST model launched and working.
**Note: this breaks some other Dynamo tests related to SDPA that rely on real views, but the breakage there isn't expected to be likely in a real-world scenario.**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123215
Approved by: https://github.com/YuqingJ
Fixes#122016 and #123178. This regression is related to an OS side change that requires a slight adjustment from us on PyTorch side to restore the previous behavior. Additionally we cleared out pre-MacOS13 related workarounds.
Before the fix on MacOS 14.4:
```
python -c "import torch;x=torch.zeros(3, device='mps');x[1] = 1; x[2] = 3; print(x)"
tensor([0., 3., 3.], device='mps:0')
```
After the fix:
```
python -c "import torch;x=torch.zeros(3, device='mps');x[1] = 1; x[2] = 3; print(x)"
tensor([0., 1., 3.], device='mps:0')
```
This also fixes complex number initialization and as such makes `nn.functional.rms_norm` pass on MacOS-14+
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123234
Approved by: https://github.com/malfet, https://github.com/kulinseth
Summary: RECORD_FUNCTION in C++ and torch.profiler.record_function already support recording inputs. Let's do the same for RecordFunctionFast.
Test Plan: Add tests in test_profiler.py that take args and also do not take args so we can support it being an optional parameter
Differential Revision: D55648870
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123208
Approved by: https://github.com/davidberard98
ARC Runners will provide working Nvidia drivers through the host configuration so this step is no longer necessary in the workflow as the ARC container is not able to install packages at the host level.
Also simplify the the setup-linux condition on if running in ARC as we can achieve the same result without needing an extra shell step via the hashFiles() function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122890
Approved by: https://github.com/seemethere, https://github.com/jeanschmidt
Summary: Currently, we only enabled the group batch fusion customization, we also enable the split cat customization.
Test Plan:
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf" --flow_id 524546542
```
P1196013839
Differential Revision: D54861682
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121915
Approved by: https://github.com/jackiexu1992
By querying `sysctl hw.perflevel0.physicalcpu ` instead of
`std:🧵:hardware_concurrency()` which returns total number of
cores, which is sum of performance and efficient ones
As lots of parallel algorithm in ATen divide the the parallel task into an even region, this end up in faster code execution, compared to when all cores are used by default
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123038
Approved by: https://github.com/albanD
We add an additional_inputs arguments to the HOP while_loop and rename the operands to carried_inputs based on offline discussion with @zou3519 . This allows us to support closures, parameters and buffers.
The alternative is to pass the lifted inputs directly to outputs of body_fn. But since we want the body_fn's output to not aliasing input. We'll need to copy the inputs and remove the copies later. This is a bit more work to do.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123018
Approved by: https://github.com/aakhundov
ghstack dependencies: #123217
This PR updates the error message in autograd when an input tensor does not set to `require_grad`. The original message does not contain the index info, making users hard to debug.
The error message style consists with that on line 105-109.
Co-authored-by: Jeffrey Wan <soulitzer@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123154
Approved by: https://github.com/soulitzer
Summary:
Remove unused/ignore/export TS logging because they do not represent independent TS usage and leads to overload of scribe
Log tupperware job's oncall information so that we have better attribution of who launched the job.
Test Plan: manual testing
Reviewed By: davidberard98
Differential Revision: D55610844
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123133
Approved by: https://github.com/clee2000
This is the entrypoint for defining an opaque/blackbox (e.g. PyTorch will
never peek into it) custom op. In this PR, you can specify backend impls
and the abstract impl for this op.
NB: most of this PR is docstrings, please don't be intimidated by the
line count.
There are a number of interesting features:
- we infer the schema from type hints. In a followup I add the ability
to manually specify a schema.
- name inference. The user needs to manually specify an op name for now.
In a followup we add the ability to automatically infer a name (this
is a little tricky).
- custom_op registrations can override each other. This makes them
more pleasant to work with in environments like colab.
- we require that the outputs of the custom_op do not alias any inputs
or each other. We enforce this via a runtime check, but can relax this
into an opcheck test if it really matters in the future.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122344
Approved by: https://github.com/ezyang, https://github.com/albanD
Summary: We probably don't need
`torch._C._AutoDispatchBelowAutograd()`, which is to prevent
infinite recursion if the implementation calls itself. Let's
remove it and see if anything breaks. The other major change
is registering the op to the more general Autograd dispatch
key so it can be used on cuda as well.
Test Plan:
python test/inductor/test_cpu_repro.py -k test_decomposed_fake_quant_per_channel
Reviewers: zou3519, bdhirsh
Subscribers: zou3519, bdhirsh, jerryzh168, leslie-fang-intel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123186
Approved by: https://github.com/zou3519, https://github.com/leslie-fang-intel
Summary: This test is actually broken and probably succeeding by mistake because of a cache hit. Forcing a fresh cache or removing the errant setting cause a consistent failure. Disabling for now until we have time to investigate further.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123211
Approved by: https://github.com/desertfire
When we enter map_autograd, we try to trace through fwd/bwd of a map operator that is wrapped in ctx.functionalize wrapper. This forces us to go through PreDispatch functionalization again (only the python part). As a result, it revealed our previous bug where pre-dispatch mode handling doesn't actually manage the local dispatch key set. (If there is no active mode, we need to turn off PreDispatch key). This PR fixes that. Also I shuffled some APIs around so that there is less code duplication as the setting/unsetting logic is quite hard to get it right.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121444
Approved by: https://github.com/bdhirsh
* Adds a configurable GEMM size threshold for the usage of Cutlass GEMM Kernels **_inductor.config.cutlass_backend_min_gemm_size**
* During GEMM algorithm choice generation: **if no viable choices can be generated using the configured backends, the ATen backend will be used as a fallback backend**, even if it is not enabled in **_inductor.config.max_autotune_gemm_backends**
Test plan:
CI
Additional unit test in test_cutlass_backend.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121491
Approved by: https://github.com/jansel
ghstack dependencies: #121490
Ring attention support for _scaled_dot_product_flash_attention with DTensor.
This assumes the query and key/value are sharded along the sequence length dimension. See the tests for example usage with PT Transformer as well as direct usage with _scaled_dot_product_flash_attention.
## Notable caveats
* Numerical accuracy: The backwards pass doesn't match numerically with the non-chunked version but the forwards pass does. I assume this is due to accumulated errors. I've added a chunked version that uses autograd to verify that the distributed version matches the chunked version.
* nn.Linear has incorrect behavior when running on a sharded tensor of size (bs, heads, seq_len, dim) with `Shard(2)` and does an unnecessary accumulate which requires `Replicate()` on QKV when using `nn.MultiHeadedAttention` to work around the issue.
* If enabled, it forces sequence parallelism and doesn't interop with tensor parallelism.
## SDPA usage
```py
with attention_context_parallel(), sdpa_kernel(backends=[SDPBackend.FLASH_ATTENTION]):
dquery = distribute_tensor(query, device_mesh, [Shard(2)])
dkey = distribute_tensor(key, device_mesh, [Shard(2)])
dvalue = distribute_tensor(value, device_mesh, [Shard(2)])
dout: DTensor = torch.nn.functional.scaled_dot_product_attention(
dquery, dkey, dvalue, is_causal=is_causal
)
out = dout.to_local()
```
## Transformer usage
```py
with attention_context_parallel(), sdpa_kernel(backends=[SDPBackend.FLASH_ATTENTION]):
encoder_layer = nn.TransformerEncoderLayer(
d_model=dim,
nhead=nheads,
dim_feedforward=dim,
batch_first=True,
).to(dtype)
encoder_layer = parallelize_module(
module=encoder_layer,
device_mesh=device_mesh,
parallelize_plan={
"self_attn": ContextParallel(),
},
)
model = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
```
## Test plan
```
pytest test/distributed/_tensor/test_attention.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122460
Approved by: https://github.com/drisspg, https://github.com/wanchaol
When we enter map_autograd, we try to trace through fwd/bwd of a map operator that is wrapped in ctx.functionalize wrapper. This forces us to go through PreDispatch functionalization again (only the python part). As a result, it revealed our previous bug where pre-dispatch mode handling doesn't actually manage the local dispatch key set. (If there is no active mode, we need to turn off PreDispatch key). This PR fixes that. Also I shuffled some APIs around so that there is less code duplication as the setting/unsetting logic is quite hard to get it right.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121444
Approved by: https://github.com/bdhirsh
Summary: In profiler.export_stacks there was a comment suggesting that the export was compatible with FlameGraph even though it isn't. We should remove this so that users are not confused.
Test Plan: Removed comment
Reviewed By: aaronenyeshi
Differential Revision: D55501792
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123102
Approved by: https://github.com/aaronenyeshi
Summary:
For the original clone we did for output, we only clone when the
corresponding tensor is an constant. We need this because we have to
make sure the constants' ownership maintain in the Model. However we
haven't include if it's a view of a constant.
Test Plan:
Included in commit
test_aot_inductor::test_return_view_constant
Reviewed By: frank-wei, desertfire
Differential Revision: D55645636
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123200
Approved by: https://github.com/chenyang78
This PR only adds abstract class registration logic without touching existing tests so they still trace with real script object. The added tests are only for registration APIs and test error messages.
Our design is that the abstract implementation should be in Python. This is much better in terms of usability. But this also has implications for custom op that takes script object as input, which is detailed later in this stack.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122622
Approved by: https://github.com/zou3519
ghstack dependencies: #122619, #122620, #122621
Summary: In some cases we don't have information from the old IR about submodule ordering, in this case unflattener should still work in best effort mode.
Differential Revision: D55642005
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123192
Approved by: https://github.com/angelayi
This is the last of the old TestOptim! With this change, everything will be migrated to use OptimizerInfo. Our sparse support is...well, sparse, and the tests try to best encapsulate which configs actually work. Note that support_sparse is actually just supports sparse grads...we don't test sparse params.
1. This PR fixes a bug in Adagrad multi_tensor with maximize by passing the correct value of maximize (vs False everytime) when sparse values are present.
2. This PR does improve coverage. There used to only be 2 configs each, and now we have the following configs for:
Adagrad:
```
python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_Adagrad
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
{'maximize': True, 'lr': 0.1}
{'initial_accumulator_value': 0.1, 'lr': 0.1} <--- this and above are CPU
.{'foreach': False, 'lr': 0.1}
{'foreach': True, 'lr': 0.1}
{'maximize': True, 'foreach': False, 'lr': 0.1}
{'maximize': True, 'foreach': True, 'lr': 0.1}
{'initial_accumulator_value': 0.1, 'foreach': False, 'lr': 0.1}
{'initial_accumulator_value': 0.1, 'foreach': True, 'lr': 0.1}
.
----------------------------------------------------------------------
Ran 2 tests in 227.744s
OK
```
SGD
```
(pytorch-3.10) [janeyx@devgpu023.odn1 /data/users/janeyx/pytorch (bff23193)]$ python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_SGD
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
{'dampening': 0.5, 'lr': 0.0048}
.{'foreach': False, 'lr': 0.0048}
{'foreach': True, 'lr': 0.0048}
{'dampening': 0.5, 'foreach': False, 'lr': 0.0048}
{'dampening': 0.5, 'foreach': True, 'lr': 0.0048}
.
----------------------------------------------------------------------
Ran 2 tests in 112.801s
OK
```
SparseAdam
```
(pytorch-3.10) [janeyx@devgpu023.odn1 /data/users/janeyx/pytorch (bff23193)]$ python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_Sparse
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
{'maximize': True, 'lr': 0.04}
.{'maximize': True, 'lr': 0.04}
.
----------------------------------------------------------------------
Ran 2 tests in 35.113s
OK
```
Fixes#103322. A side quest in this migration was to re-enable and track dynamo issues as they trigger on the optim tests, which will be complete from this PR. New tests may add more things to track in dynamo, but there is now an established system for doing so, and dynamo is either enabled or a bug is tracked for every migrated test in TestOptimRenewed.
Next steps:
Remove the hyperparameter constraints in common_optimizer.py defined by metadata_for_sparse (other than LR, which seems handpicked for the tests to actually pass). Doing this requires adding more sparse functionality.
Add more tests!
Maybe add more optimizers!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123146
Approved by: https://github.com/albanD
ghstack dependencies: #123134, #123139
Existing `innermost_fn` handling of `functools.wraps` is not ideal, but I'm not sure if there's a good fix. This can manifest for GmWrapper (used to handle list inputs from Dynamo -> AOTAutograd) where we don't call the unflatten wrapper at runtime.
Since core parts of Dynamo rely on attribute check for `_torchdynamo_orig_callable`, so I'm adding a test to cover it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123007
Approved by: https://github.com/jansel
ghstack dependencies: #122691, #122746
Fixes#122807
The work handle of the coalescing job will be populated:
```python
with dist._coalescing_manager(group=pg_nccl, device=device, async_ops=True) as cm:
dist.all_reduce(a)
dist.all_reduce(b)
print(len(cm.works)) # prints 1
cm.wait() # actually waits
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122849
Approved by: https://github.com/kwen2501
Adding wildcard support for TP's `parallelize_module` API.
Example patterns:
`layers.*.linear`: any characters
`layers.?.linear`: single character
`layers.[1-2]`: digit range, matches `layers.1` and `layers.2`
Example use case:
A model have multiple layers, and we want to parallelize the linear module `lin` inside each layer.
```
model_tp = parallelize_module(
model,
device_mesh,
{
"layers.*.lin": ColwiseParallel(),
},
)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122968
Approved by: https://github.com/XilunWu, https://github.com/wz337, https://github.com/wanchaol
ghstack dependencies: #122919
Summary: This PR reduces the difference between strict and non-strict exported program by supporting inline_constraints for non-strict exported program,
Test Plan: CI
Differential Revision: D55547830
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123017
Approved by: https://github.com/angelayi
This supersedes the previous `Guards Overview" as a more comprehensive
approach to most of the main topics within Dynamo.
In the future, we could add specific sections for each of the topics
discussed here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122305
Approved by: https://github.com/msaroufim
After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.
In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.
The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.
This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123136
Approved by: https://github.com/desertfire
I am ok if people don't want this PR to be merged.
For optimizers, we know that the state dict and param_group have same parameters. So, I think its ok to skip TENSOR_MUST_ALIAS guards.
Similarly for state tensors, all of them are different. Therefore, we can skip the tensor aliasing guards.
With this PR, these are the numbers for Megatron which has 394 parameters
<img width="290" alt="image" src="https://github.com/pytorch/pytorch/assets/13822661/0ce75dc6-4299-46bb-bf3c-7989ebc7cfc4">
C++ numbers jump a lot because of 2 reasons
1) We are now not doing INCREF/DECREF for a large number of tensors.
2) For python guards, we can expect higher numbers but that requires some more plumbing because the Python tensor guards are all collapsed into one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123044
Approved by: https://github.com/jansel, https://github.com/mlazos
Summary:
Existing flight recorder dumping logic is: dump only on timeout, but not
on NCCL error. This resulted in the faulty ranks missing dumps when NCCL
error happens.
So in this PR, we revise the logic of dump such that records are dumped
when any exception is detected. Exception could be 1. NCCL async errors.
2. watchdog timeout
Also the existing code tends to mix the logic of flight recorder dump
and desync debug, which is no desirable. We only dump the desync debug
report only when timeout is detected.
Test Plan:
Added a new unit test to trigger nccl error and dump, and make sure the
dump is triggered by the error.
Also existing dump on timeout tests should still pass.
sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (84bf9d4c)]$ python
test/distributed/test_c10d_nccl.py NcclErrorDumpTest
NCCL version 2.19.3+cuda12.0
[E329 19:15:11.775879730 ProcessGroupNCCL.cpp:565] [Rank 0] Watchdog
caught collective operation timeout: WorkNCCL(SeqNum=2,
OpType=ALLREDUCE, NumelIn=10, NumelOut=10, Timeout(ms)=10000) ran for
10028 milliseconds before timing out.
[E329 19:15:11.777459894 ProcessGroupNCCL.cpp:1561] [PG 0 Rank 0]
Exception hit in NCCL work: 2
[E329 19:15:12.660717323 ProcessGroupNCCL.cpp:1332] [PG 0 Rank 0]
Received a timeout signal from this local rank and will start to dump
the debug info. Last enqueued NCCL work: 2, last completed NCCL work: 1.
[E329 19:15:12.660932242 ProcessGroupNCCL.cpp:1167] [PG 0 Rank 0]
ProcessGroupNCCL preparing to dump debug info.
[E329 19:15:12.661192990 ProcessGroupNCCL.cpp:1174] [PG 0 Rank 0]
ProcessGroupNCCL dumping nccl trace to /tmp/tmp06psqil3/trace_0
[F329 19:15:12.661485601 ProcessGroupNCCL.cpp:1185] [PG 0 Rank 0] [PG 0
Rank 0] ProcessGroupNCCL's watchdog detected a collective timeout from
the local rank. This is most likely caused by incorrect usages of
collectives, e.g., wrong sizes used across ranks, the order of
collectives is not same for all ranks or the scheduled collective, for
some reason, didn't run. Additionally, this can be caused by GIL
deadlock or other reasons such as network errors or bugs in the
communications library (e.g. NCCL), etc. We tried our best to dump the
debug info into the storage to help you debug the issue.
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123023
Approved by: https://github.com/wconstab
show specific inlining reasons with ``TORCH_LOGS="+dynamo" TORCHDYNAMO_VERBOSE=1``
* before, ``INLINING <code...>, inlined according trace_rules.lookup``
* after, ``INLINING <code...> inlined according trace_rules.lookup MOD_INLINELIST``
this can distanguish between inlining by default or by MOD_INLINELIST (specific rule)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123014
Approved by: https://github.com/jansel
ghstack dependencies: #123013
User defined triton kernel calls `ASTSource.make_ir`. Triton recently added an extra require argument to this API and make the call in PyTorch user defined triton kernel related code to fail. The PR make PyTorch work with both old and new version of the API.
Test:
```
python test/inductor/test_aot_inductor.py -k test_triton_kernel_equal_to_1_arg_abi_compatible_cuda
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123124
Approved by: https://github.com/oulgen, https://github.com/jansel
ghstack dependencies: #123076
Currently scan has an `init` argument which must be the identity of the
combine function. This isn't strictly necessary if we are more careful about
keeping track of the first element and avoid combining it with anything.
This does additionally require that there are no active load masks, since we can't
do the `where_cond` any more. However, this shouldn't be possible anyway since
scans are always realized and only fused via the scheduler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119727
Approved by: https://github.com/lezcano
Reset guard at the end of RootGuardManager, even if the result is true. Earlier we reset only when result was False. But this causes extra bookkeeping in each guard. This PR gives a tiny bit improvement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123046
Approved by: https://github.com/jansel
Downloading CUDA sometimes fails and breaks the build process, but AOTriton does not need these packages for its own Triton fork. This commit comments out the related downloading scripts.
The actual changes from Triton can be found at: 9b73a543a5
Fixes the following building error
```
[2/6] cd /var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python && /opt/conda/envs/py_3.8/bin/cmake -E env VIRTUAL_ENV=/var/lib/jenkins/workspace/build/aotriton/build/venv PATH="/var/lib/jenkins/workspace/build/aotriton/build/venv/bin:/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.8/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" TRITON_BUILD_DIR=/var/lib/jenkins/workspace/build/aotriton/build/triton_build python setup.py develop
FAILED: CMakeFiles/aotriton_venv_triton /var/lib/jenkins/.local/lib/python3.8/site-packages/triton/_C/libtriton.so /var/lib/jenkins/workspace/build/aotriton/build/CMakeFiles/aotriton_venv_triton
cd /var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python && /opt/conda/envs/py_3.8/bin/cmake -E env VIRTUAL_ENV=/var/lib/jenkins/workspace/build/aotriton/build/venv PATH="/var/lib/jenkins/workspace/build/aotriton/build/venv/bin:/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.8/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" TRITON_BUILD_DIR=/var/lib/jenkins/workspace/build/aotriton/build/triton_build python setup.py develop
downloading and extracting https://conda.anaconda.org/nvidia/label/cuda-12.1.1/linux-64/cuda-nvcc-12.1.105-0.tar.bz2 ...
downloading and extracting https://conda.anaconda.org/nvidia/label/cuda-12.1.1/linux-64/cuda-cuobjdump-12.1.111-0.tar.bz2 ...
Traceback (most recent call last):
File "/var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python/setup.py", line 325, in <module>
download_and_copy(
File "/var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python/setup.py", line 151, in download_and_copy
ftpstream = urllib.request.urlopen(url)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/urllib/request.py", line 215, in urlopen
return opener.open(url, data, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/urllib/request.py", line 521, in open
response = meth(req, response)
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/urllib/request.py", line 630, in http_response
response = self.parent.error(
^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/urllib/request.py", line 559, in error
return self._call_chain(*args)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/urllib/request.py", line 492, in _call_chain
result = func(*args)
^^^^^^^^^^^
File "/opt/conda/lib/python3.12/urllib/request.py", line 639, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 524:
ninja: build stopped: subcommand failed.
```
Example of failed build log: https://github.com/pytorch/pytorch/actions/runs/8483953034/job/23245996425
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122982
Approved by: https://github.com/jansel
Summary:
Removes `using namespace` from a header file. Having `using namespace` in a header file is *always* a bad idea. A previous raft of diffs provided appropriate qualifications to everything that relied on this `using namespace`, so it is now safe to remove it in this separate diff.
Helps us enable `-Wheader-hygiene`.
Test Plan: Sandcastle
Reviewed By: dmm-fb
Differential Revision: D54838298
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121847
Approved by: https://github.com/Skylion007
as titled, previously we could possibly return the expected input spec
that shared by multiple args, this is not ok since different args might
have different tensor metas, why it was working before is because
redistribute in these cases become a no-op.
This PR fixes it by making each expected input spec to shallow clone the
corresponding input metadata
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122949
Approved by: https://github.com/tianyu-l
ghstack dependencies: #122929
This PR refactors the schema_suggestions in OuputSharding to be a single
OpSchema instead of list of schemas, which in practice we only have one,
for the multiple resharding case we also moved to OpStrategy so there's
no case that needs it to be a list
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122929
Approved by: https://github.com/tianyu-l
Summary: After we migrate to torch.export, we won't see ops like add_ and mul_ due to functionalization. We are rolling out pre dispatch export, so for now we just skip those mutating ops in tests.
Test Plan: buck run mode/opt caffe2/test/quantization:test_quantization
Reviewed By: tugsbayasgalan
Differential Revision: D55442019
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122863
Approved by: https://github.com/clee2000
When fakifying a grad tracking tensor, if the level is -2 (sentinel
value) we can just unwrap the grad tensor and return a fake version of
it. In this PR, we update the `assert_metadata_eq` to not compare if
the grad tensor and the unwrapped ones are leafs or not, as this may
not be always true.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122728
Approved by: https://github.com/zou3519
Fixes#114844
In the linked issue we have
```
compiled_module = torch.compile(module)
compiled_module.x = ...
compiled_module(...) # Mutates self.x
```
Where since the module mutates `self.x` you would expect `compiled_module.x`
to be updated but actually `compiled_module.x = ...` sets an attribute "x"
on the `OptimizedModule` object while the forward method of the module mutates
`module.x`.
This gives the expected behavior by forwarding `compiled_module.__setattr__`
down to `module.__setattr__`. There is already a corresponding `__getattr__`
so now `compiled_module.x` becomes an alias for `module.x`.
Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122098
Approved by: https://github.com/ezyang, https://github.com/lezcano
Summary:
Added support for quantized linear on CPU with fbgemm.
Specifically, for torch.ops.quantized.linear_unpacked_dynamic_fp16, we
decompose it into two steps, pack weight, and fbgemm's qlinear with
packed weight.
Test Plan:
Included in commit.
test_aot_inductor::test_quantized_linear
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D55577959](https://our.internmc.facebook.com/intern/diff/D55577959)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123069
Approved by: https://github.com/hl475
Enable VEC on Windows OS.
1. Fix some type defination gap between Windows and Linux.
2. Fix some operator not support on Windows, such as [], /.
3. Enable static sleef library build on Windows.
4. Disable unsupported function overloading on MSVC.
5. Upgrade submodule sleef lib, which fixed build issue on Windows.
6. Fixed bazel build issues.
7. Fix test app not link to sleef on Windows.
Note: If rebuild fail after pulled this PR, please sync `sleef` submodule by run:
```cmd
git submodule sync
git submodule update --init --recursive
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980
Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet
Fixes#118849
Add a map for parent_to_child_mappings in _mesh_resources so we can cache and reuse submesh slicing result so that we can avoid recreating submesh and the underlying sub pg repeatedly, which could lead to funky behaviors.
We will follow up with reusing pg from the parent_mesh during submesh creation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122975
Approved by: https://github.com/wanchaol
Summary:
This would otherwise yield
> ValueError: ('Manual wrapping with ShardingStrategy.HYBRID_SHARD', 'requires explicit specification of process group or device_mesh.')
which is odd.
Remove the extra tailing commas.
Test Plan: CI
Differential Revision: D55549851
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123019
Approved by: https://github.com/Skylion007
inference for vision_maskrcnn model fail when max-autotune is enabled.
Repro:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --accuracy --inference --bfloat16 --backend inductor --only vision_maskrcnn
```
It turns out that MA code receives empty input tensor for convolution and some places in MA related code does not handle this corner case properly. This PR enhance that and now the accuracy test above can pass.
Regarding why the input tensor is empty, I think it's probably due to no objects are detected in the input images (random data?).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123008
Approved by: https://github.com/jansel
This PR adds a new metadata, `torch_fn` which is meant to replace `source_fn_stack` as `source_fn_stack` is not entirely well defined between strict/nonstrict. Previous discussion [here](https://docs.google.com/document/d/1sPmmsmh6rZFWH03QBOe49MaXrQkP8SxoG8AOMb-pFk4/edit#heading=h.anmx9qknhvm).
`torch_fn` represents the torch function that a particular aten operator came from. For example, `torch.nn.Linear` goes down to the `torch.nn.functional.linear` at the `__torch_function__` layer, and then `aten.t/aten.addmm` in the `__torch_dispatch__` layer. So the nodes `aten.t/aten.addmm` will now have the `torch_fn` metadata containing the `torch.nn.functional.linear`.
The `torch_fn` metadata is a tuple of 2 strings: a unique identifier for each torch function call, and the actual torch function `f"{fn.__class__}.{fn.__name__}"`. The purpose of the first value is to distinguish between 2 consecutive calls to the same function. For example, if we had 2 calls to `torch.nn.Linear`, the nodes and corresponding metadata would look something like:
```
aten.t - ("linear_1", "builtin_function_or_method.linear"),
aten.addmm - ("linear_1", "builtin_function_or_method.linear"),
aten.t - ("linear_2", "builtin_function_or_method.linear"),
aten.addmm - ("linear_2", "builtin_function_or_method.linear"),
```
Higher order ops -- currently we can get the torch_fn metadata for nodes within the HOO's subgraph, but after retracing, this becomes the `(cond, higher_order_op.cond)` :( This is because `fx_traceback.set_current_meta` points to the cond node in the toplevel graph, rather than the original node in the subgraph. I think this is because `fx.Interpreter` does not go into the cond subgraphs. (will discuss with Yidi more ab this)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122693
Approved by: https://github.com/tugsbayasgalan
fixes https://github.com/pytorch/pytorch/issues/122826
# Problem
When the model returns multiple outputs which alias the same tensor, we get a SEGFAULT. Because we try to release the same buffer twice.
```
def forward(x):
x_out = x + 1
contig = x_out.contiguous() # alias of same tensor as x_out
return x_out, contig
run_impl() {
output_handles[0] = buf0.release();
output_handles[1] = buf0.release(); # SEGFAULT
}
# if we try to workaround this by assign aliases without creating a new tensor,
# then, we'll get a double free error during handle clean-up.
output_handles[1] = output_handles[0]; # assign without creating a new tensor
...
alloc_tensors_by_stealing_from_handles(){
aoti_torch_delete_tensor_object(handles[0]);
aoti_torch_delete_tensor_object(handles[1]); # Double free
}
```
# Solution
~~Instead, we use the first `output_handle` that shares the same tensor and alias it.~~
```
output_handles[0] = buf0.release();
aoti_torch_alias_tensor(output_handles[0], &output_handles[1]); # No SEGFAULT & No double free!
```
A simpler approach is to figure out which handles are duplicate. Then we simply copy all duplicate except the last one. The last one will use `std::move` and free the tensor owned by the model instance.
```
output_handles[0] = buf0.release();
output_handles[1] = output_handles[0];
```
Differential Revision: [D55455344](https://our.internmc.facebook.com/intern/diff/D55455344)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122846
Approved by: https://github.com/desertfire, https://github.com/chenyang78, https://github.com/jingsh
The batch-size for this model is 64 previously. Later on we change that to 256 and cause OOM in cudagraphs setting. This PR tune the batch size down to 128.
Share more logs from my local run
```
cuda,res2net101_26w_4s,128,1.603578,110.273572,335.263494,1.042566,11.469964,11.001666,807,2,7,6,0,0
cuda,res2net101_26w_4s,256,1.714980,207.986155,344.013071,1.058278,22.260176,21.034332,807,2,7,6,0,0
```
The log shows that torch.compile uses 11GB for 128 batch size and 21GB for 256 batch size. I guess the benchmark script has extra overhead cause the model OOM for 256 batch size in the dashboard run.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122977
Approved by: https://github.com/Chillee
# Motivation
Add some attributes to `XPUDeviceProp` and expose them via `torch.xpu.get_device_properties` and `torch.xpu.get_device_capability`. They can be used in `torch.compile` or directly passed to triton to generate more optimized code based on device properties.
# Additional Context
expose the following attributes to `torch.xpu.get_device_properties`:
- `has_fp16` (newly added)
- `has_fp64` (newly added)
- `has_atomic64` (newly added)
- `driver_version`
- `vendor`
- `version`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121898
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet, https://github.com/albanD, https://github.com/atalman
Summary:
When a rank detects a timeout from tcpstore and triggers the dump. It's good to have more info about the source rank which detects the
collective timeout locally. We just need to put the source rank as the
value in the kvstore
Test Plan:
In unit test, we triggered the timeout on rank 0 and rank 1 should get
the timeout signal from store and log the correct source rank:
```
(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (34d27652)]$ python
test/distributed/test_c10d_nccl.py NCCLTraceTestTimeoutDumpOnStuckRanks
NCCL version 2.19.3+cuda12.0
[rank0]:[E327 17:04:16.986381360 ProcessGroupNCCL.cpp:565] [Rank 0]
Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2,
OpType=ALLREDUCE, NumelIn=12, NumelOut=12, Timeout(ms)=1000) ran for
1099 milliseconds before timing out.
[rank0]:[E327 17:04:16.988036373 ProcessGroupNCCL.cpp:1582] [PG 0 Rank
0] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed
NCCL work: 1.
[rank0]:[E327 17:04:16.182548526 ProcessGroupNCCL.cpp:1346] [PG 0
Rank 0] Received a timeout signal from this local rank and will start
to dump the debug info. Last enqueued NCCL work: 2, last completed
NCCL work: 1.
[rank0]:[E327 17:04:16.247574460 ProcessGroupNCCL.cpp:1167] [PG 0
Rank 0] ProcessGroupNCCL preparing to dump debug info.
[rank1]:[E327 17:04:16.273332178 ProcessGroupNCCL.cpp:1346] [PG 0
Rank 1] Received a global timeout from another rank 0, and will start
to dump the debug info. Last enqueued NCCL work: 1, last completed
NCCL work: 1.
[rank1]:[E327 17:04:16.273565177 ProcessGroupNCCL.cpp:1167] [PG 0
Rank 1] ProcessGroupNCCL preparing to dump debug info.
[rank1]:[F327 17:04:16.274256512 ProcessGroupNCCL.cpp:1185] [PG 0
Rank 1] [PG 0 Rank 1] ProcessGroupNCCL's watchdog detected a
collective timeout from another rank 0 and notified the current rank.
This is most likely caused by incorrect usages of collectives, e.g.,
wrong sizes used across ranks, the order of collectives is not same
for all ranks or the scheduled collective, for some reason, didn't
run. Additionally, this can be caused by GIL deadlock or other
reasons such as network errors or bugs in the communications library
(e.g. NCCL), etc. We tried our best to dump the debug info into the
storage to help you debug the issue.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122850
Approved by: https://github.com/wconstab
This PR unified the vectorized conversion with `at::vec::convert` for all vectorized data types. The intrinsics implementations are implemented as a specialization and moved to their own arch-specific files. The vectorized conversion logic in cpp Inductor is simplified.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119979
Approved by: https://github.com/jansel, https://github.com/malfet
Partially addresses #122160
In the module `torch.utils.tensorboard.summary`, the `hparams` method does not depend on any utilities from pytorch as it uses only the utilities from `tensorboard`. Thus, I think it will be safe to delete the test for `hparams` method as it does not depend on pytorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122556
Approved by: https://github.com/huydhn
Summary:
Replacing `torch._export.aot_compile` callsites with
```
ep = torch.export._trace._export(.., predispatch=True) # Traces the given program into predispatch IR
so_path = torch._inductor.aot_compile_ep(ep, ...) # Takes an exported program and compiles it into a .so
```
This allows us to explicitly split up the export step from AOTInductor. We can later modify tests to do `export + serialize + deserialize + inductor` to mimic internal production use cases better.
Test Plan: CI
Differential Revision: D54808612
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122225
Approved by: https://github.com/SherlockNoMad, https://github.com/khabinov
Sympy simplifications don't obey floating point semantics, so don't
use Sympy for this. Keep them as is, only evaluate with the reference
implementations when all arguments are known.
This may end up getting subsumed by some other changes later, but I
wanted to understand if this was easy and it seems to be easy.
This doesn't actually depend on the earlier diffs on the stack and I can detach it.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122823
Approved by: https://github.com/lezcano
Summary: Pre-grad fx passes expect information from shape propagation to be present. D55221119 ensured that `pass_execution_and_save` invokes shape propagation, and this diff adds a covering unit test to prevent regression.
Test Plan: New UT passes locally.
Differential Revision: D55440240
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122897
Approved by: https://github.com/khabinov, https://github.com/Skylion007
This PR fixes the two major issues that was discovered after the initial merge of PR #121561
1. The Flash Attention support added by has severe performance regressions on regular shapes (power of two head dimensions and sequence lengths) compared with PR #115981. Its performance is worse than the math backend and only has numerical stability advantages. This PR fixes this problem.
2. There is a flaw of memory storage handling in PR #121561 which does not copy the gradients back to the designated output tensor. This PR removes the deprecated `TensorStorageSanitizer` class which is unnecessary due to the more flexible backward kernel shipped by PR #121561
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122857
Approved by: https://github.com/jeffdaily, https://github.com/drisspg
This significantly speeds up real world applications, such as LLMs
Before this change llama2-7b fp16 inference run at 1.5 tokens per sec,
after it runs at almost 6 tokens per sec
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122951
Approved by: https://github.com/ezyang
Enable VEC on Windows OS.
1. Fix some type defination gap between Windows and Linux.
2. Fix some operator not support on Windows, such as [], /.
3. Enable static sleef library build on Windows.
4. Disable unsupported function overloading on MSVC.
5. Upgrade submodule sleef lib, which fixed build issue on Windows.
6. Fixed bazel build issues.
7. Fix test app not link to sleef on Windows.
Note: If rebuild fail after pulled this PR, please sync `sleef` submodule by run:
```cmd
git submodule sync
git submodule update --init --recursive
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980
Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet
Dynamo skips user defined modules from `torch/testing/_internal` (eg MLP, Transformer). This PR adds `torch/testing/_internal/...` to `manual_torch_name_rule_map`. It ensures FSDP CI + torch.compile are meaningfully tested
unit test shows frame count = 0 before and frame count > 0 after
```pytest test/dynamo/test_trace_rules.py -k test_module_survive_skip_files```
some FSDP unit tests actually start to compile modules with this change. add trition availability check or disable tests for now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122851
Approved by: https://github.com/jansel
Summary:
Minor logging cleanup in distributed library
1. Don't use "f" formatted strings - address linter issues.
2. Nits: Make use of unused `e` (error) in a few logs.
3. Change info->debug as asked in issue #113545
4. Nit: rename log -> logger in a few files for consistency
5. Fix a linter error.
Test Plan:
1. Local build passes.
2. Linter is happy.
Reviewers: wanchaol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921
Approved by: https://github.com/wanchaol
When we enter map_autograd, we try to trace through fwd/bwd of a map operator that is wrapped in ctx.functionalize wrapper. This forces us to go through PreDispatch functionalization again (only the python part). As a result, it revealed our previous bug where pre-dispatch mode handling doesn't actually manage the local dispatch key set. (If there is no active mode, we need to turn off PreDispatch key). This PR fixes that. Also I shuffled some APIs around so that there is less code duplication as the setting/unsetting logic is quite hard to get it right.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121444
Approved by: https://github.com/bdhirsh
Summary: When we migrate to torch.export, we won't put L['self'] as the prefix for all the fqn in nn_module_stack. This diff adds the branch to handle the new case.
Test Plan: buck test mode/opt caffe2/test/quantization:test_quantization -- -r set_module_name
Differential Revision: D55436617
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122819
Approved by: https://github.com/tugsbayasgalan
After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.
In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.
The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.
This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122882
Approved by: https://github.com/desertfire
Summary: Vulkan rewrite sp that quantized transpose 2d ops can run in a model
Test Plan:
Run vulkan api test:
# buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
# buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 418 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 418 tests from VulkanAPITest
....
[----------] Global test environment tear-down
[==========] 418 tests from 1 test suite ran. (4510 ms total)
[ PASSED ] 417 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
YOU HAVE 9 DISABLED TESTS
Run quantized vulkan api test: Note the linear quantized are failing but all the convolution tests still pass. Linear failures are being debugged.
# buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
# buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 86 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 86 tests from VulkanAPITest
...
[ PASSED ] 77 tests.
[ FAILED ] 9 tests, listed below:
[ FAILED ] VulkanAPITest.linear_2d_flat
[ FAILED ] VulkanAPITest.linear_2d_small
[ FAILED ] VulkanAPITest.linear_2d_large
[ FAILED ] VulkanAPITest.linear_3d_flat
[ FAILED ] VulkanAPITest.linear_3d_small
[ FAILED ] VulkanAPITest.linear_3d_large
[ FAILED ] VulkanAPITest.linear_4d_flat
[ FAILED ] VulkanAPITest.linear_4d_small
[ FAILED ] VulkanAPITest.linear_4d_large
9 FAILED TESTS
YOU HAVE 8 DISABLED TESTS
# Run CUNET quantized model on hibiki board.
Reviewed By: manuelcandales
Differential Revision: D52344263
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122547
Approved by: https://github.com/manuelcandales, https://github.com/copyrightly, https://github.com/yipjustin
Summary:
We add a new op quantized.linear_unpacked_dynamic_fp16, which is essentially linear_dynamic_fp16 with different (unpacked) weight/bias format.
This op does packing on the fly for each call with standard at::Tensor weight & bias.
Test Plan:
Included in commit.
test_quantized_op::test_unpacked_qlinear_dynamic_fp16
Differential Revision: [D55433203](https://our.internmc.facebook.com/intern/diff/D55433203)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122762
Approved by: https://github.com/jerryzh168
Summary:
We allow CPU to use the config use_runtime_constant_folding.
Changes include
1. Rearrange USE_CUDA flags. Add CPU sections that consumes memory directly.
2. Codegen changes to accomodate cpp fusions for CPU only. Specifically, we shouldn't generate 2 headers that would cause re-declaration.
Test Plan: Activate tests that were deactivated for CPU before.
Reviewed By: khabinov
Differential Revision: D55234300
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122563
Approved by: https://github.com/chenyang78
Summary:
Original commit changeset: ebda663a196b
Original Phabricator Diff: D55271788
Test Plan: Some models are failing torch compile with this, retrying the tests
Reviewed By: colinchan15
Differential Revision: D55374457
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122709
Approved by: https://github.com/huydhn
In this PR, we add a systematic way to test all HOPs to be exportable as export team has been running into various bugs related to newly added HOPs due to lack of tests. We do this by creating:
- hop_db -> a list of HOP OpInfo tests which then used inside various flows including export functionalities: [aot-export, pre-dispatch export, retrace, and ser/der
For now, we also create an allowlist so that people can bypass the failures for now. But we should discourage ppl to do that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122265
Approved by: https://github.com/ydwu4, https://github.com/zou3519
Fixes#120794
Torch creates a cache of compiled kernels at $HOME/.cache/torch/kernels. The names used to save and select the cached kernels use cuda_major and cuda_minor to identify the gpu architecture for which the gpu kernels where compiled. On ROCM this is insufficient as on rocm cudaDeviceProp cuda_major and cuda_minor are mapped to hipDeviceProp_t::major and hipDeviceProp_t::minor which correspond to the first and second number of the LLVM target corresponding to the architecture in question:
GFX1030 is major = 10, minor = 3
GFX1032 is major = 10, minor = 3
GFX900 is major = 9, minor = 0
GFX906 is major = 9, minor = 0
GFX908 is major = 9, minor = 0
Thus it can be seen hipDeviceProp_t::major and hipDeviceProp_t::minor are insufficient to uniquely identify the ROCM architecture. This causes the rocm runtime to raise an error when an operation uses a cached kernel that was first cached on a architecture with the same hipDeviceProp_t::major and hipDeviceProp_t::minor but a different llvm target.
The solution provided in this pr is to replace the use of hipDeviceProp_t::major,hipDeviceProp_t::minor with hipDeviceProp_t::gcnArchName when pytorch is compiled for rocm which contains a string identical to the LLVM target of the architecture in question
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121401
Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang, https://github.com/malfet
Summary: Previous work `https://github.com/pytorch/pytorch/pull/120742` to enable `matrix_instr_nonkdim` only dealt with the autotuner benchmarking, but failed to enable the parameter in Triton meta for real runs. `matrix_instr_nonkdim` needs to be visible to the compiler driver to set up the optimization pipeline, so it's unlike other kernel parameters such as `BLOCK_N` that can be just set inside the kernel itself.
Test Plan:
P1201466917
triton_heuristics.template(
num_stages=1,
num_warps=4,
triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())], 'matrix_instr_nonkdim': 16},
inductor_meta={'kernel_name': 'triton_tem_fused_mm_0', 'backend_hash': None},
)
Perf :
Before: 1.693ms 0.134GB 79.28GB/s
After: 1.577ms 0.134GB 85.12GB/s
Differential Revision: D55456401
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122852
Approved by: https://github.com/xw285cornell
For some reason, if we construct `class Handle(RemovableHandle` inside `register_multi_grad_hook`, then over time, the call to `RemovableHandle.__init__` slows down more and more (when we have GC disabled). Perhaps, this is related to the class attribute `next_id: int = 0`. Python experts: please let me know if you have thoughts 😅
I am open to any suggestions on if how we should deal with this `Handle` class. For now, I changed it to a private `_MultiHandle`.
<details>
<summary> Experiment Script </summary>
```
import gc
import time
import torch
NUM_TENSORS = int(5e4)
ts = [torch.empty(1, requires_grad=True) for _ in range(NUM_TENSORS)]
def hook(grad) -> None:
return
gc.disable()
times = []
for i, t in enumerate(ts):
start_time = time.time()
torch.autograd.graph.register_multi_grad_hook([t], hook)
end_time = time.time()
times.append(end_time - start_time)
print([f"{t * 1e6:.3f} us" for t in times[1:6]]) # print first few times
print([f"{t * 1e6:.3f} us" for t in times[-5:]]) # print last few times
times = []
for i, t in enumerate(ts):
start_time = time.time()
t.register_hook(hook)
end_time = time.time()
times.append(end_time - start_time)
print([f"{t * 1e6:.3f} us" for t in times[1:6]]) # print first few times
print([f"{t * 1e6:.3f} us" for t in times[-5:]]) # print last few times
```
</details>
<details>
<summary> Results </summary>
Before fix:
```
['23.603 us', '19.550 us', '15.497 us', '12.875 us', '13.828 us']
['327.110 us', '341.177 us', '329.733 us', '332.832 us', '341.177 us']
['318.050 us', '315.189 us', '319.719 us', '311.613 us', '308.990 us']
['374.317 us', '394.821 us', '350.714 us', '337.362 us', '331.402 us']
```
Calling `register_multi_grad_hook` makes calling itself and `register_hook` slower (actually, any call to `RemovableHandle.__init__`).
After fix:
```
['13.590 us', '9.060 us', '12.875 us', '7.153 us', '8.583 us']
['4.530 us', '5.245 us', '6.437 us', '4.768 us', '5.007 us']
['2.623 us', '1.907 us', '1.431 us', '1.669 us', '1.192 us']
['1.431 us', '1.431 us', '1.192 us', '1.192 us', '1.431 us']
```
</details>
Update: from @soulitzer
> Your suspicion about next_id is right. I think what is happening is that whenever a class attribute is set, it needs to invalidate some cached data for the subclasses one-by-one. eefff682f0/Objects/typeobject.c (L845)
And this PR fixes the issue by avoiding creating many subclasses dynamically. Changing next_id to something like List[int] or incrementing a global instead also fixes this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122847
Approved by: https://github.com/soulitzer
ghstack dependencies: #122726
Fixes `During handling of the above exception, another exception occurred: [...] torch._dynamo.exc.Unsupported: generator`. traceback.format_exc uses generators which isn't supported by dynamo yet.
<details>
<summary>current error message</summary>
```
======================================================================
ERROR: test_custom_fn_saved_tensors (__main__.TestCompiledAutograd)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 307, in __call__
return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1537, in _call_impl
return forward_call(*args, **kwargs)
File "<eval_with_key>.0", line 4, in forward
def forward(self, inputs, sizes, hooks):
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/xmfan/core/pytorch/torch/testing/_internal/common_utils.py", line 2741, in wrapper
method(*args, **kwargs)
File "/home/xmfan/core/pytorch/test/inductor/test_compiled_autograd.py", line 499, in test_custom_fn_saved_tensors
self.check_output_and_recompiles(fn, 1)
File "/home/xmfan/core/pytorch/test/inductor/test_compiled_autograd.py", line 61, in check_output_and_recompiles
actual = list(opt_fn())
File "/home/xmfan/core/pytorch/test/inductor/test_compiled_autograd.py", line 495, in fn
loss.backward()
File "/home/xmfan/core/pytorch/torch/_tensor.py", line 534, in backward
torch.autograd.backward(
File "/home/xmfan/core/pytorch/torch/autograd/__init__.py", line 267, in backward
_engine_run_backward(
File "/home/xmfan/core/pytorch/torch/autograd/graph.py", line 766, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1537, in _call_impl
return forward_call(*args, **kwargs)
File "/home/xmfan/core/pytorch/torch/_dynamo/eval_frame.py", line 397, in _fn
res = fn(*args, **kwargs)
File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 741, in call_wrapped
return self._wrapped_call(self, *args, **kwargs)
File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 315, in __call__
_WrappedCall._generate_error_message(topmost_framesummary),
File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 289, in _generate_error_message
tb_repr = get_traceback()
File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 288, in get_traceback
return traceback.format_exc()
File "/home/xmfan/.conda/envs/benchmarks/lib/python3.10/traceback.py", line 183, in format_exc
return "".join(format_exception(*sys.exc_info(), limit=limit, chain=chain))
File "/home/xmfan/.conda/envs/benchmarks/lib/python3.10/traceback.py", line 136, in format_exception
return list(te.format(chain=chain))
File "/home/xmfan/core/pytorch/torch/_dynamo/convert_frame.py", line 941, in catch_errors
return callback(frame, cache_entry, hooks, frame_state, skip=1)
File "/home/xmfan/core/pytorch/torch/_dynamo/convert_frame.py", line 348, in _convert_frame_assert
unimplemented("generator")
File "/home/xmfan/core/pytorch/torch/_dynamo/exc.py", line 199, in unimplemented
raise Unsupported(msg)
torch._dynamo.exc.Unsupported: generator
```
</details>
With this change, we get back the descriptive error message:
<details>
<summary>post-fix error message</summary>
```
Traceback (most recent call last):
File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 307, in __call__
return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1537, in _call_impl
return forward_call(*args, **kwargs)
File "<eval_with_key>.0", line 4, in forward
def forward(self, inputs, sizes, hooks):
IndexError: list index out of range
Call using an FX-traced Module, line 4 of the traced Module's generated forward function:
def forward(self, inputs, sizes, hooks):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
getitem = inputs[0]
getitem_1 = inputs[1]; inputs = None
```
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122746
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #122691
Currently, when we create proxies for a list's elements in wrap_fx_proxy_cls, we create them using the same source as the list's e.g. `LocalSource(inputs)` instead of `GetItemSource(LocalSource(inputs), index=i)`. This results in invalid guards when the tensors it contains becomes dynamic, and the guard system thinks the list is a tensor:
```
Malformed guard:
L['sizes'][0] == L['inputs'].size()[0]
Malformed guard:
2 <= L['inputs'].size()[0]
Traceback [...]
AttributeError: 'list' object has no attribute 'size'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122691
Approved by: https://github.com/jansel, https://github.com/anijain2305
In ARC Runners we are using dind-rootless to run docker-in-docker and
in rootless mode volume mounts always mount as root but are mapped to
the local `runner` user in ARC. This causes the build.sh and test.sh
scripts to fail because they run as the `jenkins` user and expect to
be able to write to the workspace path that's being mounted.
Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>
misc-include-cleaner was introduced in clang-tidy-17 as a way to check missing and unused includes. However, there are lots of transitive headers in PyTorch and it would take enormous efforts to add related annotations to them in order to direct this checker. For this reason, it's better to disable it now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122855
Approved by: https://github.com/cpuhrsch
This fixes a bug when casting a module that has DTensor parameters. The old behavior will swap the .data field of the Tensor subclass which is incorrect behavior when dealing with tensor subclasses that may have multiple child tensors.
This uses the `swap_tensors` method to swap all of the tensors not just the .data field.
Test plan:
```
pytest test/distributed/_tensor/test_api.py -k 'test_distribute_module_casting'
python test/distributed/fsdp/test_wrap.py -k test_auto_wrap_smoke_test_cuda_init_mode1_cpu_offload0_use_device_id_True
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122755
Approved by: https://github.com/wanchaol, https://github.com/mikaylagawarecki
This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton)
- [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`).
* MI300X is supported. More architectures will be added once Triton support them.
- [x] Only supports power of two sequence lengths.
* Now it support arbitrary sequence length
- [ ] No support for varlen APIs.
* varlen API will be supported in future release of AOTriton
- [x] Only support head dimension 16,32,64,128.
* Now it support arbitrary head dimension <= 256
- [x] Performance is still being optimized.
* Kernel is selected according to autotune information from Triton.
Other improvements from AOTriton include
* Allow more flexible Tensor storage layout
* More flexible API
This is a more extensive fix to #112997
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561
Approved by: https://github.com/huydhn
`dynamo.explain()` was updated to return a structure but the docs weren't updated to match.
- Update the docs to use the new API
- Remove some dead code left when `explain` was updated.
- Drive-by: Fix some `nopython` uses that I noticed
- Drive-by: I noticed an ignored error coming from CleanupHook on shutdown - make it check the global before setting it.
Fixes#122573
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122745
Approved by: https://github.com/jansel
This PR reduces the difference between strict and non-strict exported program by
- Support `inline_constraints` for non-strict exported program
- Add runtime assertions for range constraints to non-strict exported program
After this PR, the following unit tests are no longer `expectedFailureNonStrict`:
- test_automatic_constrain_size
- test_export_with_inline_constraints
- test_redundant_asserts
- test_constrain_size_with_constrain_value
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122722
Approved by: https://github.com/pianpwk
Python 3.12 changed a few things with how `_PyInterpreterFrame`s are allocated and freed:
- Frames are now required to be placed on the Python frame stack. In 3.11, we could allocate frames anywhere in memory. In 3.12, we now need to use `THP_PyThreadState_BumpFramePointerSlow`/`push_chunk`/`allocate_chunk`. This method of allocating/freeing frames is also compatible with 3.11.
- The eval frame function is now responsible for clearing the frame (see https://docs.python.org/3/whatsnew/changelog.html#id128, the point about "...which now clear the frame.")
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122146
Approved by: https://github.com/jansel
Previously, we were checking `len(device_types)` where `device_types` is a `list`. This meant that if there were multiple inputs, we would see something like `device_types = ["cuda", "cuda"]` and a false positive warning. We should check `len(set(device_types))`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122726
Approved by: https://github.com/soulitzer
Summary:
Right now we don't insert additional observers (share observers) if qspec.dtype and qspec.is_dynamic matches exactly,
since fixed qparams quantization spec and derived quantization spec do have have is_dynamic field curerntly, observer sharing does not happen between them and quantization spec, in this PR we fixed the issue by
adding is_dynamic to all quantization specs.
Note: SharedQuantizationSpec should probably be its own type in the future
TODO later:
(1). move all these fields (dtype, is_dynamic, quant_min, quant_max etc.) to QuantizationSpecBase,
(2). make SharedQuantizationSpec a separate type
(3). add quant_min/quant_max in observer sharing checking in pt2e/prepare.py
Test Plan:
python test/test_quantization.py -k test_fixed_qparams_qspec_observer_dedup
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D55396546](https://our.internmc.facebook.com/intern/diff/D55396546)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122734
Approved by: https://github.com/andrewor14
This PR adds the vectorized indirect indexing so that we can further simplify the `CppVecKernelChecker` (done in the later PR #119734) and remove the check that throws `CppVecUnsupportedError`. A boundary assertion check is added on vectorized indices and via the new `indirect_assert` method on `Kernel` - the base implementation is for scalar indices, overridden in `CppVecKernel` for vectorized indices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119655
Approved by: https://github.com/jansel
ghstack dependencies: #119654
Summary: It looks like this target has stopped working, lets fix it.
Test Plan:
```
buck2 run mode/opt //caffe2/benchmarks/dynamo/:test
```
now works
Differential Revision: D55389546
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122735
Approved by: https://github.com/xmfan
Vectorized boolean values in CPU Inductor were modeled with `Vectorized<float>` which cannot work for operations with other data types. This PR generalizes it with the new `VecMask` template class that can work for masks on any vectorized data types. The intrinsics implementation in `cpp_prefix.h` for mask conversion, cast and masked load are now implemented as the specialization for `VecMask` and moved to corresponding header files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119654
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
This PR:
- disallows FakeTensor.data_ptr when it is called inside PT2 or fx tracing.
- disallows FunctionalTensor.data_ptr (python FunctionalTensor is only used in
PT2)
The motivation behind this is that the leading cause of segfaults when
using custom ops with PT2 is calling .data_ptr on FunctionalTensor or
FakeTensor.
This change is BC-breaking. If your code broke as a result of this, it's
because there was a bug in it (these .data_ptr should never be
accessed!). You can either fix the bug (recommended) or get the previous
behavior back with:
```
from torch._subclasses.fake_tensor import FakeTensor
from torch._subclasses.functional_tensor import FunctionalTensor
data_ptr = 0 if isinstance(tensor, (FakeTensor, FunctionalTensor)) else tensor.data_ptr()
```
Test Plan:
- existing tests
Differential Revision: [D55366199](https://our.internmc.facebook.com/intern/diff/D55366199)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122514
Approved by: https://github.com/ezyang, https://github.com/albanD, https://github.com/yifuwang, https://github.com/kurtamohler
Fixes#118795
This is a graph breaking partial fix for #120914. We still need -actual- module parametrization tracing support, but at least it doesn't blow up hard now.
**Background**: Module parametrization injects a property as the module parameter attribute that calls a `nn.Module` whose forward takes in a module parameter and returns a reparametrized module parameter.
Example:
```
class MyParametrization(nn.Module):
def forward(X):
# This reparametrization just negates the original parameter value
return -X
m = nn.Linear(...)
p = MyParametrization()
register_parametrization(m, "weight", p)
# Accessing the "weight" attribute will invoke p's forward() on m's original weight and return the output as the new weight.
# m.weight here is now an injected property that does the above instead of an actual Parameter.
# This property is defined in torch/nn/utils/parametrize.py.
m.weight
# NB: Parametrization changes the module type (e.g. torch.nn.utils.parametrize.ParametrizedLinear)
print(type(m))
```
**Problem 1**: Dynamo has special tracing rules for things in `torch.nn`. Parametrizing a module changes the type of the module and the parametrized attribute, so now these rules wrongly affect tracing here. To fix this:
* For parametrized modules, call `convert_to_unspecialized()` to restart analysis where Dynamo starts inlining the module.
**Problem 2**: The issue seen in #118795 is that Dynamo will see a dynamically constructed tensor when `m.weight` is called and introduce that to its `tensor_weakref_to_sizes_strides` cache during fake-ification. This tensor is also made to be a graph input, since it's a module parameter. When guards are created for this module parameter input, the logic calls `m.weight` again and tries to look the result up in the cache, but this is a different tensor now, giving the `KeyError` symptom. To fix this:
* Replace Dynamo's `tensor_weakref_to_sizes_strides` cache with a `input_source_to_sizes_strides` cache.
* This cache was originally introduced in #100128.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121041
Approved by: https://github.com/anijain2305
Summary:
During tracing, some constants (tensor_constant{idx}) are being generated internally.
Those constants are neither parameters or buffers, and users have zero control on them.
To accomodate this, we should allow users not passing in those constants generated internally but still be able the constants in the model.
Test Plan:
Included in commit.
```
build/bin/test_aot_inductor
```
Reviewed By: zoranzhao
Differential Revision: D55354548
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122690
Approved by: https://github.com/khabinov
By using `vcvt_f16_f32` and back
According to [benchmark_convert.py](d3279637ca) this makes float32 to float16 tensor conversion roughly 3 times faster: time to convert 4096x4096 float32 tensor drops from 5.23 msec to 1.66 msec on M2 Pro
Test plan: run `vector_test_all_types` + CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122702
Approved by: https://github.com/kimishpatel
`CXX_AVX[2|512]_FOUND` flags should indicate whether compiler supports generating code for given instruction set, rather than whether host machine can run the generated code.
This fixes a weird problem that surfaced after https://github.com/pytorch/pytorch/pull/122503 when builder can sometimes be dispatched to an old CPU architecture, that can not run AVX512 instructions, but can compile for those just fine
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122708
Approved by: https://github.com/jeanschmidt
Fixes https://github.com/pytorch/pytorch/issues/122404
Previously, when rewriting c10d collectives, if the group argument is
unspecified or None, we create a world pg variable out of thin air and
pass it to the rewrite target. The approach was problematic, as it
assumes the symbol `torch` is available in the scope (see #122404).
After #120560, dynamo can now trace dist.group.WORLD. If the group
argument is unspecified, we can just set it with dist.group.WORLD in the
rewrite target.
Testing
pytest test/distributed/test_inductor_collectives.py -k test_dynamo_rewrite_dist_allreduce
Also verified with the repro provided in #122404
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122561
Approved by: https://github.com/wconstab
ghstack dependencies: #120560
Summary:
This diff
* Refactors triton and autotune caches to be child classes of the original memcache based cache infra
* Swaps scuba table for autotune
* Adds autotune time spent/saved to scuba table
Test Plan:
Local testing using:
```
buck run mode/opt fbcode//caffe2/test/inductor/:max_autotune -- -r test_max_autotune_remote_caching_dynamic_False
```
and
```
TORCH_INDUCTOR_AUTOTUNE_REMOTE_CACHE=1 buck2 run mode/opt //scripts/oulgen:runner
```
Differential Revision: D55332620
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122637
Approved by: https://github.com/jamesjwu
Summary:
The test is unstable at the moment. We need to make sure both Aten
and Triton Kernel works to reactivate the test.
Test Plan:
Disabling test
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122682
Approved by: https://github.com/clee2000
This started as a re-land of https://github.com/pytorch/pytorch/pull/105590 but focusing on enabling it on MacOS, but quickly turned into landing very limited platform-specific acceleration at this time (I.e. this PR does not add any NEON accelerated code at all, just enables vectorized compilation for the existing abstractions)
Enabling the test harness, uncovered number of latent issues in CPU inductor that were fixed in the following PRS:
- https://github.com/pytorch/pytorch/pull/122511
- https://github.com/pytorch/pytorch/pull/122513
- https://github.com/pytorch/pytorch/pull/122580
- https://github.com/pytorch/pytorch/pull/122608
Following was added/changed to enable vectorization code to work on MacOS
- Added VecNEON class to `_inductor/codecache.py` that is supported on all AppleSilicon Macs
- Added `Vectorized::loadu_one_fourth` to `vec_base.h`, and limit it to 8-bit types
- Change 64-bit integral types mapping to `int64_t`/`uint64_t` to align with the rest of the code, as on MacOS, `int64_t` is a `long long` rather than `long` (see https://github.com/pytorch/pytorch/pull/118149 for more details)
See table below for perf changes with and without torch.compile using [gpt-fast](https://github.com/pytorch-labs/gpt-fast) running `stories15M` on M2 Pro:
| dtype | Eager | Compile (before) | Compile (after) |
| ------ | ------ | --------- | --------- |
| bfloat16 | 120 tokens/sec | 130 tokens/sec | 156 tokens/sec |
| float32 | 158 tokens/sec | 140 tokens/sec | 236 tokens/sec |
| float16 | 235 tokens/sec | 81 tokens/sec | 58 tokens/sec |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122217
Approved by: https://github.com/jansel
Summary:
`torch.export` is a powerful tool for creating a structured and shareable package from arbitrary pytorch code. One great use case of `torch.export` is sharing models or subgraphs in a way that allows results to be easily replicated. However, in the current implementation of `export`, the `example_inputs` field is thrown out. When trying to replicate bugs, benchmarks, or behaviors, losing the original input shapes and values makes the process much messier.
This change adds saving and loading for the `example_inputs` attribute of an `ExportedProgram` when using `torch.export.save` and `torch.export.load`. This simple addition makes `ExportedPrograms`s a fantastic tool for performance and accuracy replication. For example, with this change we enable the following workflow:
```
# Script to create a reproducible accuracy issue with my model.
kwargs = {"fastmath_mode": True}
exp_program = export(my_model, sample_inputs, kwargs)
result = exp_program.module()(*sample_inputs, **kwargs)
# Uhoh, I dont like that result, lets send the module to a colleague to take a look.
torch.export.save(exp_program, "my_model.pt2")
```
My colleague can then easily reproduce my results llike so:
```
# Script to load and reproduce results from a saved ExportedProgram.
loaded_program = torch.export.load("my_model.pt2")
# The following line is enabled by this Diff, we pull out the arguments
# and options that caused the issue.
args, kwargs = loaded_program.example_inputs
reproduced_result = loaded_program.module()(*args, **kwargs)
# Oh I see what happened here, lets fix it.
```
Being able to share exact inputs and arguments makes `ExportedPrograms` much
more clean and powerful with little downside. The main potential issue with this change
is that it does slightly increase the size of saved programs. However, the size of
inputs will be much smaller than parameters in most cases. I am curious to hear
discussion on saved file size though.
The deserialization of `example_inputs` is currently implemented as `Optional`. Although this wont effect users of `export.save` and `export.load`, it does give backwards compatibility to any direct users of `serialize` and `deserialize`.
Test Plan:
This diff includes a new test which exercises the save / load flow with multiple args and kwargs.
```
buck test //caffe2/test:test_export -- TestSerialize
```
Differential Revision: D55294614
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122618
Approved by: https://github.com/zhxchen17
#### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the
existing and past issues](https://github.com/pytorch/pytorch/issues)
It's likely that your bug will be resolved by checking our FAQ or troubleshooting guide [documentation](https://pytorch.org/docs/main/dynamo/index.html)
Note: if you're submitting an issue that you generated from a fuzzer. Please do the following:
- Ensure rtol/atol are at default tolerances
- Dont compare indices of max/min etc, because that avoids the above requirement
- If comparing eager and torch.compile at fp16/bf16, you should use fp32 as baseline
If the above requirements are met, add the label "topic: fuzzer" to your issue.
body: "The issue is already assigned. Please pick an opened and unnasigned issue with the [docathon-h2-2023 label](https://github.com/pytorch/pytorch/issues?q=is%3Aopen+is%3Aissue+label%3Adocathon-h2-2023)."
body: "The issue is already assigned. Please pick an opened and unnasigned issue with the [docathon-h1-2024 label](https://github.com/pytorch/pytorch/issues?q=is%3Aopen+is%3Aissue+label%3Adocathon-h1-2024)."
});
} else {
await github.rest.issues.addAssignees({
@ -44,7 +46,7 @@ jobs:
});
}
} else {
const commmentMessage = "This issue does not have the correct label. Please pick an opened and unnasigned issue with the [docathon-h2-2023 label](https://github.com/pytorch/pytorch/issues?q=is%3Aopen+is%3Aissue+label%3Adocathon-h2-2023)."
const commmentMessage = "This issue does not have the correct label. Please pick an opened and unnasigned issue with the [docathon-h1-2024 label](https://github.com/pytorch/pytorch/issues?q=is%3Aopen+is%3Aissue+label%3Adocathon-h1-2024)."
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.