**Motivation**
Enable SVE vectorization with `torch.compile`
Extends PR: #119571
* This PR enables vectorization for codegen part using SVE-256 (vec length)
* The changes can be extended to other SVE vec lengths
I've done some comparisons against existing NEON implementation with SVE vectorization enabled route for `torch.compile`
Test results are for 8 cores on ARM Neoverse_V1
<img width="359" alt="Screenshot 2024-08-28 at 16 02 07" src="https://github.com/user-attachments/assets/6961fbea-8285-4ca3-b92e-934a2db50ee2">
It's worth mentioning, for standalone `SiLU op` there's a `~1.8x` speedup with `torch.compile`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134672
Approved by: https://github.com/jgong5, https://github.com/malfet
The change in pytorch/pytorch#136785 enabled these jobs to run on LF runners however we saw a sudden large spike in cost once that happened last week that would have caused us to over use our available AWS credits. This change hardlocks the tests for these jobs to Meta runners. We need this at least until we can figure out how to handle the additional spend caused by these jobs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137616
Approved by: https://github.com/Skylion007, https://github.com/seemethere
Summary: We hipify NCCLUtils.h from nccl.h to rccl/rccl.h. This follows the format of the rocm rpm suite (the header is in include/rccl/rccl.h), however the source code is just src/rccl.h. Using the rccl/rccl.h will make us find the rpm's header but not the src code's header.
Test Plan:
buck run mode/opt-amd-gpu -c hpc_comms.use_rccl=develop -c fbcode.split-dwarf=True --config rccl.build_rdma_core=true --config rccl.adhoc_brcm=true //aps_models/ads/icvr:icvr_launcher -- mode=local_ctr_cvr_cmf_rep_1000x_v1_no_atom data_loader.dataset.table_ds=[2024-09-04] data_loader.dataset.batch_size=512 max_ind_range=10
w/o this diff, it'll show 2.18 nccl version
Differential Revision: D62371434
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135472
Approved by: https://github.com/jeffdaily, https://github.com/cenzhaometa
Summary: Fixes a couple of problems: constants didn't have metadata before creating graph signatures, and graph signatures weren't updated when lifting constants.
Test Plan: fixed test
Differential Revision: D64081786
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137547
Approved by: https://github.com/tugsbayasgalan
**Summary**
Previously, we assumed the packed weight for (`MKL/MKLDNN`) linear operations was at `new_input_nodes[1]`. However, this is not the case for `MKL linear`, where `new_input_nodes[1]` contains the original weight instead of the packed weight. To generalize the code, in this PR, we identify nodes that are present in `input_nodes` but not in `new_input_nodes`—indicating they are no longer used by the GEMM template and can be considered candidates for deletion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135101
Approved by: https://github.com/jgong5, https://github.com/jansel
It seems that there's a bug in `TensorMaker` - it would treat `storage_offset` as bytes when calculating the storage size, but as numel when setting the tensor `storage_offset`. This seems to be causing tensors returned by get_buffer() with non-0 offset to report wrong storage size.
Will look into the `TensorMaker` issue further. But for `get_buffer()`, it seems more natural to just incorporate the offset into the data pointer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137569
Approved by: https://github.com/weifengpy
ghstack dependencies: #137567
during auto_functionalize_v2 if we encounter a view such that size() stride() and storage_offset() matches the base
we create a view that is regenerated by calling aten.alias instead of as_strided for better performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137149
Approved by: https://github.com/zou3519
Remove all the keyword static for constants of vec registers in exp_u20 implementation. With the bf16 input shape of BertLarge, the SDPA kernel improves from 5.1ms to 4.7ms on SPR 56 threads.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137571
Approved by: https://github.com/jgong5
Fixes#115725. Note that the github issue title is misleading. Read the comments to understand what the problem is really about.
The PR improves the documentation and CMake's behavior for ROCM builds.
- Documentation: There were two environment variables for ROCm builds that are now documented. `ROCM_PATH` and `PYTORCH_ROCM_ARCH`.
- CMake: Improved diagnostic messaging and error handling with respect to `ROCM_PATH`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137308
Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/jeffdaily
The test also add singpost log for the benchmarks that pass.
to test run I ran python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv out.csv
results
```
WIN: benchmark ('a', 'instruction count') failed, actual result 90 is -18.18% lower than expected 110 ±1.00% please update the expected results.
REGRESSION: benchmark ('b', 'memory') failed, actual result 200 is 100.00% higher than expected 100 ±+10.00% if this is an expected regression, please update the expected results.
PASS: benchmark ('c', 'something') pass, actual result 107 +7.00% is within expected 100 ±10.00%
MISSING REGRESSION TEST: benchmark ('d', 'missing-test') does not have a regression test enabled for it.
You can use the new reference expected result stored at path: out.csv.
a,instruction count,90,0.01
b,memory,200,0.1
c,something,100,0.1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137551
Approved by: https://github.com/aorenste
When non-blocking mode is enabled, we need to make sure `ncclComm_` is ready before calling NCCL APIs on it.
`NCCLComm::getNcclComm` help us do that (thanks to a wait function inside), thus is preferred than directly using `ncclComm_`.
To prevent `ncclComm_` from being directly used outside, e.g. in `ProcessGroupNCCL`, we also move it as a private member of `NCCLComm` class -- the external-facing wrapper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137573
Approved by: https://github.com/Skylion007, https://github.com/shuqiangzhang, https://github.com/c-p-i-o
ghstack dependencies: #137572
synchronization
Summary:
Barrier is essentially intended to block CPU thread (instead of GPU
streams). Before we used 2 stream synchronizations (1. current stream
blocked by nccl stream end event, 2. CPU thread blocked on current
stream). This is unnecessary as we already have CPU thread blocking
logic in wait(). Also, adding barrier specific code block in the general
GPU synchronize() API is intrusive and confusing.
This PR cleans this.
Test Plan:
CI
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137516
Approved by: https://github.com/fduwjj, https://github.com/kwen2501
- Previously the detection would fail before user calling APIs such as `torch.cuda.set_device()`. This is because the detection logic requires nvml initialization. In this PR, we added explicit nvml initialization (which idempotent).
- Previously any nvml issue occurred in the detection logic would result in fatal error. Now we issue an informative warning and return a topology assuming no NVLink connectivity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137530
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474, #137475, #137529
## This Stack
Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.
## This PR
- Replaces one-shot all-reduce with `symm_mem::one_shot_all_reduce_out`
- Replaces two-shot all-reduce with `symm_mem::two_shot_all_reduce_`
- Removes HCM all-reduce (at least for now). Due to the nature of its accumulation order, we can't guarantee the numerical consistency across all ranks.
- Removes the `IntraNodeComm` python binding (its original purpose is superceded by `SymmetricMemory`).
- Removes methods that were made for the python binding.
- Replaces nvlink detection logic with `DMAConnectivityDetector`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137475
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474
## This Stack
Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.
## This PR
Implement `symm_mem::multimem_one_shot_all_reduce_out`. The out-variant is more suitable for `IntraNodeComm` integration.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137474
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473
Before this changes, tests for operators like `eye` or `triu_indices` were essentially a test that respective CPU operators are stable, as cpu_sample and mps_sample were the same
Moved the logic to `transform_opinfo_sample_to_mps` whicih in addition to copying tensors is also tweaks `kwargs`
Discovered that:
- `torch.randn` and `torch.randint` fall into the same undefined category
- `torch.logspace` is not implemented for MPS
- Allow 1.0 absolute tolerance for all `torch.linspace` calls over integral input as rounding is wrong on the MPS side
- `torch.triu_indices` are not implemented (PR is coming, this is how I've discovered this problem)
- `torch.signal.windows.kaiser` fails because `aten::i0` is not implemented
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137601
Approved by: https://github.com/albanD
This is a utility to aid the torch.compile debugging. You provide a function that returns True on success, False on failure, or do something out of process and run bisect_helper `good | bad`.
The bisector will first go through backends - `eager`, `aot_eager`, `aot_eager_decomp_partition`, `inductor` to find the first failing backend. Then, it will go through subsystems within the backend - currently limited but could be expanded - and try to find the first subsystem for which disabling fixes the problem. Once it has found the failing subsystem, it will find the number of times the subsystem is applied, and then bisect through it.
An example usage of how to hook it up for aot_eager_decomp_partition and decomposition subsystem is :
```
from torch._inductor.bisect_helper import BisectionManager
if op in CURRENT_DECOMPOSITION_TABLE:
if BisectionManager.disable_subsystem("aot_eager_decomp_partition", "decomposition", lambda: repr(op)):
return NotImplemented
```
Once it has discovered the problematic change, it will print out the associated debug info, and you can set the same limits with `TORCH_BISECT_BACKEND` `TORCH_BISECT_SUBSYSTEM` and `TORCH_BISECT_MAX`.
We could add further options as an automated way of going through a check list for checking divergence - e.g., the mode to emulate amp casts.
Fix for https://github.com/pytorch/pytorch/issues/126546
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131936
Approved by: https://github.com/ezyang
Related: https://github.com/pytorch/xla/issues/7799#issuecomment-2375818263
Follow ups: Do the same for maia and mtia
## Motivation
With the move to `weights_only` by default, we are making an explicit decision not to allowlist GLOBALs required to deserialize `numpy` tensors by default. The implication is that backends relying on numpy for serialization will fail loudly when `torch.load` flips `weights_only`.
However, we make the observation that this dependency on numpy was legacy and is not actually needed anymore. So we can remove it, which aligns with our weights_only strategy.
## Why is this ok?
The following comment on why numpy is necessary for serialization is legacy
c87c9f0a01/torch/_tensor.py (L303-L312)
We no longer do the following, though it was the case 5 years ago in the PR that added this
> CPU storage is reconstructed with randomly initialized data, moved onto backend device, and then storage is updated to the serialized content
**Instead what now happens is that CPU storage is constructed with data from the file **and then** moved onto backend device.**
Old behavior (`legacy_load`): 67adda891a/torch/serialization.py (L620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137444
Approved by: https://github.com/albanD
## Overview
This PR adds a `shard_placement_fn: Optional[Callable[[nn.Parameter], Optional[Shard]]` arg to `fully_shard` that allows users to specify FSDP sharding on a nonzero tensor dim. If doing so, then the tensor dim size must be divisible by the FSDP shard world size.
```
# Example:
def shard_placement_fn(param: nn.Parameter) -> Optional[Shard]:
largest_dim = largest_dim_size = -1
for dim, dim_size in enumerate(param.shape):
if dim_size > largest_dim_size:
largest_dim = dim
largest_dim_size = dim_size
return Shard(largest_dim)
fully_shard(module, shard_placement_fn=shard_placement_fn)
```
## Follow-Ups
- **Copy kernels:** For all-gather copy-out, we currently copy-out to temporaries and then chunk-dim-0 -> cat-shard-dim, incurring an extra copy for parameters sharded on nonzero tensor dim. Similarly, for reduce-scatter copy-in, we currently chunk-shard-dim -> cat-dim-0, incurring an extra copy for gradients sharded on nonzero tensor dim. @yifuwang has ideas for adding additional split size args to the copy ops that allows fusing these extra copies into the existing all-gather copy-out and reduce-scatter copy-in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137496
Approved by: https://github.com/weifengpy
ghstack dependencies: #137593
Summary:
NCCL 2.23.4 provides the profiler plugin feature, which traces collective, p2p, proxyOps, and other events.
The diff supports the following feature: when NCCL times out, the flight recorder can also dump traces in the profiler plugin.
Test Plan:
```
tensor = torch.tensor([dist.get_rank()], dtype=torch.int32, device=dev)
# Create a list with same number of elements as world size (aka no. of ranks)
# During allgather this list is going to be populated with tensors from all ranks (aka all gather)
gathered_tensors = [torch.zeros_like(tensor) for _ in range(WORLD_SIZE)]
# get collective from all ranks
if i <= 10 or RANK != 0:
dist.all_gather(gathered_tensors, tensor)
```
My script triggers flight recoder.
```
trainer/0 [0]:E0927 12:07:22.643702 1012209 ProcessGroupNCCL.cpp:1356] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info.
trainer/0 [0]:I0927 12:07:22.643784 1012209 ProcessGroupNCCL.cpp:392] NCCL_PROFILER_PLUGIN: /data/users/zhiyongww/fbsource/fbcode/scripts/nbahl/libnccl_profiler_plugin.so
trainer/0 [0]:I0927 12:07:22.643805 1012209 plugin.cpp:559] Profiler start dump
trainer/0 [0]:I0927 12:07:22.645249 1012209 ProcessGroupNCCL.cpp:1363] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL dumping nccl trace to /tmp/nccl_trace_rank_0
trainer/0 [0]:I0927 12:07:22.645418 1012209 NCCLUtils.cpp:348] Finished writing NCCLPG debug info to /tmp/nccl_trace_rank_0
```
Content from /tmp/nccl_trace_rank_0: P1614645283
Differential Revision: D61929401
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137523
Approved by: https://github.com/c-p-i-o
See `test_inline_closure_returned_by_another_function_and_captures` and #136814 for more context.
In #90286, we introduced an optimization so that for captured cells that are unmodified during a Dynamo trace, `UserFunctionVariable` will represent them as variable of the cell's actual value, rather than a `NewCellVariable`.
Later on we introduced more mechanisms to model such cells across function calls (#104222), and across function calls where `NestedUserFunctionVariable::bind_args` need to look up further in the parent frames (#106491) to find these cells' values.
This patch removes `InlinedClosureVariable` in favor of a simpler modelling, which is also more consistent with what was introduced in #90286, i.e., just model these cells as their contents, in `symbolic_locals`.
This fixes#136814 because resolution of `InlinedClosureVariable` to the underlying cell content value happens in
`NestedUserFunctionVariable::bind_args`, which requires Dynamo to have the value in scope at the function call site (when Dynamo does inlining), but's not always the case (as the test case shows). However, if we model the cells in `symbolic_locals`, we never need such resolution, and the values are directly stored into the `NestedUserFunctionVariable::closure` upon the function creation, at which point Dynamo always has the cell value in `symbolic_locals` for look up.
Fixes#136814.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137510
Approved by: https://github.com/williamwen42
Summary:
# context
* enable the `_get_user_embeddings` function
* run failed at P1610151892
```
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
GuardOnDataDependentSymNode: Could not guard on data-dependent expression u22 <= 0 (unhinted: u22 <= 0). (Size-like symbols: u22)
ATTENTION: guard_size_oblivious would fix the error, evaluating expression to False.
Maybe you need to add guard_size_oblivious to framework code, see doc below for more guidance.
Potential framework code culprit (scroll up for full backtrace):
File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/38472faba4e3e6c1/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_decomp/decompositions.py", line 1692, in native_layer_norm_backward
if M <= 0 or N <= 0:
```
```
N = prod(inner_dims) # type: ignore[arg-type]
M = prod(outer_dims) # type: ignore[arg-type]
if M <= 0 or N <= 0:
return (
input.new_zeros(input_shape) if output_mask[0] else None,
input.new_zeros(input_shape[axis:]) if output_mask[1] else None,
input.new_zeros(input_shape[axis:]) if output_mask[2] else None,
)
```
# changes
* use guard_size_oblivious since the new_zeros return is kind of optimization, shouldn't impact the correctness of the follow up code logic.
* the size `ret[i][j]` could be zero, so the change in V1 isn't valid
* for more details: [post](https://fb.workplace.com/groups/6829516587176185/permalink/8003616173099548/)
```
from torch.fx.experimental.symbolic_shapes import guard_size_oblivious
if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0):
```
# past
* found `u22` was introduced at
```
def _wait_impl(self) -> List[List[int]]:
# Can not use is_torchdynamo_compiling(), as every such condition should be independent for compilation with graph breaks.
if isinstance(self._splits_awaitable, dist.Work):
self._splits_awaitable.wait()
ret = self._output_tensor.view(self.num_workers, -1).T.tolist() # <------ u22 introduced here
if not torch.jit.is_scripting() and is_torchdynamo_compiling():
for i in range(len(ret)):
for j in range(len(ret[i])):
torch._check_is_size(ret[i][j]) # <---------- my question: why the _check_is_size isn't enough??
torch._check(ret[i][j] > 0) # <------ added by diff V1
```
Test Plan:
# run command
```
TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_v4_mini training.pipeline_type=pt2 2>&1 | tee -a `tagT`.`tagH`.log
```
# results
* before
**without enabling `_get_user_embeddings`**
[14 Failures and Restarts](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp2eNI7p/failures_and_restarts.html)
log: P1610151892
{F1889387940}
* V1
enable `_get_user_embeddings`
with `torch._check(ret[i][j] > 0)`
[13 Failures and Restarts](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp6J1iY9/failures_and_restarts.html)
{F1889388378}
* V2
enable `_get_user_embeddings`
with `if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0):`
[tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpFhZZyC/index.html)
if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0):
Differential Revision: D63424929
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136798
Approved by: https://github.com/ezyang
Summary: Previously triton_reshape will generate code with `Min` keyword in it, which is incorrect. This diff updates the triton_reshape function to properly expand `Min` keyword to `<`.
Test Plan:
```
buck2 run @//mode/{opt,mtia,inplace} //glow/fb/fx/fba/tests:test_fba_inductor -- -r test_Min_keyword_in_block_shape
```
Differential Revision: D63850158
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137357
Approved by: https://github.com/blaine-rister, https://github.com/eellison
This PR doesn't change the logic of `test_graph_input_is_async` - it just adds an additional check to the graph input type to ensure it's always `AsyncCollectiveTensor` as expected. It would potentially make it easier to show to users that we already support `AsyncCollectiveTensor` as graph input.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137253
Approved by: https://github.com/bdhirsh
Fixes#137422
Add parameter type definition in API docs to clarify allowed value type, eliminate users pass `None` as `dim` value directly.
```python
>>> import torch
>>> x = torch.randn(3,1,2)
>>> x.squeeze(dim=None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: Please look up dimensions by name, got: name = None.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137485
Approved by: https://github.com/albanD
The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour. Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore.
Also, the test seems to run OOM on a 2xlarge with `std::bad_alloc` memory error. Maybe, this would also fix the issue (pending CI testing)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137447
Approved by: https://github.com/albanD, https://github.com/malfet
## This Stack
Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.
## This PR
Implement `symm_mem::two_shot_all_reduce_`. Later we'll replace the two-shot all-reduce in `IntraNodeComm` with these.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137473
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472
## This Stack
Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.
## This PR
Implement `symm_mem::one_shot_all_reduce` and `symm_mem::one_shot_all_reduce_out`. Later we'll replace the one-shot all-reduce in `IntraNodeComm` with these.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137472
Approved by: https://github.com/Chillee, https://github.com/weifengpy
ghstack dependencies: #137471
## This Stack
Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.
## This PR
Refine the corss-device synchronization primitives to make it clearer when to use which synchronization pattern.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137471
Approved by: https://github.com/Chillee, https://github.com/weifengpy
Moves `TransformGetItemToIndex` to a file where dynamo stores other traceable HOP concepts. (We don't trace through torch.* modules by default)
Tracing through the mode required fixing a bug in dynamo autograd function, which fixed a graph break, which caused the autograd test failures (skipping for now and will file an issue)
Previously those tests were in essence running in eager, because dynamo would fallback due to an arg mismatch error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137120
Approved by: https://github.com/yanboliang, https://github.com/malfet
ghstack dependencies: #137114, #137115, #137116, #137117
This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode)
Typically the bytecode for a context manager looks like this during a graph break:
1. graph call
2. enter context
3. unsupported code
4. exit context
5. resume call
resume fn structure:
1. enter context
2. jump
...
3. exit context
The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack).
So for torch function modes the structure of our output code is this:
1. graph call
2. mutate tf mode stack to replay mutations
4. unsupported code
5. on exception restore stack
6. resume function
Then our resume fn looks like this:
1. no-op enter torch function mode
2. jump
3. exit tf mode
To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context).
Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly.
Approved by: https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443, #135444
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137114
Approved by: https://github.com/yanboliang
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/135106. For AOTI, there is the Inductor IR of weight
```
ReinterpretView(
StorageBox(
ConstantBuffer(name='L__self___mlp_0_weight', layout=FixedLayout('cpu', torch.float32, size=[64, 128], stride=[128, 1]))
),
FixedLayout('cpu', torch.float32, size=[128, 64], stride=[1, 128]),
origins=OrderedSet([addmm])
)
```
In the post-processing step of the GEMM template, the used weight was before permutation, leading to correctness issues. In this PR, we address this by reshaping the weight to the expected size and stride before the weight prepack.
**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_aot_inductor.py -k test_misc_1_max_autotune_True_non_abi_compatible_cpu
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_aoti_linear
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_aoti_linear_multi_view_operations
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136421
Approved by: https://github.com/jgong5, https://github.com/desertfire
This PR is for supporting calling `parallelize_module` from within a model definition, making the model a parallel one.
Calling `parallelize_module` is an alternative to maintaining a set of `ColumnWiseLinear`, `RowWiseLinear`, etc, while still being able to directly author a parallel model.
(The motivation for authoring a parallel model is that there may be other distributed operations, which may not be easily captured by any module, see the forward function below. Alternatively speaking, the purpose is to exploit the expressiveness of DTensor -- we need to first create DTensors before calling ops on them. Having parallelized modules in model is one way of creating DTensors.)
For example:
```
class FeedForward(nn.Module):
def __init__(self, config: TransformerArgs) -> None:
super().__init__()
w1 = nn.Linear(config.dim, config.hidden_dim, bias=False)
w2 = nn.Linear(config.hidden_dim, config.dim, bias=False)
w3 = nn.Linear(config.dim, config.hidden_dim, bias=False)
self.w1 = parallelize_module(w1, Colwise)
self.w2 = parallelize_module(w2, Rowwise)
self.w3 = parallelize_module(w3, Colwise)
def forward(self, x: Tensor) -> Tensor:
y: DTensor = self.w2(F.silu(self.w1(x)) * self.w3(x))
# y is a DTensor with Partial placement; we can return it as is.
return y
# Or we can convert it to Replicate -- there is modeling flexibility here.
return y.redistribute(Replicate())
with device_mesh:
model = FeedForward(config)
# Now model is a model parallelized onto device_mesh
y = model(x)
```
The `device_mesh` actually used for `parallelize_module` would be retrieved from the ambient context.
Calling `parallelize_module` from within model hierarchy also saves the use of *FQNs* as in the out-of-model annotation case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134247
Approved by: https://github.com/tianyu-l
When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat.
Discussion for reviewers:
It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](bcac71517c/torch/_inductor/kernel/mm.py (L156)). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation
```
class AllocatedFixedLayout(FixedLayout)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132554
Approved by: https://github.com/jansel
For `autograd.Function`, the engine will try to allocate correctly-shaped zeros for `None` grads (i.e. in the case where the output isn't used downstream). It determines the shape of these zeros from the `VariableInfo` entry, which is derived from the forward output shape. For the NJT forward output case, the size info stored will contain a nested int, and calling `zeros()` with this size throws:
```
RuntimeError: .../build/aten/src/ATen/RegisterCPU.cpp:5260: SymIntArrayRef expected to contain only concrete integers
```
This PR fixes this by storing the full tensor in the `VariableInfo` for the nested case and calling `zeros_like()` to allocate correctly-shaped zeros. This is pretty inefficient; ideally we would want to save just the NJT shape and be able to construct zeros from it, but this requires factory function support for nested ints (WIP). So this is a short-term fix until we have that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136875
Approved by: https://github.com/soulitzer
compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because:
(1) it consists of a single input + many weights that are used sequentially
(2) contains a mix of recompute vs non-recomputed ops (matmul + sin)
(3) it is relatively simple
from running locally:
```
collecting compile time instruction count for aotdispatcher_partitioner_cpu
compile time instruction count for iteration 0 is 21764219181
compile time instruction count for iteration 1 is 12475020009
compile time instruction count for iteration 2 is 12463710140
compile time instruction count for iteration 3 is 12455676489
compile time instruction count for iteration 4 is 12451344330
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760
Approved by: https://github.com/ezyang
ghstack dependencies: #136759
this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher:
(1) inference vs training code paths
(2) "subclasses" vs "no subclasses" codepaths
Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely)
I ran locally, and got these numbers on the 4 paths:
```
collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu
compile time instruction count for iteration 0 is 11692348671
compile time instruction count for iteration 1 is 3026287204
compile time instruction count for iteration 2 is 3011467318
compile time instruction count for iteration 3 is 3004485935
compile time instruction count for iteration 4 is 3003087410
collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu
compile time instruction count for iteration 0 is 6068003223
compile time instruction count for iteration 1 is 5585418102
compile time instruction count for iteration 2 is 5581856618
compile time instruction count for iteration 3 is 5581651794
compile time instruction count for iteration 4 is 5578742619
collecting compile time instruction count for aotdispatcher_inference_subclass_cpu
compile time instruction count for iteration 0 is 8634984264
compile time instruction count for iteration 1 is 8633467573
compile time instruction count for iteration 2 is 8632182092
compile time instruction count for iteration 3 is 8632056925
compile time instruction count for iteration 4 is 8632543871
collecting compile time instruction count for aotdispatcher_training_subclass_cpu
compile time instruction count for iteration 0 is 14737239311
compile time instruction count for iteration 1 is 14734346427
compile time instruction count for iteration 2 is 14736493730
compile time instruction count for iteration 3 is 14734121272
compile time instruction count for iteration 4 is 14733852882
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759
Approved by: https://github.com/laithsakka
Instead, callback to a missing handler when needed. This greatly speeds things up with the value ranges dict is large. The missing handler is needed because nested ints don't have VRs, but symbolic sizes involving them occasionally show up in compute.
```
TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="s11" TORCH_LOGS=dynamic PYTORCH_TEST_WITH_DYNAMO=1 python test/test_nestedtensor.py TestNestedTensorAutogradCPU.test_dropout_backward_jagged_cpu
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136667
Approved by: https://github.com/isuruf
ghstack dependencies: #136429
Partially addresses https://github.com/pytorch/pytorch/issues/128150
When you have big sums of values, we end up computing long chains of
binary addition in our FX graph representation. Not only is this ugly,
it also is quadratic, as the sympy.Add constructor is O(N) in number
of arguments. Instead, ensure that we maintain the summation as a
single FX node so we can do the entire addition all in one go.
update_hint_regression benchmark, before and after:
```
update_hint_regression,compile_time_instruction_count,2648328980
update_hint_regression,compile_time_instruction_count,2563748678
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136429
Approved by: https://github.com/isuruf
Summary: There are two instances of AppendOnlyLists that don't get cleared after we have finished iterating through the forward lists. This can be potentially dangerous since they can last for the entirety of the lifespan of the profiler. We have also seen crashes during the destructor of these variables when the profiler is exiting. This could possibly be related to the fact that the default constructor assumes some valid state of these lists rather than whatever state they are in when profiler is exiting.
Test Plan: Ran with profile_memory=True to make sure allocations queue gets cleared correctly and trace+workload ran successfully
Differential Revision: D64010911
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137450
Approved by: https://github.com/aaronenyeshi
Works around #136543.
This fix solves the issue only in the context of the ONNX exporter but this issue happens in other context.
The bug happens when method `run_decompositions` is called. The failing pattern is assumed to be ``view(transpose(x, ...))``. This pattern is replaced by ``view(flatten(transpose(x, ..)))``. By changing the dimensions, the strides are updated as well and `run_decompositions` does not fail anymore. It would be inefficient on a 1D tensor but then transpose would not be used. The extra node appears in the final onnx graph but is removed after optimization. The final onnx graph should not be impacted and no performance loss should be observed for the onnx model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137340
Approved by: https://github.com/justinchuby
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour. Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore.
Also, the test seems to run OOM on a 2xlarge with `std::bad_alloc` memory error. Maybe, this would also fix the issue (pending CI testing)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137447
Approved by: https://github.com/albanD, https://github.com/malfet
code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.
Also, remove the need of a private synchronization function.
In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves.
Test Plan:
python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137295
Approved by: https://github.com/kwen2501
Summary: It seems like the import path is different from FBCode & OSS. Wondering how to consolidate them.
Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cutlass_backend
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 33. Build failure 0
```
Differential Revision: D63991961
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137435
Approved by: https://github.com/jovianjaison
When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat.
Discussion for reviewers:
It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](bcac71517c/torch/_inductor/kernel/mm.py (L156)). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation
```
class AllocatedFixedLayout(FixedLayout)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132554
Approved by: https://github.com/jansel
This PR fixes a bug in `search_end_matrix_indices_cuda_kernel` causing an illegal memory access when calling `bmm_sparse_cuda` on a sparse matrix with no non-zero values in the first batch dimension. Reproducible example:
```py
import torch
ind = torch.tensor([[1], [0], [0]], device="cuda")
val = torch.tensor([1.], device="cuda")
A = torch.sparse_coo_tensor(ind, val, size=(2, 1, 1))
B = torch.zeros((2, 1, 1), device="cuda")
C = torch.bmm(A, B)
```
## Details
In the previous code, we may for example end up with the following situation:
```
i : indices_1D[i]
------------------------------------------
0 : 1 <- start_idx, mid_idx
1 : 1 <- end_idx
...
```
When `target_mat_num = 0`, the next iteration of the while loop will assign `-1` to `end_idx` and thus `(0 + (-1)) >> 1 = -1` to `mid_idx`, causing an access error on line 703. The updated code maintains the invariant `start_idx <= end_idx` and will not go out of bounds.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131977
Approved by: https://github.com/amjames, https://github.com/pearu, https://github.com/nikitaved
This patch adds logging for all frames Dynamo traced, during each invocation of a Dynamo-optimized function.
Example:
```python
import torch
@torch.compile
def foo():
x = torch.ones([10])
def bar():
y = x + x
torch._dynamo.graph_break()
z = y * x
return z
return bar(), bar
foo()
foo()
```
Running `TORCH_LOGS="dynamo" python` on the above dumps the following near the very end.
```
......
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] starting from foo /Users/ryanguo99/Documents/work/scratch/test.py:4, torchdynamo attempted to trace the following frames: [
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] * foo /Users/ryanguo99/Documents/work/scratch/test.py:4
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] * bar /Users/ryanguo99/Documents/work/scratch/test.py:7
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] ]
I1003 12:18:31.064000 177 torch/_dynamo/eval_frame.py:486] starting from foo /Users/ryanguo99/Documents/work/scratch/test.py:4, torchdynamo attempted to trace the following frames: []
......
```
Fixes#118262.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137297
Approved by: https://github.com/williamwen42
Moves `TransformGetItemToIndex` to a file where dynamo stores other traceable HOP concepts. (We don't trace through torch.* modules by default)
Tracing through the mode required fixing a bug in dynamo autograd function, which fixed a graph break, which caused the autograd test failures (skipping for now and will file an issue)
Previously those tests were in essence running in eager, because dynamo would fallback due to an arg mismatch error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137120
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114, #137115, #137116, #137117
This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode)
Typically the bytecode for a context manager looks like this during a graph break:
1. graph call
2. enter context
3. unsupported code
4. exit context
5. resume call
resume fn structure:
1. enter context
2. jump
...
3. exit context
The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack).
So for torch function modes the structure of our output code is this:
1. graph call
2. mutate tf mode stack to replay mutations
4. unsupported code
5. on exception restore stack
6. resume function
Then our resume fn looks like this:
1. no-op enter torch function mode
2. jump
3. exit tf mode
To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context).
Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly.
Approved by: https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443, #135444
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137114
Approved by: https://github.com/yanboliang
If there is an `unshard` (top-half) without a `wait_for_unshard` (bottom-half), then the next iteration's `unshard` will be a no-op. This can unexpectedly not propagate the optimizer update on the sharded parameters to the unsharded parameters, so it is better to clear that `unshard` at the end of backward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137348
Approved by: https://github.com/weifengpy
Summary:
- Further added more types for debug value dumping.
- Add a test case for symint inputs for debug printer. in real prod model use cases, "unbacked symints" (those 'u0', 's0', etc.) can be helpful if we can examine their value.
Test Plan:
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_sym_inputs_abi_compatible_cuda
```
Differential Revision: D63864708
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137323
Approved by: https://github.com/chenyang78
Summary: Implement Remote AOTAutogradCache. It uses all the same tech as Remote FXGraphCache, just with its own name.
Test Plan:
Run benchmark:
TORCHINDUCTOR_AUTOGRAD_REMOTE_CACHE=1 TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=1 TORCHINDUCTOR_AUTOGRAD_CACHE=0 TORCHINDUCTOR_FX_GRAPH_CACHE=0 TORCH_LOGS=+torch._functorch._aot_autograd.autograd_cache buck run mode/opt benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --repeat 5 --performance --cold-start-latency
See that it cache hits even with local cache removed.
Results show up in remote cache logs https://fburl.com/scuba/pt2_remote_cache/5893dbaj
New unit tests
Reviewed By: oulgen
Differential Revision: D63323958
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137278
Approved by: https://github.com/oulgen
Summary: Prototyping the custom op meta kernel generation. Rest of the changes are in fbcode/scripts/angelayi
Test Plan: followup diff (D63837739)
Differential Revision: D63837740
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137277
Approved by: https://github.com/zou3519
This PR fixes an issue where when running `python setup.py develop`, the `open_registration_extension` self contained example will not build due to the following:
```
error: 'synchronizeStream' overrides a member function but is not marked 'override'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137355
Approved by: https://github.com/albanD, https://github.com/spzala
When one process fails, others are immediately killed. This prevents other processes to do necessary cleanups, or dump debug information (in particular, the NCCL flight recorder).
This PR adds a grace period. Default behavior is unchanged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131278
Approved by: https://github.com/albanD
Added an optimization pass to the swap function which removes extraneous pytrees. Currently it removes the pytree flatten/unflatten calls between modules in very specific scenarios (all the inputs of one module go into the other).
Future work can be to remove the input pytree.flatten if the inputs go directly into an unflatten, and output pytree unflatten if the outputs are directly from a pytree.flatten.
Differential Revision: [D62879820](https://our.internmc.facebook.com/intern/diff/D62879820)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136191
Approved by: https://github.com/avikchaudhuri
currently FSDP2 support only CUDA, for other backends that need to use FSDP2 it won’t work as stream and events are based on CUDA. To support other backends, use
_get_device_handle by device type to get the class and use this
for stream and events.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136843
Approved by: https://github.com/awgu
Summary: In unflatten, when we generate module calls when their signature has been preserved, we do not pass the original constant args. This can cause strange effects, e.g., if the module is swapped out with itself, we may suddenly go down a different path than the original, or even crash.
Test Plan: added a test
Reviewed By: angelayi
Differential Revision: D63913750
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137363
Approved by: https://github.com/angelayi
Fixes#136720
the result in this case says:
```
Traceback (most recent call last):
File "/Users/shenke/workspace/pytorch/mytest.py", line 9, in <module>
result = torch.bincount(input)
^^^^^^^^^^^^^^^^^^^^^
RuntimeError: maximum value of input overflowed, it should be < 9223372036854775807 but got 9223372036854775807
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136745
Approved by: https://github.com/Skylion007
Thanks @awgu for raising this issue and the small repro
From offline discussion with @albanD, in the case where a forward returns multiple outputs with different devices, we'd want to select the ready queue based on the device of the first one. Even though this is somewhat arbitrary, we prefer this over deciding which ready queue to push based on whichever input buffer's we happen to compute last, which can vary depending on more factors and thus be harder to reason about. This is in theory bc-breaking, but it seems unlikely that someone would depend on this behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135633
Approved by: https://github.com/albanD
This change modifies the `hipify_python.py` script to properly detect all directories, `include` and `ignore` paths during hipification process on Windows, by changing the path syntax convention to a UNIX-like one.
Since in many places the script assumes a UNIX-like convention by using paths with forward slashes `/`, I decided to accommodate for it by converting Windows paths to UNIX-like ones. By doing it so, the number of changes to the file is limited. Moreover this early-on unification allows for the rest of the code to have a battle-tested linux-like behaviour.
Another option would be to use `Path` object from `pathlib` to represent all paths in the script, however, it would impact a broader share of a code and would hence require a more meticulous evaluation in terms of non-altered logic and edge cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135360
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd
Summary:
X-link: https://github.com/pytorch/executorch/pull/5720
For smaller models the overhead of profiling ops might be prohibitively large (distorting the inference execution time significantly) so we provide users an option to disable op profiling and essentially only profile the important events such as inference execution time.
To disable operator profiling users need to do:
```
etdump_gen.set_event_tracer_profiling_level(executorch::runtime::EventTracerProfilingLevel::kNoOperatorProfiling);
```
Test Plan: Added test case.
Differential Revision: D61883224
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136838
Approved by: https://github.com/dbort
if the function is
```func(a, b, c) ```
and is called as
```func(a=1, b=.., c=..)```
before this change we do not iterate on the a, b, c, since those appear in kwargs this diff fix that issue.
This function is used in _inductor/ir.py to iterate over custom op arguments and when a custom pass does changes
and pass arguments as kwargs, we do not process them.
```
for info, arg in torch._library.utils.zip_schema(schema, args, kwargs):
handle_aliasing_and_mutation(info, arg)
```
Fix https://github.com/pytorch/pytorch/issues/137057
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137311
Approved by: https://github.com/zou3519
Fixes https://github.com/pytorch/pytorch/issues/136358
The bug here is that the Tensor object is actually 2 classes: `Tensor` from `_tensor.py` and `TensorBase` from c++.
Before this PR, they have the following gc methods:
Tensor:
- tp_clear subtype_clear
- tp_traverse THPVariable_subclass_traverse
- tp_dealloc THPVariable_subclass_dealloc
TensorBase:
- tp_clear THPVariable_clear
- tp_traverse THPFunction_traverse (fake function that is just an error)
- tp_dealloc object_dealloc
The problem is that when clear is called on the Tensor, subtype_clear is going to clear the things owned by the "Tensor" type, in particular, its `__dict__` attribute, before delegating to the TensorBase clear where we detect that resurrection needs to happen and skip it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137267
Approved by: https://github.com/ezyang, https://github.com/kshitij12345
This PR relaxes the even sharding requirement for the all-gather extensions.
The `fsdp_pre_all_gather` now expects signature:
```diff
def fsdp_pre_all_gather(
self,
mesh: DeviceMesh,
+ outer_size: torch.Size,
+ outer_stride: Tuple[int, ...],
module: nn.Module,
mp_policy: MixedPrecisionPolicy,
) -> Tuple[Tuple[torch.Tensor, ...], Any]:
```
- Since no one is using this new signature yet, we should be safe to change it.
- Currently, the `outer_stride` will always be contiguous strides since FSDP2 only supports contiguous strides for now.
- For the uneven sharding case, the user is responsible to return a padded sharded tensor from `fsdp_pre_all_gather`. This is risky territory because if the user does not do so, then this may manifest as a NCCL timeout, as only the ranks with padding will error out. However, I am not aware of any way around this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137005
Approved by: https://github.com/weifengpy
Summary: We had attribute assignment detection and handling of registered buffer assignments when using `aot_autograd`, but not when using just `make_fx`. Fixed.
Test Plan: expanded coverage of `test_state_tensors` to use `export` instead of `torch.export.export`
Differential Revision: D63802576
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137240
Approved by: https://github.com/tugsbayasgalan
Adds lowering for `aten.searchsorted`. This entails:
1. Adding support for multi-dimensional bucket tensors to `ops.bucketize`.
2. Adding support for striding to `ops.bucketize`.
3. Adding support for sorting tensors to `ops.bucketize`.
4. Adding a lowering for `aten.searchsorted.Tensor`.
5. Adding a basic decomposition for `aten.searchsorted.Scalar` that calls into the lowering for tensors.
6. Updating the meta-function for `aten.searchsorted` to properly check some of the sizing conditions.
Closes#135873
Differential Revision: [D63766514](https://our.internmc.facebook.com/intern/diff/D63766514)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135701
Approved by: https://github.com/amjames, https://github.com/eellison, https://github.com/davidberard98
Summary:
as title
also change it in `prepare_pt2e()` docstring
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:quantization_pt2e_qat
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization
```
Differential Revision: D63345059
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137233
Approved by: https://github.com/tugsbayasgalan
Summary:
Special autotuning configs like `num_warps` and `num_stages` can be passed to the kernel as parameters. The `config.all_kwargs()` call [here](762a7d197c/python/triton/runtime/autotuner.py (L106)) in the Trtion code includes those special configs (names and values) into the potential arguments to the kernel. [Here](762a7d197c/python/triton/runtime/jit.py (L613)) some of those may be included in actual kenrel arguments, given that their names are present among the kernel parameters.
This PR replicates this behavior in user-defined Triton kernel compilation in PT2. Resolves#136550.
Test Plan:
```
$ python test/inductor/test_triton_kernels.py -k test_triton_kernel_special_params
inductor []
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.inductor []
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
.inductor [('fxgraph_cache_bypass', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1), ('extern_calls', 1), ('possibly_missed_reinplacing_opportunities', 0), ('possibly_missed_reinplacing_bytes', 0)]
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.inductor []
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.inductor []
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
.inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_bypass', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1), ('extern_calls', 1), ('benchmarking.TritonBenchmarker.triton_do_bench', 1), ('possibly_missed_reinplacing_opportunities', 0), ('possibly_missed_reinplacing_bytes', 0)]
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.
----------------------------------------------------------------------
Ran 6 tests in 6.283s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137236
Approved by: https://github.com/zou3519
Summary:
When we handle dynamic shapes markers like `Dim.AUTO, Dim.DYNAMIC`, we use dynamo decorators, attaching set attributes to the export input tensors, e.g. `x._dynamo_dynamic_indices = set()`.
I thought this was fine, since it's done all the time with torch.compile, but it breaks some PT2Inference tests, specifically because unpickling a set attribute isn't possible with the C++ torch::jit::pickle_load call.
We've agreed that the PT2Inference side will clone sample inputs & pickle the original inputs to be safe, but this still establishes a nice invariant that user-facing decorators are both ignored & cleaned out in the lifecycle of an export call.
Test Plan: test_export
Differential Revision: D63773534
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137230
Approved by: https://github.com/avikchaudhuri
When the stub file `nn/parallel/distributed.pyi` was removed (#88701), some types that existed are no longer available. This pull request adds them back.
Just for reference, these types are used in pytorch-lightning's LightningCLI. Command line interfaces are created automatically, and having type hints make them nicer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136835
Approved by: https://github.com/kwen2501
We didn't support multiple levels of vmap. The main problem is, during
the batching rule, we need to exclude the vmap dispatch key
(FuncTorchBatched) like how our C++ batching rules do it.
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137306
Approved by: https://github.com/Chillee
This PR adds new meta functions for `lerp`, `addcmul`, and `addcdiv` (including their
respective inplace versions).
These functions only had refs implementations, which was being the root cause of a
significant overhead ([issue][1]) when running `AdamW` optimizer step on PyTorch/XLA
backend. Running the meta functions resulted in the following improvements:
- `lerp` calls: 1,550ms to 140ms (10x)
- `addcdiv` calls: 640ms to 350ms (1.8x)
- `addcmul` calls: 620ms to 300ms (2.05x)
[1]: https://github.com/pytorch/xla/issues/7923
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136909
Approved by: https://github.com/jansel
One-shot all-reduce did not have a barrier at the end. It was possible for a rank to write to its p2p buffer for the next collective before another rank finished reading it for the previous collective.
Also removing the fuse-input-copy optimization. The synchronization complexity probably outweighs the saving.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137257
Approved by: https://github.com/Chillee
This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279:
## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.
## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5) <-- no additional context yet
del work <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)
When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.
## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.
`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
This should be a no-op change, i.e. it runs the same code, but replaces verbose ObjectiveC invocation with helper function from OperationUtils.h, which this example already depends on
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137313
Approved by: https://github.com/atalman
Previously, all integer inputs to user-defined triton kernels were assumed to be int32. This would result in errors if your input was actually an int64.
This PR checks the value to determine which dtype to use for indexing: if it is known to be < int_max, then use int32 (and add guards if relevant); if we can't check (e.g. unbacked symint), then use int64.
Differential Revision: [D63797975](https://our.internmc.facebook.com/intern/diff/D63797975)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137234
Approved by: https://github.com/eellison
As title, when testing on an internal case, we found that we have very similar output for the error when certain ranks does not join one collective. This is because we didn't put all ranks into `candidate_ranks` so that they didn't get wiped out from entries and gets checked again.
Ideally for the given case, we should report this is an out of order case, because rank 0, 1 calls all-to-all while all the rest ranks call all-gather-base. But when we select entries to compare, we don't have global view of the entries.
In the specific case, on rank 0 and 1, it has collective of PG 7 on entry 1130 with seq ID = 1130. However, on other ranks, they have collective of PG 0 on entry 1130 with seq ID = 2. It's hard to use entry idx to do the match because if we later consider p2p, this assumption will collapse, so we now still defer it for users or further down debugging stream to figure it out. To make the message clearer, I also include both seqID and record_id (aka, entry index) in the message. (That does not mean this is not possible to implement in the code, for example, we can let all record_id to minus the maximum p2p seq id before it; but users will easily see the wrong order, so we don't think it's necessary to have that logic now)
P1626755348
Differential Revision: [D63815335](https://our.internmc.facebook.com/intern/diff/D63815335/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137256
Approved by: https://github.com/c-p-i-o
Summary: We had a report of crashes in parallel compile subprocesses linked to reading justknobs. See https://fburl.com/workplace/14a4mcbh internally. This is a known issue with justknobs. It looks like we don't have a lot of control over evaluating knobs. Some are read in inductor (`"pytorch/remote_cache:autotune_memcache_version`), but many are read by the triton compiler. According to this advice https://fburl.com/workplace/imx9lsx3, we can import thread_safe_fork which installs some functionality to destroy some singletons before forking and re-enable them after. This apporach works for the failing workload.
Test Plan: See D63719673 where the reporting user was kind enough to provide us with a local repro. Without the relevant import, we can reproduce the crash. With the import, the training runs successfully to completion.
Differential Revision: D63736829
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137155
Approved by: https://github.com/xmfan, https://github.com/eellison
When we populate unlifted graph module, we actually only "unlift" constant tensor inputs which is problematic because export de-duplicates aliasing constants. As a result, we only register one constant instead of two constants. This PR fixes that by querying ep.constants table instead of ep.graph_signature.lifted_tensor_constants.
Differential Revision: [D63743111](https://our.internmc.facebook.com/intern/diff/D63743111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137162
Approved by: https://github.com/pianpwk
Summary: This just adds a config option and JK for turning on remote AOTAutogradCache. It does not implement anything with the new options being passed in. That will come next diff.
This PR also changes the command for turning on the local AOTAutogradCache to be more consistent to that of FXGraphCache: TORCHINDUCTOR_AUTOGRAD_CACHE
Test Plan: Existing tests should pass and should build
Reviewed By: oulgen
Differential Revision: D63321965
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137011
Approved by: https://github.com/oulgen
This greatly reduces compile time; TorchBench models that were previously 50-100x slower (vs the cpp backend) are now ~20x slower. More work needs to be done on the Triton side, but smaller block sizes will still be helpful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136612
Approved by: https://github.com/desertfire
ghstack dependencies: #135342
Function `_get_pg_default_device` is being used outside of `distributed_c10d.py`.
A concern is that people may not be aware of what it actually does, due to bad naming of this function:
`Return the device to use with ``group`` for control flow usage (object collectives, barrier).`
The remediation is as follows:
- Added a deprecation warning to `_get_pg_default_device`;
- Added a private function `_get_object_coll_device` to undertake what it does;
- Added a `_device_capability` function for users who want to query the device support of a PG -- it returns a plain list, no more "default" choice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136790
Approved by: https://github.com/H-Huang
Previously we were making a fairly restrictive assumption when unflattening an exported program: for any submodule, we would assert that the graph of every call to that submodule must be the same. This assertion is load-bearing, i.e., if we simply remove the assertion then we can get incorrect results, as shown by the following example.
```
class N(torch.nn.Module):
def forward(self, x, b):
if b:
return x + 1
else:
return x + 2
class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.n = N()
def forward(self, x):
x0 = x + 3
x1 = self.n(x0, True)
x2 = x1 + 4
x3 = self.n(x2, False)
return x3 + 5
m = M()
inp = (torch.ones(1),)
print(m(*inp)) # tensor([16.])
ep = torch.export.export(m, inp)
print(ep.module()(*inp)) # tensor([16.])
unflattened = torch.export.unflatten(ep)
print(unflattened(*inp)) # tensor([15.])
```
However, this goes against the spirit of specializing graphs when exporting: we should *expect* that for every call to a submodule we *might* generate a different graph. The goal of this PR is to fix unflattening to handle multiple specialized graphs corresponding to multiple calls to the same submodule.
The idea is simple: for every call to a child module `foo`, we will create potentially different child modules `foo`, `foo@1`, `foo@2`, etc. and use those names as targets in `callmodule` instructions in the parent graph. An immediate consequence of this is that the list of fqns in an unflattened module may not be the same as an exported module. Note that all these variants share the same parameters / buffers, so that multiple calls to the same submodule can share state as expected.
However, as described so far this scheme may end up with needlessly too many submodules. Thus, between calls to the same submodule, if graphs are equal then we optimize away the extra submodules and reuse call names as much as possible. Moreover, when submodules are shared across fqns, we also try to de-duplicate graphs corresponding to their calls as much as possible. Note that no matter what, information about which submodule was called is still preserved, so that if a submodule has to be swapped with another, one can still find all calls to the former submodule and replace them with calls to the latter.
A note on the choice of naming scheme for call names: instead of generating "sibling" modules `foo@1`, `foo@2`, etc. for `foo`, we had considered generating "children" modules `foo._1`, `foo._2`, etc. of `foo`. However this can cause spurious cycles when de-duplicating graphs. E.g., suppose that `foo` is an alias for `bar._1` and `foo._1` is an alias for `bar`, then we must either introduce a cycle or drop the opportunity to optimize. Another idea would be to make `foo` a dummy module that contains `foo._0` corresponding to the first call, but this necessitates too many changes to existing tests and hurts the common case.
Differential Revision: D63642479
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137013
Approved by: https://github.com/pianpwk
* Upload_metrics function to upload to ossci-raw-job-status bucket instead of dynamo
* Moves all added metrics to a field called "info" so ingesting into database table with a strict schema is easier
* Removes the dynamo_key field since it is no longer needed
* Removes the concept of reserved metrics, since they cannot be overwritten by user added metrics anymore
* Moves s3 resource initialization behind a function so import is faster
---
Tested by emitting a metric during run_test and seeing that documents got added to s3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136799
Approved by: https://github.com/ZainRizvi
Fixes#129366
Since NJT has custom serialization logic, we need an NJT-specific fix to clear out cached sizes / strides PyCapsules. Eventually, we should switch NJT to use the default serialization logic, but this depends on #125622 being addressed.
This PR also makes serialization more complete by explicitly handling `lengths`, `ragged_idx`, and the `metadata_cache`, ensuring working operation for both contiguous and non-contiguous NJTs,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137031
Approved by: https://github.com/soulitzer
ghstack dependencies: #137030
Summary:
Given an op, with a pair (output buffer, input buffer) from that op, we consider marking the output buffer as inline. However, if the parent of input buffer and the current op are going to be fused, then we don't want to mark the output buffer as inline. This change checks that criterion, and skips inlining if it is so.
Test Plan:
New unit test "layer_norm_should_not_inplace" runs LayerNorm and checks for no "in_out" pointers.
Fixes#120217
Here's a diagram of the issue:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137042
Approved by: https://github.com/eellison
Fixes#130154
This PR takes the strategy outlined in the above issue and clears out any cached sizes / strides PyCapsules before serialization. This affects the default subclass serialization logic.
The PyCapsule issue also affects `deepcopy`, so that's fixed here as well.
Note: I originally tried utilizing a context manager to remove / restore cached PyCapsules after serialization, but in practice the state returned from `_reduce_ex_internal()` references the actual `tensor.__dict__()`, so the problem persists once the cached values are restored. Instead, we have to be careful to remove the cached values in the right place so they're not re-cached when pulling out size / stride information for serialization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137030
Approved by: https://github.com/albanD
this PR unblocks unit test with single Float8Linear module. It fixes following error
```
torch._foreach_copy_(foreach_copy_dsts, all_gather_inputs)
[rank0]:E0913 13:44:29.829000 2179476 torch/testing/_internal/common_distributed.py:671] RuntimeError: "foreach_tensor_copy" not implemented for 'Float8_e4m3fn'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135955
Approved by: https://github.com/vkuzo, https://github.com/eqy
Summary:
Tests in test_mkldnn_pattern_matcher.py can take too long to finish. Splitting them into smaller tests, using `parametrize`.
I guess this means this test file has some refactoring opportunities as well. Next time would be the parametrize the add functions.
Differential Revision: D63723925
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137153
Approved by: https://github.com/desertfire
For the reusable action checkout-pytorch, skips cleaning workspace when running from a container environment.
The motivation for this change is twofold:
* There is no need for cleanup when running in ephemeral containers, as any changes will be discarded when the docker container is terminated;
* In the specific case of GITHUB_WORKSPACE, to enable sharing this between multiple containers, it need to be mounted with `-v`. This prevents the possibility of running `rm -r` and deleting this mount path;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137168
Approved by: https://github.com/huydhn
Previously if we had a graph like:
```
triton_kernel_wrapper_functional_proxy = triton_kernel_wrapper_functional(...)
getitem: "f32[3][1]cuda:0" = triton_kernel_wrapper_functional_proxy['out_ptr']
getitem_1: "f32[3][1]cuda:0" = triton_kernel_wrapper_functional_proxy['out2_ptr']
sigmoid: "f32[3][1]cuda:0" = torch.ops.aten.sigmoid.default(getitem_1)
mul: "f32[3][1]cuda:0" = torch.ops.aten.mul.Tensor(tangents_1, sigmoid)
```
The partitioner would assume that the `sigmoid()` could be fused into either its user (the pointwise mul), or its producer (the user triton kernel). This could lead to a bad partitioning:
(1) If the partitioner thinks we can fuse the sigmoid with its producer triton kernel, we would keep the sigmoid compute in the forward, and have to generate two separate kernels in the forward (user triton kernel, dedicated sigmoid kernel)
(2) if the partitioner puts the sigmoid in the backward instead, we could fuse it with an existing backward kernel (the mul with a tangent)
Reviewed By: embg
Differential Revision: D63551393
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136878
Approved by: https://github.com/zou3519
Summary:
# Context
Goal: Enable CK for Inductor in FBCode
We split this stack into three diffs to help with review & in case we need to revert anything.
# This Diff
* Gets us to have CK kernels as an option for GEMM autotuning in Inductor.
Reviewed By: zjing14
Differential Revision: D62662705
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136234
Approved by: https://github.com/tenpercent, https://github.com/chenyang78
'set_requires_grad' dict appears to be always full of "False" values,
and we always set requires_grad based on the value of 'has_backward'
setting of required_grad field was being repeatedly done during
get_fwd_recv_ops, but it should be done just once, so move it to the
function that creates recv buffers in the first place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136804
Approved by: https://github.com/kwen2501
Summary:
We skip the save_gpu_kernel if kernel is being saved already.
This would give us a more accurate Triton profiling result. The
following trace shows before/after the change for a benchmarking of a
trivial addmm:
Before:
<img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a">
After:
<img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118">
We can see that before the change, the benchmarking includes two parts,
(1) The overhead of our triton_heuristic call, which includes the
save/get, and the (expensive) hash computation.
(2) The exact computation of Triton kernel.
We see that (1) accounts >50% of time, which makes kernel selection
for profiling choosing aten kernels over Triton kernels.
Test Plan:
Existing OSS CI
python test/inductor/test_cuda_cpp_wrapper.py
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137073
Approved by: https://github.com/desertfire
For Traceable FSDP2, the most common use case is to have `fullgraph=False` for forward pass (to allow user-level graph breaks), and `fullgraph=True` for compiled autograd backward pass (required for queue_callback support).
With `torch._dynamo.compiled_autograd=True`, previously we are not able to set different `fullgraph` config value for forward vs. backward pass, since `rebuild_ctx` just reuses the forward compile config as-is. This PR adds `torch._dynamo.config.compiled_autograd_kwargs_override` config to allow forcing `fullgraph=True` for CA Dynamo tracing.
With this PR, we can remove standalone compiled autograd ctx manager usage in Traceable FSDP2 unit tests, and consolidate on using `torch._dynamo.compiled_autograd=True`.
Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_True`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136967
Approved by: https://github.com/xmfan
Fixes#127920
This commit addresses a build failure occurring with GCC 12 and above due to the -Werror=nonnull flag. The error manifests in the test_api target.
**Issue:**
When building with GCC 12+, the following error occurs:
```
error: argument 1 null where non-null expected [-Werror=nonnull]
431 | __builtin_memmove(__result, __first, sizeof(_Tp) * _Num);
| ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
This change ensures that:
1. The flag is only added for GCC 12 or higher
2. The flag is only added if it's supported by the compiler
3. The flag is added specifically to the test_api target, not globally
By disabling this specific error, we allow the build to proceed while maintaining other compiler warnings.
**Test Plan:**
- Verified successful build with GCC 12 and above
- Ensured no regression in builds with earlier GCC versions and other compilers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137092
Approved by: https://github.com/malfet
`json.dumps(float("inf"))` returns `Infinity`, which is technically invalid json
This is fine if you json.load, but ClickHouse cannot handle it
Solution here: cast inf and nan to string (which ClickHouse is able to cast back to float)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136877
Approved by: https://github.com/huydhn
Summary:
# Why
We want this to run internally
# What
- fix python path issue on the test
- reenable the test
# Background
(copied from similar issue resolved earlier)
It appears that the parent process does not pass the entire path down to the child process. Namely, if there is some setup that makes the sys.path effectively look different than, say, PYTHONPATH or something like this, the child will not inherit this setup. To avoid needing to keep track of specific setups, we pass the effective `sys.path` from the parent to the child through the PYTHONPATH env variable
Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:kernel_benchmark
Differential Revision: D63498897
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136876
Approved by: https://github.com/henrylhtsang
By even further reducing precisions of imprecise FP16 ops, introducing new BF16_LOW_PRECISION_OPS category and marking BF16 tests as xfail for `divfloor_rounding`, `floor_divide` and `remainder`.
I guess the nature of low-precision results, is that MPSGraph, unlike the rest of the PyTorch does not do accumulation over fp32 for reduction operations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136987
Approved by: https://github.com/albanD
ghstack dependencies: #137070
Related to #107302.
When built and tested with NumPy 2 the following unit tests failed.
```
=========================================================== short test summary info ============================================================
FAILED [0.0026s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_complex128 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0024s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_complex64 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0025s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_float32 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0024s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_float64 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0016s] test/test_linalg.py::TestLinalgCPU::test_nuclear_norm_axes_small_brute_force_old_cpu - ValueError: Unable to avoid copy while creating an array as requested.
FAILED [0.0054s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_complex128 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
FAILED [0.0055s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_complex64 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
FAILED [0.0048s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_float32 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
FAILED [0.0054s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_float64 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
=========================================== 9 failed, 1051 passed, 118 skipped in 152.51s (0:02:32) ============================================
```
This PR fixes them. The test is now compatible with both NumPy 1 & 2.
Some more details:
1. The `np.linalg.solve` has changed its behavior. So I added an adapt function in the unit test to keep its behavior the same no matter it is NumPy 1 or Numpy 2.
2. The cause of the failure is when passing a `torch.Tensor` to `np.linalg.qr`, the return type in NumPy 1 is `(np.ndarray, np.ndarray)`, while it is `(torch.Tensor, torch.Tensor)` in NumPy 2.
3. NumPy 2 does not allow `np.array(obj, copy=False)`, but recommended to use `np.asarray(obj)` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136800
Approved by: https://github.com/lezcano
# Motivation
This PR intends to make device-specific Event inherit from the generic torch.Event. The benefit is providing a generic abstract class `torch.Event` for different devices, like `torch.Stream`. This make it easier for Dynamo to capture the Event of different devices, like torch.cuda.Event and torch.xpu.Event.
And the next PR would like to remove previous useless base class `_StreamBase` and `_EventBase` to avoid multiple Inheritance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134845
Approved by: https://github.com/albanD, https://github.com/EikanWang
inductor mutates the aot backward graph. a solution could be to copy the graph, but since we don't know if compiled autograd is applied or not, it would be expensive to always clone it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136741
Approved by: https://github.com/jansel
ghstack dependencies: #135663
Summary: Problem is, when gpu is not big, we will omit the test cases in the test class. We expect the test to be skipped, but due to fbcode ci it can throw an error. This causes the test to be flaky.
Test Plan: ci
Differential Revision: D62037908
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137055
Approved by: https://github.com/masnesral
Fixes#136440
**Issue:**
When building PyTorch in debug mode on aarch64 architecture using GCC, we encounter relocation errors due to the R_AARCH64_CALL26 relocation limit. This occurs because debug builds with -O0 optimization generate larger code sizes, potentially exceeding the range limit for these relocations.
**Fix:**
Apply -Og optimization instead of -O0 for aarch64 GCC debug builds. This slightly reduces code size while maintaining debuggability, bringing function calls back within the range of R_AARCH64_CALL26 relocations.
The fix is implemented by conditionally setting compiler and linker flags in CMakeLists.txt:
- For aarch64 GCC debug builds: use -Og
- For all other debug builds: retain -O0
This change affects only debug builds on aarch64 with GCC, leaving other configurations unchanged.
**Testing:**
Verified that the build succeeds without relocation errors on aarch64 systems with GCC in debug mode. Ensured that debugging information is still available and useful for debugging purposes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136990
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Since our implementation currently assumes contiguous strides, let us add an explicit check and raise an error at construction time if the parameter is not contiguous.
We can try to support this in the future. Mainly, I want to first learn more about how DTensor support for non-contiguous memory formats works.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137000
Approved by: https://github.com/weifengpy
## The problem.
[A commit from three weeks ago](82d00acfee) appears to have broken five tests but was not caught by CI.
[A later commit](https://github.com/pytorch/pytorch/commit/e05ea2b1797) which added a decomposition of `transpose_copy` added another broken test, also seemingly not detected, making six total (listed below).
They came to my attention when I updated some pending decomposition pull requests which passed CI, and started getting failures like [this](https://hud.pytorch.org/pr/134319) for a test unrelated to any of these pull requests, `TestCommonCPU.test_out__refs_transpose_copy_cpu_float32`
Running `python test/test_ops.py -k _copy` on `viable/strict` found failures for six `_refs` ops: `copysign`, `expand_copy`, `index_copy`, `t_copy`, `transpose_copy`, `view_copy`
## The solution
The original commit did actually cause breakage by slightly changing user-visible behavior (in a special case involving scalar tensors being copied between different devices).
This pull request fixes that breakage in a reasonable way, but I don't understand why this error didn't appear in CI until I made later changes in the same area.
## To reproduce
To reproduce the six cases in your own client:
```
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=5 python test/test_ops.py TestCommonCPU.test_out__refs_view_copy_cpu_float32
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=2 python test/test_ops.py TestCommonCPU.test_out__refs_t_copy_cpu_float32
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python test/test_ops.py TestCommonCPU.test_out__refs_index_copy_cpu_float32
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=7 python test/test_ops.py TestCommonCPU.test_out__refs_expand_copy_cpu_float32
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python test/test_ops.py TestCommonCPU.test_out__refs_copysign_cpu_float32
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=4 python test/test_ops.py TestCommonCPU.test_out__refs_transpose_copy_cpu_float32
```
@amjames
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136653
Approved by: https://github.com/zou3519
This PR is for supporting calling `parallelize_module` from within a model definition, making the model a parallel one.
Calling `parallelize_module` is an alternative to maintaining a set of `ColumnWiseLinear`, `RowWiseLinear`, etc, while still being able to directly author a parallel model.
(The motivation for authoring a parallel model is that there may be other distributed operations, which may not be easily captured by any module, see the forward function below. Alternatively speaking, the purpose is to exploit the expressiveness of DTensor -- we need to first create DTensors before calling ops on them. Having parallelized modules in model is one way of creating DTensors.)
For example:
```
class FeedForward(nn.Module):
def __init__(self, config: TransformerArgs) -> None:
super().__init__()
w1 = nn.Linear(config.dim, config.hidden_dim, bias=False)
w2 = nn.Linear(config.hidden_dim, config.dim, bias=False)
w3 = nn.Linear(config.dim, config.hidden_dim, bias=False)
self.w1 = parallelize_module(w1, Colwise)
self.w2 = parallelize_module(w2, Rowwise)
self.w3 = parallelize_module(w3, Colwise)
def forward(self, x: Tensor) -> Tensor:
y: DTensor = self.w2(F.silu(self.w1(x)) * self.w3(x))
# y is a DTensor with Partial placement; we can return it as is.
return y
# Or we can convert it to Replicate -- there is modeling flexibility here.
return y.redistribute(Replicate())
with device_mesh:
model = FeedForward(config)
# Now model is a model parallelized onto device_mesh
y = model(x)
```
The `device_mesh` actually used for `parallelize_module` would be retrieved from the ambient context.
Calling `parallelize_module` from within model hierarchy also saves the use of *FQNs* as in the out-of-model annotation case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134247
Approved by: https://github.com/tianyu-l
compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because:
(1) it consists of a single input + many weights that are used sequentially
(2) contains a mix of recompute vs non-recomputed ops (matmul + sin)
(3) it is relatively simple
from running locally:
```
collecting compile time instruction count for aotdispatcher_partitioner_cpu
compile time instruction count for iteration 0 is 21764219181
compile time instruction count for iteration 1 is 12475020009
compile time instruction count for iteration 2 is 12463710140
compile time instruction count for iteration 3 is 12455676489
compile time instruction count for iteration 4 is 12451344330
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760
Approved by: https://github.com/ezyang
ghstack dependencies: #136670, #136759
this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher:
(1) inference vs training code paths
(2) "subclasses" vs "no subclasses" codepaths
Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely)
I ran locally, and got these numbers on the 4 paths:
```
collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu
compile time instruction count for iteration 0 is 11692348671
compile time instruction count for iteration 1 is 3026287204
compile time instruction count for iteration 2 is 3011467318
compile time instruction count for iteration 3 is 3004485935
compile time instruction count for iteration 4 is 3003087410
collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu
compile time instruction count for iteration 0 is 6068003223
compile time instruction count for iteration 1 is 5585418102
compile time instruction count for iteration 2 is 5581856618
compile time instruction count for iteration 3 is 5581651794
compile time instruction count for iteration 4 is 5578742619
collecting compile time instruction count for aotdispatcher_inference_subclass_cpu
compile time instruction count for iteration 0 is 8634984264
compile time instruction count for iteration 1 is 8633467573
compile time instruction count for iteration 2 is 8632182092
compile time instruction count for iteration 3 is 8632056925
compile time instruction count for iteration 4 is 8632543871
collecting compile time instruction count for aotdispatcher_training_subclass_cpu
compile time instruction count for iteration 0 is 14737239311
compile time instruction count for iteration 1 is 14734346427
compile time instruction count for iteration 2 is 14736493730
compile time instruction count for iteration 3 is 14734121272
compile time instruction count for iteration 4 is 14733852882
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759
Approved by: https://github.com/laithsakka
ghstack dependencies: #136670
Fixes https://github.com/pytorch/pytorch/issues/136640
Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1.
In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately.
In particular, if we have a tensor with a size value of `(64//((2048//(s3*((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard:
```
Eq((64//((2048//(s3*((s2//s3))))))), 1)
```
I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True.
I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues:
(1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions
(2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though.
Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136670
Approved by: https://github.com/ezyang
To see the payoff, look at test/dynamo/test_logging.py
The general idea is to refactor produce_guards into produce_guards_verbose which also returns verbose code parts, which have our annotations.
The rest of the logic is plumbing around SLocs to the places they need to be so we can print them. Guards are easy; value ranges and duck sizing take more care.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136917
Approved by: https://github.com/anijain2305
1. example of failing diff
https://github.com/pytorch/pytorch/pull/136740
2. test this by running
python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv
results
```
WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results.
REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results.
MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it
```
MISSING REGRESSION TEST does not fail but its logged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136551
Approved by: https://github.com/ezyang
ghstack dependencies: #136383
* This fixes a major CMake/Bazel configuration bug where we were leaving CUTLASS performance on the table, especially with FlashAttention. This now enables using MMA instructions on SM90+, which should close the gap between SDPA and the external FA2. Note these operations only affect H100 and newer GPUs. Thankfully, this seems to have been updated recently into being a noop on the CUTLASS side. Still better set the CMake variable properly.
* Also enables additional new shape kernels added in the recent CUTLASS 3.5.1+ update. This was the original motivatin of the PR before I realized the basic MMA kernels were accidentally disabled since we didn't go through the submodule's CMake/Bazels.
* Adds a bit to compile time and code size, but well worth it considering it speeds up our internal flash attention significantly on H100s at the cost of some minor additional compile time.
* These kernels and settings will be needed for Flash Attention 3 whenever we add that too.
Fixes#133695
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133686
Approved by: https://github.com/ezyang
Summary: In D60803317, we added CompileContext (trace_id) information to Kineto traces using caching when a CompileContext exits. As pointed out by some users, this gives innaccurate IDs because we are not getting the context that we is being looked up within the eval_frame. For this reason, we decided to revert that change, and go with an approach that involves getting the trace_id associated with a given CacheEntry. To do this, we add a trace_id to the GuardedCode so that it can be passed onto a CacheEntry. Then, we change the lookup function to return said trace_id alongside the code so that we can pass both into our eval function. Once we get to a Torch-Compiled Region, we can just append the context information to the name of the annotation thus bypassing any need for kwargs.
Test Plan: Added more comprehensive unit test. Saw that all the trace_ids appeared within the graph.
Differential Revision: D63138786
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136460
Approved by: https://github.com/ezyang
This is to avoid cache confusion between normal vs pydebug vs nogil builds in cpp extensions which can lead to catastrophic ABI issues.
This is rare today for people to run both normal and pydebug on the same machine, but we expect quite a few people will run normal and nogil on the same machine going forward.
This is tested locally by running each version alternatively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136890
Approved by: https://github.com/colesbury
In #136512, we fixed handling for tl.constexpr and dynamic shapes: if a symint is passed to tl.constexpr, you should specialize on it, because tl.constexpr implies needing to know the concrete value at compile time.
However, when using triton_op, capture_triton, or non-strict export, the regression remains (and #136512 might technically regress some specific export scenarios) - see [Richard's comment](https://github.com/pytorch/pytorch/pull/136512/files#r1775999871).
This fixes these scenarios: implement the handling differently depending on whether we're expecting a SymNodeVariable or a SymInt(/SymBool/SymFloat)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136686
Approved by: https://github.com/zou3519
int8_t = DeviceIndex is interpreted by cout as a char, which then shows up as a control character in logs (eg. ^A) etc.
Explicitly casting to int to have the numbers printed out correctly.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135405
Approved by: https://github.com/wconstab
Fixes#134714 (or attempts to, idk how to test yet)
For posterity, how one can test:
1. make sure you have USE_PTHREADPOOL=1 or pull a packaged binary
2. run gdb --args python, with `r` to enter, `Ctrl-C` to pause, and `c` to get back into Python
3. import torch
4. torch.set_num_threads(1), make sure this does not trigger any additional threads getting created.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136793
Approved by: https://github.com/albanD
Summary:
This PR adds `torch.float8e4m3fn` support to cuSPARSELt and `to_sparse_semi_structured`.
This will let users to run fp8 + 2:4 sparse matmuls on Hopper GPUs with
cusparselt >= 0.6.2, via to `scaled_mm` API.
```
A = rand_sparse_semi_structured_mask(256, 128, dtype=torch.float16)
B = torch.rand(dense_input_shape, device=device).to(torch.float16).t()
A_fp8, A_scale = to_float8(A)
B_fp8, B_scale = to_float8(B)
dense_result = torch._scaled_mm(
A_fp8, B_fp8,
scale_a=A_scale, scale_b=B_scale,
out_dtype=out_dtype
)
A_fp8_sparse = to_sparse_semi_structured(A_fp8)
sparse_result = torch._scaled_mm(
A_fp8_sparse, B_fp8,
scale_a=A_scale, scale_b=B_scale,
out_dtype=out_dtype
)
```
Note that to keep this consistent with normal torch behavior, calling
`torch.mm(A_fp8_sparse, B_fp8)` will raise a NotImplementedError.
I also turned on cuSPARSELt by default and added CUSPARSELT_MAX_ID to the
backend to make the tests a bit cleaner
Test Plan:
```
python test/test_sparse_semi_structured -k scaled_mm
python test/test_sparse_semi_structured -k fp8
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136397
Approved by: https://github.com/drisspg
Summary: Previously is_fbcode just checked whether the checkout was git or not. This is extremely error prone. Lets make it fool-proof.
Test Plan: unit tests
Reviewed By: masnesral
Differential Revision: D63545169
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136871
Approved by: https://github.com/masnesral
Before that attempt to run something like
```
% python -c "import torch;dev,dt='mps',torch.int; print(torch.normal(mean=torch.arange(1., 11., device=dev, dtype=dt), std=torch.arange(10, 0, -1, device=dev, dtype=dt)))"
```
Resulted in hard error
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %5 = "mps.multiply"(%2, %arg1) : (tensor<10xf32>, tensor<10xsi32>) -> tensor<*xf32>
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %5 = "mps.multiply"(%2, %arg1) : (tensor<10xf32>, tensor<10xsi32>) -> tensor<*xf32>
/AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:953: failed assertion `original module failed verification'
```
After the change, it raises a nice type error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136863
Approved by: https://github.com/Skylion007
ghstack dependencies: #136754, #136755, #136821, #136822
Fixes https://github.com/pytorch/pytorch/issues/136494
Currently, CUDASymmetricMemory::rendezvous() initializes a multicast address if multicast support is present. However, if we believe multicast support is present but cuMulticastCreate still fails for some reason, we do not fallback gracefully.
- In addition to CUDART and driver version check, query CU_DEVICE_ATTRIBUTE_MULTICAST_SUPPORTED to determine multicast support for a rank/device.
- Before initializing multicast for a block, ensure all ranks/devices have multicast support.
- This is unlikely, but if cuMulticastCreate still fails on rank 0, print the corresponding driver error message as a warning, and gracefully skip multicast initialization for the block.
- Introduced an environment variable (TORCH_SYMM_MEM_DISABLE_MULTICAST) to allow users to explicitly disable multicast support as a workaround.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136577
Approved by: https://github.com/Chillee, https://github.com/eqy
- also makes scales and zp dtype reconcile with meta impl as well as other
quantized ops representation of scales and zero point
- make sure qunatize_per_token's output_dtype is respected
There are a few places where we need to reconcile on scale and zero point dtype
but that will come later. This fixes are mainly being done to enable quantized
kv cache though ET stack
Differential Revision: [D62301840](https://our.internmc.facebook.com/intern/diff/D62301840/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136807
Approved by: https://github.com/jerryzh168
Removing `_transform_shapes_for_default_dynamic` and `assume_static_by_default=False` as added in https://github.com/pytorch/pytorch/pull/133620.
This reverts back to `assume_static_by_default=True` with the use of dynamo decorators (e.g. `maybe_mark_dynamic, mark_static`, instead) for handling Dim.AUTO & Dim.STATIC instead. This is easier to maintain, as it doesn't requiring reasoning about "inverting" the dynamic_shapes specs, and also opens up usage of other decorators (`mark_dynamic, mark_unbacked`).
On the user side this change has no effect, but internally this means dynamic behavior is determined only by the `dynamic_shapes` specs (ignoring user-side input decorators following https://github.com/pytorch/pytorch/pull/135536), but transferring this information for _DimHints via decorators, for Dynamo/non-strict to create symbolic_contexts accordingly, e.g. 7c6d543a5b/torch/_dynamo/variables/builder.py (L2646-L2666)
One caveat is we don't raise errors for dynamic decorators on the user side, since we don't know if they're from user markings, or from re-exporting with inputs we've previously marked.
Differential Revision: D63358628
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136591
Approved by: https://github.com/avikchaudhuri
Summary:
This diff logs the time_taken_ns for the forward and backward graphs in AOTAutogradCache, saving it into the cache entry.
This information is helpful later when I remotify the cache, and also is just useful to have in tlparse and chromium events.
Test Plan: Run benchmark, see that the times are in the chromium events.
Reviewed By: aorenste
Differential Revision: D62590077
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136529
Approved by: https://github.com/oulgen
In #136512, we fixed handling for tl.constexpr and dynamic shapes: if a symint is passed to tl.constexpr, you should specialize on it, because tl.constexpr implies needing to know the concrete value at compile time.
However, when using triton_op, capture_triton, or non-strict export, the regression remains (and #136512 might technically regress some specific export scenarios) - see [Richard's comment](https://github.com/pytorch/pytorch/pull/136512/files#r1775999871).
This fixes these scenarios: implement the handling differently depending on whether we're expecting a SymNodeVariable or a SymInt(/SymBool/SymFloat)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136686
Approved by: https://github.com/zou3519
I think this could help many teams, especially compile/export teams (/cc @ezyang), to let end user/bug reporters to quickly test WIP PR when reporting a related bug.
This could quickly run in an official nightly Docker container or in a nightly venv/coda env.
Let me know what do you think.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136535
Approved by: https://github.com/ezyang
This was a stupid cast error that caused MPSGraph to crash with the following exception
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %3 = "mps.multiply"(%2, %arg1) : (tensor<1x3x9x9xf16>, tensor<1xf32>) -> tensor<*xf32>
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %3 = "mps.multiply"(%2, %arg1) : (tensor<1x3x9x9xf16>, tensor<1xf32>) -> tensor<*xf32>
/AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:953: failed assertion `original module failed verification'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136822
Approved by: https://github.com/Skylion007
ghstack dependencies: #136754, #136755, #136821
This is a retry of https://github.com/pytorch/pytorch/pull/136594, which is having trouble landing.
Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this:
`tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])`
https://github.com/pytorch/pytorch/pull/135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want?
Differential Revision: [D63540693](https://our.internmc.facebook.com/intern/diff/D63540693)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136858
Approved by: https://github.com/atalman
Original Issue: https://github.com/pytorch/pytorch/issues/134644
We assume trace_tangents to have the same memory_format as inputs, outputs, intermediate during first tracing.
=>
Tracing time:
- Store trace_tangents_memory_formats in metadata
- Coerce tangents to deduced memory_format
Runtime:
- Coerce tangents to tracing memory format from metadata
Subclasses logic:
- Previously coercing tangents logic did not handle nested subclasses case, fixing this.
For Subclasses we deduce memory format for subclass_tensor first, then for each element of subclass:
[subclass_tensor_memory_format, subclass_tensor_elem0_memory_format, ... ]
If subclass element (__tensor_flatten__[0] tensors) is also subclass => on its place we will have a nested list of the same structure.
The recursive traversal of subclass tree is expensive. So we do memory format deduction and coercing at the same time, to keep only one traverse for this. With this approach there is no regression in comparison with previous logic which also does one traversal. (`coerce_tangent_and_suggest_memory_format` method).
Other small change:
Remove duplicated not-related comment.
Testing
```
python test/functorch/test_aotdispatch.py -k test_channels_last_grads_no_force_contiguous
```
Benchmarking:
After change:
```
└─ $ PYTORCH_AOTD_DEBUG_PROFILE=1 python test/functorch/test_aotdispatch.py -k test_benchmark_grads_no_force_contiguous
Benchmark SUBCLASS avg_bwd_duration:4.059906005859375 ms
Benchmark NO_SUBCLASS avg_bwd_duration:3.1563830375671387 ms
```
Before change:
```
BEFORE_CHANGE SUBCLASS 4.1194
```
No siginificant changes in processing time.
(We do single traverse of subclass tree for collecting memory_formats and coercing during tracing.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135225
Approved by: https://github.com/bdhirsh
Fixes the max-autotune failure of `soft_actor_critic` of Torchbench in FP32 single thread dynamic shape case:
```log
File "/home/user/inductor/pytorch/torch/_inductor/codegen/cpp_micro_gemm.py", line 136, in codegen_call
C_ptr = f"&({kernel.index(C, [0, 0])})"
File "/home/user/inductor/pytorch/torch/_inductor/codegen/cpp_template_kernel.py", line 135, in index
else self.args.input(node.get_name())
File "/home/user/inductor/pytorch/torch/_inductor/codegen/common.py", line 1251, in input
assert name not in V.graph.removed_buffers, name
AssertionError: buf_GemmOut
```
The 1st and 2nd linear does not need to use local buffer while the 3rd linear needs to use local buffer.
The 3rd linear which uses local buffer will add its global buffer (named as `buf_GemmOut`) into `V.graph.removed_buffers`.
When scheduling the nodes, the 1st linear (won't use local buffer) will get its output buffer (also named as `buf_GemmOut`) from the input and found that it's in the `V.graph.removed_buffers` and raise AssertionError. The issue is that the output buffer of all these linears are all names with `buf_GemmOut`, which have a conflict.
Rename these buffers by adding the name of the `template_buffer` as the prefix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136419
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #136418, #136518
Cleanup:
1/ We do not need to unwrap_subclasses() in freezing wrapper, as it will be wrapped by AOTD wrappers which inclused SubclassesWrapper
2/ No need to use weakreferences for unwrapped list, dynamo optimizers need to clean unwrapped list along with original params_flat.
Verfified fbcode tests compiled_optimizers
Differential Revision: [D63393651](https://our.internmc.facebook.com/intern/diff/D63393651)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136549
Approved by: https://github.com/bdhirsh
Fixes#135439
This PR adds support for the `is_inference` method on torch tensors which successfully compiles the following example fn without graph breaks:
```python
def fn_simple(x):
if x.is_inference():
return x.sum()
else:
return x.min()
```
I've also tried to add guards on the tensor to guard against `is_inference`. I wasn't 100% sure where these should go so please don't hesitate to correct me.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136450
Approved by: https://github.com/ezyang
Related issue: #125077
### Feature
Inductor tries to remove dimensions with stride 0 from block pointers. Rather than loading with stride 0, it's more efficient to load a smaller block pointer, then use `tl.broadcast_to` to broadcast it up to the desired size. This already worked for simpler block pointers, but it was disabled for more complex block pointers which used `tl.reshape` to change the dimensionality after loading.
This PR generalizes the approach to work for all block pointers. The idea is to first reshape, adding singleton dimensions, then broadcast those singletons up to something larger, then reshape again to the final output shape. For readability, we emit this code only if it actually does something. Simpler loads will just have `tl.load`.
Here's an example of a complicated kernel that uses `reshape` -> `load` -> `reshape`. (The first reshape is actually the slice `[None,None,:]`).
```
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 64
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x2 = xindex
x1 = (xindex // 8)
tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0])
tmp1 = tl.reshape(tl.broadcast_to(tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[((7 + XBLOCK) // 8)], order=[0], offsets=[(xoffset // 8)]), boundary_check=[0], eviction_policy='evict_last')[:, None, None], [((7 + XBLOCK) // 8), ((1) * ((1) <= (((7 + XBLOCK) // 8))) + (((7 + XBLOCK) // 8)) * ((((7 + XBLOCK) // 8)) < (1))), ((8) * ((8) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (8)))]), [XBLOCK])
tmp2 = tmp0 + tmp1
tl.store(tl.make_block_ptr(out_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tmp2.to(tl.float32), boundary_check=[0])
''', device_str='cuda')
```
Before this PR, we would have stride-0 dimensions:
```
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 64
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x2 = xindex
x1 = (xindex // 8)
tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0])
tmp1 = tl.reshape(tl.load(tl.make_block_ptr(in_ptr1, shape=[8, 1, 8], strides=[8, 0, 0], block_shape=[((7 + XBLOCK) // 8), ((1) * ((1) <= (((7 + XBLOCK) // 8))) + (((7 + XBLOCK) // 8)) * ((((7 + XBLOCK) // 8)) < (1))), ((8) * ((8) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (8)))], order=[2, 1, 0], offsets=[(xoffset // 8), 0, xoffset % 8]), boundary_check=[0], eviction_policy='evict_last'), [XBLOCK])
tmp2 = tmp0 + tmp1
tl.store(tl.make_block_ptr(out_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0])
''', device_str='cuda')
```
Here's a simpler example where we use 2D tiling. In this case we don't actually need the broadcast. The broadcast is implied via a slice adding a new singleton dimension. This code is not changed by this PR, but it's important to know that we don't accidentally insert unnecessary broadcasts.
```
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
ynumel = 8
xnumel = 8
yoffset = tl.program_id(1) * YBLOCK
yindex = yoffset + tl.arange(0, YBLOCK)[None, :]
ymask = yindex < ynumel
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
xmask = xindex < xnumel
x1 = xindex
y0 = yindex
tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[8, 8], strides=[1, 8], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1])
tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[YBLOCK], order=[0], offsets=[yoffset]), boundary_check=[0], eviction_policy='evict_last')[None, :]
tmp2 = tmp0 + tmp1
tl.store(tl.make_block_ptr(out_ptr0, shape=[8, 8], strides=[1, 8], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), tmp2.to(tl.float32), boundary_check=[0, 1])
''', device_str='cuda')
```
### Test Plan
Added a new expecttest to check the emitted code for broadcast addition. Looking at the test, we can see that stride 0 dimensions are removed. (This test generated the example kernels in the previous section.)
This change also removed a stride-0 dimension in an existing block pointer test. I updated the expected code accordingly.
Bonus: I noticed that the test parametrization for `config.prefer_nd_tiling` wasn't working as intended. It ended up always setting this option to `True`. Fixed it so we get the intended test coverage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135557
Approved by: https://github.com/shunting314, https://github.com/jansel
Co-authored-by: Yueming Hao <yhao@meta.com>
Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this:
`tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])
`
https://github.com/pytorch/pytorch/pull/135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want?
Differential Revision: [D63465169](https://our.internmc.facebook.com/intern/diff/D63465169)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136594
Approved by: https://github.com/mengluy0125, https://github.com/jansel
2024-09-27 04:01:09 +00:00
913 changed files with 24600 additions and 10152 deletions
@ -208,6 +208,8 @@ If you want to compile with ROCm support, install
- [AMD ROCm](https://rocm.docs.amd.com/en/latest/deploy/linux/quick_start.html) 4.0 and above installation
- ROCm is currently supported only for Linux systems.
By default the build system expects ROCm to be installed in `/opt/rocm`. If ROCm is installed in a different directory, the `ROCM_PATH` environment variable must be set to the ROCm installation directory. The build system automatically detects the AMD GPU architecture. Optionally, the AMD GPU architecture can be explicitly set with the `PYTORCH_ROCM_ARCH` environment variable [AMD GPU architecture](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-gpus)
If you want to disable ROCm support, export the environment variable `USE_ROCM=0`.
Other potentially useful environment variables may be found in `setup.py`.
@ -77,6 +77,31 @@ default, now called through TunableOp. Any call to at::cuda::blas::gemm() or ::b
when enabled. Calling gemm() for a given set of input arguments (transa, transb, m, n, k) will attempt to use the
fastest available implementation across both rocblas and hipblaslt.
## Offline Tuning
### Motivation
Basically it is used for workload with high-memory utilization where one might run out of memory with regular tuning.
### Workflow
There are basically two steps:
1) Set the environment variables to collect the untuned GEMM and this will generate `tunableop_untuned?.csv` ("?" is placeholder for the GPU ID), like:
```
PYTORCH_TUNABLEOP_ENABLED=1
PYTORCH_TUNABLEOP_TUNING=0
PYTORCH_TUNABLEOP_RECORD_UNTUNED=1
...
```
2) Run a Python script that reads the `tunableop_untuned?.csv` and generates the `tunableop_results?.csv`, like:
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.