Summary: Also check the module's named buffers and parameters when resolving name collision
Test Plan:
```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r aoti_constant_tensor_name_collision
```
Differential Revision: D73264885
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151684
Approved by: https://github.com/angelayi
Triton codegen currently [sorts vars by divisor](ae6f6b8efb/torch/_inductor/codegen/simd.py (L233-L237)). When there are two vars with the same divisor, the order is undecided.
```python
nodes.sort(
key=lambda x: V.graph.sizevars.size_hint(
x.divisor, fallback=config.unbacked_symint_fallback
)
)
```
The test case leads to the following nodes:
```
(Pdb) nodes[0]
IterationRangesEntry(x1, ((s37 + 127)//128), 2, (xindex//ps0), {x0: ((s37 + 127)//128), x1: 2, x2: ((s12 + 127)//128), x4: 2*(((s12 + 127)//128))*(((s37 + 127)//128)), x5: 0, x6: 2, x7: (((s12 + 127)//128))*(((s37 + 127)//128))})
(Pdb) nodes[1]
IterationRangesEntry(x0, 1, ((s37 + 127)//128), ModularIndexing(xindex, 1, ps0), {x0: ((s37 + 127)//128), x1: 2, x2: ((s12 + 127)//128), x4: 2*(((s12 + 127)//128))*(((s37 + 127)//128)), x5: 0, x6: 2, x7: (((s12 + 127)//128))*(((s37 + 127)//128))})
(Pdb) nodes[2]
IterationRangesEntry(x2, 2*(((s37 + 127)//128)), ((s12 + 127)//128), (xindex//(2*(((s37 + 127)//128)))), {x0: ((s37 + 127)//128), x1: 2, x2: ((s12 + 127)//128), x4: 2*(((s12 + 127)//128))*(((s37 + 127)//128)), x5: 0, x6: 2, x7: (((s12 + 127)//128))*(((s37 + 127)//128))})
(Pdb) V.graph.sizevars.statically_known_equals(nodes[0].length, 2)
True
(Pdb) V.graph.sizevars.statically_known_equals(nodes[1].length, 1)
True
(Pdb) V.graph.sizevars.statically_known_equals(nodes[2].length, 1)
True
(Pdb) V.graph.sizevars.statically_known_equals(nodes[0].divisor, 1)
True
(Pdb) V.graph.sizevars.statically_known_equals(nodes[1].divisor, 1)
True
(Pdb) V.graph.sizevars.statically_known_equals(nodes[2].divisor, 2)
True
```
Since x1 and x0 both have divisor 1, the relative order is random across runs.
In some runs, we have order [x1, x0, x2] with divisors as [1,1,2] and lengths as [2,1,1]. After x1, we have [divisor = divisor * node.length](ae6f6b8efb/torch/_inductor/codegen/simd.py (L246)) = 1 * 2 = 2. Then, when processing x0, we have node.divisor=1, divisor=2, and [FloorDiv(node.divisor, divisor)](ae6f6b8efb/torch/_inductor/codegen/simd.py (L251)) = 0, which indicates an iteration length of 0 and leads errors later.
The fix is to sort by both divisor and length_is_one. So for two nodes with the same divisor, we process the node with length=1 first.
Fixes#149789
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151634
Approved by: https://github.com/Skylion007, https://github.com/drisspg
Summary: There are a few issues I'm solving:.
1. It's too hard to measure total pt2 overhead using the dynamo_compile table because users need to know the columns representing all the top-level events (dynamo_cumulative_compile_time_us, etc.). Instead, let's populate the existing duration_us field for all top-level events. The complication is that runtime events in particular (Triton autotuning, cudagraphify) can be collapsed into a single row, with gaps in between, so we can't simply use `end_time - start_time` in all cases. Instead, we'll sum durations for all outer events when updating the compile-time or runtime metrics context. Introduce a 'depth' counter in TLS to track the nesting of CompilationMetrics events.
2. The existing implementation relies on callers of dynamo_timed to specify whether the event is a runtime or compile-time event. That doesn't work because some methods can be called in both situations, e.g., `CachingAutotuner.benchmark_all_configs`. For example `TORCHINDUCTOR_BENCHMARK_FUSION=1` enables benchmarking during compile-time. Instead, we can figure out automatically whether we're measuring a compile-time or runtime event and log accordingling.
3. If `log_compilation_events` were to throw an exception, we'd fail to clear the aggregated counters for runtime logs and they could be attributed to the wrong compile ID. I didn't actually find evidence of this in practice, but I added exception handling for extra safety.
Test Plan:
Ran internal models and compared dynamo_compile to pt2_compile_events:
`TORCHINDUCTOR_BENCHMARK_FUSION=0`
* tlparse: https://fburl.com/itciwnxc
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/yvkif5vb
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/segijet7
`TORCHINDUCTOR_BENCHMARK_FUSION=1`
* tlparse: https://fburl.com/jgurcvkw
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/uum91ceb
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/x4xnisez
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151749
Approved by: https://github.com/Skylion007
# Motivation
This PR aims to deprecate the host allocator legacy API and recommend users to use the unified API `getHostAllocator(device_type)` APIs, such as:
```cpp
at::getHostAllocator(device_type)->allocate(...);
at::getHostAllocator(device_type)->empty_cache();
at::getHostAllocator(device_type)->record_event(...);
at::getHostAllocator(device_type)->get_stats();
at::getHostAllocator(device_type)->reset_accumulated_stats();
at::getHostAllocator(device_type)->reset_peak_stats();
```
# Additional Context
TODO:
- [ ] Move is_pinned from `AcceleratorHookInterface` to `HostAllocator`
- [ ] Deprecate `getPinnedMemoryAllocator` inside `AcceleratorHookInterface` and recommend using `getHostAllocator` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151437
Approved by: https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #151403, #151431
This somewhat complicated PR does a few things:
- It separates out a lot of the guard checking logic into its own class, GuardedCache[T]
- It adds a new `check_guard_hit` lambda to FXGraphCache._lookup_graph, which allows callers to define their own guard checking logic
- It then uses these two combined parts to lift guard checking to AOTAutogradCache. This means that AOTAutogradCache stores its own guard expressions and evaluates them.
- FXGraphCache's guard checking logic is completely unchanged, just refactored. As part of the work, I'm able to extend a bit of the logging functionality of AOTAutogradCache into FXGraphCache, so that you can know if FXGraphCache missed due to a guard failure or a full cache miss.
# Why do this?
Lifting guards to AOTAutogradCache has a few benefits:
- First, it fixes a long standing bug in guard checking logic. Backward passes can have different symint inputs than forward passes depending on forward output, if AOTAutograd chooses to store symints for the backward. These symint inputs have the same underlying symbols as the forward, but on AOTAutogradCache hit, we don't have access to the hints backing these exact symints (we only have hints for the symints on the forward function). By lifting guard checking logic to AOTAutogradCache, we no longer need to check the backward guards, as they'll be included in the AOTAutogradCache guard expression. **I've added a unit test that failed before my diff, and now passes, as an example of this**
- Secondly, this is the first step necessary to bundle CompiledFxGraph into AOTAutogradCache. Doing so will simplify our cache logic significantly, and also make precompile logic simpler, as precompiles will only need to store AOTAutogradCacheEntrys, without needing to match them up with inductor FXGraphCache entries.
- Finally, adding guard checking logic to AOTAutogradCache my allow us in the future to handle more complicated cases like a single forward with multiple backwards, as guard checks are now storable on the cache entry itself.
# Guard checking logic of AOTAutogradCache
When AOTAutogradCache evaluates guard expressions, it no longer needs to evaluate the forward/backward guards in the FXGraphCacheEntry (since the AOTAutogradCache guard expressions will encompass them). Because of this, we still need a way for AOTAutogradCache to distinguish between multiple FXGraphCache local entries. To do so, AOTAutogradCache stores the guard string from FXGraphCache, which it uses as a second "cache key". It doesn't need to **evaluate** these guards, it just needs to find the cache entry from FXGraphCache that had the same guards as when it was stored.
After this, I will work on putting the FXGraphCache entries directly into AOTAutogradCache. If I can put CompiledFxGraphs in the cache directly, I no longer need this complicated `check_guard_hit` overriding logic.
## Test Plan
Added a new unit test. There are comprehensive guard checking unit tests in `test_aot_autograd_cache` already, and those pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151563
Approved by: https://github.com/oulgen
Summary:
This PR introduces additional autotuning configurations for the persistent+TMA version of Triton `mm` and `addmm` operations. The new configurations are as follows:
* `(128, 128, 64, 5, 8)`
* `(256, 128, 64, 4, 8)`
* `(128, 128, 64, 5, 4)`
These configurations were selected based on exhaustive autotuning performed on commonly used shapes from an internal foundational model.
While these new configs are generally more performant across the board, we see notable gains a few specific cases:
* In scenarios where `n >> m, k`, the configurations `(128, 128, 64, 5, 8)` and `(256, 128, 64, 4, 8)` tend to produce an additional 5-10% speedup over the aten baseline compared to the original configurations.
* Similarly, the configuration `(128, 128, 64, 5, 4)` yields approximately an 8% improvement in scenarios where k >> m, n.
These enhancements are expected to provide performance benefits across diverse use cases, particularly when compared to the original set of configurations.
Test Plan:
contbuild & OSS CI
Reviewers: paulzhan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150587
Approved by: https://github.com/PaulZhang12, https://github.com/drisspg, https://github.com/eellison
Chatting with Bob the goal of this is to const fold the floats that where tensorified by calling
guard_scalar(val) on them and then replacing their usages by their values.
Hence we do not need to do this for nodes with no float symbols.
We do not want todo proper const folding because we need to preserve statements that deferred
runtime asserts depend on. (see the added test)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151494
Approved by: https://github.com/bobrenjc93
When bicubic interpolation was added to grid_sampler in #44780, `GridSampleFuncOptions` was not updated to allow a user to use bicubic mode in LibTorch, even though the function could handle it. This PR fixes the parity such that LibTorch's `torch::nn::functional::grid_sample` behaves the same as PyTorch's `torch.nn.functional.grid_sample`.
Existing users can directly use `torch::grid_sampler` but must know what int to pass for the interpolation (2 for bicubic) and padding mode parameters, which is not ideal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150817
Approved by: https://github.com/Skylion007
Summary:
Similar to what we did previously in D70033166
Previously, for cpu we decompose addmm if
```
check_device(mat1, mat2, device="cpu")
and statically_known_true(mat1.shape[0] == 1)
and statically_known_true(mat2.shape[0] <= 64)
and statically_known_true(mat2.shape[1] <= 512)
```
We have a new case where `mat1.shape[0] = 80`, and benchmark shows that it will beneficial if we decompose, so update the condition to
```
check_device(mat1, mat2, device="cpu")
and statically_known_true(mat1.shape[0] == 1)
and statically_known_true(mat2.shape[0] <= 128)
and statically_known_true(mat2.shape[1] <= 512)
```
Differential Revision: D73292985
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151730
Approved by: https://github.com/kflu, https://github.com/houseroad
Fixes#135998
Adds support for fp8. These are compared bitwise, without atol and rtol. The implementation uses the same comparison functions, just with atol and rtol forced to zero. The error message is different from the default case; it only tells the user the first mismatch. This is to avoid triggering the error from #135998.
Test Plan:
New unit test covers new code paths.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150002
Approved by: https://github.com/cyyever, https://github.com/zou3519
By constructing tensor on that device, because it does not call `self.common` but rather executes test directly.
Otherwise `test_add_complex3_mps` will test CPU inductor, rather than MPS one
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151732
Approved by: https://github.com/dcci
Although torch.cuda.Event and torch.xpu.Event have cuda_event and sycl_event fields respectively, the event_id exposed from the base class torch.Event is always 0, which can confuse users.
The memory of torch.Event is not useful to torch.cuda.Event and torch.xpu.Event, but we still need to inherit from torch.Event because CPython will check it.
Repro with cuda:
```
>>> import torch
>>> event = torch.cuda.Event()
>>> event.cuda_event
0
>>> event.event_id
0
>>> event.record()
>>> event.cuda_event
127982096
>>> event.event_id
0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151226
Approved by: https://github.com/albanD
This PR adds support for submatrices in offline tuning for:
- GEMM
- GEMM and bias
- ScaledGEMM
- Batch Strided GEMM
New UTs to cover submatrices. Submatrices for strided batch API is not part of this PR and will be done seperately.
There is also a bug fix for offline tuning for full matrix for GEMM and bias in the `NT` case. Offline and online UTs were updated to cover this corner case.
To improve code readability, swapped definition of transA and transB.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151138
Approved by: https://github.com/jeffdaily
By constructing tensor on that device, because it does not call `self.common` but rather executes test directly.
Otherwise `test_add_complex3_mps` will test CPU inductor, rather than MPS one
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151732
Approved by: https://github.com/dcci
By building wheel with USE_DISTRIBUTED=1
Otherwise attempt to run
```
python3 benchmarks/dynamo/torchbench.py --performance --only hf_T5 --backend inductor --inference --devices mps
```
wil fail with
```
File "/Users/nshulga/Library/Python/3.10/lib/python/site-packages/transformers/modeling_utils.py", line 40, in <module>
import torch.distributed.tensor
File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/__init__.py", line 4, in <module>
import torch.distributed.tensor._ops # force import all built-in dtensor ops
File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/_ops/__init__.py", line 2, in <module>
from ._conv_ops import * # noqa: F403
File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/_ops/_conv_ops.py", line 5, in <module>
from torch.distributed.tensor._dtensor_spec import DTensorSpec, TensorMeta
File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/_dtensor_spec.py", line 6, in <module>
from torch.distributed.tensor.placement_types import (
File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/placement_types.py", line 8, in <module>
import torch.distributed._functional_collectives as funcol
File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/_functional_collectives.py", line 9, in <module>
import torch.distributed.distributed_c10d as c10d
File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/distributed_c10d.py", line 23, in <module>
from torch._C._distributed_c10d import (
ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151721
Approved by: https://github.com/wdvr, https://github.com/dcci, https://github.com/huydhn
Summary: Further testing the script, we found that we shouldn't always assume rank 0 is the first rank, so we need to check all entries and see if it P2P op for this coalesced group.
Test Plan: Directly test with corner case.
Differential Revision: D73266257
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151683
Approved by: https://github.com/fegin
The name was updated by https://github.com/pytorch/pytorch/pull/151155. The benchmark results weren't updated on the dashboard otherwise.
For PT2 compiler perf benchmark, we are still relying on this old workflow. To get rid of this, we need to update PT2 benchmark dashboard to use the new benchmark database (cc @yangw-dev)
The results are there on the new database:
```
SELECT
*
FROM
oss_ci_benchmark_v3
WHERE
workflow_id = 14510035576
```
but not on the old database:
```
SELECT
*
FROM
inductor_torch_dynamo_perf_stats
WHERE
workflow_id = 14510035576
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151698
Approved by: https://github.com/seemethere, https://github.com/atalman
Summary: See internal Diff for more details.
In ExternKernel, the FakeTensors do not have associated real tensors, because they are just created from ir.Node's shape and stride.
Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_data_dependent_ex
buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:aot_inductor_arrayref_cpu -- -r data_dependent_extern_kernel_op
```
Differential Revision: D73002775
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151377
Approved by: https://github.com/angelayi
Summary: It looks symmetric memory only supports cuda12.3+. We do have the definition w/ 12.3- but we don't have implementation. So maybe a good idea to even disable the definition.
Test Plan: CI
Reviewed By: jianyuh, houseroad, ngimel, jiawenliu64
Differential Revision: D72936993
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151203
Approved by: https://github.com/ngimel, https://github.com/houseroad
This PR fixes two things.
The first problem is that in the vLLM style standalone_compile is
called from within a custom torch.compile backend. If there already is a
FakeTensorMode (which there is), we shouldn't create a new
FakeTensorMode with the same shape_env, instead we should just reuse the
same FakeTensorMode.
The second thing is that compile_fx can mutate the passed in gm, so we
deepcopy (since standalone_compile should be standalone)
Test Plan:
- new test
- updated old tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151502
Approved by: https://github.com/oulgen
ghstack dependencies: #151501, #151551
We were only returning the first one. There's an edge case on what to do
if the original function returns a single Tensor. capture(f) returns a
function that returns a tuple of one Tensor in this case and we were
originally converting this back to one single Tensor. I think it's fine
to return a tuple of one Tensor (that is what the graph passed to
standalone_compile asked for!) but we can revisit.
fine
Test Plan:
- modified one test to used multiple outputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151551
Approved by: https://github.com/Skylion007, https://github.com/oulgen
ghstack dependencies: #151501
To follow pattern set by CPU and CUDA impls: define common_dtype and optionally casts `elements` and `test_elements` to common dtype if needed
- Add regression test, though skip it on MacOS-13, as `isin` seems to produce garbage there even for same dtypes
```
>>> import torch
>>> x=torch.arange(4.0, device='mps')
>>> y=torch.arange(1.0, 3.0, device='mps')
>>> x, y, torch.isin(x, y), torch.isin(y, x)
(tensor([0., 1., 2., 3.], device='mps:0'), tensor([1., 2.], device='mps:0'), tensor([False, True, False, False], device='mps:0'), tensor([False, False], device='mps:0'))
>>> torch.__version__
'2.6.0'
```
- Cleanup code a bit
Fixes https://github.com/pytorch/pytorch/issues/151443
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151600
Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/kulinseth
If pip is not installed:
### Before
```console
> python3 torch/utils/collect_env.py
Collecting environment information...
Traceback (most recent call last):
File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 694, in <module>
main()
~~~~^^
File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 677, in main
output = get_pretty_env_info()
File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 672, in get_pretty_env_info
return pretty_str(get_env_info())
~~~~~~~~~~~~^^
File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 497, in get_env_info
pip_version, pip_list_output = get_pip_packages(run_lambda)
~~~~~~~~~~~~~~~~^^^^^^^^^^^^
File "/Users/Adam/pytorch/torch/utils/collect_env.py", line 450, in get_pip_packages
for line in out.splitlines()
^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'splitlines'
```
### After
```console
> python3 torch/utils/collect_env.py
Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A
OS: macOS 15.4 (arm64)
GCC version: Could not collect
Clang version: 20.1.0
CMake version: version 3.31.6
Libc version: N/A
Python version: 3.13.2 (main, Apr 8 2025, 15:27:33) [Clang 17.0.0 (clang-1700.0.13.3)] (64-bit runtime)
Python platform: macOS-15.4-arm64-arm-64bit-Mach-O
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A
CPU:
Apple M2 Pro
Versions of relevant libraries:
[pip3] Could not collect
[conda] Could not collect
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151607
Approved by: https://github.com/malfet
This case creates subprocess in a subprocess. In Windows it can't load function at this scenario hence I have to skip it
```
File "C:\ProgramData\miniforge3\envs\lfq\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\ProgramData\miniforge3\envs\lfq\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'run_model' on <module '__main__' (built-in)>
Traceback (most recent call last):
File "<string>", line 25, in <module>
File "<string>", line 16, in test_multi_process
AssertionError
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150999
Approved by: https://github.com/guangyey, https://github.com/EikanWang
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
So far it's only for `gather`, but we'll move index_select and index to this implementation too. Torchtitan and fbgemm have noticed that gather/index_select perf is bad, this PR brings core implementation to be on par with those customized implementations. Added benefits: all dtypes are supported, a bit less strict on the tensor dimensions/contiguity because we pick the fast path after TensorIterator collapsed the dimensions.
Biggest part of this PR is not even the kernel (it's dumb, just vectorized loads are enough), but moving utilities for vectorized loads and stores from SymmetricMemory to be generally accessible in MemoryAccess.cuh.
Additional tests are coming to make sure this implementation doesn't break anything
`gather` is equivalent to x[indices] for 1d indices via
```
def fn_gather(x, indices):
return torch.gather(x, dim=0, index=indices.unsqueeze(1).expand(-1, x.shape[1]))
def fn_index(x, indices):
return x[indices]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151490
Approved by: https://github.com/Skylion007, https://github.com/eqy
# Motivation
This PR introduces a unified parent class `HostAllocator` with the following goals:
1. Enable backend-specific host allocator registration, including support for out-of-tree backends.
2. Provide a unified and extensible API surface for host memory management across all backends, especially accelerators.
The new interface includes:
- `at::getHostAllocator()->allocate`
- `at::getHostAllocator()->empty_cache`
- `at::getHostAllocator()->record_event`
- `at::getHostAllocator()->get_stats`
- `at::getHostAllocator()->reset_accumulated_stats`
- `at::getHostAllocator()->reset_peak_stats`
# Additional Context
We plan to deprecate legacy APIs such as `at::cuda::CachingHostAllocator_emptyCache` and recommend users migrate to the new backend-specific API, for example:
```cpp
at::getHostAllocator(at::kCUDA)->empty_cache();
```
This refactor will help standardize host memory management across devices and simplify backend integration in the future.
Another key improvement I am going to do is move the `is_pinned` functionality into the `HostAllocator` class, which enables centralized pinned memory verification through calls like `at::getHostAllocator(at::kCUDA)->is_pinned(ptr)`.
Benefits include:
- Consistent host memory handling across all device backends
- Decouple pinned memory functionality with `AcceleratorHooksInterface` in a more modular way
- Clearer separation between device memory allocation and pinned host memory management
This architecture makes the system more maintainable and extensible for future device support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151431
Approved by: https://github.com/albanD
ghstack dependencies: #151403
This adds a non-blocking mode to queue_pop. This allows for workers to poll if work is ready without blocking the main loop. This is useful for the case where you want to have a GPU have maximum utilization when something only periodically is sent on the queue.
We also expose a `torch.distributed.QueueEmptyError` so users can catch the error and handle it accordingly.
Test plan:
```
pytest test/distributed/test_store.py -k queue -v -s -x
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151485
Approved by: https://github.com/fduwjj, https://github.com/tianfengfrank
collective APIs accept either group or global rank for src/dst rank.
We provide a helper `_canonicalize_group_rank` which converts from maybe
group or maybe global to one particular format (defined by the kwarg
return_global: bool=False).
In this PR we stop performing the mapping lookup that converts group to
global or global to group in the case that the caller wants us to return
the same value that was passed in. The PR should be functionally
equivalent, except in cases where the mapping itself would raise an
exception but the mapping was not necessary in the first place.
This has come up in cases where people create new process groups outside
of 'init_process_group' APIs and group-specific ranks may not have a
valid mapping to the 'global' rank.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151373
Approved by: https://github.com/xunnanxu, https://github.com/d4l3k
Summary: add one option to allow skipping all reduce unused parameters, this could help improve training throughput significantly when the number of unused parameters is large in the model.
Test Plan: unit tests, CI
Differential Revision: D72282069
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151503
Approved by: https://github.com/mrshenli
This is part of splitting up https://github.com/pytorch/pytorch/pull/150558 into smaller chunks, please see that for more context
Use the binary docker build action from https://github.com/pytorch/pytorch/pull/151471
Change the workflow trigger to be all of .ci/docker so it will make a new image + tag whenever it changes.
build script:
* change to be independent of the CUDA_VERSION env var, since all the info should be in the imagename:tag
* remove docker push parts since that will happen during the workflow
* clean up a bit
* make the build script more like the CI build script (use a temp image name)
I don't think this image is actually used anywhere
Also push docker image to imagename:tag, I got rid of it in the PR making the reusable workflow since I thought it was not in the original scripts but it actually is there
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151483
Approved by: https://github.com/ZainRizvi
If a custom operator does not contain a fake impl, currently draft-export will use the real-tensor propagation to get an output for the operator and continue tracing. However if we retrace the exported model using `ep.run_decompositions`, or `export`, or run the exported program with fake tensors, we'll still fail because there's no fake impl.
With this PR, after draft-export we will generate an operator profile for each operator call that we encounter, and store this on the report attached to the exported program `ep._report.op_profiles`. Users can then use `torch._library.fake_profile.register_fake_profile` to temporarily generate and register a fake impl based on these operator profiles. This way future fake tensor retracing will work.
The workflow would look something like:
```python
class M(torch.nn.Module):
def forward(self, a, b):
res = torch.ops.mylib.foo8(a, b) # no fake impl
return res
ep = export(M(), (torch.ones(3, 4), torch.ones(3, 4)) # this fails bc no fake impl
ep = draft_export(M(), (torch.ones(3, 4), torch.ones(3, 4))
ep.run_decompositions() # this fails bc no fake impl
# this registers fake impls based on the profiles
with torch._library.fake_profile.register_fake_profile(ep._report.op_profiles):
decomp = ep.run_decompositions() # this works
new_inp = (
torch.ones(2, 3, 4),
torch.ones(2, 3, 4),
)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150809
Approved by: https://github.com/zou3519
This implements epilogue visitor tree argument generation (example type [here](3fe62887d8/include/cutlass/epilogue/fusion/sm90_callbacks_tma_warpspecialized.hpp (L332))).
Details:
The codegen task here is to implement a function which can generate a tree of C++ structs and properly extract the correct properties from Inductor buffers and write them to the correct locations in the generated struct. To implement this with the minimum amount of code, I generate the cutlass DAGIR (the EVT internal represenation) which specifically has a pass, [pass_argument_type.py ](5e497243f7/python/cutlass/backend/evt/passes/pass_argument_type.py (L4)) which generates a nested tree of custom argument types for each node in the DAGIR. This nested tree of constructors is then passed kwargs to fill in the proper values, where the node's name is used to differentiate between different values in the kwarg dictionary. This however is non-customizable; the nested tree of EVT args is a nested tree of ctypes which looks for *actual values* so that this object can be passed directly to the cutlass-python C++ runner. Inductor on the other hand needs to fill this struct with string C++ expressions representing the values (or extracting the values from kernel launcher args). So `_render_argument_type` implements this: it iterates over the tree of types created by pass_argument_type.py and generates a string representing the nested structs, filling in C++ expressions representing the different fields.
Long term plan:
Long term, I will ask the nvidia to provide an overridable [visitor_factory](5e497243f7/python/cutlass/backend/evt/passes/pass_argument_type.py (L82)) which could allow us to override the behavior of pass_argument_type.py to generate the string we would like during DAGIR generation.
Previously merged:
* #150346
* #150345
* #150344
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150903
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
This is similar to how we handle protobufs and it makes it more convenient for bazel users to handle their version of flatbuffers. (Flatbuffers is very picky about the generated code matching the runtime). Instead of using the checked in generated code, we generate it on the fly.
This is relevant to https://github.com/pytorch/pytorch/issues/112903, because having the version of flatbuffers tied to pytorch will make pytorch difficult to use as an external workspace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151364
Approved by: https://github.com/malfet
I have noticed that there are some errors like
```
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 169302: invalid start byte
```
I havent been able to repro this locally yet, this change should fix the encoding issues
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151472
Approved by: https://github.com/masnesral
Summary:
We add states in the constant folding process for AOTInductor.
Basically, there's 3 states, which is
(1) None: The state when no constants are loaded and uninitialized.
(2) Initialized: The state when constants are loaded, but not yet
folded.
(3) Folded: The state where the model is fully ready with folded
constants.
Note that even if constant folding is not enabled, we still only run
when state is FOLDED, this is okay because without constant folding, the
transition from INITIALIZED to FOLDED is just a pass-throught.
Test Plan:
python test/inductor/test_aot_inductor.py -k test_constant_folding_with_update
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D73002538](https://our.internmc.facebook.com/intern/diff/D73002538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151273
Approved by: https://github.com/jingsh, https://github.com/desertfire
Now that we have bc_lint in CI, this script is no longer needed (nor has it ever been conclusive). I've already updated the Runbook to not need this script.
Suppressing bc_lint as this script is not shipped as a part of torch--it is not user facing! For context, this script is (rarely) used by the release notes manager to ensure BC across releases. It had been broken for at least since 2.6.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151453
Approved by: https://github.com/albanD, https://github.com/jbschlosser
Which is a regression, introduced by https://github.com/pytorch/pytorch/issues/150629#issue-2970312779 which I should have reviewed more thoroughly.
- Defined `_fused_rms_norm`, added MPS-only implementation for it and dispatch from `rms_norm_symint`, which is registered as `CompositeImplicitAutograd`, i.e. it is not supposed to do any computations over Tensor, only dispatch to other ops
-
- Register `_fused_rms_norm` as a fallback in `torch/_inductor/lowering.py`
- Added unit test to avoid those regressions in the future
TODO:
- Get rid of this op, change `rms_norm_symint` definition to `CompositeExplicitAutograd` and implement backward function in `tools/autograd/derivatives.yaml`
- Benchmark compiler and re-enable decomp as follows when compiled code is faster
```python
@register_decomposition(aten._rms_norm_fused)
def rms_norm_fused(
self: torch.Tensor, ndim: int, weight: torch.Tensor, eps: float
) -> torch.Tensor:
dtr = [self.dim() - i - 1 for i in range(ndim)]
return self * weight * (self.pow(2).mean(dtr, keepdim=True).add(eps).rsqrt())
```
Fixes https://github.com/pytorch/pytorch/issues/150629
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150661
Approved by: https://github.com/manuelcandales, https://github.com/jansel
As the title stated.
**Changes Before:**
```C++
[999/1526] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/int4mm.cu.o
/root/Git.d/pytorch/pytorch/aten/src/ATen/native/cuda/int4mm.cu(142): warning #177-D: variable "at::native::kWarpSize" was declared but never referenced
constexpr int32_t kWarpSize = 32;
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151427
Approved by: https://github.com/Skylion007, https://github.com/malfet
Summary: In most instances, this action would take ~33% of the total run time, which means that our benchmark would previously differ from the end results by a lot.
Test Plan:
We can compare the benchmark results for
```
CUDA_VISIBLE_DEVICES=4,5 buck run mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100a //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-snapshot-id=672308665_0 --lower-backend=AOT_INDUCTOR --node-replacement-dict="{'torch.nn.Linear':{'(autotune)': 'fp8_float_model_dynamic_quantization_rowwise'}}" --trace-aot-inductor-module=True --disable-acc-tracer=False --batch-size=1024
```
before and after the diff, and notice that on average, the benchmark results decrease by ~0.1ms per iteration, which is more closely aligned with the lowered modules.
Differential Revision: D72469845
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150696
Approved by: https://github.com/frank-wei
`has_triton()` returns True if Triton is present on the system and supports _any_ backend we care about. In this case, that means we _always_ check gradients, even though the intended behavior is to skip gradients when testing on CPU.
Fixes a bug from #146911.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151435
Approved by: https://github.com/masnesral
Which is a regression, introduced by https://github.com/pytorch/pytorch/issues/150629#issue-2970312779 which I should have reviewed more thoroughly.
- Defined `_fused_rms_norm`, added MPS-only implementation for it and dispatch from `rms_norm_symint`, which is registered as `CompositeImplicitAutograd`, i.e. it is not supposed to do any computations over Tensor, only dispatch to other ops
-
- Register `_fused_rms_norm` as a fallback in `torch/_inductor/lowering.py`
- Added unit test to avoid those regressions in the future
TODO:
- Get rid of this op, change `rms_norm_symint` definition to `CompositeExplicitAutograd` and implement backward function in `tools/autograd/derivatives.yaml`
- Benchmark compiler and re-enable decomp as follows when compiled code is faster
```python
@register_decomposition(aten._rms_norm_fused)
def rms_norm_fused(
self: torch.Tensor, ndim: int, weight: torch.Tensor, eps: float
) -> torch.Tensor:
dtr = [self.dim() - i - 1 for i in range(ndim)]
return self * weight * (self.pow(2).mean(dtr, keepdim=True).add(eps).rsqrt())
```
Fixes https://github.com/pytorch/pytorch/issues/150629
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150661
Approved by: https://github.com/manuelcandales, https://github.com/jansel
Summary:
When we have an exported program that looks like this:
```
ExportedProgram:
class GraphModule(torch.nn.Module):
def forward(self, b__tensor_constant0: "f32[1]", ... c_lifted_tensor_0: "i64[925]", …. , tupleized_input_0_0: "f32[10, 2139]",
clone: "i64[925]" = torch.ops.aten.clone.default(c_lifted_tensor_0); c_lifted_tensor_0 = None
index_select: "f32[10, 925]" = torch.ops.aten.index_select.default(tupleized_input_0_0, 1, clone); clone = None
```
The graph after `aot_export_module` could have a name collision, notice that `_tensor_constant0` arg of `clone` is different from the `_tensor_constant0` in the input module .
```
def forward(self):
arg9_1: "f32[10, 2139]"
_tensor_constant0: "f32[1]" = self._tensor_constant0 # this should be int64, conflicted with the original _tensor_constant0, had a clone on this constant before lifting
index: "f32[10, 925]" = torch.ops.aten.index.Tensor(arg9_1, [None, _tensor_constant0]); _tensor_constant0 = None
```
This caused the `tensors used as indices must binary, int...` aoti error on PT2I dashboard because later we used `clone` as index.
We had this error because we created a new `_tensor_constant0` at [here](https://github.com/pytorch/pytorch/blob/main/torch/fx/_symbolic_trace.py#L403-L412), and the new `_tensor_constant0` overrides the original `_tensor_constant0` on the input Module in `_unlift_graph`. The `arg` for `clone` is created at `create_proxy` in `proxy.py`.
To fix this, we do a graph pass before we unlift the graph inputs to avoid name collision
Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile_constant_folding
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r aoti_constant_tensor_name_collision
```
Differential Revision: D72761937
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151123
Approved by: https://github.com/tugsbayasgalan, https://github.com/jingsh
# Motivation
This stack of PRs aims to generalize and improve PyTorch host allocator code.
This PR introduces a `DeleterFnPtr` template parameter to `CachingHostAllocatorInterface` to resolve circular dependency issues. This change allows for better code reuse and simplifies the implementation of host allocators.
# Additional Context
TODO:
- [ ] Unify host allocator related API
- [ ] Deprecate those device-specific legacy API
- [ ] Move `is_pinned` to host allocator
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151403
Approved by: https://github.com/gujinghui, https://github.com/albanD
This PR reduces #graph partitions by reordering nodes when the `should_partition` nodes have simple dependencies. Specifically, for `should_partition` nodes:
a. If a node has no dependency or only depends on graph inputs: move to the front. Use case is when we move symints to cuda tensor for PaddedTensorSubclass
b. If the only user of a node is OutputNode: move it to the end.
#### Example
The following example shows a padded tensor subclass use case where we copy symint to a cuda tensor (aka mask) in the middle of function. Reordering still generates 1 cudagraph by moving the mask to the front.
```python
import torch
torch._inductor.config.graph_partition = True
# Two reasons for this:
# 1. We want to reuse the same mask for many masked_fill calls
# 2. Prevent inductor from fusing this op into other ops (e.g. masked_fill)
# so we can still reorder in scheduler
@torch.library.custom_op("mylib::create_mask", mutates_args=(), tags=(torch._C.Tag.cudagraph_unsafe,))
def create_mask(padded_size: int, original_size: int, device: torch.device) -> torch.Tensor:
mask = torch.zeros((padded_size,), dtype=torch.bool, device=device)
mask[original_size:] = True
return mask
@create_mask.register_fake
def _(padded_size, original_size, device):
return torch.empty((padded_size,), dtype=torch.bool, device=device)
def f(padded_tensor, original_tensor, weight):
original_size = original_tensor.size()[0]
padded_size = padded_tensor.size()[0]
# element wise op so we don't care padding value
padded_tensor = padded_tensor + 1
padded_tensor = torch.nn.functional.relu(padded_tensor)
# dot product requires padding with 0
dot_res = padded_tensor.dot(weight)
padded_tensor += dot_res
# min requires padding with inf, so we create mask now
mask = create_mask(padded_size, original_size, padded_tensor.device)
min_res = torch.min(
torch.ops.aten.masked_fill(padded_tensor, mask, float("inf"))
)
# max requires padding with inf. we can reuse previous mask
max_res = torch.max(
torch.ops.aten.masked_fill(padded_tensor, mask, -float("inf"))
)
return min_res+max_res+padded_tensor
compiled_f = torch.compile(f, mode="reduce-overhead")
def run(padded_size, original_size):
padded_tensor = torch.randn(padded_size, device="cuda")
padded_tensor[original_size:] = 0
original_tensor = torch.randn(original_size, device="meta")
weight = torch.randn(padded_size, device="cuda")
eager_out = f(padded_tensor, original_tensor, weight)
compiled_out = compiled_f(padded_tensor, original_tensor, weight)
assert torch.allclose(eager_out[0], compiled_out[0])
assert torch.allclose(eager_out[1], compiled_out[1])
# new cudagraph
run(8, 4)
# new cudagraph due to recompile
run(8, 6)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150814
Approved by: https://github.com/eellison
# Feature
This fixes a bug related to block pointer stores. Since Triton's block pointer stores don't support implicit broadcasting, in certain cases we need to generate a `reshape->broadcast->reshape` pattern to ensure that the tensor being stored has the same shape as the block pointer. This happens when the block indexing expression involves strides of 0 or dimensions of 1, both of which we eliminate from the block pointer.
The existing logic missed an important edge case. We may need a broadcast prior to the first `reshape` of this pattern, in case the tensor comes from a load with implicit broadcasting. For example, if the range trees have shape `[YBLOCK, XBLOCK]`, but the load has a shape `[1, XBLOCK]`, we need to broadcast this to `[YBLOCK, XBLOCK]` prior to storing. See the example kernel below, which comes from `expand` -> `clone` with 3D tiling. The load has an implicit broadcast, and the store has a reshape. Thus, we need to insert an explicit broadcast between them.
```
@triton.jit
def triton_poi_fused_clone_0(in_ptr0, out_ptr0, znumel, ynumel, xnumel, ZBLOCK : tl.constexpr, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
znumel = 32
ynumel = 1
xnumel = 32
zoffset = tl.program_id(2) * ZBLOCK
zindex = zoffset + tl.arange(0, ZBLOCK)[:, None, None]
zmask = zindex < znumel
yoffset = tl.program_id(1) * YBLOCK
yindex = yoffset + tl.arange(0, YBLOCK)[None, :, None]
ymask = tl.full([ZBLOCK, YBLOCK, XBLOCK], True, tl.int1)
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[None, None, :]
xmask = xindex < xnumel
x1 = xindex
z0 = zindex
tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[32], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0], eviction_policy='evict_last')[None, None, :]
tl.store(tl.make_block_ptr(out_ptr0, shape=[32, 32], strides=[32, 1], block_shape=[ZBLOCK, XBLOCK], order=[1, 0], offsets=[zoffset, xoffset]), tl.reshape(tl.broadcast_to(tmp0, [ZBLOCK, YBLOCK, XBLOCK]), [ZBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1])
''', device_str='cuda')
```
The tricky part is that we don't want to emit redundant broadcasts in the store. This PR reworks the logic a bit to make sure we don't emit a second broadcast unless it actually changes the shape.
# Test plan
Added a CI test for this case, which would fail on trunk. Checked that only one broadcast was emitted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151399
Approved by: https://github.com/jansel, https://github.com/eellison
Given an exception in torch.export, I want to try/catch it to add the message "hey try out draft-export!". Currently I only add this message for errors that draft-export is known to fix, like DataDependentErrors, ConstraintViolationErrors, and no fake impl.
Originally the error message looks like:
```
File "/data/users/angelayi/pytorch/torch/_library/custom_ops.py", line 626, in fake_impl
raise RuntimeError(
RuntimeError: There was no fake impl registered for <CustomOpDef(mylib::foo2)>. This is necessary for torch.compile/export/fx tracing to work. Please use `foo2_impl.register_fake` to add an fake impl.
```
Now, the error msg now looks something like:
```
File "/data/users/angelayi/pytorch/torch/_library/custom_ops.py", line 626, in fake_impl
raise RuntimeError(
RuntimeError: There was no fake impl registered for <CustomOpDef(mylib::foo2)>. This is necessary for torch.compile/export/fx tracing to work. Please use `foo2_impl.register_fake` to add an fake impl.
The error above occurred when calling torch.export.export. If you would like to view some more information about this error, and get a list of all other errors that may occur in your export call, you can rerun your program with the `DRAFT_EXPORT=1` envvar, or replace your `export()` call with `draft_export()`.
```
In python versions >= 3.11, we can use `exception.add_note` to add to the error message. However with previous versions I did a hack to modify `e.args`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151065
Approved by: https://github.com/pianpwk
ghstack dependencies: #151051
As the title stated.
**Changes:**
- Remove unnecessary header file
- Remove unnecessary registry logic about PrivateUse1HooksRegistry,such as TORCH_DECLARE_REGISTRY, C10_DEFINE_REGISTRY, etc,.
- using static + global variable to do initialization instead of call_one
**Next Step:**
Enable test_openreg.py in CI/CD to guard the quality of PrivateUse1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151005
Approved by: https://github.com/albanD
Implement traceable config patching for Dynamo: enables restricted patching of Dynamo config where user can use a context manager/decorator to change tracing behavior for parts of the code.
The new `dont_skip_tracing` decorator/context manager for ignoring most trace rules is easily implemented with this more generic traceable config patching feature.
Implementation:
- Create a new specialized context manager class representing a wrapper around torch._dynamo.config.patch
- Dynamo doesn't trace into the context manager but updates config at compile time
- Correctness is based on our correctness for handling supported context managers
- Implementation is inspired by how `GradModeVariable` is implemented.
Previous attempts: https://github.com/pytorch/pytorch/pull/148736 (decorator-only global approach) and https://github.com/pytorch/pytorch/pull/149439 (decorator-only traceback approach)
See https://docs.google.com/document/d/1vWNwKL_jpg-PLopifcaSa338wks3GqSVF4GHRguybGg/edit?tab=t.0 for more details on implementation - including previous approaches.
NOTE: this PR fixes a bug where skipped code objects were not tracked by convert_frame.py, leading to cases where code objects would be automatically skipped even after `torch._dynamo.reset()`. This exposed some latent dynamo-wrapped test failures in CI that previously passed in CI but not locally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150586
Approved by: https://github.com/jansel, https://github.com/zou3519, https://github.com/anijain2305
Summary:
D72906445 seemed to cause a SIGABRT when running the test in the test plan. The change I narrowed it down to was where in fake_impls the [`deregister_fake_kernel` no longer calls `lib.destroy`](https://github.com/pytorch/pytorch/pull/150806/files#diff-7fd3f4222276c63b91f3a895530bb5efe137fd23165b48f25afcf3c06a5d2a8fL65-L69).
Calling `lib.destroy` in that handle results in a maximum recursion error where someone calls library.destroy which calls the handle which calls back to library.destroy.
So I compared the implementation of this _del_library and lib.destroy and it seemed like the main thing different was deleting `self.m`. So adding that fixed my issue!
Side note, I feel like we can combine `_del_library` and `library._destroy`? But I won't do it in this diff to make sure we don't break too many things 😅
Test Plan:
`buck test 'fbcode//mode/opt' fbcode//aiplatform/gmpp/bulk_eval/reader/service/tests:reader_service_handler_tests -- --exact 'aiplatform/gmpp/bulk_eval/reader/service/tests:reader_service_handler_tests - aiplatform.gmpp.bulk_eval.reader.service.tests.reader_service_handler_tests.ReaderServiceHandlerTests: test_add_preproc_output_into_queue'`
https://www.internalfb.com/intern/testinfra/testrun/10977524170296078
Differential Revision: D73017613
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151299
Approved by: https://github.com/zou3519
Currently, `torch._chunk_cat` only supports contiguous inputs (due to `.view()` usage in `_pad_chunk()` supporting only contiguous tensor). This doesn't work for internal models where there can be non-contiguous input tensors:
- size=[8192, 16416], stride=[16448, 1] # stride[0] is larger than size[1]
- size=[1152, 384], stride=[1, 1152] # column-major tensor
In this PR, we relax the assumption on contiguous input tensor, by switching from `.view()` to `.reshape()`. Note that since `.reshape()` will try to use `.view()` under the hood whenever possible, this should not cause regression to existing use cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151263
Approved by: https://github.com/BoyuanFeng
After https://github.com/pytorch/pytorch/pull/150652, we still see some ranks missing dumps. Upon looking further, the case is that FR dump timed out for its first attempt:
watchdog thread: notify FR dump -> wait for 1 mins -> throw watchdog timeout -> notify elastic to kill process
FR dump thread: received FR dump signal -> timeout after 1 mins with first attempt -> started 2nd attempt -> got killed.
So we want to make the FR dump timeout shorter, in reality, the log shows that the dump finished within one sec. Even if we consider a very slow speed like 200K/s the usual size FR (1MB at most) takes around 5 secs, so 15 secs is like 3 times buffer.
Also we still let watchdog sleep for 1 min so that we can wait enough time for two dump to timeout and the following check like GIL checker to execute.
Also, if we get stuck in getting GIL or cuda hang, 15 seconds should be enough to detect the hang.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151329
Approved by: https://github.com/fegin
Finishes up the work started in #121686 + adds test
Update: this was not as straightforward as I originally imagined. Context below.
**TL;DR:** `TestFoo{CPU, CUDA}` now actually derive from `TestFoo`! Also, `{CPU, CUDA}TestBase` setup / teardown logic is now always called (it is required to set the primary device), regardless of whether `super().setUpClass()` / `super().tearDownClass()` are called or not.
**Background:** The typical way to get device-specific tests is to write a generic `TestFoo` and call `instantiate_device_type_tests(TestFoo, locals())` to get `TestFooCPU`, `TestFooCUDA`, etc. After this, generic tests (e.g. `TestFoo.test_bar()`) become `TestFooCPU.test_bar_cpu()` / `TestFooCUDA.test_bar_cuda()`.
Behind the scenes, this was historically accomplished by creating a `TestFooCUDA` that derives from both a `CUDATestBase` and an *empty class* called `TestFoo_base`. This `TestFoo_base` has the same bases as `TestFoo`, but none of the test functions (e.g. `test_bar()`). The documented reason for this is to avoid things like a derived `TestFooCUDA.test_bar()` being discovered in addition to the real device-specific test `TestFooCUDA.test_bar_cuda()`.
(1) A reason this matters is because it should be possible to call e.g. `super().setUpClass()` from a custom setup / teardown classmethod. If the generated TestFooCUDA does not derive from TestFoo, but instead derives from the empty class described above, this syntax does not work; in fact there is no way to form a proper `super()` call that works across the device-specific test variants. Here's an example that breaks in the OpInfo tests:
070f389745/test/test_ops.py (L218-L221)
(2) Further, there is some precedent within a custom `setUpClass()` impl for storing things on the `cls` object to be accessed at test time. This must be the device-specific test class (`TestFooCUDA`) and not `TestFoo` for this to work. As an example, the open device registration tests load a module during setup and use it in the test logic:
070f389745/test/test_cpp_extensions_open_device_registration.py (L63-L77)070f389745/test/test_cpp_extensions_open_device_registration.py (L79-L80)
To accomplish both (1) and (2) at the same time, I decided to revisit the idea of utilizing a proper inheritance hierarchy for `TestFoo` -> `{TestFooCPU, TestFooCUDA}`. That is: have TestFooCPU / TestFooCUDA **actually** derive from `TestFoo`. This achieves both (1) and (2). The only thing left is to make sure the generic tests (e.g. `TestFoo.test_bar()`) are not discoverable, as was the stated reason for diverging from this in the first place. It turns out we can simply `delattr()` these generic tests from `TestFoo` once `TestFooCPU` / `TestFooCUDA` have been setup with the device-specific variants, and all works well. The `instantiate_device_type_tests(...)` logic already deletes `TestFoo` from scope, so I don't see a problem with deleting generic tests from this base class as well (CI will prove me right or wrong ofc).
**Side note:** I was encountering a weird race condition where sometimes the custom `setUpClass()` / `tearDownClass()` defined & swapped in [here](4a47dd9b3f/torch/testing/_internal/common_device_type.py (L940-L955)) would be used, and sometimes it wouldn't. This non-deterministic behavior was called out previously by @ngimel here:
4a47dd9b3f/test/inductor/test_torchinductor_dynamic_shapes.py (L128-L130)
To address this, I moved this block of logic to before the first call to `instantiate_test()`, as that method queries for the primary device, and the primary device identification logic may manually invoke `setUpClass()` (see [here](4a47dd9b3f/torch/testing/_internal/common_device_type.py (L381-L384))). Goal: define the `setUpClass()` / `tearDownClass()` we want for correctness before they're ever called. This seems to work and the behavior is deterministic now AFAICT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151129
Approved by: https://github.com/janeyx99, https://github.com/masnesral, https://github.com/malfet
Introduced by https://github.com/pytorch/pytorch/pull/149512
Before this change, following warning was generated
```
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/transformers/attention.cpp:452:71: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
452 | REGISTER_HPU_DISPATCH(_fused_sdp_choice_stub, &_fused_sdp_choice_meta);
| ^
1 warning generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151367
Approved by: https://github.com/drisspg
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.
```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
By suppressing `missing-template-arg-list-after-template-kw` warning, which seems to be required to compile Google's libnop, which is in a semi-abandoned state now
```
In file included from /Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/base/variant.h:21:
/Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:241:30: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw]
241 | index_ = value_.template Construct(std::forward<Args>(args)...);
| ^
/Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:258:26: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw]
258 | if (!value_.template Assign(TypeTag<T>{}, index_, std::forward<U>(value))) {
| ^
/Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:265:26: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw]
265 | if (!value_.template Assign(index_, std::forward<T>(value))) {
| ^
3 errors generated.
```
Fixes https://github.com/pytorch/pytorch/issues/151316
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151344
Approved by: https://github.com/ZainRizvi, https://github.com/seemethere
Summary:
This fixes the error in https://fb.workplace.com/groups/1075192433118967/permalink/1640802133224658/
I tried really hard but I couldn't come up with a test case to repro the issue, but I confirmed with the OP that this issue has been fixed.
```
Traceback (most recent call last):
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/compile_fx.py", line 746, in _compile_fx_inner
mb_compiled_graph = fx_codegen_and_compile(
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/compile_fx.py", line 1343, in fx_codegen_and_compile
return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/compile_fx.py", line 1232, in codegen_and_compile
compiled_module = graph.compile_to_module()
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 2087, in compile_to_module
return self._compile_to_module()
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 2095, in _compile_to_module
self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 2002, in codegen
self._update_scheduler()
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 1996, in _update_scheduler
self.scheduler = Scheduler(self.operations)
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/scheduler.py", line 1954, in __init__
self._init(nodes)
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/scheduler.py", line 1974, in _init
self.update_zero_dim_cpu_tensor()
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/scheduler.py", line 4433, in update_zero_dim_cpu_tensor
and buffer.get_size() == []
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/ir.py", line 3903, in get_size
return [*self.get_layout().size]
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/ir.py", line 3914, in get_layout
raise NotImplementedError(type(self.layout).__name__)
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
NotImplementedError: NoneLayout
```
Test Plan: OP said the issue is fixed
Differential Revision: D72575808
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151321
Approved by: https://github.com/BoyuanFeng
We revisited how coalesced collective is working in https://github.com/pytorch/pytorch/pull/151243 and we now want to enable the script to work for slow path. The change is indeed bc-breaking but this is needed to make it work and the API is an internal use API. It is not user facing. For slow path the individual has input-sizes and output sizes recorded but no state. The final one has the state ready. We check the correctness of each individual collective one by one but we don't check the state match for these collectives, we can only check the state match for the last one which is the work item with coalesced label.
Added more unit test for slow path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151247
Approved by: https://github.com/d4l3k, https://github.com/XilunWu
Sometimes the python script didn't exit normally and the lock file remains in the path. In this case, the `file_baton.py` may sleep forever waiting for the lock file to release. This PR will add a warning to show the existing lock file path, let the user better understand which file to delete when the waiting time is too long.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149382
Approved by: https://github.com/soulitzer
Finishes up the work started in #121686 + adds test
Update: this was not as straightforward as I originally imagined. Context below.
**TL;DR:** `TestFoo{CPU, CUDA}` now actually derive from `TestFoo`! Also, `{CPU, CUDA}TestBase` setup / teardown logic is now always called (it is required to set the primary device), regardless of whether `super().setUpClass()` / `super().tearDownClass()` are called or not.
**Background:** The typical way to get device-specific tests is to write a generic `TestFoo` and call `instantiate_device_type_tests(TestFoo, locals())` to get `TestFooCPU`, `TestFooCUDA`, etc. After this, generic tests (e.g. `TestFoo.test_bar()`) become `TestFooCPU.test_bar_cpu()` / `TestFooCUDA.test_bar_cuda()`.
Behind the scenes, this was historically accomplished by creating a `TestFooCUDA` that derives from both a `CUDATestBase` and an *empty class* called `TestFoo_base`. This `TestFoo_base` has the same bases as `TestFoo`, but none of the test functions (e.g. `test_bar()`). The documented reason for this is to avoid things like a derived `TestFooCUDA.test_bar()` being discovered in addition to the real device-specific test `TestFooCUDA.test_bar_cuda()`.
(1) A reason this matters is because it should be possible to call e.g. `super().setUpClass()` from a custom setup / teardown classmethod. If the generated TestFooCUDA does not derive from TestFoo, but instead derives from the empty class described above, this syntax does not work; in fact there is no way to form a proper `super()` call that works across the device-specific test variants. Here's an example that breaks in the OpInfo tests:
070f389745/test/test_ops.py (L218-L221)
(2) Further, there is some precedent within a custom `setUpClass()` impl for storing things on the `cls` object to be accessed at test time. This must be the device-specific test class (`TestFooCUDA`) and not `TestFoo` for this to work. As an example, the open device registration tests load a module during setup and use it in the test logic:
070f389745/test/test_cpp_extensions_open_device_registration.py (L63-L77)070f389745/test/test_cpp_extensions_open_device_registration.py (L79-L80)
To accomplish both (1) and (2) at the same time, I decided to revisit the idea of utilizing a proper inheritance hierarchy for `TestFoo` -> `{TestFooCPU, TestFooCUDA}`. That is: have TestFooCPU / TestFooCUDA **actually** derive from `TestFoo`. This achieves both (1) and (2). The only thing left is to make sure the generic tests (e.g. `TestFoo.test_bar()`) are not discoverable, as was the stated reason for diverging from this in the first place. It turns out we can simply `delattr()` these generic tests from `TestFoo` once `TestFooCPU` / `TestFooCUDA` have been setup with the device-specific variants, and all works well. The `instantiate_device_type_tests(...)` logic already deletes `TestFoo` from scope, so I don't see a problem with deleting generic tests from this base class as well (CI will prove me right or wrong ofc).
**Side note:** I was encountering a weird race condition where sometimes the custom `setUpClass()` / `tearDownClass()` defined & swapped in [here](4a47dd9b3f/torch/testing/_internal/common_device_type.py (L940-L955)) would be used, and sometimes it wouldn't. This non-deterministic behavior was called out previously by @ngimel here:
4a47dd9b3f/test/inductor/test_torchinductor_dynamic_shapes.py (L128-L130)
To address this, I moved this block of logic to before the first call to `instantiate_test()`, as that method queries for the primary device, and the primary device identification logic may manually invoke `setUpClass()` (see [here](4a47dd9b3f/torch/testing/_internal/common_device_type.py (L381-L384))). Goal: define the `setUpClass()` / `tearDownClass()` we want for correctness before they're ever called. This seems to work and the behavior is deterministic now AFAICT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151129
Approved by: https://github.com/janeyx99, https://github.com/masnesral, https://github.com/malfet
This PR intends to rework the dispatching of the autograd key.
I.e., currently the DispatchKey.Autograd of the HOPs was triggered, even if non of the operands of the HOP have `requires_grad=True`. With this rework, the autograd is bypassed if non of the operands require gradients and only invoked if any of the operands require gradients.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151107
Approved by: https://github.com/ydwu4
There are only 118 failures atm, mark them all with xfail to avoid new regressions
Add `xfail_if_mps_unimplemented` decorator to distinguish between tests that call unimplemented eager op vs ones that fail for some other reason.
Added `aten._scaled_dot_product_attention_math_for_mps` fallback to make test behavior consistent between MacOS-15 (where falback is in place) and MacOS-14
Weird MacOS-14 specific skips:
- test_torchinductor.py::GPUTests::test_cat_extern_kernel_mps
- test_torchinductor.py::GPUTests::test_sort_transpose_mps (likely an eager bug)
- test_torchinductor.py::GPUTests::test_unaligned_input_mps
Numerous MacOS-13 skips, including few eager hard crashes, for example running `test_torchinductor.py::GPUTests::test_scatter5_mps` causes
```
/AppleInternal/Library/BuildRoots/c651a45f-806e-11ed-a221-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSNDArray/Kernels/MPSNDArrayScatter.mm:309: failed assertion `Rank of destination array (1) must be greater than or equal to inner-most dimension of indices array (3)'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150821
Approved by: https://github.com/ZainRizvi, https://github.com/dcci
ghstack dependencies: #151224, #151246, #151272, #151282, #151288
Summary:
as title
`export._trace._WrapperModule` is used to wrap functions into a Module so we can export the function.
We add `export._wrapper_utils` to `dynamo`'s `MOD_INLINELIST` so dynamo traces into `_WrapperModule`
Fixes https://github.com/pytorch/pytorch/issues/146867
Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test:test_export -- -r wrapper_module
```
Differential Revision: D72986826
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151264
Approved by: https://github.com/angelayi
Preparatory refactor for https://github.com/pytorch/pytorch/pull/146942.
# Feature
This PR refactors the existing wrapper codegen into `WrapperLine` subclasses, extending the existing Memory Planning IR into a fully-fledged Wrapper IR. See the diagram below.

The IR currently supports the following ops:
- All existing memory planning IR ops (`AllocateLine`, `FreeIfNotReusedLine`, etc.)
- Reinterpret views (`ReinterpretLine`)
- Kernel definitions (`KernelDefinitionLine`)
- Calls to defined kernels (`KernelCallLine`)
- Calls to extern kernels (`ExternKernelLine`, `ExternKernelAllocLine`)
- Ops with multiple outputs (`MultiOutputLine`)
- Tensor cleanup at the end of a graph (`FreeLine`)
- Leaving comments in code (`CommentLine`)
There are two main motivations for this refactor:
1. Unlike free-form C++ and and Python code, Wrapper IR lines provide structured information about what the wrapper code does. This serves as a natural extension point for other types of wrapper codegen. For example, the parent PR generates FX IR from Wrapper IR. Wrapper IR aims to give new backends enough information to generate wrapper code without needing to modify core Inductor files such as `ir.py`.
2. This design will hopefully promote stronger modularity and encapsulation.
a. Inductor's core compilation passes don't need to worry about whether they're targeting Python, C++, FX or anything else. They can simply focus on generating Wrapper IR, and target-specific code can be refactored into the various backends.
b. Backends do not need to know about all the details and internal state of `V.graph` IR. For example, they don't need to consider whether a buffer has been removed from the graph when generating code. Wrapper IR will hopefully provide a simpler interface for generating wrapper code, which abstracts away the details of device code.
# Implementation details
The implementation mainly consists of separating direct C++/Python codegen into two phases:
1. Emit Wrapper IR lines describing what the wrapper code is supposed to do.
2. Inside the `codegen()` method of each `WrapperLine`, call backend methods which generate pure Python/C++ code using the information stored in the Wrapper IR line. For example, `KernelCallLine` calls `wrapper._generate_kernel_call_helper`, which is overriden by the various Python and C++ backends to generate the final wrapper code.
The main difficulty in implementing this is that we need to be careful that code is generated in the correct order. Wrapper codegen happens in two passes: first we write code into `self.lines` which mainly contains wrapper IR, but can also contain raw Python or C++ lines in some situations. Then, we convert the wrapper IR into the final Python/C++ code in `self.wrapper_call`. Since the same macros may be used in both passes, it's difficult to ensure that code is written to the correct buffer. The easiest solution for this was to implement a context manager overriding the `writeline` method to write to `self.wrapper_call` after memory planning is finished. This way, `writeline` writes to `self.lines` in the first pass, and `self.wrapper_call` in the second. This obviated the need to pass `code` or `writeline` variables all the way through the call stack, which would have touched most of the existing macros.
# Test plan
Since this refactor touches all the existing wrapper codegen classes, the existing CI provides good coverage.
The parent PR introduces new tests for the FX IR backend. Among other things, these tests assert that `self.lines` only contains Wrapper IR lines, and no free-form code. While this would not be true of all programs today, the tests suggests that the IR implemented in this PR is sufficient to cover basic PyTorch usage.
# Future directions
These two goals are only partially realized by this PR. These are several important steps which still undergo direct Python/C++ codegen in core files:
- User-defined Triton kernels.
- Reinterpret views on outputs, from `gen_output_refs()`. (In the parent PR, the FX converter has a custom way of handling this. This can eventually be ported into Wrapper IR.)
- Fallback ops with custom `codegen()` methods, e.g. `ScatterFallback`.
- Misc. C++ lines emitted by the various cpp backends, e.g. declaring constants.
These cases will gradually be handled in subsequent PRs, as the Inductor->FX converter expands its coverage. Given that these refactors are pretty tricky to do, it seems wiser to execute them in stages, as opposed to porting everything to Wrapper IR at once.Some Python and codegen still lives in core files such as `ir.py`, as described in previous sections. Hopefully, this PR will serve as a starting point which moves the codebase towards a more modular design. Over time, we can gradually refactor the remaining codegen (mainly in `ir.py`) into backend classes.
One limitation of this PR is that codegen still happens in two phases during `PythonWrapperCodegen`. First, we generate Wrapper IR into `self.lines`, and from there we generate Python or C++ code into `self.wrapper_call`, `self.header`, etc. In the long term, it would be cleaner to split wrapper IR into its own class which doesn't deal with Python/C++ codegen at all. (See the diagram at the top.) That would strictly enforce the boundary between Wrapper IR and Python/C++ wrapper code. However, this would probably be a much larger refactor.
Another limitation of the current code is that the helper functions have a lot of call args. It's also possible to clean this up by passing Wrapper IR ops e.g. `KernelCallLine` into helper functions like `_generate_kernel_call_helper`, since they store all the arguments. However, that change would likely be prone to merge conflicts, so I would like to save it for follow-up PRs if possible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150458
Approved by: https://github.com/eellison
PT2 benchmark scripts has a pattern like:
```
def forward_and_backward_pass(self, mod, inputs, collect_outputs=True):
cloned_inputs = clone_inputs(inputs)
self.optimizer_zero_grad(mod)
with self.autocast(**self.autocast_arg):
pred = mod(**cloned_inputs)
loss = self.compute_loss(pred)
self.grad_scaler.scale(loss).backward()
self.optimizer_step()
if collect_outputs:
return collect_results(mod, pred, loss, cloned_inputs)
return None
```
for training.
The collect_outputs argument is True only for accuracy testing and it's false for performance testing.
For HF benchmark suite, a model usually returns tuple (loss, logits). For performance testing, even though the logits is never used anywhere, dynamo has to keep it due to the control flow.
A few bad things if we keep logits here
1. the peak memory will be higher since the logits is large and we can not release its memory earlier.
2. we can not do optimization like chunking for the logits because the tensor needs to be returned from the pre-grad graph
Actually I think it's fine to not return logits at all.
- For training cases, checking loss and gradients for accuracy is good enough. It's hard to see two runs have mismatch logits but matching loss/gradients.
- Also, discarding logits as soon as possible for perf benchmarking makes it more fair for us.
On the other hand, it may be interesting to let dynamo support something like dynamo.constexpr (similar to tl.constexpr). A variable annotated as dynamo.constexpr will be specialized at compile time and we can do more optimization (DCE e.g.) at compile time. (A small [repro](https://gist.github.com/shunting314/0912a8947028a904c34f361021b8024d))
Benchmark results here [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Fri%2C%2004%20Apr%202025%2018%3A03%3A26%20GMT&stopTime=Fri%2C%2011%20Apr%202025%2018%3A03%3A26%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/204/head&lCommit=fe25dab3f65e1b0e9db0af03f7664af70fcc9c66&rBranch=main&rCommit=55e62ff74ad5614faf80b060c7bfc551e3b7af5a)
- HF 15% (1.51 -> 1.66 compression ratio) peak memory improvement
- I also see 5% (2.74 -> 2.79x) perf win for HF. It could be true. We may generate more efficient kernels since we don't need keep logits and return it from the pre-grad graph. But I'll double check
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151075
Approved by: https://github.com/eellison, https://github.com/jansel
Retry of https://github.com/pytorch/pytorch/pull/150957, which was reverted due to internal meta failures
Credit to @mgmtea who wrote the initial version of this PR: https://github.com/pytorch/pytorch/pull/146604
Context: CUPTI is the NVIDIA library that Kineto uses for collecting GPU-side info during profiling. The intended usage is to register a callback while you want profiling to occur, and then unregister the callback when you want profiling to stop. But a bug would cause crashes if CUPTI callbacks were de-registered when used with cudagraphs. The workaround was to disable "CUPTI_LAZY_REINIT" and "CUPTI_TEARDOWN" in Kineto - which prevents crashes, but can result in slower execution after profiling has occurred and completed.
This bug is believed to be fixed in CUDA >= 12.6, so this PR qualifies that DISABLE_CUPTI_LAZY_REINIT=1 and CUPTI_TEARDOWN=0 should only be applied if CUDA >= 12.6. Additionally, `profiler_allow_cudagraph_cupti_lazy_reinit_cuda12()` is added as an escape hatch so that we can add a killswitch in case we see more crashes related to this.
Differential Revision: [D72842114](https://our.internmc.facebook.com/intern/diff/D72842114/)
**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D72842114/)!
Differential Revision: [D72842114](https://our.internmc.facebook.com/intern/diff/D72842114)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151124
Approved by: https://github.com/sraikund16
If optree is less than the minimum version, we should pretend it doesn't
exist.
The problem right now is:
- Install optree==0.12.1
- `import torch._dynamo`
- This raise an error "min optree version is 0.13.0"
The fix is to pretend optree doesn't exist if it is less than the min
version.
There are ways to clean up this PR more (e.g. have a single source of
truth for the version, some of the variables are redundant), but I am
trying to reduce the risk as much as possible for this to go into 2.7.
Test Plan:
I verified the above problem was fixed. Also tried some other things,
like the following, which now gives the expected behavior.
```py
>>> import torch
>>> import optree
>>> optree.__version__
'0.12.1'
>>> import torch._dynamo
>>> import torch._dynamo.polyfills.pytree
>>> import torch.utils._pytree
>>> import torch.utils._cxx_pytree
ImportError: torch.utils._cxx_pytree depends on optree, which is
an optional dependency of PyTorch. To u
se it, please upgrade your optree package to >= 0.13.0
```
I also audited all non-test callsites of optree and torch.utils._cxx_pytree.
Follow along with me:
optree imports
- torch.utils._cxx_pytree. This is fine.
- [guarded by check] f76b7ef33c/torch/_dynamo/polyfills/pytree.py (L29-L31)
_cxx_pytree imports
- [guarded by check] torch.utils._pytree (changed in this PR)
- [guarded by check] torch/_dynamo/polyfills/pytree.py (changed in this PR)
- [guarded by try-catch] f76b7ef33c/torch/distributed/_functional_collectives.py (L17)
- [guarded by try-catch] f76b7ef33c/torch/distributed/tensor/_op_schema.py (L15)
- [guarded by try-catch] f76b7ef33c/torch/distributed/tensor/_dispatch.py (L35)
- [guarded by try-catch] f76b7ef33c/torch/_dynamo/variables/user_defined.py (L94)
- [guarded by try-catch] f76b7ef33c/torch/distributed/tensor/experimental/_func_map.py (L14)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151257
Approved by: https://github.com/malfet, https://github.com/XuehaiPan
sdp on xpu will fallback to math path in some cases (i.e. training). In dynamo benchmark, we prefer to use fp16 for better performance. Although `allow_fp16_bf16_reduction_math_sdp` is under backends.cuda, its implementation is for all device.
I didn't add if device == xpu here, I suppose cuda devices will not run into math path anyway
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150996
Approved by: https://github.com/drisspg, https://github.com/EikanWang
Goal is to have a way to compare if a change make it better or worse.
```
Average edge over aten (max(-edge, 0), higher is better):
triton: 8.596507086950552 (from 6 valid values)
triton_persistent_tma: 9.517193693923307 (from 6 valid values)
cutlass_lvl_default: 3.3234737908691785 (from 6 valid values)
cutlass_lvl_1111: 7.088173348313991 (from 6 valid values)
cutlass_lvl_2222: 7.291869722320318 (from 6 valid values)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150639
Approved by: https://github.com/ColinPeppler
This PR is to enable FR for all coalesce ops for fast path. (batch p2p is enabled in the current script, so we will mainly focus on non-P2P ops). To explain what is fast path, let's revisit how coalesced collective is working today:
For non-P2P coalesced ops, there are are several ways to call it (due to legendary reasons):
- Way one: Directly call python api like all_reduce_coalesced in python, this will be deprecated soon.
- Way two: Directly call api inside PGNCCL like allreduce_coalesced. The way case 1 will eventually call into this. This is not deprecated and will not be deprecated, IIUC.
- Way three: Using _coalescing_manager in python, like:
```
with _coalescing_manager():
for i in range(num_colls):
dist.all_reduce(tensors[i])
```
This way has two path:
- Fast path: when users call all-reduce, all-gather-into-tensor or reduce-scatter, we will only launch one big collective by calling the api from case 1.
- Slow path: we call startCoalescing() in the beginning and then a bunch of collectives (each one will generate a FR entry) and then endCoalescing(). Inside startCoalescing(), groupStart() is called and inside endCoalescing(), groupEnd() is then called. So although this is going to be one collective, we call into PGNCCL for each collective coalesced in the slow path case.
- For uneven all-gather (allgather_v) and reduce-scatter, it follows the pattern mention in slow path. It directly call cpp api inside PGNCCL.
This PR addressed the fast path because this is just an easy case, we store the collectives info on the python side, and we will only call into PGNCCL once so there will only be one work and one FR entry. We can just treat them as regular coalesced collective.
We add some e2e unit test for build_db function so that the change to FR is more thoroughly tested.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151243
Approved by: https://github.com/d4l3k, https://github.com/wz337
Fixes#151216, #151215
Previously I forgot to revert the timeout after setting it for the timeout test.
To prevent this in the future I split the test into 3 different tests so timeout testing is isolated.
Test plan:
Stress tested
```
pytest test/distributed/test_store.py -k queue -v -s --minutes 10
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151252
Approved by: https://github.com/XilunWu
By using Metal `as_type` which according to documentation does exactly
that:
> Metal adds an as_type<type-id> operator to allow any scalar or vector data type (that is not
a pointer) to be reinterpreted as another scalar or vector data type of the same size. The bits in
the operand are returned directly without modification as the new type. The usual type
promotion for function arguments is not performed.
Using `reinterpret_cast` created a potential silent correctness error when dtypes of different sizes were bitcast to each other
Add expicit cast to src_type to avoid errors due to type promotion (i.e.
soemthing like (x+1).view(dtype=torch.float16) would work correctly in
eager mode for int16 dtype, but would fail in compile, as arithmetic
operations will promote int16 to int32
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151272
Approved by: https://github.com/dcci
ghstack dependencies: #151224, #151246
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.
```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
To avoid accuracy issues when small reductions are unrolled, cast half to float during the `load` op
As `op_math_t<half>` is indeed float
This fixes `test_unroll_small_reduction` for reduced precision types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151246
Approved by: https://github.com/dcci
ghstack dependencies: #151224
This PR implements _allgather_base, reduce_scatter, and _reduce_scatter_base in the MPI backend (ProcessGroupMPI), enabling support for Fully Sharded Data Parallel (FSDP) in environments that use MPI for distributed communication.
### Context
As noted in https://github.com/pytorch/pytorch/issues/85628, FSDP currently supports only the NCCL backend. Due to this limitation, FSDP cannot run on legacy HPC environments or clusters that rely on MPI.
By implementing just these three collective operations, we can enable FSDP to work with the MPI backend. These collectives are implemented in a similar manner to existing operations such as allgather.
### Testing
We validated this PR using pytorch/build/bin/ProcessGroupMPITest with OpenMPI, and all tests passed successfully.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150162
Approved by: https://github.com/H-Huang
Added a context manager, `torch._library.fake_profile.register_fake_profile(op_profiles)`, where given an operator profile, it will generate and register a fake impl for the operator based on the operator profile.
The input to `register_fake_profile` is a dictionary mapping operator name to a set of profiles which describe the input and outputs of the operator. Here's an example of a profile for `mylib.foo.default`:
```
"mylib.foo.default": {
OpProfile(
args_profile=(
TensorMetadata(rank=2, dtype=torch.float32, device=torch.device("cpu"), layout=torch.strided,),
TensorMetadata(rank=2, dtype=torch.float32, device=torch.device("cpu"), layout=torch.strided,),
),
out_profile=TensorMetadata(rank=2, dtype=torch.float32, device=torch.device("cpu"), layout=torch.strided,),
)
}
```
`foo`'s profile contains only one profile, which says that for 2 input tensors of rank 2, dtype float32, device cpu, we will return one tensor of rank 2, dtype float32, and device cpu.
This will then generate a fake kernel where given 2 input tensors of rank 2 (and the other tensor metadata), we will output one tensor of rank 2 (and the other tensor metadata). If the operator also supports other input ranks, then we can add to the profile for the fake impl to support more input types.
This profile can either be manually written or created by draft-export, and then checked into the codebase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150807
Approved by: https://github.com/zou3519
ghstack dependencies: #150806
- Added a test to guard bfloat16. The optimizer incorrectly turns bfloat16 initializers into uint16, but this is not relevant to export logic.
- Fix bfloat16 support in onnx_program callable
Tested with the following with cuda
```py
import torch
class BfloatModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.param = torch.nn.Parameter(torch.tensor(2.0, dtype=torch.bfloat16))
def forward(self, x):
return x * torch.tensor(1.0, dtype=torch.bfloat16) * self.param
input = torch.randn(1, 10, dtype=torch.bfloat16)
model = BfloatModel()
onnx_program = torch.onnx.export(model, (input,), dynamo=True, optimize=False, verify=True)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151121
Approved by: https://github.com/titaiwangms
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Fixes#147846. Previously there is no error out under out variant of`tensordot` while `requires_grad=True`. This can cause potential issue when out tensor is part of a computation graph.
Enforces the out variant of tensordot to run without setting `requries_grad=True`. Change same to #117067
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150270
Approved by: https://github.com/soulitzer
Summary:
There are a number of places in the code checking for the existence of `_boxed_call` instead of checking for a `True` value. This is somewhat dangerous because one would assume that setting it to `None` or `False` would be the same as not setting it (output_code.py does this, for example).
Change `hasattr()` to `getattr(..., False)` for these cases.
Test Plan: unit tests pass
Differential Revision: D72806693
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151130
Approved by: https://github.com/Skylion007
This has been pretty helpful for the size-oblivious rewrite. Wanted the variadic args version to avoid `sym_or(a, sym_or(b, sym_or(c, d)))` in favor of `sym_or(a, b, c, d)`. Happy to change this to ban the 1-arg version.
This is better than plain and/or because the whole symbolic expression gets preserved, and if we guard on it or defer as a runtime assert, we preserve all branches.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150456
Approved by: https://github.com/laithsakka
By adding `pass` in front of the comment for fake set_device call
Which fixes `TestGPU.test_zero_element_mutation_mps`, which previously
failed with
```
torch._inductor.exc.InductorError: RuntimeError: Failed to import /var/folders/sc/2thx6_x95h7_h9qs8s48yh140000gn/T/tmp2emka_sx/7k/c7kmnwhb363ysalhewglr3cwtej6tiz3t4ppqa4bvhubaokmlprw.py
IndentationError: expected an indented block after 'with' statement on line 38 (c7kmnwhb363ysalhewglr3cwtej6tiz3t4ppqa4bvhubaokmlprw.py, line 40)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151224
Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/dcci
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.
```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
Motivation: for the following script:
```
// demo.py
import torch
import json
from transformers import BertModel, BertConfig
CONFIG = """
{
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"transformers_version": "4.6.0.dev0",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 30522
}
"""
config = json.loads(CONFIG)
bloom_config = BertConfig(**config)
model = BertModel(bloom_config).half().cuda()
torch.compiler.reset()
torch.cuda.empty_cache()
compiled_fn = torch.compile(model)
vocab_size = 30522
for b in range(1, 3):
for s in range(1, 10):
print(f"🚀 {b} {s}")
input_ids = torch.randint(0, vocab_size, (b, s)).cuda()
attention_mask = torch.ones(b, s).cuda()
with torch.no_grad():
out = compiled_fn(input_ids, attention_mask).last_hidden_state
```
when we run it with:
```
time TORCH_LOGS=recompiles python demo.py
```
We can see there are 7 recompilations and it takes 2 mins (fresh build) or 1 min (cached build) in my machine.
One root cause of the recompilations is, there are guards to check the alignments of the inputs (see the patch). So there are unexpected recompilations for `(1, 4)`, `(1, 8)`, `(2, 4)` and `(2, 8)` inputs.
In this patch, we always try to always pad the inputs if we don't know its shape at compilation to avoid the guards on alignment. It is fine to always pad the tensor. It won't change the semantics.
Now there are only 3 recompilations and it takes 1 min (fresh build) and 17s (cached build) in my machine.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150403
Approved by: https://github.com/drisspg
This change enables basic NestedTensor operations on HPU,
fixing the runtime error when creating a NestedTensor on HPU.
- Extended `NestedTensorImpl` to recognize `hpu` as a valid storage device.
- Added `NestedTensorHPU` to `DispatchKey` parsing in `DispatchKey.cpp`.
- Updated `torchgen/model.py` to include `NestedTensorHPU` in `dispatch_keys`.
- Modified `native_functions.yaml` to enable `NestedTensorHPU` support for various ops.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148659
Approved by: https://github.com/jeromean, https://github.com/albanD, https://github.com/sujoysaraswati
This can save ~0.2ms on non cuda devices by skip calling `amp_definitely_not_available()`. It can improve small models in torchbench like lennard_jones on xpu 10% on both eager and inductor in dynamo benchmarks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151111
Approved by: https://github.com/soulitzer
As titled, this pr adds additional `forward_dtype` and `backward_dtype` conversion in DTensor `redistribute` API to enable SimpleFSDP's mixed precision training.
In this forward pass, the DTensor can be configured to be cast to `forward_dtype`; in the backward pass, the DTensor can be configured to be cast to `backward_dtype`.
1. **Correctness**: The end-to-end SimpleFSDP mixed precision training integration has been proved to work properly in the PR from this fork: https://github.com/tianyu-l/pytorch_intern24/pull/20. We are now migrating the code to official PyTorch DTensor.
2. **Example Usage**: There is an example in TorchTian's SimpleFSDP implementation: https://github.com/pytorch/torchtitan/pull/1060.
In the example below, a DTensor `x` is all-gather'ed along the `self.compute_placements`, with datatype cast to `self.param_dtype`. In the backward pass, additionally, the computed gradients are reduce-scatter'ed along the `self.grad_placements`, with datatype cast to `self.reduce_dtype`.
```python
output = x.redistribute(
placements=self.compute_placements,
forward_dtype=self.param_dtype,
backward_dtype=self.reduce_dtype,
).to_local(grad_placements=self.grad_placements)
```
Under the hood, in `class Redistribute(torch.autograd.Function):`, the `forward` function first takes `x`'s local tensor, convert it to `forward_dtype`, before all-gather `x`.
The `backward` function take `grad_output` and convert it to `backward_dtype`, before reduce-scatter `grad_output`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150740
Approved by: https://github.com/tianyu-l
Summary: CK doesn't support FP32 attention, but aotriton does. If we prefer CK, and the input dtype is FP32, we'll select mem efficient attention but CK doesn't support it. So we'll exclude mem eff attention and pick math.
Differential Revision: D72880985
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151132
Approved by: https://github.com/yoyoyocmu
I want to format and refactor the csrc file of pytorch_openreg. To make the code review clearer and easier to understand, I divide the code refactoring into two parts:
- Part 1: Code formatting
- Part 2: Code refactoring and optimization (Next PR)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151004
Approved by: https://github.com/albanD
ghstack dependencies: #151000
Summary:
as title
`export._trace._WrapperModule` is used to wrap functions into a Module so we can export the function.
We add `export._wrapper_utils` to `dynamo`'s `MOD_INLINELIST` so dynamo traces into `_WrapperModule`
Fixes https://github.com/pytorch/pytorch/issues/146867
Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test:test_export -- -r wrapper_module
```
Differential Revision: D69434316
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146919
Approved by: https://github.com/angelayi
Prior to this PR, `rng_state` is in `V.graph.graph_inputs` but not in read_writes of any IRNode. As a result, it is not identified as a partition inputs:
```python
def partition_0(args):
primals_2, primals_1 = args
...
buf0 = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype=torch.float32, device=device(type='cuda', index=1), pin_memory=False, rng_state=fwd_rng_state_0)
# <----- access fwd_rng_state_0 but it's not an input
...
def call(self, args):
primals_1, primals_2, fwd_rng_state_0 = args
...
partition0_args = [primals_2, primals_1]
(buf2, primals_2, primals_1) = self.partitions[0](partition0_args)
# <---- fwd_rng_state_0 is graph_inputs but is not passed to partitions[0]
...
```
This PR fixes this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150958
Approved by: https://github.com/eellison
Added a flag, `allow_override`, to allow overriding existing kernel implementations in `torch.library.register_fake` `library.impl`. The default is false, where if a user tries to register a kernel to a dispatch key that already contains a kernel, it will error. This flag doesn't apply to CustomOpDefs, where overriding a fake kernel is already allowed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150806
Approved by: https://github.com/zou3519
Preparatory refactor for https://github.com/pytorch/pytorch/pull/146942.
# Feature
This PR refactors the existing wrapper codegen into `WrapperLine` subclasses, extending the existing Memory Planning IR into a fully-fledged Wrapper IR. See the diagram below.

The IR currently supports the following ops:
- All existing memory planning IR ops (`AllocateLine`, `FreeIfNotReusedLine`, etc.)
- Reinterpret views (`ReinterpretLine`)
- Kernel definitions (`KernelDefinitionLine`)
- Calls to defined kernels (`KernelCallLine`)
- Calls to extern kernels (`ExternKernelLine`, `ExternKernelAllocLine`)
- Ops with multiple outputs (`MultiOutputLine`)
- Tensor cleanup at the end of a graph (`FreeLine`)
- Leaving comments in code (`CommentLine`)
There are two main motivations for this refactor:
1. Unlike free-form C++ and and Python code, Wrapper IR lines provide structured information about what the wrapper code does. This serves as a natural extension point for other types of wrapper codegen. For example, the parent PR generates FX IR from Wrapper IR. Wrapper IR aims to give new backends enough information to generate wrapper code without needing to modify core Inductor files such as `ir.py`.
2. This design will hopefully promote stronger modularity and encapsulation.
a. Inductor's core compilation passes don't need to worry about whether they're targeting Python, C++, FX or anything else. They can simply focus on generating Wrapper IR, and target-specific code can be refactored into the various backends.
b. Backends do not need to know about all the details and internal state of `V.graph` IR. For example, they don't need to consider whether a buffer has been removed from the graph when generating code. Wrapper IR will hopefully provide a simpler interface for generating wrapper code, which abstracts away the details of device code.
# Implementation details
The implementation mainly consists of separating direct C++/Python codegen into two phases:
1. Emit Wrapper IR lines describing what the wrapper code is supposed to do.
2. Inside the `codegen()` method of each `WrapperLine`, call backend methods which generate pure Python/C++ code using the information stored in the Wrapper IR line. For example, `KernelCallLine` calls `wrapper._generate_kernel_call_helper`, which is overriden by the various Python and C++ backends to generate the final wrapper code.
The main difficulty in implementing this is that we need to be careful that code is generated in the correct order. Wrapper codegen happens in two passes: first we write code into `self.lines` which mainly contains wrapper IR, but can also contain raw Python or C++ lines in some situations. Then, we convert the wrapper IR into the final Python/C++ code in `self.wrapper_call`. Since the same macros may be used in both passes, it's difficult to ensure that code is written to the correct buffer. The easiest solution for this was to implement a context manager overriding the `writeline` method to write to `self.wrapper_call` after memory planning is finished. This way, `writeline` writes to `self.lines` in the first pass, and `self.wrapper_call` in the second. This obviated the need to pass `code` or `writeline` variables all the way through the call stack, which would have touched most of the existing macros.
# Test plan
Since this refactor touches all the existing wrapper codegen classes, the existing CI provides good coverage.
The parent PR introduces new tests for the FX IR backend. Among other things, these tests assert that `self.lines` only contains Wrapper IR lines, and no free-form code. While this would not be true of all programs today, the tests suggests that the IR implemented in this PR is sufficient to cover basic PyTorch usage.
# Future directions
These two goals are only partially realized by this PR. These are several important steps which still undergo direct Python/C++ codegen in core files:
- User-defined Triton kernels.
- Reinterpret views on outputs, from `gen_output_refs()`. (In the parent PR, the FX converter has a custom way of handling this. This can eventually be ported into Wrapper IR.)
- Fallback ops with custom `codegen()` methods, e.g. `ScatterFallback`.
- Misc. C++ lines emitted by the various cpp backends, e.g. declaring constants.
These cases will gradually be handled in subsequent PRs, as the Inductor->FX converter expands its coverage. Given that these refactors are pretty tricky to do, it seems wiser to execute them in stages, as opposed to porting everything to Wrapper IR at once.Some Python and codegen still lives in core files such as `ir.py`, as described in previous sections. Hopefully, this PR will serve as a starting point which moves the codebase towards a more modular design. Over time, we can gradually refactor the remaining codegen (mainly in `ir.py`) into backend classes.
One limitation of this PR is that codegen still happens in two phases during `PythonWrapperCodegen`. First, we generate Wrapper IR into `self.lines`, and from there we generate Python or C++ code into `self.wrapper_call`, `self.header`, etc. In the long term, it would be cleaner to split wrapper IR into its own class which doesn't deal with Python/C++ codegen at all. (See the diagram at the top.) That would strictly enforce the boundary between Wrapper IR and Python/C++ wrapper code. However, this would probably be a much larger refactor.
Another limitation of the current code is that the helper functions have a lot of call args. It's also possible to clean this up by passing Wrapper IR ops e.g. `KernelCallLine` into helper functions like `_generate_kernel_call_helper`, since they store all the arguments. However, that change would likely be prone to merge conflicts, so I would like to save it for follow-up PRs if possible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150458
Approved by: https://github.com/eellison
`compute_local_shape_and_global_offset` util computes the local shape of
a particular shard of a DTensor, and the global offset (which describes
how the shard fits into the global tensor).
When the tensor dim does not evenly divide into the mesh dim, uneven
sharding occurs. In some cases, uneven sharding results in an empty
shard.
e.g.
tensor dim size: 4096
mesh dim size: 30
ranks 0..27 have local size 18
rank 28 has local size 8
rank 29 has local size 0 <--- empty shard
The global offset for an empty shard was previously undefined and
returned values that were computed based on logic that assumes no empty
shards. This caused DCP to fail to save a checkpoint, becuase
deduplication logic could 'throw away' real (non-empty) shards thinking
they were duplicates of zero-sized shards with the same offset.
Now, we define the global offset of an empty shard to be the dim-size,
which is out of bounds of the tensor and can't overlap with any
non-empty shards.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150862
Approved by: https://github.com/teja-rao, https://github.com/XilunWu
[dynamo] Deprecate enable_cpp_framelocals_guard_eval config variable - default: True
Reading the feature enabling param `enable_cpp_framelocals_guard_eval `at the CPP level is time consuming and slows down the operation of the dynamo as it is done every time the function using this param is called. Reading the value only once at init isn’t an option as it would disable the modification of this param at the runtime. Since this feature is enabled by default for some time and it doesn’t cause known issues, the `enable_cpp_framelocals_guard_eval `configuration param will be deprecated by this commit and its value is hardcoded to true.
Local microbenchmark dynamo_guard_eval.py:
- 931.9 us -> 538.9 us (3.10)
@williamwen42 @jansel @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151008
Approved by: https://github.com/williamwen42
Summary:
When creating an `OpaqueTensorImpl`, currently there's only an option to create it for a non-view tensor, but it can be useful to create one for view tensors as well.
View tensors should contain the same autograd parameters as the original tensor, whereas non-view tensors get created with whatever `inference_mode` option is currently enabled. For this reason, `TensorImpl` has a special view constructor that takes `TensorImpl::ImplType` as its first parameter, so adding a new constructor to `OpaqueTensorImpl` that does the same thing allows us to create views with it.
Test Plan: CI
Reviewed By: scottxu0730
Differential Revision: D71748460
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151028
Approved by: https://github.com/scottxu0730, https://github.com/chaos5958
This adds queue operations as described in https://github.com/pytorch/pytorch/issues/150943.
This works by adding two new operations `queue_push` and `queue_pop`. The semantics are designed to be blocking with a timeout. Pushing will always succeed as the queue is infinite size. Popping will first call `wait` until the key is ready and then pop the value from the queue.
This implements queues for only: HashStore, TCPStore w/ libuv. FileStore and the legacy backends are not supported.
`wait` and `check` work for queue operations though queue_push will only wake up the first waiter rather than all of them.
This also has a few cleanups to error types/documentation in related code.
Example trace:
```
[I409 16:51:43.963833529 TCPStoreLibUvBackend.cpp:829] [c10d - trace] validate magic:1015412686 address:[localhost]:55816
[I409 16:51:43.963845838 TCPStoreLibUvBackend.cpp:842] [c10d - trace] ping nonce:2840795 address:[localhost]:55816
[I409 16:51:43.963902914 TCPStoreLibUvBackend.cpp:911] [c10d - trace] add key:init/ val:1 address:[localhost]:55816
[I409 16:51:43.963939389 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:init/ address:[localhost]:55816
[I409 16:51:43.963974842 TCPStoreLibUvBackend.cpp:893] [c10d - trace] get key:init/ address:[localhost]:55816
[I409 16:51:43.964071909 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/test_queue_support address:[localhost]:55816
[I409 16:51:43.964080221 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964108584 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964123207 TCPStoreLibUvBackend.cpp:1121] [c10d - trace] queue_push key:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964128194 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964156347 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964187493 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964217709 TCPStoreLibUvBackend.cpp:1133] [c10d - trace] queue_pop key:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964324300 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964354495 TCPStoreLibUvBackend.cpp:1133] [c10d - trace] queue_pop key:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964416299 TCPStoreLibUvBackend.cpp:940] [c10d - trace] check key_count:1 keys[0]:/test_prefix/foo address:[localhost]:55816
[I409 16:51:43.964458733 TCPStoreLibUvBackend.cpp:977] [c10d - trace] wait key_count:1 keys[0]:/test_prefix/non_existant address:[localhost]:55816
[W409 16:51:43.974516585 socket.cpp:460] [c10d] waitForInput: poll for socket SocketImpl(fd=75, addr=[localhost]:55816, remote=[localhost]:46641) returned 0, likely a timeout
[W409 16:51:43.974559169 socket.cpp:485] [c10d] waitForInput: socket SocketImpl(fd=75, addr=[localhost]:55816, remote=[localhost]:46641) timed out after 10ms
[I409 16:51:43.974600451 TCPStoreLibUvBackend.cpp:1101] [c10d - trace] cancel_wait address:[localhost]:55816
```
Test plan:
```
$ pytest test/distributed/test_store.py -k queue -v -s
test/distributed/test_store.py::FileStoreTest::test_queues SKIPPED [0.4351s] (Store does not support queues)
test/distributed/test_store.py::HashStoreTest::test_queues PASSED [0.0009s]
test/distributed/test_store.py::PrefixFileStoreTest::test_queues SKIPPED [0.0006s] (Store does not support queues)
test/distributed/test_store.py::TCPStoreTest::test_queues SKIPPED [0.0012s] (Store does not support queues)
test/distributed/test_store.py::LibUvTCPStoreTest::test_queues PASSED [0.0014s]
test/distributed/test_store.py::PrefixTCPStoreTest::test_queues PASSED [0.0014s]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150969
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
Add the option for providing a Subgraph as an autotuning choice in Inductor. This is crucial for implementing the split-k optimization for GEMMs by decomposing a mm -> bmm. https://github.com/pytorch/pytorch/pull/150654 uses these changes to add decomposeK as a default autotuning choice for aten.mm in Inductor.
Using https://github.com/pytorch/pytorch/pull/150654 and a simple script:
```
import torch
def f(a, b):
return torch.matmul(a, b)
def decompose_func(a_in, b_in):
M, K = a_in.shape
K, N = b_in.shape
# TODO: Ideally we want to autotune over this parameter
kPartitions = 256
assert K % kPartitions == 0, "K must be divisible by Kmini"
B = K // kPartitions
a_reshaped = a_in.reshape(M, B, kPartitions).transpose(
0, 1
) # Shape: (B, M, kPartitions)
b_reshaped = b_in.reshape(B, kPartitions, N) # Shape: (B, kPartitions, N)
result = torch.bmm(a_reshaped, b_reshaped) # Shape: (B, M, N)
return result.sum(dim=0).to(torch.float16) # Sum over B dimension, Shape: (M, N)
for k in [4096, 8192, 12288, 16384, 20480, 24576, 28672, 32768]:
a = torch.randn(32, k, dtype=torch.float16, device="cuda", requires_grad=True)
b = torch.randn(k, 32, dtype=torch.float16, device="cuda", requires_grad=True)
compiled_res = torch.compile(f, dynamic=False)(a, b)
decompose_res = decompose_func(a, b)
print(f"Compiled mm result close to aten: {torch.allclose(f(a, b), compiled_res, atol=1e-5, rtol=0.5)}")
print(f"Compiled mm result close to decompose: {torch.allclose(decompose_res, compiled_res, atol=1e-5, rtol=0.5)}")
```
we are able to autotune the decomposeK optimization to aten and the traditional Triton templates in Inductor. DecomposeK is faster than aten by about ~10% on average and > 4x speedup over the best Triton templates on an H100 machine, e.g.:
```
AUTOTUNE mm(32x28672, 28672x32)
decompose_k_mm 0.0126 ms 100.0%
mm 0.0144 ms 87.5%
triton_mm_69 0.0579 ms 21.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
triton_mm_75 0.0677 ms 18.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=4
triton_mm_76 0.0850 ms 14.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
triton_mm_68 0.1444 ms 8.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
triton_mm_72 0.1546 ms 8.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
triton_mm_74 0.1819 ms 6.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=4
triton_mm_67 0.1917 ms 6.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=4
triton_mm_73 0.2766 ms 4.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
```
https://pastebin.com/g3FMaauT is the generated code from Inductor containing the subgraph decomposition for aten.mm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150653
Approved by: https://github.com/eellison
This is the final PR, where everything comes together.
The problem I'm trying to solve is the following: when we register a MemPool with the NCCL ProcessGroup, it calls `ncclCommRegister` on all the allocations that are _currently_ in the pool. However, any later allocation will _not_ be registered with the NCCL communicator!
This is terribly inconvenient, because it means that every piece of code that allocates a tensor must be changed to become aware of whether it's doing so within a private pool, and it must become aware of NCCL and of all the PGs in existence, in order to re-register that pool with them.
Moreover, I believe there can be performance implications because allocating tensors is usually done in the critical path (i.e., during the forward and backward of every step of a training), whereas registering memory is a slow operation that should be done once at init time.
With this PR, once the user registers a Mempool with the NCCL PG, we install some hooks into the CachingAllocator in order to listen for all future memory allocations and, if they belong to the pool, we automatically call `ncclCommRegister` on them! (In fact, we reuse the hooks that already exist for `TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150684
Approved by: https://github.com/kwen2501
ghstack dependencies: #150683
In the NCCL ProcessGroup we want to support being able to "register" with NCCL all the allocations that belong to a certain private MemPool. In order to do so on-the-fly for every new allocation, we register a hook for the CachingAllocator's TraceEvents. However, we were lacking a way to know whether a given TraceEvent belonged to the MemPool that we cared about or not. With this PR, we add a MempoolId_t field to the TraceEvents.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150683
Approved by: https://github.com/syed-ahmed, https://github.com/kwen2501
Differential Revision: [D72760205](https://our.internmc.facebook.com/intern/diff/D72760205/)
We hardcoded to only use GEMM anyway.
This also raises the problem with high instantiation level. As the instantiation level goes higher (here it is 3333), the time it takes to list the configs might be long already (here it is >3 minutes).
If we know exactly what configs we care, we should have a way to generate them without calling generators. But let's see if we need that.
using this script
```
import os
os.environ["TORCH_LOGS"] = "inductor"
import torch
import torch._inductor.config
torch._inductor.config.max_autotune = True
torch._inductor.config.force_disable_caches = True
torch._inductor.config.max_autotune_gemm_backends = "Aten,CUTLASS"
# intentionally use no cutlass ops
torch._inductor.config.cuda.cutlass_max_profiling_configs = 0
torch._inductor.config.cuda.cutlass_instantiation_level = "3333"
def main():
M = 128
dtype = torch.float16
A = torch.randn(M, M, device="cuda", dtype=dtype)
B = torch.randn(M, M, device="cuda", dtype=dtype)
compiled_model = torch.compile(torch.mm)
_ = compiled_model(A, B)
print("done")
if __name__ == "__main__":
main()
```
before, with logs:
```
CUTLASS library generated 7 operations in 235.03 seconds
Got cutlass configs: total number of ops: 4753. Filtering took 10.51 seconds
```
after:
```
CUTLASS library generated 1 operations in 207.39 seconds
Got cutlass configs: total number of ops: 4753. Filtering took 9.53 seconds
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150973
Approved by: https://github.com/ColinPeppler
These two events are really common, and also make up a huge portion of logs (~70%) we get internally in PT2 Compile Events. I don't think it's actually that useful to aggregate them, so instead of logging them to PT2 Compile Events, lets just only log them to chromium.
These two events will still be visible from tlparse: they just won't be in our internal tables. Please let me know if folks disagree.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151053
Approved by: https://github.com/oulgen, https://github.com/masnesral
### Compilation error
The issue is that u0 (an unbacked symint) can come from a smaller int dtype e.g. int16, int32.
```
error: no matching function for call to ‘min(int64_t&, short int&)’
759 | call_add_kernel_with_scaling_0(... std::min(100L, s97, u0) ...);
```
### Diff
The fix is to explicitly specify `int64_t` in the std::min template.
```
int64_t s97 = arg0_1_size[0];
int16_t u0_raw; # not a long
auto u0 = u0_raw;
# Before
std::min({100L, s97, u0})
# After
std::min<int64_t>({100L, s97, u0})
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150894
Approved by: https://github.com/desertfire
This PR is the duplicated one for https://github.com/pytorch/pytorch/pull/139975.
This PR is to add torch._scaled_mm for CPU backend.
_scaled_mm_out_cpu and _scaled_mm_cpu are new added and included in torch._scaled_mm CPU dispatch. We also add _scaled_mm_out_cpu_emulated as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150410
Approved by: https://github.com/atalman
If input is channels last than MPS will return a channels last output
This fixed `GPUTests.test_convolution_4_mps` from test_torchinductor.py
That previous failed with
```
AssertionError: expected size 3==3, stride 1==192 at dim=1; expected size 12==12, stride 48==16 at dim=2; expected size 16==16, stride 3==1 at dim=3
```
As FakeTensor implementation of conv returned `Contiguous`, rather than `ChannelLast` layout on MacOS-15 or later.
This doesn't seem to be very well documented, so will try to document the call path for `ExternKernel` invocation for `aten::convolution`:
- First inductor decomp defined here is called
c93e4b8290/torch/_inductor/kernel/conv.py (L424-L425)
- Then it goes thru FakeTensor decomposition implemented here
320914f1b6/torch/_subclasses/fake_impls.py (L739-L740)
- Finally it goes down to convolution meta registrations implemented here
320914f1b6/torch/_meta_registrations.py (L2416-L2417)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151042
Approved by: https://github.com/dcci
Summary:
- Flip default value of `strict` argument from True to False on torch.export.export_for_training API
- All callsites have been updated to provide this argument explicitly to avoid behavior change.
- If you see any breakages, that means you may have a new callsite that is missed, please set `strict=True` explicitly to the callsite to mitigage.
Test Plan: CI
Differential Revision: D72724975
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150941
Approved by: https://github.com/ydwu4
Summary:
The reference quantized modules for linear / conv / etc fail to torchscript due to two issues
(1) The type of torch.qscheme doesn't script
(2) The "_DTYPE_TO_QVALUE_BOUNDS" values were resolving to union[float, int] instead of just int. We fix that with a hard cast.
See: <internal post> + comments for more context
Test Plan: unit tests + fixing this NB N6923590
Differential Revision: D72652616
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150870
Approved by: https://github.com/jerryzh168
While fixing the memory leak in https://github.com/pytorch/pytorch/pull/145757, we accidentally close the socket for the case when nread == 0 and thought it is the case when connection is closed. This is not true. According to libuv doc: https://docs.libuv.org/en/v1.x/stream.html#c.uv_read_cb.
> nread might be 0, which does not indicate an error or EOF. This is equivalent to EAGAIN or EWOULDBLOCK under read(2).
We found this bug when debugging a broken pipe issue when users first call a set and then wait for all keys right afterwards on 128 ranks. This might also cause other broken pipe issues we have seen in the prod jobs recently.
Added a unit test to test this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150987
Approved by: https://github.com/d4l3k, https://github.com/XilunWu
## Summary of changes
1. Change assertion to a warning, when no all gather or reduce scatter patterns are found, and remove the corresponding unit test. It seems some valid TP graphs may not have any pattern matches, from what I can see.
2. Fix wrong variable name being referenced (`A_with_scatter_dim_0` instead of just `A`)
3. Simplify reshaping to target output shape (don't need to recalculate output shape)
4. When "A" tensor is 2D, so we are doing doing a 2D x 2D scaled mm, we need to fix our handling of the case where the scatter dim is 0. When scatter dim is 0 for the 2D scaled mm output shape, this is actually dim 1 in the unreduced stacked partial scaled mm outputs, which has a (logical) shape of `(group_size, M//group_size, N)`. To summarize:
- Unreduced stacked partials are of shape `(M, N)`
- We view as `(group size, M//group_size, N)` and reduce along the scatter dim (`group_size` / dim 0).
- Reduced output (`reduced_out`) has shape (M//group_size, N)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150935
Approved by: https://github.com/lw
This modification is to support XPU kernels for depthwise_conv2d and depthwise_conv3d.
Currently, when running depthwise_conv on XPU devices, it is calculated with Mkldnn via the ConvBackend::Overrideable path.
After this modification, depthwise_conv will be calculated directly using XpuDepthwise3d when the Mkldnn backend is disabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149114
Approved by: https://github.com/guangyey, https://github.com/albanD
Credit to @mgmtea who wrote the initial version of this PR: https://github.com/pytorch/pytorch/pull/146604
Context: CUPTI is the NVIDIA library that Kineto uses for collecting GPU-side info during profiling. The intended usage is to register a callback while you want profiling to occur, and then unregister the callback when you want profiling to stop. But a bug would cause crashes if CUPTI callbacks were de-registered when used with cudagraphs. The workaround was to disable "CUPTI_LAZY_REINIT" and "CUPTI_TEARDOWN" in Kineto - which prevents crashes, but can result in slower execution after profiling has occurred and completed.
This bug is believed to be fixed in CUDA >= 12.6, so this PR qualifies that DISABLE_CUPTI_LAZY_REINIT=1 and CUPTI_TEARDOWN=0 should only be applied if CUDA >= 12.6. Additionally, `profiler_allow_cudagraph_cupti_lazy_reinit_cuda12()` is added as an escape hatch so that we can add a killswitch in case we see more crashes related to this.
Differential Revision: [D72745929](https://our.internmc.facebook.com/intern/diff/D72745929)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150957
Approved by: https://github.com/aaronenyeshi, https://github.com/Skylion007
This PR remove the usage of guard_size_oblivious in vector_norm by inlining it in the runtime check,
this prevent any data dependent error from ever appearing here at the locations where guard_size_oblivious
used to exist. Before this PR it used to break potentially. This is NOT BC breaking or changing of semantics from eager.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148809
Approved by: https://github.com/bobrenjc93
# Root cause
The barrier timeout set to 0.1 is too short, some threads may not have enough time to reach the barrier.
# How to reproduce
Adding some sleep will be easy to reproduce.
```python
def test_barrier_timeout_rank_tracing(self):
N = 3
store = dist.HashStore()
def run_barrier_for_rank(i: int):
if i != 0:
import time;time.sleep(1) # Let some thread sleep for a while
try:
store_util.barrier(
store,
N,
key_prefix="test/store",
barrier_timeout=0.1,
rank=i,
rank_tracing_decoder=lambda x: f"Rank {x} host",
trace_timeout=0.01,
)
except Exception as e:
return str(e)
return ""
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150768
Approved by: https://github.com/d4l3k
PR https://github.com/pytorch/pytorch/pull/149665 did a change to the optimized_add that is causing an issue internally.
In general make_optimized should be only be called with valid new_args, new_args can become None
when elements already exists also, we should break out of the loop in that case.
Note that I also only maintained the optimized summation when both lhs and rhs lengths are <=2.
This is ok because the optimization is based on the inductive property of adding one symbol at a time.
the [2]+[2] here is serving as base case ( i feel we can also remove it ) .
Note that keeping it for all sizes while correct, I am not sure if tis as efficient (we will do N log(n) insertions).
there is no current justification for that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150955
Approved by: https://github.com/Mingming-Ding, https://github.com/atalman, https://github.com/bobrenjc93
# Motivation
Adapt `torch.accelerator.device_count` for multi-process usage. For example, `torch.cuda.device_count` avoids poisoning fork, then `torch.accelerator.device_count` should meet the same requirement.
Now that `torch.get_device_module(device).device_count` supports this, `torch.accelerator.device_count` should align with this behavior as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149924
Approved by: https://github.com/albanD
ghstack dependencies: #147507
This adds a new `clone()` method to Store which will return a new Store instance that can be used from a different thread.
This is intended to better support multiple threads with stores such as when ProcessGroupNCCL needs a store to do error propagation.
Related issue: https://github.com/pytorch/pytorch/issues/150943
Test plan:
```
pytest test/distributed/test_store.py -k PythonStore
pytest test/distributed/test_store.py -k clone
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150966
Approved by: https://github.com/fduwjj
Summary:
- We are saying the minimum version of pytree that PyTorch can use is
0.13.0
- If a user imports torch.utils._cxx_pytree, it will raise an
ImportError if optree doesn't exist or exists and is less than the
minimum version.
Fixes https://github.com/pytorch/pytorch/issues/150889. There are
actually two parts to that issue:
1. dtensor imports torch.utils._cxx_pytree, but the optree installed in
the environment might be too old. Instead, raising ImportError in
torch.utils._cxx_pytree solves the issue.
2. We emit an "optree too low version" warning. I've deleted the
warning in favor of the more explicit ImportError.
Test Plan:
- code reading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150956
Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/XuehaiPan
If you try to use torch in c++ using modules then it will not compile due to static function not being supported in MSVC when using modules https://developercommunity.visualstudio.com/t/10323558.
It's also aligned with [C++20 standard](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/n4849.pdf) (ISO/IEC 14882:2020) 10.2.7 Export declaration [module.interface]: "Exported names have either external linkage or no linkage".
Fixes https://github.com/pytorch/pytorch/issues/71309
Tested using the following code.
```c++
export module testModule;
import <torch/torch.h>;
import <memory>;
import <string>;
import <tuple>;
import <iostream>;
export namespace testModule
{
export void test()
{
torch::Tensor tensor1 = torch::rand({ 2, 3 });
torch::Tensor tensor2 = torch::rand({ 3, 2 });
// Perform tensor multiplication
torch::Tensor result = torch::matmul(tensor1, tensor2);
// Print the tensors
std::cout << "Tensor 1: " << tensor1 << std::endl;
std::cout << "Tensor 2: " << tensor2 << std::endl;
std::cout << "Result of multiplication: " << result << std::endl;
}
}
```
```c++
import testModule;
int main()
{
testModule::test();
return 0;
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148675
Approved by: https://github.com/albanD, https://github.com/malfet
Co-authored-by: mantaionut <ionut@janeasystems.com>
Summary:
We add the functionality to allow users to directly pass in a at::Tensor
into AOTInductor, that would be used as the constant.
This user managed buffer skips the copying step in AOTInductor, and let
users to directly manage the memory usage themselve.
Test Plan:
LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
/data/users/$USER/pytorch/build/bin/test_aoti_inference
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D72589514](https://our.internmc.facebook.com/intern/diff/D72589514)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150276
Approved by: https://github.com/chenyang78, https://github.com/desertfire
Summary: We need real_tensor on the FakeTensor in node.meta["val"] in order to aot_compile the draft exported programs. Otherwise, we cannot propagate real tensors even when fake_mode.propagate_real_tensors = True.
This also fixes real tensor propagation in `run_decomposition()`.
Test Plan:
```
buck2 run @mode/dev-nosan caffe2/test:test_export -- -r test_dedup_data_dependent_failure
```
Differential Revision: D72732714
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150948
Approved by: https://github.com/angelayi
This enables using FSDP+TP on parameters with dimensions that aren't
evenly divisible by the DP/TP mesh sizes.
- this may not support all possible combinations of strided shardings
and shardings, but the support before this PR is not complete anyway
This contains several fixes for different aspects of DTensor behavior
relating to uneven strided sharding:
- original creation of the strided tensor requires fixes in
StridedShard._split_tensor
- full_tensor() reconstruction requries fixes in
StridedShard._to_replicate_tensor to correctly reshuffle the data into
the original pre-sharded order
- Distributed Checkpointing support requires correct computation of the
compute_local_shape_and_global_offset util so it knows how a local
shard maps to the global tensor, for reconstruction during
load/reshard.
This PR also adds a util `_explicit_order_placements` which converts a list of
placements with StridedSharding into a list of placements with only
regular sharding, with the order shuffled such that it is equivalent.
Builds on and completes the work started in https://github.com/pytorch/pytorch/pull/148894
Uneven Sharding Example
-------
(copied from _StridedShard._to_replicate_tensor docstring)
mesh = (DP=2, TP=2)
original = torch.arange(5)
**Applying Sharding**
Step 1 - Apply TP sharding
`tp = distribute_tensor(x, world_mesh['tp'], [Shard(0)])`
local_tensors:
rank0: [0,1,2] rank1: [3,4]
rank1: [0,1,2] rank3: [3,4]
Step 2 - Apply FSDP sharding
`dp_tp = ...` (the process of creating a strided-shard tensor is skipped over as it is hacky and complicated)
dp_tp has placement (_StridedShard(0, split_factor=2), Shard(0))
local_tensors:
rank0: [0,1] rank1: [3]
rank1: [2] rank3: [4]
**Reconstructing the Full Tensor**
Now, say someone wants to reconstruct dp_tp's full tensor. This will invoke 'redistribute' to replicate.
redistribute will first replicate the "Shard(0)" placement on the rightmost mesh dim, then replicate the
StridedShard placement second, which is implemented by this function.
So our starting point (`local_tensor` arg) is the result of replicating the Shard(0) placement across the
TP dim, which looks like this.
Note the discrepancy with the 'tp sharded tensor' line above! We'll fix it by locally shuffling data.
local_tensors:
rank0: [0,1,3] rank1: [0,1,3]
rank1: [2,4] rank3: [2,4]
Step 1: replicate over the DP dimension. Afterwards, each rank can locally sort the values.
note: we need padding to do this allgather, and we'll need to keep track of the padding amount for later
local_tensors:
rank0: [0,1,3,2,4] rank1: [0,1,3,2,4]
rank1: [0,1,3,2,4] rank3: [0,1,3,2,4]
Step 2: chunk and shuffle values around to account for the wrong order of operations above
and get the original tensor content back
01324# <- our allgather includes padding, if padding was applied in step 1
01324 <- Remove the padding
013, 24 <- chunk once, 'undoing' the DP allgather
01, 3, 2, 4 <- chunk each chunk, 'undoing' the initial (wrong) TP allgather performed by Shard(0)->Replicate()
012, 34 <- interleave with stride=TP mesh dim size
01234 <- concatenate
Co-authored-by: Luca Wehrstedt <lw@meta.com>
Co-authored-by: Will Constable <whc@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150490
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
## Improvements to `docstring_linter`
* Add a "grandfather list" of existing undocumented classes and functions (`--grandfather`, `--grandfather-tolerance`, `--no-grandfather`, `--write-grandfather`)
* In classes, now just one of the class itself or its `__init__()` method needs to be documented (`--lint-init` turns the old behavior back on)
* Now classes and functions defined local to other functions do not need to be documented (`--lint-local` turns the old behavior back on)
* New `--report` flag produces a compact report of long, undocumented classes or function definitions: see attached example run over all pytorch: [pytorch-docs.json](https://github.com/user-attachments/files/18455981/pytorch-docs.json)
## Help text
```
$ python tools/linter/adapters/docstring_linter.py --help
usage: docstring_linter.py [-h] [-l] [-v] [--grandfather GRANDFATHER] [--grandfather-tolerance GRANDFATHER_TOLERANCE] [--lint-init]
[--lint-local] [--lint-protected] [--max-class MAX_CLASS] [--max-def MAX_DEF]
[--min-docstring MIN_DOCSTRING] [--no-grandfather] [--report] [--write-grandfather]
[files ...]
`docstring_linter` reports on long functions, methods or classes without docstrings
positional arguments:
files A list of files or directories to lint
optional arguments:
-h, --help show this help message and exit
-l, --lintrunner Run for lintrunner and print LintMessages which aren't edits
-v, --verbose Print more debug info
--grandfather GRANDFATHER, -g GRANDFATHER
Set the grandfather list
--grandfather-tolerance GRANDFATHER_TOLERANCE, -t GRANDFATHER_TOLERANCE
Tolerance for grandfather sizes, in percent
--lint-init, -i Lint __init__ and class separately
--lint-local, -o Lint definitions inside other functions
--lint-protected, -p Lint functions, methods and classes that start with _
--max-class MAX_CLASS, -c MAX_CLASS
Maximum number of lines for an undocumented class
--max-def MAX_DEF, -d MAX_DEF
Maximum number of lines for an undocumented function
--min-docstring MIN_DOCSTRING, -s MIN_DOCSTRING
Minimum number of characters for a docstring
--no-grandfather, -n Disable the grandfather list
--report, -r Print a report on all classes and defs
--write-grandfather, -w
Rewrite the grandfather list
```
---
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145834
Approved by: https://github.com/amjames, https://github.com/eellison
This adds lazy initialization support to ProcessGroupGloo via `TORCH_GLOO_LAZY_INIT` or via `create_device(..., lazy_init=True)`
This is still a draft PR as there's one race condition when doing coalesced operations that needs to be fixed upstream in Gloo first. Depends on https://github.com/facebookincubator/gloo/pull/427 landing first
This also updates the gloo submodule to include the required changes.
Test plan:
added lazy init test variants
```
pytest -v test/distributed/test_c10d_gloo.py -k Lazy
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150801
Approved by: https://github.com/fduwjj
Summary:
When we divide a FakeTensor by an integer using the fast op implementation, the type promotion should be `ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT` so we get a float when dividing an int FakeTensor by an integer.
```
FAST = get_fast_op_impls()
fast_div = FAST[torch.ops.aten.div.Tensor]
fast_div(fake_tensor, some_int)
```
Test Plan:
```
python test/test_fake_tensor.py -k test_fast_div
```
Differential Revision: D72667430
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150874
Approved by: https://github.com/angelayi
Summary:
Sometimes we get `MetadataMismatchError` in aoti compilation because draft export uses the flag below to infer the fake kernel when there’s a mismatch, but aoti doesn’t have this flag turned on.
https://fburl.com/code/9qzytl6q
torch._functorch.config.generate_fake_kernels_from_real_mismatches
If we set this flag to True, then aoti compilation would work.
Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts
```
Differential Revision: D72345085
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150651
Approved by: https://github.com/angelayi
The util converts a list of placements in the traditional DTensor format
(e.g. [_StridedShard(0), Shard(0)], where list position is mesh_dim and sharding
is always applied left-to-right (from dim 0 to higher dims))
to a more explicitly ordered format, also replacing '_StridedShard' with
simple 'Shard' placements in the process.
(e.g. the above becomes [(1, Shard(0)), (0, Shard(0)] where the first
item in the tuple is the mesh_dim and the ordering of the tuples is the
sharding order.
This is useful so far as a helper for fixing local shape computation for
strided sharding in the uneven shape case, in the following PR- but may
also be useful more broadly if we can use explicit orderings to simplify
other parts of DTensor logic.
This skips implementing some combinations of _StridedSharding that are
not currently used in the wild today, but could be supported easily.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150493
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
This PR creates two utils for generating a schema for hops from example inputs and use base hop as an exmaple.
1. HopArgumentInfoGen creates an argument or an output schema with mutation information.
2. CFuncitonSchemaGen piece together the argument info of inputs and outputs and produces torch._C.FunctionSchema.
is_write attribute of argument info can be computed. Note that the is_write annotation only works when the inputs are flattened (e.g. cannot support mutation inside tuple). We need special handling the case where we have tuple inputs like cond.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149688
Approved by: https://github.com/zou3519
While looking at enabling FR analysis for coalesced collectives, I found that for the slow-path coalescing (cols which are not all-gather, all-reduce or reduce-scatter), we still record start event for them. This is wrong and we should do the same thing as endEvent recodring.
And I made the profiler title more visible when we pass in the opType for coalesced all-gather and reduce-scatter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150863
Approved by: https://github.com/eqy, https://github.com/d4l3k, https://github.com/kwen2501
Currently for HPU device we don't have any support for _fused_sdp_choice_stub dispatcher function, so for `scaled_dot_product_attention` function by default selecting the `MATH Backend` using `_fused_sdp_choice_stub` for HPU device. With this PR we have enabled support for `_fused_sdp_choice_stub` dispatcher function, so that we can invoke any backend (for example math, flash_attention, efficient_attention, cudnn_attention, overrideable) according to user choice for HPU device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149512
Approved by: https://github.com/drisspg
Observing failure in release workflow:
https://github.com/pytorch/pytorch/actions/runs/14346340202/job/40216804374
```
Traceback (most recent call last):
File "/opt/python/cp311-cp311/lib/python3.11/site-packages/wheel/bdist_wheel.py", line 11, in <module>
from setuptools.command.bdist_wheel import bdist_wheel as bdist_wheel
ModuleNotFoundError: No module named 'setuptools.command.bdist_wheel'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/tmp/tmppwpqef_x/triton/python/setup.py", line 27, in <module>
from wheel.bdist_wheel import bdist_wheel
File "/opt/python/cp311-cp311/lib/python3.11/site-packages/wheel/bdist_wheel.py", line 13, in <module>
raise ImportError(ERROR) from exc
ImportError: The 'wheel.bdist_wheel' module has been removed.
Please update your setuptools to v70.1 or later.
If you're explicitly importing 'wheel.bdist_wheel', please update your import to point to 'setuptools.command.bdist_wheel' instead.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150931
Approved by: https://github.com/Skylion007
``modernize-use-default-member-init`` prefers initialisation in class members, that make more ``= default`` constructors possible. Some violations or modernize rules have been fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149046
Approved by: https://github.com/zou3519
Summary:
att
regular weight has the type of torch.nn.parameter.Parameter
buffer and tensor constant has the type of torch.Tensor
both types are valid.
Test Plan: CI
Differential Revision: D72657275
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150867
Approved by: https://github.com/zhxchen17
This bug was crazy hard to reproduce, so I can't seem to get a unit test written that isn't the internal one I used for debugging.
Here's a short TLDR of the bug:
- Due to D71983456(OSS: https://github.com/pytorch/pytorch/pull/149910), we cache CachingAutotuners in memory.
- Importantly: **Saving stuff in PyCodeCache in memory is not semantically equivalent to writing to disk**. By saving it in memory, CachingAutotuners do not reset global state.
- It's possible through recompiles for different dynamo frames to compile down to exactly the same inductor output code. This involves models that run multiple times, but differ very subtley, or in ways that cause a dynamo guard failure but not a different inductor output code.
- Because of this, we reuse CachingAutotuners for a second compile (with different example inputs, just the same triton kernel code)
- CachingAutotuners have a Coordinate Descent class on them, which has a cache: https://fburl.com/code/4igrsams (OSS: aafc4b6188/torch/_inductor/runtime/coordinate_descent_tuner.py (L69))
- Because we are caching these in memory and not on disk, this cache is **not cleared** between runs.
- However, this variable is *not* saved on the class, and is reinitialized every time we do autotuning: https://fburl.com/code/n2o8tmje
(OSS: aafc4b6188/torch/_inductor/runtime/triton_heuristics.py (L933))
- `config2launcher` is added when we call `benchmark_one_config`, but on a CoorDesc *cache hit*, we never call `benchmark_one_config`! So we end up returning None, and erroring with:
```
AttributeError: 'NoneType' object has no attribute 'store_cubin'
```
This fixes the problem for now by just recompiling the launcher. Technically, we might be able to save config2launcher on the class to avoid this, but I don't want to risk another weird cache safety bug here, so taking the simpler approach for now.
Note that this error only reproduces if:
- None of AOTAutogradCache, FXgraphCache hit on the second entry: otherwise, the CachingAutotuner will go through a pickling and then not be saved in memory
- We haven't spawned parallel compile workers. If there are parallel compile workers, we pickle the autotuner on the way from the worker to the parent process, once again resetting the Autotuner.
- The autotune cache doesn't already have the best config stored in it
So it was extraordinarily hard to debug/reproduce. Because of this, I have a complicated internal unit test but no OSS test that can trigger the exact problem. I'll work on a separate test later, but this needs to go in to fix a sev, so we're landing it based on an internal test only.
Differential Revision: [D72655382](https://our.internmc.facebook.com/intern/diff/D72655382/)
**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D72655382/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150860
Approved by: https://github.com/oulgen
To workaround limitation of 32-arguments per kernel and being able to eventually compile something like
```python
import torch
def foo(*args):
rc = torch.empty_like(args[0])
for arg in args:
rc += arg
return rc
tensors = torch.rand(100, 32, device='mps').unbind(0)
print(torch.compile(foo)(*tensors))
```
For now, introduce `at::native:🤘:get_tensor_gpu_address` and use it from both C++ test and compile_shader to convert list of tensors to list of pointers valid on GPU.
Initially this binding were done via `id< MTLArgumentEncoder>`, but according to [Improving CPU Performance by Using Argument Buffers](https://developer.apple.com/documentation/metal/improving-cpu-performance-by-using-argument-buffers?language=objc#Encode-Resources-into-Argument-Buffers) article, this is not necessary when targeting Tier2-only devices (which is true of all devices on MacOS-13 or newer):
> To directly encode the argument buffer resources on these Tier 2 devices, write the [MTLBuffer](https://developer.apple.com/documentation/metal/mtlbuffer?language=objc).[gpuAddress](https://developer.apple.com/documentation/metal/mtlbuffer/gpuaddress?language=objc) property — and for other resource types (samplers, textures, and acceleration structures), the [gpuResourceID](https://developer.apple.com/documentation/metal/mtlcomputepipelinestate/gpuresourceid?language=objc) property — into the corresponding structure member. To encode offsets, treat these property values as uint64 types and add the offset to them.
Add both C++ and PyThon unittests that validate that this works.
Please note, that using either ArgumentEncoder or directly encoding the data does not guarantee buffer will not be freed until shader execution is complete. On the other hand, this should already be guaranteed by MPSCachingAllocator that would only free the memory after all streams completed its execution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150780
Approved by: https://github.com/dcci
This PR:
- cleans up some existing comments that don't make sense anymore
- hooks up the "custom_op_default_layout_constraint" back (that seems to
have broken)
- cleans up the "lazy registration path" which seems to never get hit
anymore
- adds dislike_padding to nodes that require exact strides
Test Plan:
- tests + CI
disable padding
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148104
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #150495
Summary: Added a field `protocol` to `ExternKernelNodes` and all the lowering pass will always use the oss schema to serialize external kernel nodes from now on.
Test Plan: CI
Differential Revision: D72020444
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150197
Approved by: https://github.com/zhxchen17
Includes ATen native transformers hipified sources in ROCm+Windows build. This was removed due to Trinton not being available on Windows, but this causes further linker errors. Setting `USE_FLASH_ATTENTION=0` and `USE_MEM_EFF_ATTENTION=0` during the build will mitigate the missing headers, but also not cause any linker errors, so we will use this approach for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150521
Approved by: https://github.com/jeffdaily
I still don't really understand the original purpose of that env var, but it appears that its usage is completely disconnected from MemPools and from `ncclMemAlloc`/`Free`. In fact, when that env var is set, we invoke `ncclCommRegister` for _all_ NCCL communicators for _all_ the memory segments managed by the allocator (both the global ones, allocated with `cudaMalloc`, and the ones in private MemPools), and we do that both for the segments that already exist when the PG is initialized and for all segments that will be allocated later.
I'm reworking the code a bit, by using a few helper functions, whose name should make this behavior clearer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150682
Approved by: https://github.com/kwen2501
ghstack dependencies: #150681
This consists mainly in two changes:
- ensure we can reliably obtain the device from a `NCCLComm` object (there was one constructor which didn't set the device)
- use a RAII pattern for acquiring the lock to the global dictionary of `NCCLComms` (which ensures the lock is released in case of exceptions)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150681
Approved by: https://github.com/kwen2501
This PR is to resolve issue reported in https://github.com/intel/torch-xpu-ops/issues/1478
There are two cases failing in our Windows CI enabling.
- **test_xpu.py::TestXpuXPU::test_lazy_init_xpu** Needs to add `if __name__ == '__main__':` for Windows when using multiprocess. Refer to https://stackoverflow.com/a/18205006
```
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Traceback (most recent call last):
File "C:\Users\sdp\lufengqing\torch-xpu-ops\test\xpu\xpu_test_utils.py", line 24, in <module>
test_multi_process(model, input)
File "C:\Users\sdp\lufengqing\torch-xpu-ops\test\xpu\xpu_test_utils.py", line 16, in test_multi_process
assert p.exitcode == 0
AssertionError
```
- **test_xpu.py::TestXpuXPU::test_wrong_xpu_fork_xpu** is a linux only test case, we should skip it on Windows. Refer to 248487f455/test/test_multiprocessing.py (L609)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150520
Approved by: https://github.com/guangyey, https://github.com/EikanWang
This PR extracts some test cases from TestPatternMatcher into a newly created TestPatternMatcherGeneric, and uses instantiate_device_type_tests to make them reusable across multiple devices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150286
Approved by: https://github.com/jansel
2025-04-08 05:42:44 +00:00
843 changed files with 33763 additions and 12243 deletions
"at::cuda::CachingHostAllocator_resetAccumulatedStats() is deprecated. Please use at::getHostAllocator(at::kCUDA)->reset_accumulated_stats() instead.")
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.