Compare commits

..

376 Commits

Author SHA1 Message Date
254a955f20 Autoupdate min_lrs for ReduceLROnPlateau if possible, fixes #104361 (#137637)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137637
Approved by: https://github.com/albanD
ghstack-source-id: 9a7b26eadb1b3e5d6749cd98434ba7c4c6b5e402
2024-10-09 19:18:56 -07:00
d50d5df2fb Add warning for non static grads in optimizer variable (#137554)
Fixes https://github.com/pytorch/pytorch/issues/112548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137554
Approved by: https://github.com/williamwen42
2024-10-10 01:23:21 +00:00
f301f6544b fix bug for fill_empty_deterministic_ not support complex half (#137488)
Fixes #133157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137488
Approved by: https://github.com/ezyang
2024-10-10 01:21:32 +00:00
361046718d Generate new expected results file when there is failures in diff time benchmarks (#137551)
The test also add singpost log for the benchmarks that pass.
to test run I ran python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv out.csv
results
```
WIN: benchmark ('a', 'instruction count') failed, actual result 90 is -18.18% lower than expected 110 ±1.00% please update the expected results.

REGRESSION: benchmark ('b', 'memory') failed, actual result 200 is 100.00% higher than expected 100 ±+10.00% if this is an expected regression, please update the expected results.

PASS: benchmark ('c', 'something') pass, actual result 107 +7.00% is within expected 100 ±10.00%

MISSING REGRESSION TEST: benchmark ('d', 'missing-test') does not have a regression test enabled for it.

You can use the new reference expected result stored at path: out.csv.

a,instruction count,90,0.01
b,memory,200,0.1
c,something,100,0.1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137551
Approved by: https://github.com/aorenste
2024-10-10 01:09:15 +00:00
d9f4a7d3f9 Simplify find_localzeros (#133325)
Instead of doing an N^2 connected thing, only do simplifications for binary max/min, and for very simple situations.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D64135230](https://our.internmc.facebook.com/intern/diff/D64135230)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133325
Approved by: https://github.com/albanD
2024-10-10 00:52:50 +00:00
4f45c76806 [PGNCCL] Limit access to ncclComm_ (#137573)
When non-blocking mode is enabled, we need to make sure `ncclComm_` is ready before calling NCCL APIs on it.
`NCCLComm::getNcclComm` help us do that (thanks to a wait function inside), thus is preferred than directly using `ncclComm_`.

To prevent `ncclComm_` from being directly used outside, e.g. in `ProcessGroupNCCL`, we also move it as a private member of `NCCLComm` class -- the external-facing wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137573
Approved by: https://github.com/Skylion007, https://github.com/shuqiangzhang, https://github.com/c-p-i-o
ghstack dependencies: #137572
2024-10-10 00:34:05 +00:00
cyy
0739efbd1f Remove reference of gcc7 from CI scripts (#137339)
Because gcc7 can't be used to build Pytorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137339
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-10-10 00:29:29 +00:00
47a515d260 [c10d] simplify barrier implementation and further decouple CPU/GPU (#137516)
synchronization
Summary:
Barrier is  essentially intended to block CPU thread (instead of GPU
streams). Before we used 2 stream synchronizations (1. current stream
blocked by nccl stream end event, 2. CPU thread blocked on current
stream). This is unnecessary as we already have CPU thread blocking
logic in wait(). Also, adding barrier specific code block in the general
GPU synchronize() API is intrusive and confusing.

This PR cleans this.

Test Plan:
CI

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137516
Approved by: https://github.com/fduwjj, https://github.com/kwen2501
2024-10-09 23:55:28 +00:00
51c33c0b72 Increase the runner size of AVX* jobs to 4xlarge (#137633)
The failed test is recently moved backed from slow and it requires more RAM than what available on 2xlarge runner.  It looks ok to up the instance size to 4xlarge instead.  I missed periodic jobs in https://github.com/pytorch/pytorch/pull/137447

Example periodic failures de4c2a3b4e (test_cpu_repro)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137633
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-10-09 23:43:49 +00:00
4304c68a4c In Inductor, be willing to generate deferred runtime asserts when unbacked (#137097)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137097
Approved by: https://github.com/angelayi
ghstack dependencies: #137091
2024-10-09 23:34:35 +00:00
6908d8d450 Enable python dispatcher for reinplacing pass (#137091)
Arguably this should be put somewhere higher up in the stack?  Not sure.

Xref: https://fb.workplace.com/groups/6829516587176185/permalink/8042762615851570/

There is a repro but I need to fix more bugs before it can be checked in

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137091
Approved by: https://github.com/bdhirsh
2024-10-09 23:34:35 +00:00
31e334ad9e [unwind] replace LONG_LONG_MAX by the portable LLONG_MAX (#125043)
This fixes a compilation error on systems with the musl c library.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125043
Approved by: https://github.com/aaronenyeshi
2024-10-09 23:34:16 +00:00
aafa02506e [CudaDMAConnectivityDetector] improve the detection robustness (#137530)
- Previously the detection would fail before user calling APIs such as `torch.cuda.set_device()`. This is because the detection logic requires nvml initialization. In this PR, we added explicit nvml initialization (which idempotent).
- Previously any nvml issue occurred in the detection logic would result in fatal error. Now we issue an informative warning and return a topology assuming no NVLink connectivity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137530
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474, #137475, #137529
2024-10-09 23:30:16 +00:00
fbaf9b62de [SymmetricMemoryOps] use float32 as the accumulator type when accumulating bfloat16 with multimem.ld_reduce (#137529)
This provides better accuracy without additional cost.

Also added documentation to `multimem_one_shot_all_reduce` to note the numerical caveats.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137529
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474, #137475
2024-10-09 23:30:16 +00:00
39c5122a4f [IntraNodeComm] replace all-reduce kernels with corresponding symm_mem ops (#137475)
## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR
- Replaces one-shot all-reduce with `symm_mem::one_shot_all_reduce_out`
- Replaces two-shot all-reduce with `symm_mem::two_shot_all_reduce_`
- Removes HCM all-reduce (at least for now). Due to the nature of its accumulation order, we can't guarantee the numerical consistency across all ranks.
- Removes the `IntraNodeComm` python binding (its original purpose is superceded by `SymmetricMemory`).
- Removes methods that were made for the python binding.
- Replaces nvlink detection logic with `DMAConnectivityDetector`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137475
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474
2024-10-09 23:30:16 +00:00
e6edfe3928 [SymmetricMemoryOps] create an out-variant for multimem_one_shot_all_reduce (#137474)
## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR

Implement `symm_mem::multimem_one_shot_all_reduce_out`. The out-variant is more suitable for `IntraNodeComm` integration.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137474
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473
2024-10-09 23:30:16 +00:00
b22749712c type _inductor/optimize_indexing.py (#137599)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137599
Approved by: https://github.com/Skylion007, https://github.com/eellison
2024-10-09 23:29:47 +00:00
d67b4f9e5f type _inductor/quantized_lowerings.py (#137598)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137598
Approved by: https://github.com/Skylion007
2024-10-09 23:29:26 +00:00
9b01d17b8d Use MetaProxy more pervasively (#137588)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137588
Approved by: https://github.com/ezyang
ghstack dependencies: #136674
2024-10-09 23:22:03 +00:00
13cf8360d8 [MPS] Fix testing for generator operators (#137601)
Before this changes, tests for operators like `eye` or `triu_indices` were essentially a test that respective CPU operators are stable, as cpu_sample and mps_sample were the same

Moved the logic to `transform_opinfo_sample_to_mps` whicih in addition to copying tensors is also tweaks `kwargs`

Discovered that:
 - `torch.randn` and `torch.randint` fall into the same undefined category
 - `torch.logspace` is not implemented for MPS
 -  Allow 1.0  absolute tolerance for all `torch.linspace` calls over integral input as rounding is wrong on the MPS side
 - `torch.triu_indices` are not implemented (PR is coming, this is how I've discovered this problem)
 - `torch.signal.windows.kaiser` fails because `aten::i0` is not implemented
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137601
Approved by: https://github.com/albanD
2024-10-09 23:17:11 +00:00
48fe0d56d6 Type _inductor/exc.py (#137595)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137595
Approved by: https://github.com/Skylion007
2024-10-09 23:15:06 +00:00
7408742b67 Make ignore_fresh_unbacked_symbols reentrant (#137605)
I have a test but it requires some other feature work that isn't fully baked.  Maybe this will fix an xfail.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137605
Approved by: https://github.com/albanD
2024-10-09 23:08:05 +00:00
5516ac5c21 [ROCm] Tunableop record untuned (#128813)
When enable tunableop, It is easy to have OOM since APP usually needs large video memory size, such as running a LLM for inference.  So we need a offline mode to tune the GEMMs. This PR provide an offline mode for tunableOp:

- record untuned GEMMs to file.

- a python API named tune_gemm_in_file is added to read the untuned file and tune the GEMMs in file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128813
Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang, https://github.com/naromero77amd

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-09 21:59:03 +00:00
839d3568b0 [compiled autograd] fix -Wuninitialized (#137539)
https://github.com/pytorch/pytorch/pull/135663#discussion_r1792408353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137539
Approved by: https://github.com/isuruf, https://github.com/Skylion007
2024-10-09 21:16:26 +00:00
38027b9b47 [SymmetricMemory] fix a bug where numel calculation overflows when the tensor size is large (#137567)
Fixes https://github.com/pytorch/pytorch/issues/137145

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137567
Approved by: https://github.com/Chillee, https://github.com/weifengpy
2024-10-09 20:45:57 +00:00
a93ea617b5 [FSDP2] Required mesh_dim_names for HSDP (#137436)
Two changes:
1. Require `mesh_dim_names` if using HSDP
2. Pass only the shard mesh to `fsdp_pre_all_gather`

Change 1 is technically BC breaking, but it should not be hard to fix on the user side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137436
Approved by: https://github.com/weifengpy, https://github.com/wz337
2024-10-09 20:35:09 +00:00
47af7cc962 Add compiler bisector (#131936)
This is a utility to aid the torch.compile debugging. You provide a function that returns True on success, False on failure, or do something out of process and run bisect_helper `good | bad`.

The bisector will first go through backends - `eager`, `aot_eager`, `aot_eager_decomp_partition`, `inductor` to find the first failing backend. Then, it will go through subsystems within the backend - currently limited but could be expanded - and try to find the first subsystem for which disabling fixes the problem. Once it has found the failing subsystem, it will find the number of times the subsystem is applied, and then bisect through it.

An example usage of how to hook it up for aot_eager_decomp_partition and decomposition subsystem is :

```
    from torch._inductor.bisect_helper import BisectionManager
    if op in CURRENT_DECOMPOSITION_TABLE:
        if BisectionManager.disable_subsystem("aot_eager_decomp_partition", "decomposition", lambda: repr(op)):
            return NotImplemented
```

Once it has discovered the problematic change, it will print out the associated debug info, and you can set the same limits with `TORCH_BISECT_BACKEND` `TORCH_BISECT_SUBSYSTEM` and `TORCH_BISECT_MAX`.

We could add further options as an automated way of going through a check list for checking divergence - e.g., the mode to emulate amp casts.

Fix for https://github.com/pytorch/pytorch/issues/126546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131936
Approved by: https://github.com/ezyang
2024-10-09 20:34:11 +00:00
cfe970260a Clarify opt-einsum usage, fix #127109 (#137596)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137596
Approved by: https://github.com/albanD
2024-10-09 20:31:24 +00:00
c73d2634b9 Revert "Log chromium event for automatic dynamic reasons (#137491)"
This reverts commit 3c1ab9367885fdb0ead5fcc14a22d6934070ca92.

Reverted https://github.com/pytorch/pytorch/pull/137491 on behalf of https://github.com/jovianjaison due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/137491#issuecomment-2403360486))
2024-10-09 20:24:12 +00:00
16a2c2cfd4 Revert "Introduce torch.sym_sum (#136429)"
This reverts commit 90bed32b986ab1356dc376df3985497cedbe8a29.

Reverted https://github.com/pytorch/pytorch/pull/136429 on behalf of https://github.com/ezyang due to fails internal stuff ([comment](https://github.com/pytorch/pytorch/pull/136429#issuecomment-2403335147))
2024-10-09 20:08:01 +00:00
572f506f9c [c10d] Improve split_group test (#137572)
Fix 1:
`backend1 = pg._get_backend`, here `pg` should be `ng1`.

Fix 2:
`dist.broadcast` should be called by ranks of subgroup `ng1` only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137572
Approved by: https://github.com/Skylion007
2024-10-09 19:43:57 +00:00
70288c3c2d Remove dependency on numpy for serialization for XLA/open registration devices without numpy (#137444)
Related: https://github.com/pytorch/xla/issues/7799#issuecomment-2375818263

Follow ups: Do the same for maia and mtia

## Motivation

With the move to `weights_only` by default, we are making an explicit decision not to allowlist GLOBALs required to deserialize `numpy` tensors  by default. The implication is that backends relying on numpy for serialization will fail loudly when `torch.load` flips `weights_only`.

However, we make the observation that this dependency on numpy was legacy and is not actually needed anymore. So we can remove it, which aligns with our weights_only strategy.

## Why is this ok?

The following comment on why numpy is necessary for serialization is legacy

c87c9f0a01/torch/_tensor.py (L303-L312)

We no longer do the following, though it was the case 5 years ago in the PR that added this
> CPU storage is reconstructed with randomly initialized data, moved onto backend device, and then storage is updated to the serialized content

**Instead what now happens is that CPU storage is constructed with data from the file **and then** moved onto backend device.**

Old behavior (`legacy_load`): 67adda891a/torch/serialization.py (L620)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137444
Approved by: https://github.com/albanD
2024-10-09 19:35:55 +00:00
aa61e251d4 [FSDP2] Added shard_placement_fn arg (#137496)
## Overview
This PR adds a `shard_placement_fn: Optional[Callable[[nn.Parameter], Optional[Shard]]` arg to `fully_shard` that allows users to specify FSDP sharding on a nonzero tensor dim. If doing so, then the tensor dim size must be divisible by the FSDP shard world size.

```
# Example:
def shard_placement_fn(param: nn.Parameter) -> Optional[Shard]:
    largest_dim = largest_dim_size = -1
    for dim, dim_size in enumerate(param.shape):
        if dim_size > largest_dim_size:
            largest_dim = dim
            largest_dim_size = dim_size
    return Shard(largest_dim)

fully_shard(module, shard_placement_fn=shard_placement_fn)
```

## Follow-Ups
- **Copy kernels:** For all-gather copy-out, we currently copy-out to temporaries and then chunk-dim-0 -> cat-shard-dim, incurring an extra copy for parameters sharded on nonzero tensor dim. Similarly, for reduce-scatter copy-in, we currently chunk-shard-dim -> cat-dim-0, incurring an extra copy for gradients sharded on nonzero tensor dim. @yifuwang  has ideas for adding additional split size args to the copy ops that allows fusing these extra copies into the existing all-gather copy-out and reduce-scatter copy-in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137496
Approved by: https://github.com/weifengpy
ghstack dependencies: #137593
2024-10-09 19:13:32 +00:00
36133f39db Tensorify compute on Python scalars (#136674)
Signed-off-by: Bob Ren <bobrenfb.com>

Comandeered from https://github.com/pytorch/pytorch/pull/130228 as I'm helping @ezyang w/ shipping dynamic float arguments in PT2. This starts with supporting torch.ops.aten.mul. I'll stack on top support for other operators in subsequent PRs to keep this scoped to the mechanics of the fx pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136674
Approved by: https://github.com/ezyang
2024-10-09 18:51:41 +00:00
f15edb291a type _dynamo/trace_wrapped_higher_order_op.py (#137354)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137354
Approved by: https://github.com/Skylion007, https://github.com/jansel
2024-10-09 18:35:28 +00:00
9a957e2842 [NCCL][Profiler] Add functionality to call dump function of NCCL profiler plugin (#137523)
Summary:
NCCL 2.23.4 provides the profiler plugin feature, which traces collective, p2p, proxyOps, and other events.

The diff supports the following feature: when NCCL times out, the flight recorder can also dump traces in the profiler plugin.

Test Plan:
```
        tensor = torch.tensor([dist.get_rank()], dtype=torch.int32, device=dev)
        # Create a list with same number of elements as world size (aka no. of ranks)
        # During allgather this list is going to be populated with tensors from all ranks (aka all gather)
        gathered_tensors = [torch.zeros_like(tensor) for _ in range(WORLD_SIZE)]
        # get collective from all ranks
        if i <= 10 or RANK != 0:
            dist.all_gather(gathered_tensors, tensor)
```
My script triggers flight recoder.
```
trainer/0 [0]:E0927 12:07:22.643702 1012209 ProcessGroupNCCL.cpp:1356] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info.
trainer/0 [0]:I0927 12:07:22.643784 1012209 ProcessGroupNCCL.cpp:392] NCCL_PROFILER_PLUGIN: /data/users/zhiyongww/fbsource/fbcode/scripts/nbahl/libnccl_profiler_plugin.so
trainer/0 [0]:I0927 12:07:22.643805 1012209 plugin.cpp:559] Profiler start dump
trainer/0 [0]:I0927 12:07:22.645249 1012209 ProcessGroupNCCL.cpp:1363] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL dumping nccl trace to /tmp/nccl_trace_rank_0
trainer/0 [0]:I0927 12:07:22.645418 1012209 NCCLUtils.cpp:348] Finished writing NCCLPG debug info to /tmp/nccl_trace_rank_0
```
Content from /tmp/nccl_trace_rank_0: P1614645283

Differential Revision: D61929401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137523
Approved by: https://github.com/c-p-i-o
2024-10-09 18:19:33 +00:00
394c143e4e [dynamo] Fix error when inlining certain nested closure returned by another function (#137510)
See `test_inline_closure_returned_by_another_function_and_captures` and #136814 for more context.

In #90286, we introduced an optimization so that for captured cells that are unmodified during a Dynamo trace, `UserFunctionVariable` will represent them as variable of the cell's actual value, rather than a `NewCellVariable`.

Later on we introduced more mechanisms to model such cells across function calls (#104222), and across function calls where `NestedUserFunctionVariable::bind_args` need to look up further in the parent frames (#106491) to find these cells' values.

This patch removes `InlinedClosureVariable` in favor of a simpler modelling, which is also more consistent with what was introduced in #90286, i.e., just model these cells as their contents, in `symbolic_locals`.

This fixes #136814 because resolution of `InlinedClosureVariable` to the underlying cell content value happens in
`NestedUserFunctionVariable::bind_args`, which requires Dynamo to have the value in scope at the function call site (when Dynamo does inlining), but's not always the case (as the test case shows). However, if we model the cells in `symbolic_locals`, we never need such resolution, and the values are directly stored into the `NestedUserFunctionVariable::closure` upon the function creation, at which point Dynamo always has the cell value in `symbolic_locals` for look up.

Fixes #136814.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137510
Approved by: https://github.com/williamwen42
2024-10-09 18:13:57 +00:00
018dabff20 [ONNX] Implement patch for jit.isinstance (#137592)
Patch torch.jit.isinstance for users for models to be dynamo exportable. Replaces https://github.com/pytorch/pytorch/pull/137487.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137592
Approved by: https://github.com/titaiwangms, https://github.com/xadupre
2024-10-09 18:06:52 +00:00
ceb2fcc5db [FSDP2] Fixed incorrect tensor meta after .to(dtype) (#137593)
This fixes https://github.com/pytorch/pytorch/issues/137522. After a method that changes to module parameters (like `.to(torch.float64)`), we need to update the `DTensorSpec`, whose `TensorMeta`'s dtype may have changed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137593
Approved by: https://github.com/Skylion007
2024-10-09 17:57:11 +00:00
bae8d5853e [TorchRec][PT2 compile] enable dynamo in _get_user_embeddings (#136798)
Summary:
# context
* enable the `_get_user_embeddings` function
* run failed at P1610151892
```
  torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
  GuardOnDataDependentSymNode: Could not guard on data-dependent expression u22 <= 0 (unhinted: u22 <= 0).  (Size-like symbols: u22)

  ATTENTION: guard_size_oblivious would fix the error, evaluating expression to False.
  Maybe you need to add guard_size_oblivious to framework code, see doc below for more guidance.

  Potential framework code culprit (scroll up for full backtrace):
    File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/38472faba4e3e6c1/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_decomp/decompositions.py", line 1692, in native_layer_norm_backward
      if M <= 0 or N <= 0:
```
```
    N = prod(inner_dims)  # type: ignore[arg-type]
    M = prod(outer_dims)  # type: ignore[arg-type]
    if M <= 0 or N <= 0:
        return (
            input.new_zeros(input_shape) if output_mask[0] else None,
            input.new_zeros(input_shape[axis:]) if output_mask[1] else None,
            input.new_zeros(input_shape[axis:]) if output_mask[2] else None,
        )
```
# changes
* use guard_size_oblivious since the new_zeros return is kind of optimization, shouldn't impact the correctness of the follow up code logic.
* the size `ret[i][j]` could be zero, so the change in V1 isn't valid
* for more details: [post](https://fb.workplace.com/groups/6829516587176185/permalink/8003616173099548/)
```
    from torch.fx.experimental.symbolic_shapes import guard_size_oblivious
    if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0):
```

# past
* found `u22` was introduced at
```
    def _wait_impl(self) -> List[List[int]]:
        # Can not use is_torchdynamo_compiling(), as every such condition should be independent for compilation with graph breaks.
        if isinstance(self._splits_awaitable, dist.Work):
            self._splits_awaitable.wait()

        ret = self._output_tensor.view(self.num_workers, -1).T.tolist()  # <------ u22 introduced here

        if not torch.jit.is_scripting() and is_torchdynamo_compiling():
            for i in range(len(ret)):
                for j in range(len(ret[i])):
                    torch._check_is_size(ret[i][j])   # <----------  my question: why the _check_is_size isn't enough??
                    torch._check(ret[i][j] > 0)   # <------ added by diff V1
```

Test Plan:
# run command
```
TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_v4_mini training.pipeline_type=pt2 2>&1 | tee -a `tagT`.`tagH`.log
```

# results
* before
**without enabling `_get_user_embeddings`**
[14 Failures and Restarts](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp2eNI7p/failures_and_restarts.html)
log: P1610151892
{F1889387940}
* V1
enable `_get_user_embeddings`
with `torch._check(ret[i][j] > 0)`
[13 Failures and Restarts](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp6J1iY9/failures_and_restarts.html)
{F1889388378}
* V2
enable `_get_user_embeddings`
with `if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0):`
[tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpFhZZyC/index.html)
if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0):

Differential Revision: D63424929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136798
Approved by: https://github.com/ezyang
2024-10-09 17:19:45 +00:00
4d45536e92 Save aot graph code in AOTAutogradCache for logging purposes (#137432)
Save the string graph code from print_readable

Differential Revision: [D63985711](https://our.internmc.facebook.com/intern/diff/D63985711/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137432
Approved by: https://github.com/bdhirsh
ghstack dependencies: #137431
2024-10-09 16:59:08 +00:00
b71d0ac3b1 remove unused variable (#137565)
per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137565
Approved by: https://github.com/Skylion007
2024-10-09 16:31:43 +00:00
ae03c0cff3 Add microbenchmark for FxGraphHashDetails.debug_lines (#137506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137506
Approved by: https://github.com/jamesjwu
2024-10-09 16:15:05 +00:00
e945b6600d Support 3.8 compile again (#137587)
This is not going to be very reliable since we don't have CI though...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137587
Approved by: https://github.com/Skylion007
2024-10-09 15:54:52 +00:00
1d15dd7891 Fix triton_reshape to properly expand Min keyword in triton codegen (#137357)
Summary: Previously triton_reshape will generate code with `Min` keyword in it, which is incorrect. This diff updates the triton_reshape function to properly expand `Min` keyword to `<`.

Test Plan:
```
buck2 run @//mode/{opt,mtia,inplace} //glow/fb/fx/fba/tests:test_fba_inductor -- -r test_Min_keyword_in_block_shape
```

Differential Revision: D63850158

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137357
Approved by: https://github.com/blaine-rister, https://github.com/eellison
2024-10-09 15:53:45 +00:00
de4c2a3b4e Add AsyncCollectiveTensor isinstance check to test_graph_input_is_async (#137253)
This PR doesn't change the logic of `test_graph_input_is_async` - it just adds an additional check to the graph input type to ensure it's always `AsyncCollectiveTensor` as expected. It would potentially make it easier to show to users that we already support `AsyncCollectiveTensor` as graph input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137253
Approved by: https://github.com/bdhirsh
2024-10-09 08:06:16 +00:00
ac8954d1ca [pattern match][SDPA] remove contiguous in sdpa replacement (#136930)
Fixes a perf issue which is found internally.
In the case, we see query(size=[1, 16, 384, 64], stride=[393216, 64, 1024, 1]) in model code. However before entering SDPA, it becomes query(size=[1, 16, 384, 64], stride=[393216, 24576, 64, 1]). This is caused by the [SDPA pattern match](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/fx_passes/fuse_attention.py#L130-L132), which applies contiguous to inputs in replacement. This is not necessary as the contiguous doesn't exist in pattern. Furthermore, it could sometimes cause perf issues. Anyway, we can do the additional contiguous in the kernel implementation if needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136930
Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jgong5
2024-10-09 07:52:38 +00:00
72ad1b8c6c Make Context to be Device-agnostic Step by Step (2/N) (#136526)
- add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526
Approved by: https://github.com/ezyang, https://github.com/EikanWang
ghstack dependencies: #136519
2024-10-09 07:34:30 +00:00
a02093e824 fix test_export_constraints_error_not_in_range (#137500)
Test Plan: fixed

Differential Revision: D64052011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137500
Approved by: https://github.com/tugsbayasgalan
2024-10-09 05:48:14 +00:00
abb00efc14 Add torch.squeeze parameter description to declare allowed type (#137485)
Fixes #137422

Add parameter type definition in API docs to clarify allowed value type, eliminate users pass `None`  as `dim` value directly.

```python
>>> import torch
>>> x = torch.randn(3,1,2)
>>> x.squeeze(dim=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Please look up dimensions by name, got: name = None.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137485
Approved by: https://github.com/albanD
2024-10-09 05:29:13 +00:00
df114a447e Parametrize test_lstm_packed (#137447)
The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour.  Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore.

Also, the test seems to run OOM on a 2xlarge with `std::bad_alloc` memory error.  Maybe, this would also fix the issue (pending CI testing)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137447
Approved by: https://github.com/albanD, https://github.com/malfet
2024-10-09 05:13:53 +00:00
2fff990c16 Revert "[AutoAC] Backward Pass Aware AC - changes to partitioner to acommodate SOLVER as a callable (#137314)"
This reverts commit 932b9945c0bc61a11a7db2f52c974cf283d5a2ed.

Reverted https://github.com/pytorch/pytorch/pull/137314 on behalf of https://github.com/huydhn due to The failure shows up in trunk ([comment](https://github.com/pytorch/pytorch/pull/137314#issuecomment-2401311719))
2024-10-09 04:53:30 +00:00
972822dea1 Minorly reorder optim kwargs in docs, fixes #137391 (#137531)
Closes #137391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137531
Approved by: https://github.com/albanD
2024-10-09 04:14:45 +00:00
4628fcf41a Fix ir._WaitKernel (#137401)
In ABI-compatible mode, AOTInductor could not compile _WaitKernel due to
an incorrect outputs list.  Add the correct set of outputs, as done in
ir._CollectiveKernel.create_out_of_place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137401
Approved by: https://github.com/desertfire
ghstack dependencies: #136924
2024-10-09 04:02:30 +00:00
0414aeacd9 AOTInductor: silence linker warnings about executable stacks (#136924)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136924
Approved by: https://github.com/desertfire
2024-10-09 04:02:30 +00:00
ddc7b6d0b4 Removes confusing note, addresses #38006 (#137535)
Fixes #38006

The note was originally added in https://github.com/pytorch/pytorch/pull/30257, which tried to ensure that the gradient wasn't modified in the optimizer. This note creates more confusion than is helpful, so removing it is better than leaving it in, especially because most uses of closure that I know _does_ modify the grads.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137535
Approved by: https://github.com/albanD
2024-10-09 04:00:38 +00:00
d3edf4ebf4 [SymmetricMemoryOps] implement two-shot all-reduce (#137473)
## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR

Implement `symm_mem::two_shot_all_reduce_`. Later we'll replace the two-shot all-reduce in `IntraNodeComm` with these.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137473
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472
2024-10-09 03:49:42 +00:00
82e55b624f [SymmetricMemoryOps] implement one_shot_all_reduce (#137472)
## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR

Implement `symm_mem::one_shot_all_reduce` and `symm_mem::one_shot_all_reduce_out`. Later we'll replace the one-shot all-reduce in `IntraNodeComm` with these.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137472
Approved by: https://github.com/Chillee, https://github.com/weifengpy
ghstack dependencies: #137471
2024-10-09 03:49:42 +00:00
5d83ee3e32 [SymmetricMemoryOps] refine cross-device barriers (#137471)
## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR

Refine the corss-device synchronization primitives to make it clearer when to use which synchronization pattern.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137471
Approved by: https://github.com/Chillee, https://github.com/weifengpy
2024-10-09 03:49:42 +00:00
5f1759a025 [Dynamo] add flex attention mode test (#137121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137121
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227, #137119
2024-10-09 02:29:40 +00:00
d5785d4295 [Dynamo] Handle torch function subclass/mode dispatch on generic tensor methods (#137119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137119
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227
2024-10-09 02:29:40 +00:00
0a304d9048 [Dynamo] Handle extracted unbound tensor methods (#137227)
fixes2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137227
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
ghstack dependencies: #137114, #137115, #137116, #137117, #137120
2024-10-09 02:29:40 +00:00
b3f30c9bc3 [Dynamo] Move flex attention torch function mode to traceable HOP file (#137120)
Moves `TransformGetItemToIndex` to a file where dynamo stores other traceable HOP concepts.  (We don't trace through torch.* modules by default)

Tracing through the mode required fixing a bug in dynamo autograd function, which fixed a graph break, which caused the autograd test failures (skipping for now and will file an issue)

Previously those tests were in essence running in eager, because dynamo would fallback due to an arg mismatch error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137120
Approved by: https://github.com/yanboliang, https://github.com/malfet
ghstack dependencies: #137114, #137115, #137116, #137117
2024-10-09 02:29:40 +00:00
27dee935af [Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137117
Approved by: https://github.com/yanboliang, https://github.com/williamwen42
ghstack dependencies: #137114, #137115, #137116
2024-10-09 02:29:40 +00:00
38afac2917 [Dynamo] Remove ignored modes from torch function mode stack guard (#135503) (#137116)
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137116
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114, #137115
2024-10-09 02:29:40 +00:00
108b469f78 [Dynamo] Remove ignored modes workaround (#135502) (#137115)
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137115
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114
2024-10-09 02:29:40 +00:00
e41dffbedd [Dynamo] Trace enter/exit of TorchFunctionModes (#135422) (#137114)
This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode)

Typically the bytecode for a context manager looks like this during a graph break:
1. graph call
2. enter context
3. unsupported code
4. exit context
5. resume call

resume fn structure:
1. enter context
2. jump
...
3. exit context

The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack).

So for torch function modes the structure of our output code is this:

1. graph call
2. mutate tf mode stack to replay mutations
4. unsupported code
5. on exception restore stack
6. resume function

Then our resume fn looks like this:

1. no-op enter torch function mode
2. jump
3.  exit tf mode

To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context).

Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly.
Approved by: https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443, #135444

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137114
Approved by: https://github.com/yanboliang
2024-10-09 02:29:40 +00:00
0b8048c78a Fix AOTI CPP GEMM Template issue without freezing (#136421)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/135106. For AOTI, there is the Inductor IR of weight
```
ReinterpretView(
  StorageBox(
    ConstantBuffer(name='L__self___mlp_0_weight', layout=FixedLayout('cpu', torch.float32, size=[64, 128], stride=[128, 1]))
  ),
  FixedLayout('cpu', torch.float32, size=[128, 64], stride=[1, 128]),
  origins=OrderedSet([addmm])
)
```
In the post-processing step of the GEMM template, the used weight was before permutation, leading to correctness issues. In this PR, we address this by reshaping the weight to the expected size and stride before the weight prepack.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_aot_inductor.py -k test_misc_1_max_autotune_True_non_abi_compatible_cpu
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_aoti_linear
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_aoti_linear_multi_view_operations
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136421
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-10-09 02:19:07 +00:00
be0b75256a Make Context to be Device-agnostic Step by Step (1/N) (#136519)
- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519
Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey
2024-10-09 02:13:36 +00:00
384ddab294 [c10d] fix sequence numbers for coalesced operations (#135132)
Summary:
We were erroneously incrementing seq_collective for p2p operations.
FIxes issue #134833

Test Plan:
Unit tests.
TODO: add more unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135132
Approved by: https://github.com/fduwjj
2024-10-09 01:38:12 +00:00
8cbb58cff6 [inductor] Limit cpu copies in autotuning to CUDA devices (#137509)
Summary: Missed in https://github.com/pytorch/pytorch/pull/136701#discussion_r1792328849: we should perform this optimization only for mutated args on cuda devices

Test Plan: `python benchmarks/dynamo/timm_models.py --performance --inductor --device cuda --inference --bfloat16 --print-compilation-time --print-memory --cold-start-latency --only fbnetc_100`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137509
Approved by: https://github.com/int3, https://github.com/eellison
2024-10-09 01:31:58 +00:00
932b9945c0 [AutoAC] Backward Pass Aware AC - changes to partitioner to acommodate SOLVER as a callable (#137314)
Summary: making it so that the config can pass `config.activation_memory_budget_solver` as a callable method and then that callable is invoked to determine the set of saved/recomputed nodes.

Test Plan: tbd

Reviewed By: Chillee, basilwong

Differential Revision: D63714905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137314
Approved by: https://github.com/eellison, https://github.com/basilwong

Co-authored-by: Parikshit Shah <parikshit@meta.com>
2024-10-09 00:39:29 +00:00
23c531b3e9 Allow parallelize_module to get device_mesh from ambient context (#134247)
This PR is for supporting calling `parallelize_module` from within a model definition, making the model a parallel one.

Calling `parallelize_module` is an alternative to maintaining a set of `ColumnWiseLinear`, `RowWiseLinear`, etc, while still being able to directly author a parallel model.

(The motivation for authoring a parallel model is that there may be other distributed operations, which may not be easily captured by any module, see the forward function below. Alternatively speaking, the purpose is to exploit the expressiveness of DTensor -- we need to first create DTensors before calling ops on them. Having parallelized modules in model is one way of creating DTensors.)

For example:
```
class FeedForward(nn.Module):
    def __init__(self, config: TransformerArgs) -> None:
        super().__init__()
        w1 = nn.Linear(config.dim, config.hidden_dim, bias=False)
        w2 = nn.Linear(config.hidden_dim, config.dim, bias=False)
        w3 = nn.Linear(config.dim, config.hidden_dim, bias=False)
        self.w1 = parallelize_module(w1, Colwise)
        self.w2 = parallelize_module(w2, Rowwise)
        self.w3 = parallelize_module(w3, Colwise)

    def forward(self, x: Tensor) -> Tensor:
        y: DTensor = self.w2(F.silu(self.w1(x)) * self.w3(x))
        # y is a DTensor with Partial placement; we can return it as is.
        return y
        # Or we can convert it to Replicate -- there is modeling flexibility here.
        return y.redistribute(Replicate())

with device_mesh:
    model = FeedForward(config)
    # Now model is a model parallelized onto device_mesh

y = model(x)

```

The `device_mesh` actually used for `parallelize_module` would be retrieved from the ambient context.

Calling `parallelize_module` from within model hierarchy also saves the use of *FQNs* as in the out-of-model annotation case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134247
Approved by: https://github.com/tianyu-l
2024-10-09 00:19:03 +00:00
de7f32a205 openreg add pin_memory (#135339)
Occording to `Next steps` in test/cpp_extensions/open_registration_extension/README.md, add Pinned memory and HostAllocator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135339
Approved by: https://github.com/albanD
2024-10-09 00:07:59 +00:00
8893881867 Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264)
Fixes #104435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264
Approved by: https://github.com/ezyang

Co-authored-by: eellison <elias.ellison@gmail.com>
2024-10-09 00:05:52 +00:00
eqy
cba3f4f5e3 [CUDA] Clean up asserts in test_cuda.py (#137034)
Switch some `assertTrue` tests to `assertEqual` etc for debuggability in logs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137034
Approved by: https://github.com/Skylion007
2024-10-08 23:16:19 +00:00
b16167874d Minor SGD docs clarification fixing #137356, #137352 (#137528)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137528
Approved by: https://github.com/albanD
2024-10-08 23:05:08 +00:00
2a1829d728 Error message for allow_in_graph decorator and arbitrary function combo (#135972)
Fixes #103615

Quick error message for non-allowed allow_in_graph decorator and arbitrary function combo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135972
Approved by: https://github.com/anijain2305
2024-10-08 22:48:38 +00:00
4aed81c0db Add support for cat memory planning mms with max autotune (#132554)
When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat.

Discussion for reviewers:

It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](bcac71517c/torch/_inductor/kernel/mm.py (L156)). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation

```
class AllocatedFixedLayout(FixedLayout)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132554
Approved by: https://github.com/jansel
2024-10-08 22:36:46 +00:00
02013da038 Lift restriction on training IR for unflatten (#137470)
Differential Revision: [D64025578](https://our.internmc.facebook.com/intern/diff/D64025578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137470
Approved by: https://github.com/avikchaudhuri
2024-10-08 22:30:24 +00:00
81c8a8ada6 [ONNX] Bump onnxscript in CI (#137497)
To 0.1.0.dev20241008
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137497
Approved by: https://github.com/titaiwangms
2024-10-08 21:56:30 +00:00
76ab1ab665 Fix autograd.Function + NJT when an output grad is None (#136875)
For `autograd.Function`, the engine will try to allocate correctly-shaped zeros for `None` grads (i.e. in the case where the output isn't used downstream). It determines the shape of these zeros from the `VariableInfo` entry, which is derived from the forward output shape. For the NJT forward output case, the size info stored will contain a nested int, and calling `zeros()` with this size throws:
```
RuntimeError: .../build/aten/src/ATen/RegisterCPU.cpp:5260: SymIntArrayRef expected to contain only concrete integers
```

This PR fixes this by storing the full tensor in the `VariableInfo` for the nested case and calling `zeros_like()` to allocate correctly-shaped zeros. This is pretty inefficient; ideally we would want to save just the NJT shape and be able to construct zeros from it, but this requires factory function support for nested ints (WIP). So this is a short-term fix until we have that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136875
Approved by: https://github.com/soulitzer
2024-10-08 21:01:36 +00:00
5e3e1c0151 Revert "[FSDP2] Required mesh_dim_names for HSDP (#137436)"
This reverts commit 5fb30df7d6ecc25cc7c4c17a8a33d14ddaa7c279.

Reverted https://github.com/pytorch/pytorch/pull/137436 on behalf of https://github.com/malfet due to Looks like it broke distributed testing, see https://github.com/pytorch/pytorch/actions/runs/11239761070/job/31249854217 ([comment](https://github.com/pytorch/pytorch/pull/137436#issuecomment-2400794929))
2024-10-08 20:50:49 +00:00
b499083a91 Get rid of quadratic tests to has_same_metadata (#136857)
Fixes https://github.com/pytorch/pytorch/issues/136852

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136857
Approved by: https://github.com/isuruf, https://github.com/bdhirsh
2024-10-08 20:49:23 +00:00
d34b617bb9 Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422) (#137114)"
This reverts commit 51bc839b94829f176e3c1b7f62e3448d6028c480.

Reverted https://github.com/pytorch/pytorch/pull/137114 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))
2024-10-08 20:33:17 +00:00
8c937445ee Revert "[Dynamo] Remove ignored modes workaround (#135502) (#137115)"
This reverts commit b1fd7708bd81d8d52908bf4459ed024471abd803.

Reverted https://github.com/pytorch/pytorch/pull/137115 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))
2024-10-08 20:33:17 +00:00
e5f9131327 Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503) (#137116)"
This reverts commit f9d69cde88ad972ee8fc24413dd0740f4e21562d.

Reverted https://github.com/pytorch/pytorch/pull/137116 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))
2024-10-08 20:33:17 +00:00
2d18c2d5e7 Revert "[Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117)"
This reverts commit 941be418d8ec3290d0e3bae0e16a443be26b3075.

Reverted https://github.com/pytorch/pytorch/pull/137117 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))
2024-10-08 20:33:17 +00:00
cc75ac084f Add test for https://github.com/pytorch/pytorch/issues/137087 (#137090)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137090
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-10-08 20:17:03 +00:00
5349ee2934 Revert "Parametrize test_lstm_packed (#137447)"
This reverts commit d5493ed579ba41015ffef981832a3f04f94bb6f8.

Reverted https://github.com/pytorch/pytorch/pull/137447 on behalf of https://github.com/huydhn due to Need to up few more instance to 4xlarge, revert to reland ([comment](https://github.com/pytorch/pytorch/pull/137447#issuecomment-2400737602))
2024-10-08 20:15:24 +00:00
3c1ab93678 Log chromium event for automatic dynamic reasons (#137491)
Log a chromium event so that we can see the reasons for invoking automatic dynamic shapes in aggregate internally.

Run following code:
```
import torch
@torch.compile(backend="eager")
def foo(t, x):
    return t.sin() + x

torch._dynamo.config.automatic_dynamic_shapes = True
torch._dynamo.config.assume_static_by_default = True
# Change size
x = torch.randn([1,2])
foo(x, 2)
x = torch.randn([2,2])
foo(x, 2)
torch._dynamo.reset()
# Change dimensionality
x = torch.randn([1,2])
foo(x, 2)
x = torch.randn([1,2,3])
foo(x, 2)
torch._dynamo.reset()
# Change stride
x = torch.randn([3,3])
foo(x, 2)
x = torch.as_strided(x, [3,3], [2,2])
foo(x, 2)
torch._dynamo.reset()
# Change scalar
x = torch.randn([1,2])
foo(x, 2)
foo(x, 3)
```

Internal link to perfetto:
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key

The events look like this:
<img width="639" alt="image" src="https://github.com/user-attachments/assets/23916333-7f24-47c7-934b-201f33aebeab">
<img width="638" alt="image" src="https://github.com/user-attachments/assets/9f927c8d-04bb-4431-8802-685b032df656">
<img width="640" alt="image" src="https://github.com/user-attachments/assets/342e9e11-0dfc-422d-bd0b-01a8574d38ba">
<img width="635" alt="image" src="https://github.com/user-attachments/assets/dc2c97cd-7180-4069-b019-d6e63ee490bc">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137491
Approved by: https://github.com/Skylion007, https://github.com/oulgen
2024-10-08 19:53:12 +00:00
cyy
a2396b2dd8 [2/N] Fix extra warnings brought by clang-tidy-17 (#137459)
Follows #137407

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137459
Approved by: https://github.com/Skylion007
2024-10-08 19:05:02 +00:00
b41fc14072 compile time benchmarks for AOTDispatcher (partitioner) (#136760)
compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because:

(1) it consists of a single input + many weights that are used sequentially
(2) contains a mix of recompute vs non-recomputed ops (matmul + sin)
(3) it is relatively simple

from running locally:
```
collecting compile time instruction count for aotdispatcher_partitioner_cpu
compile time instruction count for iteration 0 is 21764219181
compile time instruction count for iteration 1 is 12475020009
compile time instruction count for iteration 2 is 12463710140
compile time instruction count for iteration 3 is 12455676489
compile time instruction count for iteration 4 is 12451344330
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760
Approved by: https://github.com/ezyang
ghstack dependencies: #136759
2024-10-08 18:44:13 +00:00
48b8f818b2 compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759)
this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher:

(1) inference vs training code paths
(2) "subclasses" vs "no subclasses" codepaths

Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely)

I ran locally, and got these numbers on the 4 paths:
```
collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu
compile time instruction count for iteration 0 is 11692348671
compile time instruction count for iteration 1 is 3026287204
compile time instruction count for iteration 2 is 3011467318
compile time instruction count for iteration 3 is 3004485935
compile time instruction count for iteration 4 is 3003087410
collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu
compile time instruction count for iteration 0 is 6068003223
compile time instruction count for iteration 1 is 5585418102
compile time instruction count for iteration 2 is 5581856618
compile time instruction count for iteration 3 is 5581651794
compile time instruction count for iteration 4 is 5578742619
collecting compile time instruction count for aotdispatcher_inference_subclass_cpu
compile time instruction count for iteration 0 is 8634984264
compile time instruction count for iteration 1 is 8633467573
compile time instruction count for iteration 2 is 8632182092
compile time instruction count for iteration 3 is 8632056925
compile time instruction count for iteration 4 is 8632543871
collecting compile time instruction count for aotdispatcher_training_subclass_cpu
compile time instruction count for iteration 0 is 14737239311
compile time instruction count for iteration 1 is 14734346427
compile time instruction count for iteration 2 is 14736493730
compile time instruction count for iteration 3 is 14734121272
compile time instruction count for iteration 4 is 14733852882
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759
Approved by: https://github.com/laithsakka
2024-10-08 18:44:13 +00:00
53af729a66 add meta for _segment_reduce_backward (#137442)
reland of https://github.com/pytorch/pytorch/pull/124988

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137442
Approved by: https://github.com/albanD
2024-10-08 18:40:06 +00:00
1aac1ffce1 Don't generate implicit value ranges for missing symbols. (#136667)
Instead, callback to a missing handler when needed. This greatly speeds things up with the value ranges dict is large. The missing handler is needed because nested ints don't have VRs, but symbolic sizes involving them occasionally show up in compute.

```
TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="s11" TORCH_LOGS=dynamic PYTORCH_TEST_WITH_DYNAMO=1 python test/test_nestedtensor.py TestNestedTensorAutogradCPU.test_dropout_backward_jagged_cpu
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136667
Approved by: https://github.com/isuruf
ghstack dependencies: #136429
2024-10-08 18:12:57 +00:00
90bed32b98 Introduce torch.sym_sum (#136429)
Partially addresses https://github.com/pytorch/pytorch/issues/128150

When you have big sums of values, we end up computing long chains of
binary addition in our FX graph representation.  Not only is this ugly,
it also is quadratic, as the sympy.Add constructor is O(N) in number
of arguments.  Instead, ensure that we maintain the summation as a
single FX node so we can do the entire addition all in one go.

update_hint_regression benchmark, before and after:

```
update_hint_regression,compile_time_instruction_count,2648328980
update_hint_regression,compile_time_instruction_count,2563748678
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136429
Approved by: https://github.com/isuruf
2024-10-08 18:12:57 +00:00
3bf6594d13 Log compile ids to pt2_remote_cache and pt2_compile_events (#137431)
Log the current compilation id for all relevant samples for these two tables, so we can have a 1:1 analog with dynamo_compile.

Differential Revision: [D63900826](https://our.internmc.facebook.com/intern/diff/D63900826/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137431
Approved by: https://github.com/oulgen
2024-10-08 18:04:48 +00:00
758dbac308 Add type check for ord in torch.linalg.vector_norm() and torch.linalg.matrix_norm() (#137463)
fixes #137424, fixes #137460
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137463
Approved by: https://github.com/lezcano
2024-10-08 17:53:56 +00:00
d87835ac32 [Profiler] Clear Out Dangling AppendOnlyLists (#137450)
Summary: There are two instances of AppendOnlyLists that don't get cleared after we have finished iterating through the forward lists. This can be potentially dangerous since they can last for the entirety of the lifespan of the profiler. We have also seen crashes during the destructor of these variables when the profiler is exiting. This could possibly be related to the fact that the default constructor assumes some valid state of these lists rather than whatever state they are in when profiler is exiting.

Test Plan: Ran with profile_memory=True to make sure allocations queue gets cleared correctly and trace+workload ran successfully

Differential Revision: D64010911

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137450
Approved by: https://github.com/aaronenyeshi
2024-10-08 17:48:59 +00:00
7e8dace0de Revert "[ROCm] remove caffe2 from hipify (#137157)"
This reverts commit 40d826074546558f6665a4c118335a7725503cac.

Reverted https://github.com/pytorch/pytorch/pull/137157 on behalf of https://github.com/xw285cornell due to this is breaking internal where we still use caffe2 ([comment](https://github.com/pytorch/pytorch/pull/137157#issuecomment-2400466131))
2024-10-08 17:45:45 +00:00
a8047564ff Revert "[FlexAttention] Support training bias for eager (#136910)"
This reverts commit 711dacf9845cbc9ea8b3b0fa257309930106712f.

Reverted https://github.com/pytorch/pytorch/pull/136910 on behalf of https://github.com/malfet due to torch.library.custom_op looks weird here and it breaks some internal workloads ([comment](https://github.com/pytorch/pytorch/pull/136910#issuecomment-2400434833))
2024-10-08 17:29:02 +00:00
0b5ade8a12 Revert "[Dynamo] Move flex attention torch function mode to traceable HOP file (#137120)"
This reverts commit 68151fd2889c9752348c2dfdc7c175ee201c0cd3.

Reverted https://github.com/pytorch/pytorch/pull/137120 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137120#issuecomment-2400429265))
2024-10-08 17:26:19 +00:00
2570d77a26 Revert "type _dynamo/trace_wrapped_higher_order_op.py (#137354)"
This reverts commit a9f7b905de2217eedee6723b0eb83b3ac7406c26.

Reverted https://github.com/pytorch/pytorch/pull/137354 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137354#issuecomment-2400424669))
2024-10-08 17:22:40 +00:00
76c5bdd2cc Revert "[Dynamo] Handle extracted unbound tensor methods (#137227)"
This reverts commit 14eabd69152e31d059444310979625542db2aece.

Reverted https://github.com/pytorch/pytorch/pull/137227 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137227#issuecomment-2400406384))
2024-10-08 17:12:41 +00:00
c88c0e6c65 Revert "[Dynamo] Handle torch function subclass/mode dispatch on generic tensor methods (#137119)"
This reverts commit d255b34c0ac6208633ed5e71d019fa9ae061e1fc.

Reverted https://github.com/pytorch/pytorch/pull/137119 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137119#issuecomment-2400401262))
2024-10-08 17:09:26 +00:00
cc10ef4645 Revert "[Dynamo] add flex attention mode test (#137121)"
This reverts commit 144665d772f7ec014a4a23f460a632a4a4774f4a.

Reverted https://github.com/pytorch/pytorch/pull/137121 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137121#issuecomment-2400389882))
2024-10-08 17:03:34 +00:00
11192ceca4 Revert "[FlexAttention] only calculate grads for buffers that require_grad (#137451)"
This reverts commit 9f9d252971ea1de04d349a0460e39e3bfe824eae.

Reverted https://github.com/pytorch/pytorch/pull/137451 on behalf of https://github.com/malfet due to Need to revert it in order to be able to backout https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137451#issuecomment-2400385858))
2024-10-08 17:00:59 +00:00
8184e202d8 Update mutation checking in pattern matcher (#137448)
Fix for https://github.com/pytorch/pytorch/issues/137229

The current mutation checking is complicated because it works for pre grad IR. When pre grad ir has been traced to OpOverloads checking is much easier. I am also special casing auto functional hop although I discussed with @zou3519 it would be nice to have a way of querying HOPs that mimic schemas.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137448
Approved by: https://github.com/zou3519
2024-10-08 16:56:40 +00:00
28493efe6e fix silly mapping issue with torch.Size (#137465)
Test Plan: added test

Differential Revision: D64022949

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137465
Approved by: https://github.com/yushangdi, https://github.com/angelayi
2024-10-08 16:53:15 +00:00
7267363844 [ONNX] Insert contiguous node between transpose and view before calling run_decompositions (#137340)
Works around #136543.

This fix solves the issue only in the context of the ONNX exporter but this issue happens in other context.

The bug happens when method `run_decompositions` is called. The failing pattern is assumed to be ``view(transpose(x, ...))``. This pattern is replaced by ``view(flatten(transpose(x, ..)))``. By changing the dimensions, the strides are updated as well and `run_decompositions` does not fail anymore. It would be inefficient on a 1D tensor but then transpose would not be used. The extra node appears in the final onnx graph but is removed after optimization. The final onnx graph should not be impacted and no performance loss should be observed for the onnx model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137340
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2024-10-08 16:45:59 +00:00
5fb30df7d6 [FSDP2] Required mesh_dim_names for HSDP (#137436)
Two changes:
1. Require `mesh_dim_names` if using HSDP
2. Pass only the shard mesh to `fsdp_pre_all_gather`

Change 1 is technically BC breaking, but it should not be hard to fix on the user side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137436
Approved by: https://github.com/weifengpy, https://github.com/wz337
2024-10-08 16:31:18 +00:00
0bfedb13e7 Remove aoti_torch_zero_ codegen (#137371)
Summary: aoti_torch_zero_ codegen breaks AOTI FC, see discussion in D63281798.

Test Plan: CI

Differential Revision: D63916320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137371
Approved by: https://github.com/jingsh
2024-10-08 15:57:41 +00:00
c04b35a5ae [AOTI] Add standalone version of TORCH_CHECK (#136873)
Summary: In the standalone mode, TORCH_CHECK throws std::runtime_error, instead of c10::Error. The goal is to cut dependency on libtorch. Specifically, AOTI generates CPU code which may call ATen vectorization ops and we need to make sure those ops are self-contained.

Differential Revision: [D63911928](https://our.internmc.facebook.com/intern/diff/D63911928)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136873
Approved by: https://github.com/albanD, https://github.com/chenyang78
2024-10-08 15:30:01 +00:00
d5493ed579 Parametrize test_lstm_packed (#137447)
The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour.  Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore.

Also, the test seems to run OOM on a 2xlarge with `std::bad_alloc` memory error.  Maybe, this would also fix the issue (pending CI testing)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137447
Approved by: https://github.com/albanD, https://github.com/malfet
2024-10-08 15:26:27 +00:00
3e2f276a14 Fix to() on non-contiguous NJTs (#137124)
Called out via torchrec integration: `lengths` is not handled properly.

Future work (not related to non-contiguous NJTs): #137275
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137124
Approved by: https://github.com/soulitzer
ghstack dependencies: #137030, #137031
2024-10-08 15:11:05 +00:00
a77bb8527c Make index check in applySelect support deferred runtime assert (#137046)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137046
Approved by: https://github.com/albanD
2024-10-08 14:31:47 +00:00
9b2e453e24 Migrate ARM64 Linux binary jobs to runner determinator (#136666)
Updates ARM64 Linux binary jobs to use the runner determinator.

Issue: pytorch/ci-infra#265
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136666
Approved by: https://github.com/ZainRizvi
2024-10-08 12:14:06 +00:00
76dca1fef3 [c10d] separate the codes for GPU stream synchronization and CPU thread synchronization (#137295)
code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.

Also, remove the need of a private synchronization function.

In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves.

Test Plan:
python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137295
Approved by: https://github.com/kwen2501
2024-10-08 08:53:47 +00:00
9f9d252971 [FlexAttention] only calculate grads for buffers that require_grad (#137451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137451
Approved by: https://github.com/Chillee
2024-10-08 07:36:38 +00:00
59cdd8ddf1 Bump optree version to 0.13.0 to enable Python 3.13 and Python 3.13t support (#137396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137396
Approved by: https://github.com/albanD
2024-10-08 06:49:04 +00:00
493d0eeef3 Revert "Add support for cat memory planning mms with max autotune (#132554)"
This reverts commit d558ec07300defee24dd4a83ab4b387a39ea2176.

Reverted https://github.com/pytorch/pytorch/pull/132554 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/132554#issuecomment-2398946854))
2024-10-08 06:21:06 +00:00
8ca15e87f5 Update torchbind expecttest from landrace (#137453)
Update expecttest from torch function mode PR landrace (torch function mode changes output code slightly)

Attempted to revert the stack but there were conflicts
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137453
Approved by: https://github.com/huydhn
2024-10-08 06:01:29 +00:00
bb31e3f57e Add original forward names to schema so that prettify pass works (#136887)
When we run_decomp, we retrace if it is training IR. As a result, we do need to reliably store the oroiginal forward names when we run decomp.

Differential Revision: [D63064453](https://our.internmc.facebook.com/intern/diff/D63064453/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136887
Approved by: https://github.com/angelayi
2024-10-08 04:21:02 +00:00
46525abb71 OpenReg: support multiple executors (#136249)
From PR https://github.com/pytorch/pytorch/pull/135646 we have split the daemon into drvier/executor, however, current executor stands for all devices and allocate memory all together. In order to better simulate device behavior, here we support multiple executors, each executor stands for one device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136249
Approved by: https://github.com/FFFrog, https://github.com/albanD
2024-10-08 01:37:08 +00:00
395e098209 type _dynamo/mutation_guard.py (#137350)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137350
Approved by: https://github.com/Skylion007
2024-10-08 00:04:34 +00:00
52ba40c6f6 [ROCm][AOTI] add CK backend (#135641)
Companion to #134379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135641
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78

Co-authored-by: Colin Peppler <colinpeppler@meta.com>
2024-10-07 23:53:58 +00:00
2c0b11c79b forward-fix D63916220 breaking test_cutlass_backend in FBCode (#137435)
Summary: It seems like the import path is different from FBCode & OSS. Wondering how to consolidate them.

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cutlass_backend

Tests finished: Pass 2. Fail 0. Fatal 0. Skip 33. Build failure 0
```

Differential Revision: D63991961

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137435
Approved by: https://github.com/jovianjaison
2024-10-07 23:44:04 +00:00
812f286d4a Delete duplicate bindings in torch/csrc/autograd/python_torch_functions_manual.cpp (#136711)
This change deletes the duplicate binding of `torch. _functionalize_mark_mutation_hidden_from_autograd()`, another defination is here:

5c78c6b05a/torch/csrc/autograd/python_torch_functions_manual.cpp (L630-L636)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136711
Approved by: https://github.com/soulitzer
2024-10-07 23:19:06 +00:00
d558ec0730 Add support for cat memory planning mms with max autotune (#132554)
When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat.

Discussion for reviewers:

It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](bcac71517c/torch/_inductor/kernel/mm.py (L156)). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation

```
class AllocatedFixedLayout(FixedLayout)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132554
Approved by: https://github.com/jansel
2024-10-07 22:49:29 +00:00
01bf350967 Fix bmm_sparse_cuda illegal memory access (#131977)
This PR fixes a bug in `search_end_matrix_indices_cuda_kernel` causing an illegal memory access when calling `bmm_sparse_cuda` on a sparse matrix with no non-zero values in the first batch dimension. Reproducible example:
```py
import torch

ind = torch.tensor([[1], [0], [0]], device="cuda")
val = torch.tensor([1.], device="cuda")
A = torch.sparse_coo_tensor(ind, val, size=(2, 1, 1))
B = torch.zeros((2, 1, 1), device="cuda")
C = torch.bmm(A, B)
```

## Details

In the previous code, we may for example end up with the following situation:
```
i : indices_1D[i]
------------------------------------------
0 : 1                <- start_idx, mid_idx
1 : 1                <- end_idx
...
```
When `target_mat_num = 0`, the next iteration of the while loop will assign `-1` to `end_idx` and thus `(0 + (-1)) >> 1 = -1` to `mid_idx`, causing an access error on line 703. The updated code maintains the invariant `start_idx <= end_idx` and will not go out of bounds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131977
Approved by: https://github.com/amjames, https://github.com/pearu, https://github.com/nikitaved
2024-10-07 22:47:34 +00:00
a6707a7303 [dynamo] log all graph breaks to graph_breaks logging artifact (#137244)
We were previously not logging all graph breaks (e.g. data dependent jumps) to the graph_breaks logging artifact.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137244
Approved by: https://github.com/jansel
2024-10-07 22:34:27 +00:00
a9f7b905de type _dynamo/trace_wrapped_higher_order_op.py (#137354)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137354
Approved by: https://github.com/Skylion007, https://github.com/jansel
2024-10-07 21:57:06 +00:00
796c3c3415 Revert "Disallow FakeTensor.data_ptr access in eager mode (#137221)"
This reverts commit 7e13e7dd7e5fc20c0420605aeecb0f902af5326c.

Reverted https://github.com/pytorch/pytorch/pull/137221 on behalf of https://github.com/jovianjaison due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/137221#issuecomment-2397957081))
2024-10-07 21:46:13 +00:00
319eda9dfd [inductor] Add API to make post_grad_custom passes cache-able (#137298)
Summary: See https://github.com/pytorch/pytorch/issues/130772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137298
Approved by: https://github.com/oulgen, https://github.com/eellison
2024-10-07 21:11:54 +00:00
8aa110cb00 [ROCm] Enable int_mm_error tests for rocm 6.0+ (#124999)
This pull request enables the int_mm_error tests for rocm 6.0+ . since  #122431 landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124999
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2024-10-07 21:10:18 +00:00
46abaa3b0f Increase parallelnative shards to 4 (#137433)
The job times out flakily in trunk as its duration is approaching 3.5h https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=parallelnative

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137433
Approved by: https://github.com/wdvr, https://github.com/malfet
2024-10-07 21:06:34 +00:00
c87c9f0a01 [inductor] Conditionally copy args to cpu to minimize memory overhead of autotuning (#136701)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136701
Approved by: https://github.com/eellison
2024-10-07 19:47:04 +00:00
900f57216f [dynamo] Log a summary of frames Dynamo traced (#137297)
This patch adds logging for all frames Dynamo traced, during each invocation of a Dynamo-optimized function.

Example:
```python
import torch

@torch.compile
def foo():
    x = torch.ones([10])
    def bar():
        y = x + x
        torch._dynamo.graph_break()
        z = y * x
        return z

    return bar(), bar

foo()
foo()
```

Running `TORCH_LOGS="dynamo" python` on the above dumps the following near the very end.
```
......
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] starting from foo /Users/ryanguo99/Documents/work/scratch/test.py:4, torchdynamo attempted to trace the following frames: [
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486]   * foo /Users/ryanguo99/Documents/work/scratch/test.py:4
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486]   * bar /Users/ryanguo99/Documents/work/scratch/test.py:7
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] ]
I1003 12:18:31.064000 177 torch/_dynamo/eval_frame.py:486] starting from foo /Users/ryanguo99/Documents/work/scratch/test.py:4, torchdynamo attempted to trace the following frames: []
......
```

Fixes #118262.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137297
Approved by: https://github.com/williamwen42
2024-10-07 19:44:41 +00:00
f33ffd01f2 [export] fix joint graph metadata (#136011)
Differential Revision: D62652832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136011
Approved by: https://github.com/tugsbayasgalan
2024-10-07 19:36:44 +00:00
08b84afda9 [inductor] Fix alignment hint for WorkspaceArg (#137429)
Alignment hints refer to the base ptr, not the size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137429
Approved by: https://github.com/eellison
2024-10-07 19:32:33 +00:00
fe44b6a67f Revert "Add back DistributedDataParallel types that were lost when pyi was removed (#136835)"
This reverts commit 40b09edd87fcbe4e63c4db6399ec758d5c34e1b1.

Reverted https://github.com/pytorch/pytorch/pull/136835 on behalf of https://github.com/jovianjaison due to this pr is causing typecheck errors internally ([comment](https://github.com/pytorch/pytorch/pull/136835#issuecomment-2397661940))
2024-10-07 18:59:41 +00:00
144665d772 [Dynamo] add flex attention mode test (#137121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137121
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227, #137119
2024-10-07 18:55:26 +00:00
d255b34c0a [Dynamo] Handle torch function subclass/mode dispatch on generic tensor methods (#137119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137119
Approved by: https://github.com/williamwen42
ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227
2024-10-07 18:55:26 +00:00
14eabd6915 [Dynamo] Handle extracted unbound tensor methods (#137227)
fixes2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137227
Approved by: https://github.com/williamwen42
ghstack dependencies: #137114, #137115, #137116, #137117, #137120
2024-10-07 18:55:26 +00:00
68151fd288 [Dynamo] Move flex attention torch function mode to traceable HOP file (#137120)
Moves `TransformGetItemToIndex` to a file where dynamo stores other traceable HOP concepts.  (We don't trace through torch.* modules by default)

Tracing through the mode required fixing a bug in dynamo autograd function, which fixed a graph break, which caused the autograd test failures (skipping for now and will file an issue)

Previously those tests were in essence running in eager, because dynamo would fallback due to an arg mismatch error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137120
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114, #137115, #137116, #137117
2024-10-07 18:55:26 +00:00
941be418d8 [Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137117
Approved by: https://github.com/yanboliang, https://github.com/williamwen42
ghstack dependencies: #137114, #137115, #137116
2024-10-07 18:55:26 +00:00
f9d69cde88 [Dynamo] Remove ignored modes from torch function mode stack guard (#135503) (#137116)
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137116
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114, #137115
2024-10-07 18:55:26 +00:00
b1fd7708bd [Dynamo] Remove ignored modes workaround (#135502) (#137115)
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137115
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114
2024-10-07 18:55:26 +00:00
51bc839b94 [Dynamo] Trace enter/exit of TorchFunctionModes (#135422) (#137114)
This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode)

Typically the bytecode for a context manager looks like this during a graph break:
1. graph call
2. enter context
3. unsupported code
4. exit context
5. resume call

resume fn structure:
1. enter context
2. jump
...
3. exit context

The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack).

So for torch function modes the structure of our output code is this:

1. graph call
2. mutate tf mode stack to replay mutations
4. unsupported code
5. on exception restore stack
6. resume function

Then our resume fn looks like this:

1. no-op enter torch function mode
2. jump
3.  exit tf mode

To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context).

Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly.
Approved by: https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443, #135444

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137114
Approved by: https://github.com/yanboliang
2024-10-07 18:55:26 +00:00
ff95ff5d38 type _dynamo/profiler.py (#137351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137351
Approved by: https://github.com/Skylion007
2024-10-07 18:54:33 +00:00
aa145dead8 [FSDP2] Fixed mistargeted backward prefetch (#137348)
If there is an `unshard` (top-half) without a `wait_for_unshard` (bottom-half), then the next iteration's `unshard` will be a no-op. This can unexpectedly not propagate the optimizer update on the sharded parameters to the unsharded parameters, so it is better to clear that `unshard` at the end of backward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137348
Approved by: https://github.com/weifengpy
2024-10-07 18:10:09 +00:00
01c07e7864 Revert "[BE][Ez]: Update cudnn_frontend submodule to v1.7.0 (#136920)"
This reverts commit 8dddd456794f82db5b4e807e9aed1919d3a832da.

Reverted https://github.com/pytorch/pytorch/pull/136920 on behalf of https://github.com/drisspg due to Breaks sdpa with bias support, will switch to newer patch version when released ([comment](https://github.com/pytorch/pytorch/pull/136920#issuecomment-2397548622))
2024-10-07 17:56:57 +00:00
cyy
0c0d8c8ff0 [1/N] Fix extra warnings brought by clang-tidy-17 (#137407)
Before we can use clang-tidy-17
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137407
Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi
2024-10-07 17:53:59 +00:00
ceb4ed8450 [AOTI][Tooling][10/n] Add scalar and symbolic type input debug printing support (#137323)
Summary:
- Further added more types for debug value dumping.

- Add a test case for symint inputs for debug printer. in real prod model use cases,  "unbacked symints" (those 'u0', 's0', etc.) can be helpful if we can examine their value.

Test Plan:
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2  TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_sym_inputs_abi_compatible_cuda
```

Differential Revision: D63864708

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137323
Approved by: https://github.com/chenyang78
2024-10-07 17:41:40 +00:00
04e48ac562 [inductor] Refactor prefix to make it easy to create subclass of PythonWrapper (#137198)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137198
Approved by: https://github.com/jansel
ghstack dependencies: #137191, #137193
2024-10-07 17:20:58 +00:00
e2b72348d0 [inductor] Reuse the subgraph if accessed via same get_attr node (#137193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137193
Approved by: https://github.com/jansel
ghstack dependencies: #137191
2024-10-07 17:20:58 +00:00
7a5eaecd92 [inductor] Correctly keep track of the graph_input_names (#137191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137191
Approved by: https://github.com/jansel
2024-10-07 17:20:53 +00:00
14b4099521 [FSDP2] support torch._foreach_copy_(float8) for fully_shard(Float8Linear) (#135955)
this PR unblocks unit test with single Float8Linear module. It fixes following error
```
torch._foreach_copy_(foreach_copy_dsts, all_gather_inputs)
[rank0]:E0913 13:44:29.829000 2179476 torch/testing/_internal/common_distributed.py:671] RuntimeError: "foreach_tensor_copy" not implemented for 'Float8_e4m3fn'
```

Differential Revision: [D63961071](https://our.internmc.facebook.com/intern/diff/D63961071)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135955
Approved by: https://github.com/vkuzo, https://github.com/eqy
2024-10-07 16:36:31 +00:00
33461592e2 [TLParse] Include cache hit/miss/bypass in the report name (#137282)
Makes it easier to tell on glance

https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp1xoGc1/index.html

<img width="398" alt="image" src="https://github.com/user-attachments/assets/7ed111cb-46d8-4442-a1b2-037d0a8decd8">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137282
Approved by: https://github.com/jamesjwu
2024-10-07 16:00:00 +00:00
4db199f15f Implement Remote AOTAutogradCache (#137278)
Summary: Implement Remote AOTAutogradCache. It uses all the same tech as Remote FXGraphCache, just with its own name.

Test Plan:
Run benchmark:
TORCHINDUCTOR_AUTOGRAD_REMOTE_CACHE=1 TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=1 TORCHINDUCTOR_AUTOGRAD_CACHE=0 TORCHINDUCTOR_FX_GRAPH_CACHE=0 TORCH_LOGS=+torch._functorch._aot_autograd.autograd_cache buck run mode/opt benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --repeat 5 --performance --cold-start-latency

See that it cache hits even with local cache removed.

Results show up in remote cache logs https://fburl.com/scuba/pt2_remote_cache/5893dbaj

New unit tests

Reviewed By: oulgen

Differential Revision: D63323958

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137278
Approved by: https://github.com/oulgen
2024-10-07 15:38:54 +00:00
f80ed0b831 [export] Custom op meta kernel generation (two pass) (#137277)
Summary: Prototyping the custom op meta kernel generation. Rest of the changes are in fbcode/scripts/angelayi

Test Plan: followup diff (D63837739)

Differential Revision: D63837740

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137277
Approved by: https://github.com/zou3519
2024-10-07 15:34:19 +00:00
e20e7a8c38 Fixed developer setup issue in open_registration_extension (#137355)
This PR fixes an issue where when running `python setup.py develop`, the `open_registration_extension` self contained example will not build due to the following:

```
error: 'synchronizeStream' overrides a member function but is not marked 'override'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137355
Approved by: https://github.com/albanD, https://github.com/spzala
2024-10-07 15:25:37 +00:00
8c3ab21490 multiprocessing.spawn: allow a grace period when shutdown (#131278)
When one process fails, others are immediately killed. This prevents other processes to do necessary cleanups, or dump debug information (in particular, the NCCL flight recorder).

This PR adds a grace period. Default behavior is unchanged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131278
Approved by: https://github.com/albanD
2024-10-07 12:37:34 +00:00
a063a82c8b [redo] Fp8 support for item() with cuda, index_select, and fill_ cpu (#137341)
Summary:

Redo of https://github.com/pytorch/pytorch/pull/128780, easier to copy-paste.

Test Plan: CI

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137341
Approved by: https://github.com/eqy
2024-10-07 00:58:51 +00:00
d1b87e26e5 [BE] Delete empty files (#137376)
Discovered by running
```
 % find aten -type f -size 0
aten/src/ATen/native/quantized/cpu/qnnpack/wrappers/dummy.c
aten/src/ATen/native/vulkan/api/StringUtil.cpp
aten/src/ATen/native/LegacyBridge.cpp
aten/src/ATen/function_wrapper.py
aten/src/ATen/cudnn/Exceptions.h
```

Most of them were added by b774ce54f8

Remove reference to LegacyBridge.cpp from `aten_native_source_non_codegen_list`:
f42f63ee86/build_variables.bzl (L1317)

And reference to `native/quantized/cpu/qnnpack/wrappers/dummy.c` from f42f63ee86/aten/src/ATen/native/quantized/cpu/qnnpack/buckbuild.bzl (L440)
Which seems to be a bug from some ancient Android toolchain

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137376
Approved by: https://github.com/kit1980, https://github.com/eqy, https://github.com/seemethere, https://github.com/jianyuh, https://github.com/Skylion007
2024-10-06 18:59:04 +00:00
0eba7e5451 Revert runtime numeric check in inductor due to increased compilation time (#137324)
Summary:
This diff reverts D63438718
Cause latency regression on multiple models

Test Plan: NA

Differential Revision: D63872515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137324
Approved by: https://github.com/xuzhao9
2024-10-06 05:23:24 +00:00
1dc1b85714 [export] Move swap to a different file (#137134)
Refactor so that unflattener doesn't become too messy

Differential Revision: [D63719648](https://our.internmc.facebook.com/intern/diff/D63719648/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137134
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #136191, #137102
2024-10-06 04:28:18 +00:00
fa9cd46d12 [export] Update swap's forward function (#137102)
Downstream APS code was failing to run the previously swapped module because of some fx.GraphModule forward function weirdness (P1594789677). So to fix this, I just attached a custom forward function which matches the unflattened module's forward function.

Differential Revision: [D63683422](https://our.internmc.facebook.com/intern/diff/D63683422/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137102
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #136191
2024-10-06 04:25:36 +00:00
52d7704b32 [export] Add optimization passes (#136191)
Added an optimization pass to the swap function which removes extraneous pytrees. Currently it removes the pytree flatten/unflatten calls between modules in very specific scenarios (all the inputs of one module go into the other).

Future work can be to remove the input pytree.flatten if the inputs go directly into an unflatten, and output pytree unflatten if the outputs are directly from a pytree.flatten.

Differential Revision: [D62879820](https://our.internmc.facebook.com/intern/diff/D62879820)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136191
Approved by: https://github.com/avikchaudhuri
2024-10-06 04:22:42 +00:00
ad4e91acfe [fsdp2] based on device, use stream and Event (#136843)
currently FSDP2 support only CUDA, for other backends that need to use FSDP2 it won’t work as stream and events are based on CUDA. To support other backends, use
 _get_device_handle by device type to get the class and use this
for stream and events.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136843
Approved by: https://github.com/awgu
2024-10-06 04:17:47 +00:00
4061910ba2 Have Triton CPU backend respect max_autotune setting (#137276)
We would previously do it regardless of the setting's value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137276
Approved by: https://github.com/jansel, https://github.com/desertfire
2024-10-06 03:03:33 +00:00
711dacf984 [FlexAttention] Support training bias for eager (#136910)
Add training bias eager implementation, take over the original POC from https://github.com/pytorch/pytorch/pull/136076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136910
Approved by: https://github.com/Chillee
2024-10-05 19:34:57 +00:00
d073223663 turn CompilationCallbackHandler into dataclass (#137312)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137312
Approved by: https://github.com/Skylion007
ghstack dependencies: #137181
2024-10-05 19:03:28 +00:00
f54e142c58 Remove references to Rockset in trymerge (#137207)
For the migration to ClickHouse

But also Rockset is not used in trymerge anymore
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137207
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-10-05 12:53:22 +00:00
40d8260745 [ROCm] remove caffe2 from hipify (#137157)
- Remove all "MasqueradingAsCUDA" files and classes.
- Do not rename "CUDA" classes to "HIP".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137157
Approved by: https://github.com/eqy
2024-10-05 12:48:54 +00:00
ca38f28543 [FlexAttention] Adjust BlockMask if reusing the one created at larger seqlen (#137255)
Fixes #136232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137255
Approved by: https://github.com/Chillee
2024-10-05 07:31:32 +00:00
4830bd0dd4 [Doc] Clarify that NaNs are not equal to each other (#137386)
Fixes https://github.com/pytorch/pytorch/issues/137337

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137386
Approved by: https://github.com/janeyx99, https://github.com/huydhn, https://github.com/kit1980
2024-10-05 06:19:12 +00:00
17718209ea fix specialization bug in unflatten + preserve_module_call_signature (#137363)
Summary: In unflatten, when we generate module calls when their signature has been preserved, we do not pass the original constant args. This can cause strange effects, e.g., if the module is swapped out with itself, we may suddenly go down a different path than the original, or even crash.

Test Plan: added a test

Reviewed By: angelayi

Differential Revision: D63913750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137363
Approved by: https://github.com/angelayi
2024-10-05 04:26:02 +00:00
6d0d7b6e37 [CI][BE] Restore cuda memory allocator setting (#137383)
By adding `finally:` clause at the end of the test

Might fix https://github.com/pytorch/pytorch/issues/137098#issuecomment-2389172552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137383
Approved by: https://github.com/ngimel
2024-10-05 04:16:38 +00:00
0067f586ba [audio hash update] update the pinned audio hash (#136968)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136968
Approved by: https://github.com/pytorchbot
2024-10-05 04:08:59 +00:00
4d8b845797 Fix overflow error when torch.bincount() handles a large tensor (#136745)
Fixes #136720

the result in this case says:

```
Traceback (most recent call last):
  File "/Users/shenke/workspace/pytorch/mytest.py", line 9, in <module>
    result = torch.bincount(input)
             ^^^^^^^^^^^^^^^^^^^^^
RuntimeError: maximum value of input overflowed, it should be < 9223372036854775807 but got 9223372036854775807
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136745
Approved by: https://github.com/Skylion007
2024-10-05 04:04:48 +00:00
d6f340f66c Determine autograd engine ready queue based on InputMetadata instead of InputBuffer (#135633)
Thanks @awgu for raising this issue and the small repro

From offline discussion with @albanD, in the case where a forward returns multiple outputs with different devices, we'd want to select the ready queue based on the device of the first one. Even though this is somewhat arbitrary, we prefer this over deciding which ready queue to push based on whichever input buffer's we happen to compute last, which can vary depending on more factors and thus be harder to reason about. This is in theory bc-breaking, but it seems unlikely that someone would depend on this behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135633
Approved by: https://github.com/albanD
2024-10-04 23:59:46 +00:00
79562f3af8 [ROCm] Modify hipify script to work with Windows paths (#135360)
This change modifies the `hipify_python.py` script to properly detect all directories, `include` and `ignore` paths during hipification process on Windows, by changing the path syntax convention to a UNIX-like one.

Since in many places the script assumes a UNIX-like convention by using paths with forward slashes `/`, I decided to accommodate for it by converting Windows paths to UNIX-like ones. By doing it so, the number of changes to the file is limited. Moreover this early-on unification allows for the rest of the code to have a battle-tested linux-like behaviour.

Another option would be to use `Path` object from `pathlib` to represent all paths in the script, however, it would impact a broader share of a code and would hence require a more meticulous evaluation in terms of non-altered logic and edge cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135360
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd
2024-10-04 23:43:43 +00:00
8b6774d381 Clarify comment for error handling of dict getattr (#137381)
Just a small nit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137381
Approved by: https://github.com/malfet
2024-10-04 23:40:21 +00:00
f42f63ee86 Add option to disable operator profiling (#136838)
Summary:
X-link: https://github.com/pytorch/executorch/pull/5720

For smaller models the overhead of profiling ops might be prohibitively large (distorting the inference execution time significantly) so we provide users an option to disable op profiling and essentially only profile the important events such as inference execution time.

To disable operator profiling users need to do:
```
etdump_gen.set_event_tracer_profiling_level(executorch::runtime::EventTracerProfilingLevel::kNoOperatorProfiling);
```

Test Plan: Added test case.

Differential Revision: D61883224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136838
Approved by: https://github.com/dbort
2024-10-04 22:56:00 +00:00
f2d174c051 Update CODEOWNERS (#136278)
Swap @gokulavasan for @divyanshk as codeowner of torch/utils/data/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136278
Approved by: https://github.com/divyanshk, https://github.com/janeyx99, https://github.com/jansel
2024-10-04 22:42:05 +00:00
88e54de219 More nogil unsafe API fix (#137142)
Cover the PyDict APIs and confirms no update needed for PyModule one.
The rest was already covered in https://github.com/pytorch/pytorch/pull/136899

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137142
Approved by: https://github.com/eqy, https://github.com/Skylion007
2024-10-04 21:56:34 +00:00
e27c0048db Enable additional tests for MPS CI runs (#134356)
As part of the follow up for https://github.com/pytorch/pytorch/issues/133520, adapting existing unused tests for use in MPS CI runs. Focusing on nhwc & other memory formatting tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134356
Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/huydhn
2024-10-04 21:52:38 +00:00
7c1d93944e Proper handling of arguments passed by in kwargs inside zip_schema (#137311)
if the function is

```func(a, b, c) ```
and is called as
```func(a=1, b=.., c=..)```
before this change we do not iterate on the a, b, c, since those appear in kwargs this diff fix that issue.

This function is used in _inductor/ir.py to iterate over custom op arguments and when a custom pass does changes
and pass arguments as kwargs, we do not process them.
```
        for info, arg in torch._library.utils.zip_schema(schema, args, kwargs):
            handle_aliasing_and_mutation(info, arg)
```
Fix https://github.com/pytorch/pytorch/issues/137057

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137311
Approved by: https://github.com/zou3519
2024-10-04 21:50:31 +00:00
c0deec120f Fix resurrection logic to trigger early enough (#137267)
Fixes https://github.com/pytorch/pytorch/issues/136358

The bug here is that the Tensor object is actually 2 classes: `Tensor` from `_tensor.py` and `TensorBase` from c++.

Before this PR, they have the following gc methods:
Tensor:
 - tp_clear subtype_clear
 - tp_traverse THPVariable_subclass_traverse
 - tp_dealloc THPVariable_subclass_dealloc

TensorBase:
- tp_clear THPVariable_clear
- tp_traverse THPFunction_traverse (fake function that is just an error)
- tp_dealloc object_dealloc

The problem is that when clear is called on the Tensor, subtype_clear is going to clear the things owned by the "Tensor" type, in particular, its `__dict__` attribute, before delegating to the TensorBase clear where we detect that resurrection needs to happen and skip it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137267
Approved by: https://github.com/ezyang, https://github.com/kshitij12345
2024-10-04 21:13:54 +00:00
bd48933323 Run docker builds on Meta account for now (#137358)
To fix
```
arn:aws:sts::391835788720:assumed-role/ghci-lf-github-action-runners-runner-role/i-096a3e2616140518b is not authorized to perform: ecr:InitiateLayerUpload on resource: arn:aws:ecr:us-east-1:308535385114:repository/pytorch/pytorch-linux-jammy-py3-clang18-asan because no resource-based policy allows the ecr:InitiateLayerUpload action
```
Which seems to be doing the trick see https://github.com/pytorch/pytorch/actions/runs/11185419440/job/31098258344
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137358
Approved by: https://github.com/huydhn
2024-10-04 20:39:56 +00:00
7b3378a39a [FSDP2] Relaxed even sharding requirement for all-gather extensions (#137005)
This PR relaxes the even sharding requirement for the all-gather extensions.

The `fsdp_pre_all_gather` now expects signature:
```diff
def fsdp_pre_all_gather(
    self,
    mesh: DeviceMesh,
+    outer_size: torch.Size,
+    outer_stride: Tuple[int, ...],
    module: nn.Module,
    mp_policy: MixedPrecisionPolicy,
) -> Tuple[Tuple[torch.Tensor, ...], Any]:
```
- Since no one is using this new signature yet, we should be safe to change it.
- Currently, the `outer_stride` will always be contiguous strides since FSDP2 only supports contiguous strides for now.
- For the uneven sharding case, the user is responsible to return a padded sharded tensor from `fsdp_pre_all_gather`. This is risky territory because if the user does not do so, then this may manifest as a NCCL timeout, as only the ranks with padding will error out. However, I am not aware of any way around this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137005
Approved by: https://github.com/weifengpy
2024-10-04 20:34:20 +00:00
f4b415da11 type _dynamo/replay_record.py (#137183)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137183
Approved by: https://github.com/Skylion007
2024-10-04 20:29:24 +00:00
6a6a8b17b8 handle state tensors in training ir path (#137240)
Summary: We had attribute assignment detection and handling of registered buffer assignments when using `aot_autograd`, but not when using just `make_fx`. Fixed.

Test Plan: expanded coverage of `test_state_tensors` to use `export` instead of `torch.export.export`

Differential Revision: D63802576

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137240
Approved by: https://github.com/tugsbayasgalan
2024-10-04 20:23:48 +00:00
f0ef7fddde Add ignored/unmaintained comment for capture_autograd_function flag (#137309)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137309
Approved by: https://github.com/jansel
ghstack dependencies: #136961
2024-10-04 20:02:37 +00:00
0878739b11 [AOTI] Add C shim for MKLDNN _convolution_pointwise (#137269)
Differential Revision: [D63875271](https://our.internmc.facebook.com/intern/diff/D63875271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137269
Approved by: https://github.com/chenyang78, https://github.com/hl475
2024-10-04 19:42:05 +00:00
a968576777 Add lowering for aten.searchsorted (#135701)
Adds lowering for `aten.searchsorted`. This entails:

1. Adding support for multi-dimensional bucket tensors to `ops.bucketize`.
2. Adding support for striding to `ops.bucketize`.
3. Adding support for sorting tensors to `ops.bucketize`.
4. Adding a lowering for `aten.searchsorted.Tensor`.
5. Adding a basic decomposition for `aten.searchsorted.Scalar` that calls into the lowering for tensors.
6. Updating the meta-function for `aten.searchsorted` to properly check some of the sizing conditions.

Closes #135873

Differential Revision: [D63766514](https://our.internmc.facebook.com/intern/diff/D63766514)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135701
Approved by: https://github.com/amjames, https://github.com/eellison, https://github.com/davidberard98
2024-10-04 19:26:05 +00:00
58ec6a360c force contiguity for all reduce (#137345)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137345
Approved by: https://github.com/xmfan
2024-10-04 19:16:38 +00:00
c0a930b104 Change to export_for_training in quantize_pt2e tests (#137233)
Summary:
as title

also change it in `prepare_pt2e()` docstring

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:quantization_pt2e_qat

buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization
```

Differential Revision: D63345059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137233
Approved by: https://github.com/tugsbayasgalan
2024-10-04 18:33:02 +00:00
22e19bd2d7 Add link to torch.compile the missing manual in troubleshooting (#137301)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137301
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2024-10-04 18:19:30 +00:00
79195b9453 [inductor] Add kwargs to bypass unexpected keyword argument error (#137329)
Summary:
I tried `TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT=~/fbcode/profile.txt`.

TypeError: DebugAutotuner.run() got an unexpected keyword argument 'benchmark_run'

Test Plan: ci

Differential Revision: D63876103

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137329
Approved by: https://github.com/muchulee8
2024-10-04 18:17:56 +00:00
d2d14d14e3 [RELAND] Fix unlift to preserve aliased constants (#137310)
Differential Revision: [D63864743](https://our.internmc.facebook.com/intern/diff/D63864743)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137310
Approved by: https://github.com/avikchaudhuri
2024-10-04 18:15:52 +00:00
8b9cbf22c2 Enable regression test for add loop benchmarks (#136573)
The red dotted line is 1.5

<img width="1607" alt="Screenshot 2024-09-24 at 11 50 41 AM" src="https://github.com/user-attachments/assets/719a9a86-89af-4c58-8723-80a28f9bb517">

expected taken from the average.
<img width="850" alt="Screenshot 2024-09-24 at 2 33 27 PM" src="https://github.com/user-attachments/assets/0f25e855-35ae-4031-86ef-1452ef6598de">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136573
Approved by: https://github.com/ezyang
2024-10-04 18:12:08 +00:00
ad240018f2 [PT2][Inductor][Reliability] Add back unit test for pad_mm with BF16 (#137231)
Summary: We added the unit test for recent added pad_mm pattern in customized optimus D63040455, where it will resolve the long computation kernel issue for BF16 on A100.

Test Plan:
```
buck2 test mode/opt //caffe2/test/inductor:pad_mm -- test_pad_mm_bf16
```

Buck UI: https://www.internalfb.com/buck2/4dd4c90c-4a2a-4859-923c-a4008f78a1cd
Test UI: https://www.internalfb.com/intern/testinfra/testrun/9851624237127136
Network: Up: 100KiB  Down: 4.3GiB  (reSessionID-87f11454-d920-47af-9af5-39ca0572b7c6)
Jobs completed: 7079. Time elapsed: 3:34.3s.
Cache hits: 99%. Commands: 7061 (cached: 7024, remote: 19, local: 18)
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D63794727

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137231
Approved by: https://github.com/henrylhtsang
2024-10-04 17:49:55 +00:00
b2979f4382 Allow autocast in training ir export (#137287)
Summary: hardcode "val" field for autocast (similar to set_grad_enabled), to bypass the verifier check.

Test Plan: CI

Differential Revision: D63345767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137287
Approved by: https://github.com/angelayi
2024-10-04 17:38:51 +00:00
42adadf2f2 [aotinductor] enable CUTLASS backend (#134379)
### Context
This PR allows CUTLASS kernels usage in AOTI. It does this by:
* For any CUTLASS kernels that win during autotuning, compile them as a .so & .o
* When creating the final model .so, link all the CUTLASS kernels .o files
* Make sure we codegen things correctly (argument dtypes and specify extern "C" linking for the CUTLASS kernel)

### Example
https://gist.github.com/ColinPeppler/e834fa2255c37e9444b6d540bf7bd04d#file-model-cpp-L548-L549

```
TORCH_LOGS="+output_code" python test/inductor/test_cutlass_backend.py -v -k test_max_autotune_cutlass_backend_regular_mm
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134379
Approved by: https://github.com/tenpercent, https://github.com/chenyang78
2024-10-04 17:32:41 +00:00
c7b0d4b148 raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)
raw_alloc is used by cudnn, miopen, thrust, and tunableop.  Without this PR, the env var for disabling the caching allocator will only partially work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114
Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2024-10-04 15:36:29 +00:00
cyy
67908e9111 Enable clang-tidy on torch/csrc/distributed/rpc (#137320)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137320
Approved by: https://github.com/Skylion007
2024-10-04 15:34:05 +00:00
15c3479db7 [AOTI] Fix _scaled_mm ABI-compatible codegen (#137132)
Summary: Similar to https://github.com/pytorch/pytorch/pull/137008, but for supporting _scaled_mm in the ABI-compatible mode.

Differential Revision: [D63757729](https://our.internmc.facebook.com/intern/diff/D63757729)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137132
Approved by: https://github.com/chenyang78
ghstack dependencies: #137008
2024-10-04 14:05:18 +00:00
5d24ea81d3 [AOTI] Fix cpp wrapper codegen for _scaled_mm (#137008)
Summary: Fixes https://github.com/pytorch/pytorch/issues/136209. Because _scaled_mm has an out variant, the generated cpp fallback call should call _scaled_mm_out. The ABI-compatible mode needs more work.

Differential Revision: [D63757728](https://our.internmc.facebook.com/intern/diff/D63757728)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137008
Approved by: https://github.com/hl475
2024-10-04 14:02:46 +00:00
f56f7476d3 Revert "Add meta functions for lerp, addcmul, and addcdiv. (#136909)"
This reverts commit e4b98b11493914769d15ca8b124c0b5fa1fdd364.

Reverted https://github.com/pytorch/pytorch/pull/136909 on behalf of https://github.com/albanD due to breaks trunk jobs ([comment](https://github.com/pytorch/pytorch/pull/136909#issuecomment-2393774694))
2024-10-04 14:01:54 +00:00
cd17b2645c Revert "[Distributed] Fix extra context on device 0 (#135273)"
This reverts commit a93d3873e97973fbc0009245579ee4e4fa7f155a.

Reverted https://github.com/pytorch/pytorch/pull/135273 on behalf of https://github.com/albanD due to Broken trunk distributed ci ([comment](https://github.com/pytorch/pytorch/pull/135273#issuecomment-2393772987))
2024-10-04 13:58:57 +00:00
5509207543 Revert "[PyTorch] Port ExecuTorch bfdot improvement back to ATen BlasKernel (#136331)"
This reverts commit 592e3a3d4069029946ec4c8d103a313806c53a88.

Reverted https://github.com/pytorch/pytorch/pull/136331 on behalf of https://github.com/albanD due to Breaks aarch64 builds, see link below ([comment](https://github.com/pytorch/pytorch/pull/136331#issuecomment-2393760135))
2024-10-04 13:52:37 +00:00
e80f47fb4d Pass special arguments to user-defined Triton kernels if required (#137236)
Summary:

Special autotuning configs like `num_warps` and `num_stages` can be passed to the kernel as parameters. The `config.all_kwargs()` call [here](762a7d197c/python/triton/runtime/autotuner.py (L106)) in the Trtion code includes those special configs (names and values) into the potential arguments to the kernel. [Here](762a7d197c/python/triton/runtime/jit.py (L613)) some of those may be included in actual kenrel arguments, given that their names are present among the kernel parameters.

This PR replicates this behavior in user-defined Triton kernel compilation in PT2. Resolves #136550.

Test Plan:

```
$ python test/inductor/test_triton_kernels.py -k test_triton_kernel_special_params
inductor []
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.inductor []
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
.inductor [('fxgraph_cache_bypass', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1), ('extern_calls', 1), ('possibly_missed_reinplacing_opportunities', 0), ('possibly_missed_reinplacing_bytes', 0)]
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.inductor []
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.inductor []
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
.inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_bypass', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1), ('extern_calls', 1), ('benchmarking.TritonBenchmarker.triton_do_bench', 1), ('possibly_missed_reinplacing_opportunities', 0), ('possibly_missed_reinplacing_bytes', 0)]
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.
----------------------------------------------------------------------
Ran 6 tests in 6.283s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137236
Approved by: https://github.com/zou3519
2024-10-04 07:36:55 +00:00
cyy
6327a71880 [Environment Variable][2/N] Use thread-safe setenv wrapper (#124485)
This follows #119449 to make setenv thread-safe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124485
Approved by: https://github.com/eqy
2024-10-04 07:30:51 +00:00
6dcd773c57 [export] clean up dynamic markers from tensors (#137230)
Summary:
When we handle dynamic shapes markers like `Dim.AUTO, Dim.DYNAMIC`, we use dynamo decorators, attaching set attributes to the export input tensors, e.g. `x._dynamo_dynamic_indices = set()`.

I thought this was fine, since it's done all the time with torch.compile, but it breaks some PT2Inference tests, specifically because unpickling a set attribute isn't possible with the C++ torch::jit::pickle_load call.

We've agreed that the PT2Inference side will clone sample inputs & pickle the original inputs to be safe, but this still establishes a nice invariant that user-facing decorators are both ignored & cleaned out in the lifecycle of an export call.

Test Plan: test_export

Differential Revision: D63773534

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137230
Approved by: https://github.com/avikchaudhuri
2024-10-04 06:50:45 +00:00
a408cfcbf1 [torch.compile] torch.vmap supports dynamic shapes + enable flex attention create_block_mask dynamic shapes (#137163)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137163
Approved by: https://github.com/Chillee
2024-10-04 05:16:04 +00:00
40b09edd87 Add back DistributedDataParallel types that were lost when pyi was removed (#136835)
When the stub file `nn/parallel/distributed.pyi` was removed (#88701), some types that existed are no longer available. This pull request adds them back.

Just for reference, these types are used in pytorch-lightning's LightningCLI. Command line interfaces are created automatically, and having type hints make them nicer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136835
Approved by: https://github.com/kwen2501
2024-10-04 04:44:20 +00:00
97634e4f82 Rollout infra for executorch migration to training IR (#132703)
Title

Differential Revision: [D60432217](https://our.internmc.facebook.com/intern/diff/D60432217/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132703
Approved by: https://github.com/tarun292
2024-10-04 04:33:08 +00:00
f500cb43bb Fix torch.library.register_vmap (#137306)
We didn't support multiple levels of vmap. The main problem is, during
the batching rule, we need to exclude the vmap dispatch key
(FuncTorchBatched) like how our C++ batching rules do it.

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137306
Approved by: https://github.com/Chillee
2024-10-04 03:46:35 +00:00
cfc51c858a type _dynamo/callback.py (#137181)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137181
Approved by: https://github.com/Skylion007
2024-10-04 03:28:52 +00:00
9670e9e5b0 Revert "Mark PyTorch module as no-gil valid and pythoncapi_compat.h (#136899)"
This reverts commit 4f93de895138cc3cb8c4383b480a2d0ecf407e1b.

Reverted https://github.com/pytorch/pytorch/pull/136899 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/136899#issuecomment-2392721534))
2024-10-04 03:28:31 +00:00
e4b98b1149 Add meta functions for lerp, addcmul, and addcdiv. (#136909)
This PR adds new meta functions for `lerp`, `addcmul`, and `addcdiv` (including their
respective inplace versions).

These functions only had refs implementations, which was being the root cause of a
significant overhead ([issue][1]) when running `AdamW` optimizer step on PyTorch/XLA
backend. Running the meta functions resulted in the following improvements:

- `lerp` calls: 1,550ms to 140ms (10x)
- `addcdiv` calls: 640ms to 350ms (1.8x)
- `addcmul` calls: 620ms to 300ms (2.05x)

[1]: https://github.com/pytorch/xla/issues/7923

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136909
Approved by: https://github.com/jansel
2024-10-04 02:47:25 +00:00
a1f1f585ab clean up error_on_nested_jit_trace flag (#136961)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136961
Approved by: https://github.com/jansel
2024-10-04 02:07:54 +00:00
d32696249a [IntraNodeComm] fix a race condition in one-shot all-reduce (#137257)
One-shot all-reduce did not have a barrier at the end. It was possible for a rank to write to its p2p buffer for the next collective before another rank finished reading it for the previous collective.

Also removing the fuse-input-copy optimization. The synchronization complexity probably outweighs the saving.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137257
Approved by: https://github.com/Chillee
2024-10-04 01:41:14 +00:00
3d3b394e94 [MTIA](3/n) Implement CPU pins functions for MTIA hooks (#137283)
Summary: Link CPU pins function in MTIA hooks to the host allocator implementation

Test Plan:
signals
unit test in D63709424

Differential Revision: D63352770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137283
Approved by: https://github.com/egienvalue
2024-10-04 01:26:21 +00:00
15e127bc3b [numpy2.0 compat] Fix test_parse_numpy_int_overflow for NumPy 2.0 (#137135)
NumPy now throws an OverflowError when trying to create np.uint64(-1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137135
Approved by: https://github.com/Skylion007
2024-10-04 01:21:12 +00:00
13ec343afe clean up capture_func_transforms flag (#136960)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136960
Approved by: https://github.com/guilhermeleobas, https://github.com/jansel
2024-10-04 01:10:52 +00:00
6b9b2a126e Build clang18 image for ASAN tests (#128763)
Use the latest clang.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128763
Approved by: https://github.com/malfet
2024-10-04 00:53:56 +00:00
a93d3873e9 [Distributed] Fix extra context on device 0 (#135273)
This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279:

## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.

## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5)  <-- no additional context yet
del work  <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)

When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.

## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.

`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
2024-10-04 00:44:02 +00:00
88e338f4dd [AOTI] Add C shim for MKLDNN _linear_pointwise (#136999)
Differential Revision: [D63851216](https://our.internmc.facebook.com/intern/diff/D63851216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136999
Approved by: https://github.com/leslie-fang-intel, https://github.com/chenyang78, https://github.com/hl475
2024-10-04 00:35:10 +00:00
57c02e5a00 [BE] Use helper functions in mps_extension (#137313)
This should be a no-op change, i.e. it runs the same code, but replaces verbose ObjectiveC invocation with helper function from OperationUtils.h, which this example already depends on
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137313
Approved by: https://github.com/atalman
2024-10-04 00:26:38 +00:00
bc916a5537 [easy] for test_ck_backend enable RE & activate remaining tests for FBCode (#137305)
Differential Revision: D63859208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137305
Approved by: https://github.com/muchulee8, https://github.com/chenyang78
2024-10-04 00:22:35 +00:00
cyy
60d19cb59e Enable clang-tidy on torch/csrc/distributed/autograd/* (#137180)
Enable clang-tidy on `torch/csrc/distributed/autograd/*` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137180
Approved by: https://github.com/Skylion007
2024-10-03 23:49:23 +00:00
7e13e7dd7e Disallow FakeTensor.data_ptr access in eager mode (#137221)
Previously we raised a deprecation warning (beginning PyTorch 2.4). Now
that we are on 2.6, we're completing the deprecation and disallowing
this behavior.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137221
Approved by: https://github.com/albanD, https://github.com/eellison
2024-10-03 23:47:55 +00:00
cfcd0e1fe9 [ONNX] Update the faketensor documentation (#137292)
Update the faketensor documentation to reflect current usage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137292
Approved by: https://github.com/shubhambhokare1, https://github.com/sdpython
2024-10-03 23:27:11 +00:00
4096ed7dc2 Migrate to training ir in quantization_pt2e_qat unittests (#137232)
Summary: Change capture_pre_autograd_graph to export_for_training in unit tests.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:quantization_pt2e_qat
```

Reviewed By: tugsbayasgalan

Differential Revision: D63336660

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137232
Approved by: https://github.com/angelayi
2024-10-03 22:57:04 +00:00
b44f25e1ba [CI] Move s390 binary build to its own workflow (#137304)
It was added by https://github.com/pytorch/pytorch/pull/125399 and takes 3 hours to finish
Considering limited number of runners, it often causes queueing see:
<img width="402" alt="image" src="https://github.com/user-attachments/assets/5c67c1d6-af4c-4453-a089-aa1174513bfa">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137304
Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/atalman
2024-10-03 22:31:36 +00:00
54094c0c26 [inductor][user triton] Check size hints to determine indexing dtype (#137234)
Previously, all integer inputs to user-defined triton kernels were assumed to be int32. This would result in errors if your input was actually an int64.

This PR checks the value to determine which dtype to use for indexing: if it is known to be < int_max, then use int32 (and add guards if relevant); if we can't check (e.g. unbacked symint), then use int64.

Differential Revision: [D63797975](https://our.internmc.facebook.com/intern/diff/D63797975)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137234
Approved by: https://github.com/eellison
2024-10-03 22:07:26 +00:00
c83178d894 Change to export_for_training in XNNPACK tests (#137238)
Summary: as title

Test Plan: CI

Differential Revision: D63344674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137238
Approved by: https://github.com/tugsbayasgalan
2024-10-03 21:28:05 +00:00
ce14f1f0c9 [aoti] Accept constant inputs (#137197)
Fixes https://fb.workplace.com/groups/1028545332188949/posts/1056788036031345/?comment_id=1056790162697799&reply_comment_id=1057501845959964

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137197
Approved by: https://github.com/henrylhtsang, https://github.com/desertfire, https://github.com/hl475
2024-10-03 20:59:33 +00:00
eqy
46f158bfbc [cuDNN] Check shapes during graph capture in cuDNN CTCLoss (#130071)
Found out from #125952 about the existence of `_assert_async`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130071
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-10-03 20:10:28 +00:00
592e3a3d40 [PyTorch] Port ExecuTorch bfdot improvement back to ATen BlasKernel (#136331)
ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes https://github.com/pytorch/pytorch/pull/127488 . Includes https://github.com/pytorch/executorch/pull/5444 .

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136331
Approved by: https://github.com/malfet, https://github.com/albanD
ghstack dependencies: #136445
2024-10-03 18:18:37 +00:00
c8a7da305b [PyTorch] Add attribute version of C10_ALWAYS_INLINE (#136445)
Sometimes (such as on a lambda), you need `__attribute__((always_inline))` but not `inline`.

Differential Revision: [D63266917](https://our.internmc.facebook.com/intern/diff/D63266917/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136445
Approved by: https://github.com/malfet
2024-10-03 18:18:37 +00:00
525f6715bc Revert "Fix unlift to unblock training IR + run_decomp on aliasing constants (#137162)"
This reverts commit f96020c246aec8514b945d530879635a03294f70.

Reverted https://github.com/pytorch/pytorch/pull/137162 on behalf of https://github.com/jovianjaison due to Sorry for reverting your changes but many jobs are failing with NameError: name _recursive_getattr is not defined + a Lint job fails ([comment](https://github.com/pytorch/pytorch/pull/137162#issuecomment-2392036062))
2024-10-03 18:17:56 +00:00
c7714b8d8d [FR] Fix duplicate output for the case when not all ranks join on collective (#137256)
As title, when testing on an internal case, we found that we have very similar output for the error when certain ranks does not join one collective. This is because we didn't put all ranks into `candidate_ranks` so that they didn't get wiped out from entries and gets checked again.

Ideally for the given case, we should report this is an out of order case, because rank 0, 1 calls all-to-all while all the rest ranks call all-gather-base. But when we select entries to compare, we don't have global view of the entries.

In the specific case, on rank 0 and 1, it has collective of PG 7 on entry 1130 with seq ID = 1130. However, on other ranks, they have collective of PG 0 on entry 1130 with seq ID = 2. It's hard to use entry idx to do the match because if we later consider p2p, this assumption will collapse, so we now still defer it for users or further down debugging stream to figure it out. To make the message clearer, I also include both seqID and record_id (aka, entry index) in the message. (That does not mean this is not possible to implement in the code, for example, we can let all record_id to minus the maximum p2p seq id before it; but users will easily see the wrong order, so we don't think it's necessary to have that logic now)

P1626755348

Differential Revision: [D63815335](https://our.internmc.facebook.com/intern/diff/D63815335/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137256
Approved by: https://github.com/c-p-i-o
2024-10-03 18:06:45 +00:00
adc48a5b52 Python CAPI cleanup (#137266)
This is unrelated to anything else, but as I was going through the code, fixing bad patterns and a refcount bug (which is unlikely to cause any real issue tbh)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137266
Approved by: https://github.com/Skylion007
2024-10-03 17:55:48 +00:00
8bb8c3997b [inductor] parallel compile: add import of thread_safe_fork for internal (#137155)
Summary: We had a report of crashes in parallel compile subprocesses linked to reading justknobs. See https://fburl.com/workplace/14a4mcbh internally. This is a known issue with justknobs. It looks like we don't have a lot of control over evaluating knobs. Some are read in inductor (`"pytorch/remote_cache:autotune_memcache_version`), but many are read by the triton compiler. According to this advice https://fburl.com/workplace/imx9lsx3, we can import thread_safe_fork which installs some functionality to destroy some singletons before forking and re-enable them after. This apporach works for the failing workload.

Test Plan: See D63719673 where the reporting user was kind enough to provide us with a local repro. Without the relevant import, we can reproduce the crash. With the import, the training runs successfully to completion.

Differential Revision: D63736829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137155
Approved by: https://github.com/xmfan, https://github.com/eellison
2024-10-03 17:37:21 +00:00
f96020c246 Fix unlift to unblock training IR + run_decomp on aliasing constants (#137162)
When we populate unlifted graph module, we actually only "unlift" constant tensor inputs which is problematic because export de-duplicates aliasing constants. As a result, we only register one constant instead of two constants. This PR fixes that by querying ep.constants table instead of ep.graph_signature.lifted_tensor_constants.

Differential Revision: [D63743111](https://our.internmc.facebook.com/intern/diff/D63743111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137162
Approved by: https://github.com/pianpwk
2024-10-03 17:28:53 +00:00
4d3c0fc061 [AOTAutogradCache] add config for AOTAutograd remote cache (#137011)
Summary: This just adds a config option and JK for turning on remote AOTAutogradCache. It does not implement anything with the new options being passed in. That will come next diff.

This PR also changes the command for turning on the local AOTAutogradCache to be more consistent to that of FXGraphCache: TORCHINDUCTOR_AUTOGRAD_CACHE

Test Plan: Existing tests should pass and should build

Reviewed By: oulgen

Differential Revision: D63321965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137011
Approved by: https://github.com/oulgen
2024-10-03 16:03:47 +00:00
a569a8ac4c type _dynamo/external_utils.py (#137185)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137185
Approved by: https://github.com/Skylion007
2024-10-03 15:18:53 +00:00
b6cb174816 Fix serialization for torch.uint16, torch.uint32, torch.uint64 (#137184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137184
Approved by: https://github.com/albanD
2024-10-03 14:56:11 +00:00
89b7a5d128 Implement AcceleratorHooksInterface's virtual functions deviceCount() and getCurrentDevice() for CUDA and XPU (#136752)
Fixes #136751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136752
Approved by: https://github.com/albanD
2024-10-03 14:44:58 +00:00
63bbf712d8 Add py3.13t linux wheel build (#137127)
Builder PR required: https://github.com/pytorch/builder/pull/2001
Test PR: https://github.com/pytorch/pytorch/pull/136490/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137127
Approved by: https://github.com/albanD
2024-10-03 13:13:48 +00:00
38114ec860 [async-tp] fix a race condition that can cause silent correctness issue (#137199)
Details described in https://github.com/pytorch/pytorch/issues/137171:

![image](https://github.com/user-attachments/assets/8247b4f1-7805-4585-9d72-05e9475f218b)

Fix: we introduce the following invariants in `_pipelined_all_gather_and_consume` and `_pipelined_produce_and_all2all`:
- Before any stream writes to/reads from p2p buffers, perform a barrier on channel 0 on the launch stream.
- After all streams completed writing to/reading from p2p buffers, perform a barrier on channel 0 on the launch stream.

NOTE: This fix only focuses on addressing the race condition. Some barriers are exposed, which can be hidden by computation, and we'll optimize them in subsequent PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137199
Approved by: https://github.com/weifengpy
2024-10-03 10:42:37 +00:00
f166d62764 Avoid __ne__ weakref comparison and use identity instead in cache_size.py (#135000)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135000
Approved by: https://github.com/anijain2305
2024-10-03 07:43:58 +00:00
bd9517c1ee cond_batch_rule with boolean pred (#135009)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135009
Approved by: https://github.com/guilhermeleobas, https://github.com/jansel, https://github.com/zou3519
2024-10-03 07:43:30 +00:00
0d1701f310 Revert "raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)"
This reverts commit 70019074806920f95976fedad775d7570294f635.

Reverted https://github.com/pytorch/pytorch/pull/131114 on behalf of https://github.com/PaliC due to failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/131114#issuecomment-2390615007))
2024-10-03 06:22:55 +00:00
87bf2a8428 [compiled autograd] initialize cudagraph tls from context manager (#136735)
FIXES https://github.com/pytorch/pytorch/issues/126934. Cudagraphs TLS is initialized on module import, but compiled autograd codepaths might not import it. This causes problems because autograd/compiled autograd will restore TLS state, and in this case will restore the TLS to an uninitialized state

Should fix flaky cudagraph tests: https://github.com/pytorch/pytorch/issues/131663, https://github.com/pytorch/pytorch/issues/132108

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136735
Approved by: https://github.com/BoyuanFeng, https://github.com/eellison
ghstack dependencies: #136059
2024-10-03 06:22:11 +00:00
b86269fab5 Unify cpp_extension build directory removal (#136059)
Keeps existing default directory clearing logic, even though it fails when TORCH_EXTENSIONS_DIR is set. To properly clear, we'd need to track all the folders we compiled the extensions to.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136059
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-10-03 06:22:11 +00:00
55c343fa3a [DTensor] Register replication strategy for a few upsampling interpolate ops (#137201)
To unblock Llama 3.2 vision's use case to resize positional embeddings for fine-tuning. Context in [workplace post](https://fb.workplace.com/groups/319878845696681/permalink/1271172040567352/).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137201
Approved by: https://github.com/XilunWu
2024-10-03 03:45:39 +00:00
84cac3585d Move _is_static_problem to mm_common (#137150)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137150
Approved by: https://github.com/eellison
2024-10-03 02:55:43 +00:00
5c0ce8d0a6 Skip Flaky Test: for #134602 (#137226)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137226
Approved by: https://github.com/cpuhrsch
2024-10-03 01:53:59 +00:00
b3953ff25e [inductor] Reduce block sizes when using Triton CPU backend (#136612)
This greatly reduces compile time; TorchBench models that were previously 50-100x slower (vs the cpp backend) are now ~20x slower. More work needs to be done on the Triton side, but smaller block sizes will still be helpful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136612
Approved by: https://github.com/desertfire
ghstack dependencies: #135342
2024-10-03 01:48:32 +00:00
4513fb5c53 [Inductor] Use parametrize to break down some unit tests (#137156)
Summary: To address the issue that some tests are marked as slow, see https://github.com/pytorch/pytorch/issues/136940#issuecomment-2387227598

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137156
Approved by: https://github.com/eellison
2024-10-03 01:43:36 +00:00
7631a04081 [c10d] Fix the device query story of ProcessGroup (#136790)
Function `_get_pg_default_device` is being used outside of `distributed_c10d.py`.

A concern is that people may not be aware of what it actually does, due to bad naming of this function:
`Return the device to use with ``group`` for control flow usage (object collectives, barrier).`

The remediation is as follows:
- Added a deprecation warning to `_get_pg_default_device`;
- Added a private function `_get_object_coll_device` to undertake what it does;
- Added a `_device_capability` function for users who want to query the device support of a PG -- it returns a plain list, no more "default" choice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136790
Approved by: https://github.com/H-Huang
2024-10-03 01:36:22 +00:00
cd5d1fe015 unflatten with specialized graphs per submodule call (#137013)
Previously we were making a fairly restrictive assumption when unflattening an exported program: for any submodule, we would assert that the graph of every call to that submodule must be the same. This assertion is load-bearing, i.e., if we simply remove the assertion then we can get incorrect results, as shown by the following example.

```
    class N(torch.nn.Module):
        def forward(self, x, b):
            if b:
                return x + 1
            else:
                return x + 2

    class M(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.n = N()

        def forward(self, x):
            x0 = x + 3
            x1 = self.n(x0, True)
            x2 = x1 + 4
            x3 = self.n(x2, False)
            return x3 + 5

    m = M()
    inp = (torch.ones(1),)
    print(m(*inp))  # tensor([16.])
    ep = torch.export.export(m, inp)
    print(ep.module()(*inp))  # tensor([16.])

    unflattened = torch.export.unflatten(ep)
    print(unflattened(*inp))  # tensor([15.])
```

However, this goes against the spirit of specializing graphs when exporting: we should *expect* that for every call to a submodule we *might* generate a different graph. The goal of this PR is to fix unflattening to handle multiple specialized graphs corresponding to multiple calls to the same submodule.

The idea is simple: for every call to a child module `foo`, we will create potentially different child modules `foo`, `foo@1`, `foo@2`, etc. and use those names as targets in `callmodule` instructions in the parent graph. An immediate consequence of this is that the list of fqns in an unflattened module may not be the same as an exported module. Note that all these variants share the same parameters / buffers, so that multiple calls to the same submodule can share state as expected.

However, as described so far this scheme may end up with needlessly too many submodules. Thus, between calls to the same submodule, if graphs are equal then we optimize away the extra submodules and reuse call names as much as possible. Moreover, when submodules are shared across fqns, we also try to de-duplicate graphs corresponding to their calls as much as possible. Note that no matter what, information about which submodule was called is still preserved, so that if a submodule has to be swapped with another, one can still find all calls to the former submodule and replace them with calls to the latter.

A note on the choice of naming scheme for call names: instead of generating "sibling" modules `foo@1`, `foo@2`, etc. for `foo`, we had considered generating "children" modules `foo._1`, `foo._2`, etc. of `foo`. However this can cause spurious cycles when de-duplicating graphs. E.g., suppose that `foo` is an alias for `bar._1` and `foo._1` is an alias for `bar`, then we must either introduce a cycle or drop the opportunity to optimize. Another idea would be to make `foo` a dummy module that contains `foo._0` corresponding to the first call, but this necessitates too many changes to existing tests and hurts the common case.

Differential Revision: D63642479

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137013
Approved by: https://github.com/pianpwk
2024-10-03 00:55:44 +00:00
6241006c28 Fix dependency on filesystem on Linux (#137209)
Similar to: https://github.com/pytorch/pytorch/pull/134494
We are seeing come back of https://github.com/pytorch/pytorch/issues/133437 due to use of filesystem on Linux

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137209
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-10-03 00:18:28 +00:00
235f7e06f4 [CI] upload_metrics function to upload to s3 instead of dynamo (#136799)
* Upload_metrics function to upload to ossci-raw-job-status bucket instead of dynamo
* Moves all added metrics to a field called "info" so ingesting into database table with a strict schema is easier
* Removes the dynamo_key field since it is no longer needed
* Removes the concept of reserved metrics, since they cannot be overwritten by user added metrics anymore
* Moves s3 resource initialization behind a function so import is faster
---
Tested by emitting a metric during run_test and seeing that documents got added to s3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136799
Approved by: https://github.com/ZainRizvi
2024-10-02 23:19:28 +00:00
2c9e194e23 Revert "[FSDP2] support torch._foreach_copy_(float8) for fully_shard(Float8Linear) (#135955)"
This reverts commit b50b3b32191e7192a28c54a417891f24df4e4dda.

Reverted https://github.com/pytorch/pytorch/pull/135955 on behalf of https://github.com/PaliC due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/135955#issuecomment-2389810936))
2024-10-02 22:46:31 +00:00
bb03ef7aca [FlexAttention] Fix max-autotune when captured buffers are View nodes (#137204)
## Summary

Originally reported in https://github.com/pytorch-labs/attention-gym/issues/45

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137204
Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng
2024-10-02 22:19:33 +00:00
759cd73adb [Profiler] Update Kineto Submodule (#137137)
Summary: Updating commits from Aug 7, 2024 to Sep 26, 2024

Test Plan: Phabricator + OSS CI

Differential Revision: D63723255

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137137
Approved by: https://github.com/aaronenyeshi
2024-10-02 22:19:28 +00:00
e9e5d767b6 [inductor] Fix build_paths usage in config.py (#137187)
Summary: In https://github.com/pytorch/pytorch/pull/136234 we accidentally used the old version of `build_paths`, but in https://github.com/pytorch/pytorch/pull/136952 the API slightly changed. This diff addresses that issue by updating the API usage.

Test Plan: CI

Reviewed By: ColinPeppler

Differential Revision: D63764809

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137187
Approved by: https://github.com/ColinPeppler
2024-10-02 22:06:02 +00:00
e95b230fd8 Fix NJT serialization (#137031)
Fixes #129366

Since NJT has custom serialization logic, we need an NJT-specific fix to clear out cached sizes / strides PyCapsules. Eventually, we should switch NJT to use the default serialization logic, but this depends on #125622 being addressed.

This PR also makes serialization more complete by explicitly handling `lengths`, `ragged_idx`, and the `metadata_cache`, ensuring working operation for both contiguous and non-contiguous NJTs,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137031
Approved by: https://github.com/soulitzer
ghstack dependencies: #137030
2024-10-02 21:41:35 +00:00
eqy
be423a8480 [SDPA] Bump grad_query fudge factor for Flash Attention (#135711)
Tolerance issue for small GPUs e.g., (A16, A2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135711
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2024-10-02 21:35:00 +00:00
36fb342ffd Check for fused kernel before inplace update (#137042)
Summary:
Given an op, with a pair (output buffer, input buffer) from that op, we consider marking the output buffer as inline. However, if the parent of input buffer and the current op are going to be fused, then we don't want to mark the output buffer as inline. This change checks that criterion, and skips inlining if it is so.

Test Plan:
New unit test "layer_norm_should_not_inplace" runs LayerNorm and checks for no "in_out" pointers.

Fixes #120217

Here's a diagram of the issue:
![Inline+Fusion](https://github.com/user-attachments/assets/c03308d8-fdbf-40a0-a46d-964ece5f9e6d)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137042
Approved by: https://github.com/eellison
2024-10-02 21:14:34 +00:00
a3f3773477 Make PT2E work with both IR simultaneously (#135769)
Summary: as title

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:quantization_pt2e_qat
```

Differential Revision: D62449830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135769
Approved by: https://github.com/angelayi
2024-10-02 21:05:22 +00:00
4a9225fa1f improve get_schedule_class() (#137103)
Small change to make `get_schedule_class()` take case insensitive schedule names

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137103
Approved by: https://github.com/kwen2501
2024-10-02 20:08:25 +00:00
2d465e4d1d [non ghstack] Init threadpool with user defined num_threads before default (#137051)
Very similar to https://github.com/pytorch/pytorch/pull/136793, but adds back `pool->set_thread_count` call as it is still necessary (I am guessing due to the mutex)

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137051
Approved by: https://github.com/albanD
2024-10-02 20:02:30 +00:00
59d7cf7342 Add _dynamo.config inline_inbuilt_nn_modules and specialize_float logging (#137139)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137139
Approved by: https://github.com/ezyang
2024-10-02 19:58:38 +00:00
2b329d3bf1 Fix typo in _normalize ref (#137079)
I think this should basically make no difference numerically, but it does have some ramifications on things like CSE.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137079
Approved by: https://github.com/Skylion007
ghstack dependencies: #136826, #137043, #137049, #137065
2024-10-02 19:06:48 +00:00
6374a19a6e Fix wrapper subclass serialization with custom sizes / strides (#137030)
Fixes #130154

This PR takes the strategy outlined in the above issue and clears out any cached sizes / strides PyCapsules before serialization. This affects the default subclass serialization logic.

The PyCapsule issue also affects `deepcopy`, so that's fixed here as well.

Note: I originally tried utilizing a context manager to remove / restore cached PyCapsules after serialization, but in practice the state returned from `_reduce_ex_internal()` references the actual `tensor.__dict__()`, so the problem persists once the cached values are restored. Instead, we have to be careful to remove the cached values in the right place so they're not re-cached when pulling out size / stride information for serialization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137030
Approved by: https://github.com/albanD
2024-10-02 18:55:03 +00:00
8962610247 [BE][clang-format] make macro PyObject_HEAD_INIT(type) and PyVarObject_HEAD_INIT(type, size) have its own line (#136949)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136949
Approved by: https://github.com/albanD, https://github.com/eqy
ghstack dependencies: #136945
2024-10-02 18:39:22 +00:00
89c37be6b7 [BE][clang-format] make macro PyObject_HEAD have its own line (#136945)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136945
Approved by: https://github.com/albanD
2024-10-02 18:39:21 +00:00
54f50f19eb [dtensor][experimental] expose DTensor Context Parallel API (#137038)
**Summary**
expose experimental Context Parallel API `torch.distributed.tensor.experimental._attention.context_parallel` to module `torch.distributed.tensor.experimental`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137038
Approved by: https://github.com/wz337, https://github.com/fegin
2024-10-02 18:00:23 +00:00
4559cddaf9 Revert "Add py3.13t linux wheel build (#137127)"
This reverts commit 6b7adc12140d3073c5700cc1c48998556489857e.

Reverted https://github.com/pytorch/pytorch/pull/137127 on behalf of https://github.com/jovianjaison due to Sorry for reverting your changes but 2 jobs are failing ([comment](https://github.com/pytorch/pytorch/pull/137127#issuecomment-2389250791))
2024-10-02 17:44:42 +00:00
b50b3b3219 [FSDP2] support torch._foreach_copy_(float8) for fully_shard(Float8Linear) (#135955)
this PR unblocks unit test with single Float8Linear module. It fixes following error
```
torch._foreach_copy_(foreach_copy_dsts, all_gather_inputs)
[rank0]:E0913 13:44:29.829000 2179476 torch/testing/_internal/common_distributed.py:671] RuntimeError: "foreach_tensor_copy" not implemented for 'Float8_e4m3fn'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135955
Approved by: https://github.com/vkuzo, https://github.com/eqy
2024-10-02 17:26:45 +00:00
c318bafe9c [inductor mkldnn test][BE] Use parametrize to shorten test run time (#137153)
Summary:
Tests in test_mkldnn_pattern_matcher.py can take too long to finish. Splitting them into smaller tests, using `parametrize`.

I guess this means this test file has some refactoring opportunities as well. Next time would be the parametrize the add functions.

Differential Revision: D63723925

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137153
Approved by: https://github.com/desertfire
2024-10-02 17:20:27 +00:00
466623fb51 [CI] Support for CI GPU test and benchmark on containers (#137169)
Renames the arc references to container, and add changes required so CI that requires GPU can run on containers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137169
Approved by: https://github.com/huydhn
2024-10-02 17:10:59 +00:00
e3fd4d796f [CI] Skip sccache for nvcc builds when building for A100 (#137170)
There is a unknown issue with nvcc builds and sccache, it crashes with:

```
      /opt/cache/bin/sccache /usr/local/cuda-12.1/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_MPI -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_py_EXPORTS -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/include -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/../include -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/../third_party/asmjit/src -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/../third_party/cpuinfo/include -isystem /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include -isystem /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.1/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -D_GLIBCXX_USE_CXX11_ABI=1 --expt-relaxed-constexpr -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -MD -MT CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops/sparse_index_select.cu.o -MF CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops/sparse_index_select.cu.o.d -x cu -c /tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/src/sparse_ops/sparse_index_select.cu -o CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops/sparse_index_select.cu.o
      sccache: error: failed to execute compile
      sccache: caused by: error reading compile response from server
      sccache: caused by: Failed to read response header
      sccache: caused by: failed to fill whole buffer
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137170
Approved by: https://github.com/huydhn
2024-10-02 17:07:24 +00:00
d4cf90d282 [BE] [CI] Skip clean gha workspace if CI is running in a container for checkout-pytorch (#137168)
For the reusable action checkout-pytorch, skips cleaning workspace when running from a container environment.

The motivation for this change is twofold:
* There is no need for cleanup when running in ephemeral containers, as any changes will be discarded when the docker container is terminated;
* In the specific case of GITHUB_WORKSPACE, to enable sharing this between multiple containers, it need to be mounted with `-v`. This prevents the possibility of running `rm -r` and deleting this mount path;

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137168
Approved by: https://github.com/huydhn
2024-10-02 17:04:50 +00:00
af3e25fea7 remove capture_autograd_function flag (#136959)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136959
Approved by: https://github.com/jansel
2024-10-02 16:59:19 +00:00
bcaa0f5ee9 [CI] Remove nanogpt from perf smoke test (#137176)
Summary: nanogpt's performance is not stable. Remove it from the perf smoke test. We may want to use another test instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137176
Approved by: https://github.com/eellison
2024-10-02 16:35:04 +00:00
7001907480 raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)
raw_alloc is used by cudnn, miopen, thrust, and tunableop.  Without this PR, the env var for disabling the caching allocator will only partially work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114
Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2024-10-02 16:27:15 +00:00
a954a9ea75 [Inductor] External callable registration API for Matmul tuning candidates (#130774)
Fixes #[130769](https://github.com/pytorch/pytorch/issues/130769)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130774
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@meta.com>
2024-10-02 15:38:10 +00:00
af86a6fdba [dynamo][user-defined-class] Fallback when object.__new__ fails (#137044)
Seen in https://github.com/vllm-project/vllm/pull/8949

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137044
Approved by: https://github.com/jansel
2024-10-02 14:15:36 +00:00
d29094888b Use torch.Stream&torch.Event for Dynamo capature (#134850)
# Motivation
This PR aims to solve the multiple Inheritance problem.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134850
Approved by: https://github.com/yf225, https://github.com/EikanWang
2024-10-02 14:15:33 +00:00
bf73af4b4e dont let partitioner think it can fuse pointwise ops into user triton kernels (#136878)
Previously if we had a graph like:
```
        triton_kernel_wrapper_functional_proxy = triton_kernel_wrapper_functional(...)
        getitem: "f32[3][1]cuda:0" = triton_kernel_wrapper_functional_proxy['out_ptr']
        getitem_1: "f32[3][1]cuda:0" = triton_kernel_wrapper_functional_proxy['out2_ptr']
        sigmoid: "f32[3][1]cuda:0" = torch.ops.aten.sigmoid.default(getitem_1)
        mul: "f32[3][1]cuda:0" = torch.ops.aten.mul.Tensor(tangents_1, sigmoid)
```

The partitioner would assume that the `sigmoid()` could be fused into either its user (the pointwise mul), or its producer (the user triton kernel). This could lead to a bad partitioning:

(1) If the partitioner thinks we can fuse the sigmoid with its producer triton kernel, we would keep the sigmoid compute in the forward, and have to generate two separate kernels in the forward (user triton kernel, dedicated sigmoid kernel)

(2) if the partitioner puts the sigmoid in the backward instead, we could fuse it with an existing backward kernel (the mul with a tangent)

Reviewed By: embg

Differential Revision: D63551393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136878
Approved by: https://github.com/zou3519
2024-10-02 13:52:44 +00:00
5c2c3ca10b [Inductor] Fix test_conv2d_unary_cpu_cpp_wrapper failure (#137158)
Summary: test_conv2d_unary_cpu_cpp_wrapper is failing on ciflow/slow because of mis-handling of inf. This PR fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137158
Approved by: https://github.com/chenyang78
2024-10-02 13:21:35 +00:00
d117ec1d6e [3/3][Inductor] Make CK work in FBCode (#136234)
Summary:
# Context
Goal: Enable CK for Inductor in FBCode

We split this stack into three diffs to help with review & in case we need to revert anything.

# This Diff
* Gets us to have CK kernels as an option for GEMM autotuning in Inductor.

Reviewed By: zjing14

Differential Revision: D62662705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136234
Approved by: https://github.com/tenpercent, https://github.com/chenyang78
2024-10-02 12:17:38 +00:00
6b7adc1214 Add py3.13t linux wheel build (#137127)
Builder PR required: https://github.com/pytorch/builder/pull/2001
Test PR: https://github.com/pytorch/pytorch/pull/136490/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137127
Approved by: https://github.com/albanD
2024-10-02 11:59:33 +00:00
8c29a0dd0e [pipelining] Clean up dead code (#136804)
'set_requires_grad' dict appears to be always full of "False" values,
and we always set requires_grad based on the value of 'has_backward'

setting of required_grad field was being repeatedly done during
get_fwd_recv_ops, but it should be done just once, so move it to the
function that creates recv buffers in the first place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136804
Approved by: https://github.com/kwen2501
2024-10-02 11:26:31 +00:00
cyy
862029a1ef [Distributed] [15/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#137072)
Follows  #136848

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137072
Approved by: https://github.com/kwen2501
2024-10-02 10:56:15 +00:00
ed02309232 type _dynamo/create_parameter_op.py (#136958)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136958
Approved by: https://github.com/jansel
2024-10-02 10:23:37 +00:00
52d29a2b94 [reland #136389] Skip kernel saving if already existed (#137073)
Summary:
We skip the save_gpu_kernel if kernel is being saved already.
This would give us a more accurate Triton profiling result. The
following trace shows before/after the change for a benchmarking of a
trivial addmm:

Before:
<img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a">

After:
<img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118">

We can see that before the change, the benchmarking includes two parts,
   (1) The overhead of our triton_heuristic call, which includes the
   save/get, and the (expensive) hash computation.
   (2) The exact computation of Triton kernel.

   We see that (1) accounts >50% of time, which makes kernel selection
   for profiling choosing aten kernels over Triton kernels.

Test Plan:
Existing OSS CI
python test/inductor/test_cuda_cpp_wrapper.py

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137073
Approved by: https://github.com/desertfire
2024-10-02 09:27:08 +00:00
e374d6850a [distributed][test] Remove unused variable and fix doc typo (#136943)
Refactor distributed test code:
- Fix TODO: Remove unused variable
- Fix doc typo
- Migrate deprecated method call `load_state_dict` and `save_state_dict`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136943
Approved by: https://github.com/H-Huang
2024-10-02 08:31:53 +00:00
e9a55b43a1 [inductor] Support lists of tensors in operatorbench (#136911)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136911
Approved by: https://github.com/eellison
2024-10-02 06:41:06 +00:00
a89e3c2490 Add compiled_autograd_kwargs_override Dynamo config (#136967)
For Traceable FSDP2, the most common use case is to have `fullgraph=False` for forward pass (to allow user-level graph breaks), and `fullgraph=True` for compiled autograd backward pass (required for queue_callback support).

With `torch._dynamo.compiled_autograd=True`, previously we are not able to set different `fullgraph` config value for forward vs. backward pass, since `rebuild_ctx` just reuses the forward compile config as-is. This PR adds `torch._dynamo.config.compiled_autograd_kwargs_override` config to allow forcing `fullgraph=True` for CA Dynamo tracing.

With this PR, we can remove standalone compiled autograd ctx manager usage in Traceable FSDP2 unit tests, and consolidate on using `torch._dynamo.compiled_autograd=True`.

Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_True`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136967
Approved by: https://github.com/xmfan
2024-10-02 06:23:59 +00:00
b51d22b8bb [BE] [NEON] Use vshlq_n_u32 instead of vshlq_u32 (#137122)
As compiler optimizes it away anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137122
Approved by: https://github.com/kit1980
2024-10-02 06:18:11 +00:00
2854d157de Add type annotations for higher order ops/flex_attention (#137065)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137065
Approved by: https://github.com/drisspg, https://github.com/Skylion007
ghstack dependencies: #136826, #137043, #137049
2024-10-02 04:39:25 +00:00
3b8511dadf Remove python 3.8 from triton builds (#137141)
All jobs have switched to Python 3.9. These 3.8 builds no longer necessary

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137141
Approved by: https://github.com/albanD
2024-10-02 03:36:54 +00:00
8e39f2a4a5 [Inductor] Enable a SDPA pattern matching for CUDA (#137085)
Summary: Fixes https://github.com/pytorch/pytorch/issues/122429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137085
Approved by: https://github.com/eellison
2024-10-02 03:22:11 +00:00
18525e185e Fix rendezvous error due to EtcdStore get method not waiting in some cases (#137056)
Fixes #132950

This fixes an issue in `torch/distributed/elastic/rendezvous/etcd_store.py` where the [get method](https://github.com/pytorch/pytorch/blob/v2.4.0/torch/distributed/elastic/rendezvous/etcd_store.py#L60) does not wait as expected when no keys have been written under the store prefix yet (and therefore the store prefix key does not exist). This was because the `_try_wait_get` method would error out immediately [here](https://github.com/alenawang/pytorch/blob/main/torch/distributed/elastic/rendezvous/etcd_store.py#L179) if the prefix was not found instead of continuing to the etcd watch.

This was causing upstream issues where distributed jobs using etcd-v2 could not get past the initial rendezvous at all (details in issue #132950).

We added a test demonstrating this issue and the fix. Without the fix the test fails with `etcd.EtcdKeyNotFound: Key not found : /torch/elastic/store` instead of waiting for the first key to be written; with the fix the test waits properly.

Co-authored-by: tarat44 <32471142+tarat44@users.noreply.github.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137056
Approved by: https://github.com/fduwjj

Co-authored-by: tarat44 <32471142+tarat44@users.noreply.github.com>
2024-10-02 01:45:00 +00:00
f108f88c40 [logging/debugging] handle None (constant) args in debug log (#137032)
Summary:
# Why

The arguments are filtered out as they are just const in the compiled graph, but the logger still expects a non-None type

# What

When passing a filtered out arg (None) to the debug logger, just log that it's a filtered out argument, instead of throwing a Type error

# Background

https://github.com/pytorch/pytorch/pull/131594

Test Plan: - execute repro from https://github.com/pytorch/pytorch/issues/135584#issue-2516944089 with and without the edits

Differential Revision: D63652564

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137032
Approved by: https://github.com/angelayi
2024-10-02 01:43:22 +00:00
f984b88718 Ensure noncontiguous tensor creation tests offsetting (#136396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136396
Approved by: https://github.com/amjames, https://github.com/eellison
ghstack dependencies: #136055
2024-10-02 00:40:43 +00:00
c7638da558 Lowerings: remove restriction on TensorBox keyword arguments (#136055)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136055
Approved by: https://github.com/eellison
2024-10-02 00:40:43 +00:00
63d6908da0 fix build error with gcc 12+ (#137092)
Fixes #127920

This commit addresses a build failure occurring with GCC 12 and above due to the -Werror=nonnull flag. The error manifests in the test_api target.

**Issue:**
When building with GCC 12+, the following error occurs:
```
error: argument 1 null where non-null expected [-Werror=nonnull]
  431 |             __builtin_memmove(__result, __first, sizeof(_Tp) * _Num);
      |             ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```

This change ensures that:
1. The flag is only added for GCC 12 or higher
2. The flag is only added if it's supported by the compiler
3. The flag is added specifically to the test_api target, not globally

By disabling this specific error, we allow the build to proceed while maintaining other compiler warnings.

**Test Plan:**
- Verified successful build with GCC 12 and above
- Ensured no regression in builds with earlier GCC versions and other compilers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137092
Approved by: https://github.com/malfet
2024-10-02 00:37:15 +00:00
d725758210 [ts_converter] Fix prim::If buffer names (#136648)
Summary:
We previously incorrectly handled the following graph, specifically for the node `w.3` in `block0`:
```
 graph(%x.1 : Float(3, strides=[1], requires_grad=0, device=cpu),
       %y.1 : int):
   %2 : __torch__.___torch_mangle_1.M = prim::CreateObject()
   %3 : int = prim::Constant[value=20](), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:747:34
   %4 : int = prim::Constant[value=10](), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:746:34
   %5 : int = prim::Constant[value=1](), scope: M::
   %w.1 : int = prim::GetAttr[name="w"](%2), scope: M::
   %7 : int = aten::mul(%w.1, %4), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:746:25
    = prim::SetAttr[name="w"](%2, %7), scope: M::
   %h.1 : int = prim::GetAttr[name="h"](%2), scope: M::
   %9 : int = aten::mul(%h.1, %3), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:747:25
    = prim::SetAttr[name="h"](%2, %9), scope: M::
   %10 : bool = aten::gt(%y.1, %4), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:749:19
   %res.37 : Tensor = prim::If(%10), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:749:16
     block0():
       %w.3 : int = prim::GetAttr[name="w"](%2), scope: M::
       %res.1 : Tensor = aten::add(%x.1, %w.3, %5), scope: M:: # <string>:5:9
       -> (%res.1)
     block1():
       %h.3 : int = prim::GetAttr[name="h"](%2), scope: M::
       %res.3 : Tensor = aten::add(%x.1, %h.3, %5), scope: M:: # <string>:5:9
       -> (%res.3)
   %16 : bool = aten::lt(%y.1, %4), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:754:19
   %res : Tensor = prim::If(%16), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:754:16
     block0():
       %w : int = prim::GetAttr[name="w"](%2), scope: M::
       %res.15 : Tensor = aten::add(%res.37, %w, %5), scope: M:: # <string>:5:9
       -> (%res.15)
     block1():
       %h : int = prim::GetAttr[name="h"](%2), scope: M::
       %res.21 : Tensor = aten::add(%res.37, %h, %5), scope: M:: # <string>:5:9
       -> (%res.21)
   return (%res)
```

Test Plan: CI

Differential Revision: D63399064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136648
Approved by: https://github.com/SherlockNoMad
2024-10-02 00:07:47 +00:00
8765804542 Continue on error for pytorch autolint (#137104)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137104
Approved by: https://github.com/huydhn, https://github.com/atalman
2024-10-01 22:30:36 +00:00
f0fa460c60 [BE] Add script to keept the runner-determinator scripts in sync (#136794)
Whenever we update runner_determinator.py it needs to be copied over into _runner-determinator.yml.

This is a quick utility script to make that process less tedious
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136794
Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt
2024-10-01 22:26:28 +00:00
4f93de8951 Mark PyTorch module as no-gil valid and pythoncapi_compat.h (#136899)
PyList_GetItem are audited but not other APIs yet (they will be done in a follow up PR to keep this one small enough).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136899
Approved by: https://github.com/colesbury, https://github.com/atalman
2024-10-01 22:05:35 +00:00
6baee60e3c upload test stats: remove nan/inf when uploading (#136877)
`json.dumps(float("inf"))` returns `Infinity`, which is technically invalid json

This is fine if you json.load, but ClickHouse cannot handle it

Solution here: cast inf and nan to string (which ClickHouse is able to cast back to float)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136877
Approved by: https://github.com/huydhn
2024-10-01 21:47:46 +00:00
0788d016d6 Update incompatible cudagraph ops skip message (#137015)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137015
Approved by: https://github.com/BoyuanFeng
2024-10-01 21:23:36 +00:00
34c18887ad [FlexAttention] Remove restriction on QK headdim > V headdim (#135884)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135884
Approved by: https://github.com/Chillee
2024-10-01 21:17:54 +00:00
99eb47fb6d Add CI for Triton CPU backend (#135342)
Where possible, I have marked failing tests (which we intend to fix or triage) as `@xfail_if_triton_cpu`. This will help us track progress of the Triton CPU backend over time. Tests that I don't think we need to address, or that are flaky, have been marked as skips.

Successful CI run: https://github.com/pytorch/pytorch/actions/runs/10822238062/job/30028284549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135342
Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/malfet
2024-10-01 20:43:10 +00:00
86b715c5f6 Revert "Skip kernel saving if already existed. (#136389)"
This reverts commit 2521cd387482a70d30e4ea922fa4fe3b488c9f6d.

Reverted https://github.com/pytorch/pytorch/pull/136389 on behalf of https://github.com/muchulee8 due to Issue #136940  ([comment](https://github.com/pytorch/pytorch/pull/136389#issuecomment-2386950623))
2024-10-01 20:04:43 +00:00
b53ab8b86a Revert "[dtensor][experimental] expose DTensor Context Parallel API (#137038)"
This reverts commit e23e766cc089b568aa4c0ebf0747ff9b504b8915.

Reverted https://github.com/pytorch/pytorch/pull/137038 on behalf of https://github.com/huydhn due to Sorry for reverting your changes but the doc build failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/137038#issuecomment-2386902253))
2024-10-01 19:49:28 +00:00
a00f0d5db8 [PT2][Inductor] Add runtime numeric check for the post grad pass (#136724)
Summary: Similar to D51838043, we further add post grad runtime numeric check since some graph passes are performed at aten-level.

Differential Revision: D63438718

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136724
Approved by: https://github.com/Yuzhen11
2024-10-01 18:56:01 +00:00
d61e45283e Properly interpolate sloc here (#137088)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137088
Approved by: https://github.com/Skylion007
2024-10-01 18:33:03 +00:00
c2dee8ea9c enable lazy init for MTIA (#136902)
Summary: As title.

Test Plan: OSS and Internal CIs

Reviewed By: nautsimon, hanzlfs

Differential Revision: D63434511

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136902
Approved by: https://github.com/nautsimon
2024-10-01 18:30:56 +00:00
1f3a793790 Fix PyTorch builds on MacOS-13 (#137095)
By including SonomaOps header

Fixes https://github.com/pytorch/pytorch/issues/137094

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137095
Approved by: https://github.com/atalman, https://github.com/ZainRizvi
2024-10-01 17:56:35 +00:00
e23e766cc0 [dtensor][experimental] expose DTensor Context Parallel API (#137038)
**Summary**
expose experimental Context Parallel API `torch.distributed.tensor.experimental._attention.context_parallel` to module `torch.distributed.tensor.experimental`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137038
Approved by: https://github.com/wz337, https://github.com/fegin
2024-10-01 17:41:28 +00:00
73b07df042 Preserve custom ops via run_decomps (#136882)
This is re-apply of https://github.com/pytorch/pytorch/pull/136773?fbclid=IwZXh0bgNhZW0CMTEAAR3SmginkvZcILVY7G2XDa_KosnV4DPmq1l6pkjPIM255QgJLKVAR90rGAU_aem_ZWpcVdUsmAGzOGiwbjtBDg.

Note that this doesn't completely remove the _preserve_ops list from export mainly because we want to have small change to address failing executorch tests. All the complications included in this PR is deleted in the next PR.

Differential Revision: [D63553086](https://our.internmc.facebook.com/intern/diff/D63553086/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136882
Approved by: https://github.com/bdhirsh
2024-10-01 17:38:00 +00:00
b1b6816e05 [testing] reenable kernel_benchmark.py tests (#136876)
Summary:
# Why

We want this to run internally

# What

- fix python path issue on the test
- reenable the test

# Background

(copied from similar issue resolved earlier)

It appears that the parent process does not pass the entire path down to the child process. Namely, if there is some setup that makes the sys.path effectively look different than, say, PYTHONPATH or something like this, the child will not inherit this setup. To avoid needing to keep track of specific setups, we pass the effective `sys.path` from the parent to the child through the PYTHONPATH env variable

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:kernel_benchmark

Differential Revision: D63498897

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136876
Approved by: https://github.com/henrylhtsang
2024-10-01 17:16:21 +00:00
3d0cb81594 [MPS] Enable bfloat16 testing (#136987)
By even further reducing precisions of imprecise FP16 ops, introducing new BF16_LOW_PRECISION_OPS category and marking BF16 tests as xfail for `divfloor_rounding`, `floor_divide` and `remainder`.
I guess the nature of low-precision results, is that MPSGraph, unlike the rest of the PyTorch does not do accumulation over fp32 for reduction operations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136987
Approved by: https://github.com/albanD
ghstack dependencies: #137070
2024-10-01 17:10:07 +00:00
cc2a66c55e [export] hook up mark_dynamic to export Dims (#137029)
Adds Dim.DYNAMIC which calls torch._dynamo.mark_dynamic() in the backend. Similar to Dim.AUTO in that it does automatic inference for ranges & relations, but errors out for specializations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137029
Approved by: https://github.com/avikchaudhuri
2024-10-01 17:05:09 +00:00
ef6fd3d780 Fix adaptive_max_pool2d fallback (#136367)
Fixes #136332
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136367
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-01 16:20:34 +00:00
8f4f7bed5d [MPS] Fix bfloat to complex casts (#137070)
For Metal cast ops to comple, one need to explicitly cast to/from `bfloat` unlike for other dtypes

Tested in https://github.com/pytorch/pytorch/pull/136987
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137070
Approved by: https://github.com/Skylion007
2024-10-01 15:47:29 +00:00
696d01aef3 Revert "inductor: use previous guards to know if a size is 1 for broadcasting (#136670)"
This reverts commit dfdda2f6a603ae9245f38a3e8f6365c3cb6d49ac.

Reverted https://github.com/pytorch/pytorch/pull/136670 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](c010c6099b) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))
2024-10-01 15:23:55 +00:00
951107e8c2 Revert "compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759)"
This reverts commit b17cd264d38ca3381391c449bdaf9f03381caf35.

Reverted https://github.com/pytorch/pytorch/pull/136759 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](c010c6099b) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))
2024-10-01 15:23:55 +00:00
923410193b Revert "compile time benchmarks for AOTDispatcher (partitioner) (#136760)"
This reverts commit c010c6099bf304bbb681af534b9f3996c33ce582.

Reverted https://github.com/pytorch/pytorch/pull/136760 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](c010c6099b) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))
2024-10-01 15:23:55 +00:00
8f5c2b5f17 type _dynamo/test_case.py (#136957)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136957
Approved by: https://github.com/Skylion007
2024-10-01 14:36:22 +00:00
d4cc2aaf1e type _dynamo/logging.py (#136956)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136956
Approved by: https://github.com/Skylion007
2024-10-01 14:35:54 +00:00
7303716005 Revert "Simplify find_localzeros (#133325)"
This reverts commit 99f90c379ed214ab30882a87bdb3924ed6d6c899.

Reverted https://github.com/pytorch/pytorch/pull/133325 on behalf of https://github.com/ezyang due to https://fb.workplace.com/groups/gpuinference/permalink/2921405651341417/ ([comment](https://github.com/pytorch/pytorch/pull/133325#issuecomment-2385832600))
2024-10-01 13:25:03 +00:00
6bd9d37266 Remove allow-untyped-defs from torch.fx.experimental.symbolic_shapes (#137019)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137019
Approved by: https://github.com/Skylion007
ghstack dependencies: #136934, #136935, #136972
2024-10-01 13:22:10 +00:00
cc8f1cddd4 Turn on type-checking in torch.fx.experimental.symbolic_shapes (#136972)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136972
Approved by: https://github.com/Skylion007
ghstack dependencies: #136934, #136935
2024-10-01 13:22:10 +00:00
b85f21fc1d Add decomposition for squeeze_copy (#130941)
* Extracted from #128416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941
Approved by: https://github.com/amjames, https://github.com/eellison
ghstack dependencies: #136653
2024-10-01 10:23:22 +00:00
083921852b set FlexAttention devices properly during tracing (#137049)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137049
Approved by: https://github.com/zou3519, https://github.com/drisspg, https://github.com/yanboliang
ghstack dependencies: #136826, #137043
2024-10-01 09:08:08 +00:00
34cef1eaa7 Allow automatic dynamic shapes for closures and set current node properly in flexattention subgraph lowering (#137043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137043
Approved by: https://github.com/drisspg
ghstack dependencies: #136826
2024-10-01 09:08:08 +00:00
37dd924c2d Fix test/test_linalg.py for NumPy 2 (#136800)
Related to  #107302.

When built and tested with NumPy 2 the following unit tests failed.

```
=========================================================== short test summary info ============================================================
FAILED [0.0026s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_complex128 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0024s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_complex64 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0025s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_float32 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0024s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_float64 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0016s] test/test_linalg.py::TestLinalgCPU::test_nuclear_norm_axes_small_brute_force_old_cpu - ValueError: Unable to avoid copy while creating an array as requested.
FAILED [0.0054s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_complex128 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
FAILED [0.0055s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_complex64 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
FAILED [0.0048s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_float32 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
FAILED [0.0054s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_float64 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
=========================================== 9 failed, 1051 passed, 118 skipped in 152.51s (0:02:32) ============================================
```

This PR fixes them. The test is now compatible with both NumPy 1 & 2.

Some more details:

1. The `np.linalg.solve` has changed its behavior. So I added an adapt function in the unit test to keep its behavior the same no matter it is NumPy 1 or Numpy 2.
2. The cause of the failure is when passing a `torch.Tensor` to `np.linalg.qr`, the return type in NumPy 1 is `(np.ndarray, np.ndarray)`, while it is `(torch.Tensor, torch.Tensor)` in NumPy 2.
3. NumPy 2 does not allow `np.array(obj, copy=False)`, but recommended to use `np.asarray(obj)` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136800
Approved by: https://github.com/lezcano
2024-10-01 07:53:24 +00:00
df5bbc09d1 Make device-specific event inherits from torch.Event (#134845)
# Motivation
This PR intends to make device-specific Event inherit from the generic torch.Event. The benefit is providing a generic abstract class `torch.Event` for different devices, like `torch.Stream`. This make it easier for Dynamo to capture the Event of different devices, like torch.cuda.Event and torch.xpu.Event.
And the next PR would like to remove previous useless base class `_StreamBase` and `_EventBase` to avoid multiple Inheritance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134845
Approved by: https://github.com/albanD, https://github.com/EikanWang
2024-10-01 06:28:41 +00:00
cyy
47a78daf91 [Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449)
This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119449
Approved by: https://github.com/malfet, https://github.com/albanD, https://github.com/eqy
2024-10-01 06:24:30 +00:00
be169f743b [Dynamo] Mark config.dead_code_elimination as deprecated (#136933)
part of #136862

For reviewers, all call sites are here: https://github.com/search?q=repo%3Apytorch%2Fpytorch+dead_code_elimination+language%3APython&type=code&l=Python

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136933
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
2024-10-01 03:51:59 +00:00
6e10f7d8c1 [compiled autograd] undo view_to_reshape inductor fx pass in node name matching (#136741)
inductor mutates the aot backward graph. a solution could be to copy the graph, but since we don't know if compiled autograd is applied or not, it would be expensive to always clone it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136741
Approved by: https://github.com/jansel
ghstack dependencies: #135663
2024-10-01 03:22:49 +00:00
40157db5a7 [compiled autograd] log placeholder origin in verbose (#135663)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135663
Approved by: https://github.com/jansel
2024-10-01 03:22:49 +00:00
6966811da6 [test] skip not omit big gpu tests for cuda_cpp_wrapper (#137055)
Summary: Problem is, when gpu is not big, we will omit the test cases in the test class. We expect the test to be skipped, but due to fbcode ci it can throw an error. This causes the test to be flaky.

Test Plan: ci

Differential Revision: D62037908

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137055
Approved by: https://github.com/masnesral
2024-10-01 03:03:27 +00:00
cyy
17455695d6 [Distributed] [14/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136848)
Follows  #136713

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136848
Approved by: https://github.com/H-Huang
2024-10-01 02:01:13 +00:00
951af3d3d8 Format torch.fx.experimental.validator (#136935)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136935
Approved by: https://github.com/Skylion007
ghstack dependencies: #136934
2024-10-01 01:47:17 +00:00
33c2d3232f Format torch.fx.experimental.symbolic_shapes with PYFMT (#136934)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136934
Approved by: https://github.com/Skylion007
2024-10-01 01:47:16 +00:00
d9c400bd9f Added some tests to prevent regressions in partitioning and flexattention (#136826)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136826
Approved by: https://github.com/yanboliang, https://github.com/drisspg
2024-10-01 01:08:44 +00:00
3f457ee1f6 Fix AOT Graph capture not propagating non_blocking copy parameter to … (#136513)
…inductor codegen.

Fixes #136260

**Note**: this is my first code contribution to torch so please let me know if there's anything I need to fix/some other convention I should follow.

Regarding the bug, re-running the issue's reproduction code:
```
import torch

def fn(x):
    return x.to(device="cuda", non_blocking=True)

inp = torch.randn(3, 4)

torch.compile(fn)(inp)
```

We now have the non_blocking being passed on to codegen properly:

```
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] TRACED GRAPH
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]  ===== pre insert_deferred_runtime_asserts __compiled_fn_1 =====
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]  <eval_with_key>.0 class GraphModule(torch.nn.Module):
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]     def forward(self, L_x_: "f32[3, 4]"):
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]         l_x_ = L_x_
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]          # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True)
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]         to: "f32[3, 4]" = l_x_.to(device = 'cuda', non_blocking = True);  l_x_ = None
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]         return (to,)
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] TRACED GRAPH
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]  ===== __compiled_fn_1 =====
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]  /home/niklasz/Desktop/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]     def forward(self, L_x_: "f32[3, 4][4, 1]cpu"):
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]         l_x_ = L_x_
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]          # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True)
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]         to: "f32[3, 4][4, 1]cuda:0" = l_x_.to(device = 'cuda', non_blocking = True);  l_x_ = None
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]         return (to,)
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]
V0922 20:33:25.404000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:114] [0/0] [__aot_graphs] aot_config id: 0, fw_metadata=ViewAndMutationMeta(input_info=[InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=False, keep_input_mutations=True)], output_info=[OutputAliasInfo(output_type=<OutputType.non_alias: 1>, raw_type=<class 'torch._subclasses.functional_tensor.FunctionalTensor'>, base_idx=None, dynamic_dims=set(), requires_grad=False, functional_tensor=None)], num_intermediate_bases=0, keep_input_mutations=True, traced_tangents=[], subclass_inp_meta=[0], subclass_fw_graph_out_meta=[0], subclass_tangent_meta=[], is_train=False, traced_tangent_metas=None, num_symints_saved_for_bw=None, grad_enabled_mutation=None, deterministic=None, static_input_indices=[], tokens={}, indices_of_inputs_that_requires_grad_with_mutations_in_bw=[], bw_donated_idxs=None, num_backward_tokens=0),subclass_metadata=None
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] TRACED GRAPH
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]  ===== Forward graph 0 =====
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]  /home/niklasz/Desktop/pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]     def forward(self, arg0_1: "f32[3, 4][4, 1]cpu"):
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]          # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True)
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]         device_put: "f32[3, 4][4, 1]cuda:0" = torch.ops.prims.device_put.default(arg0_1, device(type='cuda', index=0), True);  arg0_1 = None
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]         convert_element_type: "f32[3, 4][4, 1]cuda:0" = torch.ops.prims.convert_element_type.default(device_put, torch.float32);  device_put = None
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]         return (convert_element_type,)
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1134] [0/0] [__output_code] Output code written to: /tmp/torchinductor_niklasz/ha/chaai264g6ribfw3q2qhl6ayjtaqaavku5wivxtzw4nabgd6htsv.py
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] Output code:
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] # AOT ID: ['0_inference']
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from ctypes import c_void_p, c_long, c_int
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import torch
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import math
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import random
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import os
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import tempfile
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from math import inf, nan
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.hooks import run_intermediate_hooks
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.utils import maybe_profile
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.codegen.memory_planning import _align as align
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch import device, empty_strided
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.async_compile import AsyncCompile
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.select_algorithm import extern_kernels
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.codegen.multi_kernel import MultiKernelCall
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] aten = torch.ops.aten
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] inductor_ops = torch.ops.inductor
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] _quantized = torch.ops._quantized
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] assert_size_stride = torch._C._dynamo.guards.assert_size_stride
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpu
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] alloc_from_pool = torch.ops.inductor._alloc_from_pool
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] async_compile = AsyncCompile()
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] async_compile.wait(globals())
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] del async_compile
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] def call(args):
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     arg0_1, = args
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     args.clear()
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     assert_size_stride(arg0_1, (3, 4), (4, 1))
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     with torch.cuda._DeviceGuard(0):
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]         torch.cuda.set_device(0)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]         buf0 = empty_strided_cuda((3, 4), (4, 1), torch.float32)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]         buf0.copy_(arg0_1, True)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]         del arg0_1
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     return (buf0, )
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] def benchmark_compiled_module(times=10, repeat=10):
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     from torch._dynamo.testing import rand_strided
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     from torch._inductor.utils import print_performance
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     arg0_1 = rand_strided((3, 4), (4, 1), device='cpu', dtype=torch.float32)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     fn = lambda: call([arg0_1])
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     return print_performance(fn, times=times, repeat=repeat)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] if __name__ == "__main__":
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     from torch._inductor.wrapper_benchmark import compiled_module_main
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     compiled_module_main('None', benchmark_compiled_module)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
```
See above line `buf0.copy_(arg0_1, True)`. Specific log setting used: `export TORCH_LOGS="graph_code,aot_graphs,output_code"`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136513
Approved by: https://github.com/eellison
2024-10-01 00:32:47 +00:00
19a4d68224 Add missing mappings to support torch.uint16 in quantization and export (#136547)
Test Plan: CI.

Differential Revision: D63142844

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136547
Approved by: https://github.com/angelayi
2024-10-01 00:01:01 +00:00
18e707645c Substitute unbacked symints in expressions (#137020)
Differential Revision: [D63647095](https://our.internmc.facebook.com/intern/diff/D63647095)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137020
Approved by: https://github.com/ezyang
2024-09-30 23:07:22 +00:00
af64c44b56 Revert "Don't uselessly recompute axiom dict every static eval call (#135429)"
This reverts commit 1d6e0412f5205b1cd709e034526d7f21d6f2d56f.

Reverted https://github.com/pytorch/pytorch/pull/135429 on behalf of https://github.com/ezyang due to try again ([comment](https://github.com/pytorch/pytorch/pull/135429#issuecomment-2384288879))
2024-09-30 22:29:13 +00:00
c07ebaf430 [triton] Try to use triton.language.extra.libdevice when possible (#136997)
Summary:
X-link: https://github.com/facebookresearch/generative-recommenders/pull/90

In view of https://github.com/triton-lang/triton/pull/3825 we should try to use `triton.language.extra.libdevice` instead of `triton.language.extra.cuda.libdevice`.

Test Plan: CI

Reviewed By: bertmaher, karthik-man

Differential Revision: D63583965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136997
Approved by: https://github.com/bertmaher
2024-09-30 21:52:44 +00:00
b3972ee19a [triton] Unify build_paths.py for NV & AMD, fix typing (#136952)
Summary: Some build improvements.

Test Plan: CI

Differential Revision: D63583959

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136952
Approved by: https://github.com/bertmaher
2024-09-30 21:51:45 +00:00
66a269afe8 Revert "Format torch.fx.experimental.symbolic_shapes with PYFMT (#136934)"
This reverts commit cf1a7eab250ea37ca8fda0327e8e38693c3c5c1a.

Reverted https://github.com/pytorch/pytorch/pull/136934 on behalf of https://github.com/ezyang due to merge conflict revert ([comment](https://github.com/pytorch/pytorch/pull/136934#issuecomment-2384195881))
2024-09-30 21:44:44 +00:00
c94536ae74 Revert "Format torch.fx.experimental.validator (#136935)"
This reverts commit 377e4bc877a3ac4cd6d073aa513a309159ade991.

Reverted https://github.com/pytorch/pytorch/pull/136935 on behalf of https://github.com/ezyang due to merge conflict revert ([comment](https://github.com/pytorch/pytorch/pull/136934#issuecomment-2384195881))
2024-09-30 21:44:44 +00:00
8982906502 Revert "Turn on type-checking in torch.fx.experimental.symbolic_shapes (#136972)"
This reverts commit 3ff2d93d9f72fd26503ef0cf5c5956edad4c52e6.

Reverted https://github.com/pytorch/pytorch/pull/136972 on behalf of https://github.com/ezyang due to need to back out for merge conflict ([comment](https://github.com/pytorch/pytorch/pull/136972#issuecomment-2384182244))
2024-09-30 21:35:08 +00:00
b825848d85 Fix aarch64 debug build with GCC (#136990)
Fixes #136440

**Issue:**
When building PyTorch in debug mode on aarch64 architecture using GCC, we encounter relocation errors due to the R_AARCH64_CALL26 relocation limit. This occurs because debug builds with -O0 optimization generate larger code sizes, potentially exceeding the range limit for these relocations.

**Fix:**
Apply -Og optimization instead of -O0 for aarch64 GCC debug builds. This slightly reduces code size while maintaining debuggability, bringing function calls back within the range of R_AARCH64_CALL26 relocations.

The fix is implemented by conditionally setting compiler and linker flags in CMakeLists.txt:
- For aarch64 GCC debug builds: use -Og
- For all other debug builds: retain -O0

This change affects only debug builds on aarch64 with GCC, leaving other configurations unchanged.

**Testing:**
Verified that the build succeeds without relocation errors on aarch64 systems with GCC in debug mode. Ensured that debugging information is still available and useful for debugging purposes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136990
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-30 21:11:50 +00:00
866a64ce9a [FSDP2] Added check for contiguous parameters (#137000)
Since our implementation currently assumes contiguous strides, let us add an explicit check and raise an error at construction time if the parameter is not contiguous.

We can try to support this in the future. Mainly, I want to first learn more about how DTensor support for non-contiguous memory formats works.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137000
Approved by: https://github.com/weifengpy
2024-09-30 21:10:47 +00:00
66e3186a48 Revert "Init threadpool with user defined num_threads before default (#136793)"
This reverts commit adbcaee950afa6697c04962096344bf0962a542f.

Reverted https://github.com/pytorch/pytorch/pull/136793 on behalf of https://github.com/janeyx99 due to Caused internal Oculus crash, and internal force landed a diff without exporting to GH =.= ([comment](https://github.com/pytorch/pytorch/pull/136793#issuecomment-2384148132))
2024-09-30 21:10:12 +00:00
bc6adb9596 [EZ][BE] Delete ISSUE_TEMPALTE.md (#137040)
As it has been superseded by [ISSUES_TEMPLATE](https://github.com/pytorch/pytorch/tree/main/.github/ISSUE_TEMPLATE) folder, per https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/configuring-issue-templates-for-your-repository#creating-issue-forms

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137040
Approved by: https://github.com/ZainRizvi
2024-09-30 21:04:32 +00:00
d46ebcb31b Enable experiments for protected branches (#136785)
This is to allow the protected branches (like `main` and `nightly`) also run on the LF fleet, now that we've migrated over
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136785
Approved by: https://github.com/jeanschmidt
2024-09-30 20:58:28 +00:00
2ef1454189 Revert "Add int1 to int7 dtypes (#136301)"
This reverts commit bfa16a161d5089a9ba008f5e665f29b58dc16526.

Reverted https://github.com/pytorch/pytorch/pull/136301 on behalf of https://github.com/PaliC due to causing internal failures ([comment](https://github.com/pytorch/pytorch/pull/136301#issuecomment-2384119600))
2024-09-30 20:50:49 +00:00
766 changed files with 19129 additions and 8617 deletions

View File

@ -355,6 +355,12 @@ case "$image" in
CONDA_CMAKE=yes
VISION=yes
;;
pytorch-linux-jammy-py3-clang18-asan)
ANACONDA_PYTHON_VERSION=3.10
CLANG_VERSION=18
CONDA_CMAKE=yes
VISION=yes
;;
pytorch-linux-jammy-py3.9-gcc11)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
@ -381,6 +387,13 @@ case "$image" in
HALIDE=yes
TRITON=yes
;;
pytorch-linux-jammy-py3.12-triton-cpu)
CUDA_VERSION=12.4
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
CONDA_CMAKE=yes
TRITON_CPU=yes
;;
pytorch-linux-focal-linter)
# TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.
# We will need to update mypy version eventually, but that's for another day. The task
@ -510,6 +523,7 @@ docker build \
--build-arg "UCC_COMMIT=${UCC_COMMIT}" \
--build-arg "CONDA_CMAKE=${CONDA_CMAKE}" \
--build-arg "TRITON=${TRITON}" \
--build-arg "TRITON_CPU=${TRITON_CPU}" \
--build-arg "ONNX=${ONNX}" \
--build-arg "DOCS=${DOCS}" \
--build-arg "INDUCTOR_BENCHMARKS=${INDUCTOR_BENCHMARKS}" \

View File

@ -0,0 +1 @@
6a333f1b05671f6fada4ba7bbfae4a02a9d96f4f

View File

@ -13,11 +13,17 @@ if [ -n "$CLANG_VERSION" ]; then
elif [[ $UBUNTU_VERSION == 22.04 ]]; then
# work around ubuntu apt-get conflicts
sudo apt-get -y -f install
wget --no-check-certificate -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -
if [[ $CLANG_VERSION == 18 ]]; then
apt-add-repository "deb http://apt.llvm.org/jammy/ llvm-toolchain-jammy-18 main"
fi
fi
sudo apt-get update
apt-get install -y --no-install-recommends clang-"$CLANG_VERSION"
apt-get install -y --no-install-recommends llvm-"$CLANG_VERSION"
apt-get install -y --no-install-recommends clang-"$CLANG_VERSION" llvm-"$CLANG_VERSION"
if [[ $CLANG_VERSION == 18 ]]; then
apt-get install -y --no-install-recommends libomp-18-dev
fi
# Install dev version of LLVM.
if [ -n "$LLVMDEV" ]; then

View File

@ -32,7 +32,7 @@ pip_install coloredlogs packaging
pip_install onnxruntime==1.18.1
pip_install onnx==1.16.2
pip_install onnxscript==0.1.0.dev20240831 --no-deps
pip_install onnxscript==0.1.0.dev20241008 --no-deps
# required by onnxscript
pip_install ml_dtypes

View File

@ -15,8 +15,11 @@ conda_reinstall() {
if [ -n "${XPU_VERSION}" ]; then
TRITON_REPO="https://github.com/intel/intel-xpu-backend-for-triton"
TRITON_TEXT_FILE="triton-xpu"
elif [ -n "${TRITON_CPU}" ]; then
TRITON_REPO="https://github.com/triton-lang/triton-cpu"
TRITON_TEXT_FILE="triton-cpu"
else
TRITON_REPO="https://github.com/openai/triton"
TRITON_REPO="https://github.com/triton-lang/triton"
TRITON_TEXT_FILE="triton"
fi
@ -44,9 +47,10 @@ chown -R jenkins /var/lib/jenkins/triton
chgrp -R jenkins /var/lib/jenkins/triton
pushd /var/lib/jenkins/
as_jenkins git clone ${TRITON_REPO} triton
as_jenkins git clone --recursive ${TRITON_REPO} triton
cd triton
as_jenkins git checkout ${TRITON_PINNED_COMMIT}
as_jenkins git submodule update --init --recursive
cd python
# TODO: remove patch setup.py once we have a proper fix for https://github.com/triton-lang/triton/issues/4527

View File

@ -139,9 +139,9 @@ opt-einsum==3.3
#Pinned versions: 3.3
#test that import: test_linalg.py
optree==0.12.1
optree==0.13.0
#Description: A library for tree manipulation
#Pinned versions: 0.12.1
#Pinned versions: 0.13.0
#test that import: test_vmap.py, test_aotdispatch.py, test_dynamic_shapes.py,
#test_pytree.py, test_ops.py, test_control_flow.py, test_modules.py,
#common_utils.py, test_eager_transforms.py, test_python_dispatch.py,

View File

@ -147,6 +147,13 @@ COPY ci_commit_pins/triton.txt triton.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton.txt
ARG TRITON_CPU
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/triton-cpu.txt triton-cpu.txt
RUN if [ -n "${TRITON_CPU}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton-cpu.txt
ARG EXECUTORCH
# Build and install executorch
COPY ./common/install_executorch.sh install_executorch.sh

View File

@ -178,7 +178,7 @@ fi
# sccache will fail for CUDA builds if all cores are used for compiling
# gcc 7 with sccache seems to have intermittent OOM issue if all cores are used
if [ -z "$MAX_JOBS" ]; then
if { [[ "$BUILD_ENVIRONMENT" == *cuda* ]] || [[ "$BUILD_ENVIRONMENT" == *gcc7* ]]; } && which sccache > /dev/null; then
if { [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; } && which sccache > /dev/null; then
export MAX_JOBS=$(($(nproc) - 1))
fi
fi
@ -218,10 +218,6 @@ if [[ "${BUILD_ENVIRONMENT}" == *-pch* ]]; then
export USE_PRECOMPILED_HEADERS=1
fi
if [[ "${BUILD_ENVIRONMENT}" == *linux-focal-py3.7-gcc7-build* ]]; then
export USE_GLOO_WITH_OPENSSL=ON
fi
if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]; then
export BUILD_STATIC_RUNTIME_BENCHMARK=ON
fi

View File

@ -191,9 +191,22 @@ function install_torchrec_and_fbgemm() {
pip_uninstall torchrec-nightly
pip_uninstall fbgemm-gpu-nightly
pip_install setuptools-git-versioning scikit-build pyre-extensions
# TODO (huydhn): I still have no clue on why sccache doesn't work with only fbgemm_gpu here, but it
# seems to be an sccache-related issue
if [[ "$IS_A100_RUNNER" == "1" ]]; then
unset CMAKE_CUDA_COMPILER_LAUNCHER
sudo mv /opt/cache/bin /opt/cache/bin-backup
fi
# See https://github.com/pytorch/pytorch/issues/106971
CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 --user "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"
if [[ "$IS_A100_RUNNER" == "1" ]]; then
export CMAKE_CUDA_COMPILER_LAUNCHER=/opt/cache/bin/sccache
sudo mv /opt/cache/bin-backup /opt/cache/bin
fi
}
function clone_pytorch_xla() {

View File

@ -403,7 +403,7 @@ pr_time_benchmarks() {
PYTHONPATH=$(pwd)/benchmarks/dynamo/pr_time_benchmarks source benchmarks/dynamo/pr_time_benchmarks/benchmark_runner.sh "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv" "benchmarks/dynamo/pr_time_benchmarks/benchmarks"
echo "benchmark results on current PR: "
cat "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv"
PYTHONPATH=$(pwd)/benchmarks/dynamo/pr_time_benchmarks python benchmarks/dynamo/pr_time_benchmarks/check_results.py "benchmarks/dynamo/pr_time_benchmarks/expected_results.csv" "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv"
PYTHONPATH=$(pwd)/benchmarks/dynamo/pr_time_benchmarks python benchmarks/dynamo/pr_time_benchmarks/check_results.py "benchmarks/dynamo/pr_time_benchmarks/expected_results.csv" "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv" "$TEST_REPORTS_DIR/new_expected_results.csv"
}
if [[ "${TEST_CONFIG}" == *pr_time_benchmarks* ]]; then
@ -606,6 +606,11 @@ test_inductor_halide() {
assert_git_not_dirty
}
test_inductor_triton_cpu() {
python test/run_test.py --include inductor/test_triton_cpu_backend.py --verbose
assert_git_not_dirty
}
test_dynamo_benchmark() {
# Usage: test_dynamo_benchmark huggingface 0
TEST_REPORTS_DIR=$(pwd)/test/test-reports
@ -660,15 +665,6 @@ test_inductor_torchbench_smoketest_perf() {
# The threshold value needs to be actively maintained to make this check useful
python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_training_smoketest.csv" -t 1.4
TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --bfloat16 --inference \
--export-aot-inductor --only nanogpt --output "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv"
# The threshold value needs to be actively maintained to make this check useful
# The perf number of nanogpt seems not very stable, e.g.
# https://github.com/pytorch/pytorch/actions/runs/7158691360/job/19491437314,
# and thus we lower its threshold to reduce flakiness. If this continues to be a problem,
# we switch to use some other model.
python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t 4.9
# Check memory compression ratio for a few models
for test in hf_Albert timm_vision_transformer; do
python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --amp --training \
@ -1439,6 +1435,8 @@ elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then
test_inductor_distributed
elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then
test_inductor_halide
elif [[ "${TEST_CONFIG}" == *inductor-triton-cpu* ]]; then
test_inductor_triton_cpu
elif [[ "${TEST_CONFIG}" == *inductor-micro-benchmark* ]]; then
test_inductor_micro_benchmark
elif [[ "${TEST_CONFIG}" == *huggingface* ]]; then
@ -1462,7 +1460,7 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
# https://github.com/opencv/opencv-python/issues/885
pip_install opencv-python==4.8.0.74
if [[ "${TEST_CONFIG}" == *inductor_torchbench_smoketest_perf* ]]; then
checkout_install_torchbench hf_Bert hf_Albert nanogpt timm_vision_transformer
checkout_install_torchbench hf_Bert hf_Albert timm_vision_transformer
PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_smoketest_perf
elif [[ "${TEST_CONFIG}" == *inductor_torchbench_cpu_smoketest_perf* ]]; then
checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_edgecnn \

View File

@ -26,7 +26,7 @@ fi
export SCRIPT_HELPERS_DIR=$SCRIPT_PARENT_DIR/win-test-helpers
set +ex
grep -E -R 'PyLong_(From|As)(Unsigned|)Long\(' --exclude=python_numbers.h --exclude=eval_frame.c torch/
grep -E -R 'PyLong_(From|As)(Unsigned|)Long\(' --exclude=python_numbers.h --exclude=pythoncapi_compat.h --exclude=eval_frame.c torch/
PYLONG_API_CHECK=$?
if [[ $PYLONG_API_CHECK == 0 ]]; then
echo "Usage of PyLong_{From,As}{Unsigned}Long API may lead to overflow errors on Windows"

View File

@ -27,12 +27,11 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then
source activate testenv >/dev/null
elif [[ "$PACKAGE_TYPE" != libtorch ]]; then
python_path="/opt/python/cp\$python_nodot-cp\${python_nodot}"
# Prior to Python 3.8 paths were suffixed with an 'm'
if [[ -d "\${python_path}/bin" ]]; then
export PATH="\${python_path}/bin:\$PATH"
elif [[ -d "\${python_path}m/bin" ]]; then
export PATH="\${python_path}m/bin:\$PATH"
if [[ "\$python_nodot" = *t ]]; then
python_digits="\$(echo $DESIRED_PYTHON | tr -cd [:digit:])"
python_path="/opt/python/cp\$python_digits-cp\${python_digits}t"
fi
export PATH="\${python_path}/bin:\$PATH"
fi
EXTRA_CONDA_FLAGS=""

View File

@ -44,7 +44,9 @@ ContinuationIndentWidth: 4
Cpp11BracedListStyle: true
DerivePointerAlignment: false
DisableFormat: false
ForEachMacros: [ FOR_EACH_RANGE, FOR_EACH, ]
ForEachMacros:
- FOR_EACH_RANGE
- FOR_EACH
IncludeCategories:
- Regex: '^<.*\.h(pp)?>'
Priority: 1
@ -58,6 +60,24 @@ IndentWrappedFunctionNames: false
KeepEmptyLinesAtTheStartOfBlocks: false
MacroBlockBegin: ''
MacroBlockEnd: ''
Macros:
- >-
PyObject_HEAD_INIT(type)={
/* this is not exactly match with PyObject_HEAD_INIT in Python source code
* but it is enough for clang-format */
{ 0xFFFFFFFF },
(type)
},
- >-
PyVarObject_HEAD_INIT(type, size)={
{
/* manually expand PyObject_HEAD_INIT(type) above
* because clang-format do not support recursive expansion */
{ 0xFFFFFFFF },
(type)
},
(size)
},
MaxEmptyLinesToKeep: 1
NamespaceIndentation: None
PenaltyBreakBeforeFirstCallParameter: 1
@ -79,7 +99,11 @@ SpacesInContainerLiterals: true
SpacesInCStyleCastParentheses: false
SpacesInParentheses: false
SpacesInSquareBrackets: false
Standard: Cpp11
Standard: c++17
StatementMacros:
- PyObject_HEAD
- PyObject_VAR_HEAD
- PyException_HEAD
TabWidth: 8
UseTab: Never
---

View File

@ -1,38 +0,0 @@
If you have a question or would like help and support, please ask at our
[forums](https://discuss.pytorch.org/).
If you are submitting a feature request, please preface the title with [feature request].
If you are submitting a bug report, please fill in the following details.
## Issue description
Provide a short description.
## Code example
Please try to provide a minimal example to repro the bug.
Error messages and stack traces are also helpful.
## System Info
Please copy and paste the output from our
[environment collection script](https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py)
(or fill out the checklist below manually).
You can get the script and run it with:
```
wget https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
```
- PyTorch or Caffe2:
- How you installed PyTorch (conda, pip, source):
- Build command you used (if compiling from source):
- OS:
- PyTorch version:
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- GCC version (if compiling from source):
- CMake version:
- Versions of any other relevant libraries:

View File

@ -18,8 +18,14 @@ inputs:
runs:
using: composite
steps:
- name: Check if in a container runner
shell: bash
id: check_container_runner
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Clean workspace
shell: bash
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
env:
NO_SUDO: ${{ inputs.no-sudo }}
run: |

View File

@ -85,15 +85,25 @@ runs:
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Check if in a ARC runner
- name: Check if in a container runner
shell: bash
id: check_arc_runner
run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> "$GITHUB_OUTPUT"
id: check_container_runner
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
id: install-nvidia-driver
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
- name: Setup GPU_FLAG for docker run
id: setup-gpu-flag
run: echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}
- name: Setup SCCACHE_SERVER_PORT environment for docker run when on container
id: setup-sscache-port-flag
run: echo "SCCACHE_SERVER_PORT_DOCKER_FLAG=-e SCCACHE_SERVER_PORT=$((RUNNER_UID + 4226))" >> "${GITHUB_ENV}"
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}
- name: Lock NVIDIA A100 40GB Frequency
shell: bash
@ -101,7 +111,7 @@ runs:
sudo nvidia-smi -pm 1
sudo nvidia-smi -ac 1215,1410
nvidia-smi
if: contains(matrix.runner, 'a100')
if: ${{ contains(matrix.runner, 'a100') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
- name: Start monitoring script
id: monitor-script
@ -172,6 +182,7 @@ runs:
NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}
TD_DISTRIBUTED: ${{ steps.keep-going.outputs.ci-td-distributed }}
SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
SCCACHE_REGION: us-east-1
SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
SHM_SIZE: ${{ contains(inputs.build-environment, 'cuda') && '2g' || '1g' }}
DOCKER_IMAGE: ${{ inputs.docker-image }}
@ -181,6 +192,9 @@ runs:
PYTORCH_TEST_RERUN_DISABLED_TESTS: ${{ matrix.rerun_disabled_tests && '1' || '0' }}
DASHBOARD_TAG: ${{ inputs.dashboard-tag }}
HUGGING_FACE_HUB_TOKEN: ${{ inputs.HUGGING_FACE_HUB_TOKEN }}
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
IS_A100_RUNNER: ${{ contains(matrix.runner, 'a100') && '1' || '0' }}
shell: bash
run: |
set -x
@ -199,6 +213,7 @@ runs:
# shellcheck disable=SC2086,SC2090
container_name=$(docker run \
${GPU_FLAG:-} \
${SCCACHE_SERVER_PORT_DOCKER_FLAG:-} \
-e BUILD_ENVIRONMENT \
-e PR_NUMBER \
-e GITHUB_ACTIONS \
@ -227,6 +242,7 @@ runs:
-e PR_LABELS \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e SCCACHE_BUCKET \
-e SCCACHE_REGION \
-e SCCACHE_S3_KEY_PREFIX \
-e XLA_CUDA \
-e XLA_CLANG_CACHE_S3_BUCKET_NAME \
@ -234,7 +250,9 @@ runs:
-e PYTORCH_TEST_RERUN_DISABLED_TESTS \
-e SKIP_SCCACHE_INITIALIZATION=1 \
-e HUGGING_FACE_HUB_TOKEN \
-e SCRIBE_GRAPHQL_ACCESS_TOKEN \
-e DASHBOARD_TAG \
-e IS_A100_RUNNER \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
@ -305,7 +323,7 @@ runs:
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()
if: always() && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false'
# NB: We are currently having an intermittent GPU-related issue on G5 runners with
# A10G GPU. Once this happens, trying to reset the GPU as done in setup-nvidia does

View File

@ -28,14 +28,14 @@ runs:
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: Check if in a ARC runner
- name: Check if in a container runner
shell: bash
id: check_arc_runner
run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> $GITHUB_OUTPUT
id: check_container_runner
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Start docker if docker deamon is not running
shell: bash
if: ${{ steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
run: |
if systemctl is-active --quiet docker; then
echo "Docker daemon is running...";
@ -73,7 +73,7 @@ runs:
env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
- name: Kill any existing containers, clean up images
if: ${{ steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
shell: bash
run: |
# ignore expansion of "docker ps -q" since it could be empty
@ -116,7 +116,7 @@ runs:
- name: Check that the docker daemon is running
shell: bash
continue-on-error: true
if: ${{ steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'true' }}
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}
run: |
set +x

View File

@ -1 +1 @@
ba696ea3dfec4cbe693bf06a84c75dc196077f5b
3f0569939c4369bec943fc27d1c9d8dfbc828c26

View File

@ -16,6 +16,7 @@ ciflow_push_tags:
- ciflow/nightly
- ciflow/periodic
- ciflow/rocm
- ciflow/s390
- ciflow/slow
- ciflow/trunk
- ciflow/unstable

View File

@ -1,4 +1,4 @@
# iOS simulator requirements
coremltools==5.0b5
protobuf==3.20.2
optree==0.12.1
optree==0.13.0

View File

@ -27,7 +27,7 @@ pytest-cpp==2.3.0
rockset==1.0.3
z3-solver==4.12.2.0
tensorboard==2.13.0
optree==0.12.1
optree==0.13.0
# NB: test_hparams_* from test_tensorboard is failing with protobuf 5.26.0 in
# which the stringify metadata is wrong when escaping double quote
protobuf==3.20.2

View File

@ -333,7 +333,7 @@ def generate_wheels_matrix(
package_type = "manywheel"
if python_versions is None:
python_versions = FULL_PYTHON_VERSIONS + ["3.13"]
python_versions = FULL_PYTHON_VERSIONS + ["3.13", "3.13t"]
if arches is None:
# Define default compute archivectures
@ -369,7 +369,13 @@ def generate_wheels_matrix(
# TODO: Enable python 3.13 on rocm, aarch64, windows
if (
gpu_arch_type == "rocm" or (os != "linux" and os != "linux-s390x")
) and python_version == "3.13":
) and (python_version == "3.13" or python_version == "3.13t"):
continue
# TODO: Enable python 3.13t on xpu and cpu-s390x
if (
gpu_arch_type == "xpu" or gpu_arch_type == "cpu-s390x"
) and python_version == "3.13t":
continue
if use_split_build and (

View File

@ -1,5 +1,9 @@
# flake8: noqa: G004
# Note: Copies of this script in runner_determinator.py and _runner-determinator.yml
# must be kept in sync. You can do it easily by running the following command:
# python .github/scripts/update_runner_determinator.py
"""
This runner determinator is used to determine which set of runners to run a
GitHub job on. It uses the first comment of a GitHub issue (by default
@ -79,6 +83,9 @@ class Experiment(NamedTuple):
rollout_perc: float = (
0 # Percentage of workflows to experiment on when user is not opted-in.
)
all_branches: bool = (
False # If True, the experiment is also enabled on the exception branches
)
# Add more fields as needed
@ -212,7 +219,7 @@ def get_potential_pr_author(
def is_exception_branch(branch: str) -> bool:
"""
Branches that get opted out of all experiments and should always use Meta runners
Branches that get opted out of experiments by default, until they're explicitly enabled.
"""
return branch.split("/")[0] in {"main", "nightly", "release", "landchecks"}
@ -338,7 +345,10 @@ def is_user_opted_in(user: str, user_optins: UserOptins, experiment_name: str) -
def get_runner_prefix(
rollout_state: str, workflow_requestors: Iterable[str], is_canary: bool = False
rollout_state: str,
workflow_requestors: Iterable[str],
branch: str,
is_canary: bool = False,
) -> str:
settings = parse_settings(rollout_state)
user_optins = parse_users(rollout_state)
@ -348,6 +358,12 @@ def get_runner_prefix(
for experiment_name, experiment_settings in settings.experiments.items():
enabled = False
if not experiment_settings.all_branches and is_exception_branch(branch):
log.info(
f"Branch {branch} is an exception branch. Not enabling experiment {experiment_name}."
)
continue
# Is any workflow_requestor opted in to this experiment?
opted_in_users = [
requestor
@ -407,35 +423,34 @@ def get_rollout_state_from_issue(github_token: str, repo: str, issue_num: int) -
def main() -> None:
args = parse_args()
if args.github_ref_type == "branch" and is_exception_branch(args.github_branch):
log.info(
f"Exception branch: '{args.github_branch}', using Meta runners and no experiments."
runner_label_prefix = DEFAULT_LABEL_PREFIX
try:
rollout_state = get_rollout_state_from_issue(
args.github_token, args.github_issue_repo, args.github_issue
)
runner_label_prefix = DEFAULT_LABEL_PREFIX
else:
try:
rollout_state = get_rollout_state_from_issue(
args.github_token, args.github_issue_repo, args.github_issue
)
username = get_potential_pr_author(
args.github_token,
args.github_repo,
args.github_actor,
args.github_ref_type,
args.github_branch,
)
username = get_potential_pr_author(
args.github_token,
args.github_repo,
args.github_actor,
args.github_ref_type,
args.github_branch,
)
is_canary = args.github_repo == "pytorch/pytorch-canary"
is_canary = args.github_repo == "pytorch/pytorch-canary"
runner_label_prefix = get_runner_prefix(
rollout_state, (args.github_issue_owner, username), is_canary
)
runner_label_prefix = get_runner_prefix(
rollout_state,
(args.github_issue_owner, username),
args.github_branch,
is_canary,
)
except Exception as e:
log.error(
f"Failed to get issue. Defaulting to Meta runners and no experiments. Exception: {e}"
)
except Exception as e:
log.error(
f"Failed to get issue. Defaulting to Meta runners and no experiments. Exception: {e}"
)
set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, runner_label_prefix)

View File

@ -4,6 +4,10 @@ from unittest.mock import Mock, patch
import runner_determinator as rd
USER_BRANCH = "somebranch"
EXCEPTION_BRANCH = "main"
class TestRunnerDeterminatorIssueParser(TestCase):
def test_parse_settings(self) -> None:
settings_text = """
@ -66,6 +70,40 @@ class TestRunnerDeterminatorIssueParser(TestCase):
"otherExp settings not parsed correctly",
)
def test_parse_all_branches_setting(self) -> None:
settings_text = """
```
experiments:
lf:
rollout_perc: 25
all_branches: true
otherExp:
all_branches: True
rollout_perc: 0
```
---
Users:
@User1,lf
@User2,lf,otherExp
"""
settings = rd.parse_settings(settings_text)
self.assertTupleEqual(
rd.Experiment(rollout_perc=25, all_branches=True),
settings.experiments["lf"],
"lf settings not parsed correctly",
)
self.assertTrue(settings.experiments["otherExp"].all_branches)
self.assertTupleEqual(
rd.Experiment(rollout_perc=0, all_branches=True),
settings.experiments["otherExp"],
"otherExp settings not parsed correctly",
)
def test_parse_users(self) -> None:
settings_text = """
experiments:
@ -119,7 +157,7 @@ class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
@User2,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User1"])
prefix = rd.get_runner_prefix(settings_text, ["User1"], USER_BRANCH)
self.assertEqual("lf.", prefix, "Runner prefix not correct for User1")
def test_opted_in_user_two_experiments(self) -> None:
@ -136,7 +174,7 @@ class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
@User2,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User2"])
prefix = rd.get_runner_prefix(settings_text, ["User2"], USER_BRANCH)
self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for User2")
@patch("random.uniform", return_value=50)
@ -154,7 +192,7 @@ class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
@User2,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User3"])
prefix = rd.get_runner_prefix(settings_text, ["User3"], USER_BRANCH)
self.assertEqual("", prefix, "Runner prefix not correct for user")
@patch("random.uniform", return_value=10)
@ -174,7 +212,7 @@ class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
"""
# User3 is opted out, but is pulled into both experiments by the 10% rollout
prefix = rd.get_runner_prefix(settings_text, ["User3"])
prefix = rd.get_runner_prefix(settings_text, ["User3"], USER_BRANCH)
self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")
def test_lf_prefix_always_comes_first(self) -> None:
@ -192,7 +230,7 @@ class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
"""
prefix = rd.get_runner_prefix(settings_text, ["User2"])
prefix = rd.get_runner_prefix(settings_text, ["User2"], USER_BRANCH)
self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")
def test_ignores_commented_users(self) -> None:
@ -210,7 +248,7 @@ class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
"""
prefix = rd.get_runner_prefix(settings_text, ["User1"])
prefix = rd.get_runner_prefix(settings_text, ["User1"], USER_BRANCH)
self.assertEqual("", prefix, "Runner prefix not correct for user")
def test_ignores_extra_experiments(self) -> None:
@ -229,9 +267,44 @@ class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
"""
prefix = rd.get_runner_prefix(settings_text, ["User1"])
prefix = rd.get_runner_prefix(settings_text, ["User1"], USER_BRANCH)
self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")
def test_disables_experiment_on_exception_branches_when_not_explicitly_opted_in(
self,
) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 100
---
Users:
@User,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User1"], EXCEPTION_BRANCH)
self.assertEqual("", prefix, "Runner prefix not correct for user")
def test_allows_experiment_on_exception_branches_when_explicitly_opted_in(
self,
) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 100
all_branches: true
---
Users:
@User,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User1"], EXCEPTION_BRANCH)
self.assertEqual("lf.", prefix, "Runner prefix not correct for user")
if __name__ == "__main__":
main()

View File

@ -12,7 +12,7 @@ import json
import os
import warnings
from hashlib import sha256
from typing import Any, Dict, List, Optional
from typing import Any, List, Optional
from unittest import main, mock, skip, TestCase
from urllib.error import HTTPError
@ -24,7 +24,6 @@ from trymerge import (
find_matching_merge_rule,
get_classifications,
get_drci_classifications,
get_rockset_results,
gh_get_team_members,
GitHubPR,
JobCheckState,
@ -42,7 +41,6 @@ if "GIT_REMOTE_URL" not in os.environ:
os.environ["GIT_REMOTE_URL"] = "https://github.com/pytorch/pytorch"
GQL_MOCKS = "gql_mocks.json.gz"
ROCKSET_MOCKS = "rockset_mocks.json.gz"
DRCI_MOCKS = "drci_mocks.json.gz"
@ -77,16 +75,11 @@ def mock_query(
if err.code == 401 or err.code == 403:
err_msg = f"If you are seeing this message during workflow run, please make sure to update {file_name}"
err_msg += f" locally, by deleting it and running {os.path.basename(__file__)} with"
err_msg += " GitHub Personal Access Token passed via GITHUB_TOKEN,"
err_msg += " the rockset api key passed via ROCKSET_API_KEY,"
err_msg += " GitHub Personal Access Token passed via GITHUB_TOKEN"
err_msg += " and drci api key passed via DRCI_BOT_KEY environment variables"
if (
os.getenv("GITHUB_TOKEN") is None
or os.getenv("ROCKSET_API_KEY") is None
or os.getenv("DRCI_BOT_KEY") is None
):
if os.getenv("GITHUB_TOKEN") is None or os.getenv("DRCI_BOT_KEY") is None:
err_msg = (
"Failed to update cached queries as GITHUB_TOKEN or ROCKSET_API_KEY or DRCI_BOT_KEY "
"Failed to update cached queries as GITHUB_TOKEN or DRCI_BOT_KEY "
+ "is not defined. "
+ err_msg
)
@ -110,16 +103,6 @@ def mocked_gh_graphql(query: str, **kwargs: Any) -> Any:
return mock_query(gh_graphql_wrapper, GQL_MOCKS, key_function, query, kwargs)
def mocked_rockset_results(head_sha: str, merge_base: str, num_retries: int = 3) -> Any:
return mock_query(
get_rockset_results,
ROCKSET_MOCKS,
lambda x, y: f"{x} {y}",
head_sha,
merge_base,
)
def mocked_drci_classifications(pr_num: int, project: str, num_retries: int = 3) -> Any:
return mock_query(
get_drci_classifications,
@ -273,10 +256,6 @@ def xla_merge_rules(repo: Any, org: str, project: str) -> List[MergeRule]:
]
def empty_rockset_results(head_sha: str, merge_base: str) -> List[Dict[str, Any]]:
return []
class DummyGitRepo(GitRepo):
def __init__(self) -> None:
super().__init__(get_git_repo_dir(), get_git_remote_name())
@ -288,7 +267,6 @@ class DummyGitRepo(GitRepo):
return "super awsome commit message"
@mock.patch("trymerge.get_rockset_results", side_effect=empty_rockset_results)
@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)
@mock.patch(
"trymerge.get_drci_classifications", side_effect=mocked_drci_classifications
@ -604,7 +582,6 @@ class TestTryMerge(TestCase):
mocked_gh_fetch_merge_base.assert_called_once()
@mock.patch("trymerge.get_rockset_results", side_effect=mocked_rockset_results)
@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)
@mock.patch("trymerge.gh_fetch_merge_base", return_value="")
@mock.patch(
@ -843,7 +820,7 @@ class TestBypassFailures(TestCase):
checks = pr.get_checkrun_conclusions()
# Known flaky failure takes precedence over ignore current (need to set the
# merge base here to get the results from Rockset, and that categorize the
# merge base here to get the results from Dr. CI, and that categorize the
# broken trunk failure too
checks = get_classifications(
pr.pr_num,
@ -929,7 +906,6 @@ class TestBypassFailures(TestCase):
)
@mock.patch("trymerge.get_rockset_results", side_effect=mocked_rockset_results)
@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)
@mock.patch("trymerge.gh_fetch_merge_base", return_value="")
@mock.patch("trymerge.get_drci_classifications", return_value={})
@ -1008,7 +984,6 @@ class TestBypassFailuresOnSandCastle(TestCase):
self.assertTrue(len(failed) == 2)
@mock.patch("trymerge.get_rockset_results", side_effect=mocked_rockset_results)
@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)
@mock.patch("trymerge.gh_fetch_merge_base", return_value="")
@mock.patch(

View File

@ -452,8 +452,6 @@ RE_DIFF_REV = re.compile(r"^Differential Revision:.+?(D[0-9]+)", re.MULTILINE)
CIFLOW_LABEL = re.compile(r"^ciflow/.+")
CIFLOW_TRUNK_LABEL = re.compile(r"^ciflow/trunk")
MERGE_RULE_PATH = Path(".github") / "merge_rules.yaml"
ROCKSET_MERGES_COLLECTION = "merges"
ROCKSET_MERGES_WORKSPACE = "commons"
REMOTE_MAIN_BRANCH = "origin/main"
DRCI_CHECKRUN_NAME = "Dr.CI"
INTERNAL_CHANGES_CHECKRUN_NAME = "Meta Internal-Only Changes Check"
@ -1180,7 +1178,7 @@ class GitHubPR:
merge_commit_sha = repo.rev_parse(name=self.default_branch())
if comment_id and self.pr_num:
# Finally, upload the record to Rockset. The list of pending and failed
# Finally, upload the record to s3. The list of pending and failed
# checks are at the time of the merge
save_merge_record(
comment_id=comment_id,
@ -1202,7 +1200,7 @@ class GitHubPR:
ignore_current=bool(ignore_current_checks),
)
else:
print("Missing comment ID or PR number, couldn't upload to Rockset")
print("Missing comment ID or PR number, couldn't upload to s3")
# Usually Github will see that the commit has "resolves <pr_num>" in the
# commit message and close the PR, but sometimes it doesn't, leading to
@ -1481,7 +1479,7 @@ def find_matching_merge_rule(
# Categorize all checks when skip_mandatory_checks (force merge) is set. Do it here
# where the list of checks is readily available. These records will be saved into
# Rockset merge records
# s3 merge records
(
pending_mandatory_checks,
failed_mandatory_checks,
@ -1568,7 +1566,7 @@ def save_merge_record(
This saves the merge records as a json, which can later be uploaded to s3
"""
# Prepare the record to be written into Rockset
# Prepare the record to be written into s3
data = [
{
"comment_id": comment_id,
@ -1590,7 +1588,8 @@ def save_merge_record(
"ignore_current": ignore_current,
"error": error,
# This is a unique identifier for the record for deduping purposes
# in rockset. Any unique string would work
# in Rockset. Any unique string would work. This will not be used
# after we migrate off Rockset
"_id": f"{project}-{pr_num}-{comment_id}-{os.environ.get('GITHUB_RUN_ID')}",
}
]
@ -1600,36 +1599,6 @@ def save_merge_record(
json.dump(data, f)
@retries_decorator(rc=[])
def get_rockset_results(head_sha: str, merge_base: str) -> List[Dict[str, Any]]:
query = f"""
SELECT
w.name as workflow_name,
j.id,
j.name,
j.conclusion,
j.completed_at,
j.html_url,
j.head_sha,
j.torchci_classification.captures as failure_captures,
LENGTH(j.steps) as steps,
FROM
commons.workflow_job j join commons.workflow_run w on w.id = j.run_id
where
j.head_sha in ('{head_sha}','{merge_base}')
"""
try:
import rockset # type: ignore[import]
res = rockset.RocksetClient(
host="api.usw2a1.rockset.com", api_key=os.environ["ROCKSET_API_KEY"]
).sql(query)
return cast(List[Dict[str, Any]], res.results)
except ModuleNotFoundError:
print("Could not use RockSet as rocket dependency is missing")
return []
@retries_decorator()
def get_drci_classifications(pr_num: int, project: str = "pytorch") -> Any:
"""
@ -2067,7 +2036,7 @@ def categorize_checks(
pending_checks: List[Tuple[str, Optional[str], Optional[int]]] = []
failed_checks: List[Tuple[str, Optional[str], Optional[int]]] = []
# failed_checks_categorization is used to keep track of all ignorable failures when saving the merge record on Rockset
# failed_checks_categorization is used to keep track of all ignorable failures when saving the merge record on s3
failed_checks_categorization: Dict[str, List[Any]] = defaultdict(list)
# If required_checks is not set or empty, consider all names are relevant
@ -2126,7 +2095,7 @@ def categorize_checks(
):
failed_checks = failed_checks + flaky_or_broken_trunk
# The list of failed_checks_categorization is returned so that it can be saved into the Rockset merge record
# The list of failed_checks_categorization is returned so that it can be saved into the s3 merge record
return (pending_checks, failed_checks, failed_checks_categorization)
@ -2410,7 +2379,7 @@ def main() -> None:
handle_exception(e)
if args.comment_id and args.pr_num:
# Finally, upload the record to Rockset, we don't have access to the
# Finally, upload the record to s3, we don't have access to the
# list of pending and failed checks here, but they are not really
# needed at the moment
save_merge_record(
@ -2433,7 +2402,7 @@ def main() -> None:
error=str(e),
)
else:
print("Missing comment ID or PR number, couldn't upload to Rockset")
print("Missing comment ID or PR number, couldn't upload to s3")
finally:
if not args.check_mergeability:
gh_remove_label(

31
.github/scripts/update_runner_determinator.py vendored Executable file
View File

@ -0,0 +1,31 @@
#!/usr/bin/env python3
import re
# Read the contents of runner_determinator.py
with open(".github/scripts/runner_determinator.py") as script_file:
script_content = script_file.read()
# Indent the script content by 10 spaces to match destination indentation
indented_script_content = "\n".join(
[" " * 10 + line if line else line for line in script_content.splitlines()]
)
# Read the contents of _runner-determinator.yml
with open(".github/workflows/_runner-determinator.yml") as yml_file:
yml_content = yml_file.read()
# Replace the content between the markers
new_yml_content = re.sub(
r"(cat <<EOF > runner_determinator.py\n)(.*?)(\n\s+EOF)",
lambda match: match.group(1) + indented_script_content + match.group(3),
yml_content,
flags=re.DOTALL,
)
# Save the modified content back to _runner-determinator.yml
with open(".github/workflows/_runner-determinator.yml", "w") as yml_file:
yml_file.write(new_yml_content)
print("Updated _runner-determinator.yml with the contents of runner_determinator.py")

View File

@ -68,6 +68,7 @@ jobs:
needs: get-label-type
with:!{{ upload.binary_env_as_input(config) }}
{%- if "aarch64" in build_environment %}
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
{%- elif "s390x" in build_environment %}
@ -102,6 +103,7 @@ jobs:
build_name: !{{ config["build_name"] }}
build_environment: !{{ build_environment }}
{%- if "aarch64" in build_environment %}
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.2xlarge
ALPINE_IMAGE: "arm64v8/alpine"
{%- elif "s390x" in build_environment %}

View File

@ -91,14 +91,14 @@ jobs:
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Check if in a ARC runner
- name: Check if in a container runner
shell: bash
id: check_arc_runner
run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> "$GITHUB_OUTPUT"
id: check_container_runner
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
if: ${{ inputs.cuda-version != 'cpu' && steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}
if: ${{ inputs.cuda-version != 'cpu' && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
- name: Output disk space left
run: |

View File

@ -114,22 +114,32 @@ jobs:
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Check if in a ARC runner
- name: Check if in a container runner
shell: bash
id: check_arc_runner
run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> "$GITHUB_OUTPUT"
id: check_container_runner
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
id: install-nvidia-driver
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
- name: Setup GPU_FLAG for docker run
id: setup-gpu-flag
run: echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}
- name: Setup SCCACHE_SERVER_PORT environment for docker run when on container
id: setup-sscache-port-flag
run: echo "SCCACHE_SERVER_PORT_DOCKER_FLAG=-e SCCACHE_SERVER_PORT=$((RUNNER_UID + 4226))" >> "${GITHUB_ENV}"
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}
- name: Lock NVIDIA A100 40GB Frequency
run: |
sudo nvidia-smi -pm 1
sudo nvidia-smi -ac 1215,1410
nvidia-smi
if: contains(matrix.runner, 'a100')
if: ${{ contains(matrix.runner, 'a100') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
- name: Start monitoring script
id: monitor-script
@ -208,6 +218,7 @@ jobs:
NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}
TD_DISTRIBUTED: ${{ steps.keep-going.outputs.ci-td-distributed }}
SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
SCCACHE_REGION: us-east-1
SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
SHM_SIZE: ${{ contains(inputs.build-environment, 'cuda') && '2g' || '1g' }}
DOCKER_IMAGE: ${{ inputs.docker-image }}
@ -218,6 +229,7 @@ jobs:
DASHBOARD_TAG: ${{ inputs.dashboard-tag }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
IS_A100_RUNNER: ${{ contains(matrix.runner, 'a100') && '1' || '0' }}
run: |
set -x
@ -236,6 +248,7 @@ jobs:
# shellcheck disable=SC2086,SC2090
container_name=$(docker run \
${GPU_FLAG:-} \
${SCCACHE_SERVER_PORT_DOCKER_FLAG:-} \
-e BUILD_ENVIRONMENT \
-e PR_NUMBER \
-e GITHUB_ACTIONS \
@ -265,6 +278,7 @@ jobs:
-e PR_LABELS \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e SCCACHE_BUCKET \
-e SCCACHE_REGION \
-e SCCACHE_S3_KEY_PREFIX \
-e XLA_CUDA \
-e XLA_CLANG_CACHE_S3_BUCKET_NAME \
@ -274,6 +288,7 @@ jobs:
-e HUGGING_FACE_HUB_TOKEN \
-e SCRIBE_GRAPHQL_ACCESS_TOKEN \
-e DASHBOARD_TAG \
-e IS_A100_RUNNER \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
@ -343,7 +358,7 @@ jobs:
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()
if: always() && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false'
# NB: We are currently having an intermittent GPU-related issue on G5 runners with
# A10G GPU. Once this happens, trying to reset the GPU as done in setup-nvidia does

View File

@ -59,6 +59,10 @@ jobs:
cat <<EOF > runner_determinator.py
# flake8: noqa: G004
# Note: Copies of this script in runner_determinator.py and _runner-determinator.yml
# must be kept in sync. You can do it easily by running the following command:
# python .github/scripts/update_runner_determinator.py
"""
This runner determinator is used to determine which set of runners to run a
GitHub job on. It uses the first comment of a GitHub issue (by default
@ -138,6 +142,9 @@ jobs:
rollout_perc: float = (
0 # Percentage of workflows to experiment on when user is not opted-in.
)
all_branches: bool = (
False # If True, the experiment is also enabled on the exception branches
)
# Add more fields as needed
@ -271,7 +278,7 @@ jobs:
def is_exception_branch(branch: str) -> bool:
"""
Branches that get opted out of all experiments and should always use Meta runners
Branches that get opted out of experiments by default, until they're explicitly enabled.
"""
return branch.split("/")[0] in {"main", "nightly", "release", "landchecks"}
@ -397,7 +404,10 @@ jobs:
def get_runner_prefix(
rollout_state: str, workflow_requestors: Iterable[str], is_canary: bool = False
rollout_state: str,
workflow_requestors: Iterable[str],
branch: str,
is_canary: bool = False,
) -> str:
settings = parse_settings(rollout_state)
user_optins = parse_users(rollout_state)
@ -407,6 +417,12 @@ jobs:
for experiment_name, experiment_settings in settings.experiments.items():
enabled = False
if not experiment_settings.all_branches and is_exception_branch(branch):
log.info(
f"Branch {branch} is an exception branch. Not enabling experiment {experiment_name}."
)
continue
# Is any workflow_requestor opted in to this experiment?
opted_in_users = [
requestor
@ -466,35 +482,34 @@ jobs:
def main() -> None:
args = parse_args()
if args.github_ref_type == "branch" and is_exception_branch(args.github_branch):
log.info(
f"Exception branch: '{args.github_branch}', using Meta runners and no experiments."
runner_label_prefix = DEFAULT_LABEL_PREFIX
try:
rollout_state = get_rollout_state_from_issue(
args.github_token, args.github_issue_repo, args.github_issue
)
runner_label_prefix = DEFAULT_LABEL_PREFIX
else:
try:
rollout_state = get_rollout_state_from_issue(
args.github_token, args.github_issue_repo, args.github_issue
)
username = get_potential_pr_author(
args.github_token,
args.github_repo,
args.github_actor,
args.github_ref_type,
args.github_branch,
)
username = get_potential_pr_author(
args.github_token,
args.github_repo,
args.github_actor,
args.github_ref_type,
args.github_branch,
)
is_canary = args.github_repo == "pytorch/pytorch-canary"
is_canary = args.github_repo == "pytorch/pytorch-canary"
runner_label_prefix = get_runner_prefix(
rollout_state, (args.github_issue_owner, username), is_canary
)
runner_label_prefix = get_runner_prefix(
rollout_state,
(args.github_issue_owner, username),
args.github_branch,
is_canary,
)
except Exception as e:
log.error(
f"Failed to get issue. Defaulting to Meta runners and no experiments. Exception: {e}"
)
except Exception as e:
log.error(
f"Failed to get issue. Defaulting to Meta runners and no experiments. Exception: {e}"
)
set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, runner_label_prefix)

View File

@ -189,7 +189,7 @@ jobs:
run: |
pushd "${PYTORCH_FINAL_PACKAGE_DIR}"
# shellcheck disable=SC2046,SC2102
python3 -mpip install $(echo *.whl)[opt-einsum,optree] optree==0.12.1
python3 -mpip install $(echo *.whl)[opt-einsum,optree] optree==0.13.0
popd
.ci/pytorch/win-test.sh

View File

@ -43,7 +43,7 @@ jobs:
strategy:
fail-fast: false
matrix:
py_vers: [ "3.8", "3.9", "3.10", "3.11", "3.12" ]
py_vers: [ "3.9", "3.10", "3.11", "3.12" ]
device: ["cuda", "rocm", "xpu"]
include:
- device: "rocm"
@ -91,9 +91,6 @@ jobs:
# Determine python executable for given version
case $PY_VERS in
3.8)
PYTHON_EXECUTABLE=/opt/python/cp38-cp38/bin/python
;;
3.9)
PYTHON_EXECUTABLE=/opt/python/cp39-cp39/bin/python
;;
@ -214,7 +211,7 @@ jobs:
strategy:
fail-fast: false
matrix:
py_vers: [ "3.8", "3.9", "3.10", "3.11", "3.12" ]
py_vers: [ "3.9", "3.10", "3.11", "3.12" ]
timeout-minutes: 40
env:
DOCKER_IMAGE: pytorch/conda-builder:cpu

View File

@ -67,6 +67,7 @@ jobs:
pytorch-linux-jammy-py3.12-halide,
pytorch-linux-jammy-xpu-2024.0-py3,
pytorch-linux-jammy-py3-clang15-asan,
pytorch-linux-jammy-py3-clang18-asan,
pytorch-linux-focal-py3-clang10-onnx,
pytorch-linux-focal-linter,
pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter,
@ -78,7 +79,9 @@ jobs:
- docker-image-name: pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks
runner: linux.arm64.m7g.4xlarge
timeout-minutes: 600
runs-on: "${{ needs.get-label-type.outputs.label-type }}${{ matrix.runner }}"
# Docker uploads fail from LF runners, see https://github.com/pytorch/pytorch/pull/137358
# runs-on: "${{ needs.get-label-type.outputs.label-type }}${{ matrix.runner }}"
runs-on: "${{ matrix.runner }}"
env:
DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/${{ matrix.docker-image-name }}
steps:

View File

@ -60,6 +60,7 @@ jobs:
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_9-cpu-aarch64
@ -86,6 +87,7 @@ jobs:
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.2xlarge
ALPINE_IMAGE: "arm64v8/alpine"
secrets:
@ -130,6 +132,7 @@ jobs:
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_9-cuda-aarch64
@ -177,6 +180,7 @@ jobs:
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cpu-aarch64
@ -203,6 +207,7 @@ jobs:
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.2xlarge
ALPINE_IMAGE: "arm64v8/alpine"
secrets:
@ -247,6 +252,7 @@ jobs:
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64
@ -294,6 +300,7 @@ jobs:
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cpu-aarch64
@ -320,6 +327,7 @@ jobs:
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.2xlarge
ALPINE_IMAGE: "arm64v8/alpine"
secrets:
@ -364,6 +372,7 @@ jobs:
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64
@ -411,6 +420,7 @@ jobs:
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cpu-aarch64
@ -437,6 +447,7 @@ jobs:
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.2xlarge
ALPINE_IMAGE: "arm64v8/alpine"
secrets:
@ -481,6 +492,7 @@ jobs:
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64

View File

@ -3324,3 +3324,353 @@ jobs:
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/manylinux-builder:cpu-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cpu
build_environment: linux-binary-manywheel
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cpu-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cpu-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/manylinux-builder:cpu-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cpu
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cpu-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cpu-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/manylinux-builder:cpu-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cpu-cxx11-abi-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu-cxx11-abi
GPU_ARCH_TYPE: cpu-cxx11-abi
DOCKER_IMAGE: pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cpu-cxx11-abi
build_environment: linux-binary-manywheel
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cpu-cxx11-abi-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cpu-cxx11-abi-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu-cxx11-abi
GPU_ARCH_TYPE: cpu-cxx11-abi
DOCKER_IMAGE: pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cpu-cxx11-abi
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cpu-cxx11-abi-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cpu-cxx11-abi-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu-cxx11-abi
GPU_ARCH_TYPE: cpu-cxx11-abi
DOCKER_IMAGE: pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cpu-cxx11-abi
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda11_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda11_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda11_8-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cuda11_8-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda11_8
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda11_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda11_8-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda11_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda12_1-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_1
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_1-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cuda12_1-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda12_1
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_1-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda12_1-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda12_1
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cuda12_4-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda12_4
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_4-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda12_4-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda12_4
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -1514,3 +1514,283 @@ jobs:
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda11_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda11_8
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda11_8-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cuda11_8-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda11_8
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda11_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda11_8-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda11_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda12_1-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_1
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_1-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cuda12_1-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda12_1
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_1-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda12_1-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda12_1
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_4
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cuda12_4-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda12_4
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_4-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda12_4-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda12_4
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/manylinux-builder:cpu-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cpu
build_environment: linux-binary-manywheel-split
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cpu-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cpu-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/manylinux-builder:cpu-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cpu
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cpu-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cpu-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/manylinux-builder:cpu-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -120,6 +120,28 @@ jobs:
test-matrix: ${{ needs.linux-jammy-cpu-py3_12-inductor-halide-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-cpu-py3_12-inductor-triton-cpu-build:
name: linux-jammy-cpu-py3.12-gcc11-inductor-triton-cpu
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
build-environment: linux-jammy-py3.12-gcc11
docker-image-name: pytorch-linux-jammy-py3.12-triton-cpu
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
test-matrix: |
{ include: [
{ config: "inductor-triton-cpu", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
]}
linux-jammy-cpu-py3_12-inductor-triton-cpu-test:
name: linux-jammy-cpu-py3.12-gcc11-inductor-triton-cpu
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-cpu-py3_12-inductor-triton-cpu-build
with:
build-environment: linux-jammy-py3.12-gcc11
docker-image: ${{ needs.linux-jammy-cpu-py3_12-inductor-triton-cpu-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cpu-py3_12-inductor-triton-cpu-build.outputs.test-matrix }}
linux-focal-cuda12_4-py3_10-gcc9-inductor-build:
# Should be synced with the one in inductor-periodic.yml but this only runs inductor_timm
name: cuda12.4-py3.10-gcc9-sm86

View File

@ -11,7 +11,6 @@ jobs:
contents: read
pull-requests: write
runs-on: lf.linux.2xlarge
continue-on-error: true
if: ${{ github.repository_owner == 'pytorch' }}
steps:
- name: Checkout pytorch
@ -31,10 +30,12 @@ jobs:
bash .github/scripts/lintrunner.sh
- name: Check for changes
id: git-check
continue-on-error: true
run: |
git diff --exit-code || echo "changes=true" >> "$GITHUB_OUTPUT"
- name: Suggest changes
if: steps.git-check.outputs.changes == 'true'
continue-on-error: true
uses: parkerbxyz/suggest-changes@v1
with:
comment: "Please commit the suggested changes from pytorch's linter."

View File

@ -223,6 +223,7 @@ jobs:
cache: pip
- name: Install dependencies
run: |
python3 -m pip install --upgrade pip
pip install pytest-rerunfailures==11.1.* pytest-flakefinder==1.1.* pytest-xdist==3.3.* expecttest==0.2.* fbscribelogger==0.1.* numpy==1.24.*
pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cpu/
- name: Run run_test.py (nonretryable)

View File

@ -57,10 +57,10 @@ jobs:
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "nogpu_AVX512", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_AVX512", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_NO_AVX2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_1-py3_10-gcc9-test:
@ -89,10 +89,10 @@ jobs:
{ config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "nogpu_AVX512", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_AVX512", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_NO_AVX2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
]}
@ -118,9 +118,10 @@ jobs:
docker-image-name: pytorch-linux-jammy-py3.9-gcc11
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
]}
parallelnative-linux-jammy-py3_9-gcc11-test:
@ -339,10 +340,10 @@ jobs:
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "nogpu_AVX512", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_AVX512", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_NO_AVX2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
]}

View File

@ -185,10 +185,10 @@ jobs:
docker-image-name: pytorch-linux-focal-py3.9-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
@ -217,10 +217,10 @@ jobs:
docker-image-name: pytorch-linux-focal-py3.11-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
@ -251,10 +251,10 @@ jobs:
docker-image-name: pytorch-linux-focal-py3.12-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
@ -588,9 +588,9 @@ jobs:
docker-image-name: pytorch-linux-focal-py3.12-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 3, runner: "linux.4xlarge" },
{ config: "default", shard: 2, num_shards: 3, runner: "linux.4xlarge" },
{ config: "default", shard: 3, num_shards: 3, runner: "linux.4xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "linux.2xlarge" },

24
.github/workflows/s390.yml vendored Normal file
View File

@ -0,0 +1,24 @@
name: s390
on:
push:
branches:
- main
tags:
- ciflow/s390/*
workflow_dispatch:
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions: read-all
jobs:
linux-manylinux-2_28-py3-cpu-s390x-build:
name: linux-manylinux-2_28-py3-cpu-s390x
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-s390x-binary-manywheel
docker-image-name: pytorch/manylinuxs390x-builder:cpu-s390x-main
runner: linux.s390x

View File

@ -266,10 +266,10 @@ jobs:
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "nogpu_AVX512", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_AVX512", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_NO_AVX2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
@ -316,11 +316,3 @@ jobs:
build-environment: linux-focal-cuda11.8-py3.10-gcc9-experimental-split-build
docker-image: ${{ needs.linux-focal-cuda11_8-py3_10-gcc9-experimental-split-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda11_8-py3_10-gcc9-experimental-split-build.outputs.test-matrix }}
linux-manylinux-2_28-py3-cpu-s390x-build:
name: linux-manylinux-2_28-py3-cpu-s390x
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-s390x-binary-manywheel
docker-image-name: pytorch/manylinuxs390x-builder:cpu-s390x-main
runner: linux.s390x

View File

@ -28,7 +28,7 @@ jobs:
check-latest: false
cache: pip
architecture: x64
- run: pip install pyyaml==6.0 rockset==1.0.3
- run: pip install pyyaml==6.0
- name: Setup committer id
run: |
@ -43,7 +43,6 @@ jobs:
COMMENT_ID: ${{ github.event.client_payload.comment_id }}
REBASE: ${{ github.event.client_payload.rebase }}
IGNORE_CURRENT: ${{ github.event.client_payload.ignore_current }}
ROCKSET_API_KEY: ${{ secrets.ROCKSET_API_KEY }}
DRCI_BOT_KEY: ${{ secrets.DRCI_BOT_KEY }}
GITHUB_RUN_ID: ${{ github.run_id }}
run: |

View File

@ -153,7 +153,7 @@ init_command = [
'junitparser==2.1.1',
'rich==10.9.0',
'pyyaml==6.0.1',
'optree==0.12.1',
'optree==0.13.0',
]
[[linter]]
@ -216,6 +216,10 @@ include_patterns = [
'torch/csrc/*.cpp',
'torch/csrc/**/*.h',
'torch/csrc/**/*.cpp',
'torch/csrc/distributed/autograd/**/*.cpp',
'torch/csrc/distributed/autograd/**/*.h',
'torch/csrc/distributed/rpc/**/*.cpp',
'torch/csrc/distributed/rpc/**/*.h',
'torch/csrc/jit/serialization/*.h',
'torch/csrc/jit/serialization/*.cpp',
]

View File

@ -1083,8 +1083,16 @@ if(NOT MSVC)
append_cxx_flag_if_supported("-Wno-unused-but-set-variable" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Wno-maybe-uninitialized" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-fstandalone-debug" CMAKE_CXX_FLAGS_DEBUG)
string(APPEND CMAKE_CXX_FLAGS_DEBUG " -fno-omit-frame-pointer -O0")
string(APPEND CMAKE_LINKER_FLAGS_DEBUG " -fno-omit-frame-pointer -O0")
if(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64" AND CMAKE_CXX_COMPILER_ID MATCHES "GNU")
if(CMAKE_BUILD_TYPE MATCHES Debug)
message(Warning "Applying -Og optimization for aarch64 GCC debug build to workaround ICE")
endif()
string(APPEND CMAKE_CXX_FLAGS_DEBUG " -fno-omit-frame-pointer -Og")
string(APPEND CMAKE_LINKER_FLAGS_DEBUG " -fno-omit-frame-pointer -Og")
else()
string(APPEND CMAKE_CXX_FLAGS_DEBUG " -fno-omit-frame-pointer -O0")
string(APPEND CMAKE_LINKER_FLAGS_DEBUG " -fno-omit-frame-pointer -O0")
endif()
append_cxx_flag_if_supported("-fno-math-errno" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-fno-trapping-math" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Werror=format" CMAKE_CXX_FLAGS)

View File

@ -121,7 +121,7 @@ torch/profiler/ @aaronenyeshi @sraikund16
test/functorch/test_aotdispatch.py @ezyang @Chillee
# Dataloader
torch/utils/data/ @andrewkho @gokulavasan
torch/utils/data/ @andrewkho @divyanshk
# hipify
torch/utils/hipify/ @jeffdaily @jithunnair-amd

View File

@ -39,25 +39,16 @@ class TORCH_API Context {
const Generator& defaultGenerator(Device device) {
c10::DeviceType device_type = device.type();
initCUDAIfNeeded(device_type);
initHIPIfNeeded(device_type);
lazyInitDevice(device_type);
if (device_type == at::kCPU) {
return at::detail::getDefaultCPUGenerator();
} else if (device_type == at::kCUDA) {
return at::detail::getCUDAHooks().getDefaultCUDAGenerator(device.index());
} else if (device_type == at::kMPS) {
return at::detail::getMPSHooks().getDefaultMPSGenerator();
} else if (device_type == at::kXPU) {
return at::detail::getXPUHooks().getDefaultXPUGenerator(device.index());
} else if (device_type == at::kIPU) {
return at::detail::getIPUHooks().getDefaultIPUGenerator(device.index());
} else if (device_type == at::kPrivateUse1) {
return at::detail::getPrivateUse1Hooks().getDefaultGenerator(
device.index());
} else {
AT_ERROR(c10::DeviceTypeName(device_type), " device type not enabled.");
return getAcceleratorHooksInterface(device_type)
.getDefaultGenerator(device.index());
}
}
const AcceleratorHooksInterface& getAcceleratorHooksInterface(
std::optional<c10::DeviceType> opt_device_type = std::nullopt) {
c10::DeviceType device_type = opt_device_type.has_value()
@ -80,10 +71,10 @@ class TORCH_API Context {
c10::DeviceTypeName(device_type), " device type not an accelerator.");
}
}
Device getDeviceFromPtr(void* data, c10::DeviceType device_type) {
initCUDAIfNeeded(device_type);
initHIPIfNeeded(device_type);
initXPUIfNeeded(device_type);
lazyInitDevice(device_type);
if (device_type == at::kCPU) {
return c10::DeviceType::CPU;
} else if (device_type == at::kCUDA) {
@ -96,6 +87,7 @@ class TORCH_API Context {
AT_ERROR(c10::DeviceTypeName(device_type), " device type not enabled.");
}
}
bool isPinnedPtr(
const void* data,
std::optional<c10::DeviceType> device_type = std::nullopt) {
@ -106,13 +98,22 @@ class TORCH_API Context {
opt_device_type.value())) { // passed device not an accelerator
return false;
}
return getAcceleratorHooksInterface(opt_device_type.value())
.isPinnedPtr(data);
return getAcceleratorHooksInterface(opt_device_type).isPinnedPtr(data);
}
Allocator* getPinnedMemoryAllocator(
std::optional<c10::DeviceType> device_type = std::nullopt) {
return getAcceleratorHooksInterface(device_type).getPinnedMemoryAllocator();
}
void lazyInitDevice(c10::DeviceType device_type) {
if (device_type != at::kCPU) {
c10::call_once(init_[static_cast<int8_t>(device_type)], [&] {
getAcceleratorHooksInterface(device_type).init();
});
}
}
static bool hasOpenMP();
static bool hasMKL();
static bool hasLAPACK();
@ -165,27 +166,6 @@ class TORCH_API Context {
static bool hasMAIA() {
return c10::impl::hasDeviceGuardImpl(c10::DeviceType::MAIA);
}
// defined in header so that getNonVariableType has ability to inline
// call_once check. getNonVariableType is called fairly frequently
void lazyInitCUDA() {
c10::call_once(thc_init, [&] { detail::getCUDAHooks().initCUDA(); });
}
void lazyInitHIP() {
c10::call_once(thh_init, [&] { detail::getHIPHooks().initHIP(); });
}
void lazyInitXPU() {
c10::call_once(thx_init, [&] { detail::getXPUHooks().initXPU(); });
}
void lazyInitMTIA() {
c10::call_once(th_mtia_init, [&] { detail::getMTIAHooks().initMTIA(); });
}
void lazyInitPrivateUse1() {
c10::call_once(thp_init, [&] {
if (isPrivateUse1HooksRegistered()) {
at::detail::getPrivateUse1Hooks().initPrivateUse1();
}
});
}
static const at::cuda::NVRTC& getNVRTC() {
return detail::getCUDAHooks().nvrtc();
}
@ -361,27 +341,8 @@ class TORCH_API Context {
void setAllowFP16ReductionCPU(bool);
private:
void initCUDAIfNeeded(c10::DeviceType p) {
if (p == c10::DeviceType::CUDA) {
lazyInitCUDA();
}
}
void initHIPIfNeeded(c10::DeviceType p) {
if (p == c10::DeviceType::HIP) {
lazyInitHIP();
}
}
void initXPUIfNeeded(c10::DeviceType p) {
if (p == c10::DeviceType::XPU) {
lazyInitXPU();
}
}
static bool checkCuBLASConfigDeterministic();
c10::once_flag thc_init;
c10::once_flag thh_init;
c10::once_flag thx_init;
c10::once_flag th_mtia_init;
c10::once_flag thp_init;
std::array<c10::once_flag, at::COMPILE_TIME_MAX_DEVICE_TYPES> init_;
bool enabled_cudnn = true;
bool deterministic_cudnn = false;
bool deterministic_mkldnn = false;
@ -513,7 +474,7 @@ inline size_t getNumGPUs() {
"to be CUDA (e.g., when you say CUDA, on a HIP build of ATen, this actually "
"means HIP. Rebuild PyTorch with one or the other disabled.");
} else if (hasCUDA()) {
return detail::getCUDAHooks().getNumGPUs();
return detail::getCUDAHooks().deviceCount();
} else if (hasHIP()) {
return detail::getHIPHooks().getNumGPUs();
} else {
@ -550,7 +511,7 @@ inline void manual_seed(uint64_t seed) {
}
// NB: Sometimes we build with CUDA, but we don't have any GPUs
// available. In that case, we must not seed CUDA; it will fail!
const auto cuda_num_gpus = detail::getCUDAHooks().getNumGPUs();
const auto cuda_num_gpus = detail::getCUDAHooks().deviceCount();
if (hasCUDA() && cuda_num_gpus > 0) {
for (const auto i : c10::irange(cuda_num_gpus)) {
auto cuda_gen = globalContext().defaultGenerator(
@ -563,7 +524,7 @@ inline void manual_seed(uint64_t seed) {
}
}
const auto xpu_num_gpus = detail::getXPUHooks().getNumGPUs();
const auto xpu_num_gpus = detail::getXPUHooks().deviceCount();
if (hasXPU() && xpu_num_gpus) {
for (const auto i : c10::irange(xpu_num_gpus)) {
auto xpu_gen = globalContext().defaultGenerator(

View File

@ -22,13 +22,6 @@ DLDataType getDLDataType(const Tensor& t) {
case ScalarType::UInt64:
dtype.code = DLDataTypeCode::kDLUInt;
break;
case ScalarType::Int1:
case ScalarType::Int2:
case ScalarType::Int3:
case ScalarType::Int4:
case ScalarType::Int5:
case ScalarType::Int6:
case ScalarType::Int7:
case ScalarType::Char:
dtype.code = DLDataTypeCode::kDLInt;
break;

View File

@ -112,12 +112,12 @@
// Ensure we never have too many scalar types for the expansion here to
// support. To bump this, you must regenerate the macros below.
static_assert(static_cast<int>(c10::ScalarType::NumOptions) < 60);
static_assert(static_cast<int>(c10::ScalarType::NumOptions) < 45);
// Python code to regenerate generate code below:
#if 0
num_args = 60
num_args = 45
nums = ', '.join(str(i) for i in reversed(range(num_args+1)))
args = ', '.join(f'_{i}' for i in range(1, num_args+1))
@ -135,8 +135,8 @@ for i in range(1, num_args+1):
// Begin generated code
// clang-format off
#define AT_NUM_ARGS(...) AT_EXPAND(AT_NUM_ARGS_AUX(__VA_ARGS__, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0))
#define AT_NUM_ARGS_AUX(_1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20, _21, _22, _23, _24, _25, _26, _27, _28, _29, _30, _31, _32, _33, _34, _35, _36, _37, _38, _39, _40, _41, _42, _43, _44, _45, _46, _47, _48, _49, _50, _51, _52, _53, _54, _55, _56, _57, _58, _59, _60, N, ...) N
#define AT_NUM_ARGS(...) AT_EXPAND(AT_NUM_ARGS_AUX(__VA_ARGS__, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0))
#define AT_NUM_ARGS_AUX(_1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20, _21, _22, _23, _24, _25, _26, _27, _28, _29, _30, _31, _32, _33, _34, _35, _36, _37, _38, _39, _40, _41, _42, _43, _44, _45, N, ...) N
#define AT_AP1(N, _1) AT_DISPATCH_CASE(_1, N)
#define AT_AP2(N, _1, _2) AT_DISPATCH_CASE(_1, N) AT_DISPATCH_CASE(_2, N)
#define AT_AP3(N, _1, _2, _3) AT_DISPATCH_CASE(_1, N) AT_DISPATCH_CASE(_2, N) AT_DISPATCH_CASE(_3, N)
@ -182,21 +182,5 @@ for i in range(1, num_args+1):
#define AT_AP43(N, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20, _21, _22, _23, _24, _25, _26, _27, _28, _29, _30, _31, _32, _33, _34, _35, _36, _37, _38, _39, _40, _41, _42, _43) AT_DISPATCH_CASE(_1, N) AT_DISPATCH_CASE(_2, N) AT_DISPATCH_CASE(_3, N) AT_DISPATCH_CASE(_4, N) AT_DISPATCH_CASE(_5, N) AT_DISPATCH_CASE(_6, N) AT_DISPATCH_CASE(_7, N) AT_DISPATCH_CASE(_8, N) AT_DISPATCH_CASE(_9, N) AT_DISPATCH_CASE(_10, N) AT_DISPATCH_CASE(_11, N) AT_DISPATCH_CASE(_12, N) AT_DISPATCH_CASE(_13, N) AT_DISPATCH_CASE(_14, N) AT_DISPATCH_CASE(_15, N) AT_DISPATCH_CASE(_16, N) AT_DISPATCH_CASE(_17, N) AT_DISPATCH_CASE(_18, N) AT_DISPATCH_CASE(_19, N) AT_DISPATCH_CASE(_20, N) AT_DISPATCH_CASE(_21, N) AT_DISPATCH_CASE(_22, N) AT_DISPATCH_CASE(_23, N) AT_DISPATCH_CASE(_24, N) AT_DISPATCH_CASE(_25, N) AT_DISPATCH_CASE(_26, N) AT_DISPATCH_CASE(_27, N) AT_DISPATCH_CASE(_28, N) AT_DISPATCH_CASE(_29, N) AT_DISPATCH_CASE(_30, N) AT_DISPATCH_CASE(_31, N) AT_DISPATCH_CASE(_32, N) AT_DISPATCH_CASE(_33, N) AT_DISPATCH_CASE(_34, N) AT_DISPATCH_CASE(_35, N) AT_DISPATCH_CASE(_36, N) AT_DISPATCH_CASE(_37, N) AT_DISPATCH_CASE(_38, N) AT_DISPATCH_CASE(_39, N) AT_DISPATCH_CASE(_40, N) AT_DISPATCH_CASE(_41, N) AT_DISPATCH_CASE(_42, N) AT_DISPATCH_CASE(_43, N)
#define AT_AP44(N, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20, _21, _22, _23, _24, _25, _26, _27, _28, _29, _30, _31, _32, _33, _34, _35, _36, _37, _38, _39, _40, _41, _42, _43, _44) AT_DISPATCH_CASE(_1, N) AT_DISPATCH_CASE(_2, N) AT_DISPATCH_CASE(_3, N) AT_DISPATCH_CASE(_4, N) AT_DISPATCH_CASE(_5, N) AT_DISPATCH_CASE(_6, N) AT_DISPATCH_CASE(_7, N) AT_DISPATCH_CASE(_8, N) AT_DISPATCH_CASE(_9, N) AT_DISPATCH_CASE(_10, N) AT_DISPATCH_CASE(_11, N) AT_DISPATCH_CASE(_12, N) AT_DISPATCH_CASE(_13, N) AT_DISPATCH_CASE(_14, N) AT_DISPATCH_CASE(_15, N) AT_DISPATCH_CASE(_16, N) AT_DISPATCH_CASE(_17, N) AT_DISPATCH_CASE(_18, N) AT_DISPATCH_CASE(_19, N) AT_DISPATCH_CASE(_20, N) AT_DISPATCH_CASE(_21, N) AT_DISPATCH_CASE(_22, N) AT_DISPATCH_CASE(_23, N) AT_DISPATCH_CASE(_24, N) AT_DISPATCH_CASE(_25, N) AT_DISPATCH_CASE(_26, N) AT_DISPATCH_CASE(_27, N) AT_DISPATCH_CASE(_28, N) AT_DISPATCH_CASE(_29, N) AT_DISPATCH_CASE(_30, N) AT_DISPATCH_CASE(_31, N) AT_DISPATCH_CASE(_32, N) AT_DISPATCH_CASE(_33, N) AT_DISPATCH_CASE(_34, N) AT_DISPATCH_CASE(_35, N) AT_DISPATCH_CASE(_36, N) AT_DISPATCH_CASE(_37, N) AT_DISPATCH_CASE(_38, N) AT_DISPATCH_CASE(_39, N) AT_DISPATCH_CASE(_40, N) AT_DISPATCH_CASE(_41, N) AT_DISPATCH_CASE(_42, N) AT_DISPATCH_CASE(_43, N) AT_DISPATCH_CASE(_44, N)
#define AT_AP45(N, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20, _21, _22, _23, _24, _25, _26, _27, _28, _29, _30, _31, _32, _33, _34, _35, _36, _37, _38, _39, _40, _41, _42, _43, _44, _45) AT_DISPATCH_CASE(_1, N) AT_DISPATCH_CASE(_2, N) AT_DISPATCH_CASE(_3, N) AT_DISPATCH_CASE(_4, N) AT_DISPATCH_CASE(_5, N) AT_DISPATCH_CASE(_6, N) AT_DISPATCH_CASE(_7, N) AT_DISPATCH_CASE(_8, N) AT_DISPATCH_CASE(_9, N) AT_DISPATCH_CASE(_10, N) AT_DISPATCH_CASE(_11, N) AT_DISPATCH_CASE(_12, N) AT_DISPATCH_CASE(_13, N) AT_DISPATCH_CASE(_14, N) AT_DISPATCH_CASE(_15, N) AT_DISPATCH_CASE(_16, N) AT_DISPATCH_CASE(_17, N) AT_DISPATCH_CASE(_18, N) AT_DISPATCH_CASE(_19, N) AT_DISPATCH_CASE(_20, N) AT_DISPATCH_CASE(_21, N) AT_DISPATCH_CASE(_22, N) AT_DISPATCH_CASE(_23, N) AT_DISPATCH_CASE(_24, N) AT_DISPATCH_CASE(_25, N) AT_DISPATCH_CASE(_26, N) AT_DISPATCH_CASE(_27, N) AT_DISPATCH_CASE(_28, N) AT_DISPATCH_CASE(_29, N) AT_DISPATCH_CASE(_30, N) AT_DISPATCH_CASE(_31, N) AT_DISPATCH_CASE(_32, N) AT_DISPATCH_CASE(_33, N) AT_DISPATCH_CASE(_34, N) AT_DISPATCH_CASE(_35, N) AT_DISPATCH_CASE(_36, N) AT_DISPATCH_CASE(_37, N) AT_DISPATCH_CASE(_38, N) AT_DISPATCH_CASE(_39, N) AT_DISPATCH_CASE(_40, N) AT_DISPATCH_CASE(_41, N) AT_DISPATCH_CASE(_42, N) AT_DISPATCH_CASE(_43, N) AT_DISPATCH_CASE(_44, N) AT_DISPATCH_CASE(_45, N)
#define AT_AP46(N, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20, _21, _22, _23, _24, _25, _26, _27, _28, _29, _30, _31, _32, _33, _34, _35, _36, _37, _38, _39, _40, _41, _42, _43, _44, _45, _46) AT_DISPATCH_CASE(_1, N) AT_DISPATCH_CASE(_2, N) AT_DISPATCH_CASE(_3, N) AT_DISPATCH_CASE(_4, N) AT_DISPATCH_CASE(_5, N) AT_DISPATCH_CASE(_6, N) AT_DISPATCH_CASE(_7, N) AT_DISPATCH_CASE(_8, N) AT_DISPATCH_CASE(_9, N) AT_DISPATCH_CASE(_10, N) AT_DISPATCH_CASE(_11, N) AT_DISPATCH_CASE(_12, N) AT_DISPATCH_CASE(_13, N) AT_DISPATCH_CASE(_14, N) AT_DISPATCH_CASE(_15, N) AT_DISPATCH_CASE(_16, N) AT_DISPATCH_CASE(_17, N) AT_DISPATCH_CASE(_18, N) AT_DISPATCH_CASE(_19, N) AT_DISPATCH_CASE(_20, N) AT_DISPATCH_CASE(_21, N) AT_DISPATCH_CASE(_22, N) AT_DISPATCH_CASE(_23, N) AT_DISPATCH_CASE(_24, N) AT_DISPATCH_CASE(_25, N) AT_DISPATCH_CASE(_26, N) AT_DISPATCH_CASE(_27, N) AT_DISPATCH_CASE(_28, N) AT_DISPATCH_CASE(_29, N) AT_DISPATCH_CASE(_30, N) AT_DISPATCH_CASE(_31, N) AT_DISPATCH_CASE(_32, N) AT_DISPATCH_CASE(_33, N) AT_DISPATCH_CASE(_34, N) AT_DISPATCH_CASE(_35, N) AT_DISPATCH_CASE(_36, N) AT_DISPATCH_CASE(_37, N) AT_DISPATCH_CASE(_38, N) AT_DISPATCH_CASE(_39, N) AT_DISPATCH_CASE(_40, N) AT_DISPATCH_CASE(_41, N) AT_DISPATCH_CASE(_42, N) AT_DISPATCH_CASE(_43, N) AT_DISPATCH_CASE(_44, N) AT_DISPATCH_CASE(_45, N) AT_DISPATCH_CASE(_46, N)
#define AT_AP47(N, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20, _21, _22, _23, _24, _25, _26, _27, _28, _29, _30, _31, _32, _33, _34, _35, _36, _37, _38, _39, _40, _41, _42, _43, _44, _45, _46, _47) AT_DISPATCH_CASE(_1, N) AT_DISPATCH_CASE(_2, N) AT_DISPATCH_CASE(_3, N) AT_DISPATCH_CASE(_4, N) AT_DISPATCH_CASE(_5, N) AT_DISPATCH_CASE(_6, N) AT_DISPATCH_CASE(_7, N) AT_DISPATCH_CASE(_8, N) AT_DISPATCH_CASE(_9, N) AT_DISPATCH_CASE(_10, N) AT_DISPATCH_CASE(_11, N) AT_DISPATCH_CASE(_12, N) AT_DISPATCH_CASE(_13, N) AT_DISPATCH_CASE(_14, N) AT_DISPATCH_CASE(_15, N) AT_DISPATCH_CASE(_16, N) AT_DISPATCH_CASE(_17, N) AT_DISPATCH_CASE(_18, N) AT_DISPATCH_CASE(_19, N) AT_DISPATCH_CASE(_20, N) AT_DISPATCH_CASE(_21, N) AT_DISPATCH_CASE(_22, N) AT_DISPATCH_CASE(_23, N) AT_DISPATCH_CASE(_24, N) AT_DISPATCH_CASE(_25, N) AT_DISPATCH_CASE(_26, N) AT_DISPATCH_CASE(_27, N) AT_DISPATCH_CASE(_28, N) AT_DISPATCH_CASE(_29, N) AT_DISPATCH_CASE(_30, N) AT_DISPATCH_CASE(_31, N) AT_DISPATCH_CASE(_32, N) AT_DISPATCH_CASE(_33, N) AT_DISPATCH_CASE(_34, N) AT_DISPATCH_CASE(_35, N) AT_DISPATCH_CASE(_36, N) AT_DISPATCH_CASE(_37, N) AT_DISPATCH_CASE(_38, N) AT_DISPATCH_CASE(_39, N) AT_DISPATCH_CASE(_40, N) AT_DISPATCH_CASE(_41, N) AT_DISPATCH_CASE(_42, N) AT_DISPATCH_CASE(_43, N) AT_DISPATCH_CASE(_44, N) AT_DISPATCH_CASE(_45, N) AT_DISPATCH_CASE(_46, N) AT_DISPATCH_CASE(_47, N)
#define AT_AP48(N, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20, _21, _22, _23, _24, _25, _26, _27, _28, _29, _30, _31, _32, _33, _34, _35, _36, _37, _38, _39, _40, _41, _42, _43, _44, _45, _46, _47, _48) AT_DISPATCH_CASE(_1, N) AT_DISPATCH_CASE(_2, N) AT_DISPATCH_CASE(_3, N) AT_DISPATCH_CASE(_4, N) AT_DISPATCH_CASE(_5, N) AT_DISPATCH_CASE(_6, N) AT_DISPATCH_CASE(_7, N) AT_DISPATCH_CASE(_8, N) AT_DISPATCH_CASE(_9, N) AT_DISPATCH_CASE(_10, N) AT_DISPATCH_CASE(_11, N) AT_DISPATCH_CASE(_12, N) AT_DISPATCH_CASE(_13, N) AT_DISPATCH_CASE(_14, N) AT_DISPATCH_CASE(_15, N) AT_DISPATCH_CASE(_16, N) AT_DISPATCH_CASE(_17, N) AT_DISPATCH_CASE(_18, N) AT_DISPATCH_CASE(_19, N) AT_DISPATCH_CASE(_20, N) AT_DISPATCH_CASE(_21, N) AT_DISPATCH_CASE(_22, N) AT_DISPATCH_CASE(_23, N) AT_DISPATCH_CASE(_24, N) AT_DISPATCH_CASE(_25, N) AT_DISPATCH_CASE(_26, N) AT_DISPATCH_CASE(_27, N) AT_DISPATCH_CASE(_28, N) AT_DISPATCH_CASE(_29, N) AT_DISPATCH_CASE(_30, N) AT_DISPATCH_CASE(_31, N) AT_DISPATCH_CASE(_32, N) AT_DISPATCH_CASE(_33, N) AT_DISPATCH_CASE(_34, N) AT_DISPATCH_CASE(_35, N) AT_DISPATCH_CASE(_36, N) AT_DISPATCH_CASE(_37, N) AT_DISPATCH_CASE(_38, N) AT_DISPATCH_CASE(_39, N) AT_DISPATCH_CASE(_40, N) AT_DISPATCH_CASE(_41, N) AT_DISPATCH_CASE(_42, N) AT_DISPATCH_CASE(_43, N) AT_DISPATCH_CASE(_44, N) AT_DISPATCH_CASE(_45, N) AT_DISPATCH_CASE(_46, N) AT_DISPATCH_CASE(_47, N) AT_DISPATCH_CASE(_48, N)
#define AT_AP49(N, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20, _21, _22, _23, _24, _25, _26, _27, _28, _29, _30, _31, _32, _33, _34, _35, _36, _37, _38, _39, _40, _41, _42, _43, _44, _45, _46, _47, _48, _49) AT_DISPATCH_CASE(_1, N) AT_DISPATCH_CASE(_2, N) AT_DISPATCH_CASE(_3, N) AT_DISPATCH_CASE(_4, N) AT_DISPATCH_CASE(_5, N) AT_DISPATCH_CASE(_6, N) AT_DISPATCH_CASE(_7, N) AT_DISPATCH_CASE(_8, N) AT_DISPATCH_CASE(_9, N) AT_DISPATCH_CASE(_10, N) AT_DISPATCH_CASE(_11, N) AT_DISPATCH_CASE(_12, N) AT_DISPATCH_CASE(_13, N) AT_DISPATCH_CASE(_14, N) AT_DISPATCH_CASE(_15, N) AT_DISPATCH_CASE(_16, N) AT_DISPATCH_CASE(_17, N) AT_DISPATCH_CASE(_18, N) AT_DISPATCH_CASE(_19, N) AT_DISPATCH_CASE(_20, N) AT_DISPATCH_CASE(_21, N) AT_DISPATCH_CASE(_22, N) AT_DISPATCH_CASE(_23, N) AT_DISPATCH_CASE(_24, N) AT_DISPATCH_CASE(_25, N) AT_DISPATCH_CASE(_26, N) AT_DISPATCH_CASE(_27, N) AT_DISPATCH_CASE(_28, N) AT_DISPATCH_CASE(_29, N) AT_DISPATCH_CASE(_30, N) AT_DISPATCH_CASE(_31, N) AT_DISPATCH_CASE(_32, N) AT_DISPATCH_CASE(_33, N) AT_DISPATCH_CASE(_34, N) AT_DISPATCH_CASE(_35, N) AT_DISPATCH_CASE(_36, N) AT_DISPATCH_CASE(_37, N) AT_DISPATCH_CASE(_38, N) AT_DISPATCH_CASE(_39, N) AT_DISPATCH_CASE(_40, N) AT_DISPATCH_CASE(_41, N) AT_DISPATCH_CASE(_42, N) AT_DISPATCH_CASE(_43, N) AT_DISPATCH_CASE(_44, N) AT_DISPATCH_CASE(_45, N) AT_DISPATCH_CASE(_46, N) AT_DISPATCH_CASE(_47, N) AT_DISPATCH_CASE(_48, N) AT_DISPATCH_CASE(_49, N)
#define AT_AP50(N, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20, _21, _22, _23, _24, _25, _26, _27, _28, _29, _30, _31, _32, _33, _34, _35, _36, _37, _38, _39, _40, _41, _42, _43, _44, _45, _46, _47, _48, _49, _50) AT_DISPATCH_CASE(_1, N) AT_DISPATCH_CASE(_2, N) AT_DISPATCH_CASE(_3, N) AT_DISPATCH_CASE(_4, N) AT_DISPATCH_CASE(_5, N) AT_DISPATCH_CASE(_6, N) AT_DISPATCH_CASE(_7, N) AT_DISPATCH_CASE(_8, N) AT_DISPATCH_CASE(_9, N) AT_DISPATCH_CASE(_10, N) AT_DISPATCH_CASE(_11, N) AT_DISPATCH_CASE(_12, N) AT_DISPATCH_CASE(_13, N) AT_DISPATCH_CASE(_14, N) AT_DISPATCH_CASE(_15, N) AT_DISPATCH_CASE(_16, N) AT_DISPATCH_CASE(_17, N) AT_DISPATCH_CASE(_18, N) AT_DISPATCH_CASE(_19, N) AT_DISPATCH_CASE(_20, N) AT_DISPATCH_CASE(_21, N) AT_DISPATCH_CASE(_22, N) AT_DISPATCH_CASE(_23, N) AT_DISPATCH_CASE(_24, N) AT_DISPATCH_CASE(_25, N) AT_DISPATCH_CASE(_26, N) AT_DISPATCH_CASE(_27, N) AT_DISPATCH_CASE(_28, N) AT_DISPATCH_CASE(_29, N) AT_DISPATCH_CASE(_30, N) AT_DISPATCH_CASE(_31, N) AT_DISPATCH_CASE(_32, N) AT_DISPATCH_CASE(_33, N) AT_DISPATCH_CASE(_34, N) AT_DISPATCH_CASE(_35, N) AT_DISPATCH_CASE(_36, N) AT_DISPATCH_CASE(_37, N) AT_DISPATCH_CASE(_38, N) AT_DISPATCH_CASE(_39, N) AT_DISPATCH_CASE(_40, N) AT_DISPATCH_CASE(_41, N) AT_DISPATCH_CASE(_42, N) AT_DISPATCH_CASE(_43, N) AT_DISPATCH_CASE(_44, N) AT_DISPATCH_CASE(_45, N) AT_DISPATCH_CASE(_46, N) AT_DISPATCH_CASE(_47, N) AT_DISPATCH_CASE(_48, N) AT_DISPATCH_CASE(_49, N) AT_DISPATCH_CASE(_50, N)
#define AT_AP51(N, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20, _21, _22, _23, _24, _25, _26, _27, _28, _29, _30, _31, _32, _33, _34, _35, _36, _37, _38, _39, _40, _41, _42, _43, _44, _45, _46, _47, _48, _49, _50, _51) AT_DISPATCH_CASE(_1, N) AT_DISPATCH_CASE(_2, N) AT_DISPATCH_CASE(_3, N) AT_DISPATCH_CASE(_4, N) AT_DISPATCH_CASE(_5, N) AT_DISPATCH_CASE(_6, N) AT_DISPATCH_CASE(_7, N) AT_DISPATCH_CASE(_8, N) AT_DISPATCH_CASE(_9, N) AT_DISPATCH_CASE(_10, N) AT_DISPATCH_CASE(_11, N) AT_DISPATCH_CASE(_12, N) AT_DISPATCH_CASE(_13, N) AT_DISPATCH_CASE(_14, N) AT_DISPATCH_CASE(_15, N) AT_DISPATCH_CASE(_16, N) AT_DISPATCH_CASE(_17, N) AT_DISPATCH_CASE(_18, N) AT_DISPATCH_CASE(_19, N) AT_DISPATCH_CASE(_20, N) AT_DISPATCH_CASE(_21, N) AT_DISPATCH_CASE(_22, N) AT_DISPATCH_CASE(_23, N) AT_DISPATCH_CASE(_24, N) AT_DISPATCH_CASE(_25, N) AT_DISPATCH_CASE(_26, N) AT_DISPATCH_CASE(_27, N) AT_DISPATCH_CASE(_28, N) AT_DISPATCH_CASE(_29, N) AT_DISPATCH_CASE(_30, N) AT_DISPATCH_CASE(_31, N) AT_DISPATCH_CASE(_32, N) AT_DISPATCH_CASE(_33, N) AT_DISPATCH_CASE(_34, N) AT_DISPATCH_CASE(_35, N) AT_DISPATCH_CASE(_36, N) AT_DISPATCH_CASE(_37, N) AT_DISPATCH_CASE(_38, N) AT_DISPATCH_CASE(_39, N) AT_DISPATCH_CASE(_40, N) AT_DISPATCH_CASE(_41, N) AT_DISPATCH_CASE(_42, N) AT_DISPATCH_CASE(_43, N) AT_DISPATCH_CASE(_44, N) AT_DISPATCH_CASE(_45, N) AT_DISPATCH_CASE(_46, N) AT_DISPATCH_CASE(_47, N) AT_DISPATCH_CASE(_48, N) AT_DISPATCH_CASE(_49, N) AT_DISPATCH_CASE(_50, N) AT_DISPATCH_CASE(_51, N)
#define AT_AP52(N, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20, _21, _22, _23, _24, _25, _26, _27, _28, _29, _30, _31, _32, _33, _34, _35, _36, _37, _38, _39, _40, _41, _42, _43, _44, _45, _46, _47, _48, _49, _50, _51, _52) AT_DISPATCH_CASE(_1, N) AT_DISPATCH_CASE(_2, N) AT_DISPATCH_CASE(_3, N) AT_DISPATCH_CASE(_4, N) AT_DISPATCH_CASE(_5, N) AT_DISPATCH_CASE(_6, N) AT_DISPATCH_CASE(_7, N) AT_DISPATCH_CASE(_8, N) AT_DISPATCH_CASE(_9, N) AT_DISPATCH_CASE(_10, N) AT_DISPATCH_CASE(_11, N) AT_DISPATCH_CASE(_12, N) AT_DISPATCH_CASE(_13, N) AT_DISPATCH_CASE(_14, N) AT_DISPATCH_CASE(_15, N) AT_DISPATCH_CASE(_16, N) AT_DISPATCH_CASE(_17, N) AT_DISPATCH_CASE(_18, N) AT_DISPATCH_CASE(_19, N) AT_DISPATCH_CASE(_20, N) AT_DISPATCH_CASE(_21, N) AT_DISPATCH_CASE(_22, N) AT_DISPATCH_CASE(_23, N) AT_DISPATCH_CASE(_24, N) AT_DISPATCH_CASE(_25, N) AT_DISPATCH_CASE(_26, N) AT_DISPATCH_CASE(_27, N) AT_DISPATCH_CASE(_28, N) AT_DISPATCH_CASE(_29, N) AT_DISPATCH_CASE(_30, N) AT_DISPATCH_CASE(_31, N) AT_DISPATCH_CASE(_32, N) AT_DISPATCH_CASE(_33, N) AT_DISPATCH_CASE(_34, N) AT_DISPATCH_CASE(_35, N) AT_DISPATCH_CASE(_36, N) AT_DISPATCH_CASE(_37, N) AT_DISPATCH_CASE(_38, N) AT_DISPATCH_CASE(_39, N) AT_DISPATCH_CASE(_40, N) AT_DISPATCH_CASE(_41, N) AT_DISPATCH_CASE(_42, N) AT_DISPATCH_CASE(_43, N) AT_DISPATCH_CASE(_44, N) AT_DISPATCH_CASE(_45, N) AT_DISPATCH_CASE(_46, N) AT_DISPATCH_CASE(_47, N) AT_DISPATCH_CASE(_48, N) AT_DISPATCH_CASE(_49, N) AT_DISPATCH_CASE(_50, N) AT_DISPATCH_CASE(_51, N) AT_DISPATCH_CASE(_52, N)
#define AT_AP53(N, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20, _21, _22, _23, _24, _25, _26, _27, _28, _29, _30, _31, _32, _33, _34, _35, _36, _37, _38, _39, _40, _41, _42, _43, _44, _45, _46, _47, _48, _49, _50, _51, _52, _53) AT_DISPATCH_CASE(_1, N) AT_DISPATCH_CASE(_2, N) AT_DISPATCH_CASE(_3, N) AT_DISPATCH_CASE(_4, N) AT_DISPATCH_CASE(_5, N) AT_DISPATCH_CASE(_6, N) AT_DISPATCH_CASE(_7, N) AT_DISPATCH_CASE(_8, N) AT_DISPATCH_CASE(_9, N) AT_DISPATCH_CASE(_10, N) AT_DISPATCH_CASE(_11, N) AT_DISPATCH_CASE(_12, N) AT_DISPATCH_CASE(_13, N) AT_DISPATCH_CASE(_14, N) AT_DISPATCH_CASE(_15, N) AT_DISPATCH_CASE(_16, N) AT_DISPATCH_CASE(_17, N) AT_DISPATCH_CASE(_18, N) AT_DISPATCH_CASE(_19, N) AT_DISPATCH_CASE(_20, N) AT_DISPATCH_CASE(_21, N) AT_DISPATCH_CASE(_22, N) AT_DISPATCH_CASE(_23, N) AT_DISPATCH_CASE(_24, N) AT_DISPATCH_CASE(_25, N) AT_DISPATCH_CASE(_26, N) AT_DISPATCH_CASE(_27, N) AT_DISPATCH_CASE(_28, N) AT_DISPATCH_CASE(_29, N) AT_DISPATCH_CASE(_30, N) AT_DISPATCH_CASE(_31, N) AT_DISPATCH_CASE(_32, N) AT_DISPATCH_CASE(_33, N) AT_DISPATCH_CASE(_34, N) AT_DISPATCH_CASE(_35, N) AT_DISPATCH_CASE(_36, N) AT_DISPATCH_CASE(_37, N) AT_DISPATCH_CASE(_38, N) AT_DISPATCH_CASE(_39, N) AT_DISPATCH_CASE(_40, N) AT_DISPATCH_CASE(_41, N) AT_DISPATCH_CASE(_42, N) AT_DISPATCH_CASE(_43, N) AT_DISPATCH_CASE(_44, N) AT_DISPATCH_CASE(_45, N) AT_DISPATCH_CASE(_46, N) AT_DISPATCH_CASE(_47, N) AT_DISPATCH_CASE(_48, N) AT_DISPATCH_CASE(_49, N) AT_DISPATCH_CASE(_50, N) AT_DISPATCH_CASE(_51, N) AT_DISPATCH_CASE(_52, N) AT_DISPATCH_CASE(_53, N)
#define AT_AP54(N, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20, _21, _22, _23, _24, _25, _26, _27, _28, _29, _30, _31, _32, _33, _34, _35, _36, _37, _38, _39, _40, _41, _42, _43, _44, _45, _46, _47, _48, _49, _50, _51, _52, _53, _54) AT_DISPATCH_CASE(_1, N) AT_DISPATCH_CASE(_2, N) AT_DISPATCH_CASE(_3, N) AT_DISPATCH_CASE(_4, N) AT_DISPATCH_CASE(_5, N) AT_DISPATCH_CASE(_6, N) AT_DISPATCH_CASE(_7, N) AT_DISPATCH_CASE(_8, N) AT_DISPATCH_CASE(_9, N) AT_DISPATCH_CASE(_10, N) AT_DISPATCH_CASE(_11, N) AT_DISPATCH_CASE(_12, N) AT_DISPATCH_CASE(_13, N) AT_DISPATCH_CASE(_14, N) AT_DISPATCH_CASE(_15, N) AT_DISPATCH_CASE(_16, N) AT_DISPATCH_CASE(_17, N) AT_DISPATCH_CASE(_18, N) AT_DISPATCH_CASE(_19, N) AT_DISPATCH_CASE(_20, N) AT_DISPATCH_CASE(_21, N) AT_DISPATCH_CASE(_22, N) AT_DISPATCH_CASE(_23, N) AT_DISPATCH_CASE(_24, N) AT_DISPATCH_CASE(_25, N) AT_DISPATCH_CASE(_26, N) AT_DISPATCH_CASE(_27, N) AT_DISPATCH_CASE(_28, N) AT_DISPATCH_CASE(_29, N) AT_DISPATCH_CASE(_30, N) AT_DISPATCH_CASE(_31, N) AT_DISPATCH_CASE(_32, N) AT_DISPATCH_CASE(_33, N) AT_DISPATCH_CASE(_34, N) AT_DISPATCH_CASE(_35, N) AT_DISPATCH_CASE(_36, N) AT_DISPATCH_CASE(_37, N) AT_DISPATCH_CASE(_38, N) AT_DISPATCH_CASE(_39, N) AT_DISPATCH_CASE(_40, N) AT_DISPATCH_CASE(_41, N) AT_DISPATCH_CASE(_42, N) AT_DISPATCH_CASE(_43, N) AT_DISPATCH_CASE(_44, N) AT_DISPATCH_CASE(_45, N) AT_DISPATCH_CASE(_46, N) AT_DISPATCH_CASE(_47, N) AT_DISPATCH_CASE(_48, N) AT_DISPATCH_CASE(_49, N) AT_DISPATCH_CASE(_50, N) AT_DISPATCH_CASE(_51, N) AT_DISPATCH_CASE(_52, N) AT_DISPATCH_CASE(_53, N) AT_DISPATCH_CASE(_54, N)
#define AT_AP55(N, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20, _21, _22, _23, _24, _25, _26, _27, _28, _29, _30, _31, _32, _33, _34, _35, _36, _37, _38, _39, _40, _41, _42, _43, _44, _45, _46, _47, _48, _49, _50, _51, _52, _53, _54, _55) AT_DISPATCH_CASE(_1, N) AT_DISPATCH_CASE(_2, N) AT_DISPATCH_CASE(_3, N) AT_DISPATCH_CASE(_4, N) AT_DISPATCH_CASE(_5, N) AT_DISPATCH_CASE(_6, N) AT_DISPATCH_CASE(_7, N) AT_DISPATCH_CASE(_8, N) AT_DISPATCH_CASE(_9, N) AT_DISPATCH_CASE(_10, N) AT_DISPATCH_CASE(_11, N) AT_DISPATCH_CASE(_12, N) AT_DISPATCH_CASE(_13, N) AT_DISPATCH_CASE(_14, N) AT_DISPATCH_CASE(_15, N) AT_DISPATCH_CASE(_16, N) AT_DISPATCH_CASE(_17, N) AT_DISPATCH_CASE(_18, N) AT_DISPATCH_CASE(_19, N) AT_DISPATCH_CASE(_20, N) AT_DISPATCH_CASE(_21, N) AT_DISPATCH_CASE(_22, N) AT_DISPATCH_CASE(_23, N) AT_DISPATCH_CASE(_24, N) AT_DISPATCH_CASE(_25, N) AT_DISPATCH_CASE(_26, N) AT_DISPATCH_CASE(_27, N) AT_DISPATCH_CASE(_28, N) AT_DISPATCH_CASE(_29, N) AT_DISPATCH_CASE(_30, N) AT_DISPATCH_CASE(_31, N) AT_DISPATCH_CASE(_32, N) AT_DISPATCH_CASE(_33, N) AT_DISPATCH_CASE(_34, N) AT_DISPATCH_CASE(_35, N) AT_DISPATCH_CASE(_36, N) AT_DISPATCH_CASE(_37, N) AT_DISPATCH_CASE(_38, N) AT_DISPATCH_CASE(_39, N) AT_DISPATCH_CASE(_40, N) AT_DISPATCH_CASE(_41, N) AT_DISPATCH_CASE(_42, N) AT_DISPATCH_CASE(_43, N) AT_DISPATCH_CASE(_44, N) AT_DISPATCH_CASE(_45, N) AT_DISPATCH_CASE(_46, N) AT_DISPATCH_CASE(_47, N) AT_DISPATCH_CASE(_48, N) AT_DISPATCH_CASE(_49, N) AT_DISPATCH_CASE(_50, N) AT_DISPATCH_CASE(_51, N) AT_DISPATCH_CASE(_52, N) AT_DISPATCH_CASE(_53, N) AT_DISPATCH_CASE(_54, N) AT_DISPATCH_CASE(_55, N)
#define AT_AP56(N, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20, _21, _22, _23, _24, _25, _26, _27, _28, _29, _30, _31, _32, _33, _34, _35, _36, _37, _38, _39, _40, _41, _42, _43, _44, _45, _46, _47, _48, _49, _50, _51, _52, _53, _54, _55, _56) AT_DISPATCH_CASE(_1, N) AT_DISPATCH_CASE(_2, N) AT_DISPATCH_CASE(_3, N) AT_DISPATCH_CASE(_4, N) AT_DISPATCH_CASE(_5, N) AT_DISPATCH_CASE(_6, N) AT_DISPATCH_CASE(_7, N) AT_DISPATCH_CASE(_8, N) AT_DISPATCH_CASE(_9, N) AT_DISPATCH_CASE(_10, N) AT_DISPATCH_CASE(_11, N) AT_DISPATCH_CASE(_12, N) AT_DISPATCH_CASE(_13, N) AT_DISPATCH_CASE(_14, N) AT_DISPATCH_CASE(_15, N) AT_DISPATCH_CASE(_16, N) AT_DISPATCH_CASE(_17, N) AT_DISPATCH_CASE(_18, N) AT_DISPATCH_CASE(_19, N) AT_DISPATCH_CASE(_20, N) AT_DISPATCH_CASE(_21, N) AT_DISPATCH_CASE(_22, N) AT_DISPATCH_CASE(_23, N) AT_DISPATCH_CASE(_24, N) AT_DISPATCH_CASE(_25, N) AT_DISPATCH_CASE(_26, N) AT_DISPATCH_CASE(_27, N) AT_DISPATCH_CASE(_28, N) AT_DISPATCH_CASE(_29, N) AT_DISPATCH_CASE(_30, N) AT_DISPATCH_CASE(_31, N) AT_DISPATCH_CASE(_32, N) AT_DISPATCH_CASE(_33, N) AT_DISPATCH_CASE(_34, N) AT_DISPATCH_CASE(_35, N) AT_DISPATCH_CASE(_36, N) AT_DISPATCH_CASE(_37, N) AT_DISPATCH_CASE(_38, N) AT_DISPATCH_CASE(_39, N) AT_DISPATCH_CASE(_40, N) AT_DISPATCH_CASE(_41, N) AT_DISPATCH_CASE(_42, N) AT_DISPATCH_CASE(_43, N) AT_DISPATCH_CASE(_44, N) AT_DISPATCH_CASE(_45, N) AT_DISPATCH_CASE(_46, N) AT_DISPATCH_CASE(_47, N) AT_DISPATCH_CASE(_48, N) AT_DISPATCH_CASE(_49, N) AT_DISPATCH_CASE(_50, N) AT_DISPATCH_CASE(_51, N) AT_DISPATCH_CASE(_52, N) AT_DISPATCH_CASE(_53, N) AT_DISPATCH_CASE(_54, N) AT_DISPATCH_CASE(_55, N) AT_DISPATCH_CASE(_56, N)
#define AT_AP57(N, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20, _21, _22, _23, _24, _25, _26, _27, _28, _29, _30, _31, _32, _33, _34, _35, _36, _37, _38, _39, _40, _41, _42, _43, _44, _45, _46, _47, _48, _49, _50, _51, _52, _53, _54, _55, _56, _57) AT_DISPATCH_CASE(_1, N) AT_DISPATCH_CASE(_2, N) AT_DISPATCH_CASE(_3, N) AT_DISPATCH_CASE(_4, N) AT_DISPATCH_CASE(_5, N) AT_DISPATCH_CASE(_6, N) AT_DISPATCH_CASE(_7, N) AT_DISPATCH_CASE(_8, N) AT_DISPATCH_CASE(_9, N) AT_DISPATCH_CASE(_10, N) AT_DISPATCH_CASE(_11, N) AT_DISPATCH_CASE(_12, N) AT_DISPATCH_CASE(_13, N) AT_DISPATCH_CASE(_14, N) AT_DISPATCH_CASE(_15, N) AT_DISPATCH_CASE(_16, N) AT_DISPATCH_CASE(_17, N) AT_DISPATCH_CASE(_18, N) AT_DISPATCH_CASE(_19, N) AT_DISPATCH_CASE(_20, N) AT_DISPATCH_CASE(_21, N) AT_DISPATCH_CASE(_22, N) AT_DISPATCH_CASE(_23, N) AT_DISPATCH_CASE(_24, N) AT_DISPATCH_CASE(_25, N) AT_DISPATCH_CASE(_26, N) AT_DISPATCH_CASE(_27, N) AT_DISPATCH_CASE(_28, N) AT_DISPATCH_CASE(_29, N) AT_DISPATCH_CASE(_30, N) AT_DISPATCH_CASE(_31, N) AT_DISPATCH_CASE(_32, N) AT_DISPATCH_CASE(_33, N) AT_DISPATCH_CASE(_34, N) AT_DISPATCH_CASE(_35, N) AT_DISPATCH_CASE(_36, N) AT_DISPATCH_CASE(_37, N) AT_DISPATCH_CASE(_38, N) AT_DISPATCH_CASE(_39, N) AT_DISPATCH_CASE(_40, N) AT_DISPATCH_CASE(_41, N) AT_DISPATCH_CASE(_42, N) AT_DISPATCH_CASE(_43, N) AT_DISPATCH_CASE(_44, N) AT_DISPATCH_CASE(_45, N) AT_DISPATCH_CASE(_46, N) AT_DISPATCH_CASE(_47, N) AT_DISPATCH_CASE(_48, N) AT_DISPATCH_CASE(_49, N) AT_DISPATCH_CASE(_50, N) AT_DISPATCH_CASE(_51, N) AT_DISPATCH_CASE(_52, N) AT_DISPATCH_CASE(_53, N) AT_DISPATCH_CASE(_54, N) AT_DISPATCH_CASE(_55, N) AT_DISPATCH_CASE(_56, N) AT_DISPATCH_CASE(_57, N)
#define AT_AP58(N, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20, _21, _22, _23, _24, _25, _26, _27, _28, _29, _30, _31, _32, _33, _34, _35, _36, _37, _38, _39, _40, _41, _42, _43, _44, _45, _46, _47, _48, _49, _50, _51, _52, _53, _54, _55, _56, _57, _58) AT_DISPATCH_CASE(_1, N) AT_DISPATCH_CASE(_2, N) AT_DISPATCH_CASE(_3, N) AT_DISPATCH_CASE(_4, N) AT_DISPATCH_CASE(_5, N) AT_DISPATCH_CASE(_6, N) AT_DISPATCH_CASE(_7, N) AT_DISPATCH_CASE(_8, N) AT_DISPATCH_CASE(_9, N) AT_DISPATCH_CASE(_10, N) AT_DISPATCH_CASE(_11, N) AT_DISPATCH_CASE(_12, N) AT_DISPATCH_CASE(_13, N) AT_DISPATCH_CASE(_14, N) AT_DISPATCH_CASE(_15, N) AT_DISPATCH_CASE(_16, N) AT_DISPATCH_CASE(_17, N) AT_DISPATCH_CASE(_18, N) AT_DISPATCH_CASE(_19, N) AT_DISPATCH_CASE(_20, N) AT_DISPATCH_CASE(_21, N) AT_DISPATCH_CASE(_22, N) AT_DISPATCH_CASE(_23, N) AT_DISPATCH_CASE(_24, N) AT_DISPATCH_CASE(_25, N) AT_DISPATCH_CASE(_26, N) AT_DISPATCH_CASE(_27, N) AT_DISPATCH_CASE(_28, N) AT_DISPATCH_CASE(_29, N) AT_DISPATCH_CASE(_30, N) AT_DISPATCH_CASE(_31, N) AT_DISPATCH_CASE(_32, N) AT_DISPATCH_CASE(_33, N) AT_DISPATCH_CASE(_34, N) AT_DISPATCH_CASE(_35, N) AT_DISPATCH_CASE(_36, N) AT_DISPATCH_CASE(_37, N) AT_DISPATCH_CASE(_38, N) AT_DISPATCH_CASE(_39, N) AT_DISPATCH_CASE(_40, N) AT_DISPATCH_CASE(_41, N) AT_DISPATCH_CASE(_42, N) AT_DISPATCH_CASE(_43, N) AT_DISPATCH_CASE(_44, N) AT_DISPATCH_CASE(_45, N) AT_DISPATCH_CASE(_46, N) AT_DISPATCH_CASE(_47, N) AT_DISPATCH_CASE(_48, N) AT_DISPATCH_CASE(_49, N) AT_DISPATCH_CASE(_50, N) AT_DISPATCH_CASE(_51, N) AT_DISPATCH_CASE(_52, N) AT_DISPATCH_CASE(_53, N) AT_DISPATCH_CASE(_54, N) AT_DISPATCH_CASE(_55, N) AT_DISPATCH_CASE(_56, N) AT_DISPATCH_CASE(_57, N) AT_DISPATCH_CASE(_58, N)
#define AT_AP59(N, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20, _21, _22, _23, _24, _25, _26, _27, _28, _29, _30, _31, _32, _33, _34, _35, _36, _37, _38, _39, _40, _41, _42, _43, _44, _45, _46, _47, _48, _49, _50, _51, _52, _53, _54, _55, _56, _57, _58, _59) AT_DISPATCH_CASE(_1, N) AT_DISPATCH_CASE(_2, N) AT_DISPATCH_CASE(_3, N) AT_DISPATCH_CASE(_4, N) AT_DISPATCH_CASE(_5, N) AT_DISPATCH_CASE(_6, N) AT_DISPATCH_CASE(_7, N) AT_DISPATCH_CASE(_8, N) AT_DISPATCH_CASE(_9, N) AT_DISPATCH_CASE(_10, N) AT_DISPATCH_CASE(_11, N) AT_DISPATCH_CASE(_12, N) AT_DISPATCH_CASE(_13, N) AT_DISPATCH_CASE(_14, N) AT_DISPATCH_CASE(_15, N) AT_DISPATCH_CASE(_16, N) AT_DISPATCH_CASE(_17, N) AT_DISPATCH_CASE(_18, N) AT_DISPATCH_CASE(_19, N) AT_DISPATCH_CASE(_20, N) AT_DISPATCH_CASE(_21, N) AT_DISPATCH_CASE(_22, N) AT_DISPATCH_CASE(_23, N) AT_DISPATCH_CASE(_24, N) AT_DISPATCH_CASE(_25, N) AT_DISPATCH_CASE(_26, N) AT_DISPATCH_CASE(_27, N) AT_DISPATCH_CASE(_28, N) AT_DISPATCH_CASE(_29, N) AT_DISPATCH_CASE(_30, N) AT_DISPATCH_CASE(_31, N) AT_DISPATCH_CASE(_32, N) AT_DISPATCH_CASE(_33, N) AT_DISPATCH_CASE(_34, N) AT_DISPATCH_CASE(_35, N) AT_DISPATCH_CASE(_36, N) AT_DISPATCH_CASE(_37, N) AT_DISPATCH_CASE(_38, N) AT_DISPATCH_CASE(_39, N) AT_DISPATCH_CASE(_40, N) AT_DISPATCH_CASE(_41, N) AT_DISPATCH_CASE(_42, N) AT_DISPATCH_CASE(_43, N) AT_DISPATCH_CASE(_44, N) AT_DISPATCH_CASE(_45, N) AT_DISPATCH_CASE(_46, N) AT_DISPATCH_CASE(_47, N) AT_DISPATCH_CASE(_48, N) AT_DISPATCH_CASE(_49, N) AT_DISPATCH_CASE(_50, N) AT_DISPATCH_CASE(_51, N) AT_DISPATCH_CASE(_52, N) AT_DISPATCH_CASE(_53, N) AT_DISPATCH_CASE(_54, N) AT_DISPATCH_CASE(_55, N) AT_DISPATCH_CASE(_56, N) AT_DISPATCH_CASE(_57, N) AT_DISPATCH_CASE(_58, N) AT_DISPATCH_CASE(_59, N)
#define AT_AP60(N, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20, _21, _22, _23, _24, _25, _26, _27, _28, _29, _30, _31, _32, _33, _34, _35, _36, _37, _38, _39, _40, _41, _42, _43, _44, _45, _46, _47, _48, _49, _50, _51, _52, _53, _54, _55, _56, _57, _58, _59, _60) AT_DISPATCH_CASE(_1, N) AT_DISPATCH_CASE(_2, N) AT_DISPATCH_CASE(_3, N) AT_DISPATCH_CASE(_4, N) AT_DISPATCH_CASE(_5, N) AT_DISPATCH_CASE(_6, N) AT_DISPATCH_CASE(_7, N) AT_DISPATCH_CASE(_8, N) AT_DISPATCH_CASE(_9, N) AT_DISPATCH_CASE(_10, N) AT_DISPATCH_CASE(_11, N) AT_DISPATCH_CASE(_12, N) AT_DISPATCH_CASE(_13, N) AT_DISPATCH_CASE(_14, N) AT_DISPATCH_CASE(_15, N) AT_DISPATCH_CASE(_16, N) AT_DISPATCH_CASE(_17, N) AT_DISPATCH_CASE(_18, N) AT_DISPATCH_CASE(_19, N) AT_DISPATCH_CASE(_20, N) AT_DISPATCH_CASE(_21, N) AT_DISPATCH_CASE(_22, N) AT_DISPATCH_CASE(_23, N) AT_DISPATCH_CASE(_24, N) AT_DISPATCH_CASE(_25, N) AT_DISPATCH_CASE(_26, N) AT_DISPATCH_CASE(_27, N) AT_DISPATCH_CASE(_28, N) AT_DISPATCH_CASE(_29, N) AT_DISPATCH_CASE(_30, N) AT_DISPATCH_CASE(_31, N) AT_DISPATCH_CASE(_32, N) AT_DISPATCH_CASE(_33, N) AT_DISPATCH_CASE(_34, N) AT_DISPATCH_CASE(_35, N) AT_DISPATCH_CASE(_36, N) AT_DISPATCH_CASE(_37, N) AT_DISPATCH_CASE(_38, N) AT_DISPATCH_CASE(_39, N) AT_DISPATCH_CASE(_40, N) AT_DISPATCH_CASE(_41, N) AT_DISPATCH_CASE(_42, N) AT_DISPATCH_CASE(_43, N) AT_DISPATCH_CASE(_44, N) AT_DISPATCH_CASE(_45, N) AT_DISPATCH_CASE(_46, N) AT_DISPATCH_CASE(_47, N) AT_DISPATCH_CASE(_48, N) AT_DISPATCH_CASE(_49, N) AT_DISPATCH_CASE(_50, N) AT_DISPATCH_CASE(_51, N) AT_DISPATCH_CASE(_52, N) AT_DISPATCH_CASE(_53, N) AT_DISPATCH_CASE(_54, N) AT_DISPATCH_CASE(_55, N) AT_DISPATCH_CASE(_56, N) AT_DISPATCH_CASE(_57, N) AT_DISPATCH_CASE(_58, N) AT_DISPATCH_CASE(_59, N) AT_DISPATCH_CASE(_60, N)
// End generated code
// clang-format on

View File

@ -18,6 +18,8 @@ c10::Allocator* GetCPUAllocatorMaybePinned(bool pin_memory) {
// To properly support this, see https://github.com/pytorch/pytorch/issues/14560
if (at::globalContext().hasCUDA()) {
return at::detail::getCUDAHooks().getPinnedMemoryAllocator();
} else if (at::globalContext().hasMTIA()) {
return at::detail::getMTIAHooks().getPinnedMemoryAllocator();
} else if (at::globalContext().hasXPU()) {
return at::detail::getXPUHooks().getPinnedMemoryAllocator();
} else if(at::isPrivateUse1HooksRegistered()) {

View File

@ -420,15 +420,15 @@ inline c10::MaybeOwned<Tensor> expand_size(
inline std::vector<Tensor> expand_outplace(TensorList to_expand) {
// expands a list of Tensors; ignores undefined (null) tensors
bool first = true;
DimVector sizes;
SymDimVector sizes;
for (const auto i : c10::irange(to_expand.size())) {
if (!to_expand[i].defined()) {
continue;
} else if (first) {
sizes = to_expand[i].sizes();
sizes = to_expand[i].sym_sizes();
first = false;
} else {
sizes = infer_size_dimvector(sizes, to_expand[i].sizes());
sizes = infer_size_symdimvector(sizes, to_expand[i].sym_sizes());
}
}
@ -436,10 +436,10 @@ inline std::vector<Tensor> expand_outplace(TensorList to_expand) {
for (const auto i : c10::irange(to_expand.size())) {
if (!to_expand[i].defined()) {
continue;
} else if (to_expand[i].sizes().equals(sizes)) {
} else if (to_expand[i].sym_sizes().equals(sizes)) {
result[i] = to_expand[i];
} else {
result[i] = to_expand[i].expand(sizes);
result[i] = to_expand[i].expand_symint(sizes);
}
}
return result;

View File

@ -209,8 +209,8 @@ void init_num_threads() {
}
void set_num_threads(int nthreads) {
TORCH_CHECK(nthreads > 0, "Expected positive number of threads");
#ifndef C10_MOBILE
TORCH_CHECK(nthreads > 0, "Expected positive number of threads");
int no_value = NOT_SET;
if (!num_intraop_threads.compare_exchange_strong(no_value, nthreads)) {
// num_intraop_threads either stores a positive integer or CONSUMED,
@ -229,8 +229,9 @@ void set_num_threads(int nthreads) {
}
}
#else
caffe2::PThreadPool* const pool = caffe2::pthreadpool(nthreads);
caffe2::PThreadPool* const pool = caffe2::pthreadpool();
TORCH_INTERNAL_ASSERT(pool, "Invalid thread pool!");
pool->set_thread_count(nthreads);
#endif // C10_MOBILE
}

View File

@ -19,7 +19,7 @@ Tensor& scalar_fill(Tensor& self, const Scalar& value) {
AT_DISPATCH_V2(
self.scalar_type(), "fill_out", AT_WRAP([&]() {
fill_inplace<scalar_t>(self, value);
}), kComplexHalf, kHalf, kBool, kBFloat16, AT_EXPAND(AT_ALL_TYPES_AND_COMPLEX), AT_EXPAND(AT_BAREBONES_UNSIGNED_TYPES));
}), kComplexHalf, kHalf, kBool, kBFloat16, AT_EXPAND(AT_ALL_TYPES_AND_COMPLEX), AT_EXPAND(AT_FLOAT8_TYPES), AT_EXPAND(AT_BAREBONES_UNSIGNED_TYPES));
return self;
}

View File

@ -144,8 +144,8 @@ class CheckSparseTensorInvariants {
bool old_state;
public:
CheckSparseTensorInvariants(bool state) {
old_state = at::globalContext().checkSparseTensorInvariants();
CheckSparseTensorInvariants(bool state)
: old_state(at::globalContext().checkSparseTensorInvariants()) {
at::globalContext().setCheckSparseTensorInvariants(state);
}

View File

@ -255,7 +255,9 @@ inline Tensor applySelect(
// the other hand, indexing wraping is valid for all negative int64_t
// values, as x[INT64_MIN] is the same as x[INT64_MAX]
TORCH_CHECK_INDEX(
size > -1 - index && size > index,
size.sym_gt(-1 - index)
.sym_and(size.sym_gt(index))
.expect_true(__FILE__, __LINE__),
"index ",
index,
" is out of bounds for dimension ",

View File

@ -82,7 +82,7 @@ class TORCH_API ThreadLocalState {
!defined(BUILD_LITE_INTERPRETER)
// TLS for autocast dtypes
std::array<at::ScalarType, at::COMPILE_TIME_MAX_DEVICE_TYPES>
autocast_dtypes_;
autocast_dtypes_{};
#endif
friend class ThreadLocalStateGuard;

View File

@ -13,8 +13,6 @@
#include <ATen/core/Array.h>
#include <c10/macros/Macros.h>
#include <c10/util/Exception.h>
#include <c10/util/Half.h>
#include <cmath>
#include <cstdint>

View File

@ -306,11 +306,10 @@ struct VecConvert<float, 1, BFloat16, 1> {
const VectorizedN<BFloat16, 1>& src) {
VectorizedN<float, 1> result;
uint16x8_t u16_8 = vld1q_u16(reinterpret_cast<const uint16_t*>(&src[0]));
int32x4_t shift = vdupq_n_s32(16);
auto u16_low1 = vget_low_u16(u16_8);
auto u16_high1 = vget_high_u16(u16_8);
float32x4_t f32x4_0 = vreinterpretq_f32_u32(vshlq_u32(vmovl_u16(u16_low1), shift));
float32x4_t f32x4_1 = vreinterpretq_f32_u32(vshlq_u32(vmovl_u16(u16_high1), shift));
float32x4_t f32x4_0 = vreinterpretq_f32_u32(vshlq_n_u32(vmovl_u16(u16_low1), 16));
float32x4_t f32x4_1 = vreinterpretq_f32_u32(vshlq_n_u32(vmovl_u16(u16_high1), 16));
result[0] = {f32x4_0, f32x4_1};
return result;
}

View File

@ -75,7 +75,7 @@ inline __m256i pack_saturate_and_clamp<int32_t>(
int32_t /*min_val*/,
int32_t /*max_val*/) {
// This function is for linkage only, will not be used
AT_ERROR("pack_saturate_and_clamp<int32_t> is not supported");
TORCH_CHECK(false, "pack_saturate_and_clamp<int32_t> is not supported");
}
template <>

View File

@ -77,7 +77,7 @@ inline __m512i pack_saturate_and_clamp<int32_t>(
int32_t min_val [[maybe_unused]],
int32_t max_val [[maybe_unused]]) {
// This function is for linkage only, will not be used
AT_ERROR("pack_saturate_and_clamp<int32_t> is not supported");
TORCH_CHECK(false, "pack_saturate_and_clamp<int32_t> is not supported");
return __m512i{};
}

View File

@ -125,7 +125,7 @@ void CUDAGraph::capture_begin(MempoolId_t pool/*=0*/, cudaStreamCaptureMode capt
// due to the capture status being updated _after_ a capture had already started.
c10::cuda::CUDACachingAllocator::beginAllocateToPool(capture_dev_, mempool_id_, [this](cudaStream_t stream) {
cudaStreamCaptureStatus status;
CaptureId_t stream_capture_id;
CaptureId_t stream_capture_id = 0;
AT_CUDA_CHECK(cudaStreamGetCaptureInfo(stream, &status, &stream_capture_id));
return status == cudaStreamCaptureStatus::cudaStreamCaptureStatusActive && stream_capture_id == capture_id_;
});

View File

@ -10,7 +10,7 @@ TensorBase empty_cuda(
ScalarType dtype,
std::optional<Device> device_opt,
std::optional<c10::MemoryFormat> memory_format_opt) {
at::globalContext().lazyInitCUDA();
at::globalContext().lazyInitDevice(c10::DeviceType::CUDA);
const auto device = device_or_default(device_opt);
TORCH_INTERNAL_ASSERT(device.is_cuda());
const DeviceGuard device_guard(device);
@ -50,7 +50,7 @@ TensorBase empty_strided_cuda(
IntArrayRef stride,
ScalarType dtype,
std::optional<Device> device_opt) {
at::globalContext().lazyInitCUDA();
at::globalContext().lazyInitDevice(c10::DeviceType::CUDA);
const auto device = device_or_default(device_opt);
TORCH_INTERNAL_ASSERT(device.is_cuda());
const DeviceGuard device_guard(device);

View File

@ -34,7 +34,7 @@ void init_p2p_access_cache(int64_t num_devices) {
} // namespace detail
bool get_p2p_access(int dev, int dev_to_access) {
at::globalContext().lazyInitCUDA();
at::globalContext().lazyInitDevice(c10::DeviceType::CUDA);
TORCH_CHECK(dev >= 0 || dev < num_devices_,
dev, " is not a device");

View File

@ -14,6 +14,7 @@
#include <ATen/detail/CUDAHooksInterface.h>
#include <ATen/native/cuda/CuFFTPlanCache.h>
#include <c10/util/Exception.h>
#include <c10/util/env.h>
#include <c10/cuda/CUDACachingAllocator.h>
#include <c10/cuda/CUDAFunctions.h>
#include <c10/util/irange.h>
@ -79,30 +80,19 @@ struct _Initializer {
} initializer;
} // anonymous namespace
// Sets the CUDA_MODULE_LOADING environment variable
// if it's not set by the user.
void maybe_set_cuda_module_loading(const std::string &def_value) {
auto value = std::getenv("CUDA_MODULE_LOADING");
if (!value) {
#ifdef _WIN32
auto env_var = "CUDA_MODULE_LOADING=" + def_value;
_putenv(env_var.c_str());
#else
setenv("CUDA_MODULE_LOADING", def_value.c_str(), 1);
#endif
}
}
// NB: deleter is dynamic, because we need it to live in a separate
// compilation unit (alt is to have another method in hooks, but
// let's not if we don't need to!)
void CUDAHooks::initCUDA() const {
void CUDAHooks::init() const {
C10_LOG_API_USAGE_ONCE("aten.init.cuda");
// Force the update to enable unit testing. This code get executed before unit tests
// have a chance to enable vitals.
at::vitals::VitalsAPI.setVital("CUDA", "used", "true", /* force = */ true);
maybe_set_cuda_module_loading("LAZY");
// Sets the CUDA_MODULE_LOADING environment variable
// if it's not set by the user.
c10::utils::set_env("CUDA_MODULE_LOADING", "LAZY", false);
const auto num_devices = c10::cuda::device_count_ensure_non_zero();
c10::cuda::CUDACachingAllocator::init(num_devices);
at::cuda::detail::init_p2p_access_cache(num_devices);
@ -113,7 +103,7 @@ void CUDAHooks::initCUDA() const {
#endif
}
const Generator& CUDAHooks::getDefaultCUDAGenerator(DeviceIndex device_index) const {
const Generator& CUDAHooks::getDefaultGenerator(DeviceIndex device_index) const {
return at::cuda::detail::getDefaultCUDAGenerator(device_index);
}
@ -241,6 +231,9 @@ DeviceIndex current_device() {
return -1;
}
/**
* DEPRECATED: use getCurrentDevice() instead
*/
DeviceIndex CUDAHooks::current_device() const {
return at::cuda::detail::current_device();
}
@ -436,10 +429,21 @@ void CUDAHooks::cuFFTClearPlanCache(DeviceIndex device_index) const {
at::native::detail::cufft_clear_plan_cache_impl(device_index);
}
/**
* DEPRECATED: use deviceCount() instead
*/
int CUDAHooks::getNumGPUs() const {
return at::cuda::device_count();
}
DeviceIndex CUDAHooks::deviceCount() const {
return at::cuda::device_count();
}
DeviceIndex CUDAHooks::getCurrentDevice() const {
return at::cuda::detail::current_device();
}
#ifdef USE_ROCM
bool CUDAHooks::isGPUArch(DeviceIndex device_index, const std::vector<std::string>& archs) const {
hipDeviceProp_t* prop = at::cuda::getDeviceProperties(device_index);

View File

@ -19,10 +19,11 @@ TORCH_CUDA_CPP_API void set_magma_init_fn(void (*magma_init_fn)());
// The real implementation of CUDAHooksInterface
struct CUDAHooks : public at::CUDAHooksInterface {
CUDAHooks(at::CUDAHooksArgs) {}
void initCUDA() const override;
void init() const override;
Device getDeviceFromPtr(void* data) const override;
bool isPinnedPtr(const void* data) const override;
const Generator& getDefaultCUDAGenerator(DeviceIndex device_index = -1) const override;
const Generator& getDefaultGenerator(
DeviceIndex device_index = -1) const override;
bool hasCUDA() const override;
bool hasMAGMA() const override;
bool hasCuDNN() const override;
@ -49,6 +50,9 @@ struct CUDAHooks : public at::CUDAHooksInterface {
int64_t cuFFTGetPlanCacheSize(DeviceIndex device_index) const override;
void cuFFTClearPlanCache(DeviceIndex device_index) const override;
int getNumGPUs() const override;
DeviceIndex deviceCount() const override;
DeviceIndex getCurrentDevice() const override;
#ifdef USE_ROCM
bool isGPUArch(DeviceIndex device_index, const std::vector<std::string>& archs) const override;
#endif

View File

@ -77,6 +77,31 @@ default, now called through TunableOp. Any call to at::cuda::blas::gemm() or ::b
when enabled. Calling gemm() for a given set of input arguments (transa, transb, m, n, k) will attempt to use the
fastest available implementation across both rocblas and hipblaslt.
## Offline Tuning
### Motivation
Basically it is used for workload with high-memory utilization where one might run out of memory with regular tuning.
### Workflow
There are basically two steps:
1) Set the environment variables to collect the untuned GEMM and this will generate `tunableop_untuned?.csv` ("?" is placeholder for the GPU ID), like:
```
PYTORCH_TUNABLEOP_ENABLED=1
PYTORCH_TUNABLEOP_TUNING=0
PYTORCH_TUNABLEOP_RECORD_UNTUNED=1
...
```
2) Run a Python script that reads the `tunableop_untuned?.csv` and generates the `tunableop_results?.csv`, like:
```
import torch.cuda.tunable as tunable
import os
os.putenv('PYTORCH_TUNABLEOP_ENABLED', '1')
os.putenv('PYTORCH_TUNABLEOP_TUNING', '1')
os.putenv('PYTORCH_TUNABLEOP_RECORD_UNTUNED', '0')
tunable.tune_gemm_in_file("tunableop_results?.csv")
```
## Tuning Context
The behavior of TunableOp is currently manipulated through environment variables, the C++ interface of
at::cuda::tunable::getTuningContext(), or the `torch.cuda.tunable` python interfaces. The environment variables take
@ -90,6 +115,8 @@ programmatically since the settings become fixed. Use the C++ or Python APIs ins
| -------------------- | ----------- |
| PYTORCH_TUNABLEOP_ENABLED | Default is 0. Set to 1 to enable. |
| PYTORCH_TUNABLEOP_TUNING | Default is 1. Set to 0 to disable. |
| PYTORCH_TUNABLEOP_RECORD_UNTUNED | Default is 0. Set to 1 to enable. |
| PYTORCH_TUNABLEOP_UNTUNED_FILENAME | Default is 'tunableop_untuned.csv'. |
| PYTORCH_TUNABLEOP_VERBOSE | Default is 0. Set to 1 to enable basic logging. 2 for basic tuning status. 3 for full trace. |
| PYTORCH_TUNABLEOP_VERBOSE_FILENAME | Default is "err" for stderr. Set to "out" for stdout or a filename for capturing verbose logging. |
| PYTORCH_TUNABLEOP_FILENAME | Default is 'tunableop_results.csv'. |
@ -112,6 +139,8 @@ All python APIs exist in the `torch.cuda.tunable` module.
| is_enabled() -> bool | |
| tuning_enable(val: bool = True) -> None | Default is True. |
| tuning_is_enabled() -> bool | |
| record_untuned_enable(val: bool = True) -> None | Default is True. |
| record_untuned_is_enabled() -> bool | |
| set_max_tuning_duration(duration: int) -> None | |
| get_max_tuning_duration() -> int | |
| set_max_tuning_iterations(iterations: int) -> None | |
@ -123,6 +152,7 @@ All python APIs exist in the `torch.cuda.tunable` module.
| write_file_on_exit(val: bool) -> None | Default is True. |
| write_file(filename: Optional[str] = None) -> None | If filename not given, it will call get_filename(). |
| read_file(filename: Optional[str] = None) -> None | If filename not given, it will call get_filename(). |
| tune_gemm_in_file(filename: str) -> None | read an untuned file and tune GEMMs in it. |
### C++ Interface
Example:

View File

@ -112,6 +112,32 @@ void TuningResultsManager::Add(const std::string& op_signature, const std::strin
AddImpl(op_signature, params_signature, best, it->second);
}
void TuningResultsManager::RecordUntuned( std::ofstream& untuned_file, const std::string& op_signature, const std::string& params_signature) {
std::scoped_lock l{lock_};
if (!untuned_file.good()) {
TORCH_WARN_ONCE("failed to open file for writing; untuned gemm will not be saved");
return;
} else {
bool isNew = false;
auto it = untuned_results_.find(op_signature);
if (it == untuned_results_.end()) {
it = untuned_results_.insert({op_signature, {}}).first;
isNew = true;
}
auto it_kernel_map = it->second.find(params_signature);
if (it_kernel_map == it->second.end()) {
it->second.insert(params_signature);
isNew = true;
}
if (isNew) {
untuned_file << op_signature << "," << params_signature << std::endl;
TUNABLE_LOG3("Untuned,", op_signature, ",", params_signature);
}
}
}
void TuningResultsManager::Delete(const std::string& op_signature, const std::string& params_signature) {
std::scoped_lock l{lock_};
@ -359,6 +385,7 @@ TuningStatus TuningResultsValidator::ValidatePyTorchVersion(const std::string& v
TuningContext::TuningContext() :
enable_{false},
tuning_enable_{true},
record_untuned_enable_{false},
manager_initialized_{false},
write_file_on_exit_{true},
numerics_check_enable_{false},
@ -369,6 +396,7 @@ TuningContext::TuningContext() :
icache_flush_{true},
rotating_buffer_size_{-1},
filename_{},
untuned_file_{},
results_count_from_input_file_{0}
{
}
@ -394,6 +422,10 @@ TuningContext::~TuningContext() {
}
}
}
if (untuned_file_.good()) {
untuned_file_.close();
}
}
void TuningContext::EnableTunableOp(bool value) {
@ -424,6 +456,15 @@ void TuningContext::EnableTuning(bool value) {
}
}
void TuningContext::EnableRecordUntuned(bool value) {
record_untuned_enable_ = value;
if (value) {
TUNABLE_LOG1("Enable Record Untuned for TunableOp");
} else {
TUNABLE_LOG1("Disable Record Untuned for TunableOp");
}
}
bool TuningContext::IsTuningEnabled() const {
static const char *env = std::getenv("PYTORCH_TUNABLEOP_TUNING");
if (env != nullptr && strcmp(env, "0") == 0) {
@ -432,6 +473,33 @@ bool TuningContext::IsTuningEnabled() const {
return tuning_enable_;
}
bool TuningContext::IsRecordUntunedEnabled() const {
static const char *env = std::getenv("PYTORCH_TUNABLEOP_RECORD_UNTUNED");
if (env != nullptr && strcmp(env, "1") == 0) {
return true;
}
return record_untuned_enable_;
}
std::ofstream& TuningContext::GetUntunedFile(){
if (!untuned_file_.is_open()) {
const char *env = std::getenv("PYTORCH_TUNABLEOP_UNTUNED_FILENAME");
std::string filename = (env == nullptr) ? "tunableop_untuned.csv" : env;
std::string device = c10::str(int(c10::cuda::current_device()));
std::size_t found = filename.rfind(".");
if (found != std::string::npos) {
filename.insert(found, device);
} else {
// all else fails, just append
filename.append(device);
}
untuned_file_ = std::ofstream(filename, std::ios::out | std::ios::trunc);
}
return untuned_file_;
}
void TuningContext::WriteFileOnExit(bool value) {
write_file_on_exit_ = value;
}
@ -545,7 +613,7 @@ TuningResultsManager& TuningContext::GetTuningResultsManager() {
SetFilename(filename, true);
}
auto filename = GetFilename();
if (!filename.empty()) {
if (!filename.empty() && !IsRecordUntunedEnabled()) {
ReadFile(filename);
// attempt immediately to open file for writing to catch errors early
std::ofstream file(filename, std::ios::out | std::ios::app);

View File

@ -19,6 +19,7 @@
#include <string>
#include <type_traits>
#include <unordered_map>
#include <unordered_set>
#include <utility>
#include <vector>
@ -87,6 +88,7 @@ class TORCH_CUDA_CPP_API ResultEntry {
typedef std::unordered_map<std::string, ResultEntry> KernelMap;
typedef std::unordered_map<std::string, KernelMap> ResultsMap;
typedef std::unordered_map<std::string, std::unordered_set<std::string>> UntunedMap;
struct TORCH_CUDA_CPP_API TuningResults {
// Validates if these results are compatible with the libraries
@ -129,9 +131,12 @@ class TORCH_CUDA_CPP_API TuningResultsManager {
size_t GetSize();
void RecordUntuned( std::ofstream& untuned_file, const std::string& op_signature, const std::string& params_signature);
private:
std::mutex lock_;
ResultsMap results_;
UntunedMap untuned_results_;
};
class TORCH_CUDA_CPP_API TuningResultsValidator {
@ -173,6 +178,10 @@ class TORCH_CUDA_CPP_API TuningContext {
void EnableTuning(bool value);
bool IsTuningEnabled() const;
void EnableRecordUntuned(bool value);
bool IsRecordUntunedEnabled() const;
std::ofstream& GetUntunedFile();
void EnableNumericsCheck(bool value);
bool IsNumericsCheckEnabled() const;
@ -213,6 +222,7 @@ class TORCH_CUDA_CPP_API TuningContext {
private:
bool enable_;
bool tuning_enable_;
bool record_untuned_enable_;
bool manager_initialized_;
bool write_file_on_exit_;
bool numerics_check_enable_;
@ -226,6 +236,7 @@ class TORCH_CUDA_CPP_API TuningContext {
mutable c10::once_flag manager_init_once_;
TuningResultsValidator validator_;
std::string filename_;
std::ofstream untuned_file_;
size_t results_count_from_input_file_;
};

View File

@ -54,9 +54,15 @@ class TunableOp {
auto params_sig = params->Signature();
result = mgr.Lookup(op_sig, params_sig);
// If there is not previous tuning result been found, we do the tuning iff tuning is enabled
if (result == ResultEntry::Null() && ctx->IsTuningEnabled()) {
result = FindFastest(params);
mgr.Add(op_sig, params_sig, result);
if (result == ResultEntry::Null()) {
if (ctx->IsTuningEnabled()) {
result = FindFastest(params);
mgr.Add(op_sig, params_sig, result);
}
else if (ctx->IsRecordUntunedEnabled()) {
// or record the gemm into file
mgr.RecordUntuned(ctx->GetUntunedFile(), op_sig, params_sig);
}
}
}
else {

View File

@ -1,9 +1,13 @@
#pragma once
#include <ATen/core/Generator.h>
#include <c10/core/Allocator.h>
#include <c10/core/Device.h>
#include <c10/core/Stream.h>
#include <c10/core/Allocator.h>
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wunused-parameter")
namespace at {
// AcceleratorHooksInterface is a shared interface provided by all
@ -19,6 +23,10 @@ struct TORCH_API AcceleratorHooksInterface {
// Whether the device at device_index is fully initialized or not.
virtual bool hasPrimaryContext(DeviceIndex device_index) const = 0;
virtual void init() const {
TORCH_CHECK(false, "Backend doesn`t support init()");
}
virtual DeviceIndex deviceCount() const {
return 0;
}
@ -50,7 +58,18 @@ struct TORCH_API AcceleratorHooksInterface {
TORCH_CHECK(false, "Backend doesn't support getPinnedMemoryAllocator()");
return nullptr;
}
virtual const Generator& getDefaultGenerator(
C10_UNUSED DeviceIndex device_index = -1) const {
TORCH_CHECK(false, "Backend doesn`t support getDefaultGenerator()");
}
virtual Generator getNewGenerator(
C10_UNUSED DeviceIndex device_index = -1) const {
TORCH_CHECK(false, "Backend doesn`t support getNewGenerator()");
}
};
} // namespace at
C10_DIAGNOSTIC_POP()

View File

@ -6,16 +6,13 @@
#include <ATen/detail/AcceleratorHooksInterface.h>
// Forward-declares at::Generator and at::cuda::NVRTC
// NB: Class must live in `at` due to limitations of Registry.h.
namespace at {
struct Generator;
// Forward-declares at::cuda::NVRTC
namespace cuda {
struct NVRTC;
} // namespace cuda
} // namespace at
// NB: Class must live in `at` due to limitations of Registry.h.
namespace at {
#ifdef _MSC_VER
constexpr const char* CUDA_HELP =
@ -65,12 +62,16 @@ struct TORCH_API CUDAHooksInterface : AcceleratorHooksInterface {
~CUDAHooksInterface() override = default;
// Initialize THCState and, transitively, the CUDA state
virtual void initCUDA() const {
void init() const override {
TORCH_CHECK(false, "Cannot initialize CUDA without ATen_cuda library. ", CUDA_HELP);
}
virtual const Generator& getDefaultCUDAGenerator(C10_UNUSED DeviceIndex device_index = -1) const {
TORCH_CHECK(false, "Cannot get default CUDA generator without ATen_cuda library. ", CUDA_HELP);
const Generator& getDefaultGenerator(
C10_UNUSED DeviceIndex device_index = -1) const override {
TORCH_CHECK(
false,
"Cannot get default CUDA generator without ATen_cuda library. ",
CUDA_HELP);
}
virtual Device getDeviceFromPtr(void* /*data*/) const {

View File

@ -1,19 +1,13 @@
#pragma once
#include <c10/core/Allocator.h>
#include <c10/core/GeneratorImpl.h>
#include <c10/util/Exception.h>
#include <c10/util/Registry.h>
#include <ATen/detail/AcceleratorHooksInterface.h>
#include <memory>
namespace at {
class Context;
}
// NB: Class must live in `at` due to limitations of Registry.h.
namespace at {
@ -26,13 +20,13 @@ struct TORCH_API HIPHooksInterface : AcceleratorHooksInterface {
// squelch -Werror=non-virtual-dtor
~HIPHooksInterface() override = default;
// Initialize the HIP library state
virtual void initHIP() const {
AT_ERROR("Cannot initialize HIP without ATen_hip library.");
void init() const override {
TORCH_CHECK(false, "Cannot initialize HIP without ATen_hip library.");
}
virtual std::unique_ptr<c10::GeneratorImpl> initHIPGenerator(Context*) const {
AT_ERROR("Cannot initialize HIP generator without ATen_hip library.");
const Generator& getDefaultGenerator(
C10_UNUSED DeviceIndex device_index = -1) const override {
TORCH_CHECK(false, "Cannot initialize HIP without ATen_hip library.");
}
virtual bool hasHIP() const {
@ -51,10 +45,6 @@ struct TORCH_API HIPHooksInterface : AcceleratorHooksInterface {
AT_ERROR("Pinned memory requires HIP.");
}
virtual void registerHIPTypes(Context*) const {
AT_ERROR("Cannot registerHIPTypes() without ATen_hip library.");
}
virtual int getNumGPUs() const {
return 0;
}

View File

@ -1,25 +1,33 @@
#pragma once
#include <ATen/core/Generator.h>
#include <ATen/detail/AcceleratorHooksInterface.h>
#include <c10/core/Allocator.h>
#include <c10/util/Exception.h>
#include <c10/util/Registry.h>
namespace at {
struct TORCH_API IPUHooksInterface {
virtual ~IPUHooksInterface() = default;
struct TORCH_API IPUHooksInterface : AcceleratorHooksInterface {
~IPUHooksInterface() override = default;
virtual const Generator& getDefaultIPUGenerator(
DeviceIndex device_index [[maybe_unused]] = -1) const {
AT_ERROR(
"Cannot get the default IPU generator: the IPU backend is not "
"available.");
void init() const override {
TORCH_CHECK(false, "Cannot initialize IPU without ATen_ipu library.");
}
virtual Generator newIPUGenerator(DeviceIndex device_index [[maybe_unused]] = -1) const {
AT_ERROR(
"Cannot create a new IPU generator: the IPU backend is not available.");
bool hasPrimaryContext(DeviceIndex device_index) const override {
TORCH_CHECK(false, "Cannot initialize IPU without ATen_ipu library.");
return false;
}
const Generator& getDefaultGenerator(
C10_UNUSED DeviceIndex device_index = -1) const override {
TORCH_CHECK(false, "Cannot initialize IPU without ATen_ipu library.");
}
Generator getNewGenerator(
DeviceIndex device_index [[maybe_unused]] = -1) const override {
TORCH_CHECK(false, "Cannot initialize IPU without ATen_ipu library.");
}
};

View File

@ -3,13 +3,24 @@
#include <c10/util/Exception.h>
#include <c10/util/Registry.h>
#include <ATen/detail/AcceleratorHooksInterface.h>
// NB: Class must live in `at` due to limitations of Registry.h.
namespace at {
struct TORCH_API MAIAHooksInterface {
struct TORCH_API MAIAHooksInterface : AcceleratorHooksInterface {
// This should never actually be implemented, but it is used to
// squelch -Werror=non-virtual-dtor
virtual ~MAIAHooksInterface() = default;
~MAIAHooksInterface() override = default;
void init() const override {
TORCH_CHECK(false, "Cannot initialize MAIA without ATen_maia library.");
}
bool hasPrimaryContext(DeviceIndex device_index) const override {
TORCH_CHECK(false, "Cannot initialize MAIA without ATen_maia library.");
return false;
}
virtual std::string showConfig() const {
TORCH_CHECK(false, "Cannot query detailed MAIA version information.");

View File

@ -2,9 +2,9 @@
#pragma once
#include <c10/core/Allocator.h>
#include <ATen/core/Generator.h>
#include <ATen/detail/AcceleratorHooksInterface.h>
#include <c10/core/Allocator.h>
#include <c10/util/Exception.h>
#include <c10/util/Registry.h>
@ -22,7 +22,7 @@ struct TORCH_API MPSHooksInterface : AcceleratorHooksInterface {
~MPSHooksInterface() override = default;
// Initialize the MPS library state
virtual void initMPS() const {
void init() const override {
FAIL_MPSHOOKS_FUNC(__func__);
}
virtual bool hasMPS() const {
@ -31,7 +31,8 @@ struct TORCH_API MPSHooksInterface : AcceleratorHooksInterface {
virtual bool isOnMacOSorNewer(unsigned major = 13, unsigned minor = 0) const {
FAIL_MPSHOOKS_FUNC(__func__);
}
virtual const Generator& getDefaultMPSGenerator() const {
const Generator& getDefaultGenerator(
C10_UNUSED DeviceIndex device_index = -1) const override {
FAIL_MPSHOOKS_FUNC(__func__);
}
virtual Allocator* getMPSDeviceAllocator() const {

View File

@ -31,7 +31,7 @@ struct TORCH_API MTIAHooksInterface : AcceleratorHooksInterface {
~MTIAHooksInterface() override = default;
virtual void initMTIA() const {
void init() const override {
// Avoid logging here, since MTIA needs init devices first then it will know
// how many devices are available. Make it as no-op if mtia extension is not
// dynamically loaded.

View File

@ -1,18 +1,20 @@
#pragma once
#include <ATen/core/Generator.h>
#include <ATen/detail/AcceleratorHooksInterface.h>
#include <c10/core/Allocator.h>
#include <c10/core/Device.h>
#include <c10/core/Storage.h>
#include <c10/util/Exception.h>
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wunused-parameter")
namespace at {
struct TORCH_API PrivateUse1HooksInterface : AcceleratorHooksInterface {
~PrivateUse1HooksInterface() override = default;
virtual const at::Generator& getDefaultGenerator(
c10::DeviceIndex device_index) const {
const at::Generator& getDefaultGenerator(
c10::DeviceIndex device_index) const override {
TORCH_CHECK_NOT_IMPLEMENTED(
false,
"You should register `PrivateUse1HooksInterface` for PrivateUse1 before call `getDefaultGenerator`.");
@ -24,23 +26,23 @@ struct TORCH_API PrivateUse1HooksInterface : AcceleratorHooksInterface {
"You should register `PrivateUse1HooksInterface` for PrivateUse1 before call `getDeviceFromPtr`.");
}
virtual bool isPinnedPtr(const void* data) const override {
bool isPinnedPtr(const void* data) const override {
return false;
}
virtual Allocator* getPinnedMemoryAllocator() const override {
Allocator* getPinnedMemoryAllocator() const override {
TORCH_CHECK(
false,
"You should register `PrivateUse1HooksInterface` for PrivateUse1 before call `getPinnedMemoryAllocator`.");
}
virtual bool hasPrimaryContext(DeviceIndex device_index) const override {
bool hasPrimaryContext(DeviceIndex device_index) const override {
TORCH_CHECK_NOT_IMPLEMENTED(
false,
"You should register `PrivateUse1HooksInterface` for PrivateUse1 before call `hasPrimaryContext`.");
}
virtual void initPrivateUse1() const {}
void init() const override {}
virtual void resizePrivateUse1Bytes(
const c10::Storage& storage,
size_t newsize) const {

View File

@ -4,7 +4,6 @@
#include <c10/util/Exception.h>
#include <c10/util/Registry.h>
#include <ATen/core/Generator.h>
#include <ATen/detail/AcceleratorHooksInterface.h>
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wunused-parameter")
@ -14,10 +13,8 @@ namespace at {
struct TORCH_API XPUHooksInterface : AcceleratorHooksInterface{
~XPUHooksInterface() override = default;
virtual void initXPU() const {
TORCH_CHECK(
false,
"Cannot initialize XPU without ATen_xpu library.");
void init() const override {
TORCH_CHECK(false, "Cannot initialize XPU without ATen_xpu library.");
}
virtual bool hasXPU() const {
@ -34,12 +31,15 @@ struct TORCH_API XPUHooksInterface : AcceleratorHooksInterface{
TORCH_CHECK(false, "Cannot get XPU global device index without ATen_xpu library.");
}
virtual Generator getXPUGenerator(C10_UNUSED DeviceIndex device_index = -1) const {
TORCH_CHECK(false, "Cannot get XPU generator without ATen_xpu library.");
const Generator& getDefaultGenerator(
C10_UNUSED DeviceIndex device_index = -1) const override {
TORCH_CHECK(
false, "Cannot get default XPU generator without ATen_xpu library.");
}
virtual const Generator& getDefaultXPUGenerator(C10_UNUSED DeviceIndex device_index = -1) const {
TORCH_CHECK(false, "Cannot get default XPU generator without ATen_xpu library.");
Generator getNewGenerator(
C10_UNUSED DeviceIndex device_index = -1) const override {
TORCH_CHECK(false, "Cannot get XPU generator without ATen_xpu library.");
}
virtual DeviceIndex getNumGPUs() const {

View File

@ -362,6 +362,7 @@ static std::tuple<Tensor,Tensor,Tensor> convolution_backward_plumbing(
const Tensor& grad_output_, const Tensor& input_, const Tensor& weight_,
const c10::OptionalArrayRef<SymInt> bias_sizes_opt,
c10::SymIntArrayRef stride, c10::SymIntArrayRef padding, c10::SymIntArrayRef dilation, bool transposed,
// NOLINTNEXTLINE(performance-unnecessary-value-param)
c10::SymIntArrayRef output_padding, c10::SymInt groups, std::array<bool, 3> output_mask) {
const auto maybe_layer = maybeCurrentDynamicLayer();
vmap_check_escaped(maybe_layer, "convolution_backward_plumbing");

View File

@ -458,6 +458,16 @@ inline int64_t get_bdim_size2(
TORCH_INTERNAL_ASSERT(false);
}
inline c10::SymInt get_bdim_size2_symint(
const Tensor& a_value, std::optional<int64_t> a_bdim,
const Tensor& b_value, std::optional<int64_t> b_bdim) {
if (a_bdim)
return a_value.sym_size(*a_bdim);
if (b_bdim)
return b_value.sym_size(*b_bdim);
TORCH_INTERNAL_ASSERT(false);
}
// [start, start + 1, ..., stop - 1]
inline VmapDimVector range(int64_t start, int64_t stop) {
TORCH_INTERNAL_ASSERT(stop >= start);

View File

@ -8,7 +8,7 @@
#include <ATen/core/dispatch/Dispatcher.h>
#include <ATen/functorch/BatchRulesHelper.h>
namespace at { namespace functorch {
namespace at::functorch {
#define OP_DECOMPOSE(op) m.impl(#op, static_cast<decltype(&ATEN_FN(op))>(native::op));
#define OP_DECOMPOSE2(op, overload) m.impl(#op"."#overload, static_cast<decltype(&ATEN_FN2(op, overload))>(native::op));
@ -20,4 +20,4 @@ TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
OP_DECOMPOSE(_unsafe_masked_index_put_accumulate);
}
}}
}

View File

@ -226,7 +226,7 @@ static Tensor one_hot_decomposition_hack(const Tensor &self, int64_t num_classes
if (num_classes <= 0) {
AT_ERROR("Can not infer total number of classes from empty tensor.");
} else {
shape.push_back(num_classes);
shape.emplace_back(num_classes);
return at::empty_symint(shape, self.options());
}
}
@ -246,7 +246,7 @@ static Tensor one_hot_decomposition_hack(const Tensor &self, int64_t num_classes
// TORCH_CHECK(num_classes > self.max().item().toLong(), "Class values must be smaller than num_classes.");
// }
shape.push_back(num_classes);
shape.emplace_back(num_classes);
Tensor ret = at::zeros_symint(shape, self.options());
return ret.scatter(-1, self.unsqueeze(-1), 1);
}

View File

@ -213,7 +213,7 @@ static std::tuple<Tensor,Tensor> native_dropout_batching_rule(const Tensor& tens
return std::make_tuple(output, mask);
}
static Tensor multinomial_batching_rule(const Tensor& self, const int64_t num_samples, const bool replacement, const std::optional<Generator> generator) {
static Tensor multinomial_batching_rule(const Tensor& self, const int64_t num_samples, const bool replacement, std::optional<Generator> generator) {
c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode);
auto maybe_layer = maybeCurrentDynamicLayer();
const auto cur_level = maybe_layer->layerId();
@ -237,7 +237,7 @@ static Tensor multinomial_batching_rule(const Tensor& self, const int64_t num_sa
if (is_2D_case) {
self_value = reshape_dim_into(0, 0, self_value);
}
auto out = multinomial(self_value, num_samples, replacement, generator);
auto out = multinomial(self_value, num_samples, replacement, std::move(generator));
if (is_2D_case) {
out = reshape_dim_outof_symint(0, maybe_layer->batchSize(), out);
}
@ -249,7 +249,7 @@ static Tensor multinomial_batching_rule(const Tensor& self, const int64_t num_sa
// Must be same randomness with unbatched input
// 1D case: S -> multinomial(S) -> S
// 2D case: MS -> multinomial(MS) -> MS
return multinomial(self_value, num_samples, replacement, generator);
return multinomial(self_value, num_samples, replacement, std::move(generator));
}
template <typename A, A a, typename C>

View File

@ -58,7 +58,7 @@ static int64_t get_max_index_logical_dim(
static std::vector<std::optional<Tensor>> batchIndices(
ArrayRef<std::optional<Tensor>> indices,
ArrayRef<std::optional<int64_t>> indices_bdims,
int64_t batch_size,
const c10::SymInt& batch_size,
std::optional<int64_t> self_bdim,
std::optional<int64_t> values_bdim = std::nullopt) {
// There are 3 main cases:
@ -89,7 +89,7 @@ static std::vector<std::optional<Tensor>> batchIndices(
for (size_t i = 0; i < indices.size(); i++) {
auto index = indices[i];
if (index.has_value() && index->numel() != 0) {
if (index.has_value() && index->sym_numel() != 0) {
const auto idx_bdim = indices_bdims[i];
indices_.emplace_back(maybePadToLogicalRank(moveBatchDimToFront(index.value(), idx_bdim), idx_bdim, maxLogicalRank));
if (index.value().dtype() == kBool && indices_bdims[i].has_value()) {
@ -346,10 +346,10 @@ namespace {
// Code is mostly duplicated from
// https://github.com/pytorch/pytorch/blob/fb0e27d38a8fdab4e1c14d6378c9e41cb30fd6a3
// /aten/src/ATen/native/TensorAdvancedIndexing.cpp#L294-L312
VmapDimVector compute_indexed_shape(const Tensor &src, TensorList indices_list)
VmapSymDimVector compute_indexed_shape(const Tensor &src, TensorList indices_list)
{
int64_t dims_before = 0, dims_indexed = 0;
IntArrayRef replacement_shape;
SymIntArrayRef replacement_shape;
for (const auto dim : c10::irange(indices_list.size())) {
if (!indices_list[dim].defined()) {
if (dims_indexed == 0) {
@ -357,7 +357,7 @@ namespace {
}
} else {
dims_indexed++;
replacement_shape = indices_list[dim].sizes();
replacement_shape = indices_list[dim].sym_sizes();
}
}
@ -365,7 +365,7 @@ namespace {
// The offset in these dimensions is computed by the kernel using the index tensor's
// values and the stride of src. The new shape is not meaningful. It's used to make
// the shape compatible with the result tensor.
auto shape = VmapDimVector(src.sizes());
auto shape = VmapSymDimVector(src.sym_sizes());
int64_t end = dims_before + dims_indexed;
shape.erase(shape.begin() + dims_before, shape.begin() + end);
shape.insert(shape.begin() + dims_before, replacement_shape.begin(), replacement_shape.end());
@ -375,7 +375,7 @@ namespace {
// Code is mostly duplicated from
// https://github.com/pytorch/pytorch/blob/fb0e27d38a8fdab4e1c14d6378c9e41cb30fd6a3
// /aten/src/ATen/native/TensorAdvancedIndexing.cpp#L379-L405
VmapDimVector get_indexed_shape(Tensor self, const torch::List<std::optional<at::Tensor>> &orig)
VmapSymDimVector get_indexed_shape(Tensor self, const torch::List<std::optional<at::Tensor>> &orig)
{
at::native::checkIndexTensorTypes(orig, /*allow_int*/ true);
// first expand BoolTensor (masks) or ByteTensor (masks) into 1 or more LongTensors
@ -406,13 +406,13 @@ namespace {
ArrayRef<std::optional<int64_t>> indices_bdims,
const Tensor &values,
std::optional<int64_t> values_bdim,
std::optional<int64_t> opt_batch_size = {}) {
std::optional<c10::SymInt> opt_batch_size = {}) {
Tensor self_ = moveBatchDimToFront(self, self_bdim);
Tensor values_ = moveBatchDimToFront(values, values_bdim);
// for inplace variants `index_put_` and `_index_put_impl_` we find the batch_size
// here while for `index_put` does it outside of this function.
const auto batch_size = opt_batch_size ? opt_batch_size.value() : self_.size(0);
const auto batch_size = opt_batch_size ? opt_batch_size.value() : self_.sym_size(0);
self_ = ensure_has_bdim(self_, self_bdim.has_value(), batch_size);
values_ = ensure_has_bdim(values_, values_bdim.has_value(), batch_size);
TORCH_INTERNAL_ASSERT(indices.size() == indices_bdims.size());
@ -431,7 +431,7 @@ namespace {
// number of unit dims (for broadcasting value to indexed_shape)
auto n_unit_dims = indexed_shape.size() - values_sizes.size();
VmapDimVector new_values_shape(values_sizes.size() + n_unit_dims);
VmapSymDimVector new_values_shape(values_sizes.size() + n_unit_dims);
// add the batch-dim
new_values_shape[0] = batch_size;
@ -445,7 +445,7 @@ namespace {
// since batch and unit dims are already be filled.
new_values_shape[idx + n_unit_dims] = values_sizes[idx];
}
values_ = values_.view(new_values_shape);
values_ = values_.view_symint(new_values_shape);
}
return std::make_tuple(self_, indices_, values_);
@ -613,14 +613,14 @@ std::tuple<Tensor, std::optional<int64_t>> index_put_batch_rule(
TORCH_INTERNAL_ASSERT(indices.size() == indices_bdims.size());
// find the batch_size
int64_t batch_size = 0;
c10::SymInt batch_size = 0;
if (self_bdim || values_bdim) {
batch_size = get_bdim_size2(self, self_bdim, values, values_bdim);
batch_size = get_bdim_size2_symint(self, self_bdim, values, values_bdim);
} else {
// one or more of the indices is batched.
for (size_t i = 0; i < indices.size(); i++) {
if (indices_bdims[i] && indices[i].has_value()) {
batch_size = indices[i].value().size(*indices_bdims[i]);
batch_size = indices[i].value().sym_size(*indices_bdims[i]);
break;
}
}

View File

@ -102,7 +102,7 @@ static Tensor moveDimToFrontAndExpand(Tensor tensor, std::optional<int64_t> dim,
} else {
tensor = tensor.unsqueeze(0);
auto expanded_sizes = tensor.sym_sizes().vec();
expanded_sizes[0] = size;
expanded_sizes[0] = std::move(size);
tensor = tensor.expand_symint(expanded_sizes);
}
return tensor;

View File

@ -4,7 +4,6 @@
#include <ATen/WrapDimUtils.h>
#include <ATen/functorch/TensorWrapper.h>
#include <ATen/functorch/BatchedTensorImpl.h>
#include <ATen/ATen.h>
#include <ATen/Dispatch.h>
#include <c10/util/irange.h>
#include <ATen/NamedTensorUtils.h>

View File

@ -12,14 +12,15 @@ namespace at::mps {
// The real implementation of MPSHooksInterface
struct MPSHooks : public at::MPSHooksInterface {
MPSHooks(at::MPSHooksArgs) {}
void initMPS() const override;
void init() const override;
// MPSDevice interface
bool hasMPS() const override;
bool isOnMacOSorNewer(unsigned major, unsigned minor) const override;
// MPSGeneratorImpl interface
const Generator& getDefaultMPSGenerator() const override;
const Generator& getDefaultGenerator(
DeviceIndex device_index = -1) const override;
// MPSStream interface
void deviceSynchronize() const override;

View File

@ -10,7 +10,7 @@
namespace at::mps {
void MPSHooks::initMPS() const {
void MPSHooks::init() const {
C10_LOG_API_USAGE_ONCE("aten.init.mps");
// TODO: initialize MPS devices and streams here
}
@ -59,7 +59,7 @@ Allocator* MPSHooks::getMPSDeviceAllocator() const {
return at::mps::GetMPSAllocator();
}
const Generator& MPSHooks::getDefaultMPSGenerator() const {
const Generator& MPSHooks::getDefaultGenerator([[maybe_unused]] DeviceIndex device_index) const {
return at::mps::detail::getDefaultMPSGenerator();
}

View File

@ -207,6 +207,7 @@ TORCH_META_FUNC(mm)(const Tensor & self, const Tensor & mat2) {
TORCH_META_FUNC(linalg_vector_norm)(const Tensor& self, const Scalar& scalar_ord, OptionalIntArrayRef opt_dim, bool keepdim, std::optional<ScalarType> opt_dtype) {
at::native::checkFloatingOrComplex(self, "linalg.vector_norm");
TORCH_CHECK(!at::isComplexType(scalar_ord.type()), "linalg.vector_norm: Expected a non-complex scalar as the order of norm.");
auto dim = opt_dim.value_or(IntArrayRef{});
// Casting a large integer to a double will just introduce an error for
@ -2892,6 +2893,7 @@ Tensor linalg_matrix_norm(
bool keepdim,
std::optional<ScalarType> opt_dtype) {
// Check ord first as it will be used in the dtype check of A
TORCH_CHECK(!at::isComplexType(scalar_ord.type()), "linalg.matrix_norm: Expected a non-complex scalar as the order of norm.");
auto ord = scalar_ord.toDouble();
auto abs_ord = std::abs(ord);
TORCH_CHECK(abs_ord == 2. || abs_ord == 1. || abs_ord == INFINITY, "linalg.matrix_norm: Order ", ord, " not supported.");

View File

@ -2291,7 +2291,7 @@ bool cpu_equal(const Tensor& self, const Tensor& other) {
other_data += strides[1];
}
});
}), kBool, kBFloat16, kHalf, AT_EXPAND(AT_ALL_TYPES_AND_COMPLEX), AT_EXPAND(AT_BAREBONES_UNSIGNED_TYPES));
}), kBool, kBFloat16, kHalf, AT_EXPAND(AT_ALL_TYPES_AND_COMPLEX), AT_EXPAND(AT_FLOAT8_TYPES), AT_EXPAND(AT_BAREBONES_UNSIGNED_TYPES));
return result.load();
}

View File

@ -5,6 +5,8 @@
#include <ATen/Dispatch.h>
#include <c10/util/irange.h>
#include <limits>
#ifndef AT_PER_OPERATOR_HEADERS
#include <ATen/Functions.h>
#include <ATen/NativeFunctions.h>
@ -33,6 +35,17 @@ Tensor _bincount_cpu_template(
AT_ERROR("bincount only supports 1-d non-negative integral inputs.");
}
// Ensure max_val < 2 ^ 63 - 1 (9223372036854775807)
auto max_val = *self.max().data_ptr<input_t>();
if (max_val >= std::numeric_limits<int64_t>::max()) {
AT_ERROR(
"maximum value of input overflowed, it should be < ",
std::numeric_limits<int64_t>::max(),
" but got ",
max_val
);
}
bool has_weights = weights.defined();
if (has_weights && (weights.dim() != 1 || weights.size(0) != self.size(0))) {
AT_ERROR("weights should be 1-d and have the same length as input");
@ -40,7 +53,7 @@ Tensor _bincount_cpu_template(
Tensor output;
int64_t self_size = self.size(0);
int64_t nbins = static_cast<int64_t>(*self.max().data_ptr<input_t>()) + 1L;
int64_t nbins = static_cast<int64_t>(max_val) + 1L;
nbins = std::max(nbins, minlength); // at least minlength # of bins
const input_t* self_p = self.const_data_ptr<input_t>();

View File

@ -1435,8 +1435,8 @@ Tensor & index_select_out_cpu_(const Tensor & self, int64_t dim, const Tensor &
});
});
} else {
AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND4(ScalarType::ComplexHalf, ScalarType::Half, ScalarType::Bool, ScalarType::BFloat16,
self.scalar_type(), "index_select", [&index_contig, &self, &result, &dim, &numel] {
AT_DISPATCH_V2(
self.scalar_type(), "index_select", AT_WRAP([&index_contig, &self, &result, &dim, &numel] {
auto self_stride = self.dim() == 0 ? 1 : self.stride(dim);
auto result_stride = result.dim() == 0 ? 1 : result.stride(dim);
@ -1453,7 +1453,7 @@ Tensor & index_select_out_cpu_(const Tensor & self, int64_t dim, const Tensor &
*(result_data_ptr + i * result_stride) = *self_ip;
}
});
});
}), AT_EXPAND(AT_ALL_TYPES_AND_COMPLEX), ScalarType::ComplexHalf, ScalarType::Half, ScalarType::Bool, ScalarType::BFloat16, AT_EXPAND(AT_FLOAT8_TYPES));
}
}

View File

@ -106,7 +106,7 @@ inline Tensor& fill_empty_deterministic_(Tensor& tensor) {
AT_DISPATCH_V2(
tensor.scalar_type(), "fill_empty_deterministic_", AT_WRAP([&]() {
tensor.fill_(std::numeric_limits<scalar_t>::quiet_NaN());
}), AT_EXPAND(AT_FLOATING_TYPES), AT_EXPAND(AT_COMPLEX_TYPES), AT_EXPAND(AT_FLOAT8_TYPES), kBFloat16, kHalf);
}), AT_EXPAND(AT_FLOATING_TYPES), AT_EXPAND(AT_COMPLEX_TYPES), AT_EXPAND(AT_FLOAT8_TYPES), kBFloat16, kHalf, kComplexHalf);
} else {
AT_DISPATCH_V2(
tensor.scalar_type(), "fill_empty_deterministic_", AT_WRAP([&]() {

View File

@ -44,7 +44,7 @@ Scalar _local_scalar_dense_cuda(const Tensor& self) {
cudaStream_t stream = at::cuda::getCurrentCUDAStream();
at::cuda::memcpy_and_sync((void *)value.const_data_ptr<scalar_t>(), self.const_data_ptr<scalar_t>(), sizeof(scalar_t), cudaMemcpyDeviceToHost, stream);
r = Scalar(*value.const_data_ptr<scalar_t>());
}), AT_EXPAND(AT_ALL_TYPES_AND_COMPLEX), kComplexHalf, kHalf, kBool, kBFloat16, AT_EXPAND(AT_BAREBONES_UNSIGNED_TYPES));
}), AT_EXPAND(AT_ALL_TYPES_AND_COMPLEX), kComplexHalf, kHalf, kBool, kBFloat16, AT_EXPAND(AT_FLOAT8_TYPES), AT_EXPAND(AT_BAREBONES_UNSIGNED_TYPES));
#if defined(USE_ROCM)
} else {
auto cpu_self = self.cpu();

View File

@ -285,44 +285,64 @@ struct Copy<dst_t, c10::complex<float>> {
}
};
#define AT_DISPATCH_SOURCE_TYPES(TYPE, NAME, ...) \
AT_DISPATCH_SWITCH( \
TYPE, \
NAME, \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Byte, src_t, __VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Char, src_t, __VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Long, src_t, __VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Short, src_t, __VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Int, src_t, __VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Double, src_t, __VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Float, src_t, __VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::ComplexDouble, \
src_t, \
__VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::ComplexFloat, \
src_t, \
__VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Half, \
src_t, \
__VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::BFloat16, \
src_t, \
__VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Bool, \
src_t, \
__VA_ARGS__))
#define AT_DISPATCH_SOURCE_TYPES(TYPE, NAME, ...) \
AT_DISPATCH_SWITCH( \
TYPE, \
NAME, \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Byte, \
src_t, \
__VA_ARGS__) AT_PRIVATE_CASE_TYPE_USING_HINT(at::ScalarType::Char, src_t, __VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Long, src_t, __VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Short, src_t, __VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Int, src_t, __VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Double, src_t, __VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Float, src_t, __VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::ComplexDouble, \
src_t, \
__VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::ComplexFloat, \
src_t, \
__VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Half, \
src_t, \
__VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::BFloat16, \
src_t, \
__VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType::Bool, \
src_t, \
__VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType:: \
Float8_e4m3fn, \
src_t, \
__VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType:: \
Float8_e4m3fnuz, \
src_t, \
__VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType:: \
Float8_e5m2, \
src_t, \
__VA_ARGS__) \
AT_PRIVATE_CASE_TYPE_USING_HINT( \
at::ScalarType:: \
Float8_e5m2fnuz, \
src_t, \
__VA_ARGS__))
namespace {
@ -410,10 +430,14 @@ void foreach_tensor_copy_list_kernel_cuda_(
std::vector<std::vector<at::Tensor>> tensor_lists{src.vec(), self.vec()};
AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND7(
ScalarType::Half,
ScalarType::BFloat16,
ScalarType::Bool,
ScalarType::Float8_e4m3fn,
ScalarType::Float8_e4m3fnuz,
ScalarType::Float8_e5m2,
ScalarType::Float8_e5m2fnuz,
self[0].scalar_type(),
"foreach_tensor_copy",
[&]() {

View File

@ -1518,7 +1518,7 @@ Tensor& index_select_out_cuda(
out.scalar_type(),
"index_select_cuda",
AT_WRAP([&] { index_select_out_cuda_impl<scalar_t>(out, self, (uint64_t) dim, index); }),
AT_EXPAND(AT_ALL_TYPES_AND_COMPLEX), AT_EXPAND(AT_BAREBONES_UNSIGNED_TYPES),
AT_EXPAND(AT_ALL_TYPES_AND_COMPLEX), AT_EXPAND(AT_BAREBONES_UNSIGNED_TYPES), AT_EXPAND(AT_FLOAT8_TYPES),
kComplexHalf,
kHalf,
kBool,

Some files were not shown because too many files have changed in this diff Show More