Commit Graph

87740 Commits

Author SHA1 Message Date
7d39e73c57 Fix more URLs (#153277)
Or ignore them.
Found by running the lint_urls.sh script locally with https://github.com/pytorch/pytorch/pull/153246

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153277
Approved by: https://github.com/malfet
2025-05-14 16:23:50 +00:00
de92296bbb [Intel GPU] undo broadcast on zero stride tensor for SDPA (#151976)
Fix https://github.com/pytorch/pytorch/issues/152290.

The model **hubert** uses aten::expand to build attention mask by broadcasting. Pytorch uses strides[d]=0 to represent broadcast, which is not supported by oneDNN.  This PR handles this scenario.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151976
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg
2025-05-14 16:09:03 +00:00
1f48bab377 Update torch-xpu-ops commit pin (#153445)
Update the torch-xpu-ops commit to [207105038963e5f9f012f1a0cfd3b9f57b2ab5b0](2071050389), includes:

- Improve the accuracy of `upsample_bilinear2d_backward`
- Enhance the performance of `avg_pool2d`
- Update the implementation of scatter-gather and indexing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153445
Approved by: https://github.com/guangyey, https://github.com/EikanWang
2025-05-14 15:34:47 +00:00
2e440e39a6 [nativert] Move Placement to pytorch core (#152953)
Summary:
Move Placement to pytorch core.

Using `torch::nativert::isSameDevice` explicitly in code to avoid confusion with the `isSameDevice` in torch namespace.

Test Plan:
```
buck run fbcode//mode/dev-nosan  //caffe2/test/cpp/nativert:placement_test

./bin/test_nativert
```

OSS and internal CI

Differential Revision: D74190745

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152953
Approved by: https://github.com/Skylion007, https://github.com/swolchok, https://github.com/zhxchen17, https://github.com/cyyever
2025-05-14 15:26:54 +00:00
eqy
ced90d23d3 [CUDA][CUDNN] Dispatch to cuDNN for non-batch-splittable 64-bit NCHW convolutions (#153101)
For #152816

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153101
Approved by: https://github.com/Skylion007
2025-05-14 15:22:47 +00:00
0ce941f994 [audio hash update] update the pinned audio hash (#153507)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153507
Approved by: https://github.com/pytorchbot
2025-05-14 15:16:35 +00:00
cd119ddd7c Add matching against hypothetical (new) ghstack pull-request trailer (#153528)
I would like to change ghstack to use a new trailer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153528
Approved by: https://github.com/malfet
2025-05-14 14:07:01 +00:00
8f3d7972ad [dynamo][compile-time] Cache the function signature to speedup inlining (#153396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153396
Approved by: https://github.com/jansel, https://github.com/StrongerXi
ghstack dependencies: #153333
2025-05-14 14:01:46 +00:00
2344eca5eb Revert "Fix skipIfXpu and skipIfHpu disables tests when used on class (#151315)"
This reverts commit ee096b89f63394b2c18826288783eef241f3959c.

Reverted https://github.com/pytorch/pytorch/pull/151315 on behalf of https://github.com/jeanschmidt due to Seems to have introduced internal regressions, see [D74668899](https://www.internalfb.com/diff/D74668899). @malfet may you help the author get this PR merged? ([comment](https://github.com/pytorch/pytorch/pull/151315#issuecomment-2880203323))
2025-05-14 13:15:03 +00:00
2c1912452d Revert "Rewrite autograd producer consumer stream sync logic (#151079)"
This reverts commit f78e4529a9d446deb77c6ac38184582f6ab9167a.

Reverted https://github.com/pytorch/pytorch/pull/151079 on behalf of https://github.com/jeanschmidt due to Seems to have introduced regressions in internal signals, see [D74648937](https://www.internalfb.com/diff/D74648937) ([comment](https://github.com/pytorch/pytorch/pull/151079#issuecomment-2880176879))
2025-05-14 13:07:12 +00:00
a628efd1e8 Revert "Enable accelerator to perform streaming backward (#153412)"
This reverts commit d5d26ce43641a19c3e36a751b59b7fa3825cea83.

Reverted https://github.com/pytorch/pytorch/pull/153412 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/151079 ([comment](https://github.com/pytorch/pytorch/pull/153412#issuecomment-2880169739))
2025-05-14 13:04:27 +00:00
e8f7a97e2e [Refactor] Explicilty spell out the namespace for device() function (#153248)
Summary: To prepare for the coming up header-only file change. The same files have been using a mixed style of using at::device() and device(). Given these .cpp files are not in the at namespace, it makes sense to spell them out explicitly.

Differential Revision: [D74577412](https://our.internmc.facebook.com/intern/diff/D74577412)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153248
Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/janeyx99
2025-05-14 12:00:47 +00:00
0ef5ba43a6 Fix negative dim issue in for parallel loss context manager (#152785)
Facing similar issue as on #152016  , and added as per @tianyu-l 's solution.
Fixes #152016

 Tagging @tianyu-l @atalman  for review.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152785
Approved by: https://github.com/tianyu-l
2025-05-14 10:43:27 +00:00
864a5f4434 [dynamo][compile-time] Cache the cleaned insturctions while inlining (#153333)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153333
Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/williamwen42
2025-05-14 09:26:26 +00:00
0139ce9303 Add skip_dtype_check_in_meta_registrations config to torch/fx/experimental/_config (#153513)
Helion relies on torch/fx/experimental 's fake_tensor tracing but does its own dtype checking, which conflicts with some meta kernel's existing dtype checking. This PR adds a config so that we skip those dtype checking in meta kernels and rely on the calling system to do the dtype checking.

Currently it only applies to `baddbmm`, but I expect that similar changes will need to be done to other meta kernels in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153513
Approved by: https://github.com/jansel
2025-05-14 09:14:11 +00:00
4015166e5d [ROCm] Maxpool backward NHWC Perf Improvement targeting Resnet scenarios (#152267)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152267
Approved by: https://github.com/jeffdaily
2025-05-14 06:59:29 +00:00
4c5cf18ee0 [device_mesh] improve device selection logic (#150897)
as titled, this PR improves the device selection logic when user did not
set the device before calling the DeviceMesh constructor, as a device
manager, DeviceMesh should try to set the device for users in a good
way.

The behavior of set_device before:

* If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user
* If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device.
This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES

So this PR improves the device selection logic to:

* If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves
* If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue)
* If not above, then we throw warning to users about situation, and fallback to the old heuristic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150897
Approved by: https://github.com/tianyu-l
ghstack dependencies: #150898
2025-05-14 06:29:16 +00:00
0f891cad5a Enable ruff check for torch/utils/data/*.ipynb (#148654)
Fixes part of #146411

Enable ruff check for `torch/utils/data/*.ipynb` files

## Test Result

```bash
lintrunner -a --take RUFF torch/utils/data/*.ipynb
```

![image](https://github.com/user-attachments/assets/88fddc91-3f19-4704-9aef-2cabd2cdc96e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148654
Approved by: https://github.com/Skylion007
2025-05-14 06:21:47 +00:00
f7798d8645 Checks kv pair indexing in OrderedPreservingDictTest.test_range_insert (#148136)
`OrderedPreservingDictTest.test_range_insert` has an [unused loop variable `j`](https://github.com/pytorch/pytorch/blob/main/c10/test/util/ordered_preserving_dict_test.cpp#L186), I think taken from the [inspired project](https://github.com/pytorch/pytorch/blob/main/c10/test/util/ordered_preserving_dict_test.cpp#L165) testcase for range inserts, where it [checks kv pair indexing/order](https://github.com/Tessil/ordered-map/blob/master/tests/ordered_map_tests.cpp#L136) for the ordered dict.

This just adds in that functionality to the test case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148136
Approved by: https://github.com/eellison
2025-05-14 06:05:23 +00:00
11c64b7cf8 [dynamo][compile-time] Cache whether a function is inlineable (#153192)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153192
Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/williamwen42
ghstack dependencies: #153458
2025-05-14 05:40:25 +00:00
e2ce17c6ef [SymmMem][a2av] Use more CTAs for intra-node case (#153509)
Previously, we launch the a2av kernel with at most 8 blocks for intra-node cases, which turns out to saturate only 57 GB/s bandwidth.

This PR adds more blocks for intra-node, up to 8 per peer, pumping up data parallelism.  The kernel now achieves 350 GB/s SOL for Hopper. See figure.

It also uses a simple tuning based on input size to avoid jumping to 8 CTAs directly (i.e. 1, 2, 4, then 8)

For inter-node, we cap at 8 blocks, since 57 GB/s seems bigger than regular NIC bandwidths (400 Gb/s).

![all_to_all_vdev Performance on 8xH100](https://github.com/user-attachments/assets/d4b841e6-4c42-4a2e-aa9f-2bc116ba9d25)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153509
Approved by: https://github.com/ngimel
ghstack dependencies: #153483
2025-05-14 04:24:32 +00:00
20dbe644c7 [CD] Fix the libgomp twice load issue (#150084)
Fixes #149422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150084
Approved by: https://github.com/malfet, https://github.com/leslie-fang-intel, https://github.com/atalman

Co-authored-by: LifengWang <lifeng.a.wang@intel.com>
2025-05-14 04:06:18 +00:00
316c15297c [MemoryZ] Show the current and max entries rendered (#153446)
Summary: as title

Test Plan: {F1977904091}

Differential Revision: D74626081

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153446
Approved by: https://github.com/sraikund16
2025-05-14 03:16:12 +00:00
c797f1285c [dynamo][copmile-time] Handle builtins first in LOAD_GLOBAL (#153458)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153458
Approved by: https://github.com/jansel
2025-05-14 03:04:38 +00:00
33a5179269 [AOTI][reland2] Remove typedef for half and bfloat16 (#153467)
Summary:
Reland https://github.com/pytorch/pytorch/pull/151109 after fixing cutlass AOTI build issues.

typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the standalone AOTI codegen.

Differential Revision: D74398762

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153467
Approved by: https://github.com/jingsh, https://github.com/henrylhtsang, https://github.com/cyyever
2025-05-14 02:37:18 +00:00
9ad9a04ca7 Add TensorLR variant for fused Adagrad on CPU (#153078)
This PR adds a tensor LR variant for the CPU Adagrad(fused=True).

I copied the behavior from the tensor LR variant of CPU Adam(fused=True), where the `lr.item()` is cast to a double and passed in the default function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153078
Approved by: https://github.com/janeyx99
2025-05-14 02:23:33 +00:00
d51bc27378 [export] Make draft_export public (#153219)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153219
Approved by: https://github.com/pianpwk
2025-05-14 02:18:36 +00:00
b15b870903 [BE] remove outdated torch/README.md (#153500)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153500
Approved by: https://github.com/albanD, https://github.com/cyyever
2025-05-14 02:10:30 +00:00
d759a517af Update the heuristic for AArch64 bmm/baddbmm (#149122)
Updates heuristic for bmm/baddbmm and consolidates all heuristic logic in a single location

 - The goal of the consolidation is to improve maintainability and readability of the heuristic logic. Instead of different parts scattered across two files, this patch centralizes everything inside `Matmul.cpp`, where there already exists heuristic-based selection for mkldnn.
 - The logic of the check itself doesn't change (existing code is reused where possible) but a separate heuristic threshold for bmm/baddbmm is introduced based on newer, benchmarking data. Use the script below to see the performance improvement for bmm from the new heuristic:
 ```
import torch
import time

# Set below to True to use cases selected by only one of the hueristics.
USE_ONLY_DIVERGENT_TEST_CASES = True
BATCH_SIZES  = [ 1, 8, 32, 64, 128, 256 ]
M_DIMS       = [ 4, 8, 16, 32, 64, 256, 512 ]
N_DIMS       = [ 4, 8, 16, 32, 64, 256, 512 ]
K_DIMS       = [ 4, 8, 16, 32, 64, 256, 512 ]
ITERS = 50

def old_heuristic(m, n, k):
     is_above_min_dims = m > 8 and n > 8 and k > 8
     is_above_min_size = m*n*k > 8_192
     return is_above_min_dims and is_above_min_size

def new_heuristic(b, m, n, k):
     return b*b*m*n*k >= 4_194_304

def generate_test_cases():
    test_cases = []
    for b in BATCH_SIZES:
        for m in M_DIMS:
            for n in N_DIMS:
                    for k in K_DIMS:
                        if USE_ONLY_DIVERGENT_TEST_CASES:
                            if old_heuristic(m, n, k) != new_heuristic(b, m, n, k):
                                test_cases.append([b, m, n, k])
                        else:
                            test_cases.append([b, m, n, k])
    return test_cases

def test(x, y):
    for _ in range(5):
        torch.bmm(x, y)
    perf = 0.0
    for _ in range(ITERS):
        start = time.time()
        torch.bmm(x, y)
        end = time.time()
        perf += (end - start) / ITERS
    return perf

def main():
    print(f"{'b':<10}{'m':<10}{'n':<10}{'k':<10}{'time (s)':10}")
    cumulative_mean_time = 0.0
    for b, m, n, k in generate_test_cases():
        mean_time = test(torch.rand(b, m, n), torch.rand(b, n, k))
        cumulative_mean_time += mean_time
        print(f"{b:<10}{m:<10}{n:<10}{k:<10}{mean_time:10.3e}")
    print(f"Cumulative mean time = {cumulative_mean_time:.4f} s")

if __name__ == "__main__":
    main()
```

From the script we see that cumulative mean time from all test cases (at 16 threads) is:
 - 1.6195 s for the old heuristic
 - 0.7012 s for the new heuristic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149122
Approved by: https://github.com/fadara01, https://github.com/aditew01, https://github.com/malfet
2025-05-14 02:03:50 +00:00
e8662e836a Remove std::is_arithmetic specialization from c10/util/strong_type.h (#153424)
Specializing std::is_arithmetic has undefined behavior (and breaks builds with -Winvalid-specialization). Should fix #150901

Differential Revision: [D74614724](https://our.internmc.facebook.com/intern/diff/D74614724/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153424
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-05-14 02:01:32 +00:00
clr
85f97b5a8c compile_fx: make a compile event that corresponds to the fx_compile waitcounter (#152983)
This is a pretty minor change, but by having exact correspondence, we can
easily confirm data differences between perfetto and wait counters

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152983
Approved by: https://github.com/jansel, https://github.com/masnesral
2025-05-14 01:54:42 +00:00
90001554bf [SymmMem][a2av] Fix TODO: change stride unit (#153483)
Previous kernel impl assumes float type. This PR makes it general by passing stride in unit of bytes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153483
Approved by: https://github.com/fegin, https://github.com/ngimel
2025-05-14 01:47:54 +00:00
eqy
9386701b51 [cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)
cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282
Approved by: https://github.com/drisspg
2025-05-14 01:39:24 +00:00
8521a690f7 [dynamo] fix potential circular import error in decorators.py (#153217)
Differential Revision: [D74442043](https://our.internmc.facebook.com/intern/diff/D74442043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153217
Approved by: https://github.com/jansel
2025-05-14 01:01:57 +00:00
e6a9067260 [ROCm] Maxpool forward NHWC Perf Improvement targeting Resnet scenarios (#151727)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151727
Approved by: https://github.com/jeffdaily

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2025-05-14 00:58:00 +00:00
7f79222992 Upgrade to NCCL 2.26.5 for CUDA 12 (#152810)
Upgrade NCCL to latest 2.26.5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152810
Approved by: https://github.com/eqy, https://github.com/albanD, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/cyyever
2025-05-14 00:52:50 +00:00
8739a8c288 elastic: do not shutdown rendezvous on leaving workers (#152525)
In #117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](fa6f9eb2be/torch/distributed/launcher/api.py (L290)) but should not be shutdown if a signal is received. See also [this pull request](https://github.com/pytorch/pytorch/pull/67749).

#124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before.

Removing both these changes restores the original behavior. The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving.

Fixes #150916
Fixes #147064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152525
Approved by: https://github.com/kiukchung
2025-05-14 00:44:10 +00:00
8ac82c3e72 [export] support functools.partial forward (non-strict) (#153408)
Fixes #153086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153408
Approved by: https://github.com/tugsbayasgalan
2025-05-13 23:30:13 +00:00
40b719c97d [nativert] move executor config to torch (#153087)
Summary:
nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md

To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed.

This diff moves the executor config to torch. since it's header-only this requires some changes to the libtorch build configs

Test Plan: CI

Differential Revision: D74278789

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153087
Approved by: https://github.com/zhxchen17
2025-05-13 23:26:00 +00:00
3498201e57 GPU lowering uses aoti_call_delegate (#153282)
Summary: Skip custom objects when serializing the weight nodes of `aoti_call_delegate` hop as they are not consumed by the runtime.

Test Plan: CI

Reviewed By: SherlockNoMad

Differential Revision: D73704385

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153282
Approved by: https://github.com/dolpm, https://github.com/SherlockNoMad
2025-05-13 23:23:27 +00:00
81719ebde3 [caffe2] Make c10::str works with scoped enum (#152705) (#152714)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/152705

Test Plan:
```
buck2 test fbcode//caffe2/c10/test:util_base_tests --fail-fast
```

Differential Revision: D74087796

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152714
Approved by: https://github.com/Skylion007
2025-05-13 21:05:36 +00:00
e8596c291b Fix misleadingly high AOT Inductor dashboard performance (#153060)
Fixes misleadingly high AOTInductor performance benchmark numbers in scenarios where a model updates internal parameters during `torch.export.export`. Since `FakeTensorMode` is enabled during export, all such parameters become `FakeTensor`s, slowing down future eager-mode runs using that model substantively. This, in turn, causes misleading performance stats, where the slowness of eager-mode makes `AOTInductor` look _very_ good.

An [example benchmark](https://hud.pytorch.org/benchmark/timm_models/inductor_aot_inductor?dashboard=torchinductor&startTime=Wed%2C%2030%20Apr%202025%2015%3A54%3A04%20GMT&stopTime=Wed%2C%2007%20May%202025%2015%3A54%3A04%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=main&lCommit=1dd36ad2d440a4f3faf724b3a8e13925e3180c24&rBranch=main&rCommit=cc7346bf19c019255dcb4484694a75850ed74d5a&model=convit_base) with this issue. The equivalent `cpp_wrapper` benchmark run shows a 2x performance gain, not 20x.

Only two benchmarks we regularly run are affected by this, both in the TIMM set.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153060
Approved by: https://github.com/desertfire
2025-05-13 20:59:59 +00:00
a13c8f2ecb [EZ/Profiler] Replace manual GIL calls with pybind GIL calls (#153415)
Summary: Use pybind11::gil_scoped_acquire instead of old impl as it will automatically take care of error handling. In the original implementation we missed releasing the GIL on each possible error which could put the program in a deadlock

Test Plan: Induced error manually and saw that GIL was released

Differential Revision: D74593564

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153415
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-05-13 20:47:52 +00:00
5ff2cb8587 Add justknobs for static cuda launcher (#153400)
Summary:
This diff adds a justknobs check for static cuda launcher. In particular, it supports a fractional rollout where each mast job/version can be consistently enrolled in the config on or off.

It also adds a set_feature_use so we can track whether static cuda launcher is enabled on a given dynamo compile.

Test Plan: Existing unit tests. The justknobs in question are set to be disabled right now, so this diff does not launch the feature yet.

Differential Revision: D74599203

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153400
Approved by: https://github.com/oulgen
2025-05-13 20:10:13 +00:00
clr
20ba8fe7e6 induct: Log a pt2 compile event + waitcounter for node fusing. (#153270)
This appears to be slow in production (potentially a quadratic explosion), and
logging this explicitly in pt2_compile_events and wait_counters makes it a lot easier to see how
bad of an issue this is.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153270
Approved by: https://github.com/masnesral
2025-05-13 19:02:36 +00:00
8ac82a1d20 [dynamo] Add test to ensure we don't print fx graph upon data dependent graph break (#153416)
This adds a regression test for #149831, also as part of getting it
cherry-picked into 2.7.1.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153416
Approved by: https://github.com/atalman
2025-05-13 18:28:02 +00:00
9df9d9ded0 [device_mesh] replace dim_group_info with group_name (#150898)
as titled, there's no need to maintain a dim_group_info anymore, we can
simply maintain a list of group_name instead. This will simplify the
logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150898
Approved by: https://github.com/tianyu-l, https://github.com/fegin
2025-05-13 17:16:45 +00:00
9c3cef437c gloo: support ibverbs in cmake (#153425)
This updates the gloo submodule in PyTorch to a version that supports the new ibverbs backend that can be used with PyTorch.

Test plan:

```
sudo dnf install rdma-core-devel
USE_GLOO_IBVERBS=ON python setup.py develop
torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py
```

```py
"""
run with:

torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py
"""

import os

os.environ["GLOO_DEVICE_TRANSPORT"] = "IBVERBS"

import torch
import torch.distributed as dist

dist.init_process_group("gloo")

rank = dist.get_rank()

if rank == 0:
    device = "cpu"
else:
    device = "cuda"

print(device)

t = torch.full((10, 100), fill_value=(rank+1), device=device)
target = torch.full((10, 100), fill_value=3, device=device)

dist.all_reduce(t)

torch.testing.assert_close(t, target)

t = torch.full((10, 100), fill_value=(rank+1), device=device)

if rank == 0:
    dist.send(t, dst=1)
else:
    dist.recv(t, src=0)
    torch.testing.assert_close(t, torch.full_like(t, 1))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153425
Approved by: https://github.com/fduwjj
2025-05-13 17:09:00 +00:00
dde705864a Fix test broken by D73809989 (#153413)
Summary: I forgot to remove this unused field in D73809989.

Test Plan: `buck test 'fbcode//mode/opt' fbcode//caffe2/test:fbonly -- --exact 'caffe2/test:fbonly - test_compilation_metrics_logger_in_sync (caffe2.test.fb.test_fb.TestFBOnly)'`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153413
Approved by: https://github.com/c00w
2025-05-13 16:44:30 +00:00
216e28f7e9 [ca] run xfails up until their last passing backend (#153279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153279
Approved by: https://github.com/jansel
ghstack dependencies: #153193, #153222
2025-05-13 16:42:10 +00:00