Compare commits

..

105 Commits

Author SHA1 Message Date
b1abd9ec11 Test myst-markdown in docstrings 2025-07-23 09:32:38 -07:00
00da8e63eb CI for Windows Arm64 (#148753)
This pull request adds a new CI workflow for Windows Arm64, named win-arm64-build-test.yml.
It can be triggered on any pull request by including the ciflow/win-arm64 tag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148753
Approved by: https://github.com/malfet
2025-07-23 16:12:20 +00:00
576253c476 [math] Trace float.fromhex (#156976)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156976
Approved by: https://github.com/zou3519
ghstack dependencies: #156975, #156977
2025-07-23 16:12:08 +00:00
f5314f89c8 [struct] Add struct.pack and struct.unpack polyfills (#156977)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156977
Approved by: https://github.com/XuehaiPan, https://github.com/jansel
ghstack dependencies: #156975
2025-07-23 16:12:08 +00:00
671e22a951 [math] Raise exception in Dynamo if constant fold call fail (#156975)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156975
Approved by: https://github.com/zou3519
2025-07-23 16:12:08 +00:00
d3d9bc1c31 [inductor] Allow backends to register their own custom config object (#158254)
An out of tree backend can have its own configuration options that the user can enable to control inductor compilation. These config options need to be taken into account when calculating the key that is used to determine cache miss / hits. This PR allows out of tree backends to specify a custom config module that has the same type as `torch._inductor.config` that can be used to control codegen (in addition to the default config), and will be used when creating the cache key.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158254
Approved by: https://github.com/eellison
2025-07-23 15:56:06 +00:00
7d296d5c19 [aoti][mps] Enable more tests (#158703)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158703
Approved by: https://github.com/malfet, https://github.com/desertfire
ghstack dependencies: #158349, #158350, #158351
2025-07-23 15:38:56 +00:00
2a60b8fc97 [export][ez] Fix packaging (#158855)
Summary: as title, seems ytpo

Test Plan:
CI

Rollback Plan:

Differential Revision: D78758466

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158855
Approved by: https://github.com/henryoier
2025-07-23 15:36:14 +00:00
d898d0d437 [Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847)
This PR addresses a few small bugfixes needed to make NanoGPT inference work, and also adds a new `--caching-precompile` argument to torchbench. With `--caching-precompile`, after every benchmark we save precompile artifacts to DynamoCache, allowing us to test caching precompile on all existing benchmarks.

The following bugfixes are in this PR to make all of this work:
- Fix global variables being pruned with DUPLICATE_INPUT guards. DUPLICATE_INPUT guards have additional vars from the second input, which we track with additional_local_vars, but we never tracked additional global variables. This fixes the issue. (See torch/_dynamo/guards.py changes)
- Return None from PRecompileContext.serialize() if no new dynamo compiles occurred. There's no reason to save artifacts (i.e. autotuning artifacts, etc) if no dynamo_compile occurred, so we return None early. We may later want to support editing existing dynamo artifacts as a TODO, but that's upcoming.
- log `dynamo_start` on CompilePackage.load: This is only needed so that tlparse doesn't ignore TORCH_TRACE logs generated when caching precompile hits. If there are no actual compiles, we never log a "dynamo_start" entry, which makes internal tlparse ignore the TORCH_TRACE file.

## Test Plan

After this PR, the following now works:
```
TORCH_LOGS=dynamo tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance  --inference --backend inductor  --caching-precompile --warm-start-latency
```
tlparse result (internal):
Cold Start (6 seconds):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_vk9nkp4m.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Warm Start (~1 s):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_5l4iwrpm.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

The 1 second of warm start here can be improved: the costs here are mostly in starting up workers and triton and initializing CUDA, a lot of which should not be included in the compile time cost in real world scenarios where these are already loaded before training begins.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158847
Approved by: https://github.com/zhxchen17
2025-07-23 15:06:54 +00:00
5998cd4eaa [MPS] Speedup torch.full for 1-byte types (#158874)
By using [`fillBuffer:range:value:`](https://developer.apple.com/documentation/metal/mtlblitcommandencoder/fillbuffer:range:value:?language=objc) rather than MPSGraph op, which should be faster and also does not have INT_MAX limit

Which in turn fixes `test_index_put_accumulate_large_tensor_mps` test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158874
Approved by: https://github.com/dcci
2025-07-23 14:00:40 +00:00
57024913c4 Fix decorators skipping NCCL tests (#158846)
Avoid failures caused by tests exiting via sys.exit instead of `unittest.skip`

In particular it will not try to start the test (causing forks into subprocess) just to stop them (killing the subprocess) which is done in the test setup

Using `unittest.skip` decorators avoids the starting of the test in the first place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158846
Approved by: https://github.com/Skylion007
2025-07-23 13:31:21 +00:00
ee72338f0c [Inductor] MSVC use pointer when generating temporary array pointer (#158913)
MSVC cannot implicitly convert a const iterator to a const pointer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158913
Approved by: https://github.com/desertfire

Co-authored-by: Xu Han <xu.han@outlook.com>
2025-07-23 13:19:11 +00:00
c665594c1e [AOTI] fix extract file failed on Windows. (#158702)
Changes:
1. rename zip index filename, and keep it out of normalize path.
2. normalize output path for extract file.

Extract files successful:
<img width="683" height="247" alt="image" src="https://github.com/user-attachments/assets/72dff7b9-5ec0-4523-a6ee-7768b37bbe63" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158702
Approved by: https://github.com/angelayi
2025-07-23 08:00:14 +00:00
255a04baf1 [pt2 event logging] send autotuning data for strides and hinted shapes (#158852)
Summary:
# Why

capture relevant data for offline lookup table generation

# What

report the hinted sizes not just the symbolic sizes

Test Plan:
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 | tee /tmp/epx040
```

This only validates that this change does not break anything, as the schema is not on scuba yet (not actualized)

Rollback Plan:

Reviewed By: stashuk-olek

Differential Revision: D77837548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158852
Approved by: https://github.com/jingsh
2025-07-23 06:44:27 +00:00
1d302eaee8 [vllm] add vllm test base docker image (#158755)
# description
Add base docker image for vllm.

It seems like we use the base docker image for both pytorch build, and tests. Configure a base image for vllm against pytorch CI.

# Others
Added readme regarding how the base docker images are used, and how to add one, this also explain what is the right file to modify

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158755
Approved by: https://github.com/seemethere, https://github.com/huydhn
2025-07-23 05:42:44 +00:00
a6b7bea244 [inductor] support linear & layer_norm unbacked (#155267)
### What
- Use `statically_known_true` over `guard_size_oblivious` in cases where we're checking an optimization path. Otherwise, it will DDE and we can't take the safe/slower path.
- For broadcast checks, use `fallback=False` if we encounter a DDE. Typically, unbackeds would be ≥2 and that falls inline with size-oblivious reasoning (i.e. when `size_oblivious=True`).

### Example DDE
```
torch._inductor.exc.InductorError: LoweringException: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq((u0//387), 1) (unhinted: Eq((u0//387), 1)).  (Size-like symbols: u0)

Caused by: (_inductor/lowering.py:488 in broadcast_symbolic_shapes)
```
```
torch._inductor.exc.InductorError: LoweringException: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq((u0//387), 1) (unhinted: Eq((u0//387), 1)).  (Size-like symbols: u0)

Caused by: (_inductor/ir.py:2797 in create)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155267
Approved by: https://github.com/eellison
2025-07-23 05:42:01 +00:00
be72bcf828 [vllm hash update] update the pinned vllm hash (#158806)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158806
Approved by: https://github.com/pytorchbot
2025-07-23 04:41:53 +00:00
f80f97d192 [audio hash update] update the pinned audio hash (#158807)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158807
Approved by: https://github.com/pytorchbot
2025-07-23 04:39:50 +00:00
42a69f7c2b [MTIA Aten Backend] Migrate addmm.out / baddbmm.out / bmm.out (#158749)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

 Migrate addmm.out / baddbmm.out / bmm.out to in-tree.

Differential Revision: [D78578483](https://our.internmc.facebook.com/intern/diff/D78578483/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158749
Approved by: https://github.com/albanD, https://github.com/nautsimon
ghstack dependencies: #158748
2025-07-23 03:45:28 +00:00
b87471e66f [MTIA Aten Backend] Migrate addcdiv.out / addcmul.out / eq.Tensor_out / eq.Scalar_out (#158748)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

 Migrate addcdiv.out / addcmul.out / eq.Tensor_out / eq.Scalar_out to in-tree.

Differential Revision: [D78568103](https://our.internmc.facebook.com/intern/diff/D78568103/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158748
Approved by: https://github.com/albanD, https://github.com/nautsimon
2025-07-23 03:45:20 +00:00
f10e4430e2 [AOTI] normalize path and process model files. (#158705)
Continued to https://github.com/pytorch/pytorch/pull/158702 , split `zip_filename_str` and real file path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158705
Approved by: https://github.com/desertfire
2025-07-23 02:58:21 +00:00
2dccff7dcf [inductor] pass_fds not supported on Windows, skip them on Windows. (#158830)
<img width="1366" height="806" alt="image" src="https://github.com/user-attachments/assets/ddf3d27a-36da-47ce-9ba9-00c43805bb06" />

Almost UTs are failed on `AssertionError: pass_fds not supported on Windows.`, let's skip them on Windows.
TODO: I will also debug and confirm `pass_fds` on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158830
Approved by: https://github.com/jansel
2025-07-23 02:24:35 +00:00
dec0d3101c [export] fix unbacked range deserialization (#158681)
Fixes https://github.com/pytorch/pytorch/issues/151809, by reading shape assertion nodes into ShapeEnv, and deferring instantiation of node example values, to be done node-by-node.

Differential Revision: D78588406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158681
Approved by: https://github.com/ydwu4, https://github.com/avikchaudhuri
2025-07-23 02:13:11 +00:00
9df0f56597 Fix Triton GEMM templates with k=1 (#158650)
Thanks to @davidberard98 for much of the analysis here. For GEMMs of K=1, the hints, `tl.multiple_of` and `tl.max_contiguous` apply completely, as the indices to the loads are only dependent on `offs_m` and `offs_n`. For shapes like `(97x1), (1x97)`, this results in misaligned address errors, due to the fact that for all BLOCK_M and BLOCK_N sizes, the last tile is not a contiguous load. With K > 1 case, the hint is not as strict given the dependency on the k indices for the load as well. In the K=1 case, only `offs_m` and `offs_n` are used and broadcasted to the index shape.

One can say these hints are "wrong", but in various cases in the hints being wrong, such as with the shape `9999x4, 4x9999`, there is a substantial performance improvement with the hint.

For nice shapes with K=1, where M, N are a multiple 8 to where these hints are fine and there is no misaligned address, there is no performance regression observed on H100:
<img width="547" height="402" alt="Screenshot 2025-07-18 at 5 05 47 PM" src="https://github.com/user-attachments/assets/fee2bbaa-784c-422e-bb8c-43c6c2607ad2" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158650
Approved by: https://github.com/davidberard98
2025-07-23 02:05:57 +00:00
91602a9254 Cleanup old caffe2 scripts (#158475)
Testing on this one is grep based: if there were no reference to that script I can find, I deleted.
We can easily add any of these back if needed!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158475
Approved by: https://github.com/seemethere, https://github.com/huydhn, https://github.com/cyyever
2025-07-23 01:21:31 +00:00
cc372ad557 [aoti][mps] Improve tabbing in cpp generation (#158351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158351
Approved by: https://github.com/desertfire, https://github.com/malfet
ghstack dependencies: #158349, #158350
2025-07-23 00:54:53 +00:00
84058d1179 [aoti][mps] Fix cpu kernel generation (#158350)
In the case where we have both mps and cpu code which can be inductor compiled, we need to case on the device -- this requires the device field to be correctly passed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158350
Approved by: https://github.com/malfet
ghstack dependencies: #158349
2025-07-23 00:54:53 +00:00
096dc35d77 [aoti][mps] Fix update constants buffer (#158349)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158349
Approved by: https://github.com/malfet
2025-07-23 00:54:52 +00:00
56d07d0bde Add merge_rules category for Dynamo; add guilhermeleobas (#158620)
Adds guilhermeleobas to merge_rules for Dynamo and functorch.
Guilherme has done good work on both of these subsystems and I am tired
of him approving my PRs and me not being able to merge them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158620
Approved by: https://github.com/anijain2305
2025-07-23 00:44:27 +00:00
39b54b78d7 [export] runtime asserts for while HOP subgraphs (#158467)
Differential Revision: D78431075

For #158366
- Calls runtime asserts pass for HOP subgraphs (in reenter_make_fx)
- For while_loop only (can be expanded), clones input tensors for subgraph tracing, so unbacked memos (item, nonzero, etc.) aren't reused

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158467
Approved by: https://github.com/ydwu4
2025-07-23 00:34:18 +00:00
3703dabe42 [ROCm] delete un-needed workaround for tensor.item() (#158486)
Deleting unused workaround per discussion here:
https://github.com/pytorch/pytorch/pull/158165#discussion_r2207968880

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158486
Approved by: https://github.com/jeffdaily, https://github.com/houseroad
2025-07-23 00:31:57 +00:00
d3f9107d68 Remove top limit for cpython version and fix lint appropriately. (#158853)
As per title.
Sorry for the churn in the main commit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158853
Approved by: https://github.com/seemethere, https://github.com/Skylion007, https://github.com/jingsh, https://github.com/malfet, https://github.com/ZainRizvi
2025-07-22 23:59:00 +00:00
cab96b5879 [tests] Reduce sizes of unnecessarily large tensors to reduce OOM flakes (#158456)
Downsizes several tensors that were massively oversized to test the problem at hand, to reduce test flaking.

Fixes #126867

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158456
Approved by: https://github.com/desertfire
2025-07-22 23:41:48 +00:00
6100ed457c [ROCm] Improve Type Safety of C10_WARP_SIZE (#158271)
# Background

The `C10_WARP_SIZE`, although always be `32` on CUDA platform, varies across different AMD GPUs.
Therefore, to correctly refer this value, the host code must be a variable instead of a literal defined by macro, or a `constexpr int`.

This PR may cause more compiler errors for third party code on AMD GPU, which is intentional. Having a fixed `C10_WARP_SIZE` value on host code for AMD GPU only defers compile time error to runtime.

This PR is recommended to be included as part of Release Notes to describe an API change for whoever uses this macro.

Users are recommended to use `C10_WARP_SIZE` directly, which adapts for various scenarios, or define a macro to use `C10_WARP_SIZE`. Assignment of this macro to symbols shared by host/device code causes problems on ROCM platform. (See the fix at `aten/src/ATen/native/cuda/layer_norm_kernel.cu` for a concrete example)

# Behaviors

* If compiling with HIPCC (i.e `defined(__HIPCC__)`):
  + Define `C10_WARP_SIZE` to be non-`constexpr` `at::cuda::warp_size()` for host-compilation pass (as compared to `static constexpr int C10_WARP_SIZE = 1;` set in 04bd7e6850e8efec77994963ffee87549555b9c3)
  + Define `C10_WARP_SIZE` to be a function returning `constexpr int` `64` for `__GFX9__`, and `32` otherwise, for device-compilation pass
    - `__GFX8__` is also 64 but we do not support any GFX8 GPU.
* If not compiling with HIPCC:
  + Define `C10_WARP_SIZE` to be non-constexpr `at::cuda::warp_size()`

# `constexpr` variant for host code

For host-compilation cases where a `constexpr` value is needed for warp size (eg. launch bounds), use `C10_WARP_SIZE_STATIC`, which is defined as `64`. This macro follows the pre 04bd7e6850e8efec77994963ffee87549555b9c3 behavior of `C10_WARP_SIZE`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158271
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2025-07-22 23:19:38 +00:00
badfebf29e Revert "[Inductor] Expose decomposeK knobs as envvars (#158745)"
This reverts commit eac777c4f46b381106f2f2b78fe05b506f8c558c.

Reverted https://github.com/pytorch/pytorch/pull/158745 on behalf of https://github.com/jeffdaily due to sorry but rocm CI is broken due to this PR ([comment](https://github.com/pytorch/pytorch/pull/158745#issuecomment-3105071170))
2025-07-22 23:04:16 +00:00
fc5a404eb1 [gtest][listing] fixing caffe2:verify_api_visibility - main (#158229)
Summary: Remove the custom main from this test file

Test Plan:
https://www.internalfb.com/intern/testinfra/testrun/9570149303161031

Rollback Plan:

Reviewed By: patskovn

Differential Revision: D78015676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158229
Approved by: https://github.com/Skylion007
2025-07-22 22:45:28 +00:00
04a393507b Fused RMSNorm implementation (#153666)
Relevant #72643

Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.

```py
import torch
import torch.nn as nn

class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.scale = nn.Parameter(torch.ones(dim))

    def forward(self, x):
        norm_x = x.norm(2, dim=-1, keepdim=True)
        rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
        x_normed = x / (rms_x + self.eps)
        return self.scale * x_normed

def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
    rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
    input_data = torch.randn(input_shape, device='cuda', dtype=dtype)

    for _ in range(warmup_iterations):
        _ = rms_norm_layer(input_data)
    torch.cuda.synchronize()

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()
    for _ in range(num_iterations):
        _ = rms_norm_layer(input_data)

    end_event.record()
    torch.cuda.synchronize()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    avg_time_ms = elapsed_time_ms / num_iterations

    print(f"--- RMSNorm CUDA Benchmark ---")
    print(f"Input Shape: {input_shape}")
    print(f"Normalized Dimension: {normalized_dim}")
    print(f"Benchmark Iterations: {num_iterations}")
    print(f"--- Fused Implementation ---")
    print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
    print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")

    compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
    for _ in range(warmup_iterations):
        _ = compiled_rms_norm(input_data)
    torch.cuda.synchronize()

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()
    for _ in range(num_iterations):
        _ = compiled_rms_norm(input_data)
    end_event.record()
    torch.cuda.synchronize()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    avg_time_ms = elapsed_time_ms / num_iterations

    print(f"--- TorchCompile Implementation ---")
    print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
    print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")

    print("-" * 50)

if __name__ == '__main__':
    parameter_sets = [
        {'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
        {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
        {'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
        {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
        {'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
    ]

    num_benchmark_iterations = 200
    num_warmup_iterations = 20

    for params in parameter_sets:
        batch_size = params['batch_size']
        sequence_length = params['sequence_length']
        hidden_features = params['hidden_features']
        data_type = params.get('dtype', torch.float16)

        shape = (batch_size, sequence_length, hidden_features)
        norm_dim_to_normalize = hidden_features

        print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
        benchmark_rmsnorm_cuda(input_shape=shape,
                               normalized_dim=norm_dim_to_normalize,
                               num_iterations=num_benchmark_iterations,
                               warmup_iterations=num_warmup_iterations,
                               dtype=data_type)
```

Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code

torch.manual_seed(0)

device = torch.device("cuda")

for batch in range(0, 9):
    for i in range(9, 16):
        normalized_shape_arg = (2**batch, 2**i)
        input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
        weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)

        model = torch.nn.functional.rms_norm
        compiled_model = torch.compile(model)
        loss = torch.randn_like(input_tensor)

        num_iter = 5
        for j in range(num_iter):
            output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
            output.backward(loss)

        start_event = torch.cuda.Event(enable_timing=True)
        end_event = torch.cuda.Event(enable_timing=True)
        start_event.record()
        num_iter = 10
        for j in range(num_iter):
            output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
            output.backward(loss)

        end_event.record()
        torch.cuda.synchronize()

        elapsed_time_ms = start_event.elapsed_time(end_event)
        avg_time_ms = round(elapsed_time_ms / num_iter, 5)
        print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel, https://github.com/albanD
2025-07-22 22:25:44 +00:00
a626dc8f16 [AOTI] windows package load dev (#158671)
changes:
1. add extract file fail handler for Windows develop.
2. normalize more file paths.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158671
Approved by: https://github.com/angelayi, https://github.com/desertfire
2025-07-22 21:35:57 +00:00
fd47401536 [doc] Updates to distributed.md for XCCL backend (#155834)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155834
Approved by: https://github.com/guangyey, https://github.com/AlannaBurke, https://github.com/d4l3k

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-07-22 21:01:43 +00:00
e44e05f7ae [dynamo] Move skipIf decorator to class level in test_fx_graph_runnable (#157594)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157594
Approved by: https://github.com/xmfan
ghstack dependencies: #157162
2025-07-22 20:41:49 +00:00
ddd74d10fc More fixes to MakeTensor::computeStorageSize() (#158813)
Followup after https://github.com/pytorch/pytorch/pull/158690 that fixessimilar logic if `strides` are not explicitly specified
Expanded testing to cover both cases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158813
Approved by: https://github.com/ZainRizvi, https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #158690
2025-07-22 20:36:12 +00:00
823e223893 [ROCm] logsumexp on ROCm needs scaling back to natural base. (#156903)
Fixes #156012

This is a temporary solution that makes context parallelism working before logsumexp behavior changes landed in AOTriton.

After discussion we are not going to release AOTriton 0.10.1 to fix this due to
* Even if the interface is not changed, changing the behavior of returned logsumexp tensor should still be considered as an ABI break. Such changes do not fall into the "ABI compatible" category and should be postponed to next release.
* AOTriton 0.11 is scheduled to be released before end of July, which is less than five weeks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156903
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-22 20:32:34 +00:00
6499420e45 [DeviceMesh] Make the repr shorter when debug ENV not set (#158822)
Users want a shorter repr so this PR is trying to address that when TORCH_DISTRIBUTED_DEBUG is not set to DETAIL. Feedback and discussion is welcomed. Somehow I found that torch.set_printoptions is global, so I am hesitated to use it.

Now the print is like

<img width="435" height="79" alt="image" src="https://github.com/user-attachments/assets/8f173287-7138-4fbe-a4a3-8483523b21e4" />

or

<img width="485" height="104" alt="image" src="https://github.com/user-attachments/assets/21e34db9-56b5-47e2-9767-750d6105a273" />

or

<img width="675" height="97" alt="image" src="https://github.com/user-attachments/assets/53aa763e-7edd-4622-9cdb-37e2af8ec11f" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158822
Approved by: https://github.com/wz337, https://github.com/wconstab, https://github.com/xmfan
2025-07-22 20:31:44 +00:00
e17538022a Making input dynamically adjust. (#157324)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157324
Approved by: https://github.com/Skylion007, https://github.com/d4l3k
2025-07-22 20:14:05 +00:00
37ded2ac90 Using torch.accelerator in comm_mode_features_example.py and visualize_sharding_example.py (#157317)
Continuation of https://github.com/pytorch/pytorch/pull/153213  .

 @guangyey
 @kwen2501

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157317
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/d4l3k

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-07-22 19:58:48 +00:00
767791943d [ONNX] Set default opset to 20 (#158802)
Bump default opset to 20, which is a newer opset and the max torchscript exporter supports.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158802
Approved by: https://github.com/titaiwangms
2025-07-22 19:55:05 +00:00
c917c63282 [ROCm][tunableop] UT tolerance increase for matmul_small_brute_force_tunableop at FP16 (#158788)
TunableOp will sometimes find a less precise solution due to the small input vectors used in this UT. Bumping op tolerance to eliminate flakiness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158788
Approved by: https://github.com/jeffdaily
2025-07-22 19:45:35 +00:00
659bfbf443 Revert "We do support 3.14" (#158856)
Reverting to fix lint
This reverts commit 2a249f1967d29626fe6ac6a07f28440348d1cc93.

An emergency fix since the change needed to fix this is a little more complex than expected (see https://github.com/pytorch/pytorch/pull/158853 for reference)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158856
Approved by: https://github.com/Camyll, https://github.com/atalman
2025-07-22 19:40:53 +00:00
832ab990c9 Use init_device_mesh API for select tests where possible (#158675)
This addresses reviews made for:
#158538
#108749

It interchanged all the specific DevideMesh constructor calls with the API provided by the test cases, to improve abstraction

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158675
Approved by: https://github.com/wconstab
2025-07-22 19:28:42 +00:00
56df025d51 Add caching for _rename_without_collisions (#158594)
Fixes #158357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158594
Approved by: https://github.com/pianpwk
2025-07-22 19:19:13 +00:00
55ff4f85e9 [FP8][CUTLASS] xFail honor_sm_carveout on sm100 (#152378)
CUTLASS only supports SM carveout via green contexts on `sm100`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152378
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/nWEIdia
2025-07-22 18:39:50 +00:00
7d2ceaff21 [dynamo] skip tracing functions registered in sys.monitoring (#158171)
Fixes https://github.com/pytorch/pytorch/issues/158164

This was fixed by applying `skip_code_recursive` to any function registered to `sys.monitoring` (via `PyThreadState_GET()->interp->monitoring_callables`). This check is done whenever we attempt to set the eval frame callback from Python.

Microbenchmark: `benchmarks/dynamo/microbenchmarks/overheads.py`:

BEFORE:
```
requires_grad=False
eager    7.1us (warmup=0.0s)
compiled 24.6us (warmup=10.0s)

requires_grad=True
eager    8.9us (warmup=0.0s)
compiled 57.8us (warmup=0.1s)

inference_mode()
eager    6.5us (warmup=0.0s)
compiled 23.4us (warmup=0.1s)
```

AFTER:
```
requires_grad=False
eager    7.0us (warmup=0.0s)
compiled 23.2us (warmup=15.2s)

requires_grad=True
eager    9.0us (warmup=0.0s)
compiled 55.1us (warmup=0.1s)

inference_mode()
eager    6.4us (warmup=0.0s)
compiled 22.2us (warmup=0.1s)
```

Followup thought: how do we let users know that a frame is skipped because the code object is a callable registered to sys.monitoring? (or any other reason?)

Differential Revision: [D78530528](https://our.internmc.facebook.com/intern/diff/D78530528)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158171
Approved by: https://github.com/jansel
2025-07-22 18:02:30 +00:00
2a249f1967 We do support 3.14
This has been added a bit back.
2025-07-22 10:40:18 -07:00
52c294008e [hop] allow non fake inputs when check input alias and mutation (#158798)
https://github.com/pytorch/pytorch/pull/154193 gets reverted due to a test failure. The root cause being that: an executorch pass turns int inputs into a scalar tensor in cond's subgraph. The pass have been around on the critical path of executorch since two years ago. Changing it would be difficult. So we just allow non-fake inputs for check input mutation and aliasing, which shoudn't affect the correctness of the analysis.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158798
Approved by: https://github.com/pianpwk
2025-07-22 17:22:37 +00:00
0971637c11 Fix torch.tensor warning in ONNX symbolic_opset10 export (#158835)
Fix PyTorch tensor copying warning in ONNX export

## Problem

PyTorch ONNX exporter was generating a warning about incorrect tensor copying method:

```
UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158835
Approved by: https://github.com/justinchuby
2025-07-22 16:32:49 +00:00
7d6f340238 Revert "[AOTI] Add more default options to compile_standalone (#158560)"
This reverts commit a991e285ae35159680b0ad4be24669906a6fa256.

Reverted https://github.com/pytorch/pytorch/pull/158560 on behalf of https://github.com/jeffdaily due to broke rocm CI, no test signal was available from rocm ciflow/trunk, need to add ciflow/rocm to reland ([comment](https://github.com/pytorch/pytorch/pull/158560#issuecomment-3103633964))
2025-07-22 16:20:17 +00:00
4060f30042 [AOTI] Convert C-struct zip handling to RAII container (#158687)
Attempts to fix a memory leak reported in #158614 by wrapping manually managed MiniZ C-structs in an RAII container. I have been unable to reproduce the reported leak, but this seems like the most likely candidate.

Fixes #158614 (hopefully)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158687
Approved by: https://github.com/desertfire
2025-07-22 16:01:51 +00:00
9a28e23d97 Revert "removed zero dim cpu logic from fake_tensor.py (#147501)"
This reverts commit 9e0473b56621162bd85e94943a516be4727e5651.

Reverted https://github.com/pytorch/pytorch/pull/147501 on behalf of https://github.com/ZainRizvi due to Seems to have broken ROCm. See inductor/test_aot_inductor_package.py::TestAOTInductorPackageCpp_cuda::test_compile_standalone_cos [GH job link](https://github.com/pytorch/pytorch/actions/runs/16428359564/job/46426243808) [HUD commit link](a991e285ae) ([comment](https://github.com/pytorch/pytorch/pull/147501#issuecomment-3103494041))
2025-07-22 15:45:34 +00:00
d0c00d9a69 [MPS] Do not crash if tensor dim > INT_MAX (#158824)
Looks like all MPS operations will crash if one of tensor dimentions are
greater than `2**31-1`

Change it into a structured exception, by checking tensor size before
attempting to create MPS Tensor

Add regression test for it. Before this change running following will abort with exception
```
% python3 -c "import torch; torch.randint(0, 10, (2**31,), dtype=torch.uint8, device='mps')"
/AppleInternal/Library/BuildRoots/1c8f7852-1ca9-11f0-b28b-226177e5bb69/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:829: failed assertion `[MPSNDArray initWithDevice:descriptor:isTextureBacked:] Error: NDArray dimension length > INT_MAX'
zsh: abort      python3 -c·
```

Skip the test on MacOS-13, as it crashes somewhere deep in MPSGraph framework with
```
/AppleInternal/Library/BuildRoots/c651a45f-806e-11ed-a221-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:724: failed assertion `[MPSTemporaryNDArray initWithDevice:descriptor:] Error: total bytes of NDArray > 2**32'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158824
Approved by: https://github.com/dcci
ghstack dependencies: #158690, #158823
2025-07-22 15:12:26 +00:00
371ffaf415 [bucketing] Support case of several pgs in graph (#158632)
Main changes:
- bucketing collectives only from the same process_group by group_name
- Support of groups like [0,2,4,6], [0,1,3,5] using `rank_idx_dict` for in pass operations for slice idxs etc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158632
Approved by: https://github.com/wconstab
2025-07-22 14:50:39 +00:00
1b772de397 Still run TritonBundler with BundledAOTAutogradCache, save autotune results (#158048)
When running BundledAOTAutogradCache with precompile, we still need to run triton bundling so that the precompiled CompiledFxGraph has triton cuda kernels. We also pre save the autotune results in the precompile artifact.

It would be even better to pre trim the cuda kernels on save and apply them, which we can work on later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158048
Approved by: https://github.com/zhxchen17
2025-07-22 14:12:21 +00:00
8e99714204 [EZ][BE][MPS] Remove unused ndArrayFromTensor (#158823)
And `printTensorNDArray`, both of which according to https://github.com/search?type=code&q=ndArrayFromTensor+org%3Apytorch are not used anywhere
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158823
Approved by: https://github.com/dcci
ghstack dependencies: #158690
2025-07-22 14:06:42 +00:00
9b4d938f04 [dynamo][fsdp] Consistent behavior of int attributes (#157262)
Reimpl of https://github.com/pytorch/pytorch/pull/150954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157262
Approved by: https://github.com/bdhirsh
2025-07-22 11:26:54 +00:00
0142d5f4e2 Revert "Remove is_arvr_mode() from xnnpack.buck.bzl (#158682)"
This reverts commit f09a484b8164aaadd57a79354f0ccf47733f365e.

Reverted https://github.com/pytorch/pytorch/pull/158682 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158682#issuecomment-3101648365))
2025-07-22 08:33:08 +00:00
91b69deeb0 [ROCm][CI] update fbgemm_gpu hash used by inductor tests (#158602)
fbgemm_gpu build started failing with asmjit errors.  Moving to latest tip of fbgemm for inductor tests resolves the build failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158602
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-22 08:04:59 +00:00
392fa75411 Change from import trace to import config (#158796)
Summary:
for this particular instance, we're doing

 from torch._inductor.config import trace

...trace.provenance_tracking...

but for all other call sites, we're doing

from torch._inductor import config
... config.trace.provenance_tracking....

Test Plan:
CI

Rollback Plan:

Differential Revision: D78699876

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158796
Approved by: https://github.com/c00w
2025-07-22 06:10:38 +00:00
3a67bf9c62 [PGNCCLx] Bring split and merge for PGNCCLx (#158790)
Summary: We added group split in D78300794 and remote_group_merge in D78450094. We first want to upstream this change to PGNCCLx as well so that NCCLx can use this new API and we can continue our c10d clean up in https://github.com/pytorch/pytorch/pull/158488.

Test Plan:
CI

```
buck test -c hpc_comms.use_ncclx=stable comms/ncclx/pg/tests:test_c10d_ncclx -- test_group_split_and_merge
```

Rollback Plan:

Differential Revision: D78521060

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158790
Approved by: https://github.com/d4l3k
2025-07-22 06:05:00 +00:00
d984143a74 [ci][cutlass backend] Add ci for cutlass backend tests (#156626)
redo of https://github.com/pytorch/pytorch/pull/156136

Differential Revision: [D77327309](https://our.internmc.facebook.com/intern/diff/D77327309)

I want to try land the full version first. If the ci is taking too long, we can revert back to only testing for a few names.
```
 -k 'test_max_autotune_cutlass_backend_regular_mm and not test_max_autotune_cutlass_backend_regular_mm_streamk'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156626
Approved by: https://github.com/huydhn, https://github.com/mlazos
2025-07-22 05:18:13 +00:00
21c97bd565 [reland] Transfer "stack_trace" in post_grad passes (#158752)
Summary:
We transfer stack trace in post_grad passes.

We shouldn't add "stack_trace" to _COPY_META_FIELDS because _COPY_META_FIELDS is used in proxy.py where stack_trace is explicitly set.

Since the stack_trace is being used by more and more debugging tools, we should also start testing it more rigorously. This PR start by adding a first test for testing that stack trace is preserved through post_grad_passes.

Test Plan:
```
buck run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing -- -r test_pattern_matcher_transfer_meta

buck run mode/dev-nosan
 fbcode//caffe2/test/inductor:auto_functionalize -- --rcaffe2/test/inductor:auto_functionalize_old
```

Rollback Plan:

Differential Revision: D78669729

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158752
Approved by: https://github.com/jingsh
2025-07-22 03:49:13 +00:00
a155f742ad [benchmark] allow default mode for compile (#158792)
Allow default mode for compile when users cannot run "max-autotune-no-cudagraphs" due to compilation time. Overall, "default" mode is slower than "[max-autotune-no-cudagraphs](https://github.com/pytorch/pytorch/pull/158536)" depending on input shapes.

<img width="3564" height="2368" alt="CrossEntropyBackward_bench" src="https://github.com/user-attachments/assets/5d25c0e4-6714-42bb-a544-b7ef9cbc1b17" />
<img width="3564" height="2368" alt="CrossEntropyForward_bench" src="https://github.com/user-attachments/assets/40e0bbf9-657f-48f2-ac0c-1f0fd6a0ac1d" />
<img width="3564" height="2368" alt="LayerNormBackward_bench" src="https://github.com/user-attachments/assets/db582bb2-d8d4-414a-9de7-b9af061ad0cd" />
<img width="3564" height="2368" alt="LayerNormForward_bench" src="https://github.com/user-attachments/assets/2ce18bd8-73fc-434a-820f-46aa9ad9ddce" />
<img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/f4cb5f4b-93d3-4d96-973f-37643912325a" />
<img width="3564" height="2368" alt="RMSNormForward_bench" src="https://github.com/user-attachments/assets/231c5805-b156-4587-9c5f-504a33b60883" />
<img width="3564" height="2368" alt="SoftmaxBackward_bench" src="https://github.com/user-attachments/assets/f651c578-813b-4a8e-bffc-b5b34bd879fc" />
<img width="3564" height="2368" alt="SoftmaxForward_bench" src="https://github.com/user-attachments/assets/bfdcc043-4370-4355-af84-9f463426b21a" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158792
Approved by: https://github.com/zou3519
2025-07-22 03:07:22 +00:00
cyy
3639d29ea1 Fix warnings of unused-variable (#158627)
Fixes
```
/var/lib/jenkins/workspace/test/cpp/tensorexpr/test_kernel.cpp:42:22: error: unused variable 'verification_pattern' [-Werror,-Wunused-variable]
```
and also extra semicolons.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158627
Approved by: https://github.com/albanD
2025-07-22 02:49:06 +00:00
aee8a2e985 Remove duplicated installation for python dependencies. (#158339)
As the title stated.

The `Common` Section have installed the python dependencies
1b389025ba/README.md (L247)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158339
Approved by: https://github.com/ezyang
2025-07-22 02:39:28 +00:00
eac777c4f4 [Inductor] Expose decomposeK knobs as envvars (#158745)
Fix up decomposeK autotuning, by removing condition to return more than `k_splits_limit` and setting default to 10 instead of 5. Allow `k_splits_limit` to be configurable to the user via `TORCHINDUCTOR_NUM_DECOMPOSE_K_SPLITS` and also allow user to configure threshold in which to use decompose_k via `TORCHINDUCTOR_DECOMPOSE_K_THRESHOLD`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158745
Approved by: https://github.com/eellison
2025-07-22 01:59:51 +00:00
1a6b21c59f [AOTI] fix load_pt2 split wrong model name on Windows (#158711)
fix load_pt2 split wrong model name on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158711
Approved by: https://github.com/jansel
2025-07-22 01:54:44 +00:00
abe0c9538a [BE] Fix extra-semi warnings (#158730)
And prevent new ones from appearing by removing `-Wno-error=extra-semi` (not sure what was thereason behind adding the warning but not erroring on on it when building with -Werror introduced by https://github.com/pytorch/pytorch/pull/140236 )

300+ violations of that rule were fixed by running `sed -i -e "s/});/})/" /` against `torch/nativert`
Other 3p deps that needs updates:
 - TensorPipe
 - LLVM
 - FBGEMM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158730
Approved by: https://github.com/Skylion007
2025-07-22 01:05:03 +00:00
95b658427d Revert "Add DeviceAllocator as the base device allocator (#138222)"
This reverts commit 1179e333237b02ed8fe2ba10cb9a23adf98d7d7a.

Reverted https://github.com/pytorch/pytorch/pull/138222 on behalf of https://github.com/ZainRizvi due to Very sorry but this is still breaking internally. @albanD would you be able to help get this past the finish line? D78496124 has more details on the failure and the workaround might be to do something like what's in D78684669. To validate the fixes internally, you can follow the instructions here to ghimport the changes: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3100195370))
2025-07-22 01:01:41 +00:00
6341311333 Revert "Add unified memory APIs for torch.accelerator (#152932)"
This reverts commit 2ad5c25cfc603c3656e6699d6137419dbb009495.

Reverted https://github.com/pytorch/pytorch/pull/152932 on behalf of https://github.com/ZainRizvi due to Very sorry but this is still breaking internally. @albanD would you be able to help get this past the finish line? D78496124 has more details on the failure and the workaround might be to do something like what's in D78684669. To validate the fixes internally, you can follow the instructions here to ghimport the changes: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3100195370))
2025-07-22 01:01:41 +00:00
350d6af52c [AOTI] add windows support for get_cpp_compile_command (#158732)
add windows support for `get_cpp_compile_command`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158732
Approved by: https://github.com/desertfire
2025-07-22 00:23:10 +00:00
9281625a9b Revert "Setup TorchBench in Docker (#158613)"
This reverts commit cab28330f8c49cdb66d6a299755dc09c87c14a9d.

Reverted https://github.com/pytorch/pytorch/pull/158613 on behalf of https://github.com/ZainRizvi due to Seems to have broken trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/16429779764/job/46430634676) [HUD commit link](b3c868d603) ([comment](https://github.com/pytorch/pytorch/pull/158613#issuecomment-3100023071))
2025-07-22 00:12:49 +00:00
2c37acfd89 [AOTI][CPU] Consider bias=None case for fbgemm_linear_fp16_weight (#158535)
Test Plan:

Rollback Plan:

Differential Revision: D78458214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158535
Approved by: https://github.com/houseroad, https://github.com/henryoier, https://github.com/jingsh
2025-07-21 23:42:44 +00:00
08540b13c6 Use cuda error code instead of error text in get_cuda_error_help (#158688)
Use cudaError_t and switch through the enum to prevent impact by upstream changes in wording
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158688
Approved by: https://github.com/q10, https://github.com/aorenste
2025-07-21 23:34:50 +00:00
187c2deb40 Fix clamp(min/max) strategy (#158619)
Part of plan https://github.com/pytorch/pytorch/issues/157495.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158619
Approved by: https://github.com/wanchaol
2025-07-21 23:26:08 +00:00
67be2f27e1 [CI][lintrunner] Only run on non deleted changed files (#158794)
My PR was failing lint because I removed a file, and then lintrunner would try to run on the deleted file and error, so this changes how the changed files are retrieved to only retrieve changed files that have not been removed.

I don't think this is possible through `gh pr view`, so instead it uses `gh api`

Testing: https://github.com/pytorch/pytorch/pull/158795
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158794
Approved by: https://github.com/seemethere
2025-07-21 23:22:37 +00:00
d293022c47 [cutass backend] memorize parts of cache key to reduce general overhead (#158311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158311
Approved by: https://github.com/ColinPeppler
ghstack dependencies: #156781
2025-07-21 23:21:12 +00:00
ee5a434f8c Revert "[BE] remove torch deploy - conditionals (#158288)"
This reverts commit 1a4268b8113d5160d71225bab980f03c2318a0a4.

Reverted https://github.com/pytorch/pytorch/pull/158288 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78496147 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3099826158))
2025-07-21 23:17:39 +00:00
4c18e85300 Revert "[BE] Remove torch deploy | remove torch deploy specific files (#158290)"
This reverts commit a6de309ca15cda6b2792fc74e82814dc8d2f9dd9.

Reverted https://github.com/pytorch/pytorch/pull/158290 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78496147 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3099826158))
2025-07-21 23:17:39 +00:00
920f26c761 Revert "[BE] Remove __reduce_deploy__ (#158291)"
This reverts commit 0b9fb91f17edfbc51ae36584dcb8350b2d8bb23b.

Reverted https://github.com/pytorch/pytorch/pull/158291 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78496147 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3099826158))
2025-07-21 23:17:38 +00:00
99cc3633f6 Revert "[BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407)"
This reverts commit d9426a81d2ab54f809a3b32a6ab2e606073fe66f.

Reverted https://github.com/pytorch/pytorch/pull/158407 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78496147 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3099826158))
2025-07-21 23:17:38 +00:00
15a50dcf1c Revert "[BE] Make PyObjectSlot use a global PyInterpreter and remove (#158427)"
This reverts commit eb7365072315be2bc4259114e25e269801441748.

Reverted https://github.com/pytorch/pytorch/pull/158427 on behalf of https://github.com/ZainRizvi due to Reverting this as part of reverting the stack for https://github.com/pytorch/pytorch/pull/158288 ([comment](https://github.com/pytorch/pytorch/pull/158427#issuecomment-3099815367))
2025-07-21 23:14:57 +00:00
1227ed6674 [dynamic shapes] fix _maybe_evaluate_static axioms bug (#158672)
Summary: couldn't get a minimal repro, but xref for size change during dict iteration error: https://fb.workplace.com/groups/1075192433118967/posts/1709439696360901

Test Plan:
-

Rollback Plan:

Differential Revision: D78047846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158672
Approved by: https://github.com/bobrenjc93
2025-07-21 23:14:19 +00:00
2bb684304d Fix the typos in the right nav by pulling the latest theme (#158746)
This will fix broken links in the right nav.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158746
Approved by: https://github.com/malfet
2025-07-21 22:51:07 +00:00
f09a484b81 Remove is_arvr_mode() from xnnpack.buck.bzl (#158682)
Summary:
**Changes**
*   Deleted function import from build definition utilities
    *   Removed `load("//tools/build_defs:fbsource_utils.bzl", "is_arvr_mode")`
*   Replaced is_arvr_mode() function calls with direct references to configuration flags
    *  Changed from `is_arvr_mode()` to `"ovr_config//build_mode:arvr_mode"`
*   Changed conditional expressions to Buck `select()` statements

Test Plan:
Check if CI passes

Rollback Plan:

Differential Revision: D78520947

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158682
Approved by: https://github.com/malfet
2025-07-21 22:49:26 +00:00
feaa02f9ad Revert "[build] pin setuptools>=77 to enable PEP 639 (#158104)"
This reverts commit a78fb63dbdf98a1db219095293de1a11005e0390.

Reverted https://github.com/pytorch/pytorch/pull/158104 on behalf of https://github.com/malfet due to It still breaks inductor-perf-nightly, see https://github.com/pytorch/pytorch/actions/runs/16425364208/job/46417088208, I'm going to dismiss all previous reviews ([comment](https://github.com/pytorch/pytorch/pull/158104#issuecomment-3099706457))
2025-07-21 22:46:53 +00:00
b3c868d603 [vllm]Add vllm.txt for pinned commit (#158754)
It seems the nightly.yml won't auto-generate txt file when it does not existed, so added the file with latest merged commit from vllm:

[vllm commit](https://github.com/vllm-project/vllm/commits/main)

Error:
https://github.com/pytorch/pytorch/actions/runs/16405915719/job/46351847504
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158754
Approved by: https://github.com/huydhn
2025-07-21 22:41:07 +00:00
cab28330f8 Setup TorchBench in Docker (#158613)
This reduces the time spending to setup TorchBench in A100/H100 by another half an hour

### Testing

* H100 benchmark https://github.com/pytorch/pytorch/actions/runs/16396172453.  Once this done, I will review the results on [HUD](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Fri%2C%2011%20Jul%202025%2023%3A01%3A24%20GMT&stopTime=Fri%2C%2018%20Jul%202025%2023%3A01%3A24%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/huydhn/6/head&lCommit=14a38c719b29a19f518239b5edb084838ac5d2fb&rBranch=main&rCommit=0a99b026d6bd0f67dc2c0a20fe3228ddc4144854) to confirm that all models are there
* A100 benchmark https://github.com/pytorch/pytorch/actions/runs/16396173932

Signed-off-by: Huy Do <huydhn@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158613
Approved by: https://github.com/janeyx99
2025-07-21 22:34:08 +00:00
4366610f5a [c10d] block_current_stream: correctness fixes (#158757)
This fixes a number of issues that were present in https://github.com/pytorch/pytorch/pull/156883 as pointed out by @ngimel

Test plan:

Expanded tests to cover use after free behavior + non-default stream

```
pytest test/distributed/test_c10d_pypg.py -v -k block_current_stream
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158757
Approved by: https://github.com/ngimel
2025-07-21 22:23:44 +00:00
dd0adc9386 [SymmMem] Add NVSHMEM broadcast support into Triton (#158514)
Adds broadcast collective operation for distributing data from root PE to all other PEs in NVSHMEM Triton kernels.

Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_broadcast`
<details>
<summary> Quick debug print for sanity check </summary>

```markdown
============================================================
[Rank 0] Starting broadcast test with world_size=2
============================================================
[Rank 0] Configuration:
  - nelems: 4
  - dtype: torch.int64, element_size: 8 bytes
  - nelems_bytes: 32
============================================================
[Rank 1] Starting broadcast test with world_size=2
============================================================
[Rank 1] Configuration:
  - nelems: 4
  - dtype: torch.int64, element_size: 8 bytes
  - nelems_bytes: 32
[Rank 1] Non-root source data: [-1, -1, -1, -1]
[Rank 0] Root source data: [100, 101, 102, 103]
[Rank 1] Initial destination: [-999, -999, -999, -999]
[Rank 0] Initial destination: [-999, -999, -999, -999]
[Rank 0] Executing broadcast operation...
[Rank 1] Executing broadcast operation...
[Rank 0] Broadcast operation completed
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[Rank 1] Broadcast operation completed
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[Rank 1] Results after broadcast:
[Rank 0] Results after broadcast:
[Rank 1] Destination buffer: [100, 101, 102, 103]
[Rank 1] Expected: [100, 101, 102, 103]
[Rank 0] Destination buffer: [100, 101, 102, 103]
[Rank 0] Expected: [100, 101, 102, 103]
[Rank 1] Match: ✓
[Rank 0] Match: ✓
[Rank 1] ============================================================
[Rank 1] Broadcast test PASSED ✓
[Rank 1] Summary: Root PE 0 broadcasted [100, 101, 102, 103] to all PEs
[Rank 1] ============================================================
[Rank 0] ============================================================
[Rank 0] Broadcast test PASSED ✓
[Rank 0] Summary: Root PE 0 broadcasted [100, 101, 102, 103] to all PEs
[Rank 0] ============================================================
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158514
Approved by: https://github.com/fduwjj, https://github.com/mandroid6
ghstack dependencies: #158511, #158512, #158513
2025-07-21 22:23:26 +00:00
734826d88e Revert "[AOTI] windows package load dev (#158671)"
This reverts commit d42c40976727fed4c9908d4194f26917d0a3da66.

Reverted https://github.com/pytorch/pytorch/pull/158671 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @angelayi can you please help them validate the fixes internally? You can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158671#issuecomment-3099570374))
2025-07-21 22:20:46 +00:00
5a56e6a72b Revert "[AOTI] fix extract file failed on Windows. (#158702)"
This reverts commit 7cc1a9546c135f8e7635e0d38aa2bba797f8907d.

Reverted https://github.com/pytorch/pytorch/pull/158702 on behalf of https://github.com/ZainRizvi due to Sorry but I had to revert this PR in order to revert https://github.com/pytorch/pytorch/pull/158671 ([comment](https://github.com/pytorch/pytorch/pull/158702#issuecomment-3099556215))
2025-07-21 22:18:19 +00:00
e8af168ee0 Revert "[AOTI] normalize path and process model files. (#158705)"
This reverts commit ff0da08f4bc5ee135b495926cd58a36a1c0e1a5b.

Reverted https://github.com/pytorch/pytorch/pull/158705 on behalf of https://github.com/ZainRizvi due to Sorry but I had to revert this PR in order to revert https://github.com/pytorch/pytorch/pull/158671 ([comment](https://github.com/pytorch/pytorch/pull/158705#issuecomment-3099532516))
2025-07-21 22:16:03 +00:00
97d7dc197f Revert "[AOTI] Convert C-struct zip handling to RAII container (#158687)"
This reverts commit 8ed5e1844c77d952bcea89ca7d0225d876fec4e8.

Reverted https://github.com/pytorch/pytorch/pull/158687 on behalf of https://github.com/ZainRizvi due to Sorry but I had to revert this PR in order to revert https://github.com/pytorch/pytorch/pull/158671 ([comment](https://github.com/pytorch/pytorch/pull/158687#issuecomment-3099515618))
2025-07-21 22:13:26 +00:00
9498d95b9c [Dynamo][BetterEngineering] Type trace_rules.py (#158679)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a core file, `trace_rules.py`
Running
```
mypy torch/_dynamo/trace_rules.py   --linecount-report /tmp/coverage_log
```
| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  2564 | 3997 | 64.15% | 34 | 53 | 64.15% |
| This PR | 4022 | 4022 | 100.00% | 53 | 53 | 100.00% |
| Delta    | +1458 | +25 | +35.85% | +19 | 0 | +35.85% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158679
Approved by: https://github.com/williamwen42
2025-07-21 22:12:59 +00:00
0e46f54286 [ROCm][CI] update HIP patch for 6.4.1 (#158651)
patch is intended to fix hipGraph capture for some miopen kernels

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158651
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-21 22:09:36 +00:00
216ba6e5f2 Fix MaskedTensor to device ignored mask (#151205)
Fixes #147140

## Changes

- Add `to` implementation in `MaskedTensor` to support move `mask` to target device

## Test Result

```python
In [1]: import torch
   ...: from torch.masked import as_masked_tensor
   ...: data = torch.tensor([1,2,3])
   ...: mask = torch.tensor([True,False,True])
   ...: mt = as_masked_tensor(data, mask).to('cuda')
   ...: mt.get_data().device, mt.get_mask().device
/home/zong/code/pytorch/torch/masked/maskedtensor/core.py:247: UserWarning: The PyTorch API of MaskedTensors is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.masked module for further information about the project.
  return MaskedTensor(data, mask)
/home/zong/code/pytorch/torch/masked/maskedtensor/_ops_refs.py:354: UserWarning: The PyTorch API of MaskedTensors is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.masked module for further information about the project.
  return MaskedTensor(new_data, _maybe_get_mask(args[0]))
Out[1]: (device(type='cuda', index=0), device(type='cuda', index=0))

In [2]: mt.sum(dim=0)
/home/zong/code/pytorch/torch/masked/maskedtensor/core.py:247: UserWarning: The PyTorch API of MaskedTensors is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.masked module for further information about the project.
  return MaskedTensor(data, mask)
Out[2]: MaskedTensor(4, True)

```

```bash
pytest test/test_maskedtensor.py -vv
```

![image](https://github.com/user-attachments/assets/640b809c-b4f0-4aca-a09e-04049017a745)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151205
Approved by: https://github.com/ezyang
2025-07-21 21:44:49 +00:00
c774180e59 Bump requests from 2.32.2 to 2.32.4 in /tools/build/bazel (#158006)
Bumps [requests](https://github.com/psf/requests) from 2.32.2 to 2.32.4.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/psf/requests/releases">requests's releases</a>.</em></p>
<blockquote>
<h2>v2.32.4</h2>
<h2>2.32.4 (2025-06-10)</h2>
<p><strong>Security</strong></p>
<ul>
<li>CVE-2024-47081 Fixed an issue where a maliciously crafted URL and trusted
environment will retrieve credentials for the wrong hostname/machine from a
netrc file. (<a href="https://redirect.github.com/psf/requests/issues/6965">#6965</a>)</li>
</ul>
<p><strong>Improvements</strong></p>
<ul>
<li>Numerous documentation improvements</li>
</ul>
<p><strong>Deprecations</strong></p>
<ul>
<li>Added support for pypy 3.11 for Linux and macOS. (<a href="https://redirect.github.com/psf/requests/issues/6926">#6926</a>)</li>
<li>Dropped support for pypy 3.9 following its end of support. (<a href="https://redirect.github.com/psf/requests/issues/6926">#6926</a>)</li>
</ul>
<h2>v2.32.3</h2>
<h2>2.32.3 (2024-05-29)</h2>
<p><strong>Bugfixes</strong></p>
<ul>
<li>Fixed bug breaking the ability to specify custom SSLContexts in sub-classes of
HTTPAdapter. (<a href="https://redirect.github.com/psf/requests/issues/6716">#6716</a>)</li>
<li>Fixed issue where Requests started failing to run on Python versions compiled
without the <code>ssl</code> module. (<a href="https://redirect.github.com/psf/requests/issues/6724">#6724</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/psf/requests/blob/main/HISTORY.md">requests's changelog</a>.</em></p>
<blockquote>
<h2>2.32.4 (2025-06-10)</h2>
<p><strong>Security</strong></p>
<ul>
<li>CVE-2024-47081 Fixed an issue where a maliciously crafted URL and trusted
environment will retrieve credentials for the wrong hostname/machine from a
netrc file.</li>
</ul>
<p><strong>Improvements</strong></p>
<ul>
<li>Numerous documentation improvements</li>
</ul>
<p><strong>Deprecations</strong></p>
<ul>
<li>Added support for pypy 3.11 for Linux and macOS.</li>
<li>Dropped support for pypy 3.9 following its end of support.</li>
</ul>
<h2>2.32.3 (2024-05-29)</h2>
<p><strong>Bugfixes</strong></p>
<ul>
<li>Fixed bug breaking the ability to specify custom SSLContexts in sub-classes of
HTTPAdapter. (<a href="https://redirect.github.com/psf/requests/issues/6716">#6716</a>)</li>
<li>Fixed issue where Requests started failing to run on Python versions compiled
without the <code>ssl</code> module. (<a href="https://redirect.github.com/psf/requests/issues/6724">#6724</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="021dc729f0"><code>021dc72</code></a> Polish up release tooling for last manual release</li>
<li><a href="821770e822"><code>821770e</code></a> Bump version and add release notes for v2.32.4</li>
<li><a href="59f8aa2adf"><code>59f8aa2</code></a> Add netrc file search information to authentication documentation (<a href="https://redirect.github.com/psf/requests/issues/6876">#6876</a>)</li>
<li><a href="5b4b64c346"><code>5b4b64c</code></a> Add more tests to prevent regression of CVE 2024 47081</li>
<li><a href="7bc45877a8"><code>7bc4587</code></a> Add new test to check netrc auth leak (<a href="https://redirect.github.com/psf/requests/issues/6962">#6962</a>)</li>
<li><a href="96ba401c12"><code>96ba401</code></a> Only use hostname to do netrc lookup instead of netloc</li>
<li><a href="7341690e84"><code>7341690</code></a> Merge pull request <a href="https://redirect.github.com/psf/requests/issues/6951">#6951</a> from tswast/patch-1</li>
<li><a href="6716d7c9f2"><code>6716d7c</code></a> remove links</li>
<li><a href="a7e1c745dc"><code>a7e1c74</code></a> Update docs/conf.py</li>
<li><a href="c799b8167a"><code>c799b81</code></a> docs: fix dead links to kenreitz.org</li>
<li>Additional commits viewable in <a href="https://github.com/psf/requests/compare/v2.32.2...v2.32.4">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=requests&package-manager=pip&previous-version=2.32.2&new-version=2.32.4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158006
Approved by: https://github.com/Skylion007

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-21 21:35:38 +00:00
358 changed files with 8626 additions and 7660 deletions

View File

@ -12,7 +12,7 @@ fi
SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"
source $SCRIPTPATH/../manywheel/set_desired_python.sh
pip install -q numpy==${NUMPY_VERSION} pyyaml==6.0.2 scons==4.7.0 ninja==1.11.1.4 patchelf==0.17.2
pip install -q numpy==${NUMPY_VERSION} pyyaml==6.0.2 scons==4.7.0 ninja==1.11.1 patchelf==0.17.2
for tool in python python3 pip pip3 ninja scons patchelf; do
ln -sf ${DESIRED_PYTHON_BIN_DIR}/${tool} /usr/local/bin;

View File

@ -36,3 +36,105 @@ See `build.sh` for valid build environments (it's the giant switch).
# Set flags (see build.sh) and build image
sudo bash -c 'TRITON=1 ./build.sh pytorch-linux-bionic-py3.8-gcc9 -t myimage:latest
```
## [Guidance] Adding a New Base Docker Image
### Background
The base Docker images in directory `.ci/docker/` are built by the `docker-builds.yml` workflow. Those images are used throughout the PyTorch CI/CD pipeline. You should only create or modify a base Docker image if you need specific environment changes or dependencies before building PyTorch on CI.
1. **Automatic Rebuilding**:
- The Docker image building process is triggered automatically when changes are made to files in the `.ci/docker/*` directory
- This ensures all images stay up-to-date with the latest dependencies and configurations
2. **Image Reuse in PyTorch Build Workflows** (example: linux-build):
- The images generated by `docker-builds.yml` are reused in `_linux-build.yml` through the `calculate-docker-image` step
- The `_linux-build.yml` workflow:
- Pulls the Docker image determined by the `calculate-docker-image` step
- Runs a Docker container with that image
- Executes `.ci/pytorch/build.sh` inside the container to build PyTorch
3. **Usage in Test Workflows** (example: linux-test):
- The same Docker images are also used in `_linux-test.yml` for running tests
- The `_linux-test.yml` workflow follows a similar pattern:
- It uses the `calculate-docker-image` step to determine which Docker image to use
- It pulls the Docker image and runs a container with that image
- It installs the wheels from the artifacts generated by PyTorch build jobs
- It executes test scripts (like `.ci/pytorch/test.sh` or `.ci/pytorch/multigpu-test.sh`) inside the container
### Understanding File Purposes
#### `.ci/docker/build.sh` vs `.ci/pytorch/build.sh`
- **`.ci/docker/build.sh`**:
- Used for building base Docker images
- Executed by the `docker-builds.yml` workflow to pre-build Docker images for CI
- Contains configurations for different Docker build environments
- **`.ci/pytorch/build.sh`**:
- Used for building PyTorch inside a Docker container
- Called by workflows like `_linux-build.yml` after the Docker container is started
- Builds PyTorch wheels and other artifacts
#### `.ci/docker/ci_commit_pins/` vs `.github/ci_commit_pins`
- **`.ci/docker/ci_commit_pins/`**:
- Used for pinning dependency versions during base Docker image building
- Ensures consistent environments for building PyTorch
- Changes here trigger base Docker image rebuilds
- **`.github/ci_commit_pins`**:
- Used for pinning dependency versions during PyTorch building and tests
- Ensures consistent dependencies for PyTorch across different builds
- Used by build scripts running inside Docker containers
### Step-by-Step Guide for Adding a New Base Docker Image
#### 1. Add Pinned Commits (If Applicable)
We use pinned commits for build stability. The `nightly.yml` workflow checks and updates pinned commits for certain repository dependencies daily.
If your new Docker image needs a library installed from a specific pinned commit or built from source:
1. Add the repository you want to track in `nightly.yml` and `merge-rules.yml`
2. Add the initial pinned commit in `.ci/docker/ci_commit_pins/`. The text filename should match the one defined in step 1
#### 2. Configure the Base Docker Image
1. **Add new Base Docker image configuration** (if applicable):
Add the configuration in `.ci/docker/build.sh`. For example:
```bash
pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-new1)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
NEW_ARG_1=yes
;;
```
2. **Add build arguments to Docker build command**:
If you're introducing a new argument to the Docker build, make sure to add it in the Docker build step in `.ci/docker/build.sh`:
```bash
docker build \
....
--build-arg "NEW_ARG_1=${NEW_ARG_1}"
```
3. **Update Dockerfile logic**:
Update the Dockerfile to use the new argument. For example, in `ubuntu/Dockerfile`:
```dockerfile
ARG NEW_ARG_1
# Set up environment for NEW_ARG_1
RUN if [ -n "${NEW_ARG_1}" ]; then bash ./do_something.sh; fi
```
4. **Add the Docker configuration** in `.github/workflows/docker-builds.yml`:
The `docker-builds.yml` workflow pre-builds the Docker images whenever changes occur in the `.ci/docker/` directory. This includes the
pinned commit updates.

View File

@ -160,6 +160,17 @@ case "$tag" in
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
;;
pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6
CUDNN_VERSION=9
@ -276,7 +287,7 @@ case "$tag" in
NINJA_VERSION=1.9.0
TRITON=yes
;;
pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)
pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
VISION=yes

View File

@ -87,7 +87,7 @@ EOF
if [[ $(ver $ROCM_VERSION) -ge $(ver 6.4) ]] && [[ $(ver $ROCM_VERSION) -lt $(ver 7.0) ]]; then
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4.1) ]]; then
HIP_BRANCH=release/rocm-rel-6.4
CLR_HASH=ca18eb3f77fa09292fcda62bc60c3e565d752ada # branch release/rocm-rel-6.4.1-statco-hotfix
CLR_HASH=606bc820b4b1f315d135da02a1f0b176ca50a92c # branch release/rocm-rel-6.4.1-statco-hotfix
elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then
HIP_BRANCH=release/rocm-rel-6.4
CLR_HASH=600f5b0d2baed94d5121e2174a9de0851b040b0c # branch release/rocm-rel-6.4-statco-hotfix

View File

@ -128,7 +128,7 @@ ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# Install setuptools and wheel for python 3.12/3.13
RUN for cpython_version in "cp312-cp312" "cp313-cp313" "cp313-cp313t"; do \
/opt/python/${cpython_version}/bin/python -m pip install "setuptools>=77.0.0" "packaging>=24.2" wheel; \
/opt/python/${cpython_version}/bin/python -m pip install setuptools wheel; \
done;

View File

@ -124,9 +124,10 @@ RUN python3 -mpip install cmake==3.28.0
# install newest flatbuffers version first:
# for some reason old version is getting pulled in otherwise.
# packaging package is required for onnxruntime wheel build.
RUN pip3 install 'setuptools>=77.0' 'packaging>=24.2' && \
pip3 install flatbuffers cython 'pkgconfig>=1.5.5' 'numpy<2.3.0' && \
RUN pip3 install flatbuffers && \
pip3 install cython 'pkgconfig>=1.5.5' 'setuptools>=77' 'numpy<2.3.0' && \
pip3 install --no-build-isolation h5py==3.11.0 && \
pip3 install packaging && \
git clone https://github.com/microsoft/onnxruntime && \
cd onnxruntime && git checkout v1.21.0 && \
git submodule update --init --recursive && \

View File

@ -50,7 +50,7 @@ flatbuffers==24.12.23
hypothesis==5.35.1
# Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
#Description: advanced library for generating parametrized tests
#Pinned versions: 5.35.1
#Pinned versions: 3.44.6, 4.53.2
#test that import: test_xnnpack_integration.py, test_pruning_op.py, test_nn.py
junitparser==2.1.1
@ -104,10 +104,10 @@ networkx==2.8.8
#Pinned versions: 2.8.8
#test that import: functorch
ninja==1.11.1.4
ninja==1.11.1.3
#Description: build system. Used in some tests. Used in build to generate build
#time tracing information
#Pinned versions: 1.11.1.4
#Pinned versions: 1.11.1.3
#test that import: run_test.py, test_cpp_extensions_aot.py,test_determination.py
numba==0.49.0 ; python_version < "3.9"
@ -307,7 +307,7 @@ pytest-cpp==2.3.0
#Pinned versions: 2.3.0
#test that import:
z3-solver==4.15.1.0
z3-solver==4.12.6.0
#Description: The Z3 Theorem Prover Project
#Pinned versions:
#test that import:
@ -363,10 +363,9 @@ pwlf==2.2.1
# To build PyTorch itself
packaging>=24.2
pyyaml
pyzstd
setuptools>=77.0.0
setuptools>=70.1.0
six
scons==4.5.2 ; platform_machine == "aarch64"

View File

@ -4,8 +4,8 @@ sphinx==5.3.0
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git@pytorch_sphinx_theme2#egg=pytorch_sphinx_theme2
# TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
# but it doesn't seem to work and hangs around idly. The initial thought it is probably
# something related to Docker setup. We can investigate this later
# but it doesn't seem to work and hangs around idly. The initial thought that it is probably
# something related to Docker setup. We can investigate this later.
sphinxcontrib.katex==0.8.6
#Description: This is used to generate PyTorch docs

View File

@ -269,9 +269,6 @@ if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then
tools/bazel build --config=no-tty "${BAZEL_MEM_LIMIT}" "${BAZEL_CPU_LIMIT}" //...
fi
else
# install build-system requirements before running setup.py commands
python -m pip install -r requirements-build.txt
# check that setup.py would fail with bad arguments
echo "The next three invocations are expected to fail with invalid command error messages."
( ! get_exit_code python setup.py bad_argument )

View File

@ -204,8 +204,32 @@ function install_torchrec_and_fbgemm() {
pip_build_and_install "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}" dist/torchrec
pip_uninstall fbgemm-gpu-nightly
# Set ROCM_HOME isn't available, use ROCM_PATH if set or /opt/rocm
ROCM_HOME="${ROCM_HOME:-${ROCM_PATH:-/opt/rocm}}"
# Find rocm_version.h header file for ROCm version extract
rocm_version_h="${ROCM_HOME}/include/rocm-core/rocm_version.h"
if [ ! -f "$rocm_version_h" ]; then
rocm_version_h="${ROCM_HOME}/include/rocm_version.h"
fi
# Error out if rocm_version.h not found
if [ ! -f "$rocm_version_h" ]; then
echo "Error: rocm_version.h not found in expected locations." >&2
exit 1
fi
# Extract major, minor and patch ROCm version numbers
MAJOR_VERSION=$(grep 'ROCM_VERSION_MAJOR' "$rocm_version_h" | awk '{print $3}')
MINOR_VERSION=$(grep 'ROCM_VERSION_MINOR' "$rocm_version_h" | awk '{print $3}')
PATCH_VERSION=$(grep 'ROCM_VERSION_PATCH' "$rocm_version_h" | awk '{print $3}')
ROCM_INT=$((MAJOR_VERSION * 10000 + MINOR_VERSION * 100 + PATCH_VERSION))
echo "ROCm version: $ROCM_INT"
export BUILD_ROCM_VERSION="$MAJOR_VERSION.$MINOR_VERSION"
pip_install tabulate # needed for newer fbgemm
pip_install patchelf # needed for rocm fbgemm
pushd /tmp
local wheel_dir=dist/fbgemm_gpu
local found_whl=0
@ -223,7 +247,7 @@ function install_torchrec_and_fbgemm() {
pushd fbgemm/fbgemm_gpu
git checkout "${fbgemm_commit}"
python setup.py bdist_wheel \
--package_variant=rocm \
--build-variant=rocm \
-DHIP_ROOT_DIR="${ROCM_PATH}" \
-DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \
-DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"
@ -240,6 +264,7 @@ function install_torchrec_and_fbgemm() {
done
rm -rf fbgemm
popd
else
pip_build_and_install "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}" dist/torchrec
pip_build_and_install "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#subdirectory=fbgemm_gpu" dist/fbgemm_gpu

View File

@ -201,7 +201,7 @@ fi
if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then
# JIT C++ extensions require ninja.
pip_install "ninja==1.11.1.4"
pip_install "ninja==1.10.2"
# ninja is installed in $HOME/.local/bin, e.g., /var/lib/jenkins/.local/bin for CI user jenkins
# but this script should be runnable by any user, including root
export PATH="$HOME/.local/bin:$PATH"
@ -345,6 +345,12 @@ test_h100_symm_mem() {
assert_git_not_dirty
}
test_h100_cutlass_backend() {
# cutlass backend tests for H100
TORCHINDUCTOR_CUTLASS_DIR=$(realpath "./third_party/cutlass") python test/run_test.py --include inductor/test_cutlass_backend -k "not addmm" $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
TORCHINDUCTOR_CUTLASS_DIR=$(realpath "./third_party/cutlass") python test/run_test.py --include inductor/test_cutlass_evt $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
}
test_lazy_tensor_meta_reference_disabled() {
export TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1
echo "Testing lazy tensor operations without meta reference"
@ -1769,6 +1775,8 @@ elif [[ "${TEST_CONFIG}" == h100_distributed ]]; then
test_h100_distributed
elif [[ "${TEST_CONFIG}" == "h100-symm-mem" ]]; then
test_h100_symm_mem
elif [[ "${TEST_CONFIG}" == h100_cutlass_backend ]]; then
test_h100_cutlass_backend
else
install_torchvision
install_monkeytype

View File

@ -0,0 +1,34 @@
# If you want to rebuild, run this with $env:REBUILD=1
# If you want to build with CUDA, run this with $env:USE_CUDA=1
# If you want to build without CUDA, run this with $env:USE_CUDA=0
# Check for setup.py in the current directory
if (-not (Test-Path "setup.py")) {
Write-Host "ERROR: Please run this build script from PyTorch root directory."
exit 1
}
# Get the script's parent directory
$ScriptParentDir = Split-Path -Parent $MyInvocation.MyCommand.Definition
# Set TMP_DIR and convert to Windows path
$env:TMP_DIR = Join-Path (Get-Location) "build\win_tmp"
$env:TMP_DIR_WIN = $env:TMP_DIR # Already in Windows format, no cygpath needed
# Set final package directory with default fallback
if (-not $env:PYTORCH_FINAL_PACKAGE_DIR) {
$env:PYTORCH_FINAL_PACKAGE_DIR = "C:\w\build-results"
}
# Create the final package directory if it doesn't exist
if (-not (Test-Path $env:PYTORCH_FINAL_PACKAGE_DIR)) {
New-Item -Path $env:PYTORCH_FINAL_PACKAGE_DIR -ItemType Directory -Force | Out-Null
}
# Set script helpers directory
$env:SCRIPT_HELPERS_DIR = Join-Path $ScriptParentDir "win-test-helpers\arm64"
# Run the main build script
& "$env:SCRIPT_HELPERS_DIR\build_pytorch.ps1"
Write-Host "BUILD PASSED"

View File

@ -0,0 +1,24 @@
#!/bin/bash
set -ex -o pipefail
SCRIPT_PARENT_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
# shellcheck source=./common.sh
source "$SCRIPT_PARENT_DIR/common.sh"
run_tests() {
echo Running smoke_test.py...
python ./.ci/pytorch/smoke_test/smoke_test.py --package torchonly
echo Running test_autograd.oy, test_nn.py, test_torch.py...
cd test
CORE_TEST_LIST=("test_autograd.py" "test_nn.py" "test_modules.py")
for t in "${CORE_TEST_LIST[@]}"; do
echo "Running test: $t"
python "$t" --verbose --save-xml --use-pytest -vvvv -rfEsxXP -p no:xdist
done
}
run_tests
echo "TEST PASSED"

View File

@ -0,0 +1,98 @@
# TODO: we may can use existing build_pytorch.bat for arm64
if ($env:DEBUG -eq "1") {
$env:BUILD_TYPE = "debug"
} else {
$env:BUILD_TYPE = "release"
}
# This inflates our log size slightly, but it is REALLY useful to be
# able to see what our cl.exe commands are. (since you can actually
# just copy-paste them into a local Windows setup to just rebuild a
# single file.)
# log sizes are too long, but leaving this here in case someone wants to use it locally
# $env:CMAKE_VERBOSE_MAKEFILE = "1"
$env:INSTALLER_DIR = Join-Path $env:SCRIPT_HELPERS_DIR "installation-helpers"
cd ..
# Environment variables
$env:SCCACHE_IDLE_TIMEOUT = "0"
$env:SCCACHE_IGNORE_SERVER_IO_ERROR = "1"
$env:CMAKE_BUILD_TYPE = $env:BUILD_TYPE
$env:CMAKE_C_COMPILER_LAUNCHER = "sccache"
$env:CMAKE_CXX_COMPILER_LAUNCHER = "sccache"
$env:libuv_ROOT = Join-Path $env:DEPENDENCIES_DIR "libuv\install"
$env:MSSdk = "1"
if ($env:PYTORCH_BUILD_VERSION) {
$env:PYTORCH_BUILD_VERSION = $env:PYTORCH_BUILD_VERSION
$env:PYTORCH_BUILD_NUMBER = "1"
}
$env:CMAKE_POLICY_VERSION_MINIMUM = "3.5"
# Set BLAS type
if ($env:ENABLE_APL -eq "1") {
$env:BLAS = "APL"
$env:USE_LAPACK = "1"
} elseif ($env:ENABLE_OPENBLAS -eq "1") {
$env:BLAS = "OpenBLAS"
$env:OpenBLAS_HOME = Join-Path $env:DEPENDENCIES_DIR "OpenBLAS\install"
}
# Change to source directory
Set-Location $env:PYTORCH_ROOT
# Copy libuv.dll
Copy-Item -Path (Join-Path $env:libuv_ROOT "lib\Release\uv.dll") -Destination "torch\lib\uv.dll" -Force
# Create virtual environment
python -m venv .venv
.\.venv\Scripts\Activate.ps1
where.exe python
# Python install dependencies
python -m pip install --upgrade pip
pip install setuptools pyyaml
pip install -r requirements.txt
# Set after installing psutil
$env:DISTUTILS_USE_SDK = "1"
# Print all environment variables
Get-ChildItem Env:
# Start and inspect sccache
sccache --start-server
sccache --zero-stats
sccache --show-stats
# Build the wheel
python setup.py bdist_wheel
if ($LASTEXITCODE -ne 0) { exit 1 }
# Install the wheel locally
$whl = Get-ChildItem -Path "dist\*.whl" | Select-Object -First 1
if ($whl) {
python -mpip install --no-index --no-deps $whl.FullName
}
# Copy final wheel
robocopy "dist" "$env:PYTORCH_FINAL_PACKAGE_DIR" *.whl
# Export test times
python tools/stats/export_test_times.py
# Copy additional CI files
robocopy ".additional_ci_files" "$env:PYTORCH_FINAL_PACKAGE_DIR\.additional_ci_files" /E
# Save ninja log
Copy-Item -Path "build\.ninja_log" -Destination $env:PYTORCH_FINAL_PACKAGE_DIR -Force
# Final sccache stats and stop
sccache --show-stats
sccache --stop-server
exit 0

View File

@ -126,11 +126,6 @@ if "%USE_CUDA%"=="1" (
set CMAKE_CUDA_COMPILER_LAUNCHER=%TMP_DIR%/bin/randomtemp.exe;%TMP_DIR%\bin\sccache.exe
)
:: Install build-system requirements before running setup.py commands
python -m pip install -r requirements-build.txt
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
:: Print all existing environment variable for debugging
set

View File

@ -41,7 +41,7 @@ fi
python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==2.13.0 protobuf==5.29.4 pytest-subtests==0.13.1
# Install Z3 optional dependency for Windows builds.
python -m pip install z3-solver==4.15.1.0
python -m pip install z3-solver==4.12.2.0
# Install tlparse for test\dynamo\test_structured_trace.py UTs.
python -m pip install tlparse==0.3.30

View File

@ -18,5 +18,5 @@ start /wait "" python-amd64.exe /quiet InstallAllUsers=1 PrependPath=0 Include_t
if errorlevel 1 exit /b 1
set "PATH=%CD%\Python\Scripts;%CD%\Python;%PATH%"
%PYTHON_EXEC% -m pip install --upgrade pip "setuptools>=77.0.0" "packaging>=24.2" wheel
%PYTHON_EXEC% -m pip install --upgrade pip setuptools packaging wheel
if errorlevel 1 exit /b 1

View File

@ -7,9 +7,6 @@ call "internal\install_python.bat"
%PYTHON_EXEC% --version
set "PATH=%CD%\Python\Lib\site-packages\cmake\data\bin;%CD%\Python\Scripts;%CD%\Python;%PATH%"
%PYTHON_EXEC% -m pip install "setuptools>=77.0.0" "packaging>=24.2"
if "%DESIRED_PYTHON%" == "3.13t" %PYTHON_EXEC% -m pip install numpy==2.2.1 cmake
if "%DESIRED_PYTHON%" == "3.13" %PYTHON_EXEC% -m pip install numpy==2.1.2 cmake
if "%DESIRED_PYTHON%" == "3.12" %PYTHON_EXEC% -m pip install numpy==2.0.2 cmake
@ -19,7 +16,7 @@ if "%DESIRED_PYTHON%" == "3.9" %PYTHON_EXEC% -m pip install numpy==2.0.2 cmake
%PYTHON_EXEC% -m pip install pyyaml
%PYTHON_EXEC% -m pip install mkl-include mkl-static
%PYTHON_EXEC% -m pip install boto3 ninja typing-extensions
%PYTHON_EXEC% -m pip install boto3 ninja typing_extensions setuptools==72.1.0
where cmake.exe

View File

@ -127,7 +127,7 @@ export INSTALL_TEST=0 # dont install test binaries into site-packages
export MACOSX_DEPLOYMENT_TARGET=10.15
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
SETUPTOOLS_PINNED_VERSION="==77.0.0"
SETUPTOOLS_PINNED_VERSION="==70.1.0"
PYYAML_PINNED_VERSION="=5.3"
EXTRA_CONDA_INSTALL_FLAGS=""
CONDA_ENV_CREATE_FLAGS=""
@ -135,7 +135,7 @@ RENAME_WHEEL=true
case $desired_python in
3.13t)
echo "Using 3.13 deps"
SETUPTOOLS_PINNED_VERSION=">=77.0.0"
SETUPTOOLS_PINNED_VERSION=">=70.1.0"
PYYAML_PINNED_VERSION=">=6.0.1"
NUMPY_PINNED_VERSION="=2.1.0"
CONDA_ENV_CREATE_FLAGS="python-freethreading"
@ -145,31 +145,31 @@ case $desired_python in
;;
3.13)
echo "Using 3.13 deps"
SETUPTOOLS_PINNED_VERSION=">=77.0.0"
SETUPTOOLS_PINNED_VERSION=">=70.1.0"
PYYAML_PINNED_VERSION=">=6.0.1"
NUMPY_PINNED_VERSION="=2.1.0"
;;
3.12)
echo "Using 3.12 deps"
SETUPTOOLS_PINNED_VERSION=">=77.0.0"
SETUPTOOLS_PINNED_VERSION=">=70.1.0"
PYYAML_PINNED_VERSION=">=6.0.1"
NUMPY_PINNED_VERSION="=2.0.2"
;;
3.11)
echo "Using 3.11 deps"
SETUPTOOLS_PINNED_VERSION=">=77.0.0"
SETUPTOOLS_PINNED_VERSION=">=70.1.0"
PYYAML_PINNED_VERSION=">=5.3"
NUMPY_PINNED_VERSION="=2.0.2"
;;
3.10)
echo "Using 3.10 deps"
SETUPTOOLS_PINNED_VERSION=">=77.0.0"
SETUPTOOLS_PINNED_VERSION=">=70.1.0"
PYYAML_PINNED_VERSION=">=5.3"
NUMPY_PINNED_VERSION="=2.0.2"
;;
3.9)
echo "Using 3.9 deps"
SETUPTOOLS_PINNED_VERSION=">=77.0.0"
SETUPTOOLS_PINNED_VERSION=">=70.1.0"
PYYAML_PINNED_VERSION=">=5.3"
NUMPY_PINNED_VERSION="=2.0.2"
;;

View File

@ -1 +1 @@
00b0c91db92c51a11356249262577b9fa26c18c5
b6a3368a45aaafe05f1a6a9f10c68adc5e944d9e

View File

@ -1 +1 @@
5fb5024118e9bb9decf96c2b0b1a8f0010bf56be
7f1de94a4c2d14f59ad4ca84538c36084ea6b2c8

1
.github/ci_commit_pins/vllm.txt vendored Normal file
View File

@ -0,0 +1 @@
b77c7d327f2a463bb9ef8be36f30e920bc066502

View File

@ -76,8 +76,8 @@
- .github/ci_commit_pins/audio.txt
- .github/ci_commit_pins/vision.txt
- .github/ci_commit_pins/torchdynamo.txt
- .github/ci_commit_pins/vllm.txt
- .ci/docker/ci_commit_pins/triton.txt
- .ci/docker/ci_commit_pins/vllm.txt
approved_by:
- pytorchbot
mandatory_checks_name:
@ -492,6 +492,19 @@
- srossross
- chillee
- zou3519
- guilhermeleobas
mandatory_checks_name:
- EasyCLA
- Lint
- pull
- name: Dynamo
patterns:
- torch/_dynamo/**
- torch/csrc/dynamo/**
- test/dynamo/**
approved_by:
- guilhermeleobas
mandatory_checks_name:
- EasyCLA
- Lint

View File

@ -31,7 +31,9 @@ ciflow_push_tags:
- ciflow/pull
- ciflow/h100
- ciflow/h100-distributed
- ciflow/win-arm64
- ciflow/h100-symm-mem
- ciflow/h100-cutlass-backend
retryable_workflows:
- pull
- trunk

View File

@ -8,7 +8,7 @@
boto3==1.35.42
jinja2==3.1.6
lintrunner==0.10.7
ninja==1.11.1.4
ninja==1.10.0.post1
nvidia-ml-py==11.525.84
pyyaml==6.0
requests==2.32.4

View File

@ -7,12 +7,12 @@ hypothesis==6.56.4
librosa>=0.6.2
mpmath==1.3.0
networkx==2.8.7
ninja==1.11.1.4
ninja==1.10.2.4
numba==0.59.0
numpy==1.26.4
opt-einsum>=3.3
optree==0.13.0
packaging==25.0
packaging==23.1
parameterized==0.8.1
pillow==10.3.0
protobuf==5.29.4
@ -26,11 +26,11 @@ pytest-xdist==3.3.1
pytest==7.3.2
pyyaml==6.0.2
scipy==1.12.0
setuptools==80.9.0
setuptools==72.1.0
sympy==1.13.3
tlparse==0.3.30
tensorboard==2.13.0
typing-extensions==4.12.2
unittest-xml-reporting<=3.2.0,>=2.0.0
xdoctest==1.1.0
z3-solver==4.15.1.0
z3-solver==4.12.2.0

View File

@ -2,7 +2,7 @@
set -ex
# Use uv to speed up lintrunner init
python3 -m pip install -U uv setuptools
python3 -m pip install uv==0.1.45 setuptools
CACHE_DIRECTORY="/tmp/.lintbin"
# Try to recover the cached binaries

View File

@ -10,7 +10,7 @@ if "%PY_VERS%" == "3.13t" (
call conda create -n %PYTHON_PREFIX% -y -c=conda-forge python=%PY_VERS%
)
:: Fix cmake version for issue https://github.com/pytorch/pytorch/issues/150480
call conda run -n %PYTHON_PREFIX% pip install wheel pybind11 certifi cython cmake==3.31.6 setuptools==78.1.1 ninja
call conda run -n %PYTHON_PREFIX% pip install wheel pybind11 certifi cython cmake==3.31.6 setuptools==72.1.0 ninja
dir "%VC_INSTALL_PATH%"

View File

@ -27,7 +27,7 @@ jobs:
PR_NUMBER="${{ github.event.number }}"
# Use gh CLI to get changed files in the PR with explicit repo
CHANGED_FILES=$(gh pr view "$PR_NUMBER" --repo "${{ github.repository }}" --json files --jq '.files[].path' | tr '\n' ' ' | sed 's/ $//')
CHANGED_FILES=$(gh api repos/${{ github.repository }}/pulls/$PR_NUMBER/files --paginate --jq '.[] | select(.status != "removed") | .filename' | tr '\n' ' ' | sed 's/ $//')
if [ -z "$CHANGED_FILES" ]; then
echo "No changed files found, setting to '*'"
@ -40,4 +40,4 @@ jobs:
else
echo "Not in PR context, setting changed files to '*'"
echo "changed-files=*" >> "$GITHUB_OUTPUT"
fi
fi

View File

@ -80,11 +80,6 @@ jobs:
run: |
sysctl machdep.cpu.brand_string kern.osproductversion
- name: Install build toolchain
run: |
brew update --quiet
brew install --formula cmake ninja
- name: Clean up leftover processes on MacOS pet runner
continue-on-error: true
run: |

View File

@ -50,6 +50,7 @@ jobs:
runner: [linux.12xlarge]
docker-image-name: [
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11,
pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm,
pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks,
pytorch-linux-jammy-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks,
pytorch-linux-jammy-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks,

View File

@ -0,0 +1,58 @@
name: Limited CI for CUTLASS backend on H100
on:
pull_request:
paths:
- .github/workflows/h100-cutlass-backend.yml
workflow_dispatch:
schedule:
- cron: 22 9 * * * # every 24 hours about 2:22am PDT
push:
tags:
- ciflow/h100-cutlass-backend/*
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions:
id-token: write
contents: read
jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-jammy-cuda12_8-py3_10-gcc11-sm90-build-cutlass-backend:
name: linux-jammy-cuda12.8-py3.10-gcc11-sm90-cutlass-backend
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm90-cutlass-backend
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '9.0'
test-matrix: |
{ include: [
{ config: "h100_cutlass_backend", shard: 1, num_shards: 1, runner: "linux.aws.h100", owners: ["oncall:pt2"] },
]}
secrets: inherit
linux-jammy-cuda12_8-py3_10-gcc11-sm90-test:
name: linux-jammy-cuda12.8-py3.10-gcc11-sm90-cutlass-backend
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-jammy-cuda12_8-py3_10-gcc11-sm90-build-cutlass-backend
with:
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm90-cutlass-backend
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm90-build-cutlass-backend.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm90-build-cutlass-backend.outputs.test-matrix }}
secrets: inherit

View File

@ -86,7 +86,7 @@ jobs:
- repo-name: vllm
repo-owner: vllm-project
branch: main
pin-folder: .ci/docker/ci_commit_pins
pin-folder: .github/ci_commit_pins
# Allow this to be triggered on either a schedule or on workflow_dispatch to allow for easier testing
if: github.repository_owner == 'pytorch' && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
steps:

View File

@ -315,21 +315,6 @@ jobs:
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-py3-clang18-mobile-build:
name: linux-jammy-py3-clang18-mobile-build
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3-clang12-mobile-build
docker-image-name: ci-image:pytorch-linux-jammy-py3-clang18-asan
build-generates-artifacts: false
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 1 },
]}
secrets: inherit
linux-jammy-cuda12_8-cudnn9-py3_9-clang12-build:
name: linux-jammy-cuda12.8-cudnn9-py3.9-clang12
uses: ./.github/workflows/_linux-build.yml

View File

@ -0,0 +1,187 @@
name: windows-arm64-build-test
on:
push:
tags:
- ciflow/win-arm64/*
env:
GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
PYTHON_VERSION: "3.12"
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
DOWNLOADS_DIR: c:\temp\downloads
DEPENDENCIES_DIR: c:\temp\dependencies
ENABLE_APL: 1
ENABLE_OPENBLAS: 0
BUILD_TYPE: release
permissions:
id-token: write
contents: read
jobs:
build:
# Don't run on forked repos.
if: github.repository_owner == 'pytorch'
runs-on: "windows-11-arm64-preview"
timeout-minutes: 240
steps:
- name: configure aws credentials
id: aws_creds
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_sscache
aws-region: us-east-1
role-duration-seconds: 18000
- name: Enable long paths
shell: cmd
run: |
git config --system --get core.longpaths || echo "core.longpaths is not set, setting it now"
git config --system core.longpaths true
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
path: pytorch
submodules: recursive
- name: Bootstrap Python
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_python.bat"
- name: Parse ref
id: parse-ref
shell: bash
run: python pytorch/.github/scripts/parse_ref.py
- name: Get workflow job id
shell: bash
id: get-job-id
run: |
set -eux
python pytorch/.github/scripts/get_workflow_job_id.py "${GITHUB_RUN_ID}" "${RUNNER_NAME}"
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
- name: Bootstrap APL
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_apl.bat"
- name: Bootstrap Rust
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_rust.bat"
- name: Bootstrap sccache
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_sccache.bat"
- name: Bootstrap Libuv
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_libuv.bat"
- name: Build
id: build
shell: cmd
env:
PYTORCH_FINAL_PACKAGE_DIR: C:/${{ github.run_id }}/build-results/
BRANCH: ${{ steps.parse-ref.outputs.branch }}
BUILD_WHEEL: 1
MAX_JOBS: 8
PYTHON_VERSION: "3.12"
SCCACHE_BUCKET: "ossci-compiler-cache"
SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
SCCACHE_REGION: us-east-1
VC_PRODUCT: "BuildTools"
VC_VERSION: ""
ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
AWS_DEFAULT_REGION: us-east-1
USE_CUDA: '0'
USE_XPU: '0'
OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
run: |
cd pytorch
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" arm64
powershell -ExecutionPolicy Bypass -File ".ci/pytorch/win-arm64-build.ps1"
- name: Upload artifacts
uses: actions/upload-artifact@v4.4.0
if: always()
with:
name: torch-wheel-win-arm64-py3-12
retention-days: 14
if-no-files-found: error
path: C:\${{ github.run_id }}\build-results
test:
if: github.repository_owner == 'pytorch'
strategy:
fail-fast: false
runs-on: "windows-11-arm64-preview"
needs: build
steps:
- name: Enable long paths
shell: cmd
run: |
git config --system --get core.longpaths || echo "core.longpaths is not set, setting it now"
git config --system core.longpaths true
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
path: pytorch
submodules: recursive
- name: Bootstrap Python
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_python.bat"
- name: Bootstrap Rust
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_rust.bat"
- name: Get workflow job id
shell: bash
id: get-job-id
run: |
set -eux
python pytorch/.github/scripts/get_workflow_job_id.py "${GITHUB_RUN_ID}" "${RUNNER_NAME}"
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
- name: Download Build Artifacts
uses: actions/download-artifact@v4.1.7
with:
name: torch-wheel-win-arm64-py3-12
path: C:\${{ github.run_id }}\build-results
- name: Test
id: test
shell: cmd
env:
USE_CUDA: '0'
INSTALL_WINDOWS_SDK: 1
PYTHON_VERSION: "3.12"
VC_PRODUCT: "BuildTools"
AWS_DEFAULT_REGION: us-east-1
GITHUB_REPOSITORY: ${{ github.repository }}
GITHUB_WORKFLOW: ${{ github.workflow }}
GITHUB_JOB: ${{ github.job }}
GITHUB_RUN_ID: ${{ github.run_id }}
GITHUB_RUN_NUMBER: ${{ github.run_number }}
GITHUB_RUN_ATTEMPT: ${{ github.run_attempt }}
JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
JOB_NAME: ${{ steps.get-job-id.outputs.job-name }}
PYTORCH_FINAL_PACKAGE_DIR: C:/${{ github.run_id }}/build-results/
run: |
mkdir "%PYTORCH_FINAL_PACKAGE_DIR%"
call pytorch/.ci/pytorch/windows/arm64/bootstrap_tests.bat
set GIT_BASH=C:\Program Files\Git\usr\bin\bash.exe
"%GIT_BASH%" -c "bash --noprofile --norc .ci/pytorch/win-arm64-test.sh"

View File

@ -294,14 +294,12 @@ Install PyTorch
```bash
export CMAKE_PREFIX_PATH="${CONDA_PREFIX:-'$(dirname $(which conda))/../'}:${CMAKE_PREFIX_PATH}"
python -m pip install -r requirements-build.txt
python -m pip install --no-build-isolation -v -e .
```
**On macOS**
```bash
python -m pip install -r requirements-build.txt
python -m pip install --no-build-isolation -v -e .
```

View File

@ -14,7 +14,9 @@
#include <ATen/cpu/FlushDenormal.h>
#ifdef USE_FBGEMM
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wextra-semi")
#include <fbgemm/Fbgemm.h>
C10_DIAGNOSTIC_POP()
#endif // USE_FBGEMM
#if defined(__aarch64__) && !defined(C10_MOBILE)
#include <cpuinfo.h>

View File

@ -1,6 +1,5 @@
#pragma once
#include <c10/core/CachingDeviceAllocator.h>
#include <c10/core/DeviceType.h>
#include <c10/macros/Macros.h>
@ -73,27 +72,6 @@ TORCH_API c10::DeviceIndex exchangeDevice(c10::DeviceIndex device_index);
// original device index that was active before the change.
TORCH_API c10::DeviceIndex maybeExchangeDevice(c10::DeviceIndex device_index);
TORCH_API inline void emptyCache() {
const auto device_type = getAccelerator(true).value();
at::getDeviceAllocator(device_type)->emptyCache();
}
TORCH_API inline at::CachingDeviceAllocator::DeviceStats getDeviceStats(
c10::DeviceIndex device_index) {
const auto device_type = getAccelerator(true).value();
return at::getDeviceAllocator(device_type)->getDeviceStats(device_index);
}
TORCH_API inline void resetAccumulatedStats(c10::DeviceIndex device_index) {
const auto device_type = getAccelerator(true).value();
at::getDeviceAllocator(device_type)->resetAccumulatedStats(device_index);
}
TORCH_API inline void resetPeakStats(c10::DeviceIndex device_index) {
const auto device_type = getAccelerator(true).value();
at::getDeviceAllocator(device_type)->resetPeakStats(device_index);
}
} // namespace at::accelerator
namespace at {

View File

@ -2,6 +2,7 @@
#include <ATen/cuda/CUDAGraph.h>
#include <ATen/cuda/Exceptions.h>
#include <ATen/Functions.h>
#include <c10/cuda/CUDACachingAllocator.h>
#include <c10/cuda/CUDAFunctions.h>
#include <cstddef>

View File

@ -2,7 +2,6 @@
#include <ATen/Tensor.h>
#include <c10/core/Device.h>
#include <c10/cuda/CUDACachingAllocator.h>
#include <c10/cuda/CUDAGraphsC10Utils.h>
#include <c10/cuda/CUDAStream.h>
#include <c10/util/flat_hash_map.h>

View File

@ -258,7 +258,7 @@ DECLARE_HOST_ALLOCATOR(
CUDACachingHostAllocator,
CUDACachingHostAllocatorImpl,
raw_local_deleter,
caching_host_allocator);
caching_host_allocator)
REGISTER_HOST_ALLOCATOR(at::kCUDA, &caching_host_allocator)

View File

@ -158,6 +158,7 @@ TORCH_LIBRARY_IMPL(aten, FuncTorchBatchedDecomposition, m) {
OP_DECOMPOSE(kron);
OP_DECOMPOSE(l1_loss);
m.impl("layer_norm", native::layer_norm_symint);
m.impl("_fused_rms_norm", native::rms_norm_composite);
OP_DECOMPOSE2(ldexp, Tensor);
OP_DECOMPOSE2(less_equal, Tensor );
OP_DECOMPOSE2(less, Tensor );

View File

@ -1,6 +1,6 @@
#pragma once
#include <c10/core/CachingDeviceAllocator.h>
#include <c10/core/Allocator.h>
#include <c10/core/DeviceType.h>
// Use of c10::hip namespace here makes hipification easier, because
@ -10,10 +10,10 @@ namespace c10::hip {
// Takes a valid HIPAllocator (of any sort) and turns it into
// an allocator pretending to be a CUDA allocator. See
// Note [Masquerading as CUDA]
class HIPAllocatorMasqueradingAsCUDA final : public DeviceAllocator {
DeviceAllocator* allocator_;
class HIPAllocatorMasqueradingAsCUDA final : public Allocator {
Allocator* allocator_;
public:
explicit HIPAllocatorMasqueradingAsCUDA(DeviceAllocator* allocator)
explicit HIPAllocatorMasqueradingAsCUDA(Allocator* allocator)
: allocator_(allocator) {}
DataPtr allocate(size_t size) override {
DataPtr r = allocator_->allocate(size);
@ -26,24 +26,6 @@ public:
void copy_data(void* dest, const void* src, std::size_t count) const final {
allocator_->copy_data(dest, src, count);
}
bool initialized() override {
return allocator_->initialized();
}
void emptyCache(MempoolId_t mempool_id = {0, 0}) {
allocator_->emptyCache(mempool_id);
}
void recordStream(const DataPtr& ptr, c10::Stream stream) {
allocator_->recordStream(ptr, stream);
}
CachingDeviceAllocator::DeviceStats getDeviceStats(c10::DeviceIndex device) {
return allocator_->getDeviceStats(device);
}
void resetAccumulatedStats(c10::DeviceIndex device) {
allocator_->resetAccumulatedStats(device);
}
void resetPeakStats(c10::DeviceIndex device) {
allocator_->resetPeakStats(device);
}
};
} // namespace c10::hip

View File

@ -4,9 +4,8 @@
namespace c10 { namespace hip {
namespace HIPCachingAllocatorMasqueradingAsCUDA {
static HIPAllocatorMasqueradingAsCUDA allocator(HIPCachingAllocator::get());
Allocator* get() {
static HIPAllocatorMasqueradingAsCUDA allocator(HIPCachingAllocator::get());
return &allocator;
}
@ -14,9 +13,5 @@ void recordStreamMasqueradingAsCUDA(const DataPtr& ptr, HIPStreamMasqueradingAsC
HIPCachingAllocator::recordStream(ptr, stream.hip_stream());
}
// Register this HIP allocator as CUDA allocator to enable access through both
// c10::GetAllocator(kCUDA) and c10::getDeviceAllocator(kCUDA) APIs
REGISTER_ALLOCATOR(kCUDA, &allocator)
} // namespace HIPCachingAllocatorMasqueradingAsCUDA
}} // namespace c10::hip

View File

@ -36,8 +36,10 @@
#endif
#ifdef USE_FBGEMM
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wextra-semi")
#include <fbgemm/Fbgemm.h>
#include <fbgemm/FbgemmConvert.h>
C10_DIAGNOSTIC_POP()
#endif
namespace {

View File

@ -14,8 +14,10 @@
#include <c10/util/Half.h>
#ifdef USE_FBGEMM
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wextra-semi")
#include <fbgemm/Fbgemm.h>
#include <fbgemm/FbgemmConvert.h>
C10_DIAGNOSTIC_POP()
#else
#include <caffe2/perfkernels/embedding_lookup_idx.h>
#endif

View File

@ -25,9 +25,11 @@
#include <c10/util/irange.h>
#ifdef USE_FBGEMM
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wextra-semi")
#include <fbgemm/Fbgemm.h>
#include <fbgemm/FbgemmFP16.h>
#include <fbgemm/QuantUtils.h>
C10_DIAGNOSTIC_POP()
#endif // USE_FBGEMM
namespace caffe2 {
@ -409,7 +411,7 @@ Tensor fbgemm_pack_gemm_matrix_fp16(const Tensor& weight) {
Tensor fbgemm_linear_fp16_weight_fp32_activation(
const Tensor& input,
const Tensor& packed_weight,
const Tensor& bias) {
const std::optional<Tensor>& bias) {
TORCH_WARN_ONCE("fbgemm_linear_fp16_weight_fp32_activation is deprecated "
"and will be removed in a future PyTorch release.")
@ -430,7 +432,6 @@ Tensor fbgemm_linear_fp16_weight_fp32_activation(
TORCH_CHECK(input.size(input.dim() - 1) == packed_weight_fp16.numRows())
TORCH_CHECK(input.dim() >= 2);
TORCH_CHECK(bias.dim() == 1);
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
const int64_t M = size_to_dim_(input.dim() - 1, input.sizes());
@ -449,7 +450,12 @@ Tensor fbgemm_linear_fp16_weight_fp32_activation(
output.data_ptr<float>());
// Add bias term
output.add_(bias);
c10::MaybeOwned<Tensor> bias_maybe_owned = at::borrow_from_optional_tensor(bias);
const Tensor& bias_ = *bias_maybe_owned;
if (bias_.defined()) {
TORCH_CHECK(bias_.dim() == 1);
output.add_(bias_);
}
return output;
}
@ -551,7 +557,7 @@ Tensor fbgemm_pack_gemm_matrix_fp16(const Tensor& weight) {
Tensor fbgemm_linear_fp16_weight_fp32_activation(
const Tensor& input,
const Tensor& packed_weight,
const Tensor& bias) {
const std::optional<Tensor>& bias) {
TORCH_WARN_ONCE("fbgemm_linear_fp16_weight_fp32_activation is deprecated "
"and will be removed in a future PyTorch release.")

View File

@ -4,9 +4,11 @@
#include <c10/core/QScheme.h>
#ifdef USE_FBGEMM
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wextra-semi")
#include <fbgemm/Fbgemm.h>
#include <fbgemm/FbgemmSparse.h>
#include <ATen/native/ao_sparse/quantized/cpu/packed_params.h>
C10_DIAGNOSTIC_POP()
namespace ao::sparse {

View File

@ -6,7 +6,9 @@
#include <c10/util/llvmMathExtras.h>
#ifdef USE_FBGEMM
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wextra-semi")
#include <fbgemm/Fbgemm.h>
C10_DIAGNOSTIC_POP()
#endif
namespace at::native {

View File

@ -11,25 +11,11 @@
#include <ATen/cuda/CUDAContext.h>
#if defined(USE_ROCM)
// TODO(lufang): Tensor.item() on AMD HIP is not synced in the Recsys models.
// This is just a short term workaround. Issue is tracked as FBA-388 on the AMD side.
namespace {
bool use_sync_mode() {
static const bool sync_mode = c10::utils::check_env("HIP_DOUBLE_SYNC_ON_LOCAL_SCALE_DENSE") == true;
return sync_mode;
}
}
#endif
namespace at::native {
Scalar _local_scalar_dense_cuda(const Tensor& self) {
Scalar r;
TORCH_CHECK(self.numel() > 0, "_local_scalar_dense: Empty tensor not supported");
#if defined(USE_ROCM)
if (!use_sync_mode()){
#endif
AT_DISPATCH_V2(
self.scalar_type(), "_local_scalar_dense_cuda", AT_WRAP([&] {
// Create pinned memory for the scalar value to avoid implicit
@ -46,15 +32,6 @@ Scalar _local_scalar_dense_cuda(const Tensor& self) {
at::cuda::memcpy_and_sync((void *)value.const_data_ptr<scalar_t>(), self.const_data_ptr<scalar_t>(), sizeof(scalar_t), cudaMemcpyDeviceToHost, stream);
r = Scalar(*value.const_data_ptr<scalar_t>());
}), AT_EXPAND(AT_ALL_TYPES_AND_COMPLEX), kComplexHalf, kHalf, kBool, kBFloat16, AT_EXPAND(AT_FLOAT8_TYPES), AT_EXPAND(AT_BAREBONES_UNSIGNED_TYPES));
#if defined(USE_ROCM)
} else {
auto cpu_self = self.cpu();
AT_DISPATCH_V2(
self.scalar_type(), "_local_scalar_dense_hip", AT_WRAP([&] {
r = Scalar(*cpu_self.const_data_ptr<scalar_t>());
}), AT_EXPAND(AT_ALL_TYPES_AND_COMPLEX), kComplexHalf, kHalf, kBool, kBFloat16, AT_EXPAND(AT_BAREBONES_UNSIGNED_TYPES));
}
#endif
return r;
}

File diff suppressed because it is too large Load Diff

View File

@ -261,30 +261,11 @@ std::tuple<Tensor, Tensor, Tensor> math_native_layer_norm(
return outputs;
}
Tensor rms_norm_symint(
std::tuple<Tensor, Tensor> rms_norm_composite(
const Tensor& input,
c10::SymIntArrayRef normalized_shape,
IntArrayRef normalized_shape,
const std::optional<Tensor>& weight_opt /* optional */,
std::optional<double> eps) {
// See [Note: hacky wrapper removal for optional tensor]
c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
const Tensor& weight = *weight_maybe_owned;
_check_rms_norm_inputs_symint(input, normalized_shape, weight);
#ifdef USE_MPS
if (input.device().type() == DeviceType::MPS && weight_opt.has_value()) {
const Tensor weight = weight_opt.value();
const bool any_nested = input.is_nested() || weight.is_nested();
const bool any_inputs_require_grad = input.requires_grad() || weight.requires_grad();
const bool is_input_fp = isFloatingType(input.scalar_type());
const bool is_weight_fp = isFloatingType(weight.scalar_type());
if (!(GradMode::is_enabled() && any_inputs_require_grad) && !any_nested && is_input_fp && is_weight_fp) {
auto eps_val = eps.value_or(std::numeric_limits<double>::epsilon());
return at::_fused_rms_norm(input.contiguous(), normalized_shape.size(), weight.contiguous(), eps_val);
}
}
#endif
std::vector<int64_t> dims_to_reduce;
for (const auto i : c10::irange(normalized_shape.size())) {
@ -321,10 +302,67 @@ Tensor rms_norm_symint(
upcasted_result = upcasted_result.mul(weight_opt.value());
}
return upcasted_result;
// if nested do not make contiguous
if(input.is_nested() || (weight_opt.has_value() && weight_opt.value().is_nested())){
return std::make_tuple(upcasted_result, rqrst_input);
}
if(input.suggest_memory_format() == c10::MemoryFormat::ChannelsLast || input.suggest_memory_format() == c10::MemoryFormat::ChannelsLast3d){
return std::make_tuple(upcasted_result, rqrst_input);
}
return std::make_tuple(upcasted_result.contiguous(), rqrst_input.contiguous());
});
return result.type_as(input);
return std::make_tuple(
std::get<0>(result).type_as(input), // Cast normalized result to original input type
std::get<1>(result) // rsqrt_val
);
}
Tensor rms_norm_symint(
const Tensor& input,
c10::SymIntArrayRef normalized_shape,
const std::optional<Tensor>& weight_opt /* optional */,
const std::optional<double> eps) {
c10::MaybeOwned<Tensor> weight_maybe_owned = at::borrow_from_optional_tensor(weight_opt);
const Tensor& weight = *weight_maybe_owned;
_check_rms_norm_inputs_symint(input, normalized_shape, weight);
// composite fallback for channels last
if(input.suggest_memory_format() == c10::MemoryFormat::ChannelsLast || input.suggest_memory_format() == c10::MemoryFormat::ChannelsLast3d){
return std::get<0>(rms_norm_composite(input, IntArrayRef(reinterpret_cast<const int64_t*>(normalized_shape.data()), normalized_shape.size()), weight_opt, eps));
}
// composite fallback for complex datatypes
if(input.is_complex()){
return std::get<0>(rms_norm_composite(input, IntArrayRef(reinterpret_cast<const int64_t*>(normalized_shape.data()), normalized_shape.size()), weight_opt, eps));
}
if (weight_opt.has_value() && weight_opt.value().defined() && weight_opt.value().dtype() != input.dtype()) {
TORCH_WARN_ONCE(
"Mismatch dtype between input and module: input dtype = ", input.dtype(),
", module dtype = ", weight_opt.value().dtype(), ", Can not dispatch to fused implementation"
);
return std::get<0>(rms_norm_composite(input, IntArrayRef(reinterpret_cast<const int64_t*>(normalized_shape.data()), normalized_shape.size()), weight_opt, eps));
}
#ifdef USE_MPS
if (input.device().type() == DeviceType::MPS && weight_opt.has_value()) {
const Tensor weight = weight_opt.value();
const bool any_inputs_require_grad = input.requires_grad() || weight.requires_grad();
if (!(GradMode::is_enabled() && any_inputs_require_grad)) {
return std::get<0>(at::_fused_rms_norm(input.contiguous(), IntArrayRef(reinterpret_cast<const int64_t*>(normalized_shape.data()), normalized_shape.size()), weight_opt, eps));
}
}
if (input.device().type() == DeviceType::MPS){
return std::get<0>(rms_norm_composite(input, IntArrayRef(reinterpret_cast<const int64_t*>(normalized_shape.data()), normalized_shape.size()), weight_opt, eps));
}
#endif
return std::get<0>(at::_fused_rms_norm(input, IntArrayRef(reinterpret_cast<const int64_t*>(normalized_shape.data()), normalized_shape.size()), weight_opt, eps));
}
} // namespace at::native

View File

@ -106,6 +106,12 @@ void layer_norm_cpu_out(
int64_t M,
int64_t N);
std::tuple<Tensor, Tensor> rms_norm_composite(
const Tensor& input,
IntArrayRef normalized_shape,
const std::optional<Tensor>& weight_opt /* optional */,
std::optional<double> eps);
Tensor rms_norm_symint(
const Tensor& input,
c10::SymIntArrayRef normalized_shape,

View File

@ -145,8 +145,6 @@ MPSGraphTensorData* getMPSGraphTensorData(MPSGraph* mpsGraph, MPSStream* mpsStre
MPSGraphTensorData* getMPSGraphTensorFromScalar(MPSStream* mpsStream, MPSScalar& scalar);
MPSGraph* make_mps_graph();
void printTensorNDArray(const TensorBase& t);
MPSNDArray* ndArrayFromTensor(const TensorBase& tensor, MPSShape* shape, MPSDataType mpsType);
MPSGraphTensor* mpsGraphUnrankedPlaceHolder(MPSGraph* mpsGraph, MPSDataType dataType);
MPSGraphTensor* mpsGraphRankedPlaceHolder(MPSGraph* mpsGraph, MPSDataType dataType, MPSShape* mpsShape);

View File

@ -377,36 +377,6 @@ MPSShape* getMPSShape(IntArrayRef sizes, c10::MemoryFormat memory_format) {
return [NSArray arrayWithObjects:numbers.data() count:numbers.size()];
}
void printTensorNDArray(const TensorBase& t) {
if (!t.is_mps())
return;
if (t.numel() == 0)
return;
// Get shape and data type
auto selfShape = getMPSShape(t);
auto selfDType = getMPSDataType(t.scalar_type());
// Initialize data
id<MTLBuffer> selfBuf = getMTLBufferStorage(t);
MPSGraphTensorData* tdata = [[[MPSGraphTensorData alloc] initWithMTLBuffer:selfBuf shape:selfShape
dataType:selfDType] autorelease];
C10_CLANG_DIAGNOSTIC_PUSH()
#if C10_CLANG_HAS_WARNING("-Wobjc-method-access")
C10_CLANG_DIAGNOSTIC_IGNORE("-Wobjc-method-access")
#endif
[tdata printNDArray];
C10_CLANG_DIAGNOSTIC_POP()
}
MPSNDArray* ndArrayFromTensor(const TensorBase& tensor, MPSShape* shape, MPSDataType mpsType) {
id<MTLBuffer> buffer = getMTLBufferStorage(tensor);
MPSGraphTensorData* tmpGraphTensorData = [[[MPSGraphTensorData alloc] initWithMTLBuffer:buffer
shape:shape
dataType:mpsType] autorelease];
return [tmpGraphTensorData mpsndarray];
}
static std::vector<int64_t> getSortedStrides(const IntArrayRef& s) {
std::vector<int64_t> idx(s.size());
iota(idx.begin(), idx.end(), 0);
@ -457,12 +427,22 @@ static MPSNDArray* permuteNDArray(MPSNDArray* inArray, const std::vector<int64_t
return result;
}
// Should be called before initWithBuffer to prevent hard crashes with
// '[MPSNDArray initWithDevice:descriptor:isTextureBacked:] Error: NDArray dimension length > INT_MAX'
static void check_mps_shape(MPSShape* shape) {
for (NSNumber* elem in shape) {
const auto val = [elem longValue];
TORCH_CHECK(val <= std::numeric_limits<int32_t>::max(), "MPSGaph does not support tensor dims larger than INT_MAX");
}
}
MPSNDArray* getMPSNDArray(const TensorBase& t, MPSShape* sizes, MPSShape* strides) {
id<MTLBuffer> srcBuf = getMTLBufferStorage(t);
MPSDataType mpsDataType = getMPSDataType(t.scalar_type());
MPSNDArrayDescriptor* srcTensorDesc = [MPSNDArrayDescriptor descriptorWithDataType:mpsDataType shape:sizes];
srcTensorDesc.preferPackedRows = YES;
check_mps_shape(sizes);
MPSNDArray* srcNDArray = [[[MPSNDArray alloc] initWithBuffer:srcBuf
offset:t.storage_offset() * t.element_size()
descriptor:srcTensorDesc] autorelease];
@ -572,9 +552,9 @@ Placeholder::Placeholder(MPSGraphTensor* mpsGraphTensor,
// Tensor is contiguous and has no storage offset.
// Wrap it directly inside MPSGraphTensorData
if ((_tensor.is_contiguous() && !_tensor.storage_offset()) || !useMPSStridedAPI || !is_macOS_15_0_or_newer) {
_value = [[[MPSGraphTensorData alloc] initWithMTLBuffer:srcBuf
shape:mpsShape_ ? mpsShape_ : getMPSShape(_tensor)
dataType:dataType] autorelease];
auto shape = mpsShape_ ? mpsShape_ : getMPSShape(_tensor);
check_mps_shape(shape);
_value = [[[MPSGraphTensorData alloc] initWithMTLBuffer:srcBuf shape:shape dataType:dataType] autorelease];
} else {
IntArrayRef view_shape;
if (mpsShape_) {
@ -583,8 +563,11 @@ Placeholder::Placeholder(MPSGraphTensor* mpsGraphTensor,
MPSShape* mpsShape = getMPSShape(_tensor);
MPSShape* mpsStrides = getMPSShape(_tensor.strides());
check_mps_shape(mpsShape);
auto storage_numel = src.storage().nbytes() / src.element_size();
TORCH_CHECK(storage_numel <= std::numeric_limits<int32_t>::max(),
"MPSGaph does not support tensor dims larger than INT_MAX");
MPSNDArrayDescriptor* srcTensorDesc = [MPSNDArrayDescriptor descriptorWithDataType:dataType
shape:@[ @(storage_numel) ]];
srcTensorDesc.preferPackedRows = YES;

View File

@ -62,15 +62,12 @@ static Tensor& fill_scalar_mps_impl(Tensor& self, const Scalar& value) {
return self;
}
// returns false if tensor cannot be filled with fillBuffer()
static bool fill_mps_tensor_(Tensor& self, uint8_t value) {
if (self.is_contiguous()) {
MPSStream* stream = getCurrentMPSStream();
auto storage_byte_offset = self.storage_offset() * self.itemsize();
stream->fill(mps::getMTLBufferStorage(self), value, self.nbytes(), storage_byte_offset);
return true;
}
return false;
static Tensor& fill_mps_tensor_(Tensor& self, uint8_t value) {
TORCH_INTERNAL_ASSERT(self.is_contiguous());
const auto stream = getCurrentMPSStream();
auto storage_byte_offset = self.storage_offset() * self.itemsize();
stream->fill(mps::getMTLBufferStorage(self), value, self.nbytes(), storage_byte_offset);
return self;
}
Tensor& fill_scalar_mps(Tensor& self, const Scalar& value) {
@ -89,8 +86,20 @@ Tensor& fill_scalar_mps(Tensor& self, const Scalar& value) {
return self;
}
// check if it's possible to use fillBuffer() to fill the Tensor's storage
if (value.toDouble() == 0.0 && fill_mps_tensor_(self, 0) == true)
return self;
if (self.is_contiguous()) {
if (value.toDouble() == 0.0) {
return fill_mps_tensor_(self, 0);
}
if (self.scalar_type() == kBool) {
return fill_mps_tensor_(self, value.toBool());
}
if (self.scalar_type() == kByte) {
return fill_mps_tensor_(self, value.toByte());
}
if (self.scalar_type() == kChar) {
return fill_mps_tensor_(self, value.toChar());
}
}
return fill_scalar_mps_impl(self, value);
}
@ -101,8 +110,6 @@ Tensor& fill_tensor_mps_(Tensor& self, const Tensor& value) {
value.dim(),
" dimensions.");
Scalar scalar_value = value.item();
if (scalar_value.toDouble() == 0.0 && fill_mps_tensor_(self, 0) == true)
return self;
return fill_scalar_mps(self, scalar_value);
}

View File

@ -19,7 +19,14 @@ static auto& lib = MetalShaderLibrary::getBundledLibrary();
#include <ATen/native/mps/RMSNorm_metallib.h>
#endif
Tensor _fused_rms_norm_mps(const Tensor& input, const int64_t normalized_ndim, const Tensor& weight, const double eps) {
std::tuple<Tensor, Tensor> _fused_rms_norm_mps(const Tensor& input,
IntArrayRef normalized_shape,
const std::optional<Tensor>& weight_opt,
const std::optional<double> eps) {
const Tensor weight = weight_opt.value().contiguous();
const int64_t normalized_ndim = normalized_shape.size();
auto eps_val = eps.value_or(std::numeric_limits<double>::epsilon());
TORCH_CHECK(input.is_contiguous() && weight.is_contiguous(), "Expected contiguous input and weight tensors");
auto output = at::empty_like(input);
const auto input_shape = input.sizes();
@ -41,7 +48,7 @@ Tensor _fused_rms_norm_mps(const Tensor& input, const int64_t normalized_ndim, c
const std::string kernel = fmt::format("{}_{}", name, scalarToMetalTypeString(output));
id<MTLComputePipelineState> rms_norm_pso = lib.getPipelineStateForFunc(kernel);
[computeEncoder setComputePipelineState:rms_norm_pso];
mtl_setArgs(computeEncoder, input, weight, output, eps, N, 1);
mtl_setArgs(computeEncoder, input, weight, output, eps_val, N, 1);
const auto maxThreadsPerGroup = static_cast<size_t>([rms_norm_pso maxTotalThreadsPerThreadgroup]);
size_t threadgroup_size = maxThreadsPerGroup;
@ -58,7 +65,7 @@ Tensor _fused_rms_norm_mps(const Tensor& input, const int64_t normalized_ndim, c
}
});
return output;
return std::make_tuple(output, Tensor());
}
} // namespace at::native

View File

@ -1067,6 +1067,7 @@
CUDA: baddbmm_out_cuda
MPS: baddbmm_out_mps
XPU: baddbmm_out_xpu
MTIA: baddbmm_out_mtia
SparseCsrCUDA: baddbmm_out_sparse_csr_cuda
- func: baddbmm.dtype(Tensor self, Tensor batch1, Tensor batch2, ScalarType out_dtype, *, Scalar beta=1, Scalar alpha=1) -> Tensor
@ -1376,6 +1377,7 @@
CUDA: bmm_out_cuda
MPS: bmm_out_mps
XPU: bmm_out_xpu
MTIA: bmm_out_mtia
SparseCPU: bmm_out_sparse_cpu
SparseCUDA: bmm_out_sparse_cuda
SparseCsrCUDA: bmm_out_sparse_csr_cuda
@ -3314,9 +3316,15 @@
dispatch:
CompositeImplicitAutograd: rms_norm_symint
- func: _fused_rms_norm(Tensor input, int normalized_shape_ndim, Tensor weight, float eps) -> Tensor
- func: _fused_rms_norm(Tensor input, int[] normalized_shape, Tensor? weight, float? eps) -> (Tensor, Tensor)
dispatch:
CUDA: _fused_rms_norm_cuda
MPS: _fused_rms_norm_mps
CompositeImplicitAutograd: rms_norm_composite
- func: _fused_rms_norm_backward(Tensor grad_out, Tensor input, int[] normalized_shape, Tensor rstd, Tensor? weight, bool[2] output_mask) -> (Tensor, Tensor)
dispatch:
CUDA: _fused_rms_norm_backward_cuda
- func: nan_to_num(Tensor self, float? nan=None, float? posinf=None, float? neginf=None) -> Tensor
variants: function, method
@ -3432,7 +3440,7 @@
- func: _wrapped_quantized_linear_prepacked(Tensor input, Tensor input_scale, Tensor input_zero_point, Tensor packed_weight, Tensor output_scale, Tensor output_zero_point, int out_channel) -> Tensor
- func: fbgemm_linear_fp16_weight_fp32_activation(Tensor input, Tensor packed_weight, Tensor bias) -> Tensor
- func: fbgemm_linear_fp16_weight_fp32_activation(Tensor input, Tensor packed_weight, Tensor? bias) -> Tensor
- func: fbgemm_linear_fp16_weight(Tensor input, Tensor packed_weight, Tensor bias) -> Tensor
@ -7059,6 +7067,7 @@
CUDA: addmm_out_cuda
MPS: addmm_out_mps
XPU: addmm_out_xpu
MTIA: addmm_out_mtia
SparseCPU: addmm_out_sparse_dense_cpu
SparseCUDA: addmm_out_sparse_dense_cuda
SparseCsrCPU: addmm_out_sparse_compressed_cpu
@ -8962,7 +8971,7 @@
structured_inherits: TensorIteratorBase
device_check: NoCheck # TensorIterator
dispatch:
CPU, CUDA: eq_Scalar_out
CPU, CUDA, MTIA: eq_Scalar_out
MPS: eq_scalar_out_mps
QuantizedCPU: eq_out_quantized_cpu
tags: pointwise
@ -8981,7 +8990,7 @@
structured_inherits: TensorIteratorBase
device_check: NoCheck # TensorIterator
dispatch:
CPU, CUDA: eq_Tensor_out
CPU, CUDA, MTIA: eq_Tensor_out
MPS: eq_tensor_out_mps
QuantizedCPU: eq_out_quantized_cpu
tags: pointwise
@ -9374,7 +9383,7 @@
structured_inherits: TensorIteratorBase
device_check: NoCheck # TensorIterator
dispatch:
CPU, CUDA: addcmul_out
CPU, CUDA, MTIA: addcmul_out
MPS: addcmul_out_mps
tags: pointwise
@ -9395,7 +9404,7 @@
structured_inherits: TensorIteratorBase
device_check: NoCheck # TensorIterator
dispatch:
CPU, CUDA: addcdiv_out
CPU, CUDA, MTIA: addcdiv_out
MPS: addcdiv_out_mps
tags: pointwise

View File

@ -7,11 +7,13 @@
#include <c10/util/irange.h>
#ifdef USE_FBGEMM
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wextra-semi")
#include <fbgemm/Fbgemm.h>
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Winconsistent-missing-destructor-override")
#include <fbgemm/FbgemmFP16.h>
C10_DIAGNOSTIC_POP()
#include <fbgemm/QuantUtils.h>
C10_DIAGNOSTIC_POP()
// The struct for the packed weight matrix (PackBMatrix) and the corresponding
// column offsets used for the fully connect layer, which are both prepared in

View File

@ -888,7 +888,7 @@ class QLinearUnpackedDynamicFp16 final {
static at::Tensor run(
at::Tensor input,
const at::Tensor& weight,
const at::Tensor& bias) {
const std::optional<at::Tensor>& bias) {
// We make a strong guarantee that models using these operators will have
// the same numerics across different machines. Therefore, we do not provide
// a fallback path and rather fail loudly if we cannot run FBGEMM.
@ -908,7 +908,7 @@ class QLinearUnpackedDynamicFp16 final {
static at::Tensor meta(
at::Tensor input,
const at::Tensor& weight,
const at::Tensor& bias) {
const std::optional<at::Tensor>& bias) {
// We make a strong guarantee that models using these operators will have
// the same numerics across different machines. Therefore, we do not provide
// a fallback path and rather fail loudly if we cannot run FBGEMM.
@ -929,7 +929,7 @@ class QLinearUnpackedDynamicFp16 final {
static at::Tensor run(
at::Tensor /* input */,
const at::Tensor& weight,
const at::Tensor& bias) {
const std::optional<at::Tensor>& bias) {
// We make a strong guarantee that models using these operators will have
// the same numerics across different machines. Therefore, we do not provide
// a fallback path and rather fail loudly if we cannot run FBGEMM.
@ -940,7 +940,7 @@ class QLinearUnpackedDynamicFp16 final {
static at::Tensor meta(
at::Tensor /* input */,
const at::Tensor& weight,
const at::Tensor& bias) {
const std::optional<at::Tensor>& bias) {
TORCH_CHECK(
false, "This PyTorch installation was not built with FBGEMM operators");
}

View File

@ -142,7 +142,7 @@ TORCH_LIBRARY(quantized, m) {
m.def(TORCH_SELECTIVE_SCHEMA("quantized::linear_dynamic(Tensor X, __torch__.torch.classes.quantized.LinearPackedParamsBase W_prepack, bool reduce_range=False) -> Tensor Y"), {at::Tag::pt2_compliant_tag});
m.def(TORCH_SELECTIVE_SCHEMA("quantized::linear_relu_dynamic(Tensor X, __torch__.torch.classes.quantized.LinearPackedParamsBase W_prepack, bool reduce_range=False) -> Tensor Y"), {at::Tag::pt2_compliant_tag});
m.def(TORCH_SELECTIVE_SCHEMA("quantized::linear_dynamic_fp16(Tensor X, __torch__.torch.classes.quantized.LinearPackedParamsBase W_prepack) -> Tensor Y"), {at::Tag::pt2_compliant_tag});
m.def(TORCH_SELECTIVE_SCHEMA("quantized::linear_dynamic_fp16_unpacked_weight(Tensor X, Tensor weight, Tensor bias) -> Tensor Y"), {at::Tag::pt2_compliant_tag});
m.def(TORCH_SELECTIVE_SCHEMA("quantized::linear_dynamic_fp16_unpacked_weight(Tensor X, Tensor weight, Tensor? bias) -> Tensor Y"), {at::Tag::pt2_compliant_tag});
m.def(TORCH_SELECTIVE_SCHEMA("quantized::linear_relu_dynamic_fp16(Tensor X, __torch__.torch.classes.quantized.LinearPackedParamsBase W_prepack) -> Tensor Y"), {at::Tag::pt2_compliant_tag});
m.def(TORCH_SELECTIVE_SCHEMA("quantized::linear_leaky_relu(Tensor X, __torch__.torch.classes.quantized.LinearPackedParamsBase W_prepack, float Y_scale_i, int Y_zero_point_i, float negative_slope) -> Tensor Y"), {at::Tag::pt2_compliant_tag});
m.def(TORCH_SELECTIVE_SCHEMA("quantized::linear_tanh(Tensor X, __torch__.torch.classes.quantized.LinearPackedParamsBase W_prepack, float Y_scale_i, int Y_zero_point_i) -> Tensor Y"), {at::Tag::pt2_compliant_tag});

View File

@ -242,7 +242,11 @@ __global__ void coalesceValuesKernel(
// `if constexpr` when CUDA codes will be compiled under C++-17, see
// gh-56055 for blockers.
template<typename Dtype>
#ifdef USE_ROCM
C10_LAUNCH_BOUNDS_1(C10_WARP_SIZE_STATIC*4)
#else
C10_LAUNCH_BOUNDS_1(C10_WARP_SIZE*4)
#endif
__global__ void coalesceValuesKernel(
int64_t *segment_offsets, int64_t *value_indices,
bool *values, bool *newValues,

View File

@ -32,7 +32,9 @@
#endif
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wextra-semi")
#include <cutlass/numeric_types.h>
C10_DIAGNOSTIC_POP()
#include <flash.h>

View File

@ -75,7 +75,7 @@ Tensor TensorMaker::make_tensor() {
}
auto storage_size = size * itemsize;
if (storage_offset_) {
storage_size += storage_offset_.value();
storage_size += storage_offset_.value() * itemsize;
}
return storage_size;
}

View File

@ -519,6 +519,15 @@ TEST(BasicTest, BasicStdTestCPU) {
}
TEST(BasicTest, TestForBlobResizeCPU) {
// Checks that for_blob can correctly create tensors with non-empty offset and resize them
std::array<int32_t, 6> storage;
std::iota(storage.begin(), storage.end(), 1);
auto t = at::for_blob(storage.data(), {3,}).storage_offset(3).options(c10::TensorOptions(kInt)).make_tensor();
auto te = *at::expand_size(t, {3, 3});
ASSERT_EQ(te[1][1].item<int32_t>(), 5);
}
TEST(BasicTest, TestForBlobStridesResizeCPU) {
// Checks that for_blob can correctly create tensors with non-empty offset and resize them
std::array<int32_t, 6> storage;
std::iota(storage.begin(), storage.end(), 1);

View File

@ -20,4 +20,8 @@
#error "CAFFE2_STATIC_LINK_CUDA should not be visible in public headers"
#endif
auto main() -> int {}
#include <gtest/gtest.h>
TEST(VerifyApiVisibility, Test) {
ASSERT_EQ(1, 1);
}

View File

@ -3264,6 +3264,12 @@ def parse_args(args=None):
instead of deleting it and creating a new one.",
)
parser.add_argument(
"--caching-precompile",
action="store_true",
help="Enables caching precompile, serializing artifacts to DynamoCache between runs",
)
group_latency = parser.add_mutually_exclusive_group()
group_latency.add_argument(
"--cold-start-latency",
@ -3414,6 +3420,29 @@ def parse_args(args=None):
return parser.parse_args(args)
def process_caching_precompile():
"""
After every process_entry, save precompile artifacts to DynamoCache
"""
assert torch._dynamo.config.caching_precompile, (
"Caching precompile should be enabled with --caching-precompile"
)
from torch._dynamo.precompile_context import PrecompileContext
# Serialize all callables, clear PrecompileContext
# TODO: put this under torch.compiler API once ready
serialized = PrecompileContext.serialize()
PrecompileContext.clear()
if serialized is not None:
artifacts, info = serialized
print(
f"Saving {len(info.precompile_dynamo_artifacts)} Precompile Artifact(s)..."
)
results = PrecompileContext.deserialize(artifacts)
assert results is not None
PrecompileContext.populate_caches(results)
def process_entry(rank, runner, original_dir, args):
args.rank = rank
with maybe_init_distributed(
@ -3422,7 +3451,10 @@ def process_entry(rank, runner, original_dir, args):
world_size=args.world_size,
port=args.distributed_master_port,
):
return run(runner, args, original_dir)
result = run(runner, args, original_dir)
if args.caching_precompile:
process_caching_precompile()
return result
def maybe_fresh_cache(args):
@ -3458,6 +3490,10 @@ def main(runner, original_dir=None, args=None):
)
with maybe_fresh_cache(args):
if args.caching_precompile:
os.environ["TORCH_CACHING_PRECOMPILE"] = "1"
torch._dynamo.config.caching_precompile = True
args.init_distributed = args.only and args.multiprocess
if args.init_distributed:
# NB: Do NOT query device count before CUDA initialization; we're

View File

@ -56,7 +56,11 @@ def list_benchmarks():
print(f"Available benchmarks: {list(BENCHMARK_REGISTRY.keys())}")
def run_benchmark(benchmark_name: str, should_visualize: bool = False):
def run_benchmark(
benchmark_name: str,
should_visualize: bool = False,
compile_mode: str = "max-autotune-no-cudagraphs",
):
"""Run a specific benchmark."""
if benchmark_name not in BENCHMARK_REGISTRY:
print(f"Error: Unknown benchmark '{benchmark_name}'")
@ -64,10 +68,11 @@ def run_benchmark(benchmark_name: str, should_visualize: bool = False):
return False
print(f"Running benchmark: {benchmark_name}")
print(f"Torch compile mode: {compile_mode}")
print("=" * 60)
benchmark_class = BENCHMARK_REGISTRY[benchmark_name]
benchmark = benchmark_class()
benchmark = benchmark_class(compile_mode)
benchmark.benchmark()
if should_visualize:
benchmark.visualize()
@ -75,14 +80,15 @@ def run_benchmark(benchmark_name: str, should_visualize: bool = False):
return True
def run_all_benchmarks(should_visualize: bool = False):
def run_all_benchmarks(should_visualize: bool = False, compile_mode: str = "default"):
"""Run all available benchmarks."""
print("Running all benchmarks...")
print(f"Torch compile mode: {compile_mode}")
print("=" * 60)
for name, cls in BENCHMARK_REGISTRY.items():
print(f"\n{'=' * 20} {name.upper()} {'=' * 20}")
benchmark = cls()
benchmark = cls(compile_mode)
benchmark.benchmark()
if should_visualize:
benchmark.visualize()
@ -124,6 +130,13 @@ Examples:
help="Visualize results after running benchmarks",
)
parser.add_argument(
"--compile-mode",
choices=["default", "max-autotune-no-cudagraphs"],
default="max-autotune-no-cudagraphs",
help="Torch compile mode to use (default: default)",
)
args = parser.parse_args()
# Handle list option
@ -133,7 +146,7 @@ Examples:
# Handle all option
if args.all:
run_all_benchmarks(args.visualize)
run_all_benchmarks(args.visualize, args.compile_mode)
return
# Handle specific benchmarks
@ -144,7 +157,7 @@ Examples:
sys.exit(1)
for benchmark_name in args.benchmarks:
run_benchmark(benchmark_name, args.visualize)
run_benchmark(benchmark_name, args.visualize, args.compile_mode)
print() # Add spacing between benchmarks

View File

@ -9,8 +9,8 @@ import torch.nn.functional as F
class CrossEntropyForward(BenchmarkKernel):
def __init__(self):
super().__init__()
def __init__(self, compile_mode: str = "max-autotune-no-cudagraphs"):
super().__init__(compile_mode)
self.available_backends = ["eager", "compiled", "quack", "liger"]
def get_shapes(self) -> tuple[tuple[int, ...], ...]:
@ -52,7 +52,8 @@ class CrossEntropyForward(BenchmarkKernel):
# More discussion: https://github.com/pytorch/pytorch/issues/158455
compiled_cross_entropy = torch.compile(
lambda x, target: F.cross_entropy(x, target, reduction="none"),
mode="max-autotune-no-cudagraphs",
mode=self.compile_mode,
fullgraph=True,
)
return lambda: compiled_cross_entropy(x, target)
@ -105,8 +106,8 @@ class CrossEntropyForward(BenchmarkKernel):
class CrossEntropyBackward(BenchmarkKernel):
def __init__(self):
super().__init__()
def __init__(self, compile_mode: str = "max-autotune-no-cudagraphs"):
super().__init__(compile_mode)
self.available_backends = ["eager", "compiled", "quack", "liger"]
def get_shapes(self) -> tuple[tuple[int, ...], ...]:
@ -149,7 +150,8 @@ class CrossEntropyBackward(BenchmarkKernel):
compiled_cross_entropy = torch.compile(
lambda x, target: F.cross_entropy(x, target, reduction="none"),
mode="max-autotune-no-cudagraphs",
mode=self.compile_mode,
fullgraph=True,
)
loss = compiled_cross_entropy(x, target)
return lambda: torch.autograd.grad(
@ -192,8 +194,8 @@ class CrossEntropyBackward(BenchmarkKernel):
class SoftmaxForward(BenchmarkKernel):
def __init__(self):
super().__init__()
def __init__(self, compile_mode: str = "max-autotune-no-cudagraphs"):
super().__init__(compile_mode)
self.available_backends = ["eager", "compiled", "quack", "liger"]
def get_shapes(self) -> tuple[tuple[int, ...], ...]:
@ -229,7 +231,7 @@ class SoftmaxForward(BenchmarkKernel):
torch._dynamo.mark_dynamic(x, 0)
compiled_softmax = torch.compile(
lambda x: F.softmax(x, dim=-1), mode="max-autotune-no-cudagraphs"
lambda x: F.softmax(x, dim=-1), mode=self.compile_mode, fullgraph=True
)
return lambda: compiled_softmax(x)
@ -257,8 +259,8 @@ class SoftmaxForward(BenchmarkKernel):
class SoftmaxBackward(BenchmarkKernel):
def __init__(self):
super().__init__()
def __init__(self, compile_mode: str = "max-autotune-no-cudagraphs"):
super().__init__(compile_mode)
self.available_backends = ["eager", "compiled", "quack", "liger"]
def get_shapes(self) -> tuple[tuple[int, ...], ...]:
@ -292,7 +294,7 @@ class SoftmaxBackward(BenchmarkKernel):
assert kwargs is None
x, dy = args
compiled_softmax = torch.compile(
lambda x: F.softmax(x, dim=-1), mode="max-autotune-no-cudagraphs"
lambda x: F.softmax(x, dim=-1), mode=self.compile_mode, fullgraph=True
)
y = compiled_softmax(x)
return lambda: torch.autograd.grad(y, x, grad_outputs=dy, retain_graph=True)
@ -327,8 +329,8 @@ class SoftmaxBackward(BenchmarkKernel):
class RMSNormForward(BenchmarkKernel):
def __init__(self):
super().__init__()
def __init__(self, compile_mode: str = "max-autotune-no-cudagraphs"):
super().__init__(compile_mode)
self.available_backends = ["eager", "compiled", "quack", "liger"]
def get_shapes(self) -> tuple[tuple[int, ...], ...]:
@ -372,7 +374,7 @@ class RMSNormForward(BenchmarkKernel):
torch._dynamo.mark_dynamic(x, 0)
compiled_rms_norm = torch.compile(
self.rms_norm_ref, mode="max-autotune-no-cudagraphs"
self.rms_norm_ref, mode=self.compile_mode, fullgraph=True
)
return lambda: compiled_rms_norm(x, w)
@ -402,8 +404,8 @@ class RMSNormForward(BenchmarkKernel):
class RMSNormBackward(BenchmarkKernel):
def __init__(self):
super().__init__()
def __init__(self, compile_mode: str = "max-autotune-no-cudagraphs"):
super().__init__(compile_mode)
self.available_backends = ["eager", "compiled", "quack", "liger"]
def get_shapes(self) -> tuple[tuple[int, ...], ...]:
@ -445,7 +447,9 @@ class RMSNormBackward(BenchmarkKernel):
def compiled(self, args, kwargs=None) -> Any:
assert kwargs is None
x, w, dy = args
y = torch.compile(self.rms_norm_ref, mode="max-autotune-no-cudagraphs")(x, w)
y = torch.compile(self.rms_norm_ref, mode=self.compile_mode, fullgraph=True)(
x, w
)
return lambda: torch.autograd.grad(
y, [x, w], grad_outputs=dy, retain_graph=True
)
@ -485,8 +489,8 @@ class RMSNormBackward(BenchmarkKernel):
class LayerNormForward(BenchmarkKernel):
def __init__(self):
super().__init__()
def __init__(self, compile_mode: str = "max-autotune-no-cudagraphs"):
super().__init__(compile_mode)
self.available_backends = ["eager", "compiled", "quack", "liger"]
def get_shapes(self) -> tuple[tuple[int, ...], ...]:
@ -526,7 +530,7 @@ class LayerNormForward(BenchmarkKernel):
torch._dynamo.mark_dynamic(x, 0)
compiled_layernorm = torch.compile(
self.layernorm_ref, mode="max-autotune-no-cudagraphs"
self.layernorm_ref, mode=self.compile_mode, fullgraph=True
)
return lambda: compiled_layernorm(x, w, eps=1e-6)
@ -559,8 +563,8 @@ class LayerNormForward(BenchmarkKernel):
class LayerNormBackward(BenchmarkKernel):
def __init__(self):
super().__init__()
def __init__(self, compile_mode: str = "max-autotune-no-cudagraphs"):
super().__init__(compile_mode)
self.available_backends = ["eager", "compiled", "liger"]
def get_shapes(self) -> tuple[tuple[int, ...], ...]:
@ -603,7 +607,7 @@ class LayerNormBackward(BenchmarkKernel):
assert kwargs is None
x, w, dy = args
compiled_layernorm = torch.compile(
self.layernorm_ref, mode="max-autotune-no-cudagraphs"
self.layernorm_ref, mode=self.compile_mode, fullgraph=True
)
y = compiled_layernorm(x, w)
return lambda: torch.autograd.grad(

View File

@ -13,7 +13,8 @@ def benchmark_kernel_in_milliseconds(func: Callable, *args, **kwargs) -> float:
# warmup
for _ in range(5):
func(*args, **kwargs)
return benchmarker.benchmark_gpu(lambda: func(*args, **kwargs))
with torch.compiler.set_stance("fail_on_recompile"):
return benchmarker.benchmark_gpu(lambda: func(*args, **kwargs))
@dataclass
@ -41,9 +42,10 @@ class Performance:
class BenchmarkKernel:
def __init__(self):
def __init__(self, compile_mode: str = "max-autotune-no-cudagraphs"):
self.name = self.__class__.__name__
self.available_backends: list[str] = []
self.compile_mode: str = compile_mode
# mapping from backend to list of performance results
self.profiling_results: defaultdict[str, list[Performance]] = defaultdict(list)

View File

@ -864,7 +864,6 @@ libtorch_python_core_sources = [
"torch/csrc/QScheme.cpp",
"torch/csrc/Module.cpp",
"torch/csrc/PyInterpreter.cpp",
"torch/csrc/PyInterpreterHooks.cpp",
"torch/csrc/python_dimname.cpp",
"torch/csrc/Size.cpp",
"torch/csrc/Storage.cpp",

View File

@ -1,10 +0,0 @@
#include <c10/core/CachingDeviceAllocator.h>
namespace c10 {
// Ensures proper DLL export of this pure virtual base class on Windows,
// since it's mainly used in other DLLs outside c10.dll.
DeviceAllocator::DeviceAllocator() = default;
DeviceAllocator::~DeviceAllocator() = default;
} // namespace c10

View File

@ -1,7 +1,6 @@
#pragma once
#include <c10/core/Allocator.h>
#include <c10/core/Stream.h>
namespace c10::CachingDeviceAllocator {
@ -60,55 +59,3 @@ struct DeviceStats {
};
} // namespace c10::CachingDeviceAllocator
namespace c10 {
using CaptureId_t = unsigned long long;
// first is set if the instance is created by Graph mode capture_begin.
// second is set if the instance is created by Graph mode graph_pool_handle.
using MempoolId_t = std::pair<CaptureId_t, CaptureId_t>;
struct C10_API DeviceAllocator : public c10::Allocator {
DeviceAllocator();
~DeviceAllocator() override;
// Returns true if the allocator has been properly initialized and is ready
// for use
virtual bool initialized() = 0;
// Releases all cached device memory from the specified memory pool back to
// the system
virtual void emptyCache(MempoolId_t mempool_id = {0, 0}) = 0;
// Associates a memory allocation with a stream to establish dependency
// tracking. Prevents memory reuse until all operations on the specified
// stream complete
virtual void recordStream(const DataPtr& ptr, c10::Stream stream) = 0;
// Retrieves comprehensive memory statistics for the specified device,
// including allocation patterns, usage metrics
virtual CachingDeviceAllocator::DeviceStats getDeviceStats(
c10::DeviceIndex device) = 0;
// Resets cumulative allocation statistics for the specified device to zero
virtual void resetAccumulatedStats(c10::DeviceIndex device) = 0;
// Resets peak memory usage statistics for the specified device
virtual void resetPeakStats(c10::DeviceIndex device) = 0;
};
// This function is used to get the DeviceAllocator for a specific device type
// and keep backward compatibility with c10::GetAllocator.
C10_API inline DeviceAllocator* getDeviceAllocator(const DeviceType& t) {
TORCH_CHECK(
t != DeviceType::CPU,
"getDeviceAllocator is not supported for CPU device type.");
auto* allocator = c10::GetAllocator(t);
auto* device_allocator = dynamic_cast<DeviceAllocator*>(allocator);
TORCH_INTERNAL_ASSERT(
device_allocator, "Allocator for ", t, " is not a DeviceAllocator.");
return device_allocator;
}
} // namespace c10

View File

@ -240,4 +240,24 @@ struct C10_API PyInterpreter {
void disarm() noexcept;
};
// PyInterpreterStatus describes what the state of its interpreter tag
// is, relative to the thread currently holding the GIL.
enum class PyInterpreterStatus {
// We just allocated the Tensor, it hasn't escaped to other threads,
// we know that it definitely hasn't been tagged to be associated
// with an interpreter.
DEFINITELY_UNINITIALIZED,
// We queried the interpreter field and it looked uninitialized. But
// another thread may have raced with us to tag it with some other
// interpreter id. So we will have to do a CEX to make sure we can
// actually nab it.
MAYBE_UNINITIALIZED,
// We queried the interpreter field and it was tagged to belong to us.
// This means we have sole write access (as we hold the GIL for this
// interpreter)
TAGGED_BY_US,
// Someone else tagged this. We can't use this TensorImpl from Python.
TAGGED_BY_OTHER,
};
} // namespace c10::impl

View File

@ -1,32 +0,0 @@
#include <c10/core/impl/PyInterpreterHooks.h>
namespace c10::impl {
// Define the registry
C10_DEFINE_REGISTRY(
PyInterpreterHooksRegistry,
PyInterpreterHooksInterface,
PyInterpreterHooksArgs)
const PyInterpreterHooksInterface& getPyInterpreterHooks() {
auto create_impl = [] {
#if !defined C10_MOBILE
auto hooks = PyInterpreterHooksRegistry()->Create(
"PyInterpreterHooks", PyInterpreterHooksArgs{});
if (hooks) {
return hooks;
}
#endif
// Return stub implementation that will throw errors when methods are called
return std::make_unique<PyInterpreterHooksInterface>();
};
static auto hooks = create_impl();
return *hooks;
}
// Main function to get global PyInterpreter
PyInterpreter* getGlobalPyInterpreter() {
return getPyInterpreterHooks().getPyInterpreter();
}
} // namespace c10::impl

View File

@ -1,39 +0,0 @@
#pragma once
#include <c10/core/impl/PyInterpreter.h>
#include <c10/macros/Export.h>
#include <c10/util/Registry.h>
#include <memory>
namespace c10::impl {
// Minimal interface for PyInterpreter hooks
struct C10_API PyInterpreterHooksInterface {
virtual ~PyInterpreterHooksInterface() = default;
// Get the PyInterpreter instance
// Stub implementation throws error when Python is not available
virtual PyInterpreter* getPyInterpreter() const {
TORCH_CHECK(
false,
"PyTorch was compiled without Python support. "
"Cannot access Python interpreter from C++.");
}
};
struct C10_API PyInterpreterHooksArgs{};
C10_DECLARE_REGISTRY(
PyInterpreterHooksRegistry,
PyInterpreterHooksInterface,
PyInterpreterHooksArgs);
#define REGISTER_PYTHON_HOOKS(clsname) \
C10_REGISTER_CLASS(PyInterpreterHooksRegistry, clsname, clsname)
// Get the global PyInterpreter hooks instance
C10_API const PyInterpreterHooksInterface& getPyInterpreterHooks();
C10_API PyInterpreter* getGlobalPyInterpreter();
} // namespace c10::impl

View File

@ -34,12 +34,29 @@ PyObject* PyObjectSlot::_unchecked_untagged_pyobj() const {
reinterpret_cast<uintptr_t>(pyobj_) & ~0x1ULL);
}
void PyObjectSlot::unchecked_clear_pyobj(PyInterpreter* interpreter) {
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(interpreter == pyobj_interpreter_.load());
pyobj_ = nullptr;
}
PyInterpreter& PyObjectSlot::load_pyobj_interpreter() const {
auto interpreter = pyobj_interpreter_.load(std::memory_order_acquire);
if (interpreter) {
return *interpreter;
}
TORCH_CHECK(false, "cannot access PyObject for Tensor - no interpreter set");
TORCH_CHECK(
false,
"cannot access PyObject for Tensor on interpreter ",
(*pyobj_interpreter_.load())->name());
}
bool PyObjectSlot::check_interpreter(PyInterpreter* interpreter) {
return interpreter == pyobj_interpreter();
}
bool PyObjectSlot::has_pyobj_nonhermetic() {
return check_pyobj(pyobj_interpreter(), /*ignore_hermetic_tls=*/true)
.has_value();
}
bool PyObjectSlot::owns_pyobj() {

View File

@ -2,7 +2,6 @@
#include <c10/core/impl/HermeticPyObjectTLS.h>
#include <c10/core/impl/PyInterpreter.h>
#include <c10/core/impl/PyInterpreterHooks.h>
#include <c10/util/python_stub.h>
#include <optional>
@ -25,9 +24,52 @@ struct C10_API PyObjectSlot {
//
// NB: THIS FUNCTION CAN RAISE AN EXCEPTION. Make sure to clean up after
// PyObject if necessary!
void init_pyobj(PyObject* pyobj) {
pyobj_interpreter_.store(
getGlobalPyInterpreter(), std::memory_order_relaxed);
void init_pyobj(
PyInterpreter* self_interpreter,
PyObject* pyobj,
PyInterpreterStatus status) {
impl::PyInterpreter* expected = nullptr;
switch (status) {
case impl::PyInterpreterStatus::DEFINITELY_UNINITIALIZED:
// caller guarantees there is no multithreaded access; if there is
// no data race OK to do a relaxed store
pyobj_interpreter_.store(self_interpreter, std::memory_order_relaxed);
break;
case impl::PyInterpreterStatus::TAGGED_BY_US:
// no tagging is necessary, the tag is already correct
break;
case impl::PyInterpreterStatus::MAYBE_UNINITIALIZED:
// attempt to claim this TensorImpl with the specified interpreter
// tag
if (pyobj_interpreter_.compare_exchange_strong(
expected, self_interpreter, std::memory_order_acq_rel)) {
break;
}
// test if, actually, it was already tagged by us! this situation can't
// be caused by a race, but it could be caused by a situation
// where someone conservatively tagged the tensor as MAYBE_UNINITIALIZED
// (because they didn't pre-check the tag) when actually it was
// owned by the interpreter
if (expected == self_interpreter) {
break;
}
// fallthrough, we lost the race. We are guaranteed not to lose the
// race with ourself, as calls to init_pyobj with the same interpreter
// ID must be sequentialized by the GIL
[[fallthrough]];
case impl::PyInterpreterStatus::TAGGED_BY_OTHER:
TORCH_CHECK(
false,
"cannot allocate PyObject for Tensor on interpreter ",
self_interpreter,
" that has already been used by another torch deploy interpreter ",
pyobj_interpreter_.load());
}
// we are the ONLY thread that can have gotten to this point. It is not
// possible to conflict with another zero interpreter as access is protected
// by GIL
// NB: owns_pyobj tag is initially false
pyobj_ = pyobj;
}
@ -52,25 +94,49 @@ struct C10_API PyObjectSlot {
//
// NB: this lives in header so that we can avoid actually creating the
// std::optional
// @todo alban: I'm not too sure what's going on here, we can probably delete
// it but it's worthwhile making sure
std::optional<PyObject*> check_pyobj(bool ignore_hermetic_tls = false) const {
std::optional<PyObject*> check_pyobj(
PyInterpreter* self_interpreter,
bool ignore_hermetic_tls = false) const {
// Note [Memory ordering on Python interpreter tag]
impl::PyInterpreter* interpreter =
pyobj_interpreter_.load(std::memory_order_acquire);
if (interpreter == nullptr) {
// NB: This never returns DEFINITELY_UNINITIALIZED because there is
// always the possibility that another thread races to initialize
// after we query here. The only time when we can conclude a tensor
// is definitely uninitialized is when we have just allocated it and
// it cannot have escaped to other threads yet
return std::nullopt;
}
if (!ignore_hermetic_tls && c10::impl::HermeticPyObjectTLS::get_state()) {
return std::nullopt;
} else if (interpreter == self_interpreter) {
// NB: pyobj_ could still be null!
if (!ignore_hermetic_tls && c10::impl::HermeticPyObjectTLS::get_state()) {
return std::nullopt;
} else {
return _unchecked_untagged_pyobj();
}
} else {
return _unchecked_untagged_pyobj();
TORCH_CHECK(
false,
"cannot access PyObject for Tensor on interpreter ",
(*self_interpreter)->name(),
" that has already been used by another torch deploy interpreter ",
(*pyobj_interpreter_.load())->name());
}
}
// Clear the PyObject field for an interpreter, in situations where we
// statically know the tensor is tagged with our interpreter.
void unchecked_clear_pyobj(PyInterpreter* interpreter);
PyInterpreter& load_pyobj_interpreter() const;
// Check if the PyObjectSlot's interpreter is the same as the specified
// interpreter
bool check_interpreter(PyInterpreter* interpreter);
// Check if the PyObjectSlot is holding a PyObject, owned or non-owned
bool has_pyobj_nonhermetic();
bool owns_pyobj();
void set_owns_pyobj(bool b);

View File

@ -4179,7 +4179,6 @@ struct BackendStaticInitializer {
BackendStaticInitializer() {
auto r = parseEnvForBackend();
at::SetAllocator(kCUDA, r, 0);
allocator.store(r);
}
};

View File

@ -202,24 +202,25 @@ struct ShareableHandle {
std::string handle;
};
class CUDAAllocator : public DeviceAllocator {
class CUDAAllocator : public Allocator {
public:
virtual void* raw_alloc(size_t nbytes) = 0;
virtual void* raw_alloc_with_stream(size_t nbytes, cudaStream_t stream) = 0;
virtual void raw_delete(void* ptr) = 0;
virtual void init(int device_count) = 0;
virtual bool initialized() = 0;
virtual double getMemoryFraction(c10::DeviceIndex device) = 0;
virtual void setMemoryFraction(double fraction, c10::DeviceIndex device) = 0;
virtual void emptyCache(MempoolId_t mempool_id = {0, 0}) = 0;
virtual void enable(bool value) = 0;
virtual bool isEnabled() const = 0;
virtual void cacheInfo(c10::DeviceIndex device, size_t* largestBlock) = 0;
virtual void* getBaseAllocation(void* ptr, size_t* size) = 0;
// Keep for BC only
virtual void recordStream(const DataPtr& ptr, CUDAStream stream) = 0;
void recordStream(const DataPtr& ptr, c10::Stream stream) override {
CUDAStream cuda_stream = CUDAStream(stream);
recordStream(ptr, cuda_stream);
}
virtual void recordStream(const DataPtr&, CUDAStream stream) = 0;
virtual c10::CachingDeviceAllocator::DeviceStats getDeviceStats(
c10::DeviceIndex device) = 0;
virtual void resetAccumulatedStats(c10::DeviceIndex device) = 0;
virtual void resetPeakStats(c10::DeviceIndex device) = 0;
virtual SnapshotInfo snapshot(MempoolId_t mempool_id = {0, 0}) = 0;
virtual void beginAllocateToPool(
c10::DeviceIndex device,
@ -524,10 +525,6 @@ inline void enablePeerAccess(
namespace c10::cuda {
// Keep BC only
using c10::CaptureId_t;
using c10::MempoolId_t;
// MemPool represents a pool of memory in a caching allocator. Currently,
// it's just the ID of the pool object maintained in the CUDACachingAllocator.
//

View File

@ -30,7 +30,7 @@ void c10_cuda_check_implementation(
check_message.append("CUDA error: ");
const char* error_string = cudaGetErrorString(cuda_error);
check_message.append(error_string);
check_message.append(c10::cuda::get_cuda_error_help(error_string));
check_message.append(c10::cuda::get_cuda_error_help(cuda_error));
check_message.append(c10::cuda::get_cuda_check_suffix());
check_message.append("\n");
if (include_device_assertions) {

View File

@ -9,6 +9,12 @@
namespace c10::cuda {
using CaptureId_t = unsigned long long;
// first is set if the instance is created by CUDAGraph::capture_begin.
// second is set if the instance is created by at::cuda::graph_pool_handle.
using MempoolId_t = std::pair<CaptureId_t, CaptureId_t>;
// RAII guard for "cudaStreamCaptureMode", a thread-local value
// that controls the error-checking strictness of a capture.
struct C10_CUDA_API CUDAStreamCaptureModeGuard {

View File

@ -1,5 +1,6 @@
#include <c10/cuda/CUDAMiscFunctions.h>
#include <c10/util/env.h>
#include <cuda_runtime.h>
#include <cstring>
#include <string>
@ -7,11 +8,19 @@ namespace c10::cuda {
// Explain common CUDA errors
// NOLINTNEXTLINE(bugprone-exception-escape,-warnings-as-errors)
std::string get_cuda_error_help(const char* error_string) noexcept {
std::string get_cuda_error_help(cudaError_t error) noexcept {
std::string help_text;
if (strstr(error_string, "invalid device ordinal")) {
help_text.append(
"\nGPU device may be out of range, do you have enough GPUs?");
switch (error) {
case cudaErrorInvalidDevice:
help_text.append(
"\nGPU device may be out of range, do you have enough GPUs?");
break;
default:
help_text.append("\nSearch for `")
.append(cudaGetErrorName(error))
.append(
"' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.");
break;
}
return help_text;
}

View File

@ -3,12 +3,13 @@
// CUDAExceptions.h
#include <c10/cuda/CUDAMacros.h>
#include <cuda_runtime.h>
#include <mutex>
#include <string>
namespace c10::cuda {
C10_CUDA_API std::string get_cuda_error_help(const char*) noexcept;
C10_CUDA_API std::string get_cuda_error_help(cudaError_t) noexcept;
C10_CUDA_API const char* get_cuda_check_suffix() noexcept;
C10_CUDA_API std::mutex* getFreeMutex();
} // namespace c10::cuda

View File

@ -540,7 +540,7 @@ class DeviceCachingAllocator {
static void local_raw_delete(void* ptr);
class XPUAllocator : public DeviceAllocator {
class XPUAllocator : public Allocator {
private:
std::mutex mutex;
ska::flat_hash_map<void*, Block*> allocated_blocks;
@ -576,10 +576,6 @@ class XPUAllocator : public DeviceAllocator {
}
}
bool initialized() override {
return !device_allocators.empty();
}
void malloc(
void** devPtr,
DeviceIndex device,
@ -614,13 +610,13 @@ class XPUAllocator : public DeviceAllocator {
}
}
void emptyCache(MempoolId_t mempool_id [[maybe_unused]] = {0, 0}) override {
void emptyCache() {
for (auto& da : device_allocators) {
da->emptyCache();
}
}
void recordStream(const DataPtr& ptr, c10::Stream stream) override {
void recordStream(const DataPtr& ptr, XPUStream stream) {
if (!ptr.get()) {
return;
}
@ -630,8 +626,7 @@ class XPUAllocator : public DeviceAllocator {
Block* block = get_allocated_block(ptr.get());
TORCH_CHECK(block, "No allocated block can be found.");
c10::xpu::XPUStream xpu_stream{stream};
device_allocators[block->device]->recordStream(block, xpu_stream);
device_allocators[block->device]->recordStream(block, stream);
}
DataPtr allocate(size_t size) override {
@ -684,17 +679,17 @@ class XPUAllocator : public DeviceAllocator {
": did you call init?");
}
DeviceStats getDeviceStats(DeviceIndex device) override {
DeviceStats getDeviceStats(DeviceIndex device) {
assertValidDevice(device);
return device_allocators[device]->getStats();
}
void resetPeakStats(DeviceIndex device) override {
void resetPeakStats(DeviceIndex device) {
assertValidDevice(device);
device_allocators[device]->resetPeakStats();
}
void resetAccumulatedStats(DeviceIndex device) override {
void resetAccumulatedStats(DeviceIndex device) {
assertValidDevice(device);
device_allocators[device]->resetAccumulatedStats();
}

View File

@ -394,7 +394,7 @@ function(torch_compile_options libname)
list(APPEND private_compile_options -Wredundant-move)
endif()
if(CMAKE_CXX_COMPILER_ID MATCHES "Clang")
list(APPEND private_compile_options -Wextra-semi -Wno-error=extra-semi -Wmove)
list(APPEND private_compile_options -Wextra-semi -Wmove)
else()
list(APPEND private_compile_options
# Considered to be flaky. See the discussion at

View File

@ -25,26 +25,3 @@
synchronize
device_index
```
```{eval-rst}
.. automodule:: torch.accelerator.memory
```
```{eval-rst}
.. currentmodule:: torch.accelerator.memory
```
## Memory management
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
empty_cache
max_memory_allocated
max_memory_reserved
memory_allocated
memory_reserved
memory_stats
reset_accumulated_memory_stats
reset_peak_memory_stats
```

View File

@ -1086,6 +1086,7 @@ coverage_ignore_functions = [
"z3op",
"z3str",
# torch.fx.graph_module
"reduce_deploy_graph_module",
"reduce_graph_module",
"reduce_package_graph_module",
# torch.fx.node

8
docs/source/deploy.md Normal file
View File

@ -0,0 +1,8 @@
---
orphan: true
---
# torch::deploy has been moved to pytorch/multipy <!-- codespell:ignore -->
``torch::deploy`` has been moved to its new home at [https://github.com/pytorch/multipy](https://github.com/pytorch/multipy). <!-- codespell:ignore -->

View File

@ -20,39 +20,41 @@ for a brief introduction to all features related to distributed training.
## Backends
`torch.distributed` supports three built-in backends, each with
`torch.distributed` supports four built-in backends, each with
different capabilities. The table below shows which functions are available
for use with CPU / CUDA tensors.
for use with a CPU or GPU for each backend. For NCCL, GPU refers to CUDA GPU
while for XCCL to XPU GPU.
MPI supports CUDA only if the implementation used to build PyTorch supports it.
```{eval-rst}
+----------------+-----------+-----------+-----------+
| Backend | ``gloo`` | ``mpi`` | ``nccl`` |
+----------------+-----+-----+-----+-----+-----+-----+
| Device | CPU | GPU | CPU | GPU | CPU | GPU |
+================+=====+=====+=====+=====+=====+=====+
| send | ✓ | ✘ | ✓ | ? | ✘ | ✓ |
+----------------+-----+-----+-----+-----+-----+-----+
| recv | ✓ | ✘ | ✓ | ? | ✘ | ✓ |
+----------------+-----+-----+-----+-----+-----+-----+
| broadcast | ✓ | ✓ | ✓ | ? | ✘ | ✓ |
+----------------+-----+-----+-----+-----+-----+-----+
| all_reduce | ✓ | ✓ | ✓ | ? | ✘ | ✓ |
+----------------+-----+-----+-----+-----+-----+-----+
| reduce | ✓ | ✓ | ✓ | ? | ✘ | ✓ |
+----------------+-----+-----+-----+-----+-----+-----+
| all_gather | ✓ | ✓ | ✓ | ? | ✘ | ✓ |
+----------------+-----+-----+-----+-----+-----+-----+
| gather | ✓ | ✓ | ✓ | ? | ✘ | ✓ |
+----------------+-----+-----+-----+-----+-----+-----+
| scatter | ✓ | ✓ | ✓ | ? | ✘ | ✓ |
+----------------+-----+-----+-----+-----+-----+-----+
| reduce_scatter | ✓ | ✓ | ✘ | ✘ | ✘ | ✓ |
+----------------+-----+-----+-----+-----+-----+-----+
| all_to_all | ✓ | ✓ | ✓ | ? | ✘ | ✓ |
+----------------+-----+-----+-----+-----+-----+-----+
| barrier | ✓ | ✘ | ✓ | ? | ✘ | ✓ |
+----------------+-----+-----+-----+-----+-----+-----+
+----------------+-----------+-----------+-----------+-----------+
| Backend | ``gloo`` | ``mpi`` | ``nccl`` | ``xccl`` |
+----------------+-----+-----+-----+-----+-----+-----+-----+-----+
| Device | CPU | GPU | CPU | GPU | CPU | GPU | CPU | GPU |
+================+=====+=====+=====+=====+=====+=====+=====+=====+
| send | ✓ | ✘ | ✓ | ? | ✘ | ✓ | ✘ | ✓ |
+----------------+-----+-----+-----+-----+-----+-----+-----+-----+
| recv | ✓ | ✘ | ✓ | ? | ✘ | ✓ | ✘ | ✓ |
+----------------+-----+-----+-----+-----+-----+-----+-----+-----+
| broadcast | ✓ | ✓ | ✓ | ? | ✘ | ✓ | ✘ | ✓ |
+----------------+-----+-----+-----+-----+-----+-----+-----+-----+
| all_reduce | ✓ | ✓ | ✓ | ? | ✘ | ✓ | ✘ | ✓ |
+----------------+-----+-----+-----+-----+-----+-----+-----+-----+
| reduce | ✓ | ✓ | ✓ | ? | ✘ | ✓ | ✘ | ✓ |
+----------------+-----+-----+-----+-----+-----+-----+-----+-----+
| all_gather | ✓ | ✓ | ✓ | ? | ✘ | ✓ | ✘ | ✓ |
+----------------+-----+-----+-----+-----+-----+-----+-----+-----+
| gather | ✓ | ✓ | ✓ | ? | ✘ | ✓ | ✘ | ✓ |
+----------------+-----+-----+-----+-----+-----+-----+-----+-----+
| scatter | ✓ | ✓ | ✓ | ? | ✘ | ✓ | ✘ | ✓ |
+----------------+-----+-----+-----+-----+-----+-----+-----+-----+
| reduce_scatter | ✓ | ✓ | ✘ | ✘ | ✘ | ✓ | ✘ | ✓ |
+----------------+-----+-----+-----+-----+-----+-----+-----+-----+
| all_to_all | ✓ | ✓ | ✓ | ? | ✘ | ✓ | ✘ | ✓ |
+----------------+-----+-----+-----+-----+-----+-----+-----+-----+
| barrier | ✓ | ✘ | ✓ | ? | ✘ | ✓ | ✘ | ✓ |
+----------------+-----+-----+-----+-----+-----+-----+-----+-----+
```
### Backends that come with PyTorch
@ -81,8 +83,9 @@ In the past, we were often asked: "which backend should I use?".
- Rule of thumb
- Use the NCCL backend for distributed **GPU** training
- Use the Gloo backend for distributed **CPU** training.
- Use the NCCL backend for distributed training with CUDA **GPU**.
- Use the XCCL backend for distributed training with XPU **GPU**.
- Use the Gloo backend for distributed training with **CPU**.
- GPU hosts with InfiniBand interconnect

File diff suppressed because it is too large Load Diff

View File

@ -2,12 +2,13 @@
[build-system]
requires = [
# 70.1.0: min version for integrated bdist_wheel command from wheel package
# 77.0.0: min version for SPDX expression support for project.license
"setuptools>=77.0.0,<80.0",
"setuptools>=70.1.0,<80.0",
"cmake>=3.27",
"ninja",
"numpy",
"packaging>=24.2",
"packaging",
"pyyaml",
"requests",
"six", # dependency chain: NNPACK -> PeachPy -> six
@ -19,8 +20,12 @@ build-backend = "setuptools.build_meta"
name = "torch"
description = "Tensors and Dynamic neural networks in Python with strong GPU acceleration"
readme = "README.md"
requires-python = ">=3.9,<3.14"
license = "BSD-3-Clause"
requires-python = ">=3.9"
# TODO: change to `license = "BSD-3-Clause"` and enable PEP 639 after pinning setuptools>=77
# FIXME: As of 2025.06.20, it is hard to ensure the minimum version of setuptools in our CI environment.
# TOML-table-based license deprecated in setuptools>=77, and the deprecation warning will be changed
# to an error on 2026.02.18. See also: https://github.com/pypa/setuptools/issues/4903
license = { text = "BSD-3-Clause" }
authors = [{ name = "PyTorch Team", email = "packages@pytorch.org" }]
keywords = ["pytorch", "machine learning"]
classifiers = [

View File

@ -1,9 +1,9 @@
# Build System requirements
setuptools>=77.0.0,<80.0 # setuptools develop deprecated on 80.0
setuptools>=70.1.0,<80.0 # setuptools develop deprecated on 80.0
cmake>=3.27
ninja
numpy
packaging>=24.2
packaging
pyyaml
requests
six # dependency chain: NNPACK -> PeachPy -> six

View File

@ -1,40 +1 @@
This directory contains the useful tools.
## build_android.sh
This script is to build PyTorch/Caffe2 library for Android. Take the following steps to start the build:
- set ANDROID_NDK to the location of ndk
```bash
export ANDROID_NDK=YOUR_NDK_PATH
```
- run build_android.sh
```bash
#in your PyTorch root directory
bash scripts/build_android.sh
```
If succeeded, the libraries and headers would be generated to build_android/install directory. You can then copy these files from build_android/install to your Android project for further usage.
You can also override the cmake flags via command line, e.g., following command will also compile the executable binary files:
```bash
bash scripts/build_android.sh -DBUILD_BINARY=ON
```
## build_ios.sh
This script is to build PyTorch/Caffe2 library for iOS, and can only be performed on macOS. Take the following steps to start the build:
- Install Xcode from App Store, and configure "Command Line Tools" properly on Xcode.
- Install the dependencies:
```bash
brew install cmake automake libtool
```
- run build_ios.sh
```bash
#in your PyTorch root directory
bash scripts/build_ios.sh
```
If succeeded, the libraries and headers would be generated to build_ios/install directory. You can then copy these files to your Xcode project for further usage.

View File

@ -1 +0,0 @@
cat apache_header.txt $1 > _add_apache_header.txt && mv _add_apache_header.txt $1

View File

@ -1,15 +0,0 @@
/**
* Copyright (c) 2016-present, Facebook, Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

View File

@ -1,14 +0,0 @@
# Copyright (c) 2016-present, Facebook, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
##############################################################################

View File

@ -1,189 +0,0 @@
#!/bin/bash
##############################################################################
# Example command to build the android target.
##############################################################################
#
# This script shows how one can build a Caffe2 binary for the Android platform
# using android-cmake. A few notes:
#
# (1) This build also does a host build for protobuf. You will need autoconf
# to carry out this. If autoconf is not possible, you will need to provide
# a pre-built protoc binary that is the same version as the protobuf
# version under third_party.
# If you are building on Mac, you might need to install autotool and
# libtool. The easiest way is via homebrew:
# brew install automake
# brew install libtool
# (2) You will need to have android ndk installed. The current script assumes
# that you set ANDROID_NDK to the location of ndk.
# (3) The toolchain and the build target platform can be specified with the
# cmake arguments below. For more details, check out android-cmake's doc.
set -e
# Android specific flags
if [ -z "$ANDROID_ABI" ]; then
ANDROID_ABI="armeabi-v7a with NEON"
fi
ANDROID_NATIVE_API_LEVEL="21"
echo "Build with ANDROID_ABI[$ANDROID_ABI], ANDROID_NATIVE_API_LEVEL[$ANDROID_NATIVE_API_LEVEL]"
CAFFE2_ROOT="$( cd "$(dirname "$0")"/.. ; pwd -P)"
if [ -z "$ANDROID_NDK" ]; then
echo "ANDROID_NDK not set; please set it to the Android NDK directory"
exit 1
fi
if [ ! -d "$ANDROID_NDK" ]; then
echo "ANDROID_NDK not a directory; did you install it under $ANDROID_NDK?"
exit 1
fi
if [ -z "$PYTHON" ]; then
PYTHON=python
PYTHON_VERSION_MAJOR=$($PYTHON -c 'import sys; print(sys.version_info[0])')
if [ "${PYTHON_VERSION_MAJOR}" -le 2 ]; then
echo "Default python executable is Python-2, trying to use python3 alias"
PYTHON=python3
fi
fi
ANDROID_NDK_PROPERTIES="$ANDROID_NDK/source.properties"
[ -f "$ANDROID_NDK_PROPERTIES" ] && ANDROID_NDK_VERSION=$(sed -n 's/^Pkg.Revision[^=]*= *\([0-9]*\)\..*$/\1/p' "$ANDROID_NDK_PROPERTIES")
echo "Bash: $(/bin/bash --version | head -1)"
echo "Python: $($PYTHON -c 'import sys; print(sys.version)')"
echo "Caffe2 path: $CAFFE2_ROOT"
echo "Using Android NDK at $ANDROID_NDK"
echo "Android NDK version: $ANDROID_NDK_VERSION"
CMAKE_ARGS=()
# Build PyTorch mobile
CMAKE_ARGS+=("-DCMAKE_PREFIX_PATH=$($PYTHON -c 'import sysconfig; print(sysconfig.get_path("purelib"))')")
CMAKE_ARGS+=("-DPython_EXECUTABLE=$($PYTHON -c 'import sys; print(sys.executable)')")
CMAKE_ARGS+=("-DBUILD_CUSTOM_PROTOBUF=OFF")
# custom build with selected ops
if [ -n "${SELECTED_OP_LIST}" ]; then
SELECTED_OP_LIST="$(cd $(dirname $SELECTED_OP_LIST); pwd -P)/$(basename $SELECTED_OP_LIST)"
echo "Choose SELECTED_OP_LIST file: $SELECTED_OP_LIST"
if [ ! -r ${SELECTED_OP_LIST} ]; then
echo "Error: SELECTED_OP_LIST file ${SELECTED_OP_LIST} not found."
exit 1
fi
CMAKE_ARGS+=("-DSELECTED_OP_LIST=${SELECTED_OP_LIST}")
fi
# If Ninja is installed, prefer it to Make
if [ -x "$(command -v ninja)" ]; then
CMAKE_ARGS+=("-GNinja")
fi
# Use android-cmake to build Android project from CMake.
CMAKE_ARGS+=("-DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake")
if [ -z "$BUILD_MOBILE_BENCHMARK" ]; then
BUILD_MOBILE_BENCHMARK=0
fi
if [ -z "$BUILD_MOBILE_TEST" ]; then
BUILD_MOBILE_TEST=0
fi
# Don't build artifacts we don't need
CMAKE_ARGS+=("-DBUILD_TEST=OFF")
CMAKE_ARGS+=("-DBUILD_BINARY=OFF")
# If there exists env variable and it equals to 0, build full jit interpreter.
# Default behavior is to build lite interpreter
# cmd: BUILD_LITE_INTERPRETER=0 ./scripts/build_android.sh
if [ "${BUILD_LITE_INTERPRETER}" == 0 ]; then
CMAKE_ARGS+=("-DBUILD_LITE_INTERPRETER=OFF")
else
CMAKE_ARGS+=("-DBUILD_LITE_INTERPRETER=ON")
fi
if [ "${TRACING_BASED}" == 1 ]; then
CMAKE_ARGS+=("-DTRACING_BASED=ON")
else
CMAKE_ARGS+=("-DTRACING_BASED=OFF")
fi
if [ "${USE_LIGHTWEIGHT_DISPATCH}" == 1 ]; then
CMAKE_ARGS+=("-DUSE_LIGHTWEIGHT_DISPATCH=ON")
CMAKE_ARGS+=("-DSTATIC_DISPATCH_BACKEND=CPU")
else
CMAKE_ARGS+=("-DUSE_LIGHTWEIGHT_DISPATCH=OFF")
fi
CMAKE_ARGS+=("-DBUILD_MOBILE_BENCHMARK=$BUILD_MOBILE_BENCHMARK")
CMAKE_ARGS+=("-DBUILD_MOBILE_TEST=$BUILD_MOBILE_TEST")
CMAKE_ARGS+=("-DBUILD_PYTHON=OFF")
CMAKE_ARGS+=("-DBUILD_SHARED_LIBS=OFF")
if (( "${ANDROID_NDK_VERSION:-0}" < 18 )); then
CMAKE_ARGS+=("-DANDROID_TOOLCHAIN=gcc")
else
CMAKE_ARGS+=("-DANDROID_TOOLCHAIN=clang")
fi
# Disable unused dependencies
CMAKE_ARGS+=("-DUSE_CUDA=OFF")
CMAKE_ARGS+=("-DUSE_ITT=OFF")
CMAKE_ARGS+=("-DUSE_GFLAGS=OFF")
CMAKE_ARGS+=("-DUSE_OPENCV=OFF")
CMAKE_ARGS+=("-DUSE_MPI=OFF")
CMAKE_ARGS+=("-DUSE_OPENMP=OFF")
# Only toggle if VERBOSE=1
if [ "${VERBOSE:-}" == '1' ]; then
CMAKE_ARGS+=("-DCMAKE_VERBOSE_MAKEFILE=1")
fi
# Android specific flags
CMAKE_ARGS+=("-DANDROID_NDK=$ANDROID_NDK")
CMAKE_ARGS+=("-DANDROID_ABI=$ANDROID_ABI")
CMAKE_ARGS+=("-DANDROID_NATIVE_API_LEVEL=$ANDROID_NATIVE_API_LEVEL")
CMAKE_ARGS+=("-DANDROID_CPP_FEATURES=rtti exceptions")
if [ "${ANDROID_STL_SHARED:-}" == '1' ]; then
CMAKE_ARGS+=("-DANDROID_STL=c++_shared")
fi
if [ "${ANDROID_DEBUG_SYMBOLS:-}" == '1' ]; then
CMAKE_ARGS+=("-DANDROID_DEBUG_SYMBOLS=1")
fi
if [ -n "${USE_VULKAN}" ]; then
CMAKE_ARGS+=("-DUSE_VULKAN=ON")
if [ -n "${USE_VULKAN_FP16_INFERENCE}" ]; then
CMAKE_ARGS+=("-DUSE_VULKAN_FP16_INFERENCE=ON")
fi
if [ -n "${USE_VULKAN_RELAXED_PRECISION}" ]; then
CMAKE_ARGS+=("-DUSE_VULKAN_RELAXED_PRECISION=ON")
fi
fi
# Use-specified CMake arguments go last to allow overriding defaults
CMAKE_ARGS+=($@)
# Patch pocketfft (as Android does not have aligned_alloc even if compiled with c++17
if [ -f third_party/pocketfft/pocketfft_hdronly.h ]; then
sed -i -e "s/__cplusplus >= 201703L/0/" third_party/pocketfft/pocketfft_hdronly.h
fi
# Now, actually build the Android target.
BUILD_ROOT=${BUILD_ROOT:-"$CAFFE2_ROOT/build_android"}
INSTALL_PREFIX=${BUILD_ROOT}/install
mkdir -p $BUILD_ROOT
cd $BUILD_ROOT
cmake "$CAFFE2_ROOT" \
-DCMAKE_INSTALL_PREFIX=$INSTALL_PREFIX \
-DCMAKE_BUILD_TYPE=Release \
"${CMAKE_ARGS[@]}"
# Cross-platform parallel build
if [ -z "$MAX_JOBS" ]; then
if [ "$(uname)" == 'Darwin' ]; then
MAX_JOBS=$(sysctl -n hw.ncpu)
else
MAX_JOBS=$(nproc)
fi
fi
echo "Will install headers and libs to $INSTALL_PREFIX for further Android project usage."
cmake --build . --target install -- "-j${MAX_JOBS}"
echo "Installation completed, now you can copy the headers/libs from $INSTALL_PREFIX to your Android project directory."

View File

@ -1,102 +0,0 @@
#!/usr/bin/env bash
set -eux -o pipefail
env
echo "BUILD_ENVIRONMENT:$BUILD_ENVIRONMENT"
export ANDROID_NDK_HOME=/opt/ndk
export ANDROID_NDK=/opt/ndk
export ANDROID_HOME=/opt/android/sdk
# Must be in sync with GRADLE_VERSION in docker image for android
# https://github.com/pietern/pytorch-dockerfiles/blob/master/build.sh#L155
export GRADLE_VERSION=6.8.3
export GRADLE_HOME=/opt/gradle/gradle-$GRADLE_VERSION
export GRADLE_PATH=$GRADLE_HOME/bin/gradle
# touch gradle cache files to prevent expiration
while IFS= read -r -d '' file
do
touch "$file" || true
done < <(find /var/lib/jenkins/.gradle -type f -print0)
# Patch pocketfft (as Android does not have aligned_alloc even if compiled with c++17
if [ -f ~/workspace/third_party/pocketfft/pocketfft_hdronly.h ]; then
sed -i -e "s/__cplusplus >= 201703L/0/" ~/workspace/third_party/pocketfft/pocketfft_hdronly.h
fi
export GRADLE_LOCAL_PROPERTIES=~/workspace/android/local.properties
rm -f $GRADLE_LOCAL_PROPERTIES
echo "sdk.dir=/opt/android/sdk" >> $GRADLE_LOCAL_PROPERTIES
echo "ndk.dir=/opt/ndk" >> $GRADLE_LOCAL_PROPERTIES
echo "cmake.dir=/usr/local" >> $GRADLE_LOCAL_PROPERTIES
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
# Run custom build script
if [[ "${BUILD_ENVIRONMENT}" == *-gradle-custom-build* ]]; then
# Install torch & torchvision - used to download & dump used ops from test model.
retry pip install torch torchvision --progress-bar off
exec "$(dirname "${BASH_SOURCE[0]}")/../android/build_test_app_custom.sh" armeabi-v7a
fi
# Run default build
BUILD_ANDROID_INCLUDE_DIR_x86=~/workspace/build_android/install/include
BUILD_ANDROID_LIB_DIR_x86=~/workspace/build_android/install/lib
BUILD_ANDROID_INCLUDE_DIR_x86_64=~/workspace/build_android_install_x86_64/install/include
BUILD_ANDROID_LIB_DIR_x86_64=~/workspace/build_android_install_x86_64/install/lib
BUILD_ANDROID_INCLUDE_DIR_arm_v7a=~/workspace/build_android_install_arm_v7a/install/include
BUILD_ANDROID_LIB_DIR_arm_v7a=~/workspace/build_android_install_arm_v7a/install/lib
BUILD_ANDROID_INCLUDE_DIR_arm_v8a=~/workspace/build_android_install_arm_v8a/install/include
BUILD_ANDROID_LIB_DIR_arm_v8a=~/workspace/build_android_install_arm_v8a/install/lib
PYTORCH_ANDROID_SRC_MAIN_DIR=~/workspace/android/pytorch_android/src/main
JNI_INCLUDE_DIR=${PYTORCH_ANDROID_SRC_MAIN_DIR}/cpp/libtorch_include
mkdir -p $JNI_INCLUDE_DIR
JNI_LIBS_DIR=${PYTORCH_ANDROID_SRC_MAIN_DIR}/jniLibs
mkdir -p $JNI_LIBS_DIR
ln -s ${BUILD_ANDROID_INCLUDE_DIR_x86} ${JNI_INCLUDE_DIR}/x86
ln -s ${BUILD_ANDROID_LIB_DIR_x86} ${JNI_LIBS_DIR}/x86
if [[ "${BUILD_ENVIRONMENT}" != *-gradle-build-only-x86_32* ]]; then
ln -s ${BUILD_ANDROID_INCLUDE_DIR_x86_64} ${JNI_INCLUDE_DIR}/x86_64
ln -s ${BUILD_ANDROID_LIB_DIR_x86_64} ${JNI_LIBS_DIR}/x86_64
ln -s ${BUILD_ANDROID_INCLUDE_DIR_arm_v7a} ${JNI_INCLUDE_DIR}/armeabi-v7a
ln -s ${BUILD_ANDROID_LIB_DIR_arm_v7a} ${JNI_LIBS_DIR}/armeabi-v7a
ln -s ${BUILD_ANDROID_INCLUDE_DIR_arm_v8a} ${JNI_INCLUDE_DIR}/arm64-v8a
ln -s ${BUILD_ANDROID_LIB_DIR_arm_v8a} ${JNI_LIBS_DIR}/arm64-v8a
fi
GRADLE_PARAMS="-p android assembleRelease --debug --stacktrace"
if [[ "${BUILD_ENVIRONMENT}" == *-gradle-build-only-x86_32* ]]; then
GRADLE_PARAMS+=" -PABI_FILTERS=x86"
fi
if [ -n "${GRADLE_OFFLINE:-}" ]; then
GRADLE_PARAMS+=" --offline"
fi
$GRADLE_PATH $GRADLE_PARAMS
find . -type f -name "*.a" -exec ls -lh {} \;
while IFS= read -r -d '' file
do
echo
echo "$file"
ls -lah "$file"
zipinfo -l "$file"
done < <(find . -type f -name '*.aar' -print0)
find . -type f -name *aar -print | xargs tar cfvz ~/workspace/android/artifacts.tgz

View File

@ -1,59 +0,0 @@
#!/bin/bash
##############################################################################
# Build script to build the protoc compiler for the host platform.
##############################################################################
# This script builds the protoc compiler for the host platform, which is needed
# for any cross-compilation as we will need to convert the protobuf source
# files to cc files.
#
# --other-flags accepts flags that should be passed to cmake. Optional.
#
# After the execution of the file, one should be able to find the host protoc
# binary at build_host_protoc/bin/protoc.
set -e
CAFFE2_ROOT="$( cd "$(dirname -- "$0")"/.. ; pwd -P)"
BUILD_ROOT=${BUILD_ROOT:-"$CAFFE2_ROOT/build_host_protoc"}
mkdir -p $BUILD_ROOT/build
cd $BUILD_ROOT/build
CMAKE_ARGS=()
CMAKE_ARGS+=("-DCMAKE_INSTALL_PREFIX=$BUILD_ROOT")
CMAKE_ARGS+=("-Dprotobuf_BUILD_TESTS=OFF")
# If Ninja is installed, prefer it to Make
if [ -x "$(command -v ninja)" ]; then
CMAKE_ARGS+=("-GNinja")
fi
while true; do
case "$1" in
--other-flags)
shift;
CMAKE_ARGS+=("$@")
break ;;
"")
break ;;
*)
echo "Unknown option passed as argument: $1"
break ;;
esac
done
# Use ccache if available (this path is where Homebrew installs ccache symlinks)
if [ "$(uname)" == 'Darwin' ] && [ -d /usr/local/opt/ccache/libexec ]; then
CMAKE_ARGS+=("-DCMAKE_C_COMPILER=/usr/local/opt/ccache/libexec/gcc")
CMAKE_ARGS+=("-DCMAKE_CXX_COMPILER=/usr/local/opt/ccache/libexec/g++")
fi
cmake "$CAFFE2_ROOT/third_party/protobuf/cmake" ${CMAKE_ARGS[@]}
if [ -z "$MAX_JOBS" ]; then
if [ "$(uname)" == 'Darwin' ]; then
MAX_JOBS=$(sysctl -n hw.ncpu)
else
MAX_JOBS=$(nproc)
fi
fi
cmake --build . -- "-j${MAX_JOBS}" install

Some files were not shown because too many files have changed in this diff Show More