Compare commits

...

110 Commits

Author SHA1 Message Date
2eacbe792a [dynamo] Revert C++-fying of symbolic shape guards
Moving symbolic shape guards to C++ causes compile time issues. This
basically boils down to a tradeoff question.

For models that have large amount of dynamic shape guards, this flag
will help reduce guard latency. But for most of the models, that have a
very few dynamic shape guards, the guard lantecy is anyways small. These
models will still see a high compile time hit because of calling gcc
during the compile.

So a good default value seems to be False. We can write a doc to give
guidance on reducing guard latency.
2025-10-28 11:13:24 -07:00
8110ce02a2 Add a skill for writing skills (#166266)
Apparently, if you just ask Claude to write a skill it doesn't follow the
correct rules.  So this one is just the official docs for skills.

Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166266
Approved by: https://github.com/Skylion007
ghstack dependencies: #166265
2025-10-28 16:49:27 +00:00
43c30f607e Use correct layout convention for skills (#166265)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166265
Approved by: https://github.com/Skylion007
2025-10-28 16:49:27 +00:00
5ebf74a655 [2/2] Move scaled_mm routines to their own file (#166314)
Summary:

* Further simplify `ATen/native/cuda/Blas.cpp` by moving `_scaled_mm`,
  `_scaled_mm_v2` and supporting methods to a new file,
  `ATen/native/cuda/ScaledBlas.cpp`

Test Plan:

```
pytest -svv test/test_matmul_cuda.py
pytest -svv test/test_scaled_matmul_cuda.py
```

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166314
Approved by: https://github.com/eqy
ghstack dependencies: #166313
2025-10-28 16:35:32 +00:00
acd936cc1a [1/2] Split cublasCommonArgs into its own file (#166313)
Summary:

* Factor out `cublasCommonArgs` struct
* Necessary for factoring out scaled mm routines

Test Plan:

```
pytest -svv test/test_matmul_cuda.py
pytest -svv test/test_scaled_matmul_cuda.py
```

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166313
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-10-28 16:35:32 +00:00
a4a0378e6b Revert "[cuDNN] Smoke-test runtime cuDNN version matches compile time version in CI (#165922)"
This reverts commit 2a5f87decf34b3d0ea7670238e2fd4620ed19e9f.

Reverted https://github.com/pytorch/pytorch/pull/165922 on behalf of https://github.com/atalman due to cudnn update started to fail, see https://github.com/pytorch/pytorch/pull/165913#issuecomment-3457293475 ([comment](https://github.com/pytorch/pytorch/pull/165922#issuecomment-3457389406))
2025-10-28 16:29:29 +00:00
ac841267a1 [ROCm] skip AsyncTP test class as AsyncTP is not supported on ROCm (#166316)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166316
Approved by: https://github.com/jeffdaily
2025-10-28 16:23:46 +00:00
0eacd934bc Revert "Update cuDNN 9.10.2 in Manylinux 2.28 Docker files (#165913)"
This reverts commit 840d63c12d255dac1ae3c5e442c6ea6eb87a7256.

Reverted https://github.com/pytorch/pytorch/pull/165913 on behalf of https://github.com/clee2000 due to I think something here is causing CI tests to segfault at exit on cuda, ex [GH job link](https://github.com/pytorch/pytorch/actions/runs/18857880394/job/53811917713) [HUD commit link](9a91486e45) says no tests failed but it segfaulted afterwards.  I can't tell if it's because of this change, or an unpinned dependency in docker that got triggered by this.  Note to self, would have been bad TD except trunk didn't run either ([comment](https://github.com/pytorch/pytorch/pull/165913#issuecomment-3457293475))
2025-10-28 16:11:07 +00:00
5016e7b2eb [FlexAttention] Add mechanism to get optimal autotune decision (#165817)
Script: https://github.com/meta-pytorch/attention-gym/pull/169

Feels directionally okay but there is some bike shedding / this could be quite prone to collision of keys depending on mask mod and score mod changes and simple cache key.

Usecase: https://github.com/meta-pytorch/attention-gym/pull/169

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165817
Approved by: https://github.com/Chillee
2025-10-28 15:50:12 +00:00
544b443ea1 [CD] Upgrade to CUDA 13.0.2 for nightly binaries (#165470)
13.0.U2 is posted, adding to nightlies
Why we want to upgrade: CUDA 13.0.U2 included a new release from cuBLAS that
1. Enabled opt-in fixed-point emulation for FP64 matmuls (D/ZGEMM) which improves performance and power-efficiency.
2. Improved performance on NVIDIA [DGX Spark](https://www.nvidia.com/en-us/products/workstations/dgx-spark/) for FP16/BF16 and FP8 GEMMs.
3. adds BF16x9 FP32 emulation support for SYRK and HERK routines.
Reference: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cublas-release-13-0-update-2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165470
Approved by: https://github.com/atalman
2025-10-28 15:14:43 +00:00
3041ede082 Improve eig tests in preparation for new eig backends (#166322)
### Summary
Improves validation of `torch.linalg.eig` results by verifying the eigen decomposition identity **A v − v λ = 0**.

### Motivation
Eigenvectors are not unique, and numerical differences between backends (cuSOLVER, MAGMA, CPU)
can cause false test failures. This PR replaces direct elementwise comparisons with a mathematical
identity check, improving robustness across devices.

### Details
- Introduces `fulfills_eigen_decomposition_identity()` in `test_eig_compare_backends()` to validate the eigen equation.
- Uses CPU matmul for high-precision verification.
- Handles zero-sized matrices explicitly.
- Tolerances derived from numerical comparisons between cuSOLVER and NumPy.
  See discussion: [dev-discuss.pytorch.org link](https://dev-discuss.pytorch.org/t/cusolver-dnxgeev-faster-cuda-eigenvalue-calculations/3248/6)

### Impact
- Improves test stability and correctness across eig backends.
- No change to public API.
- All tests pass; lintrunner reports no issues.
- Enables introduction of new eig backends without false test failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166322
Approved by: https://github.com/lezcano
2025-10-28 14:42:47 +00:00
34d6ef7022 Update gm.print_readable to include Annotation (#165397)
Sample output
```
[rank0]:        # Annotation: {'compile_with_inductor': 'flex_attention'} File: /data/users/bahuang/pytorch/torch/nn/attention/flex_attention.py:1490 in flex_attention, code: out, lse, max_scores = flex_attention_hop(
[rank0]:        score_mod_2 = self.score_mod_2
[rank0]:        mask_fn_2 = self.mask_fn_2
[rank0]:        flex_attention_1 = torch.ops.higher_order.flex_attention(xq_5, xk_5, xv_3, score_mod_2, (2048, 2048, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___kv_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___kv_indices, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_kv_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_kv_indices, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___q_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___q_indices, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_q_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_q_indices, 128, 128, mask_fn_2), 0.25, {'PRESCALE_QK': False, 'ROWS_GUARANTEED_SAFE': False, 'BLOCKS_ARE_CONTIGUOUS': False, 'WRITE_DQ': True, 'OUTPUT_LOGSUMEXP': True, 'OUTPUT_MAX': False}, (), (g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___mask_mod___closure___0_cell_contents,));  xq_5 = xk_5 = xv_3 = score_mod_2 = mask_fn_2 = None
[rank0]:        out_2: "bf16[8, 4, 2048, 16]" = flex_attention_1[0];  flex_attention_1 = None
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165397
Approved by: https://github.com/yushangdi, https://github.com/anijain2305, https://github.com/mlazos
2025-10-28 13:54:38 +00:00
110efe4df4 Revert "[inductor][choices] lookup table choices 1/3 (#164978)"
This reverts commit b44423bbb43860c1e340cbebc9d101dc18031ecb.

Reverted https://github.com/pytorch/pytorch/pull/164978 on behalf of https://github.com/atalman due to failing internal test on newly added tests: Test when there's no lookup table entry with different autotune modes ([comment](https://github.com/pytorch/pytorch/pull/164978#issuecomment-3456400126))
2025-10-28 13:12:55 +00:00
e137cd0a10 docs: fix typos (#164879)
Correct typos in the comments

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164879
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/cyyever
2025-10-28 12:00:36 +00:00
be28329710 [Pytorch] Update Kineto Submodule (#166317)
Summary: Update Submodule

Test Plan: CI

Differential Revision: D85579130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166317
Approved by: https://github.com/Skylion007
2025-10-28 10:41:17 +00:00
85a7c745aa [triton][nativert] Add num_cpu_threads for triton-cpu (#166255)
Summary:
The new triton-cpu has `num_cpu_threads` like `num_warps`, which are auto-tunable. This diff adds `num_cpu_threads` to NativeRT.

Differential Revision: D85515240

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166255
Approved by: https://github.com/XueningXu
2025-10-28 08:40:04 +00:00
32fe4f681e [dynamo] fix keyerror in resume_execution (again) (#166040)
Fixes https://github.com/pytorch/pytorch/issues/166176

The error I attempted to fix in https://github.com/pytorch/pytorch/pull/162318 was still appearing internally.

Surprised that this wasn't caught anywhere 😰

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166040
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #166036
2025-10-28 07:04:29 +00:00
ebb2b2e894 [dynamo] fix store attr graph break in with block (#166036)
Fixes https://github.com/pytorch/pytorch/issues/166033

Differential Revision: [D85198055](https://our.internmc.facebook.com/intern/diff/D85198055)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166036
Approved by: https://github.com/Lucaskabela
2025-10-28 07:04:29 +00:00
13413b3b07 [AMP][Refactor] Autocast dtype handling to simplify device-specific c… (#165221)
This PR refactors the autocast context manager in autocast_mode.py to simplify and centralize the logic for checking supported dtypes for each device. The previous implementation repeated similar checks for multiple device types. Now, a single mapping device_supported_dtypes is used to associate device types with their supported dtypes, and the validation logic is unified.

**The former PR #163446 was merged but reverted due to failed CI test on `openreg` related tests.**

This RR additionally slightly modified some test assertions for passing the CI tests. CI failed due to assertion for the exactly same error message. For example:
```
File "/var/lib/jenkins/workspace/test/cpp_extensions/open_registration_extension/torch_openreg/tests/test_autocast.py", line 9, in test_autocast_with_unsupported_type
    with self.assertWarnsRegex(
        AssertionError: "In openreg autocast, but the target dtype torch.float32 is not supported." does not match "In openreg autocast, but the target dtype is not supported. Disabling autocast."
```

Sorry for the inconvenience again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165221
Approved by: https://github.com/albanD
2025-10-28 06:21:29 +00:00
5d0b3e28dc [inductor] generate fused rms/layer norm bwd (#165370)
RMS/Layer norm backward would generated 2 kind of reductions:
- the reduction computing dx which reduce across the hidden dimension (in the context of transformer)
- the reduction computing dw/db which reduce across the BxT (batch size , sequence length) dimension.

These 2 set of reductions have common input buffers but inductor can not fuse them because of different loop orders.

There are multiple sources of custom kernels that implement fused version of such kernel (Liger-Kernel, quack, Paul Zhang's internal post). This PR enable Inductor to generate such kernels automatically.

The generated kernel is very similar to 33924d20b6/src/liger_kernel/ops/rms_norm.py (L114) .

To make the implementation simple and performing, we enable such fusion only if the inner reduction (computing dx) is a persistent reduction. This should be true for representative inputs. Persistent reduction is critical for the perf here to make sure a loaded tensor does not need to be reload.

To make sure the inner reduction (computing dx) and outer reductions (computing dw/db) being fusible, the PR does the following:
1. convert the outer reductions to pointwise by replacing 'reduction' & 'store_reduction' node with a new type of node 'parital_accumulate'. The new node will collect the reduction type, buffer name, input of reduction etc, which is essential for proper codegening.
2. do loop reordering (rely on the earlier loop ordering after fusion work) to reorder the loops of the converted pointwise so it can be fused with the inner reduction
3. there can be epilogues that need to be added in the end. E.g. the outer reduction may be followed by a division for mean , or followed by a down cast if dw/db is in low precision (fp16/bf16).

Some early benchmarking on H100 shows about 2X speedup for both RMSNorm and LayerNorm backward for shape (1152 * 500, 384 ) used in some internal model. Note that, I manually disable split reduction in this benchmarking since otherwise the fusion will be skipped right now. The next PR will make the mix-order-reduction compose better with split reduction

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165370
Approved by: https://github.com/jansel
ghstack dependencies: #166204
2025-10-28 05:53:52 +00:00
9139368b64 [PyTorch] Use events from pool in copy_device_to_device (#165647)
Summary: In this diff, we add a event pool so that we dont have to create/destroy events all the time, instead re-use the events from the pool.

Test Plan: contbuild

Differential Revision: D84685495

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165647
Approved by: https://github.com/bbus
2025-10-28 05:19:05 +00:00
02095cc09d [dynamo] Dont guard on getset descriptors for torch_function (#166346)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166346
Approved by: https://github.com/mlazos
ghstack dependencies: #166329
2025-10-28 04:33:56 +00:00
65868156c6 [dynamo] Guard selectively on the torch APIs (#166329)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166329
Approved by: https://github.com/Lucaskabela
2025-10-28 04:33:56 +00:00
f93ea7dab1 [export] Update dynamo_graph_capture_for_export to return GraphModule. (#166091)
Make dynamo_graph_capture_for_export return a more compatible GraphModule object which is closer the the original behavior of dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166091
Approved by: https://github.com/tugsbayasgalan
2025-10-28 04:23:28 +00:00
a77f5d9a00 [ROCm] Use a ROCm version string without hash. (#166336)
Fixes #166068

Use the ROCm version string that does not contain a hash. The string is set in LoadHIP.cmake.

Tested on repro provided by reporter.

For a ROCm 7.0 docker container, we get `7.0.0`.

For a ROCm 7.0.2 docker container, we get `7.0.2`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166336
Approved by: https://github.com/jeffdaily
2025-10-28 03:53:55 +00:00
ff46d5a79b [Inductor][Triton][FP8] Support deepseek-style scaling in Inductor (#164404)
Summary:
Support deepseek-style scaling in Inductor Triton for FP8 GEMMs. DeepSeek-style scaling is a colloquial term for a fine-grained mixed precision framework using FP8 to train [Deepseek-V3](https://arxiv.org/pdf/2412.19437), DeepSeek AI's recent MoE (Mixture of Experts) model. DeepSeek-style scaling effectively extends the dynamic range of FP8 by mitigating dequantization overhead under increased-precision accumulation, which is key to achieving more accurate FP8 GEMM results.

DeepSeek-style scaling on matmul `A @ B` leverages two different types of scaling strategies to preserve a balance between numerical stability and training efficiency:
- Activations (input tensor `A`): tile-wise (1x128 across shape `(M, K)`)
- Weights (input tensor `B`): block-wise (128x128 across shape `(N, K)`)

This diff enables Inductor users to replicate past successes with deepseek-style scaling and achieve higher numerical stability while increasing training efficiency.

NOTE: Block-wise 128x128 scaling is only supported in CUDA 12.9+; therefore, deepseek-style scaling is currently unsupported in `fbcode` (CUDA 12.4). Use OSS PyTorch to run deepseek-style scaling.

NOTE: Accuracy for FP8 is unstable, even with high tolerances, which is why TritonBench benchmarks are unlikely to be accurate against a `torch` implementation.

Test Plan:
In OSS PyTorch, run
```
TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor CUDA_LAUNCH_BLOCKING=1 TORCH_USE_CUDA_DSA=1 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 python run.py --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --m 4096 --n 768 --k 512 --output="{output_dir}/deepseek_bench.csv" --scaling_deepseek --atol=1e-2 --rtol=0.5 2>&1 | tee ~/personal/deepseek_style/deepseek_bench.log
```

Differential Revision: D83609850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164404
Approved by: https://github.com/slayton58
2025-10-28 03:38:54 +00:00
f452edd782 [dynamo, 3.14] fix misc. bugs to get most dynamo unittests passing locally in 3.14 (#164631)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164631
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos
2025-10-28 03:24:22 +00:00
ea698e8bfc [dynamo, nested graph breaks] disallow nested graph breaks in HOPs (#166016)
As discussed offline with @ydwu4, we should not allow nested graph breaks in HOPs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166016
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #166013, #166015, #165808, #165809
2025-10-28 03:03:38 +00:00
7f7a28046b [dynamo, nested graph breaks] disable nested graph breaks in generators; enable nested_graph_breaks in test_ctx_manager.py and test_generator.py (#165809)
Generators should not support nested graph breaks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165809
Approved by: https://github.com/Lucaskabela, https://github.com/guilhermeleobas
ghstack dependencies: #166013, #166015, #165808
2025-10-28 03:03:37 +00:00
d8283a317a [dynamo, nested graph breaks] fix RETURN_VALUE tx skipping in nested graph breaks (#165808)
Previously, we would completely skip building and calling any resume function if the leaf frame's resume instruction was RETURN_VALUE/RETURN_CONST. Now, we only skip building/calling resume functions for frames that are resuming on RETURN_VALUE.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165808
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #166013, #166015
2025-10-28 03:03:37 +00:00
e0ca3049c0 [dynamo, nested graph breaks] remove _dynamo.utils.counter patch on inlined tx'es (#166015)
This `patch.dict(counters, ...` appears to be ancient code that doesn't really seem to be doing anything? It causes issues in nested graph breaks because the patch cleanup clears out the record of the nested graph break. Removing the patch to see if it's even needed in the first place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166015
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #166013
2025-10-28 03:03:37 +00:00
8417981c96 [dynamo, nested graph breaks] add TestCaseWithNestedGraphBreaks subclass (#166013)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166013
Approved by: https://github.com/Lucaskabela
2025-10-28 03:03:37 +00:00
06e71c8558 [hop] local_map MoE: fix unbacked symints during tracing and symint activations order in the wrapper (#165551)
This PR fixes 2 issues with local_mapping token-choice moe. Splits from the fw token dispatch result in tensors with unbacked shapes and these unbacked shapes are fully contained in the a2as, and should not leak outside of the joint graph. The HOP body fw and bw are expected to coerce back to static shapes (due to adding it with shared experts output) before returning.
```python
routed_output: "bf16[u0 + u1 + u10 + u11 + u12 + u13 + u14 + u15 + u16 + u17 + u18 + u19 + u2 + u20 + u21 + u22 + u23 + u24 + u25 + u26 + u27 + u28 + u29 + u3 + u30 + u31 + u32 + u33 + u34 + u35 + u36 + u37 + u38 + u39 + u4 + u40 + u41 + u42 + u43 + u44 + u45 + u46 + u47 + u48 + u49 + u5 + u50 + u51 + u52 + u53 + u54 + u55 + u56 + u57 + u58 + u59 + u6 + u60 + u61 + u62 + u63 + u7 + u8 + u9, 2048]" = torch.ops.higher_order.autograd_function_apply(fwd_body_1, bwd_body_1, out_1, item, item_1, item_2, item_3, item_4, item_5, item_6, item_7, item_8, item_9, item_10, item_11, item_12, item_13, item_14, item_15, item_16, item_17, item_18, item_19, item_20, item_21, item_22, item_23, item_24, item_25, item_26, item_27, item_28, item_29, item_30, item_31, item_32, item_33, item_34, item_35, item_36, item_37, item_38, item_39, item_40, item_41, item_42, item_43, item_44, item_45, item_46, item_47, item_48, item_49, item_50, item_51, item_52, item_53, item_54, item_55, item_56, item_57, item_58, item_59, item_60, item_61, item_62, item_63, item_64, item_65, item_66, item_67, item_68, item_69, item_70, item_71, item_72, item_73, item_74, item_75, item_76, item_77, item_78, item_79, item_80, item_81, item_82, item_83, item_84, item_85, item_86, item_87, item_88, item_89, item_90, item_91, item_92, item_93, item_94, item_95, item_96, item_97, item_98, item_99, item_100, item_101, item_102, item_103, item_104, item_105, item_106, item_107, item_108, item_109, item_110, item_111, item_112, item_113, item_114, item_115, item_116, item_117, item_118, item_119, item_120, item_121, item_122, item_123, item_124, item_125, item_126, item_127, args_tensor_mask = [True, False, False, False], non_differentiable_idx = []);  fwd_body_1 = bwd_body_1 = out_1 = item = item_1 = item_2 = item_3 = item_4 = item_5 = item_6 = item_7 = item_8 = item_9 = item_10 = item_11 = item_12 = item_13 = item_14 = item_15 = item_16 = item_17 = item_18 = item_19 = item_20 = item_21 = item_22 = item_23 = item_24 = item_25 = item_26 = item_27 = item_28 = item_29 = item_30 = item_31 = item_32 = item_33 = item_34 = item_35 = item_36 = item_37 = item_38 = item_39 = item_40 = item_41 = item_42 = item_43 = item_44 = item_45 = item_46 = item_47 = item_48 = item_49 = item_50 = item_51 = item_52 = item_53 = item_54 = item_55 = item_56 = item_57 = item_58 = item_59 = item_60 = item_61 = item_62 = item_63 = item_64 = item_65 = item_66 = item_67 = item_68 = item_69 = item_70 = item_71 = item_72 = item_73 = item_74 = item_75 = item_76 = item_77 = item_78 = item_79 = item_80 = item_81 = item_82 = item_83 = item_84 = item_85 = item_86 = item_87 = item_88 = item_89 = item_90 = item_91 = item_92 = item_93 = item_94 = item_95 = item_96 = item_97 = item_98 = item_99 = item_100 = item_101 = item_102 = item_103 = item_104 = item_105 = item_106 = item_107 = item_108 = item_109 = item_110 = item_111 = item_112 = item_113 = item_114 = item_115 = item_116 = item_117 = item_118 = item_119 = item_120 = item_121 = item_122 = item_123 = item_124 = item_125 = item_126 = item_127 = None

# File: /home/xmfan/core/a/autoparallel/examples/example_ds3_local_map.py:777 in local_mapped_region, code: torch._check(routed_output.shape[0] == shape[0] * shape[1])
size_3 = routed_output.size()
getitem_139 = size_3[1];  size_3 = getitem_139 = None

# File: /home/xmfan/core/a/autoparallel/examples/example_ds3_local_map.py:779 in local_mapped_region, code: routed_output = routed_output.view(shape)
routed_output_1: "bf16[4, 6144, 2048]" = routed_output.view((4, 6144, 2048));  routed_output = None

# File: /home/xmfan/core/a/autoparallel/examples/example_ds3_local_map.py:781 in local_mapped_region, code: out = out.scatter_add(dim=1, index=token_indices_experts_sorted, src=routed_output)
out_3: "bf16[4, 1024, 2048]" = out_2.scatter_add(dim = 1, index = token_indices_experts_sorted_2, src = routed_output_1);  out_2 = token_indices_experts_sorted_2 = routed_output_1 = None
```

## 1. Unbacked symints contained within the HOP body

Based on 9b2974e812 and 36030e0315.

We disable proxy mode so that unbacked symints that are contained within the HOP subgraph aren't proxied:
```python
[rank0]: RuntimeError: u576 + u577 + u578 + u579 + u580 + u581 + u582 + u583 + u584 + u585 + u586 + u587 + u588 + u589 + u590 + u591 + u592 + u593 + u594 + u595 + u596 + u597 + u598 + u599 + u600 + u601 + u602 + u603 + u604 + u605 + u606 + u607 + u608 + u609 + u610 + u611 + u612 + u613 + u614 + u615 + u616 + u617 + u618 + u619 + u620 + u621 + u622 + u623 + u624 + u625 + u626 + u627 + u628 + u629 + u630 + u631 + u632 + u633 + u634 + u635 + u636 + u637 + u638 + u639 + 1 (140667108386064)is not tracked with proxy for <torch.fx.experimental.proxy_tensor.PythonKeyTracer object at 0x7fef9d44f950>
```
And we ensure that no unbacked symints leak outside of the region.

## 2. Saved symint activations

local_map is using the partitioned backward, and needs to follow the partitioner's desired ordering, this is the same order as AOTAutograd runtime wrapper uses in `_backward_prologue_functional` where we pass symints first: d2c82bafb7/torch/_functorch/_aot_autograd/runtime_wrappers.py (L1702-L1704)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165551
Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh
ghstack dependencies: #164780
2025-10-28 02:52:41 +00:00
a76b59cc45 [dynamo] local_map error message for reordered inputs (#164780)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164780
Approved by: https://github.com/mlazos
2025-10-28 02:52:41 +00:00
74336f8c77 Revert "[CD] Upgrade to CUDA 13.0.2 for nightly binaries (#165470)"
This reverts commit 5e769ff86780a7ffd561615dbf4b0defe80cfbb9.

Reverted https://github.com/pytorch/pytorch/pull/165470 on behalf of https://github.com/atalman due to Sorry reverting for now, to restore trunk health ([comment](https://github.com/pytorch/pytorch/pull/165470#issuecomment-3454166879))
2025-10-28 02:21:48 +00:00
236ce736a1 [reland] Add provenance to inductor IR nodes created after graph.run (#164255) (#164746)
Summary:

as title

- Some IR nodes are created during `finalize_multi_template_buffers()` in Scheduler. This PR adds provenance (`origin_node` and `origins`) for those nodes.

- Extract `assign_origin_node` function

Test Plan:
```
buck run mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r  test_deferred_triton_kernels
```

Differential Revision: D83979975

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164746
Approved by: https://github.com/mlazos
2025-10-28 02:20:20 +00:00
17bdb232e1 [GR v0] AOTI Enablement - Fix GR model AOTI inplace update by skipping empty named (#165970) (#166037)
Summary:

Add a gflag to allow us skip empty constant named parameter during
dense loading. In [vm_parameters.py](https://fburl.com/code/7xr9ihwy), there is
a constant _empty_tensor parameter used for the model. This constant parameter
is skipped in XL weights during model publish because it is empty. This will
break model inplace update later because it will be reported by the AOTI
container but cannot be found from the model merge weights. This diff will
allow us to solve the problem.

Test Plan: Verified inplace update in job https://www.internalfb.com/vanguard/serving_test_cases/1165842932095688

Reviewed By: muchulee8, joannec3634

Differential Revision: D85082330

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166037
Approved by: https://github.com/muchulee8, https://github.com/jcwchen
2025-10-28 01:50:36 +00:00
add37bacda [MPS] Better error checking for FFT ops (#166272)
Namely, error out rather than crash when out dtype is of an unexpected type
Resize output tensor to the expected size in `_out` operation, to prevent crash when tensor of an unexpected size is passed.
Preserve symbolic shapes whenever possible

Test plan: Run `python test_ops.py -v -k test_out_warning_fft_hfft_mps` for MPS device, without this change it crashes with `Error: Invalid KernelDAG, equalShape for destination failed'`, run `python ../test/test_ops.py -v -k test_dtypes_stft_mps`, without this change it crashes with `A complex mlir::Type does not have a corresponding complex MPSDataType"`, when input dtype is bfloat16
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166272
Approved by: https://github.com/kulinseth
2025-10-28 01:31:47 +00:00
1425b40f29 [inductor] Fix argmin/argmax returning incorrect indices for non-contiguous tensor (#165983)
Fixes #163929

Fixes argmin/argmax operations to return correct logical indices instead of physical memory offsets when applied to transposed/permuted tensors.  When `argmin()` or `argmax()` is called on a transposed tensor, Inductor was returning physical memory indices instead of logical row-major indices. This caused incorrect results that don't match eager mode behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165983
Approved by: https://github.com/shunting314
2025-10-28 01:23:24 +00:00
8af9ed0824 [torchfuzz] split, chunk, stack, cat, expand, gather, cumsum, clamp, index_select, split (#166221)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166221
Approved by: https://github.com/pianpwk
ghstack dependencies: #166187, #166188, #166220, #166189, #166190
2025-10-28 01:21:07 +00:00
7045aab143 [torchfuzz] add mhaf operator (#166190)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166190
Approved by: https://github.com/pianpwk
ghstack dependencies: #166187, #166188, #166220, #166189
2025-10-28 01:21:07 +00:00
7ae8aaf4c0 [torchfuzz] add sdpa operator (#166189)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166189
Approved by: https://github.com/pianpwk
ghstack dependencies: #166187, #166188, #166220
2025-10-28 01:20:58 +00:00
f2450798cd [torchfuzz] make pointwise subclasses defined torch_op_name (#166220)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166220
Approved by: https://github.com/pianpwk
ghstack dependencies: #166187, #166188
2025-10-28 01:08:34 +00:00
46d17e8871 [Symm mem] Add a unit test for mempool tensor with dist collective (#166206)
We haven't tried to see if tensors on nvshmem calling c10d collectives work or not. This PR is adding a show case for it inside UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166206
Approved by: https://github.com/ngimel
2025-10-28 00:41:47 +00:00
dc011d3203 [inductor][ez] add overridable env var for disabling fx graph cache (#166138)
I set TORCHINDUCTOR_FX_GRAPH_CACHE=0 a lot to make sure the compilation
happens by disabling fx graph caching. I even put this in my .bashrc.
But this cause a simple vllm script fail:
https://gist.github.com/shunting314/4253b2b5ab5e7d1b0fc9516c84054904

Error log:
https://gist.github.com/shunting314/1d04bbeb58bc486f975684f56d65615d

The root cause is,
1. vllm patch inductor_config.fx_graph_cache to True here:
   e255d92990/vllm/compilation/compiler_interface.py (L308)

   The code in vllm relies fx graph cache is on (unless
   VLLM_DISABLE_COMPILE_CACHE is overriden to false)
2. setting TORCHINDUCTOR_FX_GRAPH_CACHE=0 will cause
   inductor_config.fx_graph_cache not overridable.

I add TORCHINDUCTOR_FX_GRAPH_CACHE_DEFAULT so that we can still use it to skip fx
graph cache while still allow project like vllm to override it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166138
Approved by: https://github.com/eellison
2025-10-28 00:27:19 +00:00
e95920e3e6 [Optimus] Rename the post_grad_graph tlparse log (#166109)
Summary:
ezyang observed a cache miss issue, see details in https://github.com/pytorch/pytorch/issues/166012

We thus rename the post_grad_graph tlparse log name to resolve the cache issue.

Differential Revision: D85309891

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166109
Approved by: https://github.com/jamesjwu
2025-10-28 00:23:01 +00:00
5e769ff867 [CD] Upgrade to CUDA 13.0.2 for nightly binaries (#165470)
13.0.U2 is posted, adding to nightlies
Why we want to upgrade: CUDA 13.0.U2 included a new release from cuBLAS that
1. Enabled opt-in fixed-point emulation for FP64 matmuls (D/ZGEMM) which improves performance and power-efficiency.
2. Improved performance on NVIDIA [DGX Spark](https://www.nvidia.com/en-us/products/workstations/dgx-spark/) for FP16/BF16 and FP8 GEMMs.
3. adds BF16x9 FP32 emulation support for SYRK and HERK routines.
Reference: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cublas-release-13-0-update-2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165470
Approved by: https://github.com/atalman
2025-10-28 00:21:47 +00:00
0ae3e30621 [torchfuzz] fix group norm operator (#166188)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166188
Approved by: https://github.com/pianpwk
ghstack dependencies: #166187
2025-10-28 00:11:04 +00:00
47f50cfd45 [torchfuzz] check in more ignore regexes (#166187)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166187
Approved by: https://github.com/pianpwk
2025-10-27 23:58:54 +00:00
a51f877287 Enable local tensor mode for another set of DTensor tests (#166105)
Enable local tensor mode DTensor tests for the optimizers, op strategy,  matrix ops,
math ops, init ops, experimental ops, embedding ops, dynamic, convolution ops, main api.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166105
Approved by: https://github.com/ezyang
2025-10-27 23:58:24 +00:00
b44423bbb4 [inductor][choices] lookup table choices 1/3 (#164978)
\# why

- enable users to control which choices get used on which inputs
- reduce lowering time, and pin kernel selection, by selecting
  them for the inputs

\# what

- a new InductorChoices subclass that implements a lookup table
- a README explaining the usage
- corresponding testing

- currently only supports templates that go through
  `V.choices.get_template_configs`

\# testing

```
python3 -bb -m pytest test/inductor/test_lookup_table.py -v
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164978
Approved by: https://github.com/PaulZhang12, https://github.com/eellison
2025-10-27 23:45:16 +00:00
8e1e4ee8e0 [reland][dynamo][easy] Support torch.accelerator.current_accelerator (#166327)
Reland https://github.com/pytorch/pytorch/pull/165734

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166327
Approved by: https://github.com/Lucaskabela
2025-10-27 23:41:43 +00:00
1e836bc769 [MPS] fix large matmul test device (#166271)
PR is self explanatory
Test was introduced by https://github.com/pytorch/pytorch/pull/143095 and was always running on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166271
Approved by: https://github.com/kulinseth, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-10-27 22:56:59 +00:00
9a91486e45 [Inductor-FX] Don't flatten constant args (#166144)
Summary: Fallback kernels are created with flattened constant args and an `unflatten` utility to unflatten them when needed. Apply it in FXConverter to preserve the original structure

Test Plan: added new CI tests

Differential Revision: D85347589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166144
Approved by: https://github.com/blaine-rister
2025-10-27 22:33:37 +00:00
92381a5aa7 [ROCm] Custom OpenBLAS library name (#166333)
- TheRock build system for ROCm builds OpenBLAS from source and uses a custom name for the library.
- Following existing conventions in `FindOpenBLAS.cmake` to support finding a custom named version of OpenBLAS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166333
Approved by: https://github.com/jeffdaily
2025-10-27 22:13:05 +00:00
2a5f87decf [cuDNN] Smoke-test runtime cuDNN version matches compile time version in CI (#165922)
Fix and regression test for https://github.com/pytorch/pytorch/issues/165801

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165922
Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/Skylion007, https://github.com/drisspg

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-10-27 22:10:45 +00:00
840d63c12d Update cuDNN 9.10.2 in Manylinux 2.28 Docker files (#165913)
Fixes https://github.com/pytorch/pytorch/issues/165801
Smoke test: https://github.com/pytorch/pytorch/pull/165922/files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165913
Approved by: https://github.com/Camyll, https://github.com/Skylion007
2025-10-27 22:08:06 +00:00
2ce894bb1d [dynamo] Dont guard on numpy Cython functions (#166328)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166328
Approved by: https://github.com/Lucaskabela
2025-10-27 22:01:10 +00:00
47ec1e9990 Support regional inductor with custom config (#166269)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166269
Approved by: https://github.com/anijain2305
2025-10-27 21:46:02 +00:00
904abfc2ca Export flex attention with kwargs and DTensor (#166045)
Fixes #165948

Adding registration of the MaskBlock makes flex attention with kwargs exportable.

Also modified unittests to accept kwargs

```
python test/distributed/tensor/test_dtensor_export.py -k test_flex_attention_dtensor_export

python test/inductor/test_flex_attention.py -k test_pytree_
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166045
Approved by: https://github.com/drisspg, https://github.com/SherlockNoMad

Co-authored-by: fduwjj <fduwjj@gmail.com>
2025-10-27 21:40:40 +00:00
7d16fcf2df Re-re-re-re-apply "C++-accessible Placements via pybind11 (#163030)" (#166132)
Was reverted (again!) due to a merge conflict that crept in sometime during the "export to github -> land internally -> merge on github" process.

D85096233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166132
Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/malfet
2025-10-27 21:19:32 +00:00
483845a9c4 [DTensor][Op] fix for DTensor ops with Partial placements (#165962)
**Summary:** When operations are done on partial placements, we use sharding logic to incorrectly determine whether we should redistribute the tensor to replicate. By delaying the redistribution, we do the operation first, and then the partial reduction. This leads to incorrect results for max, min, gradient norm clipping, and more. We solve this by setting reduction_linear to False when there is a Partial placement to force the redistribution before completing the op.

**Test Cases**
1. pytest test/distributed/tensor/test_math_ops.py -k test_partial_reduction_ops
2. pytest test/distributed/tensor/test_math_ops.py -k test_matching_partial_reduction_ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165962
Approved by: https://github.com/wconstab
2025-10-27 21:17:13 +00:00
60bcb4ee88 [pipeline][be] refactored pipeline composability tests (#165701)
**Summary:** The first thing I did was increase the world size to 8 because test_3d_with_tp_dp_pp wouldn't actually do fully shard as tp = 2, pp = 2, leaving dp = 1. The second thing was refactoring the tests using both single and multi stage schedules so that their logic is largely combined. This was accomplished by using the logic in test_replicate_pp_grad multi-stage schedule to determine the start and end indices for a partial model, but setting virtual_stage to 1 if we are using single stage schedules. Even if this approach isn't approved, multistage schedule logic in test_3d_with_tp_dp_pp and test_replicate_pp should be changed as the logic used is incorrect.

**Test Case**
1. pytest test/distributed/_composable/test_composability/test_pp_composability.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165701
Approved by: https://github.com/H-Huang
2025-10-27 21:08:57 +00:00
ee7434be82 [dynamo][guards] 1/N Guard selectively for DTensor (#165824)
A few internal jobs are observing very high guard overhead for DTensor.
Since we own DTensor, we can make those guards way faster.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165824
Approved by: https://github.com/Lucaskabela, https://github.com/bdhirsh
2025-10-27 20:35:40 +00:00
d049ed2cb1 [BE] Fix metal compilation warnings (#166315)
- Fixes `s/#pragma onces/#pragma once` typoe

All methods in the headers must be inline, otherwise one gets barrage of following warnings
```
/Users/malfet/git/pytorch/pytorch/c10/metal/utils.h:337:7: warning: unused function 'conj<half __attribute__((ext_vector_type(2)))>' [-Wunused-function]
half2 conj(half2 a) {
      ^
/Users/malfet/git/pytorch/pytorch/c10/metal/utils.h:342:8: warning: unused function 'conj<float __attribute__((ext_vector_type(2)))>' [-Wunused-function]
float2 conj(float2 a) {
       ^
2 warnings generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166315
Approved by: https://github.com/seemethere, https://github.com/atalman
2025-10-27 20:17:10 +00:00
9901d44418 [torch/utils][Code Clean] Clean asserts in torch/utils/*.py (#165410)
Including:
- `torch/utils/*.py`

Fixes part of #164878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165410
Approved by: https://github.com/albanD, https://github.com/cyyever
2025-10-27 19:48:55 +00:00
6096c0fc74 Export should use aot_export_joint_with_descriptors (#165931)
This diff moves export run_decompositions to use aot_export_joint_with_descriptors instead of aot_export_module. Doing so, i ran into 2 main bugs:
1) aot_export_joint_with_descriptors don't correctly pass in record_nn_module_stack flag that is needed to populate nn_module_stack by switching the internal tracer.
2) When creating symint with negative inputs, we need to pass in positive=False. This didn't matter before because aot_autograd directly returns integer inputs instead of creating symint.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165931
Approved by: https://github.com/zhxchen17
2025-10-27 19:33:33 +00:00
f6951cb8ea [dynamo] Fix recompilation error message to point to new programming model docs (#165260)
Fixes #163496

Updated troubleshooting_url in torch/_dynamo/utils.py to point to the new programming model documentation.

Changed:
- Old: https://pytorch.org/docs/main/torch.compiler_troubleshooting.html
- New: https://pytorch.org/docs/main/compile/programming_model.recompilation.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165260
Approved by: https://github.com/Lucaskabela, https://github.com/williamwen42
2025-10-27 19:31:11 +00:00
8887a33ede [PyTorch] Improve conversion from/to FP16 on aarch64+sve (#166306)
Summary:
Conversion from/to float16 was not getting covered by conversion templates, because these used float16_t as data type instead of the custom at::Half.

We are adding a shim that makes conversion routines use autovec code for float16

We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16

before:

float16_t->uint8->float16_t ===> 657.489us
float16_t->int8->float16_t ===> 656.518us
float16_t->int16->float16_t ===> 668.998us
float16_t->int64->float16_t ===> 618.444us
float16_t->double->float16_t ===> 439.728us

after

float16_t->uint8->float16_t ===> 181.216us  ----> 263% higher throughput
float16_t->int8->float16_t ===> 179.821us  -----> 265% higher throughput
float16_t->int16->float16_t ===> 183.417us  ----> 265% higher throughput
float16_t->int64->float16_t ===> 459.897us  ----> 35% higher throughput
float16_t->double->float16_t ===> 351.276us  ---> 25% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85533271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166306
Approved by: https://github.com/mcfi, https://github.com/ezyang
2025-10-27 19:07:44 +00:00
36a48e7e6d Fix existing pyrefly errors on main (#166312)
Silences existing errors on main to keep errors and noise from the type checker to a minimum

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166312
Approved by: https://github.com/Skylion007
2025-10-27 19:03:06 +00:00
c6a02eae5b Add XLAHooksInterface to bazel file (#166179)
Differential Revision: D85446553

Internal builds failing after https://github.com/pytorch/pytorch/pull/161369

```
buck-headers/ATen/Context.h:22:10: fatal error: 'ATen/detail/XLAHooksInterface.h' file not found
   22 | #include <ATen/detail/XLAHooksInterface.h>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
```

Changes similar to that PR also change the build_variables file, which I've done here.  I'm not sure why this wasn't caught by the bazel build we have?

Sanity checked that some of the previously failing builds pass after this change
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166179
Approved by: https://github.com/Camyll
2025-10-27 18:47:06 +00:00
6ecd6b23b6 Document limitations of weights_only in SECURITY.md and torch.load doc (#165645)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165645
Approved by: https://github.com/albanD
2025-10-27 18:20:50 +00:00
3f69b4d9b4 [ROCm][tunableop] Fixes flaky test issue (#166084)
Fixes #165603

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166084
Approved by: https://github.com/naromero77amd, https://github.com/jeffdaily
2025-10-27 18:13:30 +00:00
a04edcb27a [inductor] a few workspace api change (#166204)
A few workspace API changes:
1. return outer name when creating. Usually a use case does not care about outer name. But for mix-order-reduction (stacked PR), we need it to do the next-layer of reduction on the workspace tensor
2. be able to override workspace tensor dtype
3. be able to delay the deallocation of workspace tensors in TritonKernel.call_kernel since they may be used after the call. The lifetime of the workspace tensors are only enlarged a little bit. They would be deallocated once the next layer reduction is done.

Test with the stacked PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166204
Approved by: https://github.com/jansel
2025-10-27 18:10:23 +00:00
eb2bad5bb5 [Inductor] Make combo kernel MAX_NUM_ARGS configurable (#166274)
The MAX_NUM_ARGS of ComboKernel is currently a fixed number. We need to tune this number to avoid large fusion for MTIA, thus making it configurable.

Differential Revision: [D85509352](https://our.internmc.facebook.com/intern/diff/D85509352/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166274
Approved by: https://github.com/eellison
2025-10-27 18:06:38 +00:00
a076b4d7ac Use std::min for #166021 (#166195)
Summary:
Attempting to forward fix failures from D85405167 (PR
https://github.com/pytorch/pytorch/pull/166021)

This is devmates suggestion and seems to work, but idk if it's a good idea or not.  Devmate says it's getting resolved to at::min which is host only, and it doesn't happen in OSS is likely because `AT_PER_OPERATOR_HEADERS` is defined in OSS but not internally.

```
In file included from .../ATen/native/hip/Normalization.hip:11:
.../ATen/native/hip/Normalization.cuh:302:37: error: no matching function for call to 'min'
  302 |         v_[u] = input[batch][plane][min(x+u*blockDim.x, input.size(2)-1)];
      |                                     ^~~
```

Differential Revision: D85463674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166195
Approved by: https://github.com/Camyll, https://github.com/malfet, https://github.com/eqy
2025-10-27 17:57:44 +00:00
a988510c33 Revert "Simplify the CUPTI CMake check for kineto (#161370)"
This reverts commit e67e3d95f3ab863c8b44a6b16ce9f25d40b0d517.

Reverted https://github.com/pytorch/pytorch/pull/161370 on behalf of https://github.com/atalman due to Sorry this is failing libtorch nightly builds [pytorch/pytorch/actions/runs/18800131287/job/53653414136](https://github.com/pytorch/pytorch/actions/runs/18800131287/job/53653414136) ([comment](https://github.com/pytorch/pytorch/pull/161370#issuecomment-3452400982))
2025-10-27 17:05:59 +00:00
99e07c39ec [dynamo][misc] Replace UserFunctionVariable with VariableTracker build (#165707)
Audit: To prevent future issues with functools.partial or callable
objects.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165707
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #166251
2025-10-27 16:47:32 +00:00
610c09f8f4 [dynamo] Fix python_type for UserDefinedClassExceptionVariable (#166251)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166251
Approved by: https://github.com/Lucaskabela
2025-10-27 16:47:32 +00:00
61bad3c1ea [dynamo] Move some FUNCTION_MATCH to CLOSURE_MATCH (#166244)
Closure match is more relaxed than FUNCTION_MATCH (which is ID_MATCH)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166244
Approved by: https://github.com/Lucaskabela
2025-10-27 16:43:46 +00:00
f89a7e9fe8 [1/N][Fix] Fix typo in aten folder (#166126)
Fix typo in aten folder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166126
Approved by: https://github.com/cyyever, https://github.com/slayton58
2025-10-27 15:34:39 +00:00
f2c81635c8 [DeviceMesh][2D] Use concatenate for 2D (FSDP+TP) instead of getting from root mesh (#165492)
With concatenate API, we can directly combine two meshes together rather than getting the spmd mesh from root.

Differential Revision: [D85409698](https://our.internmc.facebook.com/intern/diff/D85409698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165492
Approved by: https://github.com/fegin
ghstack dependencies: #163358
2025-10-27 15:33:21 +00:00
e214af6ae8 [Pytorch] Improve float32 erf() on aarch64 (#166262)
Summary:
The float32 data type has a vectorized routine that computes erf(). Such function currently calls std::exp() individually for each float on the vector being processed.
We now use sleef's vectorized routine to compute exp, improving performance of erf.

AVX2/AVX512 also have a custom erf implementation, which uses sleef to compute exp.

We've observed a throughput increase of 25%, when tested on tensors containing 1M elements

Before:
f32 erf: 3175.977us

After:
f32 erf: 2539.446us

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85522651

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166262
Approved by: https://github.com/fadara01, https://github.com/jgong5, https://github.com/aditew01
2025-10-27 14:55:38 +00:00
7ce723d21c [AOTI] Remove c10 as linked library (#165489)
Summary: AOTI compilation doesn't depend on c10 now. It should only depend on C shim symbols which live in libtorch_cpu or libtorch_cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165489
Approved by: https://github.com/yushangdi
2025-10-27 13:53:44 +00:00
4295a9a158 [xla hash update] update the pinned xla hash (#165895)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165895
Approved by: https://github.com/pytorchbot
2025-10-27 11:47:29 +00:00
90d7be35e9 Update slow tests (#165894)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165894
Approved by: https://github.com/pytorchbot
2025-10-27 11:42:14 +00:00
8d4e48831e Remove JITFunction constexpr and some arg_names (#166280)
https://github.com/triton-lang/triton/pull/8536 breaks torch.compile integration. This PR attempts to fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166280
Approved by: https://github.com/jansel
2025-10-27 09:29:03 +00:00
90b30ebf7e Update torch-xpu-ops commit pin (#166129)
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@8d373b](8d373ba272), includes:

- Add CONFIGURE_DEPENDS in install_xpu_headers macro to track these headers
- Add check to ensure P2P Tensors are dense
- Switch philox_engine_inputs usage to philox_xpu_state per XPU graph request
- Add vectorization path for maxpool backward channel last
- Fix SYCL_PRINT macro usable on Windows
- Eliminate unnecessary warning if no AOT enabled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166129
Approved by: https://github.com/EikanWang
2025-10-27 08:17:03 +00:00
173bcda436 Quick fix of torch.save memory leak (#165204)
Fix the memory leak shown in https://github.com/pytorch/pytorch/issues/149846#issuecomment-3392634572
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165204
Approved by: https://github.com/ezyang
2025-10-27 07:50:58 +00:00
6530bc70fb [DeviceMesh] Implement a device mesh concatenate api for submesh and SPMD use case (#163358)
Today FSDP needs to slicing out spmd mesh from root mesh here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/fsdp/_fully_shard/_fsdp_param.py#L301. But essentially, users want is a concatenate of some submesh into a big mesh and used as a spmd mesh. This PR is tentatively trying to implement this API for users.

One thing to note is that, all sub-mesh needs to slicing/flatten or unflatten from same root mesh otherwise the indices make no sense when it comes to mesh indexing and device allocation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163358
Approved by: https://github.com/fegin
2025-10-27 07:39:21 +00:00
4c38887346 [rfc] add debug mode to print meta in fx graphs (#165874)
quite useful in debugging things like unbacked bindings (and presumably other mechanisms that dependent on meta including activation checkpointing and stack trace printing)

<img width="3996" height="748" alt="CleanShot 2025-10-21 at 09 41 54@2x" src="https://github.com/user-attachments/assets/8b885a36-54a5-48b4-a23c-80b39ac7eb12" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165874
Approved by: https://github.com/ezyang
ghstack dependencies: #165893
2025-10-27 07:20:28 +00:00
81fa4a204c Enable Intel GPU on 4 unit test cases (#165405)
For https://github.com/pytorch/pytorch/issues/114850, we will port some aten unit tests to Intel GPU. We could enable Intel GPU with following methods and try the best to keep the original code styles:

1. Replaced onlyCUDA with onlyOn(['cuda', 'xpu']) for supported tests
2. Added allow_xpu=True for supported test class in test parameterization.
3. Use torch.accelerator to extend cude specific test to XPU if needed.
4. Enabled 'xpu' for some test pathes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165405
Approved by: https://github.com/guangyey, https://github.com/ezyang
2025-10-27 06:06:07 +00:00
4e6afa8c07 [BE][Opinfo] Mark [c]double as unsupported for MPS (#166213)
Test plan: Run `python ../test/test_ops.py -v -k test_dtypes___radd___mps` when TestCommon parametrization is enabled for MPS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166213
Approved by: https://github.com/kulinseth, https://github.com/Skylion007
2025-10-27 05:38:36 +00:00
79aa88cc5d Remove old ROCm version checks and branches (#166111)
This PR removes outdated ROCm version checks and their branches. While there is no explicit mention of minimum supported version. ROCm 6.4 is listed in the installation page and the CI yaml files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166111
Approved by: https://github.com/ezyang
2025-10-27 05:32:54 +00:00
fa4cb91846 add support for ir scalar literal parsing for inf/-inf/True/False (#163924)
Currently the ir parser doesn't support parse ir like
```
graph():
  %12 : float = prim::Constant[value=-inf]()
  %13 : float = prim::Constant[value=inf]()
  %14 : bool = prim::Constant[value=True]()
  %15 : bool = prim::Constant[value=False]()
  return (%12)
```

So the python script below will throw error.

```
#!/bin/env python
import torch

def test():
    return [True, False]
f = torch.jit.script(test)
torch._C._jit_pass_constant_propagation(f.graph)
ts_str = f.graph.__repr__()
print(ts_str)
ts = torch.parse_ir(ts_str)
func = torch._C._create_function_from_graph("forward", ts)
ret = func()
assert ret == [True, False]

def test():
    return [float("inf"), float("-inf")]
f = torch.jit.script(test)
torch._C._jit_pass_constant_propagation(f.graph)
ts_str = f.graph.__repr__()
print(ts_str)
ts = torch.parse_ir(ts_str)
func = torch._C._create_function_from_graph("forward", ts)
ret = func()
assert ret == [float("inf"), float("-inf")]
```

I add "inf" and bool cases for IRParser::parseScalarLiteral in irparser.cpp.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163924
Approved by: https://github.com/ezyang
2025-10-27 05:10:21 +00:00
c58d0ad85d Propose Out-of-tree Backend Integration (PrivateUse1) as a module and FFFrog as the maintainer (#165958)
I'd like to propose a new module `Out-of-tree Backend Integration` via `PrivateUse1` device key. The out-of-tree backend integration via `PrivateUse1` device key has been a recommended mechanism of plug-in third-party accelerator devices into PyTorch. There are already quite a few documents/tutorials on the usage with the primary one as https://docs.pytorch.org/docs/main/accelerator/index.html.

We also saw more and more HW vendors to leverage the `PrivateUse1` mechanism to support their accelerators. For example:
1. Ascend NPU
2. Microsoft MAIA
3. MooreThreads MUSA
4. Cambricon MLU

The scope of `PrivateUse1` based out-of-tree backend integration is composed of two parts:
1. `PrivateUse1` device as an out-of-tree backend that involves:
    (a) make `PrivateUse1` a function-complete device as other in-tree devices: i.e., device runtime, autograd, autocast, profiling, distributed, quantization etc.
    (b) a pluggable design to allow out-of-tree integration to extend the functionality of `PrivateUse1` such as a backend registration mechanism that allows user-friendly device naming, runtime extension points with either C++ and Python for third-party to plug-in their runtime implementation, customizable tensor implementation for third-party to add extra info/functionality to the tensor and their serialization.
2. OpenReg: A test suite and documentation effort to guarantee the functional correctness of `PrivateUse1` mechanism and to guide HW vendors with the right implementation.

I'm also proposing @FFFrog as the module maintainer for this new module due to his continuous contribution to the design and implementation both parts of the module. Below are the RFCs/Feature Proposals @FFFrog was working on:
1. [An improvement of PrivateUse1 mechanism, facilitating third-party backend integration](https://docs.google.com/document/d/1_2EO5A2Ww3xDwqbhIvs9Nk65-jV0oNYg3XAmNUsHdAY/edit?tab=t.0#heading=h.5vt8c1vo4dc7)
2. [The interoperability Standard of Third-party Backend Integration Mechanism](9bd181e742/RFC-0037-Interoperability-Standard-of-3rd-Backend-Integration-Mechanism.md)
3. [PyTorch Backend Accelerator Integration Verification and Guidance](f6048cbd4f/RFC-0045-PyTorch-Accelerator-Integration-Enhancements.md)

@FFFrog contributed 240+ PRs and a majority of them is related to `PrivateUse1`. (https://github.com/pytorch/pytorch/pulls?q=is%3Apr+author%3Afffrog+). He also reviewed 50+ PRs related to this area. He is also the primary author of OpenReg.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165958
Approved by: https://github.com/albanD, https://github.com/malfet, https://github.com/ezyang
2025-10-27 05:00:15 +00:00
000f49551b [DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so that we don't need to compare root mesh (#166003) (#166264)
Summary:

Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in https://github.com/pytorch/pytorch/pull/164510 and further simply the code.

We do have a more ambitious universe-based change here: https://github.com/pytorch/pytorch/pull/165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci

imported-using-ghimport

Test Plan: Imported from OSS

Differential Revision: D85526705

Pulled By: fduwjj

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166264
Approved by: https://github.com/XilunWu
2025-10-27 03:15:15 +00:00
9940e894ea Fix pyrefly ignore syntax in _inductor (#166247)
Ensures pyrefly ignores only ignore the intended error code.

pyrefly check
lintrunner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166247
Approved by: https://github.com/oulgen
2025-10-27 02:48:42 +00:00
27302a4932 Fix error suppression syntax in onnx, jit, _dynamo (#166249)
Ensures pyrefly will only silence one specific error code

pyrefly check
lintrunner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166249
Approved by: https://github.com/oulgen
2025-10-27 02:01:54 +00:00
507614ba43 Add GraphModule.recompile_submodules, use for regional inductor (#166002)
This makes it so that `GraphModule.recompile()` will also recompile any submodules that are also graph modules, which allows us to pass all existing regional inductor tests without skipping.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166002
Approved by: https://github.com/oulgen
ghstack dependencies: #165996
2025-10-27 01:40:51 +00:00
86f9f1d0ab Enable local tensor model for DTensor redistribute tests (#166081)
Redistribute test exercise extensively various sharding schemes and
redistribution between them. These tests uncovered more edge cases
that were not supported by the local tensor primarily different flavors
of uneven sharding. In order to handle these cases this change implements
missing functional collectives and adds support for uneven sharding
case where sharding group (ranks) is larger than the size of the dimension
being sharded. In the latter case the "missing" shards are represented
by zero sized tensors so that the rest of the local tensor machinery
can stay oblivious to this special case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166081
Approved by: https://github.com/ezyang
2025-10-26 22:21:43 +00:00
154e4d36e9 Fix pyrelfy ignore syntax in distributions and ao (#166248)
Ensures existing pyrefly ignores only ignore the intended error code

pyrefly check
lintrunner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166248
Approved by: https://github.com/oulgen
2025-10-26 22:13:48 +00:00
a2b6afeac5 [dynamo][guards] CLASS_MATCH guard for readability (#166217)
We were using FUNCTION_MATCH guard for classes. This was very confusing
(although correct).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166217
Approved by: https://github.com/jansel
2025-10-26 18:35:27 +00:00
262830d86c [dynamo] Repro for 166238 (#166252)
xfail repro for https://github.com/pytorch/pytorch/issues/166238

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166252
Approved by: https://github.com/XuehaiPan, https://github.com/jansel
2025-10-26 18:34:22 +00:00
e4c01011c2 Mark FlexAttentionBackward as cacheable (#165996)
This probably should have been marked cacheable a long time ago, no reason that it isn't.

Test Plan:
New regional inductor tests for test_flex_attention now are serializable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165996
Approved by: https://github.com/oulgen, https://github.com/zou3519, https://github.com/drisspg
2025-10-26 14:39:17 +00:00
a60d9e1f6d Fix flake8 B028 warnings (#166224)
This PR fixes flake8 B028 warning by specifying stacklevel=2 in `warnings.warn`. The advantage is that users can know more contextual information about PyTorch warnings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166224
Approved by: https://github.com/ezyang
2025-10-26 06:18:55 +00:00
f863550192 [dtensor] fix incorrect norm calculation for Partial DTensors (#159856)
The sharding strategies for `aten.linalg_vector_norm` and the optimized `aten._foreach_norm.Scalar` incorrectly assumes the norm operation is always "reduction linear" with respect to its inputs. This bug causes the norm to be computed on local, incomplete data for DTensors with a `Partial(sum)` placement, leading to an inflated result (a sum of norms, rather than the correct norm of the sum).

The error can be reproduced with the following script:
```python
import os
import torch
import torch.distributed as dist
from torch.distributed.device_mesh import init_device_mesh
from torch.distributed.tensor import DTensor, Partial, Replicate, Shard

def setup_distributed():
    """Initializes the distributed environment."""
    rank = int(os.environ["RANK"])
    local_rank = int(os.environ["LOCAL_RANK"])
    world_size = int(os.environ["WORLD_SIZE"])

    dist.init_process_group("nccl")
    torch.cuda.set_device(local_rank)

    print(f"Initialized process {rank}/{world_size} on GPU {local_rank}")
    return rank, world_size

rank, world_size = setup_distributed()
assert world_size == 2, "Please run with exactly 2 GPUs for this minimal repro."

mesh = init_device_mesh("cuda", (world_size,))

if rank == 0:
    local_partial = torch.tensor([1.0, 3.0], dtype=torch.float32)
else:
    local_partial = torch.tensor([2.0, 1.0], dtype=torch.float32)

partial_dtensor = DTensor.from_local(local_partial, mesh, [Partial("sum")])
partial_result = torch.linalg.vector_norm(partial_dtensor)
print(
    f"[Rank {rank}] partial_result: {partial_result}, full_tensor: {partial_result.full_tensor()}"
)

shard_dtensor = partial_dtensor.redistribute(mesh, [Shard(0)])
shard_result = torch.linalg.vector_norm(shard_dtensor)
print(
    f"[Rank {rank}] shard_result: {shard_result}, full_tensor {shard_result.full_tensor()}"
)

replicate_dtensor = partial_dtensor.redistribute(mesh, [Replicate()])
replicate_result = torch.linalg.vector_norm(replicate_dtensor)
print(
    f"[Rank {rank}] replicate_result: {replicate_result}, full_tensor {replicate_result.full_tensor()}"
)

full_tensor = partial_dtensor.full_tensor()
full_result = torch.linalg.vector_norm(full_tensor)
print(f"[Rank {rank}] correct_result: {full_result}")
```

Run results show that the norm is `sqrt(1**2 + 3**2) + sqrt(2**2 + 1**2) = sqrt(10) + sqrt(5) = 5.398` instead of `sqrt(3**2 + 4**2) = 5`.
```
$ torchrun --local-ranks-filter 0 --nproc-per-node 2 script.py
Initialized process 0/2 on GPU 0
[Rank 0] partial_result: DTensor(local_tensor=3.1622776985168457, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Partial(sum),)), full_tensor: 5.398345947265625
[Rank 0] shard_result: DTensor(local_tensor=3.0, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(_NormPartial(reduce_op='sum', norm_type=2),)), full_tensor 5.0
[Rank 0] replicate_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Replicate(),)), full_tensor 5.0
[Rank 0] correct_result: 5.0
$ torchrun --local-ranks-filter 1 --nproc-per-node 2 script.py
Initialized process 1/2 on GPU 1
[Rank 1] partial_result: DTensor(local_tensor=2.2360680103302, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Partial(sum),)), full_tensor: 5.398345947265625
[Rank 1] shard_result: DTensor(local_tensor=4.0, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(_NormPartial(reduce_op='sum', norm_type=2),)), full_tensor 5.0
[Rank 1] replicate_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Replicate(),)), full_tensor 5.0
[Rank 1] correct_result: 5.0
```

This fix simply forces `reduction_linear=False` for partial placements. The output becomes:
```
$ python -m torch.distributed.run --local-ranks-filter 0 --nproc-per-node 2 script.py
Initialized process 0/2 on GPU 0
[Rank 0] partial_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(Replicate(),)), full_tensor: 5.0
[Rank 0] shard_result: DTensor(local_tensor=3.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(_NormPartial(reduce_op='sum', norm_type=2),)), full_tensor 5.0
[Rank 0] replicate_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(Replicate(),)), full_tensor 5.0
[Rank 0] correct_result: 5.0
$ python -m torch.distributed.run --local-ranks-filter 1 --nproc-per-node 2 script.py
Initialized process 1/2 on GPU 1
[Rank 1] partial_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(Replicate(),)), full_tensor: 5.0
[Rank 1] shard_result: DTensor(local_tensor=4.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(_NormPartial(reduce_op='sum', norm_type=2),)), full_tensor 5.0
[Rank 1] replicate_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(Replicate(),)), full_tensor 5.0
[Rank 1] correct_result: 5.0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159856
Approved by: https://github.com/ezyang
2025-10-26 05:58:44 +00:00
84b14f3a10 Fix error suppression syntax in utils and nn (#166242)
Fixes syntax for pyrefly : ignores so they only ignore a specific category. No functional changes

pyrefly check
lintrunner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166242
Approved by: https://github.com/oulgen, https://github.com/cyyever
2025-10-26 05:21:07 +00:00
5121499f6b Fix pyrefly ignore syntax in /tools/... (#166240)
Second PR for this - only adjusts the syntax used for the ignores so the suppressions hide only one category of pyrefly errors.

test:
pyrefly check
lintrunner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166240
Approved by: https://github.com/oulgen
2025-10-26 04:20:16 +00:00
8f80892359 Use correct pyrefly syntax in suppressions distributed/... (#166241)
Updates the pyrefy-ignores in the torch/distributed directory to use the correct syntax. No functional changes.

pyrefly check
lintrunner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166241
Approved by: https://github.com/oulgen
2025-10-26 04:16:41 +00:00
787 changed files with 12127 additions and 5484 deletions

View File

@ -150,7 +150,7 @@ function install_130 {
CUDNN_VERSION=9.13.0.50
echo "Installing CUDA 13.0 and cuDNN ${CUDNN_VERSION} and NVSHMEM and NCCL and cuSparseLt-0.7.1"
# install CUDA 13.0 in the same container
install_cuda 13.0.0 cuda_13.0.0_580.65.06_linux
install_cuda 13.0.2 cuda_13.0.2_580.95.05_linux
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
install_cudnn 13 $CUDNN_VERSION

View File

@ -1,3 +1,8 @@
---
name: docstring
description: Write docstrings for PyTorch functions and methods following PyTorch conventions. Use when writing or updating docstrings in PyTorch code.
---
# PyTorch Docstring Writing Guide
This skill describes how to write docstrings for functions and methods in the PyTorch project, following the conventions in `torch/_tensor_docs.py` and `torch/nn/functional.py`.

View File

@ -0,0 +1,385 @@
---
name: skill-writer
description: Guide users through creating Agent Skills for Claude Code. Use when the user wants to create, write, author, or design a new Skill, or needs help with SKILL.md files, frontmatter, or skill structure.
---
# Skill Writer
This Skill helps you create well-structured Agent Skills for Claude Code that follow best practices and validation requirements.
## When to use this Skill
Use this Skill when:
- Creating a new Agent Skill
- Writing or updating SKILL.md files
- Designing skill structure and frontmatter
- Troubleshooting skill discovery issues
- Converting existing prompts or workflows into Skills
## Instructions
### Step 1: Determine Skill scope
First, understand what the Skill should do:
1. **Ask clarifying questions**:
- What specific capability should this Skill provide?
- When should Claude use this Skill?
- What tools or resources does it need?
- Is this for personal use or team sharing?
2. **Keep it focused**: One Skill = one capability
- Good: "PDF form filling", "Excel data analysis"
- Too broad: "Document processing", "Data tools"
### Step 2: Choose Skill location
Determine where to create the Skill:
**Personal Skills** (`~/.claude/skills/`):
- Individual workflows and preferences
- Experimental Skills
- Personal productivity tools
**Project Skills** (`.claude/skills/`):
- Team workflows and conventions
- Project-specific expertise
- Shared utilities (committed to git)
### Step 3: Create Skill structure
Create the directory and files:
```bash
# Personal
mkdir -p ~/.claude/skills/skill-name
# Project
mkdir -p .claude/skills/skill-name
```
For multi-file Skills:
```
skill-name/
├── SKILL.md (required)
├── reference.md (optional)
├── examples.md (optional)
├── scripts/
│ └── helper.py (optional)
└── templates/
└── template.txt (optional)
```
### Step 4: Write SKILL.md frontmatter
Create YAML frontmatter with required fields:
```yaml
---
name: skill-name
description: Brief description of what this does and when to use it
---
```
**Field requirements**:
- **name**:
- Lowercase letters, numbers, hyphens only
- Max 64 characters
- Must match directory name
- Good: `pdf-processor`, `git-commit-helper`
- Bad: `PDF_Processor`, `Git Commits!`
- **description**:
- Max 1024 characters
- Include BOTH what it does AND when to use it
- Use specific trigger words users would say
- Mention file types, operations, and context
**Optional frontmatter fields**:
- **allowed-tools**: Restrict tool access (comma-separated list)
```yaml
allowed-tools: Read, Grep, Glob
```
Use for:
- Read-only Skills
- Security-sensitive workflows
- Limited-scope operations
### Step 5: Write effective descriptions
The description is critical for Claude to discover your Skill.
**Formula**: `[What it does] + [When to use it] + [Key triggers]`
**Examples**:
✅ **Good**:
```yaml
description: Extract text and tables from PDF files, fill forms, merge documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction.
```
✅ **Good**:
```yaml
description: Analyze Excel spreadsheets, create pivot tables, and generate charts. Use when working with Excel files, spreadsheets, or analyzing tabular data in .xlsx format.
```
❌ **Too vague**:
```yaml
description: Helps with documents
description: For data analysis
```
**Tips**:
- Include specific file extensions (.pdf, .xlsx, .json)
- Mention common user phrases ("analyze", "extract", "generate")
- List concrete operations (not generic verbs)
- Add context clues ("Use when...", "For...")
### Step 6: Structure the Skill content
Use clear Markdown sections:
```markdown
# Skill Name
Brief overview of what this Skill does.
## Quick start
Provide a simple example to get started immediately.
## Instructions
Step-by-step guidance for Claude:
1. First step with clear action
2. Second step with expected outcome
3. Handle edge cases
## Examples
Show concrete usage examples with code or commands.
## Best practices
- Key conventions to follow
- Common pitfalls to avoid
- When to use vs. not use
## Requirements
List any dependencies or prerequisites:
```bash
pip install package-name
```
## Advanced usage
For complex scenarios, see [reference.md](reference.md).
```
### Step 7: Add supporting files (optional)
Create additional files for progressive disclosure:
**reference.md**: Detailed API docs, advanced options
**examples.md**: Extended examples and use cases
**scripts/**: Helper scripts and utilities
**templates/**: File templates or boilerplate
Reference them from SKILL.md:
```markdown
For advanced usage, see [reference.md](reference.md).
Run the helper script:
\`\`\`bash
python scripts/helper.py input.txt
\`\`\`
```
### Step 8: Validate the Skill
Check these requirements:
✅ **File structure**:
- [ ] SKILL.md exists in correct location
- [ ] Directory name matches frontmatter `name`
✅ **YAML frontmatter**:
- [ ] Opening `---` on line 1
- [ ] Closing `---` before content
- [ ] Valid YAML (no tabs, correct indentation)
- [ ] `name` follows naming rules
- [ ] `description` is specific and < 1024 chars
✅ **Content quality**:
- [ ] Clear instructions for Claude
- [ ] Concrete examples provided
- [ ] Edge cases handled
- [ ] Dependencies listed (if any)
✅ **Testing**:
- [ ] Description matches user questions
- [ ] Skill activates on relevant queries
- [ ] Instructions are clear and actionable
### Step 9: Test the Skill
1. **Restart Claude Code** (if running) to load the Skill
2. **Ask relevant questions** that match the description:
```
Can you help me extract text from this PDF?
```
3. **Verify activation**: Claude should use the Skill automatically
4. **Check behavior**: Confirm Claude follows the instructions correctly
### Step 10: Debug if needed
If Claude doesn't use the Skill:
1. **Make description more specific**:
- Add trigger words
- Include file types
- Mention common user phrases
2. **Check file location**:
```bash
ls ~/.claude/skills/skill-name/SKILL.md
ls .claude/skills/skill-name/SKILL.md
```
3. **Validate YAML**:
```bash
cat SKILL.md | head -n 10
```
4. **Run debug mode**:
```bash
claude --debug
```
## Common patterns
### Read-only Skill
```yaml
---
name: code-reader
description: Read and analyze code without making changes. Use for code review, understanding codebases, or documentation.
allowed-tools: Read, Grep, Glob
---
```
### Script-based Skill
```yaml
---
name: data-processor
description: Process CSV and JSON data files with Python scripts. Use when analyzing data files or transforming datasets.
---
# Data Processor
## Instructions
1. Use the processing script:
\`\`\`bash
python scripts/process.py input.csv --output results.json
\`\`\`
2. Validate output with:
\`\`\`bash
python scripts/validate.py results.json
\`\`\`
```
### Multi-file Skill with progressive disclosure
```yaml
---
name: api-designer
description: Design REST APIs following best practices. Use when creating API endpoints, designing routes, or planning API architecture.
---
# API Designer
Quick start: See [examples.md](examples.md)
Detailed reference: See [reference.md](reference.md)
## Instructions
1. Gather requirements
2. Design endpoints (see examples.md)
3. Document with OpenAPI spec
4. Review against best practices (see reference.md)
```
## Best practices for Skill authors
1. **One Skill, one purpose**: Don't create mega-Skills
2. **Specific descriptions**: Include trigger words users will say
3. **Clear instructions**: Write for Claude, not humans
4. **Concrete examples**: Show real code, not pseudocode
5. **List dependencies**: Mention required packages in description
6. **Test with teammates**: Verify activation and clarity
7. **Version your Skills**: Document changes in content
8. **Use progressive disclosure**: Put advanced details in separate files
## Validation checklist
Before finalizing a Skill, verify:
- [ ] Name is lowercase, hyphens only, max 64 chars
- [ ] Description is specific and < 1024 chars
- [ ] Description includes "what" and "when"
- [ ] YAML frontmatter is valid
- [ ] Instructions are step-by-step
- [ ] Examples are concrete and realistic
- [ ] Dependencies are documented
- [ ] File paths use forward slashes
- [ ] Skill activates on relevant queries
- [ ] Claude follows instructions correctly
## Troubleshooting
**Skill doesn't activate**:
- Make description more specific with trigger words
- Include file types and operations in description
- Add "Use when..." clause with user phrases
**Multiple Skills conflict**:
- Make descriptions more distinct
- Use different trigger words
- Narrow the scope of each Skill
**Skill has errors**:
- Check YAML syntax (no tabs, proper indentation)
- Verify file paths (use forward slashes)
- Ensure scripts have execute permissions
- List all dependencies
## Examples
See the documentation for complete examples:
- Simple single-file Skill (commit-helper)
- Skill with tool permissions (code-reviewer)
- Multi-file Skill (pdf-processing)
## Output format
When creating a Skill, I will:
1. Ask clarifying questions about scope and requirements
2. Suggest a Skill name and location
3. Create the SKILL.md file with proper frontmatter
4. Include clear instructions and examples
5. Add supporting files if needed
6. Provide testing instructions
7. Validate against all requirements
The result will be a complete, working Skill that follows all best practices and validation rules.

View File

@ -1 +1 @@
0fa6e3129e61143224663e1ec67980d12b7ec4eb
df6798dfb931ce7c7fe5bed2447cd1092a5981af

View File

@ -22,7 +22,7 @@ CUDA_ARCHES_FULL_VERSION = {
"12.6": "12.6.3",
"12.8": "12.8.1",
"12.9": "12.9.1",
"13.0": "13.0.0",
"13.0": "13.0.2",
}
CUDA_ARCHES_CUDNN_VERSION = {
"12.6": "9",
@ -96,21 +96,21 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux'"
),
"13.0": (
"nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | "
"nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | "
"nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | "
"nvidia-cuda-nvrtc==13.0.88; platform_system == 'Linux' | "
"nvidia-cuda-runtime==13.0.96; platform_system == 'Linux' | "
"nvidia-cuda-cupti==13.0.85; platform_system == 'Linux' | "
"nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | "
"nvidia-cublas==13.0.0.19; platform_system == 'Linux' | "
"nvidia-cufft==12.0.0.15; platform_system == 'Linux' | "
"nvidia-cublas==13.1.0.3; platform_system == 'Linux' | "
"nvidia-cufft==12.0.0.61; platform_system == 'Linux' | "
"nvidia-curand==10.4.0.35; platform_system == 'Linux' | "
"nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | "
"nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | "
"nvidia-cusolver==12.0.4.66; platform_system == 'Linux' | "
"nvidia-cusparse==12.6.3.3; platform_system == 'Linux' | "
"nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | "
"nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | "
"nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | "
"nvidia-nvtx==13.0.39; platform_system == 'Linux' | "
"nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | "
"nvidia-cufile==1.15.0.42; platform_system == 'Linux'"
"nvidia-nvtx==13.0.85; platform_system == 'Linux' | "
"nvidia-nvjitlink==13.0.88; platform_system == 'Linux' | "
"nvidia-cufile==1.15.1.6; platform_system == 'Linux'"
),
"xpu": (
"intel-cmplr-lib-rt==2025.2.1 | "

View File

@ -270,7 +270,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.88; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.96; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.85; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.1.0.3; platform_system == 'Linux' | nvidia-cufft==12.0.0.61; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.4.66; platform_system == 'Linux' | nvidia-cusparse==12.6.3.3; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.85; platform_system == 'Linux' | nvidia-nvjitlink==13.0.88; platform_system == 'Linux' | nvidia-cufile==1.15.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -519,7 +519,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.88; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.96; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.85; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.1.0.3; platform_system == 'Linux' | nvidia-cufft==12.0.0.61; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.4.66; platform_system == 'Linux' | nvidia-cusparse==12.6.3.3; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.85; platform_system == 'Linux' | nvidia-nvjitlink==13.0.88; platform_system == 'Linux' | nvidia-cufile==1.15.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -768,7 +768,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.88; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.96; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.85; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.1.0.3; platform_system == 'Linux' | nvidia-cufft==12.0.0.61; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.4.66; platform_system == 'Linux' | nvidia-cusparse==12.6.3.3; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.85; platform_system == 'Linux' | nvidia-nvjitlink==13.0.88; platform_system == 'Linux' | nvidia-cufile==1.15.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1017,7 +1017,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.88; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.96; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.85; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.1.0.3; platform_system == 'Linux' | nvidia-cufft==12.0.0.61; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.4.66; platform_system == 'Linux' | nvidia-cusparse==12.6.3.3; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.85; platform_system == 'Linux' | nvidia-nvjitlink==13.0.88; platform_system == 'Linux' | nvidia-cufile==1.15.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1266,7 +1266,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.88; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.96; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.85; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.1.0.3; platform_system == 'Linux' | nvidia-cufft==12.0.0.61; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.4.66; platform_system == 'Linux' | nvidia-cusparse==12.6.3.3; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.85; platform_system == 'Linux' | nvidia-nvjitlink==13.0.88; platform_system == 'Linux' | nvidia-cufile==1.15.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1515,7 +1515,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.88; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.96; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.85; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.1.0.3; platform_system == 'Linux' | nvidia-cufft==12.0.0.61; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.4.66; platform_system == 'Linux' | nvidia-cusparse==12.6.3.3; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.85; platform_system == 'Linux' | nvidia-nvjitlink==13.0.88; platform_system == 'Linux' | nvidia-cufile==1.15.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1764,7 +1764,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14t-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.88; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.96; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.85; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.1.0.3; platform_system == 'Linux' | nvidia-cufft==12.0.0.61; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.4.66; platform_system == 'Linux' | nvidia-cusparse==12.6.3.3; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.85; platform_system == 'Linux' | nvidia-nvjitlink==13.0.88; platform_system == 'Linux' | nvidia-cufile==1.15.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}

View File

@ -325,7 +325,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.88; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.96; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.85; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.1.0.3; platform_system == 'Linux' | nvidia-cufft==12.0.0.61; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.4.66; platform_system == 'Linux' | nvidia-cusparse==12.6.3.3; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.85; platform_system == 'Linux' | nvidia-nvjitlink==13.0.88; platform_system == 'Linux' | nvidia-cufile==1.15.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda13_0-test: # Testing
@ -991,7 +991,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.88; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.96; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.85; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.1.0.3; platform_system == 'Linux' | nvidia-cufft==12.0.0.61; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.4.66; platform_system == 'Linux' | nvidia-cusparse==12.6.3.3; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.85; platform_system == 'Linux' | nvidia-nvjitlink==13.0.88; platform_system == 'Linux' | nvidia-cufile==1.15.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda13_0-test: # Testing
@ -1657,7 +1657,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.88; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.96; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.85; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.1.0.3; platform_system == 'Linux' | nvidia-cufft==12.0.0.61; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.4.66; platform_system == 'Linux' | nvidia-cusparse==12.6.3.3; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.85; platform_system == 'Linux' | nvidia-nvjitlink==13.0.88; platform_system == 'Linux' | nvidia-cufile==1.15.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda13_0-test: # Testing
@ -2323,7 +2323,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.88; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.96; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.85; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.1.0.3; platform_system == 'Linux' | nvidia-cufft==12.0.0.61; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.4.66; platform_system == 'Linux' | nvidia-cusparse==12.6.3.3; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.85; platform_system == 'Linux' | nvidia-nvjitlink==13.0.88; platform_system == 'Linux' | nvidia-cufile==1.15.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda13_0-test: # Testing
@ -2989,7 +2989,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.88; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.96; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.85; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.1.0.3; platform_system == 'Linux' | nvidia-cufft==12.0.0.61; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.4.66; platform_system == 'Linux' | nvidia-cusparse==12.6.3.3; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.85; platform_system == 'Linux' | nvidia-nvjitlink==13.0.88; platform_system == 'Linux' | nvidia-cufile==1.15.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda13_0-test: # Testing
@ -3655,7 +3655,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_14-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.88; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.96; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.85; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.1.0.3; platform_system == 'Linux' | nvidia-cufft==12.0.0.61; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.4.66; platform_system == 'Linux' | nvidia-cusparse==12.6.3.3; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.85; platform_system == 'Linux' | nvidia-nvjitlink==13.0.88; platform_system == 'Linux' | nvidia-cufile==1.15.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14-cuda13_0-test: # Testing
@ -4321,7 +4321,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_14t-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.88; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.96; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.85; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.1.0.3; platform_system == 'Linux' | nvidia-cufft==12.0.0.61; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.4.66; platform_system == 'Linux' | nvidia-cusparse==12.6.3.3; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.85; platform_system == 'Linux' | nvidia-nvjitlink==13.0.88; platform_system == 'Linux' | nvidia-cufile==1.15.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14t-cuda13_0-test: # Testing

View File

@ -31,9 +31,9 @@ Be careful when running untrusted models. This classification includes models cr
**Prefer to execute untrusted models within a secure, isolated environment such as a sandbox** (e.g., containers, virtual machines). This helps protect your system from potentially malicious code. You can find further details and instructions in [this page](https://developers.google.com/code-sandboxing).
**Be mindful of risky model formats**. Give preference to share and load weights with the appropriate format for your use case. [safetensors](https://huggingface.co/docs/safetensors/en/index) gives the most safety but is the most restricted in what it supports. [`torch.load`](https://pytorch.org/docs/stable/generated/torch.load.html#torch.load) with `weights_only=True` is also secure to our knowledge even though it offers significantly larger surface of attack. Loading un-trusted checkpoint with `weights_only=False` MUST never be done.
**Be mindful of risky model formats**. Give preference to share and load weights with the appropriate format for your use case. [safetensors](https://huggingface.co/docs/safetensors/en/index) gives the most safety but is the most restricted in what it supports. [`torch.load`](https://pytorch.org/docs/stable/generated/torch.load.html#torch.load) has a significantly larger surface of attack but is more flexible in what it can serialize. See the documentation for more details.
Even for more secure serialization formats, unexpected inputs to the downstream system can cause diverse security threats (e.g. denial of service, out of bound reads/writes) and thus we recommend extensive validation of any untrusted inputs.
Important Note: The trustworthiness of a model is not binary. You must always determine the proper level of caution depending on the specific model and how it matches your use case and risk tolerance.

View File

@ -38,7 +38,7 @@ set_bool(AT_HIPSPARSELT_ENABLED CAFFE2_USE_HIPSPARSELT)
configure_file(Config.h.in "${CMAKE_CURRENT_SOURCE_DIR}/Config.h")
# TODO: Do not generate CUDAConfig.h for ROCm BUILDS
# At the moment, `jit_macors.h` include CUDAConfig.h for both CUDA and HIP builds
# At the moment, `jit_macros.h` include CUDAConfig.h for both CUDA and HIP builds
if(USE_CUDA OR USE_ROCM)
configure_file(cuda/CUDAConfig.h.in "${CMAKE_CURRENT_SOURCE_DIR}/cuda/CUDAConfig.h")
endif()

View File

@ -122,7 +122,7 @@ void FunctionalTensorWrapper::freeze_storage() const {
// | have their own storages, but backends like functorch |
// \/ are allowed to re-alias underneath the pass \/
// . - - - - - - - - - - - - - . . - - - - - - - - - - - - - - - .
// | underyling_storage | | underyling_storage |
// | underlying_storage | | underlying_storage |
// . - - - - - - - - - - - - - . . - - - - - - - - - - - - - - - .
//
// This constructor is only used by view ops.

View File

@ -1534,7 +1534,7 @@ void TensorIteratorBase::build(TensorIteratorConfig& config) {
// XLA and lazy tensors don't have storage, so they don't have an underlying data pointer.
// Nothing beyond this point is important for meta functions, so it's fine to exit early here.
// Extend the condition to MAIA tesnors as MAIA tensors also don't have storage.
// Extend the condition to MAIA tensors as MAIA tensors also don't have storage.
if (privateuse1_without_storage ||
common_device_.type() == DeviceType::XLA ||
common_device_.type() == DeviceType::IPU ||

View File

@ -94,11 +94,11 @@ struct PinnedReserveSegment {
struct TORCH_API HostStats {
// COUNT: total allocations (active)
Stat active_requests;
// SUM: bytes allocated/reserved by this memory alocator. (active)
// SUM: bytes allocated/reserved by this memory allocator. (active)
Stat active_bytes;
// COUNT: total allocations (active + free)
Stat allocations;
// SUM: bytes allocated/reserved by this memory alocator. This accounts
// SUM: bytes allocated/reserved by this memory allocator. This accounts
// for both free and in-use blocks.
Stat allocated_bytes;
@ -127,7 +127,7 @@ struct alignas(hardware_destructive_interference_size) HostStatsStaged {
// COUNT: total allocations (active + free)
// LOCK: access to this stat is protected by the allocator's blocks_mutex_
Stat allocations;
// SUM: bytes allocated/reserved by this memory alocator. This accounts
// SUM: bytes allocated/reserved by this memory allocator. This accounts
// for both free and in-use blocks.
Stat allocated_bytes;
// COUNT: number of allocations per bucket (active)
@ -455,7 +455,7 @@ struct CachingHostAllocatorImpl {
}
void resetAccumulatedStats() {
// Reseting accumulated memory stats requires concurrently holding both the
// Resetting accumulated memory stats requires concurrently holding both the
// free list mutexes and the blocks mutex. Previously, this was only done in
// empty_cache function.
for (size_t i = 0; i < free_list_.size(); ++i) {
@ -482,7 +482,7 @@ struct CachingHostAllocatorImpl {
}
void resetPeakStats() {
// Reseting peak memory stats requires concurrently holding both the
// Resetting peak memory stats requires concurrently holding both the
// free list mutexes and the blocks mutex. Previously, this was only done in
// empty_cache function.
for (size_t i = 0; i < free_list_.size(); ++i) {

View File

@ -148,7 +148,7 @@ struct TORCH_API ClassType : public NamedType {
void checkNotExist(const std::string& name, const std::string& what) const;
// Attributes are stored in a specific slot at runtime for effiency.
// Attributes are stored in a specific slot at runtime for efficiency.
// When emitting instructions we specify the slot so that attribute access is
// a constant lookup
std::optional<size_t> findAttributeSlot(const std::string& name) const {
@ -412,7 +412,7 @@ struct TORCH_API ClassType : public NamedType {
// Holds method attributes
std::weak_ptr<CompilationUnit> compilation_unit_;
// Holds all atrributes, attribute details are found on ClassAttribute
// Holds all attributes, attribute details are found on ClassAttribute
std::vector<ClassAttribute> attributes_;
// Construct mirroring attributes_, only around due to the fact that `containedTypes()` method returns an ArrayRef.
// Never fill this without using the appropriate provideNewClassAttribute method

View File

@ -537,7 +537,7 @@ int64_t Dispatcher::sequenceNumberForRunningRecordFunction(DispatchKey dispatchK
// Note: this records a sequence number for both Autograd keys, and for
// non-Autograd keys where the dispatchKeySet still contains an autograd key.
// This means that we might collect the same sequence nubmer two different
// This means that we might collect the same sequence number two different
// events if they all occurred above Autograd and still had the Autograd
// dispatch key in the dispatch key set.
// However, this usually doesn't happen: normally the first call will

View File

@ -585,7 +585,7 @@ class TORCH_API OperatorHandle {
// We need to store this iterator in order to make
// Dispatcher::cleanup() fast -- it runs a lot on program
// termination (and presuambly library unloading).
// termination (and presumably library unloading).
std::list<Dispatcher::OperatorDef>::iterator operatorIterator_;
};

View File

@ -365,7 +365,7 @@ std::pair<const AnnotatedKernel&, const char*> OperatorEntry::computeDispatchTab
// For autograd keys, we only use kernel from CompositeImplicitAutograd when there's no direct registration
// to its corresponding backend key or CompositeExplicitAutograd. See Note [CompositeExplicitAutograd and CompositeImplicitAutograd].
// For AutogradOther, we eagerly return ambiguousAutogradOtherKernel() if there's registration to any of
// its backends and ask backend extender to request a decicated Autograd key for the backend.
// its backends and ask backend extender to request a dedicated Autograd key for the backend.
// See Note [Ambiguity in AutogradOther kernel] for more details.
// A CompositeExplicitAutograd kernel prevents CompositeImplicitAutograd kernel being used for Autograd keys, but it doesn't
// cause confusion for AutogradOther. It's pretty straightforward to use Autograd (if available)

View File

@ -261,7 +261,7 @@ std::ostream& operator<<(std::ostream& out, const FunctionSchema& schema) {
//
// There are 2 cases
// 1. something like 'aten::items.str(Dict(str, t) self) -> ((str, t)[])'.
// without the extra parenthesis, the c++ schem parser can not parse it.
// without the extra parenthesis, the c++ scheme parser can not parse it.
// 2. something like '-> ((str, str))'. Need extra parenthesis so the return
// type is a single tuple rather than two strings.
// PR (https://github.com/pytorch/pytorch/pull/23204) has more context about

View File

@ -1176,7 +1176,7 @@ struct TORCH_API IValue final {
using HashIdentityIValueMap =
std::unordered_map<IValue, IValue, HashIdentityIValue, CompIdentityIValues>;
// Chechs if this and rhs has a subvalues in common.
// Checks if this and rhs has a subvalues in common.
// [t1,t2] and [t2, t3] returns true.
bool overlaps(const IValue& rhs) const;

View File

@ -1501,7 +1501,7 @@ struct C10_EXPORT ivalue::Object final : c10::intrusive_ptr_target {
// However, the CompilationUnit holds ownership of the type's graphs, so
// inserting a constant object into a Graph would create a reference cycle if
// that constant object held a shared_ptr to its CU. For these objects we
// instatiate them with non-owning references to its CU
// instantiate them with non-owning references to its CU
Object(WeakOrStrongTypePtr type, size_t numSlots) : type_(std::move(type)) {
slots_.resize(numSlots);
}

View File

@ -373,7 +373,7 @@ struct TORCH_API SymbolicShape {
// Unranked shape constructor.
SymbolicShape() : dims_(std::nullopt) {}
// Known rank but unknown dimentions.
// Known rank but unknown dimensions.
SymbolicShape(std::optional<size_t> rank) : dims_(std::nullopt) {
if(!rank) {
return;
@ -884,9 +884,9 @@ struct TORCH_API ListType
// global singleton
// Given an inner type T and an identifier,
// this function wil return the global singleton type pointer
// this function will return the global singleton type pointer
// the type List<T>.
// The extra "identifier" argument is needed beccause we have multiple container types
// The extra "identifier" argument is needed because we have multiple container types
// that all re-use this function (List<T>, array<T, N>, etc.)
static TypePtr get(const std::string& identifier, TypePtr inner);

View File

@ -21,7 +21,7 @@ namespace c10 {
namespace detail {
// The first argument of the schema might be of type DispatchKeySet, in which case we remove it.
// We do this because every argument in a function schema is expected to be convertable
// We do this because every argument in a function schema is expected to be convertible
// to an ivalue, but DispatchKeySet is not a type we want the jit to be aware of.
// See Note [Plumbing Keys Through The Dispatcher]
template<class KernelFunctor>

View File

@ -251,7 +251,7 @@ TEST(OperatorRegistrationTest, whenRegisteringCPUTensorType_thenCanOnlyCallUnbox
callOpUnboxedWithPrecomputedDispatchKeySet<void, Tensor>(*op, c10::DispatchKeySet(c10::DispatchKey::CPU), dummyTensor(c10::DispatchKey::CUDA));
EXPECT_TRUE(called_kernel_cpu);
// Ensure that disptach key from tensor is not used here.
// Ensure that dispatch key from tensor is not used here.
called_kernel_cpu = false;
expectThrows<c10::Error>([&] {
callOpUnboxedWithPrecomputedDispatchKeySet<void, Tensor>(*op, c10::DispatchKeySet(c10::DispatchKey::CUDA), dummyTensor(c10::DispatchKey::CPU));

View File

@ -172,7 +172,7 @@ VaryingShape<Stride> TensorType::computeStrideProps(
// The logic below follows what TensorIterator uses in its logic:
// 1. Fast_set_up is the short-cut to identify a. channels_last and
// b. contiguous format, which is what we have in the below logic.
// 2. In more generla cases, it does best effort to preserve permutatoin.
// 2. In more general cases, it does best effort to preserve permutatoin.
if (is_channels_last_strides_2d(sizes, strides) || is_channels_last_strides_3d(sizes, strides)) {
// case 1.a. short cut channels last
std::iota(stride_indices.rbegin() + 1, stride_indices.rend() - 1, 2);

View File

@ -77,21 +77,36 @@ CONVERT_TEMPLATE(double, int64_t)
CONVERT_TEMPLATE(double, float)
CONVERT_TEMPLATE(double, double)
#ifdef __ARM_FEATURE_FP16_VECTOR_ARITHMETIC
CONVERT_TEMPLATE(float16_t, uint8_t)
CONVERT_TEMPLATE(float16_t, int8_t)
CONVERT_TEMPLATE(float16_t, int16_t)
CONVERT_TEMPLATE(float16_t, int32_t)
CONVERT_TEMPLATE(float16_t, int64_t)
CONVERT_TEMPLATE(float16_t, float16_t)
CONVERT_TEMPLATE(float16_t, float)
CONVERT_TEMPLATE(float16_t, double)
CONVERT_TEMPLATE(uint8_t, float16_t)
CONVERT_TEMPLATE(int8_t, float16_t)
CONVERT_TEMPLATE(int16_t, float16_t)
CONVERT_TEMPLATE(int32_t, float16_t)
CONVERT_TEMPLATE(int64_t, float16_t)
CONVERT_TEMPLATE(float, float16_t)
CONVERT_TEMPLATE(double, float16_t)
#define CONVERT_FROM_FP16_TEMPLATE(to_type) \
template <> \
inline void convert(const at::Half* src, to_type* dst, int64_t n) { \
const float16_t* srcPtr = reinterpret_cast<const float16_t*>(src); \
return convertImpl<float16_t, to_type>(srcPtr, dst, n); \
}
#define CONVERT_TO_FP16_TEMPLATE(from_type) \
template <> \
inline void convert(const from_type* src, at::Half* dst, int64_t n) { \
float16_t* dstPtr = reinterpret_cast<float16_t*>(dst); \
return convertImpl<from_type, float16_t>(src, dstPtr, n); \
}
CONVERT_FROM_FP16_TEMPLATE(uint8_t)
CONVERT_FROM_FP16_TEMPLATE(int8_t)
CONVERT_FROM_FP16_TEMPLATE(int16_t)
CONVERT_FROM_FP16_TEMPLATE(int32_t)
CONVERT_FROM_FP16_TEMPLATE(int64_t)
CONVERT_FROM_FP16_TEMPLATE(float16_t)
CONVERT_FROM_FP16_TEMPLATE(float)
CONVERT_FROM_FP16_TEMPLATE(double)
CONVERT_TO_FP16_TEMPLATE(uint8_t)
CONVERT_TO_FP16_TEMPLATE(int8_t)
CONVERT_TO_FP16_TEMPLATE(int16_t)
CONVERT_TO_FP16_TEMPLATE(int32_t)
CONVERT_TO_FP16_TEMPLATE(int64_t)
CONVERT_TO_FP16_TEMPLATE(float)
CONVERT_TO_FP16_TEMPLATE(double)
#endif
#ifdef __ARM_FEATURE_BF16
CONVERT_TEMPLATE(bfloat16_t, uint8_t)

View File

@ -634,8 +634,7 @@ inline Vectorized<float> Vectorized<float>::erf() const {
// - exp(- x * x)
auto pow_2 = (*this) * (*this);
auto neg_pow_2 = pow_2 ^ neg_zero_vec;
auto tmp4 = neg_pow_2.map(
std::exp); // This can be swapped for a faster implementation of exp.
auto tmp4 = neg_pow_2.exp();
auto tmp5 = tmp4 ^ neg_zero_vec;
// erf(x) = sign(x) * (1 - r * t * exp(- x * x))
auto tmp6 = t * tmp5;

View File

@ -234,7 +234,7 @@ class Vectorized<c10::Half> : public Vectorized16<
vshlq_u16(vandq_u16(is_zero_vec, vdupq_n_u16(1)), shift);
return vaddvq_u16(bits_vec);
#else // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC
// use known working implmentation.
// use known working implementation.
__at_align__ value_type tmp[size()];
store(tmp);
int mask = 0;

View File

@ -1740,7 +1740,7 @@ Vectorized<int16_t> inline shift_256_16(
// Control masks for shuffle operation, treating 256 bits as an
// array of 16-bit elements, and considering pairs of neighboring
// elements. Specifially, a mask named "ctl_M_N" (M,N in [0,1], and
// elements. Specifically, a mask named "ctl_M_N" (M,N in [0,1], and
// M!=N) is set so that shuffle will move element with index M from
// input pair into element with index N in output pair, and element
// with index M in output pair will be set to all 0s.
@ -1875,7 +1875,7 @@ Vectorized<T> inline shift_256_8(
// Control masks for shuffle operation, treating 256 bits as an
// array of 8-bit elements, and considering quadruples of
// neighboring elements. Specifially, a mask named "ctl_M_N" (M,N
// neighboring elements. Specifically, a mask named "ctl_M_N" (M,N
// in [0,1,2,3], and M!=N) is set so that shuffle will move element
// with index M from input quadruple into element with index N in
// output quadruple, and other elements in output quadruple will be

View File

@ -143,7 +143,7 @@ class Vectorized<double> {
const Vectorized<double>& a,
const Vectorized<double>& b,
const Vectorized<double>& mask) {
// the mask used here returned by comparision of vec256
// the mask used here returned by comparison of vec256
return {
vec_sel(a._vec0, b._vec0, mask._vecb0),

View File

@ -142,7 +142,7 @@ class Vectorized<float> {
const Vectorized<float>& a,
const Vectorized<float>& b,
const Vectorized<float>& mask) {
// the mask used here returned by comparision of vec256
// the mask used here returned by comparison of vec256
// assuming this we can use the same mask directly with vec_sel
return {
vec_sel(a._vec0, b._vec0, mask._vecb0),

View File

@ -202,7 +202,7 @@ class Vectorized<int16_t> {
const Vectorized<int16_t>& a,
const Vectorized<int16_t>& b,
const Vectorized<int16_t>& mask) {
// the mask used here returned by comparision of vec256
// the mask used here returned by comparison of vec256
// assuming this we can use the same mask directly with vec_sel
// warning intel style mask will not work properly
return {

View File

@ -155,7 +155,7 @@ class Vectorized<int32_t> {
const Vectorized<int32_t>& a,
const Vectorized<int32_t>& b,
const Vectorized<int32_t>& mask) {
// the mask used here returned by comparision of vec256
// the mask used here returned by comparison of vec256
// assuming this we can use the same mask directly with vec_sel
// warning intel style mask will not work properly
return {

View File

@ -119,7 +119,7 @@ class Vectorized<int64_t> {
const Vectorized<int64_t>& a,
const Vectorized<int64_t>& b,
const Vectorized<int64_t>& mask) {
// the mask used here returned by comparision of vec256
// the mask used here returned by comparison of vec256
return {
vec_sel(a._vec0, b._vec0, mask._vecb0),

View File

@ -397,7 +397,7 @@ inline Vectorized<bool> operator&&(
const __m512i* other_ = reinterpret_cast<const __m512i*>(other.as_bytes());
__m512i out = _mm512_and_si512(*self_, *other_);
Vectorized<bool> ret;
// We do not have a constructer that takes __m512i, so we need to memcpy
// We do not have a constructor that takes __m512i, so we need to memcpy
std::memcpy(ret, &out, ret.size() * sizeof(bool));
return ret;
}

View File

@ -1852,7 +1852,7 @@ Vectorized<T> inline shift_512_8(
// Control masks for shuffle operation, treating 512 bits as an
// array of 8-bit elements, and considering pairs of neighboring
// elements. Specifially, a mask named "ctl_M_N" (M,N in [0,1], and
// elements. Specifically, a mask named "ctl_M_N" (M,N in [0,1], and
// M!=N) is set so that shuffle will move element with index M from
// input pair into element with index N in output pair, and element
// with index M in output pair will be set to all 0s.

View File

@ -634,7 +634,7 @@ struct Vectorized {
}
Vectorized<T> neg() const {
// NB: the trailing return type is needed because we need to coerce the
// return value back to T in the case of unary operator- incuring a
// return value back to T in the case of unary operator- incurring a
// promotion
return map([](T x) -> T { return -x; });
}

View File

@ -1958,7 +1958,7 @@ void scaled_gemm(
ScalarType result_dtype,
bool use_fast_accum,
const std::optional<Tensor>& alpha) {
// Note: see `cublasCommonArgs` for various non-intuitive manupulations
// Note: see `cublasCommonArgs` for various non-intuitive manipulations
// of input arguments to this function.
const auto computeType = CUBLAS_COMPUTE_32F;
const auto scaleType = CUDA_R_32F;

View File

@ -2,10 +2,10 @@
#include <ATen/cuda/ATenCUDAGeneral.h>
#include <ATen/cuda/CUDAContext.h>
#include <c10/core/impl/GPUTrace.h>
#include <c10/cuda/CUDAStream.h>
#include <c10/cuda/CUDAGuard.h>
#include <ATen/cuda/Exceptions.h>
#include <c10/core/impl/GPUTrace.h>
#include <c10/cuda/CUDAGuard.h>
#include <c10/cuda/CUDAStream.h>
#include <c10/util/Exception.h>
#include <cuda_runtime_api.h>
@ -246,4 +246,79 @@ private:
}
};
// EventPool - Thread-safe pool of CUDA events to avoid expensive cudaEventCreate
// calls. cudaEventCreate when concurrently invoked from multiple threads can be
// very expensive (especially on certain device/driver combinations).
using CUDAEventPtr =
std::unique_ptr<CUDAEvent, std::function<void(CUDAEvent*)>>;
class EventPool {
public:
EventPool() : pools_(at::cuda::device_count()) {}
CUDAEventPtr get(const DeviceIndex device) {
// If the device is invalid, return a default event and no pooling
if (device < 0 || device >= (DeviceIndex)pools_.size()) {
auto deleter = [](CUDAEvent* event) {
delete event;
};
return CUDAEventPtr(
std::make_unique<CUDAEvent>(cudaEventDisableTiming).release(), deleter);
}
auto& pool = pools_[device];
// Create a destructor that returns the event to the appropriate device pool
auto destructor = [&pool](CUDAEvent* event) noexcept {
if (event != nullptr) {
std::lock_guard<std::mutex> lock(pool.mutex_);
pool.event_pool_.emplace_back(event);
}
};
{
std::lock_guard<std::mutex> lock(pool.mutex_);
if (!pool.event_pool_.empty()) {
auto event = std::move(pool.event_pool_.back());
pool.event_pool_.pop_back();
return CUDAEventPtr(event.release(), destructor);
}
}
return CUDAEventPtr(
std::make_unique<CUDAEvent>(cudaEventDisableTiming).release(),
destructor);
}
void empty_cache() {
for (auto& pool : pools_) {
std::lock_guard<std::mutex> lock(pool.mutex_);
pool.event_pool_.clear();
}
}
void init_num_events(const size_t num_events) {
for (DeviceIndex device_idx = 0; device_idx < at::cuda::device_count(); ++device_idx) {
CUDAGuard device_guard(device_idx);
std::vector<CUDAEventPtr> temp_events;
temp_events.reserve(num_events);
for (size_t i = 0; i < num_events; ++i) {
auto event = get(device_idx);
// Record the event to ensure it's properly initialized
event->record();
temp_events.emplace_back(std::move(event));
}
// Events will be returned to pool when temp_events is destroyed
}
}
private:
struct alignas(64) PerDevicePool {
alignas(64) std::mutex mutex_;
std::vector<std::unique_ptr<CUDAEvent>> event_pool_;
};
std::vector<PerDevicePool> pools_;
};
} // namespace at::cuda

View File

@ -168,11 +168,9 @@ void CUDAGraph::instantiate() {
// https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH_1g1accfe1da0c605a577c22d9751a09597
// cudaGraphInstantiateWithFlags
// https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH_1ga2c652a24ba93e52b99a47bec0888233
#if !defined(USE_ROCM) || ROCM_VERSION >= 60200
int version = 0;
AT_CUDA_CHECK(cudaDriverGetVersion(&version));
if (version < 11040) {
#endif
// Trailing NULL, NULL, 0 arguments were recommended by Cuda driver people,
// who prefer not to report error message through these arguments moving forward
// (they prefer return value, or errors on api calls internal to the capture)
@ -183,13 +181,11 @@ void CUDAGraph::instantiate() {
#endif
//Since ROCm 6.2, we want to go down this path as hipGraphExecDestroy in the destructor will not immediately free the memory.
//It will wait for the next sync operation. cudaGraphInstantiateFlagAutoFreeOnLaunch will add async frees after graph launch.
#if !defined(USE_ROCM) || ROCM_VERSION >= 60200
} else {
AT_CUDA_CHECK(cudaGraphInstantiateWithFlags(&graph_exec_,
graph_,
cudaGraphInstantiateFlagAutoFreeOnLaunch));
}
#endif
has_graph_exec_ = true;
}
@ -311,7 +307,7 @@ CUDAGraph::~CUDAGraph() {
// There are recent HIP changes where hipGraphExecDestroy doesn't immediately free memory.
// They wait for next sync point in order to free the memory, this is to ensure that all
// hipGraphLaunch are finished before we release any memory. This feature was enabled in rocm6.2.
// We need to ensure all async opreations finish before deleting the object.
// We need to ensure all async operations finish before deleting the object.
#if (defined(USE_ROCM) && ROCM_VERSION >= 60200)
if (capture_dev_ != UNDEFINED_DEVICE) // check if capture_dev_ contains the real device id
{

View File

@ -137,7 +137,7 @@ struct CUDACachingHostAllocatorImpl
void free_block_slowpath(Block* block) {
auto start = std::chrono::steady_clock::now();
// Users may change the allocator config at will. torch unit tests do this.
// However, allocations using cudaHostRegister should use corresonding
// However, allocations using cudaHostRegister should use corresponding
// cudaHostUnregister and similarly for cudaHostAlloc / cudaFreeHost.
void* ptr = block->ptr_;
bool use_register = false;

View File

@ -4,7 +4,7 @@
#include <ATen/cuda/CUDAConfig.h>
// NOTE: These templates are intentionally not defined in this header,
// which aviods re-compiling them for each translation unit. If you get
// which avoids re-compiling them for each translation unit. If you get
// a link error, you need to add an explicit instantiation for your
// types in cub.cu

View File

@ -38,7 +38,7 @@ GemmTunableOp_float_NT,nt_25088_4096_64,1219,1.262
GemmTunableOp_float_NT,nt_4096_4096_64,1216,0.033
```
Note the "Validator" lines. If you change a library verison, or ROCm version, or PyTorch version, TunableOp will detect
Note the "Validator" lines. If you change a library version, or ROCm version, or PyTorch version, TunableOp will detect
this and reject the tunings file because the prior tunings are likely affected by other software changes.
The remaining lines are the tuned solutions for each TunableOp encountered during your execution. Each line consists of

View File

@ -235,7 +235,7 @@ class TunableOp {
// numeric check option is controlled by non-static env var, so check it once per tuned operator
bool do_numerics_check = ctx->IsNumericsCheckEnabled();
// calcaulte a reference answer for numerical check
// calculate a reference answer for numerical check
if (do_numerics_check) {
reference_params = params->DeepCopy(false);
TORCH_CHECK(ops_[ResultEntry::Default()]->Call(reference_params) == OK);

View File

@ -12,7 +12,7 @@ namespace at {
// AcceleratorHooksInterface is a shared interface provided by all
// accelerators to allow generic code.
// This inferface is hook-based as it corresponds to all the functions
// This interface is hook-based as it corresponds to all the functions
// that are going to be called in a generic way from the CPU code.
struct TORCH_API AcceleratorHooksInterface {

View File

@ -38,7 +38,7 @@ struct TORCH_API PrivateUse1HooksInterface : AcceleratorHooksInterface {
Generator getNewGenerator(
[[maybe_unused]] DeviceIndex device_index = -1) const override {
// TODO(FFFrog): Perserved for BC and will be removed in the future.
// TODO(FFFrog): Preserved for BC and will be removed in the future.
if (at::GetGeneratorPrivate().has_value())
return at::GetGeneratorForPrivateuse1(device_index);

View File

@ -283,7 +283,7 @@ inline void boxed_existing_bdim_all_batch_rule(
// Use when all tensors arguments accept one (normal) batch dim.
// This batching rule expands the batch dim on all Tensors, reshapes it into
// dim 0, calls the op, and then reshapes the batch dim out of dim 0.
// This is not the most efficient thing; if there are alternatives, plese try
// This is not the most efficient thing; if there are alternatives, please try
// to use them. Use this only as a last resort.
#define EXISTING_BDIM_ALL_BOXED(op) \
m.impl(#op, torch::CppFunction::makeFromBoxedFunction<boxed_existing_bdim_all_batch_rule>());

View File

@ -384,7 +384,7 @@ fourOutputs solve_ex_batch_rule(
// NOTE [ solve_ex Batch Rule Contiguity ]
// A determines whether or not linalg_solve takes an optimized path. We need the check on A_ to match the one run on
// A as BatchedTensor since it might have been saved by autograd (specifically by the jvp) and the autograd behvaior
// A as BatchedTensor since it might have been saved by autograd (specifically by the jvp) and the autograd behavior
// differs based on whether or not the optimized path was taken
const auto batched_A_was_contiguous = A_bdim.has_value() ? at::select(A, *A_bdim, 0).is_contiguous() : A.is_contiguous();
if (batched_A_was_contiguous && !A.is_complex()) {

View File

@ -282,7 +282,7 @@ static std::tuple<Tensor, std::optional<int64_t>> _softmax_backward_batch_rule(
dim = getPhysicalDim(output_, /*has_batch_dim*/true, dim);
// Not sure why output_ needs to be marked as .contiguous(). Someting must
// Not sure why output_ needs to be marked as .contiguous(). Something must
// have changed in PyTorch (and output of softmax is probably always contiguous)
return std::make_tuple(at::_softmax_backward_data(grad_output_, output_.contiguous(), dim, input_dtype), 0);
}

View File

@ -224,7 +224,7 @@ static Tensor safeStack(TensorList tensors) {
// is possible for the backward function to return an undefined grad for some
// grad_input for each example. In that case, we return an undefined grad.
//
// It is theoretically posssible for *some* of the examples to produce an
// It is theoretically possible for *some* of the examples to produce an
// undefined grad (a kernel could peek at the gradient values and return an
// undefined tensor if it determines the gradient is full of zeros). We
// could handle this by treating the undefined grad as a zero-filled tensor

View File

@ -113,7 +113,7 @@ SymIntArrayRef BatchedTensorImpl::sym_sizes_custom() const {
return sym_sizes_default();
}
// The following are publically exposed as methods of Tensor
// The following are publicly exposed as methods of Tensor
IntArrayRef BatchedTensorImpl::strides_custom() const {
return strides_default();

View File

@ -37,7 +37,7 @@ namespace at::functorch {
// how to perform the transform.
//
// TODO: we can excise DynamicLayer in favor of Interpreter,
// But I am going to leave it for now as a compatiblity shim to avoid
// But I am going to leave it for now as a compatibility shim to avoid
// needing to refactor a lot of callsites...
struct TORCH_API DynamicLayer {
explicit DynamicLayer(

View File

@ -88,7 +88,7 @@ std::ostream& operator<<(std::ostream& os, const TransformType& t);
// >>> VmapInterpreterPtr(&interpreter).batchSize()
//
// Finally, Interpreter::process switches on the type of the interpreter
// and calls one of {Transform}Intepreter::processImpl under the hood.
// and calls one of {Transform}Interpreter::processImpl under the hood.
// Same for Interpreter::sendToNextInterpreter :)
struct VmapInterpreterMeta {

View File

@ -733,7 +733,7 @@ TORCH_LIBRARY_IMPL(_, FuncTorchBatched, m) {
}
TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
// still legacy b/c teturns multiple tensors
// still legacy b/c returns multiple tensors
m.impl("split.Tensor", split_batching_rule);
m.impl("split_with_sizes", split_with_sizes_batching_rule);
m.impl("split_with_sizes_copy", split_with_sizes_copy_batching_rule);

View File

@ -158,7 +158,7 @@ void MPSStream::fill(id<MTLBuffer> buffer, uint8_t value, size_t length, size_t
endKernelCoalescing();
id<MTLBlitCommandEncoder> blitEncoder = [commandBuffer() blitCommandEncoder];
// For some reason fillBufferfor stopped working for lengh > 4Gb on MacOS 26
// For some reason fillBufferfor stopped working for length > 4Gb on MacOS 26
// See https://github.com/pytorch/pytorch/issues/163962
// Workaround by batching copy commands into 4Gb chunks
constexpr size_t max_copy_size = 0x100000000; // 4GB

View File

@ -148,7 +148,7 @@ inline void checkInputsSolver(const Tensor& A,
inline bool is_row_or_column_contiguous(const Tensor& t) {
// This could be made more general, similar to how it's checked in matmul, which would allow to
// ellide the copy with strides such as (6, 12, 1, 3) or (3, 1, 9), but this is quite tricky.
// elide the copy with strides such as (6, 12, 1, 3) or (3, 1, 9), but this is quite tricky.
// We choose to be conservative for simplicity
return t.is_contiguous() || t.transpose(-2, -1).is_contiguous();
}

View File

@ -21,7 +21,7 @@ enum class fft_norm_mode {
// NOTE [ Fourier Transform Conjugate Symmetry ]
//
// Real-to-complex Fourier transform satisfies the conjugate symmetry. That is,
// assuming X is the transformed K-dimensionsal signal, we have
// assuming X is the transformed K-dimensional signal, we have
//
// X[i_1, ..., i_K] = X[j_i, ..., j_K]*,
//

View File

@ -128,7 +128,7 @@ at::Tensor PackedLinearWeight::apply_impl(
auto* input_tr_ptr =
reinterpret_cast<uint8_t*>(input_tr.data_ptr<c10::quint8>());
// TODO: Activation transpose before and after the kernel can be removed if we
// keep activation tensor always tranposed.
// keep activation tensor always transposed.
fbgemm::transpose_simd<uint8_t>(
batch_size, K, input_ptr, K, input_tr_ptr, batch_size);

View File

@ -520,7 +520,7 @@ cpu_adaptive_avg_pool3d_channels_last(
scalar_t* out = output_data + i * channels;
int64_t size = channels;
// Note: For oridinary usage scenario, each out lane should
// Note: For ordinary usage scenario, each out lane should
// fit in L1 cache; otherwise consider block dim C.
// Pass I: zero the out lane
int64_t d1 = 0;

View File

@ -34,7 +34,7 @@ struct Dist {
// finish : This tells what to do with the aggregated value to compute
// the norm. Generally this is the result of val ^ (1 / p).
// backward : This is the gradient for that norm. Arguments are pretty
// self explanitory.
// self explanatory.
//
// There are a few cases where these aren't used. The 0 norm has no backward,
// because it's always 0, so that's shortcircuited earlier. There's a special

View File

@ -30,7 +30,7 @@ vec::Vectorized<scalar_t> is_nan_vec(vec::Vectorized<scalar_t> vec) {
return vec.isnan();
}
// TODO: use is_integeral/is_same to check the scalar_t and simplify the implementation
// TODO: use is_integral/is_same to check the scalar_t and simplify the implementation
// currently it does not work
template <>
vec::Vectorized<unsigned char> is_nan_vec<unsigned char>(vec::Vectorized<unsigned char> vec) {

View File

@ -74,7 +74,7 @@ it to sum up the entire array into a single value.
`ReduceOpsKernel.cpp` uses the `CPU_CAPABILITY_*` macros to "know" under which
compiler flags it is currently compiled. This allows the programmer to write
generic code, which will be compiled under multipled compilation settings.
generic code, which will be compiled under multiplied compilation settings.
`../ReduceOps.cpp` now includes the header `ReduceOpsKernel.h`, which contains
a generic definition of `sumImplAll`. This function allows the user to reduce

View File

@ -889,7 +889,7 @@ void ImagingResampleHorizontalConvolution8u(
_mm_loadu_si128((__m128i *) (lineIn_min + stride * i))),
_mm_loadu_si128((__m128i *) (lineIn_min + stride * (i + 4))), 1);
// Extract lower part of each lane, cast to epi16 and reoder RGBARGBA -> RRGGBBAA
// Extract lower part of each lane, cast to epi16 and reorder RGBARGBA -> RRGGBBAA
// RGBA: pix1 = [
// r0 0 r1 0 g0 0 g1 0 b0 0 b1 0 a0 0 a1 0
// r4 0 r5 0 g4 0 g5 0 b4 0 b5 0 a4 0 a5 0

View File

@ -240,7 +240,7 @@ _PS256_CONST(coscof_p2, 4.166664568298827E-002);
_PS256_CONST(cephes_FOPI, 1.27323954473516); // 4 / M_PI
/* evaluation of 8 sines at onces using AVX intrinsics
/* evaluation of 8 sines at once using AVX intrinsics
The code is the exact rewriting of the cephes sinf function.
Precision is excellent as long as x < 8192 (I did not bother to

View File

@ -311,7 +311,7 @@ void GroupNormKernelImplChannelsLastInternal(
const bool gamma_null = (gamma_data == nullptr);
const bool beta_null = beta_data == nullptr;
// NB: About algorithm choosen:
// NB: About algorithm chosen:
//
// On channels last, GroupNorm has a input shape of {N, H, W, GD},
// Mean and rstd are collected per each n and g, which involves reduction

View File

@ -930,7 +930,7 @@ void ref_dyn_quant_matmul_4bit_channelwise_kernel(
}
};
// Dynamically Quantize the float32 input to 8 bit assymetric
// Dynamically Quantize the float32 input to 8 bit asymmetric
input_quant_pack_8bit_channelwise(m, k, lhs_f32, (int8_t*)lhs_qa8dx);
const size_t lhs_stride =
@ -1163,7 +1163,7 @@ void dyn_quant_matmul_4bit_kernel(
const int64_t weight_packed_size =
kleidiai::kai_pack_rhs_int4_size(N, K, block_size);
if (weight_packed_size == packed_weights.numel()) {
// KleidiAI interface intenally handles the Channelwise and groupwise
// KleidiAI interface internally handles the Channelwise and groupwise
// distinction
kleidiai::kai_quant_pack_lhs_int4_mm(
output, inp, packed_weights, M, N, K, block_size);

File diff suppressed because it is too large Load Diff

View File

@ -1,11 +1,11 @@
#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
#include <ATen/core/Tensor.h>
#include <ATen/Context.h>
#include <ATen/Dispatch.h>
#include <ATen/Dispatch_v2.h>
#include <ATen/cuda/CachingHostAllocator.h>
#include <ATen/core/Tensor.h>
#include <ATen/cuda/CUDAContext.h>
#include <ATen/cuda/CUDAEvent.h>
#include <ATen/cuda/CachingHostAllocator.h>
#include <ATen/cuda/PeerToPeerAccess.h>
#include <ATen/native/Copy.h>
#include <ATen/native/TensorIterator.h>
@ -27,6 +27,24 @@
namespace at::native {
namespace {
// Initial pool size for CUDA events per device.
constexpr size_t kInitialEventPoolSize = 8;
at::cuda::CUDAEventPtr getEventFromPool(const at::DeviceIndex device_idx) {
static auto* event_pool = []() {
auto* pool = new at::cuda::EventPool();
// Pre-populate the pool with events to avoid stalls in creating events
pool->init_num_events(kInitialEventPoolSize);
return pool;
}();
return event_pool->get(device_idx);
}
} // namespace
void neg_kernel_cuda(TensorIteratorBase &iter);
void conj_kernel_cuda(TensorIteratorBase &iter);
@ -263,12 +281,14 @@ void copy_device_to_device(TensorIterator& iter,
// write-after-read dependencies on the destination side are handled, so
// that no one is operating on the dst memory when we perform the copy.
// src waits on dst barrier (src already waits on src)
CUDAEvent dst_ready;
// Use event pool for better performance instead of creating new events
auto dst_ready = getEventFromPool(dst_device.index());
device_guard.set_device(dst_device);
dst_ready.record(getCurrentCUDAStream(dst_device.index()));
dst_ready->record(getCurrentCUDAStream(dst_device.index()));
device_guard.set_device(src_device);
dst_ready.block(copy_stream);
dst_ready->block(copy_stream);
}
if (memcpy_eligible) {
@ -307,11 +327,11 @@ void copy_device_to_device(TensorIterator& iter,
// operate on dst's copy until the copy is complete.
// Still on src_device, record stream event
CUDAEvent src_ready;
src_ready.record(copy_stream);
auto src_ready = getEventFromPool(src_device.index());
src_ready->record(copy_stream);
device_guard.set_device(dst_device);
src_ready.block(getCurrentCUDAStream(dst_device.index()));
src_ready->block(getCurrentCUDAStream(dst_device.index()));
}
AT_CUDA_CHECK(cudaGetLastError());

View File

@ -494,7 +494,7 @@ void uniform_kernel(TensorIteratorBase& iter, double from_, double to_, RNG gen)
auto value = static_cast<scalar_t>(rand * range + from);
// reverse the bounds of curand4 from (0, 1] to [0, 1)
// Note that this method is from legacy THCTensorRandom and is likely to give
// you more 0-s, since, the probability of gettings 1-s is higher than 0-s and
// you more 0-s, since, the probability of getting 1-s is higher than 0-s and
// by reversing the bounds, we are flipping the probabilities of 1-s and 0-s.
// BEFORE TOUCHING THIS CODE READ: https://github.com/pytorch/pytorch/issues/16706
auto reverse_bound_value = value == to ? from : value;

View File

@ -6,7 +6,7 @@
#endif
// ROCm 6.3 is planned to have these functions, but until then here they are.
#if defined(USE_ROCM) && ROCM_VERSION >= 60201
#if defined(USE_ROCM)
#include <device_functions.h>
#include <hip/hip_fp16.h>
#include <hip/hip_bf16.h>
@ -115,9 +115,7 @@ __device__ __forceinline__ void fastSpecializedAtomicAdd(
index_t index,
const index_t numel,
scalar_t value) {
#if ( \
(defined(USE_ROCM) && ROCM_VERSION < 60201) || \
(defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 700)))
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 700))
gpuAtomicAddNoReturn(
reinterpret_cast<at::Half*>(tensor) + index,
static_cast<at::Half>(value));
@ -160,9 +158,7 @@ __device__ __forceinline__ void fastSpecializedAtomicAdd(
index_t index,
const index_t numel,
scalar_t value) {
#if ( \
(defined(USE_ROCM) && ROCM_VERSION < 60201) || \
(defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 800)))
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 800))
gpuAtomicAddNoReturn(
reinterpret_cast<at::BFloat16*>(tensor) + index,
static_cast<at::BFloat16>(value));

View File

@ -154,7 +154,7 @@ REGISTER_CUDA_DISPATCH(lstsq_stub, &lazy_lstsq_kernel)
// Old style dispatches
// torch_cuda_linalg dynamic library should have a global constructor
// that calls regiserLinaglDispatch so in order ot lazy bind
// that calls registerLinalgDispatch so in order ot lazy bind
// old style dispatch all one have to do is to load library and call disp.func_name
// Protect from infinite recursion by initializing dispatch to self and checking
// that values are different after linalg library were loaded

View File

@ -311,7 +311,7 @@ __global__ void batch_norm_collect_statistics_kernel(
stat_accscalar_t v_[UNRL];
for (int x = threadIdx.x; x < input.size(2); x += blockDim.x*UNRL) {
for (int u = 0; u < UNRL; u++)
v_[u] = input[batch][plane][min(x+u*blockDim.x, input.size(2)-1)];
v_[u] = input[batch][plane][std::min(x+u*blockDim.x, input.size(2)-1)];
for (int u = 0; u < UNRL; u++) {
if (x+u*blockDim.x < input.size(2)) {
stat_accscalar_t d1 = v_[u] - avg;

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,171 @@
#pragma once
#include <ATen/core/Tensor.h>
namespace at::native {
using at::blas::ScalingType;
using at::blas::SwizzleType;
namespace {
// TODO: https://github.com/pytorch/pytorch/pull/59380#pullrequestreview-725310492
c10::MaybeOwned<Tensor> inline resolve_conj_if_indicated(const Tensor& tensor, bool resolve_conj) {
if (resolve_conj && tensor.is_conj()) {
return c10::MaybeOwned<Tensor>::owned(tensor.resolve_conj());
} else {
return c10::MaybeOwned<Tensor>::borrowed(tensor);
}
}
c10::MaybeOwned<Tensor> inline prepare_matrix_for_cublas(const Tensor& tensor, bool& transpose_tensor, bool transpose_result) {
if (tensor.is_non_overlapping_and_dense()) { // common case
transpose_tensor = tensor.is_contiguous();
return resolve_conj_if_indicated(tensor, transpose_result ? transpose_tensor : !transpose_tensor);
}
IntArrayRef tensor_strides = tensor.strides();
IntArrayRef tensor_sizes = tensor.sizes();
if ((tensor_strides[0] == 1) && (tensor_strides[1] >= std::max<int64_t>(1, tensor_sizes[0]))) {
transpose_tensor = false;
return resolve_conj_if_indicated(tensor, !transpose_result);
} else if ((tensor_strides[1] == 1) && (tensor_strides[0] >= std::max<int64_t>(1, tensor_sizes[1]))) {
transpose_tensor = true;
return resolve_conj_if_indicated(tensor, transpose_result);
} else {
transpose_tensor = true;
return c10::MaybeOwned<Tensor>::owned(tensor.clone(at::MemoryFormat::Contiguous));
}
}
c10::MaybeOwned<Tensor> inline prepare_matrix_for_cublas(const Tensor& tensor, bool& transpose_tensor) {
if (tensor.is_non_overlapping_and_dense()) { // common case
transpose_tensor = tensor.is_contiguous();
return resolve_conj_if_indicated(tensor, true);
}
IntArrayRef tensor_strides = tensor.strides();
IntArrayRef tensor_sizes = tensor.sizes();
if ((tensor_strides[0] == 1) && (tensor_strides[1] >= std::max<int64_t>(1, tensor_sizes[0]))) {
transpose_tensor = false;
return resolve_conj_if_indicated(tensor, true);
} else if ((tensor_strides[1] == 1) && (tensor_strides[0] >= std::max<int64_t>(1, tensor_sizes[1]))) {
transpose_tensor = true;
return resolve_conj_if_indicated(tensor, true);
} else {
transpose_tensor = true;
return c10::MaybeOwned<Tensor>::owned(tensor.clone(at::MemoryFormat::Contiguous));
}
}
} // namespace
/**
* @brief Prepares matrices for CUBLAS operation
*
* This constructor prepares tensors for CUBLAS
* The main difference is that PyTorch uses row-major as the default and
* CUBLAS expects column-major.
*
* @details
* To enable row-major output while using CUBLAS,
* we use the mathematical identity that (A × B)^T = B^T × A^T.
*
* Transpose in this context refers to Cublas's(Fortran) definition of transpose (row-major)
* T = row-major, N = col-major
*
* Example:
* For matrices A (M×K)(row-major) and B (K×N)(row-major):
* - Standard multiplication: A × B = (M×K) × (K×N) = M×N result (row-major)
* - Using our transpose trick: (B^T × A^T) = (N×K)(T) × (K×M)(T) = N×M(N)
* - However, since the output form cublas is column-major this is
* - equivalent to an output of size MxN row-major as expected
*
* The transpose flags are derived from the layouts of the passed in tensors
*
* If the operands are in packed float4 format, `k`, `lda` and `ldb` are adjusted
* to their unpacked values to match what cuBLAS expects.
*
* @param mat1 First input matrix
* @param mat2 Second input matrix
* @param c Output matrix (result)
* @param scale_a Optional scaling factor for first matrix
* @param scale_b Optional scaling factor for second matrix
* @param scale_result Optional scaling factor for result
*/
struct cublasCommonArgs {
cublasCommonArgs(
const Tensor& mat1,
const Tensor& mat2,
Tensor& c,
const std::optional<Tensor>& scale_a = std::nullopt,
const std::optional<Tensor>& scale_b = std::nullopt,
const std::optional<Tensor>& scale_result = std::nullopt,
const std::optional<ScalingType>& scaling_choice_a = std::nullopt,
const std::optional<ScalingType>& scaling_choice_b = std::nullopt) {
bool transpose_result = false, transpose_a = false, transpose_b = false;
result = prepare_matrix_for_cublas(c, transpose_result);
mata = prepare_matrix_for_cublas(transpose_result ? mat2 : mat1, transpose_a, transpose_result);
matb = prepare_matrix_for_cublas(transpose_result ? mat1 : mat2, transpose_b, transpose_result);
// Handle scale tensors if provided
if (scale_a && scale_b) {
// By default since we return in row-major we run the gemm
// as B.T @ A.T, check transpose_result to determine if we flip the scales
scale_mata_ptr = transpose_result ? scale_b->data_ptr() : scale_a->data_ptr();
scale_mata_dtype = transpose_result ? scale_b->scalar_type() : scale_a->scalar_type();
scaling_mata_type = transpose_result ? scaling_choice_b : scaling_choice_a;
scale_matb_ptr = transpose_result ? scale_a->data_ptr() : scale_b->data_ptr();
scale_matb_dtype = transpose_result ? scale_a->scalar_type() : scale_b->scalar_type();
scaling_matb_type = transpose_result ? scaling_choice_a : scaling_choice_b;
}
if (scale_result) {
scale_result_ptr = scale_result->data_ptr();
scale_result_dtype = scale_result->scalar_type();
}
// Update transpose flags
if (transpose_result) {
transpose_a = !transpose_a;
transpose_b = !transpose_b;
}
auto sizes_a = mata->sizes();
auto sizes_b = matb->sizes();
m = sizes_a[transpose_result ? 1 : 0];
k = sizes_a[transpose_result ? 0 : 1];
n = sizes_b[transpose_result ? 0 : 1];
lda = mata->stride((transpose_a == transpose_result) ? 1 : 0);
ldb = matb->stride((transpose_b == transpose_result) ? 1 : 0);
result_ld = result->stride(transpose_result ? 0 : 1);
transa = transpose_a ? mata->is_conj() ? 'c' : 't' : 'n';
transb = transpose_b ? matb->is_conj() ? 'c' : 't' : 'n';
// cuBLAS expects unpacked values of `k`, `lda` and `ldb`, adjust for 4x2 packing
// if the gemm operands are in packed float4
if (mat1.dtype() == at::kFloat4_e2m1fn_x2 && mat2.dtype() == at::kFloat4_e2m1fn_x2) {
k = k * 2;
lda = lda * 2;
ldb = ldb * 2;
}
}
// Matrix members
char transa, transb;
int64_t m, n, k;
int64_t lda, ldb, result_ld;
c10::MaybeOwned<Tensor> mata, matb, result;
// Scale members
void* scale_mata_ptr = nullptr;
void* scale_matb_ptr = nullptr;
void* scale_result_ptr = nullptr;
std::optional<c10::ScalarType> scale_mata_dtype;
std::optional<ScalingType> scaling_mata_type;
std::optional<c10::ScalarType> scale_matb_dtype;
std::optional<ScalingType> scaling_matb_type;
std::optional<c10::ScalarType> scale_result_dtype;
};
} // namespace at::native

View File

@ -1,4 +1,4 @@
#if (defined(USE_ROCM) && ROCM_VERSION >= 50700) || ((defined(CUDA_VERSION) && CUDA_VERSION >= 12000) && (!defined(__CUDA_ARCH__) || (__CUDA_ARCH__ >= 800)))
#if defined(USE_ROCM) || ((defined(CUDA_VERSION) && CUDA_VERSION >= 12000) && (!defined(__CUDA_ARCH__) || (__CUDA_ARCH__ >= 800)))
#include <cuda_bf16.h>
#include <cuda_fp16.h>
#include <cuda_runtime.h>
@ -133,7 +133,7 @@ inline __host__ __device__ uint32_t getAlignmentRoundUp(const void* p) {
#define CDNA2_OR_LATER 0
#endif
#if (defined(USE_ROCM) && ROCM_VERSION >= 50700) || ((defined(CUDA_VERSION) && CUDA_VERSION >= 12000) && (!defined(__CUDA_ARCH__) || (__CUDA_ARCH__ >= 800)))
#if defined(USE_ROCM) || ((defined(CUDA_VERSION) && CUDA_VERSION >= 12000) && (!defined(__CUDA_ARCH__) || (__CUDA_ARCH__ >= 800)))
#if defined(USE_ROCM)
// TODO: Support RDNA
@ -1161,7 +1161,7 @@ at::Tensor _weight_int4pack_mm_cuda(
auto C_final = at::empty(
{m, n}, at::TensorOptions().dtype(at::kBFloat16).device(A.device()));
#if (defined(USE_ROCM) && ROCM_VERSION >= 50700) || ((defined(CUDA_VERSION) && CUDA_VERSION >= 12000) && (!defined(__CUDA_ARCH__) || (__CUDA_ARCH__ >= 800)))
#if defined(USE_ROCM) || ((defined(CUDA_VERSION) && CUDA_VERSION >= 12000) && (!defined(__CUDA_ARCH__) || (__CUDA_ARCH__ >= 800)))
auto stream = at::cuda::getCurrentCUDAStream();
#define RUN_GEMM(WARPS, K_TILES_PER_WARP, Q_GROUP_SIZE, REDUCE_TYPE) \
do { \
@ -1327,7 +1327,7 @@ at::Tensor _convert_weight_to_int4pack_cuda(
{nTilesTensor, kSuperTiles, 32, innerKTiles / 2},
at::TensorOptions().dtype(at::kInt).device(in.device()));
#if (defined(USE_ROCM) && ROCM_VERSION >= 50700) || ((defined(CUDA_VERSION) && CUDA_VERSION >= 12000) && (!defined(__CUDA_ARCH__) || (__CUDA_ARCH__ >= 800)))
#if defined(USE_ROCM) || ((defined(CUDA_VERSION) && CUDA_VERSION >= 12000) && (!defined(__CUDA_ARCH__) || (__CUDA_ARCH__ >= 800)))
auto stream = at::cuda::getCurrentCUDAStream();
dim3 grid(kSuperTiles, nTiles);

View File

@ -1532,7 +1532,7 @@ NvrtcFunction jit_pwise_function(
std::string file_path;
if (cache_dir.has_value()) {
// Attemps to read from the cache.
// Attempts to read from the cache.
// Cubin name is <kernel name>_arch<major>.<minor>_nvrtc<major>.<minor>_<ptx or sass>_<program length>_<string hash>
// Note that the SHA1 hash used in the file name is NOT the SHA1 hash of the file's contents,
// because we hash on the CUDA code, but we save the compiled ptx or sass

View File

@ -1346,7 +1346,7 @@ void cholesky_helper_magma(const Tensor& input, bool upper, const Tensor& info)
});
if (input.dim() > 2) {
// if upper=true we need to tranpose and conjugate the result tensor
// if upper=true we need to transpose and conjugate the result tensor
// because the cholesky decomposition is stored in the lower triangular part
if (upper) {
input.copy_(result.mH());
@ -1857,7 +1857,7 @@ void geqrf_kernel(const Tensor& input, const Tensor& tau) {
auto preferred_backend = at::globalContext().linalgPreferredBackend();
switch (preferred_backend) {
// TODO Investigate whether the following magma bug is still occuring.
// TODO Investigate whether the following magma bug is still occurring.
// It may be the case that geqrf followed by orgqr is wrong for the magma backend
// geqrf_magma currently uses geqrf2_gpu
//

View File

@ -82,7 +82,7 @@ void lu_factor_looped_cusolver(const Tensor& self, const Tensor& pivots, const T
#if defined(BUILD_LAZY_CUDA_LINALG)
namespace cuda { namespace detail {
// This is only used for an old-style dispatches
// Please do not add any new entires to it
// Please do not add any new entries to it
struct LinalgDispatch {
Tensor (*cholesky_solve_helper)(const Tensor& self, const Tensor& A, bool upper);
};

View File

@ -147,7 +147,7 @@ static void check_shape_forward(const Tensor& input,
// blocked format will propagate between layers. Input, output will be in blocked format.
//
// For inference case, weight can be prepacked into blocked format by
// (so as to save weight reoder overhead):
// (so as to save weight reorder overhead):
// model = torch.utils.mkldnn.to_mkldnn(model)
//
// For training case, grad_output can be CPU tensor or MKLDNN tensor,
@ -723,7 +723,7 @@ Tensor _mkldnn_convolution_transpose(
ideep::tensor w = itensor_from_tensor(weight, /*from_const_data_ptr*/true);
if (!weight.is_mkldnn()) {
// mkldnn transposed convolution has weight in logical order of OIHW or OIDHW,
// while PyTorch has IOHW or IODHW, `._tranpose()` switches strides (no memory copy).
// while PyTorch has IOHW or IODHW, `._transpose()` switches strides (no memory copy).
w.transpose_(0, 1);
}

View File

@ -540,7 +540,7 @@ static void _mkldnn_matmul_i8i8i32_with_primitive(
args.insert({DNNL_ARG_WEIGHTS, expected_weight});
args.insert({DNNL_ARG_DST, dst});
args.insert({DNNL_ARG_SCRATCHPAD, scratchpad});
// Create primitve and execute
// Create primitive and execute
auto primitive = dnnl::matmul(prim_desc);
primitive.execute(ideep::stream::default_stream(), args);
}

View File

@ -439,7 +439,7 @@ std::tuple<Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor> mkldnn_rnn_la
// I. Memory Formats
// a. mkldnn will use plain formats for input, hx/cx, output, hy/cy
// and possibly use blocked formats for weights depending shape info.
// b. All mkldnn memorys are created (in plain format) as views on ATen tensor,
// b. All mkldnn memories are created (in plain format) as views on ATen tensor,
// the weight reorder(if any) is handed automatically inside ideep (mkldnn bridge)
//
// II. MKLDNN Primitive Mapping

View File

@ -39,7 +39,7 @@ void check_mkldnn_binary_fusion_inputs(
inline std::vector<int64_t> padding_r(
IntArrayRef padding, IntArrayRef output_padding)
{
// ConvTranpose padding adjustment
// ConvTranspose padding adjustment
//
// PyTorch uses padding/output_padding:
// osize = (isize - 1) * stride - 2 * padding + dilation * (kernel_size - 1) + output_padding + 1

View File

@ -75,7 +75,7 @@ bool can_use_overrideable_attention(sdp::sdp_params const& params, bool debug) {
}
bool can_use_flash_attention(sdp::sdp_params const& params, bool debug) {
// Currently, XPU fallbacks flash attention to overrideable
// Currently, XPU fallbacks flash attention to overridable
return can_use_overrideable_attention(params, debug);
}
@ -115,7 +115,7 @@ sdp::SDPBackend select_sdp_backend_xpu(sdp::sdp_params const& kernel_params) {
// 1. Flash Attention
// 2. Math fallback
auto& ctx = at::globalContext();
// use overrideable linked to onednn as overrideable implementation
// use overridable linked to onednn as overridable implementation
if (!ctx.userEnabledMathSDP() && !ctx.userEnabledOverrideableSDP() &&
!ctx.userEnabledFlashSDP()) {
return sdp::SDPBackend::error;
@ -165,7 +165,7 @@ sdp::SDPBackend select_sdp_backend_xpu(sdp::sdp_params const& kernel_params) {
}
}
// If we have gotten to this point then two things have happened:
// 1. can_use_overrideable_attention did not satisfy the constraints to be ran
// 1. can_use_overridable_attention did not satisfy the constraints to be ran
// 2. The user has explicitly disabled the math kernel
// We then re-run the kernel checks with debug enabled to print out the
// reason why the kernel was not selected

View File

@ -215,7 +215,7 @@ partition create_sdpa_graph_partition(
// For optional additive mask
std::optional<op> mask_add;
// For optional implicite causal mask
// For optional implicit causal mask
std::optional<op> mask_gen_idx_row;
std::optional<logical_tensor> mask_row_idx;
std::optional<op> mask_gen_idx_col;
@ -556,7 +556,7 @@ partition create_sdpa_backward_graph_partition(
// For optional additive mask
std::optional<op> mask_add;
// For optional implicite causal mask
// For optional implicit causal mask
std::optional<op> mask_gen_idx_row;
std::optional<logical_tensor> mask_row_idx;
std::optional<op> mask_gen_idx_col;

View File

@ -345,7 +345,7 @@ class Attr {
dnnl::memory binary_m;
auto binary = ops_params_[i].binary_;
auto md = ops_params_[i].meta_;
// qeury expected_md to achieve peak performance
// query expected_md to achieve peak performance
auto expected_md = pd.query_md(
dnnl::query::exec_arg_md,
DNNL_ARG_ATTR_MULTIPLE_POST_OP(i) | DNNL_ARG_SRC_1);

View File

@ -301,7 +301,7 @@ bool is_onednn_matmul_strides(const at::Tensor& tensor) {
return false;
}
// the overlaped cases are not supported
// the overlapped cases are not supported
dnnl::memory::dims strides = get_onednn_strides(tensor);
int64_t storage_size = 1;
for (size_t dim = 0; dim < tensor_dim; ++dim)

View File

@ -29,7 +29,7 @@
secondaryTensor:(MPSGraphTensor*)secondaryTensor
name:(NSString*)name {
// As of MacOS-15.1 m..imumWithNanPropagation is only defined for floating types and calling it with integral
// agruments results in
// arguments results in
// /AppleInternal/Library/BuildRoots/c7c74b64-74b4-11ef-aeda-9635a580fe0d/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Utility/MPSKernelDAG.mm:805:
// failed assertion `Error getting visible function: (null) Function isNaN_u8_i8 was not found in the library'
if (([primaryTensor dataType] & MPSDataTypeFloatBit) == 0) {
@ -42,7 +42,7 @@
secondaryTensor:(MPSGraphTensor*)secondaryTensor
name:(NSString*)name {
// As of MacOS-15.1 m..imumWithNanPropagation is only defined for floating types and calling it with integral
// agruments results in
// arguments results in
// /AppleInternal/Library/BuildRoots/c7c74b64-74b4-11ef-aeda-9635a580fe0d/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Utility/MPSKernelDAG.mm:805:
// failed assertion `Error getting visible function: (null) Function isNaN_u8_i8 was not found in the library'
if (([primaryTensor dataType] & MPSDataTypeFloatBit) == 0) {
@ -539,7 +539,7 @@ Placeholder::Placeholder(MPSGraphTensor* mpsGraphTensor,
static const bool is_macOS_15_0_or_newer = is_macos_13_or_newer(MacOSVersion::MACOS_VER_15_0_PLUS);
// Use gather kernel to solve strides for macOS < 15.0
// Starting with macOS 15.0, MPS supports native strides direclty in the kernels
// Starting with macOS 15.0, MPS supports native strides directly in the kernels
if (!is_macOS_15_0_or_newer || !useMPSStridedAPI) {
if ((!src.is_contiguous() || src.storage_offset()) && gatherTensorData) {
Tensor emptyShell = Tensor();

View File

@ -1,4 +1,4 @@
#pragma onces
#pragma once
#include <c10/metal/common.h>
template <unsigned N = c10::metal::max_ndim>

View File

@ -1,3 +1,5 @@
#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
#include <ATen/native/Resize.h>
#include <ATen/native/SpectralOpsUtils.h>
#include <ATen/native/mps/OperationUtils.h>
@ -37,25 +39,12 @@ NSArray<NSNumber*>* IntArrayToNSArray(IntArrayRef arr) {
} // anonymous namespace
Tensor _fft_c2r_mps(const Tensor& self, IntArrayRef dim, int64_t normalization, int64_t last_dim_size) {
TORCH_CHECK(self.is_complex());
auto in_sizes = self.sizes();
DimVector out_sizes(in_sizes.begin(), in_sizes.end());
out_sizes[dim.back()] = last_dim_size;
auto out = at::empty(out_sizes, self.options().dtype(c10::toRealValueType(self.scalar_type())));
auto out = at::empty({}, self.options().dtype(c10::toRealValueType(self.scalar_type())));
return _fft_c2r_mps_out(self, dim, normalization, last_dim_size, out);
}
Tensor _fft_r2c_mps(const Tensor& self, IntArrayRef dim, int64_t normalization, bool onesided) {
TORCH_CHECK(self.is_floating_point());
auto input_sizes = self.sizes();
DimVector out_sizes(input_sizes.begin(), input_sizes.end());
auto last_dim = dim.back();
auto last_dim_halfsize = (input_sizes[last_dim]) / 2 + 1;
if (onesided) {
out_sizes[last_dim] = last_dim_halfsize;
}
auto out = at::empty(out_sizes, self.options().dtype(c10::toComplexType(self.scalar_type())));
auto out = at::empty({}, self.options().dtype(c10::toComplexType(self.scalar_type())));
return _fft_r2c_mps_out(self, dim, normalization, onesided, out);
}
@ -72,6 +61,17 @@ using namespace mps;
// TODO: Investigate numerical discrepancies see https://github.com/pytorch/pytorch/issues/120237
Tensor& _fft_r2c_mps_out(const Tensor& self, IntArrayRef dim, int64_t normalization, bool onesided, Tensor& out) {
TORCH_CHECK(self.scalar_type() == kFloat || self.scalar_type() == kHalf, "Only float and half dtypes are supported");
TORCH_CHECK(out.scalar_type() == c10::toComplexType(self.scalar_type()));
const auto input_sizes = self.sym_sizes();
SymDimVector out_sizes(input_sizes.begin(), input_sizes.end());
auto last_dim = dim.back();
auto last_dim_halfsize = (input_sizes[last_dim]) / 2 + 1;
if (onesided) {
out_sizes[last_dim] = last_dim_halfsize;
}
at::native::resize_output_symint(out, out_sizes);
auto key = __func__ + getTensorsStringKey({self, out}) + ":" + getArrayRefString(dim) + ":" +
std::to_string(normalization) + ":" + std::to_string(onesided);
@autoreleasepool {
@ -112,6 +112,12 @@ Tensor& _fft_c2r_mps_out(const Tensor& self,
int64_t normalization,
int64_t last_dim_size,
Tensor& out) {
TORCH_CHECK(self.is_complex(), "Input must be complex");
TORCH_CHECK(out.scalar_type() == c10::toRealValueType(self.scalar_type()), "Unexpected output type");
const auto in_sizes = self.sym_sizes();
SymDimVector out_sizes(in_sizes.begin(), in_sizes.end());
out_sizes[dim.back()] = last_dim_size;
at::native::resize_output_symint(out, out_sizes);
auto key = __func__ + getTensorsStringKey({self}) + ":" + getArrayRefString(dim) + ":" +
std::to_string(normalization) + ":" + std::to_string(last_dim_size);
@autoreleasepool {

View File

@ -158,7 +158,7 @@ static void reduction_out_mps(const Tensor& input_t,
IntArrayRef dim = opt_dim.value();
for (const auto dim_val : dim) {
auto wrap_dim = maybe_wrap_dim(dim_val, input_shape.size());
// canSqueeze logic is broken when dim is negative, it introduces off-by-one-erros or crashes
// canSqueeze logic is broken when dim is negative, it introduces off-by-one-errors or crashes
// See https://github.com/pytorch/pytorch/issues/136132#issuecomment-2354482608
if (wrap_dim >= 4 || dim_val < 0) {
canSqueezeLastDim = false;
@ -1282,7 +1282,7 @@ static void all_any_common_impl_mps(const Tensor& input_t,
auto inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, input_t);
auto castInputTensor = castToIHFTypes(mpsGraph, inputTensor, input_t);
// reductionOrWithTensor:axis: will throw an internal assert if number of dimentions is more than 4
// reductionOrWithTensor:axis: will throw an internal assert if number of dimensions is more than 4
// See https://github.com/pytorch/pytorch/issues/95538
MPSGraphTensor* outputTensor = nil;
if (input_t.ndimension() > 4) {
@ -1352,7 +1352,7 @@ TORCH_IMPL_FUNC(any_all_out_mps)(const Tensor& input_t, const Tensor& output_t)
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
auto inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, input_t);
auto castInputTensor = castToIHFTypes(mpsGraph, inputTensor, input_t);
// reductionOrWithTensor:axes: will throw an internal assert if number of dimentions is more than 4
// reductionOrWithTensor:axes: will throw an internal assert if number of dimensions is more than 4
// See https://github.com/pytorch/pytorch/issues/95538
if (input_t.dim() > 4) {
castInputTensor = [mpsGraph reshapeTensor:castInputTensor withShape:@[ @-1 ] name:nil];
@ -1400,7 +1400,7 @@ TORCH_IMPL_FUNC(all_all_out_mps)(const Tensor& input_t, const Tensor& output_t)
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
auto inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, input_t);
auto castInputTensor = castToIHFTypes(mpsGraph, inputTensor, input_t);
// reductionAndWithTensor:axes: will throw an internal assert if number of dimentions is more than 4
// reductionAndWithTensor:axes: will throw an internal assert if number of dimensions is more than 4
// See https://github.com/pytorch/pytorch/issues/95538
if (input_t.ndimension() > 4) {
castInputTensor = [mpsGraph reshapeTensor:castInputTensor withShape:@[ @-1 ] name:nil];

View File

@ -19,7 +19,7 @@ namespace at::native::mps {
// For both scatter and gather kernels, there are 4 specized ones (for 1D to 4D tensor)
// and one generic, for 5+D ones. Assumption (to be tested) about specialized kernels
// is that reduction of n-dimentional vector, where n is 2, should be slower
// is that reduction of n-dimensional vector, where n is 2, should be slower
// than reduction of 2D one, as n is not known at compiler time, therefore compiler
// could not do loop unrolls, that is
// float sum(float* v, int n) {

View File

@ -53,7 +53,7 @@ C10_ALWAYS_INLINE std::pair<int64_t, int64_t> _check_nested_layer_norm_inputs(
normalized_shape);
// Check that the normalized_shape has the exact same sizes as the last dimensions from the NestedTensor input
// Also, compute M and N considering the idiosyncracies of NestedTensors
// Also, compute M and N considering the idiosyncrasies of NestedTensors
int64_t N = 1;
for (const auto i: c10::irange(normalized_ndim)) {
TORCH_CHECK(

View File

@ -95,7 +95,7 @@ std::vector<Tensor> chunk_nested_tensor(const Tensor& self, int64_t chunks, int6
for (const auto split_idx : c10::irange(chunks)) {
auto new_sizes = sizes.clone();
auto new_strides = strides.clone();
// This copys offsets so we are safe to move
// This copies offsets so we are safe to move
auto new_offsets = offsets.clone();
int64_t *size_ptr = new_sizes.data_ptr<int64_t>();
int64_t *new_offsets_ptr = new_offsets.data_ptr<int64_t>();

View File

@ -245,7 +245,7 @@ int64_t get_nnz(const Tensor& nestedtensor) {
// this is because needs_broadcast indicates that the batch_size is 1
// and hence there is only 1 value for seq_len
// (2) The cum_seq_lens are given by [0, {*}_t.size(1), 2 * {*}_t.size(1),
// ..., outut_batch_size * {*}_t.size(1)] (3) Nnz_{*} is given by
// ..., output_batch_size * {*}_t.size(1)] (3) Nnz_{*} is given by
// output_batch_size * {*}_t.size(1);
int64_t max_seqlen_batch_q = 0, Nnz_q = 0;

View File

@ -575,24 +575,9 @@ void spmm(
cusparseOperation_t opB = transpose_B ? CUSPARSE_OPERATION_TRANSPOSE
: CUSPARSE_OPERATION_NON_TRANSPOSE;
// CUDA < 11.0 doesn't support 64-bit indices and doesn't raise an error about this
// silently returning incorrect results
#if defined(USE_ROCM) && (ROCM_VERSION < 60300)
auto mat1_32 = at::native::_sparse_csr_tensor_unsafe(
mat1.crow_indices().to(kInt),
mat1.col_indices().to(kInt),
mat1.values(),
mat1.sizes(),
mat1.scalar_type(),
mat1.layout(),
mat1.device());
auto descA = at::cuda::sparse::CuSparseSpMatCsrDescriptor(mat1_32);
auto algorithm = CUSPARSE_MM_ALG_DEFAULT;
#else // defined(USE_ROCM) && (ROCM_VERSION < 60300)
// TODO: update this to support COO sparse layout
auto descA = at::cuda::sparse::CuSparseSpMatCsrDescriptor(mat1);
auto algorithm = CUSPARSE_SPMM_CSR_ALG2;
#endif // defined(USE_ROCM) && (ROCM_VERSION < 60300)
auto descB = at::cuda::sparse::CuSparseConstDnMatDescriptor(
transpose_B ? mat2_->mT() : *mat2_);

View File

@ -193,12 +193,12 @@ vTensor pack_biases_quantized_weights(
src_kw_sz = b_sizes[Layout::BatchMatrices::width];
src_kh_sz = b_sizes[Layout::BatchMatrices::height];
} else if (bias.sizes().size() == 2) {
// skip batch dim for boardcasting; index -1
// skip batch dim for broadcasting; index -1
src_kb_sz = 1;
src_kw_sz = b_sizes[Layout::BatchMatrices::height];
src_kh_sz = b_sizes[Layout::BatchMatrices::batch];
} else {
// skip batch & height dim for boardcasting; index -2
// skip batch & height dim for broadcasting; index -2
src_kb_sz = 1;
src_kw_sz = b_sizes[Layout::BatchMatrices::batch];
src_kh_sz = 1;
@ -327,13 +327,13 @@ bool available_check_with_batch(
weight.size(Layout::BatchMatrices::batch) ||
bias->size(Layout::BatchMatrices::batch) == 1);
} else if (bias->ndimension() == 2) {
// skip batch dim for boardcasting; index -1
// skip batch dim for broadcasting; index -1
bias_available &=
(bias->size(Layout::BatchMatrices::height) ==
weight.size(Layout::BatchMatrices::width) ||
bias->size(Layout::BatchMatrices::height) == 1);
} else {
// skip batch & height dim for boardcasting; index -2
// skip batch & height dim for broadcasting; index -2
bias_available &=
(bias->size(Layout::BatchMatrices::batch) ==
weight.size(Layout::BatchMatrices::width) ||

View File

@ -158,7 +158,7 @@ class TORCH_API Tensor: public TensorBase {
// will only lead to trouble and dangling references.
c10::MaybeOwned<Tensor> expect_contiguous(MemoryFormat memory_format=MemoryFormat::Contiguous) && = delete;
// The following overloads are very intruiging. Consider the following
// The following overloads are very intriguing. Consider the following
// program:
//
// x[1] = 3;

View File

@ -6894,7 +6894,7 @@ TEST_F(VulkanAPITest, slice_height_success) {
{2, {2, 3, 40, 50}}, // 4D tensors with dim=height
{1, {3, 40, 50}}, // 3D tensors with dim=height
{0, {40, 50}}, // 2D tensors with dim=height
// 1D tesnors don't have height dim for test
// 1D tensors don't have height dim for test
};
// Act/Assert
@ -6906,7 +6906,7 @@ TEST_F(VulkanAPITest, slice_feature_success) {
std::unordered_map<int64_t, std::vector<int64_t>> dim2sizes {
{1, {2, 40, 13, 14}}, // 4D tensors with dim=feature(channel)
{0, {40, 13, 14}}, // 3D tensors with dim=feature(channel)
// 1D and 2D tesnors don't have feature(channel) dim for test
// 1D and 2D tensors don't have feature(channel) dim for test
};
// Act/Assert
@ -6917,7 +6917,7 @@ TEST_F(VulkanAPITest, slice_batch_success) {
// Arrange
std::unordered_map<int64_t, std::vector<int64_t>> dim2sizes {
{0, {40, 3, 13, 14}}, // 4D tensors with dim=batch
// 1D, 2D and 3D tesnors don't have batch dim for test
// 1D, 2D and 3D tensors don't have batch dim for test
};
// Act/Assert

View File

@ -916,6 +916,7 @@ libtorch_python_core_sources = [
"torch/csrc/autograd/python_torch_functions_manual.cpp",
"torch/csrc/autograd/python_variable.cpp",
"torch/csrc/autograd/python_variable_indexing.cpp",
"torch/csrc/distributed/python_placement.cpp",
"torch/csrc/dynamo/python_compiled_autograd.cpp",
"torch/csrc/dynamo/cache_entry.cpp",
"torch/csrc/dynamo/cpp_shim.cpp",
@ -1073,6 +1074,7 @@ aten_cpu_non_globed_sources = [
"aten/src/ATen/detail/MPSHooksInterface.cpp",
"aten/src/ATen/detail/MAIAHooksInterface.cpp",
"aten/src/ATen/detail/PrivateUse1HooksInterface.cpp",
"aten/src/ATen/detail/XLAHooksInterface.cpp",
"aten/src/ATen/detail/XPUHooksInterface.cpp",
"aten/src/ATen/detail/MTIAHooksInterface.cpp",
"aten/src/ATen/detail/IPUHooksInterface.cpp",
@ -1091,6 +1093,7 @@ aten_cpu_non_globed_headers = [
"aten/src/ATen/detail/HPUHooksInterface.h",
"aten/src/ATen/detail/MAIAHooksInterface.h",
"aten/src/ATen/detail/PrivateUse1HooksInterface.h",
"aten/src/ATen/detail/XLAHooksInterface.h",
"aten/src/ATen/detail/XPUHooksInterface.h",
"aten/src/ATen/detail/MTIAHooksInterface.h",
"aten/src/ATen/detail/IPUHooksInterface.h",

View File

@ -329,17 +329,17 @@ struct pair {
};
template <typename T>
static T conj(T a) {
inline T conj(T a) {
return a;
}
template <>
half2 conj(half2 a) {
inline half2 conj(half2 a) {
return half2(a.x, -a.y);
}
template <>
float2 conj(float2 a) {
inline float2 conj(float2 a) {
return float2(a.x, -a.y);
}

View File

@ -1638,7 +1638,38 @@ if(USE_KINETO)
message(STATUS " KINETO_LIBRARY_TYPE = ${KINETO_LIBRARY_TYPE}")
if(NOT LIBKINETO_NOCUPTI)
if(TARGET CUDA::cupti)
set(CUDA_SOURCE_DIR "${CUDA_TOOLKIT_ROOT_DIR}" CACHE STRING "")
message(STATUS " CUDA_SOURCE_DIR = ${CUDA_SOURCE_DIR}")
message(STATUS " CUDA_INCLUDE_DIRS = ${CUDA_INCLUDE_DIRS}")
if(NOT MSVC)
if(USE_CUPTI_SO)
set(CUPTI_LIB_NAME "libcupti.so")
else()
set(CUPTI_LIB_NAME "libcupti_static.a")
endif()
else()
set(CUPTI_LIB_NAME "cupti.lib")
endif()
find_library(CUPTI_LIBRARY_PATH ${CUPTI_LIB_NAME} PATHS
${CUDA_SOURCE_DIR}
${CUDA_SOURCE_DIR}/extras/CUPTI/lib64
${CUDA_SOURCE_DIR}/lib
${CUDA_SOURCE_DIR}/lib64
NO_DEFAULT_PATH)
find_path(CUPTI_INCLUDE_DIR cupti.h PATHS
${CUDA_SOURCE_DIR}/extras/CUPTI/include
${CUDA_INCLUDE_DIRS}
${CUDA_SOURCE_DIR}
${CUDA_SOURCE_DIR}/include
NO_DEFAULT_PATH)
if(CUPTI_LIBRARY_PATH AND CUPTI_INCLUDE_DIR)
message(STATUS " CUPTI_INCLUDE_DIR = ${CUPTI_INCLUDE_DIR}")
set(CUDA_cupti_LIBRARY ${CUPTI_LIBRARY_PATH})
message(STATUS " CUDA_cupti_LIBRARY = ${CUDA_cupti_LIBRARY}")
message(STATUS "Found CUPTI")
set(LIBKINETO_NOCUPTI OFF CACHE STRING "" FORCE)
@ -1651,7 +1682,7 @@ if(USE_KINETO)
if(NOT APPLE)
set(CMAKE_REQUIRED_LIBRARIES ${CMAKE_REQUIRED_LIBRARIES} "dl" "pthread")
endif()
set(CMAKE_REQUIRED_LIBRARIES ${CMAKE_REQUIRED_LIBRARIES} $<LINK_LIBRARY:WHOLE_ARCHIVE,CUDA::cupti_static>)
set(CMAKE_REQUIRED_LINK_OPTIONS "-Wl,--whole-archive,${CUPTI_LIBRARY_PATH},--no-whole-archive")
check_cxx_source_runs("#include <stdexcept>
int main() {
try {

View File

@ -29,10 +29,15 @@ SET(Open_BLAS_LIB_SEARCH_PATHS
$ENV{OpenBLAS}/lib
$ENV{OpenBLAS_HOME}
$ENV{OpenBLAS_HOME}/lib
)
)
SET(Open_BLAS_LIB_NAME openblas)
IF(DEFINED ENV{OpenBLAS_LIB_NAME})
SET(Open_BLAS_LIB_NAME $ENV{OpenBLAS_LIB_NAME})
ENDIF()
FIND_PATH(OpenBLAS_INCLUDE_DIR NAMES cblas.h PATHS ${Open_BLAS_INCLUDE_SEARCH_PATHS})
FIND_LIBRARY(OpenBLAS_LIB NAMES openblas PATHS ${Open_BLAS_LIB_SEARCH_PATHS})
FIND_LIBRARY(OpenBLAS_LIB NAMES ${Open_BLAS_LIB_NAME} PATHS ${Open_BLAS_LIB_SEARCH_PATHS})
SET(OpenBLAS_FOUND ON)

Some files were not shown because too many files have changed in this diff Show More