Compare commits

...

189 Commits

Author SHA1 Message Date
bad5269194 Change python doc push script to print the undocumented modules 2025-10-08 09:28:48 -07:00
7bc13c802b Update 2025-10-08 09:04:44 -07:00
d872529792 Test 2025-10-08 09:04:43 -07:00
eqy
0d39ecb2ce [cuDNN][RNN] cuDNN RNN supports BFloat16 inputs since 9.13 (#164411)
seems to work

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164411
Approved by: https://github.com/Skylion007
2025-10-08 15:26:50 +00:00
90c0825e2d [GHF] Allow reverts from pytorch-auto-revert app (#164911)
This is a bit weird, but author_login is not a unique field, but author_url is.

Explicitly allow https://github.com/apps/pytorch-auto-revert to issue revert commands

Update mocks by running
```
sed -i -e s/8e262b0495bd934d39dda198d4c09144311c5ddd6cca6a227194bd48dbfe7201/47860a8f57a214a426d1150c29893cbc2aa49507f12b731483b1a1254bca3428/ gql_mocks.json
```

Test plan: Run
```python
from trymerge import GitHubPR
pr=GitHubPR("pytorch", "pytorch", 164660)
print(pr.get_last_comment().author_url, pr.get_comment_by_id(3375785595).author_url)
```
that should produce
```
https://github.com/pytorch-auto-revert https://github.com/apps/pytorch-auto-revert
```
Plus added a regression test that checks two particular comments for revert validity

`pytorch-auto-revert` user is my alter ego :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164911
Approved by: https://github.com/jeanschmidt
2025-10-08 15:15:45 +00:00
fd4bde430a Revert "list_stored_sd_metadata API. (#160610)"
This reverts commit da903b6a8be422529d47649e89c0d50bb95c37ca.

Reverted https://github.com/pytorch/pytorch/pull/160610 on behalf of https://github.com/jeffdaily due to broke ROCm CI, but flaky also on CUDA CI https://hud.pytorch.org/failure?name=periodic%20%2F%20linux-jammy-rocm-py3.10%20%2F%20test%20(distributed%2C%202%2C%203%2C%20linux.rocm.gpu.mi250.4%2C%20module%3Arocm%2C%20oncall%3Adistributed)&jobName=undefined&failureCaptures=distributed%2Fcheckpoint%2Ftest_list_stored_state_dict.py%3A%3ATestListStateDict%3A%3Atest_list_stored_sd_metadata ([comment](https://github.com/pytorch/pytorch/pull/160610#issuecomment-3382023022))
2025-10-08 15:10:38 +00:00
b5e93ffdcf Revert "Limit path search within range (#164581)"
This reverts commit 415e641572473479fc9d9eaea12762e1a223a9e0.

Reverted https://github.com/pytorch/pytorch/pull/164581 on behalf of https://github.com/eellison due to merge sets makes this trickier ([comment](https://github.com/pytorch/pytorch/pull/164581#issuecomment-3381955240))
2025-10-08 14:56:21 +00:00
f8d0d65ddc Revert "Add memory estimator (#164738)"
This reverts commit ab01a0d7d352e7fd07989b8d6bf035bf82aea74e.

Reverted https://github.com/pytorch/pytorch/pull/164738 on behalf of https://github.com/eellison due to merge sets makes this trickier ([comment](https://github.com/pytorch/pytorch/pull/164581#issuecomment-3381955240))
2025-10-08 14:56:21 +00:00
f46ddb1e65 [ROCm][CI] add gfx1150 gfx1151 to docker images for binary builds (#164854)
Fixes #164346.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164854
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-08 14:34:22 +00:00
20082d7136 Revert "fix flex attention eager bwd: more rounding (#164317)"
This reverts commit 41808b2ba9a61ab2f4c7af394c1668d09a4a0331.

Reverted https://github.com/pytorch/pytorch/pull/164317 on behalf of https://github.com/jeffdaily due to inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_builtin_score_mods_seqlen_lt_custom_sparse_block_size_score_mod4_cuda_float16 [GH job link](https://github.com/pytorch/pytorch/actions/runs/18330774537/job/52207370954) [HUD commit link](41808b2ba9) ([comment](https://github.com/pytorch/pytorch/pull/164317#issuecomment-3381812090))
2025-10-08 14:29:10 +00:00
7158aa22e8 remove more (#164753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164753
Approved by: https://github.com/aorenste, https://github.com/mlazos
ghstack dependencies: #164664, #164665, #164667, #164668
2025-10-08 14:23:38 +00:00
2035f6b2e6 use check_size instead of check_is_size in ops.py (#164668)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164668
Approved by: https://github.com/angelayi
ghstack dependencies: #164664, #164665, #164667
2025-10-08 14:23:38 +00:00
2b58adc3bd [inductor][templates] Distinguish between kernel input nodes and codegen input nodes (#163752)
If there is a single autotuner choice, the wrong type of input node is used to instantiate `TritonTemplateBuffer` through `TritonTemplateCaller.output_node`. This PR distinguishes the input nodes used in `AlgorithmSelectorCache.__call__` between the actual inputs passed to the kernel at runtime, vs the possibly viewed inputs that influence scheduling behaviour (e.g. `MemoryDeps`) and codegen. See the added unit test for more detail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163752
Approved by: https://github.com/eellison
2025-10-08 14:12:14 +00:00
322091d8d8 [opaque_obj] Add make_fx tracing support (#163278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163278
Approved by: https://github.com/zou3519
ghstack dependencies: #163279, #163277
2025-10-08 09:09:16 +00:00
2bb4e6876c [opaque obj] Error for torch.library.custom_op infer_schema (#163277)
Unsure how we can get infer_schema to infer the scriptObject type from just the type annotation, so for now will just error clearly and ask users to specify a schema.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163277
Approved by: https://github.com/zou3519
ghstack dependencies: #163279
2025-10-08 09:09:16 +00:00
56ef7743fc [opaque_obj] Add __eq__ and __deepcopy__ (#163279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163279
Approved by: https://github.com/zou3519
2025-10-08 09:09:16 +00:00
64108bdbed [BC-Breaking] Remove long-deprecated casting functions from native_functions.yaml (#164641)
This PR removes `torch._cast_XXX` from generated OPs. They were deprecated in PyTorch 1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164641
Approved by: https://github.com/albanD, https://github.com/justinchuby
2025-10-08 08:27:58 +00:00
c855f8632e Pyrefly suppressions 7/n (#164913)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Almost there!

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199

after:
 INFO 0 errors (6,884 ignored)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164913
Approved by: https://github.com/oulgen
2025-10-08 07:27:17 +00:00
12d2ef557f Update round size with 1 division behavior (#162203)
have round size return nearest power of 2 greater than or equal to size with 1 division

Fixes #161139

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162203
Approved by: https://github.com/ezyang
2025-10-08 06:41:46 +00:00
65aa62d50d Use codegen for the boxed interpreters (#164573)
Authored with claude code.  The arg parsing is kind of horrible, open
to more suggestions.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164573
Approved by: https://github.com/albanD, https://github.com/jansel
2025-10-08 06:27:44 +00:00
6a09f9306c Fix #164742, all header-impl'd userfacing functions should be inline (#164871)
It is as @mxmpl pointed out; we are missing an inline.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164871
Approved by: https://github.com/mikaylagawarecki
2025-10-08 05:57:19 +00:00
19bf67be32 multimem reduce (#164517)
Modified `multimem_one_shot_all_reduce_out` function to accept a `root` argument, making it a `multimem_reduce` op.

The original `multimem_one_shot_all_reduce` op becomes a caller of the `multimem_reduce`, with each rank providing its own rank id as root.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164517
Approved by: https://github.com/ngimel
2025-10-08 05:25:16 +00:00
1927783aa3 Revert "Reland vision pinned commit hash update (#164492)"
This reverts commit 6861a270624b44954826688f8dad668eb0154452.

Reverted https://github.com/pytorch/pytorch/pull/164492 on behalf of https://github.com/izaitsevfb due to see autorevert msg above, inductor breakage is legit ([comment](https://github.com/pytorch/pytorch/pull/164492#issuecomment-3379537888))
2025-10-08 04:38:26 +00:00
184817c7a8 locks + unit tests (#164636)
Test Plan:
```
buck test fbcode//mode/opt caffe2/test/inductor:caching
```

Reviewed By: aorenste

D83714690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164636
Approved by: https://github.com/aorenste
2025-10-08 04:34:22 +00:00
da903b6a8b list_stored_sd_metadata API. (#160610)
Summary:
1\ Certain checkpoint load use cases are not aware of the properties of the data/tensors they want to load.
2\ These usecases include data loader checkpoints, reading data for post processing (when the original model definition is not available).
3\ There, we have to use saved checkpoint  (metadata) as our source of truth.
4\ This RFC proposal exposes the checkpoint metadata using a public API.

In this proposal we expose the stored state-dict metadata  (minus associated storage/chunk metadata).

Chunk/storage details should not be exposed to the users and is a impl detail of the storage writer/reader.

Test Plan:
UT.

Rollback Plan:

Differential Revision: D80231457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160610
Approved by: https://github.com/saumishr
2025-10-08 04:33:51 +00:00
f76fdcaaf8 [Benchmark] cleanup huggingface models (#164815)
Prune models from TorchInductor dashboard to reduce ci cost. This PR prunes for hugging face models according to the [doc](https://docs.google.com/document/d/1nLPNNAU-_M9Clx9FMrJ1ycdPxe-xRA54olPnsFzdpoU/edit?tab=t.0), which reduces from 46 to 27 models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164815
Approved by: https://github.com/anijain2305, https://github.com/seemethere, https://github.com/huydhn, https://github.com/malfet
2025-10-08 03:21:04 +00:00
608792153f [inductor][codecache] Print bytes in codecache debug output (#164898)
Summary: We have an internal request to help understand why the hash of `post_grad_custom_post_pass` is changing between attempts. We don't get useful info from the debug output, because we just print "<bytes>". Instead, attempt to print at least _some_ of the value in case it contains readable characters.

Test Plan:
Registered a dummy post_grad_custom_pass and printed codecache debug output
`TORCH_LOGS=+torch._inductor.codecache python ~/foo.py`

Yields something like:
```
V1007 16:41:19.024000 3546009 /data/users/slarsen/pytorch-3.10_4/torch/_inductor/codecache.py:989] [0/0] [law2ujt2wzjb5tyiu6jh64r2lxpvl62yvxcsmdouhg3qyelhhdv] post_grad_custom_post_pass: HelloWorld!����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������...
```

Differential Revision: [D84108770](https://our.internmc.facebook.com/intern/diff/D84108770)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164898
Approved by: https://github.com/oulgen
2025-10-08 02:45:20 +00:00
086dec3235 Pyrefly suppressions 6/n (#164877)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Almost there!

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199

after:

INFO 0 errors (5,064 ignored)

Only four directories left to enable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164877
Approved by: https://github.com/oulgen
2025-10-08 02:30:57 +00:00
ad7b2bebc6 Use tuples to have a deterministic ordering. (#164851)
When debugging I noticed some non-deterministic behavior and tracked it down to this literal set. Changed to be a tuple for determinism. Changed two other small literal sets also because using a set for a small lookup like that is slow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164851
Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh
2025-10-08 02:12:03 +00:00
d444384003 [SymmMem] Tiled reduce (#162243)
Added op: `tile_reduce(Tensor input, Tensor(a!) out, int root, str group_name)`

For now supports only:
- NVSHMEM backed symmetric tensor;
- 2D tensor and tile;
- torch.float.

Testing on right-bottom quandrant:
```
rank 0:
tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 1., 1., 1.],
        [0., 0., 0., 0., 1., 1., 1., 1.],
        [0., 0., 0., 0., 1., 1., 1., 1.],
        [0., 0., 0., 0., 1., 1., 1., 1.]], device='cuda:0')
PASSED
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162243
Approved by: https://github.com/ngimel
2025-10-08 02:03:04 +00:00
3040a5d294 Revert "[dynamo] Support torch.fx.traceback.annotate (#164678)"
This reverts commit 801e282f39e9ef4424dfd3ecfd2b550a44595229.

Reverted https://github.com/pytorch/pytorch/pull/164678 on behalf of https://github.com/izaitsevfb due to breaks executorch internally, see [D84068062](https://www.internalfb.com/diff/D84068062?entry_point=16) ([comment](https://github.com/pytorch/pytorch/pull/164678#issuecomment-3379281844))
2025-10-08 01:49:34 +00:00
97463d4cf3 Revert "Fix double dispatch to Python for detach (#163671)"
This reverts commit c32118dc3e50505fd285e6e448a90883fce11535.

Reverted https://github.com/pytorch/pytorch/pull/163671 on behalf of https://github.com/izaitsevfb due to breaks export tests ([comment](https://github.com/pytorch/pytorch/pull/163671#issuecomment-3379281422))
2025-10-08 01:46:45 +00:00
c813617c53 [PP] Migrate other schedules to use PipelineScheduleRuntime (#164777)
Second fix for https://github.com/pytorch/pytorch/issues/164756

This has been a TODO to make the all schedules execute using the same runtime. Now after this change, schedules will use the same logic for `_PipelineScheduleRuntime` where it adds `UNSHARD` and `RESHARD` operations to the schedules which fixes the issue mentioned above.

<img width="920" height="406" alt="image" src="https://github.com/user-attachments/assets/a4d5bcd0-7dac-43cd-96f9-8ca33cfd8b91" />

A test is failing after the conversion:
- Fixed a gradient scaling issue for dWeight

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164777
Approved by: https://github.com/fegin
ghstack dependencies: #164775
2025-10-08 01:45:57 +00:00
e659661ffa [PP] Fix FSDP unshard/reshard (#164775)
First fix for https://github.com/pytorch/pytorch/issues/164756

In the pipeline IR we call `UNSHARD` and `RESHARD`,  but there is a bug because when we call `module.unshard()` these do not recursively call the FSDP modules, hence leading to sometime call allgather before the module forward.

Since we want the pipeline IR to explicitly handle this, we can call `group.unshard` instead which ensures that all the modules are unsharded.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164775
Approved by: https://github.com/weifengpy
2025-10-08 01:45:57 +00:00
41808b2ba9 fix flex attention eager bwd: more rounding (#164317)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164317
Approved by: https://github.com/drisspg
ghstack dependencies: #163986
2025-10-08 01:17:45 +00:00
c0510dc447 [ContextParallel] add _LoadBalancer classes, and load-balance interface to Context Parallel APIs (#161062)
**Summary**
This PR provides an interface for users to specify how to load-balance the attention
input. The load-balance is essentially a rearrangement of the input tensor(s) over the
seq_dim before sharding and can be specified via an index tensor `rearrange` such
that Q[rearrange] is the balanced Q users want (i.e. `rearrange[i] == j` where `i` is the new
index of `Q[j]` in the balanced Q). An example is the `_generate_round_robin_indices()` added
in https://github.com/pytorch/pytorch/pull/155442.

**New `_LoadBalancer` classes**
New `_LoadBalancer` class (defined in `torch/distributed/tensor/experimental/_load_balancer.py`)
provides one interface for defining load-balance behavior: `_generate_indices(self, restore: bool = False)`.

When `restore == False`, this method should output an index Tensor (namely `rearrange_idx`) such
that QKV will be transformed into Q' K' V' in a way that `Q'[i] == Q[rearrange_idx[i]]` (same applies
to K and V).

When `restore == True`, this method outputs an index Tensor (namely `restore_idx` such that
`Q'[restore_idx] == Q` (same applies to K and V).

**Impact**
2 public CP APIs and 1 private CP API is modified. This PR should be backward-compatible by:
- For uses w/ SDPA, existing users must be using the `context_parallel()` API which does not
take in the extra `load_balancer` argument and solely determines from the global var
`_cp_options.enable_load_balance`.
- For new users including who want to try `flex_attention()`, we require to use the new API
`_context_parallel_buffers` to explicitly shard the QKV input instead of using `context_parallel()`
because we no longer rely on TorchDispatchMode nor TorchFunctionMode for op replacement. And
we also require users to explicitly pass in a `load_balancer` argument if load-balancing is demanded.

**Load-Balance Behavior**
`context_parallel_unshard()`, and `create_cp_block_mask()` APIs now take an extra optional argument
`load_balancer`. This argument is optional because of backward compatibility but we require new users
to explicitly pass in a `load_balancer` if load-balancing is demanded:
- if `load_balancer == None` and `_cp_options.enable_load_balance == False`, CP performs
no load-balancing on input Tensors.
- if `load_balancer == None` and `_cp_options.enable_load_balance ==True`, CP performs
head-tail load-balancing (e.g. split a Tensor into 2*N chunks and first N are called head and
the rest are called tail. Place the first head chunk the last tail chunk on rank 0, and the second
head along with the second last tail chunk on rank 1, and so on).

`_context_parallel_buffers()` also takes the extra optional argument `load_balancer`, but the behavior
is slightly different from the other 2 APIs -- it doesn't branch on `_cp_options.enable_load_balance` :
- if `load_balancer == None`, no load-balancing will be performed
- otherwise, apply load-balancing using `load_balancer._generate_indices()` before sharding.

**Changes**
This PR moves the index Tensor generation logic into a set of LoadBalancer classes and
make LoadBalancer the common interface for Context Parallel APIs that leverages
load-balancing:
* _context_parallel_buffers
* context_parallel_unshard
* create_cp_block_mask

The `_LoadBalancer` classes added are:
- `_LoadBalancer`: the abstract base class that provides “_generate_indices” interface index Tensor generation.
- `_HeadTailLoadBalancer`: Implements head-tail balancing logic.
- `_PerDocumentHeadTailLoadBalancer`: Supports per-document head-tail balancing for batched sequences.

**Test**
`pytest test/distributed/tensor/test_attention.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161062
Approved by: https://github.com/fegin
2025-10-08 01:09:14 +00:00
9ec10dc26a utils + unit tests (#164551)
Test Plan:
```
buck test fbcode//mode/opt caffe2/test/inductor:caching
```

Reviewed By: aorenste

Differential Revision: D83714691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164551
Approved by: https://github.com/aorenste
2025-10-08 01:05:45 +00:00
43fc859625 Don't return values in void functions (#164809)
This PR fixes returning values in void C++ functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164809
Approved by: https://github.com/janeyx99
2025-10-08 01:04:14 +00:00
f713abab16 Revert "Enable all flake8-logging-format rules (#164655)"
This reverts commit e98c4e835b1db22092fc93b49d2cddd7b3537d1f.

Reverted https://github.com/pytorch/pytorch/pull/164655 on behalf of https://github.com/malfet due to Looks like it broke lint in trunk, see bd3b98a8a5/1 ([comment](https://github.com/pytorch/pytorch/pull/164655#issuecomment-3379209309))
2025-10-08 00:55:17 +00:00
bd3b98a8a5 [dynamic shapes] make backed_size_oblivious behavior consistent b/w symbolic_shapes/inductor (#164796)
Summary: call guard_or_ directly to enable backed_size_obl in inductor calls to guard_or

Test Plan: CI and unit test added.

Differential Revision: D84009392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164796
Approved by: https://github.com/laithsakka
2025-10-08 00:19:06 +00:00
e98c4e835b Enable all flake8-logging-format rules (#164655)
These rules are enabled by removing existing suppressions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164655
Approved by: https://github.com/janeyx99
2025-10-08 00:16:13 +00:00
7b15534434 [export] Fix weight sharing when there is no complete tensor (#164857)
Summary: As titled.

Test Plan: CI

Differential Revision: D84079625

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164857
Approved by: https://github.com/yushangdi
2025-10-07 23:40:13 +00:00
c32118dc3e Fix double dispatch to Python for detach (#163671)
This fixes #71725.

Differential Revision: [D83857880](https://our.internmc.facebook.com/intern/diff/D83857880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163671
Approved by: https://github.com/ezyang, https://github.com/albanD
2025-10-07 23:34:37 +00:00
e3ae80fc03 [PP] Let PP split BlockMask into micro-BlockMask (#164111)
BlockMask has batch dimension information. So PP has to split it as well just like all other tensors. All the tensors in BlockMask have the batch dimension, so we can just split it without too many issues. However, `mask_mod` requires the batch index as the input, which the value is going to be changed after the split. So we have to wrap it inside a closure to modify the batch index.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164111
Approved by: https://github.com/H-Huang
2025-10-07 23:25:34 +00:00
483f4e0db9 CUDA 13.0 builds fix on Amazon Linux 2023 (#164870)
During 2.9 rc testing I am seeing an issue on Amazon Linux 2023 with CUDA 13.0 builds

This is related to:
 https://github.com/pytorch/pytorch/issues/152756

Workflow: https://github.com/pytorch/test-infra/actions/runs/18324074610/job/52184079262

Error:
```
WARNING: There was an error checking the latest version of pip.
+ python3.11 .ci/pytorch/smoke_test/smoke_test.py --package torchonly
Traceback (most recent call last):
  File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 333, in _load_global_deps
    ctypes.CDLL(global_deps_lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib64/python3.11/ctypes/__init__.py", line 376, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: libcudart.so.13: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/pytorch/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 12, in <module>
    import torch
  File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 425, in <module>
    _load_global_deps()
  File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 383, in _load_global_deps
    _preload_cuda_deps(lib_folder, lib_name)
  File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 317, in _preload_cuda_deps
    raise ValueError(f"{lib_name} not found in the system path {sys.path}")
Traceback (most recent call last):
ValueError: libnvToolsExt.so.*[0-9] not found in the system path ['/pytorch/pytorch/.ci/pytorch/smoke_test', '/usr/lib64/python311.zip', '/usr/lib64/python3.11', '/usr/lib64/python3.11/lib-dynload', '/usr/local/lib64/python3.11/site-packages', '/usr/local/lib/python3.11/site-packages', '/usr/lib64/python3.11/site-packages', '/usr/lib/python3.11/site-packages']
  File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 102, in <module>
    main()
  File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 98, in main
    run_cmd_or_die(f"docker exec -t {container_name} /exec")
  File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 39, in run_cmd_or_die
    raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}")
RuntimeError: Command docker exec -t 7d9c5bd403cac9a9ee824d63a1d6f6057ecce89a7daa94a81617dbf8eff0ff2e /exec failed with exit code 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164870
Approved by: https://github.com/Camyll

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2025-10-07 22:52:53 +00:00
d1a62c8036 [BE][Ez]: Enable RUF007 Prefer itertools.pairwise over zip slicing (#164856)
Now that our min version is 3.10 we can support this rule. This is more concise, readable, and efficient than the previous zip slicing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164856
Approved by: https://github.com/williamwen42
2025-10-07 22:51:17 +00:00
6861a27062 Reland vision pinned commit hash update (#164492)
Redo https://github.com/pytorch/pytorch/pull/154694

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164492
Approved by: https://github.com/yangw-dev
2025-10-07 22:45:05 +00:00
955f21dc2c [ROCm][CI] Add support for gfx1100 in rocm workflow + test skips (#148355)
This PR adds infrastructure support for gfx1100 in the rocm workflow. Nodes have been allocated for this effort.
@dnikolaev-amd contributed all the test skips.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148355
Approved by: https://github.com/jeffdaily

Co-authored-by: Dmitry Nikolaev <dmitry.nikolaev@amd.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-07 22:36:25 +00:00
9f5e1beaf3 [multi-kernel] base tensor sizes for shape cache key (#164499)
to match shape key in 3ca09d65f1/torch/_inductor/select_algorithm.py (L3571)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164499
Approved by: https://github.com/ColinPeppler
2025-10-07 21:27:40 +00:00
2e027e8742 [inductor] Improve bound on the number of dims to match for the block (#163755)
- Removes redundant broadcast code when `len(kernel.range_tree_nodes)` is much larger than `len(range_tree.nodes)`. For example:
```python
# before, the broadcast is to [1, 1, XBLOCK, R0_BLOCK]
tmp0 = tl.reshape(tl.broadcast_to(tl.load(block_ptr0, boundary_check=[2], padding_option='zero', eviction_policy='evict_last')[:, None, :, :], [(511 + XBLOCK) // 512, ((1) * ((1) <= ((511 + XBLOCK) // 512)) + ((511 + XBLOCK) // 512) * (((511 + XBLOCK) // 512) < (1))), ((512) * ((512) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (512))), R0_BLOCK]), [XBLOCK, R0_BLOCK])
# after
tmp0 = tl.reshape(tl.load(block_ptr0, boundary_check=[2], padding_option='zero', eviction_policy='evict_last'), [XBLOCK, R0_BLOCK])
```
- Fix: also save range_tree_nodes per subgraph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163755
Approved by: https://github.com/eellison, https://github.com/blaine-rister
2025-10-07 21:02:37 +00:00
1e42fde45e Revert "[CUDA] Add experimental green context support for SM carveout (#159104)"
This reverts commit 746fe78ecd52f3e9cfddda41f0ac82dada7bdd0b.

Reverted https://github.com/pytorch/pytorch/pull/159104 on behalf of https://github.com/malfet due to Breaks Windows CD build ([comment](https://github.com/pytorch/pytorch/pull/159104#issuecomment-3378675515))
2025-10-07 20:51:22 +00:00
f505caa71b Revert "multimem reduce (#164517)"
This reverts commit d1cbb74fb16406488a174832e1b58b7c242f418d.

Reverted https://github.com/pytorch/pytorch/pull/164517 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164517#issuecomment-3378529654))
2025-10-07 20:12:38 +00:00
65f10becdf Support OVERLAP_F_B in schedule (#161072)
Previously, we converted the overlap_f_b into separate forward and backward operations in the plan. This is a small change that includes it in the plan and handles it in the runtime

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161072
Approved by: https://github.com/fegin, https://github.com/wconstab
2025-10-07 19:55:10 +00:00
df640df68a Revert "Reapply "C++-accessible Placements via pybind11 (#163030)" (#164519)"
This reverts commit 8c0bc879b97bc580aaa0777b2d266bdd068cb528.

Reverted https://github.com/pytorch/pytorch/pull/164519 on behalf of https://github.com/malfet due to Still breaks internal workflows ([comment](https://github.com/pytorch/pytorch/pull/164519#issuecomment-3378469432))
2025-10-07 19:46:17 +00:00
4c3c0ef2f1 [precompile] Load source cache for AOT compile as well. (#164773)
Adding source_get_cache also to AOT compile case. Since the guard manager loader code can be shared between AOT and caching, we added a new function load_guard_manager to avoid code duplication between two workflows, for loading guards.

Test Plan: test_guard_serialization.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164773
Approved by: https://github.com/yiming0416, https://github.com/dolpm
2025-10-07 18:47:09 +00:00
bc33b10202 fix copy_ for scalar in inductor (#164167)
Fixes #158437

### Summary

- TorchInductor was not properly handling scalar copy operations `(tensor.copy_(scalar_value))`
- Ensured scalar sources are converted to appropriate tensor representations with correct dtype and device

### Impact

- Enables compilation of models using ` tensor.copy_(scalar) `patterns
- module: inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164167
Approved by: https://github.com/shunting314
2025-10-07 18:31:37 +00:00
2855a045b3 Use sym_eq and sym_and on symbolic shapes in common_meta_baddbmm_bmm (#164781)
Differential Revision: D84005053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164781
Approved by: https://github.com/Skylion007
2025-10-07 18:25:00 +00:00
9ecd092bd9 Add python bindings for NCCL CTA policies (#164309)
NCCLConfig can now be constructed with non-default [cta policies][1]

```python
import torch
from torch.distributed import ProcessGroupNCCL as nccl

config = nccl.NCCLConfig()
config.cta_policy = nccl.NCCL_CTA_POLICY_ZERO  # NCCL version >= 2.28
```

[1]: https://docs.nvidia.com/deeplearning/nccl/archives/nccl_2283/user-guide/docs/api/flags.html#nccl-communicator-cta-policy-flags

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164309
Approved by: https://github.com/eqy
2025-10-07 18:16:20 +00:00
078d475d3b move partition and compiler fns from stage 1 to stage 2 (#164765)
Differential Revision: D83995689

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164765
Approved by: https://github.com/zhxchen17
2025-10-07 18:02:03 +00:00
f37a6523ef Move version.h to torch/headeronly (#164381)
Differential Revision: [D83685392](https://our.internmc.facebook.com/intern/diff/D83685392)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164381
Approved by: https://github.com/janeyx99
2025-10-07 17:47:30 +00:00
b13cd141b3 Add pyrefly suppressions (#164748)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: delete lines in the pyrefly.toml file from the `project-excludes` field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199

after:

0 errors (4,263 ignored)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164748
Approved by: https://github.com/oulgen
2025-10-07 17:31:18 +00:00
5e47b4dd60 Remove device_id param from DeviceCachingAllocator::malloc (#164798)
The `malloc` call in DeviceCachingAllocator accepts a DeviceIndex param which
can be confusion because the allocator can only allocate memory for the device
that it corresponds to. This associated device is fixed at construction time
and the runtime param can be misleading.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164798
Approved by: https://github.com/ngimel, https://github.com/cyyever, https://github.com/eqy
2025-10-07 16:42:04 +00:00
ee5389d520 Enable batch samples in sparse tests (#164677)
The test cases are enabled because the issue was fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164677
Approved by: https://github.com/albanD
2025-10-07 15:58:37 +00:00
ab01a0d7d3 Add memory estimator (#164738)
Original work by @ShatianWang, with lints applied. I am going to a few changes and add tests in subsequent prs but I want to preserve original commit first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164738
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #164568, #164569, #164581
2025-10-07 15:32:27 +00:00
801e282f39 [dynamo] Support torch.fx.traceback.annotate (#164678)
Builds on top of https://github.com/pytorch/pytorch/pull/163673 and https://github.com/pytorch/pytorch/pull/164174. This will be used in the followup PRs to apply regional inductor compilation.

The existing implementation let Dynamo trace into the `torch.fx.traceback.annotate`, but thats not what we want. We want Dynamo to essentially run the torch.fx.traceback.annotate function in eager, so that every Fx node created in Dynamo Fx graph has the custom meta node.

This does not work with graph breaks yet. But we can solve that problem, if needed, in a separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164678
Approved by: https://github.com/SherlockNoMad, https://github.com/jansel, https://github.com/xmfan
2025-10-07 14:54:26 +00:00
87c9fbda22 Follow up to PR 163980 for s390x (#164464)
Now with same updates propagated to s390x it works on s390x runners too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164464
Approved by: https://github.com/atalman
2025-10-07 12:02:29 +00:00
3cc8af2d67 torch.topk: refactor global histogram/cumsum into a dedicated kernel to eliminate redundant memory access (#164459)
# TLDR
This PR removes the regression in torch.topk introduced from torch 2.7.0 and delivers much better performance for large inputs.

The table below reports execution times on H20 for various input sizes with float32 data, extracting the top-100 values. Results indicate that this PR restores and improves performance, especially on large inputs.
| Input Shape    | torch2.6.0 (ms) | torch2.8.0 (ms) | 2.8.0+this PR (ms) |
| -------------- | --------------- | --------------- | ------------------ |
| (1, 1B)        | 36.6            | 1564.1          | 25.6               |
| (1, 100M)      | 3.56            | 17.4            | 2.54               |
| (1, 1000,000)  | 0.135           | 0.145           | 0.098              |
| (512, 128000)  | 1.33            | 1.33            | 1.32               |
| (8192, 128000) | 19.6            | 19.6            | 19.4               |

# Background
After upgrading PyTorch from 2.6.0 to 2.7.0, we observed a significant GPU performance regression in `torch.topk` on NVIDIA GPUs. For instance, extracting the top-1000 largest values from one billion floats on an NVIDIA H20 increased from **36 ms** to **1.6 s**.

Profiling with Nsight Compute indicates that the slowdown is caused by redundant memory accesses introduced in [PR #145536](https://github.com/pytorch/pytorch/pull/145536).

# Analysis

`torch.topk` relies on **RadixSelect** to find the target values. Each radix pass requires computing a histogram of the input values. For large inputs, histogram computation is split into two stages:

1. **Local histogram**: Each CUDA block processes a subset of the input and writes its local histogram to global memory.
2. **Global reduction**: A single CUDA block reads all local histograms from global memory and reduces them into the final global histogram.

Before [PR #145536](https://github.com/pytorch/pytorch/pull/145536), both stages ran inside a single kernel (`radixFindKthValues`), using a semaphore to ensure that all local histograms were completed before reduction.

In PR #145536, the global histogram computation was merged with subsequent top-k calculations into a single kernel (`computeBlockwiseKthCounts`) to avoid the semaphore. While this simplifies synchronization, it introduces **redundant memory reads**:

- `computeBlockwiseKthCounts` launches `numInputSlices * blocks_per_slice` blocks.
- For each row (slice), `blocks_per_slice` CUDA blocks redundantly reload the same local histograms from global memory.

# This PR

To address this inefficiency, we introduce the following optimizations:

1. **Dedicated kernel**: Refactor global histogram and cumsum computation into a separate GPU kernel, `computeDigitCumSum`.
2. **Loop unrolling**: Apply loop unrolling in `computeDigitCumSum` to speed up local histogram reads.

# Performance
We benchmarked torch.topk on NVIDIA H20 with float32 inputs, extracting the top-100 values across different input sizes. The results in the table below demonstrate that this PR effectively eliminates the performance regression introduced in 2.7.0 and delivers substantial improvements on large inputs.

| Input Shape    | torch2.6.0 (ms) | torch2.8.0 (ms) | 2.8.0+this PR (ms) |
| -------------- | --------------- | --------------- | ------------------ |
| (1, 1B)        | 36.6            | 1564.1          | 25.6               |
| (1, 100M)      | 3.56            | 17.4            | 2.54               |
| (1, 1000,000)  | 0.135           | 0.145           | 0.098              |
| (512, 128000)  | 1.33            | 1.33            | 1.32               |
| (8192, 128000) | 19.6            | 19.6            | 19.4               |

Besides, I have verified the correctness of this PR with different inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164459
Approved by: https://github.com/ngimel, https://github.com/Skylion007
2025-10-07 11:04:03 +00:00
1fb072ac2a exceptions + unit tests (#164550)
Test Plan:
```
buck test fbcode//mode/opt caffe2/test/inductor:caching
```

Reviewed By: aorenste

Differential Revision: D83714688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164550
Approved by: https://github.com/aorenste
2025-10-07 10:04:58 +00:00
cac5e13e13 [dynamo] Inline nn module calls using __call__ methods (#164817)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164817
Approved by: https://github.com/SherlockNoMad, https://github.com/mlazos
2025-10-07 08:57:20 +00:00
68350660ee Increase timeout for nightly macOS performance tests to 300 minutes (#164793)
the Test step time recently went slightly up.

hopefully this fixes https://github.com/pytorch/alerting-infra/issues/263
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164793
Approved by: https://github.com/seemethere
2025-10-07 08:44:07 +00:00
ef7e2ca77e remove check_is_size from test_misc.py (#164667)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164667
Approved by: https://github.com/angelayi
ghstack dependencies: #164664, #164665
2025-10-07 07:33:50 +00:00
cdaaf3e4a3 remove size-like based size-oblivious special max simplifications (#164665)
As we removed guard_size_oblivious this simplification is no longer relevant, this is part of the process of
deprecation for guard_size_oblivious and its dependencies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164665
Approved by: https://github.com/aorenste
ghstack dependencies: #164664
2025-10-07 07:33:50 +00:00
0ea59c3c55 do not suggest torch._check_is_size() (#164664)
size like concept for data dependency is not relevant anymore as we removed all guard_size_oblivious calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164664
Approved by: https://github.com/angelayi, https://github.com/mlazos
2025-10-07 07:33:50 +00:00
8f705d019a context + unit tests (#164549)
Summary:
the context module provides configurable context selection + isolation key hashing;

context selection is broken into runtime and compile context. runtime context is decided at call time (inductor configs, precision configs, etc.) and compile context is decided at compile time (hardware type, software hashes).

callees will be given access to SelectedRuntimeContext and SelectedCompileContext, which they can use to determine and select what context is necessary with regards to the function which is being cached.

these selected contexts are wrapped in an IsolationSchema, which denotes what context should be taken into consideration when producing an isolation key. The isolation key is essentially a salt of the function signature key, which says that some function signature key result is valid under a given context (isolation schema)

Test Plan:
```
buck test fbcode//mode/opt caffe2/test/inductor:caching
```

Reviewed By: aorenste

 D83714689

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164549
Approved by: https://github.com/aorenste
2025-10-07 06:02:10 +00:00
4bcc05777e [torchfuzz] synthesize inputs for data dependent ops (#164716)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164716
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649, #164687, #164688, #164693, #164694, #164715
2025-10-07 05:40:32 +00:00
2a6cdba6e5 [torchfuzz] various edge case fixes (#164715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164715
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649, #164687, #164688, #164693, #164694
2025-10-07 05:30:46 +00:00
53f6cc7529 [torchfuzz] make ops_fuzzer deterministic (#164694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164694
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649, #164687, #164688, #164693
2025-10-07 05:30:46 +00:00
ac901bf79a [torchfuzz] consolidate on a base implementation of args_codegen (#164693)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164693
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649, #164687, #164688
2025-10-07 05:20:28 +00:00
c965d6dbb2 [torchfuzz] move into experimental dir (#164688)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164688
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649, #164687
2025-10-07 05:09:08 +00:00
ac08556f67 [torchfuzz] support more unbacked functions (#164687)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164687
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649
2025-10-07 05:00:03 +00:00
5fe7f29b9e [torchfuzz] add support for operator weights (#164649)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164649
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646, #164647
2025-10-07 05:00:03 +00:00
ded099ecbf [torchfuzz] don't use the first gpu in multi process fuzzer (#164647)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164647
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646
2025-10-07 04:59:56 +00:00
63fcc3e6c4 [torchfuzz] update README.md (#164646)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164646
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514
2025-10-07 04:59:50 +00:00
fd3e15c14f Fix typo in class definition of bytecodedispatchtable (#164762)
ghstack-source-id: 84f0d7bb7e3780ca75473782abfae530010be56e
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164761

Fixes the type in naming of bytecodedispatchtable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164762
Approved by: https://github.com/StrongerXi, https://github.com/williamwen42
2025-10-07 04:36:09 +00:00
ff5faa744a Remove unused THPXXX macros (#164660)
These macros are not used in OSS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164660
Approved by: https://github.com/albanD
2025-10-07 04:04:21 +00:00
4725871a81 Return fake mode from export graph capture API (#164730)
This PR is to temporarily unblock various experiments to re-use dynamo create fake mode. Note that this is still not what we want as the end state. The end state should look sth like:
```
out = fulllgraph_capture(mod, inputs)
fake_mode = out.backend_inputs.fake_mode
gm  = out.module()
```
This doesn't work today because export requires we need to wrap the original module to setup a flat module to trace for easier handling of pytree. As a result, we would need to carry export specific flag in fullgraph_capture which seems not ideal.
Regardless, the end state is that we need to give downstream user a graph module and a fake mode in some form, so I think _dynamo_graph_capture_for_export returning a fake mode within graph module itself via gm.meta

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164730
Approved by: https://github.com/avikchaudhuri
2025-10-07 03:42:46 +00:00
bcd96cc6ff [annotate] Copy fwd to bwd metadata for subgraphs as well (#164795)
The test is in the next PR. My older PR on dynamo annotate - https://github.com/pytorch/pytorch/pull/164678 is getting reverted due to unknown reasons, so difficult to add a test right now in this PR.  When I reland, I can add a test for this as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164795
Approved by: https://github.com/yushangdi
ghstack dependencies: #164772
2025-10-07 02:42:47 +00:00
50e077beaa Fix outdated info in requirements-ci.txt (#164441)
Fixes installation instructions and descriptions for `numba` and `scikit-image`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164441
Approved by: https://github.com/albanD
2025-10-07 02:10:41 +00:00
56d66ac0d7 Make custom op alias check consistent (#164576)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164576
Approved by: https://github.com/soulitzer
2025-10-07 02:05:09 +00:00
49f7d8d19d [ROCm] Fix test_cuda_synchronize failure on ROCm (#164735)
This PR skips the hipify step of torch/csrc/jit/ir/ir.h to avoid a build-time error for the JIT cuda namespace.  This fixes two skipped tests in test/jit/test_cuda.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164735
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-07 01:14:24 +00:00
afee8062d5 Revert "Fix mesh.get_local_rank when it is > 1d (#164473)"
This reverts commit 83d71dfb2fd993a6242372b8123549acaa85ffdb.

Reverted https://github.com/pytorch/pytorch/pull/164473 on behalf of https://github.com/izaitsevfb due to appears to be causing vision_maskrcnn regression ([comment](https://github.com/pytorch/pytorch/pull/164473#issuecomment-3374738997))
2025-10-07 00:37:41 +00:00
e89d12bf5d Numpy zerotensor handling (#164487)
Fixes #89034

Updated tensor_to_numpy() function in tensor_numpy.cpp to handle ZeroTensors by throwing an error if force=False and returning an array full of zeros if force=True.

@ngimel, I just saw that you mentioned PyTorch is not too concerned with this issue but I had already worked on it so I figured I would push it anyways and see what you thought. Feel free to close the PR if you think it is not worth merging.

@albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164487
Approved by: https://github.com/izaitsevfb
2025-10-07 00:34:14 +00:00
d4752bc7f6 [caffe2] tweak Unpickler::readInstruction handling TUPLE (#164764)
Summary: Creating the vector was a bit awkward. Use the natural iterator-pair constructor with move-iterators.

Test Plan: CI.

Reviewed By: dolpm

Differential Revision: D83995108

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164764
Approved by: https://github.com/drisspg
2025-10-07 00:18:10 +00:00
44a5d41993 [ROCm] add gfx1150 gfx1151 to supported gemm lists (#164744)
This is one of a few PRs needed to address https://github.com/pytorch/pytorch/pull/164744 fully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164744
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-07 00:02:23 +00:00
361c5d362c [fx][traceback] Actually disable preservation of node metadata when enable=False (#164772)
This will come in handy when we run graph passes that add new nodes, and
create_proxy can add seq_nr meta.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164772
Approved by: https://github.com/SherlockNoMad
2025-10-06 23:39:12 +00:00
1fc71d1b57 Revert "Numpy zerotensor handling (#164487)"
This reverts commit f7ad6dbad67161333a1473d1e0b478b7475a0ec1.

Reverted https://github.com/pytorch/pytorch/pull/164487 on behalf of https://github.com/malfet due to Did it break torchbench?, see 8c728e129d/1 ([comment](https://github.com/pytorch/pytorch/pull/164487#issuecomment-3374635051))
2025-10-06 23:32:12 +00:00
8f54e27e5d [ROCm][CI] rebuild magma binary for gfx1150 gfx1151 (#164782)
After #164763 added gfx1150 gfx1151 to list of targets, this PR will trigger rebuild of magma binary for ROCm 7 with the new targets.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164782
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-06 23:29:21 +00:00
8c0bc879b9 Reapply "C++-accessible Placements via pybind11 (#163030)" (#164519)
This makes Placement data representation available in C++ via pybind11. Reapply with fix for internal errors.

D83788896

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164519
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2025-10-06 23:19:14 +00:00
746fe78ecd [CUDA] Add experimental green context support for SM carveout (#159104)
Low-level PyTorch APIs should be usable/stable enough at this point but we might move the underlying driver API usage a bit from here...

Built on top of @drisspg 's branch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159104
Approved by: https://github.com/ngimel

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-10-06 23:11:23 +00:00
b63bbe1661 Remove old ROCm version check in tests (#164245)
This PR removes ROCm<6 version checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164245
Approved by: https://github.com/jeffdaily
2025-10-06 22:42:01 +00:00
3912ba3e94 Revert "Fix refine_ranges corner case (#164075)"
This reverts commit 27234792add2ee9bedd84ca02dbf34f8f244bc5c.

Reverted https://github.com/pytorch/pytorch/pull/164075 on behalf of https://github.com/izaitsevfb due to fails executorch builds, see [D83938444](https://www.internalfb.com/diff/D83938444) ([comment](https://github.com/pytorch/pytorch/pull/164075#issuecomment-3374430964))
2025-10-06 22:09:39 +00:00
cfc5cc17dc Revert "[dynamo] Support torch.fx.traceback.annotate (#164678)"
This reverts commit 2883b5ab773daf5861d43ff0b65be49a441ab3f9.

Reverted https://github.com/pytorch/pytorch/pull/164678 on behalf of https://github.com/izaitsevfb due to fails inductor:max_autotune tests internally, see D83948169 ([comment](https://github.com/pytorch/pytorch/pull/164678#issuecomment-3374407009))
2025-10-06 22:03:42 +00:00
fdc8ccc5bc Make Adam, AdamW work with nonzero-dim Tensor betas (#149939)
Fixes #147921

## Changes

- Convert tensor `betas` using `_to_scalar`
- Change annotation of `betas` param
- Change param type in docs

## Test Result

```bash
pytest -s test/test_optim.py -k test_tensor_lr -vv
```

![image](https://github.com/user-attachments/assets/312ee045-1e8b-4789-aa6e-ba63e6df7e81)

![image](https://github.com/user-attachments/assets/7e6ec274-645b-46b9-b1a6-2b340a685203)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149939
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-10-06 22:03:25 +00:00
48b54b45d6 Replace pynvml with nvidia-ml-py in win-test.sh (#164681)
pynvml was deprecated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164681
Approved by: https://github.com/Aidyn-A, https://github.com/eqy
2025-10-06 21:57:26 +00:00
6861fa43e5 [CUDA] Cleanup persistent cuBLASLt workspaces before compile-regions test (#163299)
Fixes some tests that seemed to start flaking out as reported in #163202, due to cuBLASLt workspaces becoming persistent following that change.

It's relatively obvious why the workspaces/allocations corresponding to them should be cleaned up for `test_memory_snapshot_script` but less obvious for `test_memory_plots_free_segment_stack`?  Why does not cleaning up workspace prevent `empty_cache` from showing up?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163299
Approved by: https://github.com/albanD
2025-10-06 21:13:03 +00:00
c1f40d33c8 Fix docker build issue after 164575 (#164774)
Looks like https://github.com/pytorch/pytorch/pull/164575 introduced an issue.
The command is wrong:
```
conda install -c "whl/nightly" -y python=3.11 conda=25.7.0
```
Should be just using default conda channel:
```
conda install  -y python=3.11 conda=25.7.0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164774
Approved by: https://github.com/Camyll
2025-10-06 20:28:20 +00:00
7e7ac2039d [ROCm][CI] add gfx1150 gfx1151 to almalinux image (#164763)
First PR necessary to address missing gfx1151 reported in https://github.com/pytorch/pytorch/issues/164346.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164763
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-06 20:19:43 +00:00
23ab6a45e5 [precompile][ez] Add instrumentation for guard loading/building. (#164602)
Summary: as title.

Test Plan: CI

Differential Revision: D83868533

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164602
Approved by: https://github.com/dolpm
2025-10-06 20:16:09 +00:00
b558c986e8 Add regression test for get_root_mesh with multiple independent meshes (#164731)
Fixes #163330

I tried to reproduce the bug with my 4-GPU setup (the original issue used 8 GPUs). I created several different test scenarios, trying to trigger the bug by:
- creating two different device meshes
- slicing them in various ways
- checking if get_root_mesh() would get confused

but the bug didn't show up! Everything worked correctly in `2.10`. I found that there was a massive refactoring of the `DeviceMesh` code (PR #163213) that landed on October 2nd. That PR completely rewrote how `DeviceMesh` tracks relationships between parent meshes and submeshes using. It seems like this refactoring fixed the bug! But I added a regression test to make sure it doesn't come back. The test (`test_get_root_mesh_multiple_independent_meshes`) does exactly what the bug report described:
  - creates two independent meshes
  - slices them both
  - verifies that each submesh correctly points back to its real parent
  - makes sure submeshes from mesh1 don't incorrectly claim mesh2 as their parent

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164731
Approved by: https://github.com/fduwjj
2025-10-06 18:52:25 +00:00
415e641572 Limit path search within range (#164581)
When we are looking if two nodes are dependent, limit path search within the bounds of their node idxs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164581
Approved by: https://github.com/ezyang
ghstack dependencies: #164568, #164569
2025-10-06 18:29:27 +00:00
11f5f65686 Use PyObject_GetOptionalAttrString in PyObject_FastGetAttrString when available (#164624)
Python 3.13 added PyObject_GetOptionalAttrString. I'm not 100% certain that it is strictly better than the old approach in all cases, but based on documentation/comments it seems to be meant for this type of use, and it's faster when I profile torchtitan training (which gets to the "check for the `__torch_function__` attr on some object" part of maybe_has_torch_function frequently enough to notice, but wastes a bunch of time generating exceptions that we then suppressed here).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164624
Approved by: https://github.com/Skylion007
2025-10-06 18:26:09 +00:00
af32d16a71 Add pure view support in autograd Function (#164736)
This is the same as https://github.com/pytorch/pytorch/pull/164467
But it needs to be co-deved due to internal insanity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164736
Approved by: https://github.com/soulitzer
2025-10-06 18:21:05 +00:00
ba480d6bf7 torch.compile: Increase subprocess parent death check interval to lower cpu (#164594)
Summary:
This check is a good idea (we could potentially do it with prctl). However
we're seeing elevated rates of cpu usage in idle worker threads. This causes issues on production jobs, causing a large amount of spikeness in qps.

Test Plan:
Tested on a prod job with caches force disabled via
TORCH_COMPILE_FORCE_DISABLE_CACHES=1

Baseline
<img width="454" height="403" alt="image" src="https://github.com/user-attachments/assets/b88583a1-5b99-48cb-b03d-cd9b69546579" />

With this diff -
<img width="426" height="403" alt="image" src="https://github.com/user-attachments/assets/431217f1-0ed0-4f6e-9d81-6428bf34e0e3" />

Differential Revision: D83803302

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164594
Approved by: https://github.com/masnesral
2025-10-06 18:15:21 +00:00
4a6abba0d9 [ROCm][CI] test_convolution.py uses miopen immediate mode (#164598)
This should help stabilize some flaky test behavior where miopen would pick different solutions for different parts of the same test and the test expects bitwise identical results.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164598
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-06 17:48:50 +00:00
96181d6f76 [BE][cutlass backend] BE changes post cutlass_cppgen name change (#164589)
Differential Revision: D83809105

Handle reviews from https://github.com/pytorch/pytorch/pull/164159

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164589
Approved by: https://github.com/Skylion007
2025-10-06 17:22:08 +00:00
2164b66121 [export] Better state_dict and constant dedup in torch.export.save (#164196)
Summary:

Previously, weight deduplication was done by simply grouping tensors with their untyped storage and saving the first tensor in the group.

A more rigorous approach would be to find a complete tensor that covers the storage and store that tensor. This is particularly important for GPU weights because when saving to raw bytes, we move the weight to CPU first, and if the weight being saved is not a complete one, it will lose the storage information during the copy to CPU.

In this diff, we reuse code in `_package_weights.py` for better weights and constants deduplication in `torch.export.save`.

Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_weight_sharing_gpu

Differential Revision: D83523690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164196
Approved by: https://github.com/angelayi
2025-10-06 17:03:15 +00:00
bde18c445d [Max Autotune][B200] Relax absolute tolerance for MM+MM test (#164022)
Summary: Relax absolute tolerance from 1e-2 to 1e-1 for `test_non_contiguous_input_mm_plus_mm` in `test_max_autotune.py`.

Test Plan: `test_max_autotune.py`

Differential Revision: D83391942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164022
Approved by: https://github.com/eellison
2025-10-06 16:29:07 +00:00
f3e43ff2d7 [Max Autotune][B200] Fix decompose_k test failure (#164021)
Summary:
Fix decompose_k test failure (`test_max_autotune_decompose_k `) in `test_max_autotune.py` on B200s by setting `torch._inductor.config` patches for variables `comprehensive_padding` and `shape_padding`. Initial failure was `AssertionError: False is not true : Could not find a split in {3, 9, 2187, 81, 243, 729, 27} in # AOT ID: ['6_forward']`.

Refactor decompose_k test to follow patch semantics when setting all environment variables within a test.

Test Plan:
`test_max_autotune.py`:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:max_autotune -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -c fbcode.re_gpu_tests=False -- test_max_autotune_decompose_k
```

Differential Revision: D83390563

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164021
Approved by: https://github.com/njriasan, https://github.com/mlazos, https://github.com/eellison
2025-10-06 16:28:23 +00:00
39d0c06ed0 [torchfuzz] check in some more xfail repros (#164619)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164619
Approved by: https://github.com/ezyang
2025-10-06 16:20:44 +00:00
4ab847bbc7 Pyrefly suppressions 4/n (#164615)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: uncomment lines in the pyrefly.toml file
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/356645cf8cfe33123d9a27f23b30f7b1

after:

0 errors (2,753 ignored)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164615
Approved by: https://github.com/oulgen
2025-10-06 16:14:36 +00:00
4bd1505f84 [precompile][ez] Inline type definition for dynamo cache entry. (#164580)
Summary: as title. DynamoCaptureOutput in package.py is not actively used in other files. Inline it to reduce confusion.

Test Plan: CI

Differential Revision: D83846957

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164580
Approved by: https://github.com/dolpm
2025-10-06 16:00:59 +00:00
1f9614cef8 [ROCm][CI] Change rocm periodic workflow label to linux.rocm.gpu.mi250.4 (#164616)
Testing done on this PR: https://github.com/pytorch/pytorch/pull/156491

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164616
Approved by: https://github.com/jeffdaily, https://github.com/huydhn
2025-10-06 15:51:07 +00:00
35f66b83f8 respect aten planned overlap in inductor (#164569)
Now that we have a hop to add implicit deps - use those deps for comm/compute overlap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164569
Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev
ghstack dependencies: #164568
2025-10-06 15:47:55 +00:00
4a39820e5e Add hop for additional control dependencies (#164568)
Adds [control_deps](https://en.wikipedia.org/wiki/Control_dependency) higher-order operator to enforce explicit scheduling dependencies in FX graphs. This prevents unwanted operation reordering/fusion by giving nodes additional dependencies, which we also respect in inductor by adding weakdeps on the additional dependencies.

This can be generally useful (such as for ordering collectives) but in this case I am using it so that fusions do not interfere with aten planned comm-compute overlap.

There's definitely some similarity with the `with_effects` hop. Talked with @angelayi  - when @zou3519  is back we will figure out how we want to consolidate.

The implementation needs to be a subgraph (as opposed to `with_effects`) because inductor relies on `V.graph.current_node`. Changing the signature of the node with `with_effects`  breaks this, and additionally, also breaks striding constraints on the wrapped node - see this [TODO](aed66248a0/torch/fx/experimental/proxy_tensor.py (L1246-L1249)). By maintaining the node with its original calling structure in subgraph this all works.

Example transformation:

Before:
```
%add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg0_1, 1), kwargs = {})
%mm : [num_users=1] = call_function[target=torch.ops.aten.mm.default](args = (%arg1_1, %arg1_1), kwargs = {})
%mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%add, 2), kwargs = {})
```
After:
```
add: "f32[256, 256]" = torch.ops.aten.add.Tensor(arg0_1, 1)
mm: "f32[256, 256]" = torch.ops.higher_order.control_deps((add,), subgraph_mm, arg1_1, arg1_1)
mul: "f32[256, 256]" = torch.ops.higher_order.control_deps((mm,), subgraph_mul, add)
```

The mm operation now explicitly depends on add completing first, and mul depends on mm, with original operations preserved in subgraphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164568
Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev
2025-10-06 15:47:55 +00:00
600267ea56 Add num_store to inductor_meta and use it to scale persistent reduction x block (#162446)
Scale up XBLOCK for contiguous persistent reductions based on rnumel and number of loads + stores

<img width="928" height="656" alt="Screenshot 2025-09-18 at 5 02 57 PM" src="https://github.com/user-attachments/assets/ec3c561f-2a3f-4459-9e14-653715898da3" />

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162446
Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314
ghstack dependencies: #162296
2025-10-06 14:29:07 +00:00
f11ac803d7 Update slow tests (#164726)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164726
Approved by: https://github.com/pytorchbot
2025-10-06 12:57:29 +00:00
ea42517e45 [xla hash update] update the pinned xla hash (#164727)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164727
Approved by: https://github.com/pytorchbot
2025-10-06 11:54:10 +00:00
91c211fb8c AC should work with pre-dispatch IR (#164505)
Previously we had to rely on turning off export verifier because the AC body was torch IR instead of aten IR. This PR makes it so that we create an IR that is export compatible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164505
Approved by: https://github.com/ydwu4, https://github.com/xmfan
2025-10-06 11:05:22 +00:00
660e369a68 [FSDP2] check storage equal and consider data_ptr() == 0 (#164595)
resolve https://github.com/pytorch/pytorch/issues/164554

unit test
* `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_cached_state_dict`
* `pytest -s test/distributed/_composable/fsdp/test_fully_shard_init.py -k test_meta_device_1d_init`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164595
Approved by: https://github.com/fegin
2025-10-06 08:44:38 +00:00
2883b5ab77 [dynamo] Support torch.fx.traceback.annotate (#164678)
Builds on top of https://github.com/pytorch/pytorch/pull/163673 and https://github.com/pytorch/pytorch/pull/164174. This will be used in the followup PRs to apply regional inductor compilation.

The existing implementation let Dynamo trace into the `torch.fx.traceback.annotate`, but thats not what we want. We want Dynamo to essentially run the torch.fx.traceback.annotate function in eager, so that every Fx node created in Dynamo Fx graph has the custom meta node.

This does not work with graph breaks yet. But we can solve that problem, if needed, in a separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164678
Approved by: https://github.com/SherlockNoMad, https://github.com/jansel, https://github.com/xmfan
2025-10-06 02:59:24 +00:00
9fff8155c3 [2/N] Fix clang-tidy readability checks (#164652)
This PR applies clang-tidy readability checks to jit sources and all headers in the code base.
`readability-redundant-inline-specifier` is suppressed because it incurs too many changes. `readability-redundant-inline-specifier` is used to detect redundant inline specifiers on function and variable declarations. There are many in-class method definitions that are marked inline.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164652
Approved by: https://github.com/Skylion007
2025-10-06 01:06:01 +00:00
331191ce4b Revert "[BE] Make PyObjectSlot use a global PyInterpreter (#162659)"
This reverts commit 29cbcbac4215e0d9070a1b7a07ddaec9a36bbd08.

Reverted https://github.com/pytorch/pytorch/pull/162659 on behalf of https://github.com/izaitsevfb due to reverted internally, see [D83214133](https://www.internalfb.com/diff/D83214133) ([comment](https://github.com/pytorch/pytorch/pull/162659#issuecomment-3369348172))
2025-10-05 21:39:57 +00:00
2c5ed6e7c0 Revert "[2/N] Fix clang-tidy readability checks (#164652)"
This reverts commit 3c5ca685d6f5b6f3971c0cd20a054aa355610419.

Reverted https://github.com/pytorch/pytorch/pull/164652 on behalf of https://github.com/izaitsevfb due to need to revert due to a conflict with revert of https://github.com/pytorch/pytorch/pull/162659 ([comment](https://github.com/pytorch/pytorch/pull/164652#issuecomment-3369346707))
2025-10-05 21:36:57 +00:00
5d7360bb03 Revert "Enable all SIM rules except disabled ones (#164645)"
This reverts commit 321e6026925f6b6e8a36e3a8b7c0295cd7541911.

Reverted https://github.com/pytorch/pytorch/pull/164645 on behalf of https://github.com/izaitsevfb due to causes lint failures ([comment](https://github.com/pytorch/pytorch/pull/164645#issuecomment-3369274351))
2025-10-05 19:32:21 +00:00
321e602692 Enable all SIM rules except disabled ones (#164645)
`SIM` rules are useful for simplifying boolean expressions and enhances code readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645
Approved by: https://github.com/ezyang
2025-10-05 07:38:25 +00:00
3c5ca685d6 [2/N] Fix clang-tidy readability checks (#164652)
This PR applies clang-tidy readability checks to jit sources and all headers in the code base.
`readability-redundant-inline-specifier` is suppressed because it incurs too many changes. `readability-redundant-inline-specifier` is used to detect redundant inline specifiers on function and variable declarations. There are many in-class method definitions that are marked inline.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164652
Approved by: https://github.com/Skylion007
2025-10-05 07:05:11 +00:00
5178d0a480 [Compile] Fix Compile Warning for Capture Id (#163898)
```bash
DEBUG /data/vllm-community-homes/vllm-user-6/pytorch/aten/src/ATen/cuda/CUDAGraph.h(59): warning #68-D: integer conversion resulted in a change of sign
DEBUG     CaptureId_t capture_id_ = -1;
DEBUG                               ^
DEBUG
DEBUG Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
DEBUG
DEBUG /data/vllm-community-homes/vllm-user-6/pytorch/aten/src/ATen/cuda/CUDAGraph.h(59): warning #68-D: integer conversion resulted in a change of sign
DEBUG     CaptureId_t capture_id_ = -1;
DEBUG                               ^
DEBUG
DEBUG Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
DEBUG
DEBUG /data/vllm-community-homes/vllm-user-6/pytorch/aten/src/ATen/cuda/CUDAGraph.h(59): warning #68-D: integer conversion resulted in a change of sign
DEBUG     CaptureId_t capture_id_ = -1;
DEBUG                               ^
```

Cuda won't use 0 as a capture id, so it is safe to initialize with 0, which also matches the initialization in `pytorch/aten/src/ATen/native/cudnn/RNN.cpp:2362`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163898
Approved by: https://github.com/houseroad
2025-10-05 06:51:33 +00:00
cf0a00d4f3 Enable ruff FURB161 rule (#164654)
This PR enables FURB161 in ruff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164654
Approved by: https://github.com/Skylion007
2025-10-04 23:26:28 +00:00
5ed4270440 remove more no longer needed torch._check_is_size calls 1 (#164630)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164630
Approved by: https://github.com/Skylion007
ghstack dependencies: #164627
2025-10-04 22:06:04 +00:00
8c728e129d remove no longer needed torch._check_is_size calls from test_dynamic_shapes (#164627)
No longer needed in those tests to prevent DDE

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164627
Approved by: https://github.com/ezyang
2025-10-04 22:06:04 +00:00
9fc2c6446d remove guard_size_oblivious from is_contiguous python eager eval path. (#164622)
Summary: this should not be needed anymore we shall have explicit is_contiguous_or_false calls where appropriate already !

Test Plan: run existing tests.

Differential Revision: D83884977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164622
Approved by: https://github.com/bobrenjc93
2025-10-04 21:02:39 +00:00
409aece3f9 [dynamo, 3.14] prevent StackRef compilation in 3.14 Windows (#164400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164400
Approved by: https://github.com/Camyll, https://github.com/atalman
2025-10-04 18:38:08 +00:00
b116c51330 torch.cond on DTensor triggers an internal assert, add xfail for this. (#164389)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164389
Approved by: https://github.com/albanD
2025-10-04 18:12:06 +00:00
2e1742dd63 Revert "Add device argument to torch.random.get_rng_state (#163034)"
This reverts commit 9580539e2f73d68e89544c713ff460bea3038701.

Reverted https://github.com/pytorch/pytorch/pull/163034 on behalf of https://github.com/cyyever due to It cased partially initialised torch module ([comment](https://github.com/pytorch/pytorch/pull/163034#issuecomment-3368349209))
2025-10-04 15:25:45 +00:00
f7ad6dbad6 Numpy zerotensor handling (#164487)
Fixes #89034

Updated tensor_to_numpy() function in tensor_numpy.cpp to handle ZeroTensors by throwing an error if force=False and returning an array full of zeros if force=True.

@ngimel, I just saw that you mentioned PyTorch is not too concerned with this issue but I had already worked on it so I figured I would push it anyways and see what you thought. Feel free to close the PR if you think it is not worth merging.

@albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164487
Approved by: https://github.com/ngimel, https://github.com/albanD
2025-10-04 12:03:48 +00:00
f46bb04dcc Revert "Add pure view support in autograd Function (#164467)"
This reverts commit 10335ffb2cce26c99958d055f415a16c1d14bc35.

Reverted https://github.com/pytorch/pytorch/pull/164467 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164467#issuecomment-3368152304))
2025-10-04 11:42:46 +00:00
6f6a919366 Revert "Make custom op alias check consistent (#164576)"
This reverts commit e438db254602cf39ba536aed0590b4144c019ee8.

Reverted https://github.com/pytorch/pytorch/pull/164576 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164467#issuecomment-3368152304))
2025-10-04 11:42:45 +00:00
83d71dfb2f Fix mesh.get_local_rank when it is > 1d (#164473)
Previously, we would not take the arguments passed by get_local_rank into account. This means that we wouldn't be able to trace this call if we had a device_mesh > 1d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164473
Approved by: https://github.com/xmfan, https://github.com/Skylion007
2025-10-04 11:27:55 +00:00
5103ecc5d8 [1/N] Fix clang-tidy readability checks (#164561)
Check all `.cpp` files except `jit` files for readability thoroughly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164561
Approved by: https://github.com/Skylion007
2025-10-04 09:40:38 +00:00
9580539e2f Add device argument to torch.random.get_rng_state (#163034)
Fixes #162812

Adds support for either passing a device directly into get_rng_state, or passing in a string or int (which is then wrapped into a device inside, as in torch.cuda.get_rng_state).

I wasn't exactly sure where tests for this should go, please let me know. I used this script for testing:
```python
import torch

# note: when running with CUDA GPU, first three tests will give the same result,
# as will the last two

# test with no device specified
print(torch.get_rng_state())

# test with CPU
cpu_device = torch.device("cpu")
print(torch.get_rng_state(cpu_device))

# test with direct name
print(torch.get_rng_state("cpu"))

# test with CUDA
cuda_device = torch.device("cuda:0")
print(torch.get_rng_state(cuda_device))

# test with integer
print(torch.get_rng_state(0))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163034
Approved by: https://github.com/ezyang, https://github.com/cyyever
2025-10-04 06:48:39 +00:00
a11a66ef32 Remove CUDA 11 branches for sparse code (#164531)
This PR removes outdated CUDA version checks from sparse code in aten/src/ATen/cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164531
Approved by: https://github.com/eqy
2025-10-04 06:07:49 +00:00
6b768e1890 Support propagating custom meta field to backward graph nodes (#164174)
# Propagate custom meta data to backward

Support propagating the user annotation tags to backward graph, by extending the `copy_fwd_metadata_to_bw_nodes` utils (recommended by @xmfan , thanks!).

Example annotation API (added in https://github.com/pytorch/pytorch/pull/163673):

```
class M(torch.nn.Module):
    def forward(self, x):
        with fx_traceback.annotate({"pp_stage": 0}):
            with fx_traceback.annotate({"fdsp_bucket": 0}):
                x = x + 1
            x = x - 2
            with fx_traceback.annotate({"cuda_stream": 2, "fsdp_bucket": 1}):
                x = x * 2
        x = x / 3
        return x
```

Assumptions (some inherited from https://github.com/pytorch/pytorch/pull/126573):

- I am trusting the seq_nr mapping introduced to aot_autograd nodes in https://github.com/pytorch/pytorch/pull/103129
- I am also trusting that the forward is single threaded, since seq_nr is thread local.  If this isn't always true, we'll need to also plumb thread_id through the same machinery which is populating seq_nr.
- **(This is changed in this PR!) I assume all backward graph nodes has "is_backward" for 'partitioner_tag', and all other nodes are forward graph nodes**.  If we don't run export before `aot_export_join_with_descriptors`, then none of the nodes has "nn_module_stack" in node meta. If we do run export first, then we don't need this change.
- I copy "custom" node meta from forward to backward graph nodes.

Question:
- Is it a good idea to copy all "custom" node meta? Or should we create a dedicated key in custom node meta to be copied? @SherlockNoMad
- Do we expect people to run export before using `aot_export_join_with_descriptors`?
- Can we assume the following for graph produced by `aot_export_join_with_descriptors`? "all backward graph nodes has "is_backward" for 'partitioner_tag', and all other nodes are forward graph nodes". Maybe this is a question for @ezyang

```
python test/functorch/test_aot_joint_with_descriptors.py -k test_preserve_
python test/export/test_export.py -k preserve_anno
python test/distributed/tensor/test_dtensor_export.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164174
Approved by: https://github.com/xmfan, https://github.com/SherlockNoMad
2025-10-04 05:03:32 +00:00
35c4130fd1 [2/N] Fix ruff warnings (#164460)
Apply ruff `SIM` rules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164460
Approved by: https://github.com/ezyang
2025-10-04 03:40:32 +00:00
34042a9145 Change intra-graph offset dtype to uint64_t (#164515)
Even though `offset_intragraph_` only tracks RNG consumption within a single graph replay, we have observed that the 32bit storage for these offsets is easy to overshoot, especially for cases with big CUDA graph captures including kernels that are generating a large amount of random numbers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164515
Approved by: https://github.com/eee4017, https://github.com/eqy
2025-10-04 03:39:09 +00:00
Ken
9d1ab4f4bb [CI] Limit Numba CUDA-13 patch to CUDA environments only (#164607)
The patch introduced in https://github.com/pytorch/pytorch/pull/163111 caused issues in ROCm environments. This change guards the patching logic to CUDA environments only, thus ameliorating test failures in ROCm environments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164607
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-04 02:39:07 +00:00
3e0826c9d7 Update disabling fast-path for strict-export inside MultiheadAttention (#164544)
For some reason, executorch needs the slow path. But the original flag doesn't work for new export because we inline torch modules even before getting into make_fx. We still have to keep the old flag because lot of code assumes this exist.... grr

Differential Revision: [D83810733](https://our.internmc.facebook.com/intern/diff/D83810733)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164544
Approved by: https://github.com/anijain2305, https://github.com/mikaylagawarecki
2025-10-04 02:20:55 +00:00
86c789849e [fr] Re-order mismatch check in fr analysis script (#164606)
In reality we found the current mismatch order does not match the actual error distribution, so we reorder it a bit as following:
1. We do collective type check first
2. Then size check (excluding all2all)
3. dtype check
4. state check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164606
Approved by: https://github.com/VieEeEw
2025-10-04 01:16:15 +00:00
f3afbcf340 [ONNX] Bump tested onnxruntime to 1.23.0 and onnxscript to 0.5.2 (#164440)
Performs tests on the latest ONNX environment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164440
Approved by: https://github.com/justinchuby, https://github.com/albanD
2025-10-04 01:10:47 +00:00
40b25578e4 [Inductor] deterministic mode (#163589)
Add a deterministic mode to skip the on device benchmarking that we know should affect numeric. This include
- pad-mm
- dynamic rblock scaling
- template autotuning
- coordinate descent tuning for reduction
- reduction config autotuning in CachingAutotuner.  For reduction both RBLOCK, num_warps should affect numeric. XBLOCK does not. We can still autotune XBLOCK for reductions.
- benchmarking for computation communication reordering pass

The mode definitely has perf hit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163589
Approved by: https://github.com/v0i0
2025-10-04 01:05:08 +00:00
412c6d28ec [ROCm][CI] additional dynamo benchmarks for inductor-periodic (#164279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164279
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-04 00:55:17 +00:00
7d570129e0 Fix custom autograd Function memory leak when saving mutated view (#164407)
Fixes https://github.com/pytorch/pytorch/issues/160317
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164407
Approved by: https://github.com/albanD
2025-10-04 00:47:12 +00:00
97ca21106d move fw|bw compiler args in aot joint with descriptors (#164584)
Summary: Minor refactor where we push some args in the aot joint with descriptors workflow that are not used in export stage to the compile stage where they are actually used.

Test Plan: existing tests should pass

Differential Revision: D83850316

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164584
Approved by: https://github.com/tugsbayasgalan
2025-10-04 00:24:46 +00:00
27234792ad Fix refine_ranges corner case (#164075)
address https://github.com/pytorch/pytorch/issues/161360

u0>0 should update the range of u0 to start from [1, ..] this fix it. it was not doing that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164075
Approved by: https://github.com/ColinPeppler
2025-10-03 23:30:46 +00:00
b6b7a44dec Fix common typos and misspellings (#164413)
Summary:
This commit fixes numerous typos and misspellings found throughout the codebase. The fixes improve code readability and documentation consistency across C++, Python, CUDA, and documentation files.

## Typos Fixed

| Before | After | Occurrences |
|--------|-------|-------------|
| occured | occurred | 14 |
| accross | across | 9 |
| lenght/lenghts | length/lengths | 8 |
| unneccessary | unnecessary | 5 |
| Peform | Perform | 4 |
| furture | future | 3 |
| paritioned | partitioned | 2 |
| desireable | desirable | 2 |
| registerations | registrations | 2 |
| seperated | separated | 2 |
| intialized | initialized | 2 |
| capatibility | compatibility | 2 |
| peformed | performed | 2 |
| Exmple | Example | 2 |
| comma_seperated | comma_separated | 2 |
| cumsuming | consuming | 2 |
| neccessary | necessary | 1 |
| ParamterMetadataTable | ParameterMetadataTable | 1 |
| matached | matched | 1 |
| conaitner | container | 1 |
| reivew | review | 1 |
| prioriry | priority | 1 |
| Alocated | Allocated | 1 |
| opportunixtically | opportunistically | 1 |
| peformance | performance | 1 |
| equavalent | equivalent | 1 |
| asssumed | assumed | 1 |
| valdiation | validation | 1 |
| apprear | appear | 1 |
| consectuve | consecutive | 1 |
| dependending | depending | 1 |
| copnversion | conversion | 1 |
| weigted | weighted | 1 |
| repreesenting | representing | 1 |
| finialize | finalize | 1 |
| unintialized | uninitialized | 1 |
| conbined | combined | 1 |
| tesnor | tensor | 1 |
| desugared | discarded | 1 |
| behaviour | behavior | 1 |
| paramerizaitons | parametrizations | 1 |
| compute_output_lenghths_kernel | compute_output_lengths_kernel | 1 |

Test Plan: N/A - mostly comments - waiting on CI

Differential Revision: D83695665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164413
Approved by: https://github.com/eqy, https://github.com/larryliu0820
2025-10-03 23:19:41 +00:00
3ddf2018d0 Revert "Support setting grad_dtype on leaf tensors (#162815)"
This reverts commit dca73982c53e9f99f96246b5d9ed9bab83c7423f.

Reverted https://github.com/pytorch/pytorch/pull/162815 on behalf of https://github.com/yangw-dev due to break internal test D83850533, see more details below ([comment](https://github.com/pytorch/pytorch/pull/162815#issuecomment-3367498501))
2025-10-03 23:14:28 +00:00
fac6f20ae3 [CI] Add another win shard (#164605)
Since its timing out 0b4f2b46d9/1

the first shard is disproportionately long because of cpp tests, I'm trying to figure that out but for now we can do this or increase the timeout
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164605
Approved by: https://github.com/seemethere, https://github.com/malfet
2025-10-03 22:51:09 +00:00
1894082000 UT/Examples for resharding checkpoint save/loads for distributed tensors with uneven shards. (#160533)
1\  DTensor abstraction on its own, does not support arbitrary length shards in its distributed tensors representation. It supports a single uneven shard, bit it has to be the last shard in the sharding dimension.

2\ However, DCP supports an API called checkpointable. This API allows you to define your custom shardable tensor structure. I have given a UT example ( look for CheckpointableDistTensor). Therefore, one option is to use CheckpointableDistTensor to save/load uneven shards.

3\ While exploring this path, I also noticed that torch.rec module also encountered a similar problem while working with DTensor. They workaround it by implementing Checkpointable API in DTensor and introducing an auxillary structure called LocalShardsWrapper. This is the second option we can use to unblock data loader resharding work.

In summary;
Use LocalShardWrapper + DTensor as the first option to unblock.
Second preference is to use new implementation of Checkpointable API. ( similar to CheckpointbaleDistTensor I have introduced in this example).

Differential Revision: D80182564

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160533
Approved by: https://github.com/saumishr
2025-10-03 22:15:02 +00:00
5a66ff4915 [dynamo, 3.14] fix _detect_and_normalize_assert_statement for 3.14 (#164005)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164005
Approved by: https://github.com/anijain2305, https://github.com/atalman
2025-10-03 22:07:54 +00:00
abadea70f3 [inductor] thread hint_override in more kernel args (#164494)
ensure hint_override is threaded in benchmarking args

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164494
Approved by: https://github.com/bobrenjc93
2025-10-03 22:07:12 +00:00
f414aa8e0d Add pyrefly suppressions (3/n) (#164588)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: uncomment lines in the pyrefly.toml file
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/bb31574ac8a59893c9cf52189e67bb2d

after:

 0 errors (1,970 ignored)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164588
Approved by: https://github.com/oulgen
2025-10-03 22:03:03 +00:00
e438db2546 Make custom op alias check consistent (#164576)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164576
Approved by: https://github.com/soulitzer
ghstack dependencies: #164467
2025-10-03 21:42:11 +00:00
10335ffb2c Add pure view support in autograd Function (#164467)
Fix https://github.com/pytorch/pytorch/issues/73604

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164467
Approved by: https://github.com/ezyang, https://github.com/soulitzer
2025-10-03 21:42:11 +00:00
f006aee601 Speed up FP precision lookup (#164044)
This commit simplifies the precision lookup and setting logic
by reducing the number of branches and using a custom hash
function. Fixes #161822. The issue described in #163709 still
persists. This is meant as a short term fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164044
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-10-03 21:35:20 +00:00
8d53d788fe lint: add .pyi to changed files on .pyi.in changes (#164603)
We were observing issues where the lint on trunk vs. PRs would be different
due to missing .pyi files. This change adds the .pyi files to the changed files
list when .pyi.in files are changed.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164603
Approved by: https://github.com/atalman, https://github.com/malfet, https://github.com/Skylion007
2025-10-03 21:30:54 +00:00
0b4f2b46d9 Revert "[inductor] require shape in TritonCSEVariable (#162275)"
This reverts commit f465ea6752c91498de63eb57439a74f4836e568a.

Reverted https://github.com/pytorch/pytorch/pull/162275 on behalf of https://github.com/yangw-dev due to break interal test, see more details in next comment ([comment](https://github.com/pytorch/pytorch/pull/162275#issuecomment-3367213941))
2025-10-03 21:07:00 +00:00
960c4b9937 [inductor] Enable triton kernels with unbacked inputs (#164509)
Summary:
We need to pass in fallback value to avoid converting symbols to int

original failure log in onefeed Slimper MB - P1973406565
`raise TypeError("Cannot convert symbols to int")`

Test Plan:
if not passing in fallback value -
https://www.internalfb.com/intern/everpaste/?handle=GGeAoh_M11kEGOECAFELOaq8ooRCbswMAAAz
`raise TypeError("Cannot convert symbols to int")`

```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:unbacked_symints -- test_triton_kernel_with_unbacked_symint_fallback --print-passing-details --env TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 --env TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(u0, 0)"
```
Buck UI: https://www.internalfb.com/buck2/4d27cd49-770b-40de-8c65-9ee04c5dd687
Test UI: https://www.internalfb.com/intern/testinfra/testrun/9570149324695031
Network: Up: 0B  Down: 16MiB  (reSessionID-8e8b07a2-e31c-402d-bf6a-ebb92253e654)
Executing actions. Remaining     0/6                                                              5.0s exec time total
Command: test.     Finished 2 cache (100% hit)                                                    5.0s exec time cached (100%)
Time elapsed: 33.8s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D83684260

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164509
Approved by: https://github.com/ColinPeppler
2025-10-03 21:05:18 +00:00
1f8ee5da11 [TorchGen] Remove unused variables and function imports (#164538)
This PR removes unused code in torchgen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164538
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-10-03 20:49:36 +00:00
da49a57d34 [ROCm] Enabled JIT UTs on ROCm (#164582)
This PR is to enable the following tests rocm.

test/test_jit.py::TestBackends::test_save_load
test/test_jit.py::TestBackends::test_execution
test/test_jit.py::TestBackends::test_errors
test/test_jit.py::TestCUDA::test_current_stream

Verified that the tests pass on AMD gfx90a and gfx942 arch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164582
Approved by: https://github.com/jeffdaily
2025-10-03 20:16:41 +00:00
8ec8c14ace Revert "[CUDA] Add experimental green context support for SM carveout (#159104)"
This reverts commit 3c59351c6ea2fc29d346903e28e95c5f4d0ccdbb.

Reverted https://github.com/pytorch/pytorch/pull/159104 on behalf of https://github.com/clee2000 due to failed lint, pyfmt not caught pyi file, I think they need special handling since theyre not in the changed files list? ([comment](https://github.com/pytorch/pytorch/pull/159104#issuecomment-3367077208))
2025-10-03 20:15:56 +00:00
2d50678dcc Fix -Wno-duplicate-decl-specifier is valid for C/ObjC but not for C++ (#164552)
Fixes #99715
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164552
Approved by: https://github.com/Skylion007
2025-10-03 20:12:49 +00:00
3ca09d65f1 [ROCm] Enable several distributed UTs (#164390)
Increase the tolerance for the following UTs as there was a slight mismatch seen on MI200.
    - test_data_parallel.py:test_strided_grad_layout
    - test_c10d_nccl.py:test_grad_layout_1devicemodule_1replicaperprocess

Skip for MI200:
    - test_fully_shard_training.py:test_2d_mlp_with_nd_mesh
    - test_2d_composability.py:test_train_parity_2d_mlp
    - test_fully_shard_overlap.py:test_fully_shard_training_overlap

Fixes #159489
Fixes #159488
Fixes #152700
Fixes #125555
Fixes #134139

Working as is on both MI200 and MI300:
Fixes #125991
Fixes #125918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164390
Approved by: https://github.com/jeffdaily
2025-10-03 19:52:51 +00:00
1bb68271b7 Stop building nativert in OSS (#164463)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164463
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-10-03 19:41:15 +00:00
9eb89a4ad5 Add missing TypeIs to torch/_inductor/ir.py (#164489)
This should be a TypeIs here

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164489
Approved by: https://github.com/mlazos
2025-10-03 19:34:20 +00:00
15d726005d Enable several unit tests on ROCm (#163087)
Code change enables:
test_nn::TestNNDeviceTypeCUDA::test_transformerencoderlayer_cuda_float16
test_nn::TestNNDeviceTypeCUDA::test_transformerencoderlayer_cuda_float32
test_nn::TestNNDeviceTypeCUDA::test_transformerencoderlayer_cuda_float64
test_nn::TestNNDeviceTypeCUDA::test_transformerencoderlayer_gelu_cuda_float16
test_linalg::TestLinalgCUDA::test_eigh_svd_illcondition_matrix_input_should_not_crash_cuda_float32
test_linalg::TestLinalgCUDA::test_eigh_svd_illcondition_matrix_input_should_not_crash_cuda_float64
test_ops::TestCommonCUDA::test_complex_half_reference_testing_as_strided_scatter_cuda_complex32

Fixes https://github.com/pytorch/pytorch/issues/134687
Fixes https://github.com/pytorch/pytorch/issues/78205

Closing github issues:
inductor/test_gpu_cpp_wrapper unit tests:
Fixes https://github.com/pytorch/pytorch/issues/157084

test_nn unit tests:
Fixes https://github.com/pytorch/pytorch/issues/157167
Fixes https://github.com/pytorch/pytorch/issues/157119
Fixes https://github.com/pytorch/pytorch/issues/157118
Fixes https://github.com/pytorch/pytorch/issues/157115
Fixes https://github.com/pytorch/pytorch/issues/157081
Fixes https://github.com/pytorch/pytorch/issues/155216
Fixes https://github.com/pytorch/pytorch/issues/157259
Fixes https://github.com/pytorch/pytorch/issues/157166
Fixes https://github.com/pytorch/pytorch/issues/157165
Fixes https://github.com/pytorch/pytorch/issues/157164
Fixes https://github.com/pytorch/pytorch/issues/157117
Fixes https://github.com/pytorch/pytorch/issues/157116
Fixes https://github.com/pytorch/pytorch/issues/157114
Fixes https://github.com/pytorch/pytorch/issues/157113
Fixes https://github.com/pytorch/pytorch/issues/157082
Fixes https://github.com/pytorch/pytorch/issues/157080
Fixes https://github.com/pytorch/pytorch/issues/157079
Fixes https://github.com/pytorch/pytorch/issues/157078

test_linalg unit tests:
Fixes https://github.com/pytorch/pytorch/issues/157427
Fixes https://github.com/pytorch/pytorch/issues/157414
Fixes https://github.com/pytorch/pytorch/issues/157369
Fixes https://github.com/pytorch/pytorch/issues/157349
Fixes https://github.com/pytorch/pytorch/issues/157348
Fixes https://github.com/pytorch/pytorch/issues/157337
Fixes https://github.com/pytorch/pytorch/issues/157336
Fixes https://github.com/pytorch/pytorch/issues/157297
Fixes https://github.com/pytorch/pytorch/issues/157281
Fixes https://github.com/pytorch/pytorch/issues/157260
Fixes https://github.com/pytorch/pytorch/issues/157171
Fixes https://github.com/pytorch/pytorch/issues/157169
Fixes https://github.com/pytorch/pytorch/issues/157168
Fixes https://github.com/pytorch/pytorch/issues/157125
Fixes https://github.com/pytorch/pytorch/issues/157124
Fixes https://github.com/pytorch/pytorch/issues/157123
Fixes https://github.com/pytorch/pytorch/issues/157089
Fixes https://github.com/pytorch/pytorch/issues/157088
Fixes https://github.com/pytorch/pytorch/issues/157087
Fixes https://github.com/pytorch/pytorch/issues/157068
Fixes https://github.com/pytorch/pytorch/issues/157067
Fixes https://github.com/pytorch/pytorch/issues/157066
Fixes https://github.com/pytorch/pytorch/issues/157047
Fixes https://github.com/pytorch/pytorch/issues/157046
Fixes https://github.com/pytorch/pytorch/issues/157045
Fixes https://github.com/pytorch/pytorch/issues/157044
Fixes https://github.com/pytorch/pytorch/issues/156997
Fixes https://github.com/pytorch/pytorch/issues/156996
Fixes https://github.com/pytorch/pytorch/issues/156995
Fixes https://github.com/pytorch/pytorch/issues/156994
Fixes https://github.com/pytorch/pytorch/issues/156993
Fixes https://github.com/pytorch/pytorch/issues/156991
Fixes https://github.com/pytorch/pytorch/issues/156990
Fixes https://github.com/pytorch/pytorch/issues/156989
Fixes https://github.com/pytorch/pytorch/issues/105118
Fixes https://github.com/pytorch/pytorch/issues/157415
Fixes https://github.com/pytorch/pytorch/issues/157282
Fixes https://github.com/pytorch/pytorch/issues/157261
Fixes https://github.com/pytorch/pytorch/issues/157170
Fixes https://github.com/pytorch/pytorch/issues/157126

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163087
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2025-10-03 19:30:59 +00:00
16f9bef642 [precompile] Fix guard serialization loading bugs. (#164490)
Summary: Added a set of fixes triggered by fm training job. Overall the theme here is that we should get rid of saved objects as much as possible when they are not used in guard reconstruction. Sometimes for objects that cannot be saved (like local functions) we still try our best to save their closures.

Test Plan:
test_guard_serialization.py
test_lazy_awatiable.py

Differential Revision: D83766926

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164490
Approved by: https://github.com/jamesjwu
2025-10-03 19:20:07 +00:00
3c59351c6e [CUDA] Add experimental green context support for SM carveout (#159104)
Low-level PyTorch APIs should be usable/stable enough at this point but we might move the underlying driver API usage a bit from here...

Built on top of @drisspg 's branch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159104
Approved by: https://github.com/ngimel

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-10-03 18:59:12 +00:00
7eb1eb4313 ci: Removing ROCm tests from trunk. (#164585)
Had a conversation with the AMD team today and I think we are all in
agreement that the current state of queueing for AMD is beyond where
we'd like to be for there to be blocking CI for ROCm.

Moving the representative testing jobs for this into the ciflow/rocm
workflow.

I'd love for these to be back in trunk if we can get to a state where
our queueing metrics are below an hour for ROCm infrastructure.

Dashboards:
* ROCm Queueing (>60mins) ([link](https://hud.pytorch.org/queue_time_analysis?dateRange=30&startDate=2025-09-03T16%3A06%3A45.025Z&endDate=2025-10-03T16%3A06%3A45.025Z&granularity=week&chartType=bar&repos=pytorch%2Fpytorch&category=machine_type&machineTypes=linux.rocm.gpu.2&machineTypes=linux.rocm.gpu.4&machineTypes=linux.rocm.gpu.mi250&machineTypes=linux.rocm.gpu.gfx942.1&machineTypes=linux.rocm.gpu.mi250.4&machineTypes=linux.rocm.gpu.gfx942.4&machineTypes=linux.rocm.gpu.mi355.2&machineTypes=linux.rocm.gpu.gfx942.4.test&machineTypes=linux.rocm.gpu.mi250.1&machineTypes=linux.rocm.gpu.gfx942.1.test&machineTypes=linux.rocm.gpu.gfx90a.1&machineTypes=linux.rocm.gpu.gfx90a.4&items=linux.rocm.gpu.2&items=linux.rocm.gpu.4&items=linux.rocm.gpu.mi250&items=linux.rocm.gpu.gfx942.1&items=linux.rocm.gpu.mi250.4&items=linux.rocm.gpu.gfx942.4&items=linux.rocm.gpu.mi355.2&items=linux.rocm.gpu.gfx942.4.test&items=linux.rocm.gpu.mi250.1&items=linux.rocm.gpu.gfx942.1.test&items=linux.rocm.gpu.gfx90a.1&items=linux.rocm.gpu.gfx90a.4))
* NVIDIA queueing (<5mins) ([link](https://hud.pytorch.org/queue_time_analysis?dateRange=30&startDate=2025-09-03T16%3A05%3A08.000Z&endDate=2025-10-03T16%3A05%3A08.000Z&granularity=week&chartType=bar&repos=pytorch%2Fpytorch&category=machine_type&machineTypes=lf.linux.g4dn.4xlarge.nvidia.gpu&machineTypes=linux.g4dn.12xlarge.nvidia.gpu&machineTypes=linux.g4dn.metal.nvidia.gpu&machineTypes=linux.g5.4xlarge.nvidia.gpu&machineTypes=lf.linux.g4dn.12xlarge.nvidia.gpu&machineTypes=lf.linux.g5.12xlarge.nvidia.gpu&machineTypes=lf.linux.g5.4xlarge.nvidia.gpu&machineTypes=lf.linux.g6.4xlarge.experimental.nvidia.gpu&machineTypes=linux.g6.4xlarge.experimental.nvidia.gpu&machineTypes=linux.4xlarge.nvidia.gpu&machineTypes=linux.g5.12xlarge.nvidia.gpu&machineTypes=linux.g4dn.4xlarge.nvidia.gpu&machineTypes=lf.linux.4xlarge.nvidia.gpu&machineTypes=linux.g6.12xlarge.nvidia.gpu&items=lf.linux.g4dn.4xlarge.nvidia.gpu&items=linux.g4dn.12xlarge.nvidia.gpu&items=linux.g4dn.metal.nvidia.gpu&items=linux.g5.4xlarge.nvidia.gpu&items=lf.linux.g4dn.12xlarge.nvidia.gpu&items=lf.linux.g5.12xlarge.nvidia.gpu&items=lf.linux.g5.4xlarge.nvidia.gpu&items=lf.linux.g6.4xlarge.experimental.nvidia.gpu&items=linux.g6.4xlarge.experimental.nvidia.gpu&items=linux.4xlarge.nvidia.gpu&items=linux.g5.12xlarge.nvidia.gpu&items=linux.g4dn.4xlarge.nvidia.gpu&items=lf.linux.4xlarge.nvidia.gpu&items=linux.g6.12xlarge.nvidia.gpu))

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164585
Approved by: https://github.com/malfet, https://github.com/yangw-dev, https://github.com/atalman, https://github.com/jeffdaily
2025-10-03 18:19:24 +00:00
f39789cdab [PyTorch Pinned Allocator] Add support of reserved pinned memory segment to avoid slow paths (#164501)
Summary:
This diff adds the feature of allocating a large pinned memory segment upfront based on the provided config. This large segment is then used to serve all the small pinned memory requests to avoid expensive device level APIs (slow paths).

Example:

PYTORCH_CUDA_ALLOC_CONF=pinned_reserve_segment_size_mb:2048

This reserves a 2GB pinned memory segment for the process and then all incoming small requests are just served from this segment and no cudaHostAlloc/cudaHostRegister apis are being called.

Differential Revision: D83779074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164501
Approved by: https://github.com/yangw-dev
2025-10-03 18:11:27 +00:00
3d9d41c801 Remove old workaround in launch_logcumsumexp_cuda_kernel (#164567)
Remove workaround for CUDA 11.4 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164567
Approved by: https://github.com/Aidyn-A, https://github.com/Skylion007
2025-10-03 18:07:02 +00:00
921 changed files with 12055 additions and 6045 deletions

View File

@ -37,9 +37,9 @@ case ${DOCKER_TAG_PREFIX} in
rocm*)
BASE_TARGET=rocm
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
# add gfx950 conditionally starting in ROCm 7.0
# add gfx950, gfx115x conditionally starting in ROCm 7.0
if [[ "$ROCM_VERSION" == *"7.0"* ]]; then
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950"
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950;gfx1150;gfx1151"
fi
EXTRA_BUILD_ARGS="${EXTRA_BUILD_ARGS} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}"
;;

View File

@ -344,7 +344,7 @@ docker build \
--build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \
--build-arg "KATEX=${KATEX:-}" \
--build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \
--build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx90a;gfx942}" \
--build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx90a;gfx942;gfx1100}" \
--build-arg "IMAGE_NAME=${IMAGE_NAME}" \
--build-arg "UCX_COMMIT=${UCX_COMMIT}" \
--build-arg "UCC_COMMIT=${UCC_COMMIT}" \

View File

@ -19,8 +19,8 @@ pip_install \
transformers==4.36.2
pip_install coloredlogs packaging
pip_install onnxruntime==1.22.1
pip_install onnxscript==0.4.0
pip_install onnxruntime==1.23.0
pip_install onnxscript==0.5.3
# Cache the transformers model to be used later by ONNX tests. We need to run the transformers
# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

View File

@ -46,9 +46,9 @@ case ${DOCKER_TAG_PREFIX} in
BASE_TARGET=rocm
GPU_IMAGE=rocm/dev-ubuntu-22.04:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
# add gfx950 conditionally starting in ROCm 7.0
# add gfx950, gfx115x conditionally starting in ROCm 7.0
if [[ "$GPU_ARCH_VERSION" == *"7.0"* ]]; then
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950"
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950;gfx1150;gfx1151"
fi
DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg ROCM_VERSION=${GPU_ARCH_VERSION}"
;;

View File

@ -115,6 +115,9 @@ RUN env GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=True pip3 install grpcio
# cmake-3.28.0 from pip for onnxruntime
RUN python3 -mpip install cmake==3.28.0
ADD ./common/patch_libstdc.sh patch_libstdc.sh
RUN bash ./patch_libstdc.sh && rm patch_libstdc.sh
# build onnxruntime 1.21.0 from sources.
# it is not possible to build it from sources using pip,
# so just build it from upstream repository.

View File

@ -84,9 +84,9 @@ case ${image} in
DEVTOOLSET_VERSION="11"
GPU_IMAGE=rocm/dev-almalinux-8:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
# add gfx950 conditionally starting in ROCm 7.0
# add gfx950, gfx115x conditionally starting in ROCm 7.0
if [[ "$GPU_ARCH_VERSION" == *"7.0"* ]]; then
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950"
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950;gfx1150;gfx1151"
fi
DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}"
;;

View File

@ -120,9 +120,8 @@ ninja==1.11.1.4
numba==0.55.2 ; python_version == "3.10" and platform_machine != "s390x"
numba==0.60.0 ; python_version == "3.12" and platform_machine != "s390x"
#Description: Just-In-Time Compiler for Numerical Functions
#Pinned versions: 0.54.1, 0.49.0, <=0.49.1
#Pinned versions: 0.55.2, 0.60.0
#test that import: test_numba_integration.py
#For numba issue see https://github.com/pytorch/pytorch/issues/51511
#Need release > 0.61.2 for s390x due to https://github.com/numba/numba/pull/10073
#numpy
@ -242,10 +241,9 @@ pygments==2.15.0
#Pinned versions: 14.1.0
#test that import:
scikit-image==0.19.3 ; python_version < "3.10"
scikit-image==0.22.0 ; python_version >= "3.10"
scikit-image==0.22.0
#Description: image processing routines
#Pinned versions:
#Pinned versions: 0.22.0
#test that import: test_nn.py
#scikit-learn
@ -341,7 +339,7 @@ onnx==1.18.0
#Pinned versions:
#test that import:
onnxscript==0.4.0
onnxscript==0.5.3
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:

View File

@ -1,15 +1,11 @@
sphinx==5.3.0
sphinx==7.2.6
#Description: This is used to generate PyTorch docs
#Pinned versions: 5.3.0
#Pinned versions: 7.2.6
standard-imghdr==3.13.0; python_version >= "3.13"
#Description: This is needed by Sphinx, so it needs to be added here.
# The reasons are as follows:
# 1) This module has been removed from the Python standard library since Python 3.13(https://peps.python.org/pep-0594/#imghdr);
# 2) The current version of Sphinx (5.3.0) is not compatible with Python 3.13.
# Once Sphinx is upgraded to a version compatible with Python 3.13 or later, we can remove this dependency.
pytorch_sphinx_theme2==0.1.0
#Description: This is needed to generate PyTorch docs
#Pinned versions: 0.1.0
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git@71e55749be14ceb56e7f8211a9fb649866b87ad4#egg=pytorch_sphinx_theme2
# TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
# but it doesn't seem to work and hangs around idly. The initial thought that it is probably
# something related to Docker setup. We can investigate this later.
@ -36,17 +32,17 @@ tensorboard==2.18.0 ; python_version >= "3.13"
#Description: This is used to generate PyTorch docs
#Pinned versions: 2.13.0
breathe==4.34.0
breathe==4.36.0
#Description: This is used to generate PyTorch C++ docs
#Pinned versions: 4.34.0
#Pinned versions: 4.36.0
exhale==0.2.3
exhale==0.3.7
#Description: This is used to generate PyTorch C++ docs
#Pinned versions: 0.2.3
#Pinned versions: 0.3.7
docutils==0.16
docutils==0.20
#Description: This is used to generate PyTorch C++ docs
#Pinned versions: 0.16
#Pinned versions: 0.20
bs4==0.0.1
#Description: This is used to generate PyTorch C++ docs
@ -56,13 +52,13 @@ IPython==8.12.0
#Description: This is used to generate PyTorch functorch docs
#Pinned versions: 8.12.0
myst-nb==0.17.2
myst-nb==1.3.0
#Description: This is used to generate PyTorch functorch and torch.compile docs.
#Pinned versions: 0.17.2
#Pinned versions: 1.3.0
# The following are required to build torch.distributed.elastic.rendezvous.etcd* docs
python-etcd==0.4.5
sphinx-copybutton==0.5.0
sphinx-design==0.4.0
sphinx-design==0.6.1
sphinxcontrib-mermaid==1.0.0
myst-parser==0.18.1
myst-parser==4.0.1

View File

@ -5,7 +5,7 @@ DESIRED_ROCM ?= 7.0
DESIRED_ROCM_SHORT = $(subst .,,$(DESIRED_ROCM))
PACKAGE_NAME = magma-rocm
# inherit this from underlying docker image, do not pass this env var to docker
#PYTORCH_ROCM_ARCH ?= gfx900;gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201
#PYTORCH_ROCM_ARCH ?= gfx900;gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1102;gfx1150;gfx1151;gfx1200;gfx1201
DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
-v $(shell git rev-parse --show-toplevel)/.ci:/builder \
@ -18,7 +18,6 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
.PHONY: all
all: magma-rocm70
all: magma-rocm64
all: magma-rocm63
.PHONY:
clean:
@ -34,8 +33,3 @@ magma-rocm70:
magma-rocm64: DESIRED_ROCM := 6.4
magma-rocm64:
$(DOCKER_RUN)
.PHONY: magma-rocm63
magma-rocm63: DESIRED_ROCM := 6.3
magma-rocm63:
$(DOCKER_RUN)

View File

@ -67,7 +67,7 @@ fi
# wheels with cxx11-abi
echo "Checking that the gcc ABI is what we expect"
if [[ "$(uname)" != 'Darwin' && "$(uname -m)" != "s390x" ]]; then
if [[ "$(uname)" != 'Darwin' ]]; then
# We also check that there are cxx11 symbols in libtorch
#
echo "Checking that symbols in libtorch.so have the right gcc abi"

View File

@ -102,8 +102,18 @@ if [ "$is_main_doc" = true ]; then
echo coverage output not found
exit 1
elif [ $undocumented -gt 0 ]; then
echo undocumented objects found:
echo "======================================"
echo "ERROR: $undocumented undocumented objects found!"
echo "======================================"
echo ""
echo "Full coverage report:"
cat build/coverage/python.txt
echo ""
echo "======================================"
echo "Undocumented modules/objects (lines after TOTAL):"
tail -n +$((lines - undocumented + 1)) build/coverage/python.txt
echo "======================================"
echo ""
echo "Make sure you've updated relevant .rsts in docs/source!"
echo "You can reproduce locally by running 'cd docs && make coverage && cat build/coverage/python.txt'"
exit 1

View File

@ -34,12 +34,14 @@ fi
# Patch numba to avoid CUDA-13 crash, see https://github.com/pytorch/pytorch/issues/162878
NUMBA_CUDA_DIR=$(python -c "import os;import numba.cuda; print(os.path.dirname(numba.cuda.__file__))" 2>/dev/null || true)
if [ -n "$NUMBA_CUDA_DIR" ]; then
NUMBA_PATCH="$(dirname "$(realpath "${BASH_SOURCE[0]}")")/numba-cuda-13.patch"
pushd "$NUMBA_CUDA_DIR"
patch -p4 <"$NUMBA_PATCH"
popd
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
NUMBA_CUDA_DIR=$(python -c "import os;import numba.cuda; print(os.path.dirname(numba.cuda.__file__))" 2>/dev/null || true)
if [ -n "$NUMBA_CUDA_DIR" ]; then
NUMBA_PATCH="$(dirname "$(realpath "${BASH_SOURCE[0]}")")/numba-cuda-13.patch"
pushd "$NUMBA_CUDA_DIR"
patch -p4 <"$NUMBA_PATCH"
popd
fi
fi
echo "Environment variables:"
@ -884,7 +886,7 @@ test_inductor_torchbench_smoketest_perf() {
done
# Perform some "warm-start" runs for a few huggingface models.
for test in AlbertForQuestionAnswering AllenaiLongformerBase DistilBertForMaskedLM DistillGPT2 GoogleFnet YituTechConvBert; do
for test in AllenaiLongformerBase DistilBertForMaskedLM DistillGPT2 GoogleFnet YituTechConvBert; do
python benchmarks/dynamo/huggingface.py --accuracy --training --amp --inductor --device cuda --warm-start-latency \
--only $test --output "$TEST_REPORTS_DIR/inductor_warm_start_smoketest_$test.csv"
python benchmarks/dynamo/check_accuracy.py \

View File

@ -38,7 +38,7 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
fi
# TODO: Move this to .ci/docker/requirements-ci.txt
python -m pip install "psutil==5.9.1" "pynvml==11.4.1" "pytest-shard==0.1.2"
python -m pip install "psutil==5.9.1" nvidia-ml-py "pytest-shard==0.1.2"
run_tests() {
# Run nvidia-smi if available

View File

@ -66,6 +66,7 @@ readability-simplify-subscript-expr,
readability-string-compare,
-readability-redundant-access-specifiers,
-readability-redundant-control-flow,
-readability-redundant-inline-specifier,
'
HeaderFilterRegex: '^(aten/|c10/|torch/).*$'
WarningsAsErrors: '*'

View File

@ -1 +1 @@
0fc62aa26a30ed7ca419d285f285cb5ba02c4394
2a9138a26ee257fef05310ad3fecf7c55fe80d73

Binary file not shown.

View File

@ -18,6 +18,7 @@ class GitHubComment:
body_text: str
created_at: str
author_login: str
author_url: Optional[str]
author_association: str
editor_login: Optional[str]
database_id: int

Binary file not shown.

View File

@ -38,6 +38,7 @@ def mock_get_comments() -> list[GitHubComment]:
body_text="mock_body_text",
created_at="",
author_login="",
author_url=None,
author_association="",
editor_login=None,
database_id=1,
@ -48,6 +49,7 @@ def mock_get_comments() -> list[GitHubComment]:
body_text=" #" + LABEL_ERR_MSG_TITLE.replace("`", ""),
created_at="",
author_login=BOT_AUTHORS[1],
author_url=None,
author_association="",
editor_login=None,
database_id=2,

View File

@ -32,6 +32,7 @@ from trymerge import (
main as trymerge_main,
MandatoryChecksMissingError,
MergeRule,
PostCommentError,
RE_GHSTACK_DESC,
read_merge_rules,
remove_job_name_suffix,
@ -588,6 +589,23 @@ class TestTryMerge(TestCase):
self.assertEqual(mock_merge_base, pr.get_merge_base())
mocked_gh_fetch_merge_base.assert_called_once()
def test_app_can_revert(self, *args: Any) -> None:
pr = GitHubPR("pytorch", "pytorch", 164660)
repo = DummyGitRepo()
app_comment_id, impostor_comment_id = 3375785595, 3377647892
# Check that app can revert
self.assertIsNotNone(validate_revert(repo, pr, comment_id=app_comment_id))
# But impostor can not
self.assertRaises(
PostCommentError,
lambda: validate_revert(repo, pr, comment_id=impostor_comment_id),
)
# Despite it's name being the name of the bot
self.assertEqual(
pr.get_comment_by_id(impostor_comment_id).author_login,
"pytorch-auto-revert",
)
@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)
@mock.patch("trymerge.gh_fetch_merge_base", return_value="")

View File

@ -234,6 +234,7 @@ query ($owner: String!, $name: String!, $number: Int!) {
createdAt
author {
login
url
}
authorAssociation
editor {
@ -1093,6 +1094,7 @@ class GitHubPR:
body_text=node["bodyText"],
created_at=node["createdAt"] if "createdAt" in node else "",
author_login=node["author"]["login"],
author_url=node["author"].get("url", None),
author_association=node["authorAssociation"],
editor_login=editor["login"] if editor else None,
database_id=node["databaseId"],
@ -2029,6 +2031,11 @@ def validate_revert(
# For some reason, one can not be a member of private repo, only CONTRIBUTOR
if pr.is_base_repo_private():
allowed_reverters.append("CONTRIBUTOR")
# Special case the pytorch-auto-revert app, whose does not have association
# But should be able to issue revert command
if comment.author_url == "https://github.com/apps/pytorch-auto-revert":
allowed_reverters.append("NONE")
if author_association not in allowed_reverters:
raise PostCommentError(
f"Will not revert as @{author_login} is not one of "

View File

@ -40,6 +40,15 @@ jobs:
# Use gh CLI to get changed files in the PR with explicit repo
CHANGED_FILES=$(gh api repos/${{ github.repository }}/pulls/$PR_NUMBER/files --paginate --jq '.[] | select(.status != "removed") | .filename' | tr '\n' ' ' | sed 's/ $//')
# See https://github.com/pytorch/pytorch/pull/134215#issuecomment-2332128790
PYI_FILES_TO_ADD=""
for file in ${CHANGED_FILES}; do
if [[ "${file}" == *".pyi.in" ]]; then
PYI_FILES_TO_ADD="${PYI_FILES_TO_ADD} ${file//.in/}"
fi
done
CHANGED_FILES="${CHANGED_FILES}${PYI_FILES_TO_ADD}"
if [ -z "$CHANGED_FILES" ]; then
echo "No changed files found, setting to '*'"
CHANGED_FILES="*"

View File

@ -63,6 +63,7 @@ jobs:
# Same as the build job
python-version: 3.12.7
test-matrix: ${{ needs.macos-perf-py3-arm64-build.outputs.test-matrix }}
timeout-minutes: 300
disable-monitor: false
monitor-log-interval: 15
monitor-data-collect-interval: 4

View File

@ -106,6 +106,16 @@ jobs:
{ config: "dynamic_aot_eager_huggingface", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "dynamic_aot_eager_timm", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "dynamic_aot_eager_timm", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "dynamic_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "dynamic_inductor_timm", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "dynamic_inductor_timm", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "dynamic_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "dynamic_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "aot_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "aot_inductor_timm", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "aot_inductor_timm", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
]}
secrets: inherit

View File

@ -213,9 +213,9 @@ jobs:
docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3
test-matrix: |
{ include: [
{ config: "distributed", shard: 1, num_shards: 3, runner: "linux.rocm.gpu.4", owners: ["module:rocm", "oncall:distributed"] },
{ config: "distributed", shard: 2, num_shards: 3, runner: "linux.rocm.gpu.4", owners: ["module:rocm", "oncall:distributed"] },
{ config: "distributed", shard: 3, num_shards: 3, runner: "linux.rocm.gpu.4", owners: ["module:rocm", "oncall:distributed"] },
{ config: "distributed", shard: 1, num_shards: 3, runner: "linux.rocm.gpu.mi250.4", owners: ["module:rocm", "oncall:distributed"] },
{ config: "distributed", shard: 2, num_shards: 3, runner: "linux.rocm.gpu.mi250.4", owners: ["module:rocm", "oncall:distributed"] },
{ config: "distributed", shard: 3, num_shards: 3, runner: "linux.rocm.gpu.mi250.4", owners: ["module:rocm", "oncall:distributed"] },
]}
secrets: inherit

View File

@ -127,8 +127,6 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
# More memory is needed to build with asan
runner: linux.2xlarge.memory
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.10-clang18-asan
docker-image-name: ci-image:pytorch-linux-jammy-py3-clang18-asan

View File

@ -59,3 +59,29 @@ jobs:
docker-image: ${{ needs.linux-jammy-rocm-py3_10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-rocm-py3_10-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-rocm-py3_10-gfx1100-test:
if: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}
permissions:
id-token: write
contents: read
name: linux-jammy-rocm-py3_10-gfx1100
uses: ./.github/workflows/_rocm-test.yml
needs:
- linux-jammy-rocm-py3_10-build
- target-determination
with:
build-environment: linux-jammy-rocm-py3.10
docker-image: ${{ needs.linux-jammy-rocm-py3_10-build.outputs.docker-image }}
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx1100" },
{ config: "default", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx1100" },
]}
tests-to-include: >
test_nn test_torch test_cuda test_ops test_unary_ufuncs test_binary_ufuncs
test_autograd inductor/test_torchinductor inductor/test_kernel_benchmark
inductor/test_pad_mm inductor/test_benchmark_fusion inductor/test_aot_inductor
inductor/test_torchinductor inductor/test_decompose_mem_bound_mm
inductor/test_flex_attention inductor/test_max_autotune
secrets: inherit

View File

@ -140,8 +140,6 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
# More memory is needed to build with asan
runner: linux.2xlarge.memory
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.10-clang18-asan
docker-image-name: ci-image:pytorch-linux-jammy-py3-clang18-asan

View File

@ -160,9 +160,10 @@ jobs:
runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
]}
secrets: inherit
@ -189,41 +190,6 @@ jobs:
runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
secrets: inherit
linux-jammy-rocm-py3_10-build:
if: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/trunk') }}
name: linux-jammy-rocm-py3.10
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-rocm-py3.10
docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3
sync-tag: rocm-build
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "default", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "distributed", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.gfx942.4" },
]}
secrets: inherit
linux-jammy-rocm-py3_10-test:
if: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/trunk') }}
permissions:
id-token: write
contents: read
name: linux-jammy-rocm-py3.10
uses: ./.github/workflows/_rocm-test.yml
needs:
- linux-jammy-rocm-py3_10-build
- target-determination
with:
build-environment: linux-jammy-rocm-py3.10
docker-image: ${{ needs.linux-jammy-rocm-py3_10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-rocm-py3_10-build.outputs.test-matrix }}
tests-to-include: "test_nn test_torch test_cuda test_ops test_unary_ufuncs test_binary_ufuncs test_autograd inductor/test_torchinductor distributed/test_c10d_common distributed/test_c10d_nccl"
secrets: inherit
inductor-build:
name: inductor-build
uses: ./.github/workflows/_linux-build.yml

2
.gitignore vendored
View File

@ -88,7 +88,7 @@ torch_compile_debug/
# Listed manually because some files in this directory are not generated
torch/testing/_internal/generated/annotated_fn_args.py
torch/testing/_internal/data/*.pt
torch/csrc/api/include/torch/version.h
torch/headeronly/version.h
torch/csrc/cudnn/cuDNN.cpp
torch/csrc/generated
torch/csrc/generic/TensorMethods.cpp

View File

@ -28,7 +28,7 @@ exclude_patterns = [
'torch/lib/**',
'venv/**',
'**/*.pyi',
"tools/experimental/dynamic_shapes/torchfuzz/**",
"tools/experimental/torchfuzz/**",
'tools/test/test_selective_build.py',
]
command = [
@ -198,7 +198,7 @@ exclude_patterns = [
'tools/test/gen_operators_yaml_test.py',
'tools/test/gen_oplist_test.py',
'tools/test/test_selective_build.py',
'tools/experimental/dynamic_shapes/torchfuzz/**',
'tools/experimental/torchfuzz/**',
]
command = [
'python3',
@ -1573,6 +1573,7 @@ exclude_patterns = [
'torch/_inductor/fx_passes/serialized_patterns/**',
'torch/_inductor/autoheuristic/artifacts/**',
'test/dynamo/cpython/**',
'test/test_torchfuzz_repros.py',
'scripts/**',
'third_party/**',
'fb/**',

View File

@ -13,6 +13,9 @@ load(":build_variables.bzl", "jit_core_sources", "lazy_tensor_ts_sources", "libt
load(":ufunc_defs.bzl", "aten_ufunc_generated_cpu_kernel_sources", "aten_ufunc_generated_cpu_sources", "aten_ufunc_generated_cuda_sources")
load("//:tools/bazel.bzl", "rules")
# Export files for use by torch/headeronly (where version.h generation now lives)
exports_files(["version.txt"])
define_targets(rules = rules)
COMMON_COPTS = [
@ -690,7 +693,9 @@ cc_library(
"torch/csrc/*/generated/*.h",
"torch/csrc/jit/serialization/mobile_bytecode_generated.h",
] + torch_cuda_headers,
) + GENERATED_AUTOGRAD_CPP + [":version_h"],
) + GENERATED_AUTOGRAD_CPP + [
"//torch/headeronly:version_h",
],
includes = [
"third_party/kineto/libkineto/include",
"torch/csrc",

View File

@ -53,7 +53,7 @@ ARG CUDA_PATH=cu121
ARG INSTALL_CHANNEL=whl/nightly
# Automatically set by buildx
# pinning version of conda here see: https://github.com/pytorch/pytorch/issues/164574
RUN /opt/conda/bin/conda install -c "${INSTALL_CHANNEL}" -y python=${PYTHON_VERSION} conda=25.7.0
RUN /opt/conda/bin/conda install -y python=${PYTHON_VERSION} conda=25.7.0
ARG TARGETPLATFORM

View File

@ -40,41 +40,6 @@ namespace {
->conv
->rnn
*/
const std::map<std::string, std::vector<std::string>> _fp32_precisions = {
{"generic", {{"ieee", "tf32", "bf16", "none"}}},
{"mkldnn", {{"ieee", "tf32", "bf16", "none"}}},
{"cuda", {{"ieee", "tf32", "none"}}}};
// Check whether the backend and op are legal
void check_fp32_prec_backend_and_op(
const std::string& backend,
const std::string& op) {
static std::vector<std::string> backends = {"generic", "mkldnn", "cuda"};
static std::vector<std::string> operators = {"conv", "matmul", "rnn", "all"};
TORCH_CHECK(
std::find(backends.begin(), backends.end(), backend) != backends.end(),
"Invalid backend: ",
backend);
TORCH_CHECK(
std::find(operators.begin(), operators.end(), op) != operators.end(),
"Invalid operator: ",
op);
if (backend == "generic") {
TORCH_CHECK(op == "all", "Invalid operation for generic backend: ", op);
}
}
// Return whether the precision is supported by backends
bool validate_fp32_prec(
const std::string& backend,
const std::string& precision) {
auto iterp = _fp32_precisions.find(backend);
TORCH_CHECK(iterp != _fp32_precisions.end());
auto precisions = iterp->second;
bool valid = std::find(precisions.begin(), precisions.end(), precision) !=
precisions.end();
return valid;
}
C10_ALWAYS_INLINE void warn_deprecated_fp32_precision_api(){
TORCH_WARN_ONCE(
@ -86,6 +51,54 @@ void check_fp32_prec_backend_and_op(
}
} // namespace
Float32Backend str2backend(const std::string& name) {
if (name == "generic")
return Float32Backend::GENERIC;
else if (name == "cuda")
return Float32Backend::CUDA;
else if (name == "mkldnn")
return Float32Backend::MKLDNN;
TORCH_CHECK(false, "Unknown backend: ", name);
}
Float32Op str2op(const std::string& name) {
if (name == "all")
return Float32Op::ALL;
else if (name == "conv")
return Float32Op::CONV;
else if (name == "rnn")
return Float32Op::RNN;
else if (name == "matmul")
return Float32Op::MATMUL;
TORCH_CHECK(false, "Unknown op: ", name);
}
Float32Precision str2precision(const std::string& name) {
if (name == "none")
return Float32Precision::NONE;
else if (name == "ieee")
return Float32Precision::IEEE;
else if (name == "tf32")
return Float32Precision::TF32;
else if (name == "bf16")
return Float32Precision::BF16;
TORCH_CHECK(false, "Unknown precision: ", name);
}
std::string precision2str(Float32Precision prec) {
switch (prec) {
case Float32Precision::NONE:
return "none";
case Float32Precision::IEEE:
return "ieee";
case Float32Precision::TF32:
return "tf32";
case Float32Precision::BF16:
return "bf16";
}
TORCH_CHECK(false, "Invalid enum Float32Precision(", static_cast<int>(prec), ")");
}
Context::Context() = default;
// TODO: This could be bad juju if someone calls globalContext() in the
@ -179,10 +192,10 @@ void Context::setUserEnabledNNPACK(bool e) {
enabled_nnpack = e;
}
bool Context::allowTF32CuDNN(const std::string& op) const {
if (op.empty()){
bool allow_tf32_rnn = float32Precision("cuda", "rnn") == "tf32";
bool allow_tf32_conv = float32Precision("cuda", "conv") == "tf32";
bool Context::allowTF32CuDNN(std::optional<Float32Op> op) const {
if (!op.has_value()) {
bool allow_tf32_rnn = float32Precision(Float32Backend::CUDA, Float32Op::RNN) == Float32Precision::TF32;
bool allow_tf32_conv = float32Precision(Float32Backend::CUDA, Float32Op::CONV) == Float32Precision::TF32;
TORCH_CHECK(
allow_tf32_rnn == allow_tf32_conv && allow_tf32_rnn == allow_tf32_cudnn,
"PyTorch is checking whether allow_tf32 is enabled for cuDNN without a specific operator name,",
@ -191,15 +204,15 @@ bool Context::allowTF32CuDNN(const std::string& op) const {
"We suggest only using the new API to set the TF32 flag(s). See also: ",
"https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices");
} else {
return float32Precision("cuda", op) == "tf32";
return float32Precision(Float32Backend::CUDA, op.value()) == Float32Precision::TF32;
}
warn_deprecated_fp32_precision_api();
return allow_tf32_cudnn;
}
void Context::setAllowTF32CuDNN(bool b) {
setFloat32Precision("cuda", "rnn", b ? "tf32" : "none");
setFloat32Precision("cuda", "conv", b ? "tf32" : "none");
setFloat32Precision(Float32Backend::CUDA, Float32Op::RNN, b ? Float32Precision::TF32 : Float32Precision::NONE);
setFloat32Precision(Float32Backend::CUDA, Float32Op::CONV, b ? Float32Precision::TF32 : Float32Precision::NONE);
allow_tf32_cudnn = b;
warn_deprecated_fp32_precision_api();
}
@ -305,7 +318,7 @@ void Context::setImmediateMiopen(bool b) {
bool Context::allowTF32CuBLAS() const {
bool legacy_allow_tf32 = float32_matmul_precision != at::Float32MatmulPrecision::HIGHEST;
bool allow_tf32_new = float32Precision("cuda", "matmul") == "tf32";
bool allow_tf32_new = float32Precision(Float32Backend::CUDA, Float32Op::MATMUL) == Float32Precision::TF32;
TORCH_CHECK(
legacy_allow_tf32 == allow_tf32_new,
"PyTorch is checking whether allow_tf32_new is enabled for cuBlas matmul,",
@ -318,17 +331,17 @@ bool Context::allowTF32CuBLAS() const {
void Context::setAllowTF32CuBLAS(bool b) {
float32_matmul_precision = b ? at::Float32MatmulPrecision::HIGH : at::Float32MatmulPrecision::HIGHEST;
setFloat32Precision("cuda", "matmul", b ? "tf32" : "ieee");
setFloat32Precision(Float32Backend::CUDA, Float32Op::MATMUL, b ? Float32Precision::TF32 : Float32Precision::IEEE);
}
Float32MatmulPrecision Context::float32MatmulPrecision() const {
bool invalid = float32Precision("cuda", "matmul") == "tf32" &&
bool invalid = float32Precision(Float32Backend::CUDA, Float32Op::MATMUL) == Float32Precision::TF32 &&
float32_matmul_precision == at::Float32MatmulPrecision::HIGHEST;
invalid = invalid ||
(float32Precision("mkldnn", "matmul") == "bf16" &&
(float32Precision(Float32Backend::MKLDNN, Float32Op::MATMUL) == Float32Precision::BF16 &&
float32_matmul_precision != at::Float32MatmulPrecision::MEDIUM);
invalid = invalid ||
(float32Precision("mkldnn", "matmul") == "tf32" &&
(float32Precision(Float32Backend::MKLDNN, Float32Op::MATMUL) == Float32Precision::TF32 &&
float32_matmul_precision != at::Float32MatmulPrecision::HIGH);
TORCH_CHECK(
!invalid,
@ -340,15 +353,26 @@ Float32MatmulPrecision Context::float32MatmulPrecision() const {
return float32_matmul_precision;
}
std::string Context::float32Precision(const std::string& backend, const std::string& op) const {
check_fp32_prec_backend_and_op(backend, op);
auto precision = fp32_precision.find(backend)->second.find(op)->second;
if (precision == "none")
precision = fp32_precision.find(backend)->second.find("all")->second;
if (precision == "none")
precision = fp32_precision.find("generic")->second.find("all")->second;
bool valid_prec = validate_fp32_prec(backend, precision);
return valid_prec ? precision : "none";
Float32Precision Context::float32Precision(Float32Backend backend, Float32Op op) const {
std::pair<Float32Backend, Float32Op> key{backend, op};
auto it = fp32_precision.find(key);
TORCH_CHECK(it != fp32_precision.end(), "Invalid (backend, op) pair: (", backend, ", ", op, ")");
Float32Precision precision = it->second;
if (precision == Float32Precision::NONE) {
key.second = Float32Op::ALL;
precision = fp32_precision.find(key)->second;
}
if (precision == Float32Precision::NONE) {
key.first = Float32Backend::GENERIC;
precision = fp32_precision.find(key)->second;
}
// "cuda" does not support "bf16"
if (backend == Float32Backend::CUDA && precision == Float32Precision::BF16) {
return Float32Precision::NONE;
}
return precision;
}
void Context::setFloat32MatmulPrecision(const std::string &s) {
@ -357,18 +381,18 @@ void Context::setFloat32MatmulPrecision(const std::string &s) {
// TODO: consider if CuDNN field needs to also be set for potential future CuDNN ops like multi-headed attention
if (s_ == "highest") {
float32_matmul_precision = at::Float32MatmulPrecision::HIGHEST;
setFloat32Precision("cuda", "matmul", "ieee");
setFloat32Precision("mkldnn", "matmul", "ieee");
setFloat32Precision(Float32Backend::CUDA, Float32Op::MATMUL, Float32Precision::IEEE);
setFloat32Precision(Float32Backend::MKLDNN, Float32Op::MATMUL, Float32Precision::IEEE);
return true;
} else if (s_ == "high") {
float32_matmul_precision = at::Float32MatmulPrecision::HIGH;
setFloat32Precision("cuda", "matmul", "tf32");
setFloat32Precision("mkldnn", "matmul", "tf32");
setFloat32Precision(Float32Backend::CUDA, Float32Op::MATMUL, Float32Precision::TF32);
setFloat32Precision(Float32Backend::MKLDNN, Float32Op::MATMUL, Float32Precision::TF32);
return true;
} else if (s_ == "medium") {
float32_matmul_precision = at::Float32MatmulPrecision::MEDIUM;
setFloat32Precision("cuda", "matmul", "tf32");
setFloat32Precision("mkldnn", "matmul", "bf16");
setFloat32Precision(Float32Backend::CUDA, Float32Op::MATMUL, Float32Precision::TF32);
setFloat32Precision(Float32Backend::MKLDNN, Float32Op::MATMUL, Float32Precision::BF16);
return true;
}
return false;
@ -382,25 +406,16 @@ void Context::setFloat32MatmulPrecision(const std::string &s) {
"setFloat32MatmulPrecision call has no effect.");
}
void Context::setFloat32Precision(const std::string& backend, const std::string& op, const std::string& p) {
check_fp32_prec_backend_and_op(backend, op);
if (validate_fp32_prec(backend, p)) {
fp32_precision[backend][op] = p;
} else {
std::string msg;
auto iterp = _fp32_precisions.find(backend);
TORCH_CHECK(iterp != _fp32_precisions.end());
for (const auto& p : iterp->second) {
msg += p;
msg += " ";
}
TORCH_WARN(
"you have set wrong precision for backend:",
backend,
" setFloat32Precision call has no effect.",
"Please choose precision from: ",
msg);
}
void Context::setFloat32Precision(Float32Backend backend, Float32Op op, Float32Precision p) {
auto it = fp32_precision.find(std::make_pair(backend, op));
TORCH_CHECK(
it != fp32_precision.end(),
"Invalid (backend, op) pair: (", backend, ", ", op, ")");
TORCH_CHECK(
!(backend == Float32Backend::CUDA && p == Float32Precision::BF16),
"backend 'cuda' does not support precision 'bf16'");
it->second = p;
}
at::LinalgBackend Context::linalgPreferredBackend() const {
@ -468,8 +483,8 @@ at::BlasBackend Context::blasPreferredBackend() {
#if ROCM_VERSION >= 60300
"gfx1100", "gfx1101", "gfx1200", "gfx1201", "gfx908",
#endif
#if ROCM_VERSION >= 60500
"gfx950"
#if ROCM_VERSION >= 70000
"gfx950", "gfx1150", "gfx1151"
#endif
};
for (auto index: c10::irange(detail::getCUDAHooks().deviceCount())) {

View File

@ -25,17 +25,27 @@
#include <c10/util/CallOnce.h>
#include <c10/util/Exception.h>
#include <c10/util/env.h>
#include <c10/util/hash.h>
#include <c10/util/irange.h>
#include <cstdint>
#include <map>
#include <mutex>
#include <unordered_map>
namespace at {
class Tensor;
enum class TORCH_API Float32MatmulPrecision { HIGHEST, HIGH, MEDIUM };
enum class TORCH_API Float32Backend { GENERIC, CUDA, MKLDNN };
enum class TORCH_API Float32Op { ALL, CONV, RNN, MATMUL };
enum class TORCH_API Float32Precision { NONE, IEEE, TF32, BF16 };
TORCH_API Float32Backend str2backend(const std::string& name);
TORCH_API Float32Op str2op(const std::string& name);
TORCH_API Float32Precision str2precision(const std::string& name);
TORCH_API std::string precision2str(Float32Precision prec);
class TORCH_API Context {
public:
@ -336,19 +346,17 @@ class TORCH_API Context {
void setFloat32MatmulPrecision(const std::string& s);
void setFloat32Precision(
const std::string& backend,
const std::string& op,
const std::string& s);
bool allowTF32CuDNN(const std::string& op = std::string()) const;
Float32Backend backend,
Float32Op op,
Float32Precision p);
bool allowTF32CuDNN(std::optional<Float32Op> op = std::nullopt) const;
void setAllowTF32CuDNN(bool);
bool allowTF32OneDNN() const;
void setAllowTF32OneDNN(bool);
bool allowTF32CuBLAS() const;
void setAllowTF32CuBLAS(bool);
Float32MatmulPrecision float32MatmulPrecision() const;
std::string float32Precision(
const std::string& backend,
const std::string& op) const;
Float32Precision float32Precision(Float32Backend backend, Float32Op op) const;
bool allowFP16ReductionCuBLAS() const;
void setAllowFP16ReductionCuBLAS(bool);
bool allowBF16ReductionCuBLAS() const;
@ -475,21 +483,20 @@ class TORCH_API Context {
bool enable_sparse_tensor_invariant_checks = false;
bool allow_fp16_reduction_cpu = false;
std::map<std::string, std::map<std::string, std::string>> fp32_precision = {
{"generic", {{"all", "none"}}},
{"mkldnn",
{{"matmul", "none"},
{"conv", "none"},
{"rnn", "none"},
{"all", "none"}}},
{"cuda",
{{"matmul",
float32_matmul_precision == at::Float32MatmulPrecision::HIGHEST
? "none"
: "tf32"},
{"conv", "tf32"},
{"rnn", "tf32"},
{"all", "none"}}},
using Key = std::pair<Float32Backend, Float32Op>;
std::unordered_map<Key, Float32Precision, c10::hash<Key>> fp32_precision = {
{{Float32Backend::GENERIC, Float32Op::ALL}, Float32Precision::NONE},
{{Float32Backend::MKLDNN, Float32Op::ALL}, Float32Precision::NONE},
{{Float32Backend::MKLDNN, Float32Op::CONV}, Float32Precision::NONE},
{{Float32Backend::MKLDNN, Float32Op::RNN}, Float32Precision::NONE},
{{Float32Backend::MKLDNN, Float32Op::MATMUL}, Float32Precision::NONE},
{{Float32Backend::CUDA, Float32Op::ALL}, Float32Precision::NONE},
{{Float32Backend::CUDA, Float32Op::CONV}, Float32Precision::TF32},
{{Float32Backend::CUDA, Float32Op::RNN}, Float32Precision::TF32},
{{Float32Backend::CUDA, Float32Op::MATMUL},
float32_matmul_precision == at::Float32MatmulPrecision::HIGHEST
? Float32Precision::NONE
: Float32Precision::TF32},
};
Allocator* prev_allocator_ptr_{nullptr};
@ -671,5 +678,4 @@ struct TORCH_API ROCmBackwardPassGuard {
~ROCmBackwardPassGuard();
static bool is_backward_pass();
};
} // namespace at

View File

@ -179,7 +179,7 @@ void propagate_names_except(const Tensor& result, const Tensor& src, IntArrayRef
return;
}
const auto src_names = src.names();
const auto result_dim = static_cast<int64_t>(result.dim());
const auto result_dim = result.dim();
const auto src_dim = static_cast<int64_t>(src_names.size());
const auto excluded_dim = static_cast<int64_t>(excluded_idxs.size());
TORCH_INTERNAL_ASSERT(src_dim - excluded_dim == result_dim);

View File

@ -229,14 +229,14 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
}
void resize_(int64_t sparse_dim, int64_t dense_dim, ArrayRef<int64_t> size) {
return _resize_(sparse_dim, dense_dim, size);
_resize_(sparse_dim, dense_dim, size);
}
void resize_(
int64_t sparse_dim,
int64_t dense_dim,
ArrayRef<c10::SymInt> size) {
return _resize_(sparse_dim, dense_dim, size);
_resize_(sparse_dim, dense_dim, size);
}
// NOTE: this function will resize the sparse tensor and also set `indices`

View File

@ -59,7 +59,7 @@ static inline void set_item(const Tensor& self, ArrayRef<TensorIndex> indices, c
}
}
return set_item(self, indices, value);
set_item(self, indices, value);
}
} // namespace indexing

View File

@ -214,7 +214,7 @@ inline Tensor applySlice(
"step must be greater than zero");
// See NOTE [nested tensor size for indexing]
if (self_sizes.has_value() && self_sizes.value().size() > 0) {
if (self_sizes.has_value() && !self_sizes.value().empty()) {
// Skip this optimization if we are tracing, as the trace may be polymorphic
// over the shape of the `self` tensor, and we still want to record
// the slice.

View File

@ -765,7 +765,8 @@ void TensorIteratorBase::for_each(loop2d_t loop, int64_t grain_size) {
if (numel == 0) {
return;
} else if (numel < grain_size || at::get_num_threads() == 1) {
return serial_for_each(loop, {0, numel});
serial_for_each(loop, {0, numel});
return;
} else {
at::parallel_for(0, numel, grain_size, [&](int64_t begin, int64_t end) {
serial_for_each(loop, {begin, end});

View File

@ -273,11 +273,11 @@ void checkLayout(CheckedFrom c, at::ArrayRef<Tensor> tensors, at::Layout layout)
}
void * maybe_data_ptr(const Tensor& tensor) {
return tensor.defined() ? (void *)tensor.data_ptr() : nullptr;
return tensor.defined() ? tensor.data_ptr() : nullptr;
}
void * maybe_data_ptr(const TensorArg& tensor) {
return tensor->defined() ? (void *)tensor->data_ptr() : nullptr;
return tensor->defined() ? tensor->data_ptr() : nullptr;
}
void check_dim_size(

View File

@ -50,6 +50,46 @@ namespace {
constexpr size_t MAX_SIZE_INDEX = 64;
}
// A large reserved pinned memory segment that is created in advance which is used
// to allocate small pinned memory requests to avoid calling into expensive APIs.
// We never free this memory and move up the pointer as we allocate new blocks
// and when blocks are freed, they are cached in the free lists.
struct PinnedReserveSegment {
PinnedReserveSegment(void *start, size_t size) : start_(start), size_(size),
current_ptr_(start_), initialized_(true) {}
PinnedReserveSegment() : start_(nullptr), size_(0), current_ptr_(nullptr), initialized_(false) {}
bool initialized() {
return initialized_;
}
void* allocate(size_t bytes) {
std::lock_guard<std::mutex> guard(mutex_);
// Round up the requested size to 4KB boundary for all including the small ones.
size_t rounded_bytes = (bytes + 4096 - 1) & ~(4096 - 1);
if (((uint8_t*)current_ptr_ + rounded_bytes) > ((uint8_t*)start_ + size_)) {
return nullptr;
}
void* ptr = current_ptr_;
current_ptr_ = (uint8_t*)current_ptr_ + rounded_bytes;
return ptr;
}
bool owns(void* ptr) {
return ptr >= start_ && ptr < (uint8_t*)start_ + size_;
}
std::mutex mutex_;
void* start_;
size_t size_;
void* current_ptr_;
bool initialized_;
};
// Struct containing memory allocator summary statistics for host.
struct TORCH_API HostStats {
// COUNT: total allocations (active)
@ -203,17 +243,6 @@ struct CachingHostAllocatorImpl {
// background.
if (!pinned_use_background_threads()) {
process_events();
} else {
// Launch the background thread and process events in a loop.
static bool background_thread_flag [[maybe_unused]] = [this] {
getBackgroundThreadPool()->run([&]() {
while (active_) {
process_events();
std::this_thread::sleep_for(std::chrono::microseconds(100));
}
});
return true;
}();
}
// Round up the allocation to the nearest power of two to improve reuse.
@ -226,6 +255,21 @@ struct CachingHostAllocatorImpl {
return {block->ptr_, reinterpret_cast<void*>(block)};
}
// Check in the recently freed blocks with pending events to see if we
// can reuse them. Call get_free_block again after processing events
if (pinned_use_background_threads()) {
// Launch the background thread and process events in a loop.
static bool background_thread_flag [[maybe_unused]] = [this] {
getBackgroundThreadPool()->run([&]() {
while (active_) {
process_events();
std::this_thread::sleep_for(std::chrono::microseconds(100));
}
});
return true;
}();
}
// Slow path: if we can't allocate from the cached free list, we need
// to create a new block.
void* ptr = nullptr;

View File

@ -49,7 +49,7 @@ static void check_unique_names(DimnameList names) {
}
void check_names_valid_for(const TensorBase& tensor, DimnameList names) {
return impl::check_names_valid_for(tensor.unsafeGetTensorImpl(), names);
impl::check_names_valid_for(tensor.unsafeGetTensorImpl(), names);
}
void check_names_valid_for(size_t tensor_dim, DimnameList names) {

View File

@ -138,7 +138,7 @@ void Tensor::_backward(TensorList inputs,
const std::optional<Tensor>& gradient,
std::optional<bool> keep_graph,
bool create_graph) const {
return impl::GetVariableHooks()->_backward(*this, inputs, gradient, keep_graph, create_graph);
impl::GetVariableHooks()->_backward(*this, inputs, gradient, keep_graph, create_graph);
}
const TensorBase& TensorBase::requires_grad_(bool _requires_grad) const {
@ -173,12 +173,4 @@ unsigned TensorBase::_register_hook(std::function<TensorBase(const TensorBase&)>
return impl::GetVariableHooks()->_register_hook(*this, std::move(hook));
}
std::optional<ScalarType> TensorBase::grad_dtype() const {
return impl::GetVariableHooks()->grad_dtype(*this);
}
void TensorBase::set_grad_dtype(const std::optional<ScalarType>& grad_dtype) const {
return impl::GetVariableHooks()->set_grad_dtype(*this, grad_dtype);
}
} // namespace at

View File

@ -930,10 +930,6 @@ public:
const TensorBase& requires_grad_(bool _requires_grad=true) const;
std::optional<ScalarType> grad_dtype() const;
void set_grad_dtype(const std::optional<ScalarType>& grad_dtype) const;
// View Variables
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@ -117,7 +117,7 @@ C10_HOST_DEVICE inline T cauchy(T val, T median, T sigma) {
template <>
C10_HOST_DEVICE inline double cauchy(double val, double median, double sigma) {
// https://en.wikipedia.org/wiki/Cauchy_distribution#Cumulative_distribution_function
return median + sigma * at::tan(c10::pi<double> * (val - static_cast<double>(0.5)));
return median + sigma * at::tan(c10::pi<double> * (val - 0.5));
}
/**

View File

@ -68,8 +68,6 @@ struct TORCH_API VariableHooksInterface {
const c10::OperatorHandle& op,
c10::DispatchKeySet dispatch_keys,
torch::jit::Stack* stack) const = 0;
virtual std::optional<c10::ScalarType> grad_dtype(const TensorBase&) const = 0;
virtual void set_grad_dtype(const TensorBase&, const std::optional<c10::ScalarType>&) const = 0;
};
TORCH_API void SetVariableHooks(VariableHooksInterface* hooks);

View File

@ -2,7 +2,7 @@
namespace c10 {
inline BoxedKernel::BoxedKernel() : functor_(), boxed_kernel_func_(nullptr) {}
inline BoxedKernel::BoxedKernel() : boxed_kernel_func_(nullptr) {}
inline BoxedKernel::BoxedKernel(
std::unique_ptr<OperatorKernel> functor,

View File

@ -20,9 +20,7 @@ make_unique_base(Args&&... args) {
} // namespace detail
inline KernelFunction::KernelFunction()
: boxed_kernel_func_(),
unboxed_kernel_func_(nullptr),
sym_unboxed_kernel_func_(nullptr) {}
: unboxed_kernel_func_(nullptr), sym_unboxed_kernel_func_(nullptr) {}
inline KernelFunction::~KernelFunction() {
if (tokens_) {

View File

@ -76,13 +76,7 @@ void _print_dispatch_trace(const std::string& label, const std::string& op_name,
OpRegistrationListener::~OpRegistrationListener()= default;
Dispatcher::Dispatcher()
: operators_()
, operatorLookupTable_()
, backendFallbackKernels_()
, listeners_(std::make_unique<detail::RegistrationListenerList>())
, cond_var_()
, guard_(std::make_shared<Guard>())
Dispatcher::Dispatcher(): backendFallbackKernels_(), listeners_(std::make_unique<detail::RegistrationListenerList>()), guard_(std::make_shared<Guard>())
{}
Dispatcher::~Dispatcher() {

View File

@ -96,7 +96,7 @@ class TORCH_API Dispatcher final {
friend class TypedOperatorHandle;
struct Guard final {
Guard() : alive(true), mutex() {}
Guard() : alive(true) {}
std::atomic<bool> alive;
std::mutex mutex;
};
@ -496,7 +496,7 @@ class TORCH_API OperatorHandle {
}
void checkInvariants() const {
return operatorDef_->op.checkInvariants();
operatorDef_->op.checkInvariants();
}
c10::ArrayRef<at::Tag> getTags() const {
@ -932,7 +932,7 @@ inline void Dispatcher::redispatchBoxed(
}
#endif
const auto& kernel = entry.lookup(dispatchKeySet);
return kernel.callBoxed(op, dispatchKeySet, stack);
kernel.callBoxed(op, dispatchKeySet, stack);
}
} // namespace c10

View File

@ -62,17 +62,7 @@ static const auto& getDispatchTableIndexToKey() {
}
OperatorEntry::OperatorEntry(OperatorName&& operator_name)
: name_(std::move(operator_name))
, schema_()
#ifndef C10_MOBILE
, tags_()
#endif
, dispatchTable_()
, dispatchKeyExtractor_(DispatchKeyExtractor::makeUninitialized())
, kernels_()
, cpp_signature_()
, sym_cpp_signature_()
, is_observed_(ObservedOperators::isObserved(name_))
: name_(std::move(operator_name)), dispatchTable_(), dispatchKeyExtractor_(DispatchKeyExtractor::makeUninitialized()), is_observed_(ObservedOperators::isObserved(name_))
{
// Pick up any backend fallbacks that were registered prior to this
// OperatorEntry being created.

View File

@ -114,7 +114,7 @@ constexpr bool allowlist_contains(std::string_view allowlist, std::string_view i
}
next++;
} else {
if (allowlist.substr(cur).compare(item) == 0) {
if (allowlist.substr(cur) == item) {
return true;
}
break;

View File

@ -73,7 +73,7 @@ c10::FunctionSchema RegisterOperators::inferSchemaFromKernels_(
std::optional<FunctionSchema> inferred_schema = std::nullopt;
for (const auto& kernel : options.kernels) {
if (nullptr != kernel.inferred_function_schema.get()) {
if (nullptr != kernel.inferred_function_schema) {
if (!inferred_schema.has_value()) {
inferred_schema = *kernel.inferred_function_schema;
break;

View File

@ -411,7 +411,6 @@ public:
Options()
: schemaOrName_(std::nullopt)
, kernels()
, aliasAnalysisKind_(std::nullopt)
{}
@ -420,7 +419,6 @@ public:
struct KernelRegistrationConfig final {
KernelRegistrationConfig()
: dispatch_key(std::nullopt)
, func()
, cpp_signature(std::nullopt)
, inferred_function_schema(nullptr)
{}

View File

@ -905,7 +905,7 @@ class Vectorized8 : public Vectorizedi {
// Because loadu(const void* ptr, T count) requires zero initialization for
// upper 128 bits. However, by using _mm256_castsi128_si256, the upper 128
// bits of the result are undefined.
// TODO<leslie> We can use _mm256_zextsi128_si256 in the furture,
// TODO<leslie> We can use _mm256_zextsi128_si256 in the future,
// since gcc 9.3 doesn't support it now.
__m128i input_128 = _mm_loadl_epi64(reinterpret_cast<const __m128i*>(ptr));
return _mm256_castsi128_si256(input_128);
@ -1844,7 +1844,7 @@ Vectorized<int16_t> inline shift_256_16(
c0 = _mm256_srav_epi32(a0, b0);
c0 = _mm256_shuffle_epi8(c0, ctl_1_0);
// Peform shifting the same way for input array elements with
// Perform shifting the same way for input array elements with
// idx%2==1.
__m256i a1 = _mm256_and_si256(a, keep_1);
__m256i b1 = _mm256_shuffle_epi8(b, ctl_1_0);
@ -2180,7 +2180,7 @@ Vectorized<T> inline shift_256_8(
c0 = _mm256_srlv_epi32(a0, b0);
c0 = _mm256_shuffle_epi8(c0, ctl_3_0);
// Peform shifting the same way for input array elements with
// Perform shifting the same way for input array elements with
// idx%4==1.
__m256i a1 = _mm256_shuffle_epi8(a, ctl_1_3);
__m256i b1 = _mm256_shuffle_epi8(b, ctl_1_0);
@ -2193,7 +2193,7 @@ Vectorized<T> inline shift_256_8(
c1 = _mm256_srlv_epi32(a1, b1);
c1 = _mm256_shuffle_epi8(c1, ctl_3_1);
// Peform shifting the same way for input array elements with
// Perform shifting the same way for input array elements with
// idx%4==2.
__m256i a2 = _mm256_shuffle_epi8(a, ctl_2_3);
__m256i b2 = _mm256_shuffle_epi8(b, ctl_2_0);
@ -2206,7 +2206,7 @@ Vectorized<T> inline shift_256_8(
c2 = _mm256_srlv_epi32(a2, b2);
c2 = _mm256_shuffle_epi8(c2, ctl_3_2);
// Peform shifting the same way for input array elements with
// Perform shifting the same way for input array elements with
// idx%4==3.
__m256i a3 = _mm256_and_si256(a, keep_3);
__m256i b3 = _mm256_shuffle_epi8(b, ctl_3_0);

View File

@ -1088,7 +1088,7 @@ class Vectorized8 : public Vectorizedi {
// Because loadu(const void* ptr, T count) requires zero initialization for
// upper 384 bits. However, by using _mm512_castsi128_si512, the upper 384
// bits of the result are undefined.
// TODO<leslie> We can use _mm512_zextsi128_si512 in the furture,
// TODO<leslie> We can use _mm512_zextsi128_si512 in the future,
// since gcc 9.3 doesn't support it now.
__m128i input_128 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr));
return _mm512_castsi128_si512(input_128);
@ -2022,7 +2022,7 @@ Vectorized<T> inline shift_512_8(
c0 = _mm512_srlv_epi16(a0, b0);
c0 = _mm512_shuffle_epi8(c0, ctl_1_0);
// Peform shifting the same way for input array elements with
// Perform shifting the same way for input array elements with
// idx%2==1.
__m512i a1 = _mm512_and_si512(a, keep_1);
__m512i b1 = _mm512_shuffle_epi8(b, ctl_1_0);

View File

@ -323,7 +323,7 @@ class CuBlasLtMatmulDescriptor : public CuBlasLtDescriptor<
descriptor_.reset(raw_descriptor);
}
template <typename T>
inline void setAttribute(cublasLtMatmulDescAttributes_t attr, const T value) {
void setAttribute(cublasLtMatmulDescAttributes_t attr, const T value) {
// NOLINTNEXTLINE(bugprone-sizeof-expression)
TORCH_CUDABLAS_CHECK(::cublasLtMatmulDescSetAttribute(descriptor(), attr, &value, sizeof(value)));
}
@ -345,7 +345,7 @@ class CuBlasLtMatrixLayout : public CuBlasLtDescriptor<
descriptor_.reset(raw_descriptor);
}
template <typename T>
inline void setAttribute(cublasLtMatrixLayoutAttribute_t attr, const T value) {
void setAttribute(cublasLtMatrixLayoutAttribute_t attr, const T value) {
TORCH_CUDABLAS_CHECK(::cublasLtMatrixLayoutSetAttribute(descriptor(), attr, &value, sizeof(T)));
}
};
@ -360,7 +360,7 @@ class CuBlasLtMatmulPreference : public CuBlasLtDescriptor<
descriptor_.reset(raw_descriptor);
}
template <typename T>
inline void setAttribute(cublasLtMatmulPreferenceAttributes_t attr, const T value) {
void setAttribute(cublasLtMatmulPreferenceAttributes_t attr, const T value) {
TORCH_CUDABLAS_CHECK(::cublasLtMatmulPreferenceSetAttribute(descriptor(), attr, &value, sizeof(T)));
}
};
@ -395,7 +395,7 @@ static inline bool bgemm_internal_cublaslt(CUDABLAS_BGEMM_ARGTYPES_AND_C_DTYPE(D
computeType = CUBLAS_COMPUTE_64F;
scaleType = CUDA_R_64F;
} else if constexpr (std::is_same_v<Dtype, float>) {
if (at::globalContext().float32Precision("cuda", "matmul") == "tf32") {
if (at::globalContext().float32Precision(at::Float32Backend::CUDA, at::Float32Op::MATMUL) == at::Float32Precision::TF32) {
computeType = CUBLAS_COMPUTE_32F_FAST_TF32;
}
} else if constexpr (std::is_same_v<Dtype, c10::complex<double>>) {
@ -1270,7 +1270,7 @@ void gemm_internal<float>(CUDABLAS_GEMM_ARGTYPES(float))
}
#if defined(USE_ROCM) && defined(USE_ROCM_CK_GEMM)
else if (at::globalContext().blasPreferredBackend() == BlasBackend::Ck) {
if (at::detail::getCUDAHooks().isGPUArch({"gfx1100"})) { //no CK GEMM version for gfx1100
if (at::detail::getCUDAHooks().isGPUArch({"gfx11", "gfx12"})) { //no CK GEMM version
gemm_internal_cublaslt<float>(CUDABLAS_GEMM_ARGS(float));
} else{
at::native::gemm_internal_ck<float>(CUDABLAS_GEMM_ARGS(float));
@ -1559,7 +1559,7 @@ bool gemm_and_bias(
computeType = CUBLAS_COMPUTE_64F;
scaleType = CUDA_R_64F;
} else if constexpr (std::is_same_v<Dtype, float>) {
if (at::globalContext().float32Precision("cuda", "matmul") == "tf32") {
if (at::globalContext().float32Precision(at::Float32Backend::CUDA, at::Float32Op::MATMUL) == at::Float32Precision::TF32) {
computeType = CUBLAS_COMPUTE_32F_FAST_TF32;
}
} else if constexpr (std::is_same_v<Dtype, at::Half>) {

View File

@ -109,7 +109,7 @@ void CUDAGeneratorState::increase(uint64_t increment) {
offset_intragraph_ % 4 == 0, "RNG offset must be a multiple of 4.");
// Ensures the increment does not cause overflow.
TORCH_INTERNAL_ASSERT(
offset_intragraph_ <= std::numeric_limits<uint32_t>::max() - increment,
offset_intragraph_ <= std::numeric_limits<uint64_t>::max() - increment,
"Increment causes overflow in the offset value.");
offset_intragraph_ += increment;
} else {
@ -461,7 +461,7 @@ void CUDAGeneratorImpl::unregister_graph(cuda::CUDAGraph* graph) {
*/
PhiloxCudaState CUDAGeneratorImpl::philox_cuda_state(uint64_t increment) {
if (at::cuda::currentStreamCaptureStatus() != at::cuda::CaptureStatus::None) {
uint32_t offset = state_->offset_intragraph_;
uint64_t offset = state_->offset_intragraph_;
state_->increase(increment);
return PhiloxCudaState(
state_->seed_extragraph_.data_ptr<int64_t>(),

View File

@ -96,16 +96,16 @@ struct CUDAGraph;
struct CUDAGeneratorState : public c10::intrusive_ptr_target {
uint64_t seed_;
uint64_t philox_offset_per_thread_;
uint32_t offset_intragraph_;
uint64_t offset_intragraph_;
bool capturing_{};
std::unordered_set<cuda::CUDAGraph*> registered_graphs_;
at::TensorBase seed_extragraph_{};
at::TensorBase offset_extragraph_{};
at::TensorBase seed_extragraph_;
at::TensorBase offset_extragraph_;
CUDAGeneratorState(
uint64_t seed = default_rng_seed_val,
uint64_t philox_offset_per_thread = 0,
uint32_t offset_intragraph = 0)
uint64_t offset_intragraph = 0)
: seed_(seed),
philox_offset_per_thread_(philox_offset_per_thread),
offset_intragraph_(offset_intragraph) {}
@ -167,7 +167,7 @@ struct TORCH_CUDA_CPP_API CUDAGeneratorImpl : public c10::GeneratorImpl {
CUDAGeneratorImpl* clone_impl() const override;
c10::intrusive_ptr<CUDAGeneratorState> state_;
std::atomic_flag no_reset_rnn_state_{};
std::atomic_flag no_reset_rnn_state_;
};
namespace cuda::detail {

View File

@ -56,7 +56,7 @@ struct TORCH_CUDA_CPP_API CUDAGraph {
// the ID assigned by cuda during graph capture,
// used to identify when a stream is participating in capture
CaptureId_t capture_id_ = -1;
CaptureId_t capture_id_ = 0;
// uuid used to request a particular private mempool from CUDACachingAllocator.
// By default, this will be set to {id_, 0}.

View File

@ -6,43 +6,15 @@
#define HIPSPARSE_VERSION ((hipsparseVersionMajor*100000) + (hipsparseVersionMinor*100) + hipsparseVersionPatch)
#endif
// cuSparse Generic API added in CUDA 10.1
// Windows support added in CUDA 11.0
#if defined(CUDART_VERSION) && defined(CUSPARSE_VERSION) && ((CUSPARSE_VERSION >= 10300) || (CUSPARSE_VERSION >= 11000 && defined(_WIN32)))
#define AT_USE_CUSPARSE_GENERIC_API() 1
#else
#define AT_USE_CUSPARSE_GENERIC_API() 0
#endif
// cuSparse Generic API descriptor pointers were changed to const in CUDA 12.0
#if defined(CUDART_VERSION) && defined(CUSPARSE_VERSION) && \
(CUSPARSE_VERSION < 12000)
#define AT_USE_CUSPARSE_NON_CONST_DESCRIPTORS() 1
#else
#define AT_USE_CUSPARSE_NON_CONST_DESCRIPTORS() 0
#endif
#if defined(CUDART_VERSION) && defined(CUSPARSE_VERSION) && \
(CUSPARSE_VERSION >= 12000)
#define AT_USE_CUSPARSE_CONST_DESCRIPTORS() 1
#else
#define AT_USE_CUSPARSE_CONST_DESCRIPTORS() 0
#endif
#if defined(USE_ROCM)
// hipSparse const API added in v2.4.0
#if HIPSPARSE_VERSION >= 200400
#define AT_USE_HIPSPARSE_CONST_DESCRIPTORS() 1
#define AT_USE_HIPSPARSE_NON_CONST_DESCRIPTORS() 0
#define AT_USE_HIPSPARSE_GENERIC_API() 1
#else
#define AT_USE_HIPSPARSE_CONST_DESCRIPTORS() 0
#define AT_USE_HIPSPARSE_NON_CONST_DESCRIPTORS() 1
#define AT_USE_HIPSPARSE_GENERIC_API() 1
#endif
#else // USE_ROCM
#define AT_USE_HIPSPARSE_CONST_DESCRIPTORS() 0
#define AT_USE_HIPSPARSE_NON_CONST_DESCRIPTORS() 0
#define AT_USE_HIPSPARSE_GENERIC_API() 0
#endif // USE_ROCM

View File

@ -12,8 +12,6 @@ cusparseStatus_t destroyConstDnMat(const cusparseDnMatDescr* dnMatDescr) {
return cusparseDestroyDnMat(const_cast<cusparseDnMatDescr*>(dnMatDescr));
}
#if AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API()
namespace {
// If a specific GPU model does not provide native support for a given data
@ -210,6 +208,4 @@ CuSparseSpMatCsrDescriptor::CuSparseSpMatCsrDescriptor(const Tensor& input, int6
descriptor_.reset(raw_descriptor);
}
#endif // AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API()
} // namespace at::cuda::sparse

View File

@ -35,7 +35,6 @@ class CuSparseDescriptor {
std::unique_ptr<T, CuSparseDescriptorDeleter<T, destructor>> descriptor_;
};
#if AT_USE_CUSPARSE_CONST_DESCRIPTORS() || AT_USE_HIPSPARSE_CONST_DESCRIPTORS()
template <typename T, cusparseStatus_t (*destructor)(const T*)>
struct ConstCuSparseDescriptorDeleter {
void operator()(T* x) {
@ -58,7 +57,6 @@ class ConstCuSparseDescriptor {
protected:
std::unique_ptr<T, ConstCuSparseDescriptorDeleter<T, destructor>> descriptor_;
};
#endif // AT_USE_CUSPARSE_CONST_DESCRIPTORS || AT_USE_HIPSPARSE_CONST_DESCRIPTORS
#if defined(USE_ROCM)
using cusparseMatDescr = std::remove_pointer_t<hipsparseMatDescr_t>;
@ -123,39 +121,8 @@ class TORCH_CUDA_CPP_API CuSparseBsrsm2Info
#endif // AT_USE_HIPSPARSE_TRIANGULAR_SOLVE
#if AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API()
cusparseIndexType_t getCuSparseIndexType(const c10::ScalarType& scalar_type);
#if AT_USE_CUSPARSE_NON_CONST_DESCRIPTORS() || AT_USE_HIPSPARSE_NON_CONST_DESCRIPTORS()
class TORCH_CUDA_CPP_API CuSparseDnMatDescriptor
: public CuSparseDescriptor<cusparseDnMatDescr, &cusparseDestroyDnMat> {
public:
explicit CuSparseDnMatDescriptor(const Tensor& input, int64_t batch_offset = -1);
};
class TORCH_CUDA_CPP_API CuSparseConstDnMatDescriptor
: public CuSparseDescriptor<const cusparseDnMatDescr, &destroyConstDnMat> {
public:
explicit CuSparseConstDnMatDescriptor(const Tensor& input, int64_t batch_offset = -1);
cusparseDnMatDescr* unsafe_mutable_descriptor() const {
return const_cast<cusparseDnMatDescr*>(descriptor());
}
cusparseDnMatDescr* unsafe_mutable_descriptor() {
return const_cast<cusparseDnMatDescr*>(descriptor());
}
};
class TORCH_CUDA_CPP_API CuSparseDnVecDescriptor
: public CuSparseDescriptor<cusparseDnVecDescr, &cusparseDestroyDnVec> {
public:
explicit CuSparseDnVecDescriptor(const Tensor& input);
};
class TORCH_CUDA_CPP_API CuSparseSpMatDescriptor
: public CuSparseDescriptor<cusparseSpMatDescr, &cusparseDestroySpMat> {};
#elif AT_USE_CUSPARSE_CONST_DESCRIPTORS() || AT_USE_HIPSPARSE_CONST_DESCRIPTORS()
class TORCH_CUDA_CPP_API CuSparseDnMatDescriptor
: public ConstCuSparseDescriptor<
cusparseDnMatDescr,
@ -194,7 +161,6 @@ class TORCH_CUDA_CPP_API CuSparseSpMatDescriptor
: public ConstCuSparseDescriptor<
cusparseSpMatDescr,
&cusparseDestroySpMat> {};
#endif // AT_USE_CUSPARSE_CONST_DESCRIPTORS() || AT_USE_HIPSPARSE_CONST_DESCRIPTORS()
class TORCH_CUDA_CPP_API CuSparseSpMatCsrDescriptor
: public CuSparseSpMatDescriptor {
@ -283,6 +249,4 @@ class TORCH_CUDA_CPP_API CuSparseSpGEMMDescriptor
}
};
#endif // AT_USE_CUSPARSE_GENERIC_API() || AT_USE_HIPSPARSE_GENERIC_API()
} // namespace at::cuda::sparse

View File

@ -9,7 +9,6 @@
#include <cuda_runtime_api.h>
#include <future>
#include <unordered_map>
namespace at::cuda {
namespace {
@ -72,9 +71,20 @@ using Block = HostBlock<CUDAStream>;
struct CUDACachingHostAllocatorImpl
: public CachingHostAllocatorImpl<CUDAStream, EventPool::Event> {
private:
std::unordered_map<void*, bool> use_host_register;
ska::flat_hash_map<void*, bool> use_host_register;
void allocate_host_memory(size_t size, void** ptr) override {
// try allocating from reserve segment first before calling into expensive APIs
if (get_reserve_segment().initialized()) {
*ptr = get_reserve_segment().allocate(size);
if (*ptr != nullptr) {
return;
}
}
allocate_host_memory_slowpath(size, ptr);
}
void allocate_host_memory_slowpath(size_t size, void** ptr) {
// Pinned memory pointers allocated by any device can be directly used by
// any other device, regardless of the current device at the time of
// allocation, since we assume unified addressing. So we grab any existing
@ -113,6 +123,18 @@ struct CUDACachingHostAllocatorImpl
}
void free_block(Block* block) override {
// We never free blocks from the reserve segment
if (get_reserve_segment().initialized()) {
// Check if the block is from the reserve segment
if (get_reserve_segment().owns(block->ptr_)) {
return;
}
}
free_block_slowpath(block);
}
void free_block_slowpath(Block* block) {
auto start = std::chrono::steady_clock::now();
// Users may change the allocator config at will. torch unit tests do this.
// However, allocations using cudaHostRegister should use corresonding
@ -172,6 +194,20 @@ struct CUDACachingHostAllocatorImpl
return event_pool->get(idx);
}
PinnedReserveSegment& get_reserve_segment() {
static auto reserve_segment = [&]() {
if (c10::cuda::CUDACachingAllocator::CUDAAllocatorConfig::pinned_reserve_segment_size_mb() > 0) {
void *ptr;
size_t sz = c10::cuda::CUDACachingAllocator::CUDAAllocatorConfig::pinned_reserve_segment_size_mb() * 1024 * 1024;
allocate_host_memory_slowpath(sz, &ptr);
return PinnedReserveSegment(ptr, sz);
} else {
return PinnedReserveSegment();
}
} ();
return reserve_segment;
}
TaskThreadPool* getThreadPool() {
static TaskThreadPool* pool = new TaskThreadPool(
static_cast<int>(c10::cuda::CUDACachingAllocator::CUDAAllocatorConfig::
@ -186,15 +222,15 @@ struct CUDACachingHostAllocatorImpl
size_t numThreads,
size_t pageSize) {
uintptr_t start = (uintptr_t)ptr + (size * i / numThreads);
uintptr_t end = (uintptr_t)start + (size / numThreads);
uintptr_t end = start + (size / numThreads);
if (i == (numThreads - 1)) {
end = (uintptr_t)ptr + size;
}
// pre-fault/map the pages by setting the first byte of the page
uintptr_t alignedStart =
(((uintptr_t)start + pageSize - 1) & ~(pageSize - 1));
for (uintptr_t p = alignedStart; p < ((uintptr_t)end); p += pageSize) {
((start + pageSize - 1) & ~(pageSize - 1));
for (uintptr_t p = alignedStart; p < (end); p += pageSize) {
// NOLINTNEXTLINE(performance-no-int-to-ptr)
memset((void*)p, 0, 1);
}

View File

@ -310,7 +310,7 @@ cublasHandle_t getCurrentCUDABlasHandle() {
// FP32 data type calculations based on the value of the allow_tf32 flag.
// To enable TF32, set the math mode of the handle to CUBLAS_TF32_TENSOR_OP_MATH.
if (!NoTF32Guard::should_disable_tf32() &&
at::globalContext().float32Precision("cuda", "matmul") == "tf32") {
at::globalContext().float32Precision(at::Float32Backend::CUDA, at::Float32Op::MATMUL) == at::Float32Precision::TF32) {
TORCH_CUDABLAS_CHECK(cublasSetMathMode(handle, CUBLAS_TF32_TENSOR_OP_MATH));
} else {
TORCH_CUDABLAS_CHECK(cublasSetMathMode(handle, CUBLAS_DEFAULT_MATH));

View File

@ -326,6 +326,23 @@ bool CUDAHooks::supportsBFloat16ConvolutionWithCuDNNv8() const {
#endif
}
bool CUDAHooks::supportsBFloat16RNNWithCuDNN() const {
#if AT_CUDNN_ENABLED() && (CUDNN_VERSION >= 91300)
if (!hasCUDA()) {
return false;
}
cudaDeviceProp* prop = at::cuda::getCurrentDeviceProperties();
// Check for Volta cores
if (prop->major >= 8) {
return true;
} else {
return false;
}
#else
return false;
#endif
}
long CUDAHooks::versionCuDNN() const {
#if AT_CUDNN_ENABLED()
return CUDNN_VERSION;

View File

@ -45,6 +45,7 @@ struct CUDAHooks : public at::CUDAHooksInterface {
bool supportsDilatedConvolutionWithCuDNN() const override;
bool supportsDepthwiseConvolutionWithCuDNN() const override;
bool supportsBFloat16ConvolutionWithCuDNNv8() const override;
bool supportsBFloat16RNNWithCuDNN() const override;
bool hasCUDART() const override;
long versionCUDART() const override;
long versionCuDNN() const override;

View File

@ -122,7 +122,7 @@ struct DeviceThreadHandlePool : public std::enable_shared_from_this<DeviceThread
// Called by the destructor. Releases this thread's handles back into the pool.
void release() {
if(my_handles.size() > 0) {
if(!my_handles.empty()) {
auto parent = weak_parent.lock();
if (!parent) {
// If this thread exits after atexit handlers have completed, the

View File

@ -19,7 +19,7 @@ struct PhiloxCudaState {
// Called if graph capture is underway
PhiloxCudaState(int64_t* seed,
int64_t* offset_extragraph,
uint32_t offset_intragraph) {
uint64_t offset_intragraph) {
seed_.ptr = seed;
offset_.ptr = offset_extragraph;
offset_intragraph_ = offset_intragraph;
@ -36,7 +36,7 @@ struct PhiloxCudaState {
Payload seed_{};
Payload offset_{};
uint32_t offset_intragraph_ = 0;
uint64_t offset_intragraph_ = 0;
bool captured_ = false;
};

View File

@ -162,7 +162,7 @@ inline std::string ComputeTypeFor() {
// ROCBLAS and hipBLASLt.
template <>
inline std::string ComputeTypeFor<float>() {
if (at::globalContext().float32Precision("cuda", "matmul") != "tf32") {
if (at::globalContext().float32Precision(at::Float32Backend::CUDA, at::Float32Op::MATMUL) != at::Float32Precision::TF32) {
return "f32_r";
} else {
return "xf32_r";

View File

@ -506,7 +506,7 @@ class HipblasltGemmOp : public Callable<ParamsT> {
}
hipblasComputeType_t computeType = HIPBLAS_COMPUTE_32F;
if (at::globalContext().float32Precision("cuda", "matmul") == "tf32") {
if (at::globalContext().float32Precision(at::Float32Backend::CUDA, at::Float32Op::MATMUL) == at::Float32Precision::TF32) {
computeType = HIPBLAS_COMPUTE_32F_FAST_TF32;
}
HipBlasLtMatmulDescriptor matmul(computeType, HIP_R_32F);

View File

@ -141,7 +141,7 @@ class RocblasGemmOp : public Callable<GemmParams<T>> {
TuningStatus Call(const GemmParams<T>* params) override {
auto input_output_type = RocBlasDataTypeFor<T>();
if (at::globalContext().float32Precision("cuda", "matmul") == "tf32" && input_output_type == rocblas_datatype_f32_r)
if (at::globalContext().float32Precision(at::Float32Backend::CUDA, at::Float32Op::MATMUL) == at::Float32Precision::TF32 && input_output_type == rocblas_datatype_f32_r)
return FAIL; // no support for TF32 in rocBLAS
auto compute_type = RocBlasComputeTypeFor<T>();
auto h_a = DoCastForHalfOrBfloat16(params->alpha);
@ -209,7 +209,7 @@ class RocblasGemmStridedBatchedOp : public Callable<GemmStridedBatchedParams<T>>
TuningStatus Call(const GemmStridedBatchedParams<T>* params) override {
auto input_output_type = RocBlasDataTypeFor<T>();
if (at::globalContext().float32Precision("cuda", "matmul") == "tf32" && input_output_type == rocblas_datatype_f32_r)
if (at::globalContext().float32Precision(at::Float32Backend::CUDA, at::Float32Op::MATMUL) == at::Float32Precision::TF32 && input_output_type == rocblas_datatype_f32_r)
return FAIL; // no support for TF32 in rocBLAS
auto compute_type = RocBlasComputeTypeFor<T>();
auto h_a = DoCastForHalfOrBfloat16(params->alpha);

View File

@ -404,8 +404,6 @@ TuningContext::TuningContext() :
max_warmup_iterations_{0},
icache_flush_{true},
rotating_buffer_size_{-1},
filename_{},
untuned_file_{},
results_count_from_input_file_{0},
is_shutting_down_{false}
{

View File

@ -141,7 +141,7 @@ void FilterDescriptor::set(const at::Tensor &t, const at::MemoryFormat memory_fo
size[i] = (int) t.size(i);
}
for (const auto i : c10::irange(dim, pad)) {
size[i] = (int) 1;
size[i] = 1;
}
dim = std::max(dim, pad);
cudnnTensorFormat_t filter_format{};

View File

@ -166,6 +166,10 @@ struct TORCH_API CUDAHooksInterface : AcceleratorHooksInterface {
return false;
}
virtual bool supportsBFloat16RNNWithCuDNN() const {
return false;
}
virtual long versionCuDNN() const {
TORCH_CHECK(false, "Cannot query cuDNN version without ATen_cuda library. ", CUDA_HELP);
}

View File

@ -176,7 +176,7 @@ struct LinalgCheckMatrixUnaryRuleHelper;
template <char const *op_name, typename F, F Func, typename A, typename... T>
struct LinalgCheckMatrixUnaryRuleHelper<op_name, F, Func, typelist<A, T...>> {
static inline Tensor check_and_reshape_input(const Tensor& tensor, std::optional<int64_t> batch_dim) {
static Tensor check_and_reshape_input(const Tensor& tensor, std::optional<int64_t> batch_dim) {
TORCH_CHECK(rankWithoutBatchDim(tensor, batch_dim) >= 2, op_name, ": The input tensor A must have at least 2 dimensions.");
return moveBatchDimToFront(tensor, batch_dim);
}
@ -222,7 +222,7 @@ struct LinalgCheckMatrixBinaryRuleHelper;
template <char const *op_name, typename F, F Func, typename A, typename B, typename... T>
struct LinalgCheckMatrixBinaryRuleHelper<op_name, F, Func, typelist<A, B, T...>> {
static inline std::tuple<Tensor, Tensor> check_inputs_and_reshape_inputs(
static std::tuple<Tensor, Tensor> check_inputs_and_reshape_inputs(
const Tensor& first, std::optional<int64_t> first_bdim,
const Tensor& second, std::optional<int64_t> second_bdim) {
TORCH_CHECK(rankWithoutBatchDim(first, first_bdim) >= 2,

View File

@ -465,11 +465,11 @@ static void dynamicLayerBack(const c10::OperatorHandle& op, torch::jit::Stack* s
// used for functions that have aliasing operations but should be treated like they're out of place (i.e. lift_fresh)
static void dynamicLayerBackGradSpecialCase(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
return dynamicLayerBack(op, stack, true);
dynamicLayerBack(op, stack, true);
}
static void dynamicLayerBackFallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
return dynamicLayerBack(op, stack, false);
dynamicLayerBack(op, stack, false);
}
TORCH_LIBRARY_IMPL(_, FuncTorchDynamicLayerFrontMode, m) {

View File

@ -58,7 +58,7 @@ scalar_t dot_impl(int64_t n, const scalar_t *x, int64_t incx, const scalar_t *y,
template<typename scalar_t>
scalar_t vdot_impl(int64_t n, const scalar_t *x, int64_t incx, const scalar_t *y, int64_t incy);
static constexpr inline bool lda_cond(int64_t m, int64_t n, int64_t lda) {
static constexpr bool lda_cond(int64_t m, int64_t n, int64_t lda) {
return n == 1 || lda >= std::max<int64_t>(1L, m);
}

View File

@ -375,7 +375,7 @@ static void bf16_gemv_trans(
const at::BFloat16 beta,
at::BFloat16* y,
const int incy) {
return bf16_gemv_trans_stub(kCPU, m, n, alpha, a, lda, x, incx, beta, y, incy);
bf16_gemv_trans_stub(kCPU, m, n, alpha, a, lda, x, incx, beta, y, incy);
}
template <>

View File

@ -70,7 +70,7 @@ inline void searchsorted_maybe_trim_input_tensors(
const Tensor& raw_boundaries) {
Tensor trimmed_sorter;
Tensor raw_sorter;
return searchsorted_maybe_trim_input_tensors(
searchsorted_maybe_trim_input_tensors(
trimmed_input,
trimmed_boundaries,
trimmed_sorter,

View File

@ -991,7 +991,7 @@ std::size_t UnsafeUkernelKeyHasher<PackKey>::operator()(const PackKey& key) cons
template <typename key_t, typename value_t>
struct KernelCache {
using kstore_t = std::unordered_map<key_t, std::shared_ptr<value_t>, UnsafeUkernelKeyHasher<key_t>>;
static inline std::shared_ptr<value_t>&& fetch_or_create(
static std::shared_ptr<value_t>&& fetch_or_create(
const key_t& key,
const std::function<std::shared_ptr<value_t>()>& callback) {
auto&& search = get_store().find(key);
@ -1003,7 +1003,7 @@ struct KernelCache {
}
}
static inline kstore_t& get_store() {
static kstore_t& get_store() {
static thread_local kstore_t cache_kernels;
return cache_kernels;
}
@ -1067,7 +1067,7 @@ struct GemmHelper {
struct Brgemm : public KernelCache <BrgemmKey, GemmHelper> {
// Fetch/create GemmHelper object and execute brgemm with batch size = 1
template <typename scalar_t_a, typename scalar_t_b, typename scalar_t_c>
static inline void call(
static void call(
int64_t M,
int64_t N,
int64_t K,
@ -1118,12 +1118,12 @@ struct Brgemm : public KernelCache <BrgemmKey, GemmHelper> {
.execute(A, B, (*value).A_B_offsets, C, (*value).scratchpad.data());
}
static inline std::shared_ptr<GemmHelper>& get_current() {
static std::shared_ptr<GemmHelper>& get_current() {
static thread_local std::shared_ptr<GemmHelper> current;
return current;
}
static inline bool device_check(ScalarType dtype) {
static bool device_check(ScalarType dtype) {
if (!at::globalContext().userEnabledMkldnn()) {
return false;
}
@ -1153,7 +1153,7 @@ using pack_t = dnnl::ukernel::brgemm_pack_B;
using pack_t = dnnl::ukernel::transform;
#endif
struct Pack : public KernelCache <PackKey, pack_t> {
static inline void call(
static void call(
int64_t K,
int64_t N,
int64_t ld_in,
@ -1182,7 +1182,7 @@ struct Pack : public KernelCache <PackKey, pack_t> {
}
}
static inline bool could_pack(ScalarType dtype) {
static bool could_pack(ScalarType dtype) {
if (!at::globalContext().userEnabledMkldnn()) {
return false;
}

View File

@ -702,7 +702,7 @@ static void check_shape_forward(const at::Tensor& input,
// If kernel size is incorrect
std::ostringstream input_ss;
std::ostringstream kernel_ss;
std::string separator = "";
std::string separator;
for (int i = 0, len = input_shape.size(); i < len; ++i) {
input_ss << separator << input_shape[i];
@ -1019,7 +1019,7 @@ static Tensor convolution_same(
if (symmetric_padding) {
// All backends handle symmetric padding natively
SymDimVector output_padding(static_cast<size_t>(dim));
SymDimVector output_padding(dim);
return at::convolution_symint(input, weight, bias, stride, padding_l, dilation,
false, output_padding, groups);
}
@ -1039,7 +1039,7 @@ static Tensor convolution_same(
}
}
auto padded_input = at::constant_pad_nd_symint(input, pad_nd, 0);
SymDimVector output_padding(static_cast<size_t>(dim));
SymDimVector output_padding(dim);
return at::convolution_symint(padded_input, weight, bias, stride, padding_l,
dilation, false, output_padding, groups);
}
@ -1174,7 +1174,7 @@ at::Tensor convolution(
bool deterministic = ctx.deterministicCuDNN() || ctx.deterministicAlgorithms();
return at::_convolution(input, weight, bias, stride, padding, dilation,
transposed, output_padding, groups,
ctx.benchmarkCuDNN(), deterministic, ctx.userEnabledCuDNN(), ctx.allowTF32CuDNN("conv"));
ctx.benchmarkCuDNN(), deterministic, ctx.userEnabledCuDNN(), ctx.allowTF32CuDNN(at::Float32Op::CONV));
}
at::Tensor convolution_overrideable(
@ -1319,7 +1319,7 @@ ConvBackend select_conv_backend(
params.benchmark = ctx.benchmarkCuDNN();
params.deterministic = ctx.deterministicCuDNN() || ctx.deterministicAlgorithms();
params.cudnn_enabled = ctx.userEnabledCuDNN();
params.allow_tf32 = ctx.allowTF32CuDNN("conv");
params.allow_tf32 = ctx.allowTF32CuDNN(at::Float32Op::CONV);
auto input = input_r;
auto weight = weight_r;
@ -1699,7 +1699,7 @@ at::Tensor _convolution(
c10::MaybeOwned<Tensor> bias_r_maybe_owned = at::borrow_from_optional_tensor(bias_r_opt);
const Tensor& bias_r = *bias_r_maybe_owned;
return at::_convolution(input_r, weight_r, bias_r, stride_, padding_, dilation_, transposed_, output_padding_, groups_, benchmark, deterministic, cudnn_enabled, at::globalContext().allowTF32CuDNN("conv"));
return at::_convolution(input_r, weight_r, bias_r, stride_, padding_, dilation_, transposed_, output_padding_, groups_, benchmark, deterministic, cudnn_enabled, at::globalContext().allowTF32CuDNN(at::Float32Op::CONV));
}
std::tuple<Tensor, Tensor, Tensor> convolution_backward_overrideable(
@ -1997,7 +1997,7 @@ std::tuple<Tensor, Tensor, Tensor> convolution_backward(
params.benchmark = ctx.benchmarkCuDNN();
params.deterministic = ctx.deterministicCuDNN() || ctx.deterministicAlgorithms();
params.cudnn_enabled = ctx.userEnabledCuDNN();
params.allow_tf32 = ctx.allowTF32CuDNN("conv");
params.allow_tf32 = ctx.allowTF32CuDNN(at::Float32Op::CONV);
// Validate inputs.
check_shape_backward(input, weight.sizes(), params);

View File

@ -1,6 +1,5 @@
#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
#include <ATen/native/Copy.h>
#include <ATen/native/Copy.h>
#include <ATen/core/Tensor.h>
#include <ATen/Dispatch.h>

View File

@ -93,6 +93,12 @@ inline bool cond_cudnn_grid_sampler(
const TensorBase& input,
const TensorBase& grid
) {
auto st = input.scalar_type();
if (!(st == kDouble || st == kFloat || st == kHalf))
return false;
st = grid.scalar_type();
if (!(st == kDouble || st == kFloat || st == kHalf))
return false;
return (
at::native::cudnn_is_acceptable(input) &&
at::native::cudnn_is_acceptable(grid) &&

View File

@ -70,7 +70,7 @@ Tensor constant_pad_nd(const Tensor& self, IntArrayRef pad, const Scalar& value)
new_shape.emplace_back(input_sizes[i]);
}
for (const auto i : c10::irange((size_t)l_pad)) {
for (const auto i : c10::irange(l_pad)) {
auto pad_idx = pad.size() - ((i + 1) * 2);
auto new_dim = input_sizes[l_diff + i] + pad[pad_idx] + pad[pad_idx + 1];
TORCH_CHECK(new_dim >= 0, "The input size ", input_sizes[l_diff + i], ", plus negative padding ",

View File

@ -108,6 +108,13 @@ bool use_mkldnn(const Tensor& input, TensorList params, TensorList hx) {
return false;
}
bool use_cudnn(const Tensor& t) {
bool acceptable = at::cudnn_is_acceptable(t);
auto st = t.scalar_type();
bool bfloat16_cond = st == kBFloat16 && at::detail::getCUDAHooks().supportsBFloat16RNNWithCuDNN();
return acceptable && (bfloat16_cond || st == kDouble || st == kFloat || st == kHalf);
}
template<typename T>
using pair_of = std::pair<T, T>;
@ -1200,7 +1207,7 @@ std::tuple<Tensor, Tensor, Tensor, Tensor, Tensor> _thnn_fused_lstm_cell_backwar
bool train, \
bool bidirectional, \
bool batch_first) { \
if (at::cudnn_is_acceptable(_input)) { \
if (use_cudnn(_input)) { \
Tensor output, hy; \
NAME##_cudnn_stub( \
_input.device().type(), \
@ -1262,7 +1269,7 @@ std::tuple<Tensor, Tensor, Tensor, Tensor, Tensor> _thnn_fused_lstm_cell_backwar
double dropout_p, \
bool train, \
bool bidirectional) { \
if (at::cudnn_is_acceptable(data)) { \
if (use_cudnn(data)) { \
Tensor output, hy; \
NAME##_packed_cudnn_stub( \
data.device().type(), \
@ -1430,7 +1437,7 @@ std::tuple<Tensor, Tensor, Tensor> lstm(
TensorList _params, bool has_biases,
int64_t num_layers, double dropout_p, bool train, bool bidirectional, bool batch_first) {
TORCH_CHECK(hx.size() == 2, "lstm expects two hidden states");
if (at::cudnn_is_acceptable(_input)) {
if (use_cudnn(_input)) {
Tensor output, hy, cy;
lstm_cudnn_stub(_input.device().type(), output, hy, cy, _input, hx, _params, has_biases,
num_layers, dropout_p, train, bidirectional, batch_first);
@ -1491,7 +1498,7 @@ std::tuple<Tensor, Tensor, Tensor> lstm(
TensorList _params, bool has_biases,
int64_t num_layers, double dropout_p, bool train, bool bidirectional) {
TORCH_CHECK(hx.size() == 2, "lstm expects two hidden states");
if (at::cudnn_is_acceptable(data)) {
if (use_cudnn(data)) {
Tensor output, hy, cy;
lstm_packed_cudnn_stub(data.device().type(), output, hy, cy, data, batch_sizes, hx,
_params, has_biases, num_layers, dropout_p, train, bidirectional);

View File

@ -47,7 +47,7 @@ int64_t compute_arange_size(const Scalar& start, const Scalar& end, const Scalar
int64_t sgn = (xstep > 0) - (xstep < 0);
size_d = std::ceil((xend - xstart + xstep - sgn) / xstep);
} else {
size_d = std::ceil(static_cast<double>(end.to<double>() - start.to<double>())
size_d = std::ceil((end.to<double>() - start.to<double>())
/ step.to<double>());
}

View File

@ -107,11 +107,6 @@ void resize_bytes_cpu(StorageImpl* storage, size_t size_bytes) {
storage->set_nbytes(size_bytes);
}
// Call the sparse implementation in SparseTensor.cpp directly.
// A dynamic dispatch here is NOT necessary, so I didn't put
// this function in native_functions.yaml
const Tensor& resize_as_sparse_(const Tensor& self, const Tensor& src);
// TODO(VitalyFedyunin): Move it to HTML docs.
//
// Strides of the output tensor of `resize_as_` operator is defined by input

View File

@ -145,12 +145,6 @@
#include <utility>
#include <vector>
namespace at::native {
AdvancedIndex make_info(Tensor self, IOptTensorListRef orig);
} // namespace at::native
namespace at::meta {
TORCH_META_FUNC(gather)

View File

@ -73,7 +73,6 @@
#include <ATen/ops/where_native.h>
#include <ATen/ops/zeros_like.h>
#include <iostream>
#include <utility>
#endif

View File

@ -23,14 +23,6 @@
#include <ATen/Functions.h>
#include <ATen/NativeFunctions.h>
#else
#include <ATen/ops/_cast_Byte_native.h>
#include <ATen/ops/_cast_Char_native.h>
#include <ATen/ops/_cast_Double_native.h>
#include <ATen/ops/_cast_Float_native.h>
#include <ATen/ops/_cast_Half_native.h>
#include <ATen/ops/_cast_Int_native.h>
#include <ATen/ops/_cast_Long_native.h>
#include <ATen/ops/_cast_Short_native.h>
#include <ATen/ops/_dim_arange_native.h>
#include <ATen/ops/_efficientzerotensor_native.h>
#include <ATen/ops/_empty_affine_quantized.h>

View File

@ -91,9 +91,6 @@ bool cudnn_is_acceptable(const TensorBase& self) {
return false;
if (!self.is_cuda())
return false;
auto st = self.scalar_type();
if (!(st == kDouble || st == kFloat || st == kHalf))
return false;
if (!detail::getCUDAHooks().compiledWithCuDNN())
return false;
// cuDNN functions like grid_sampler returns CUDNN_STATUS_BAD_PARAM on empty

View File

@ -124,7 +124,7 @@ struct IsUnique {};
template <typename scalar_t>
struct IsUnique<scalar_t, false> {
inline bool operator() (scalar_t* data_ptr, int64_t i) {
bool operator() (scalar_t* data_ptr, int64_t i) {
if (i == 0) { return true; }
return c10::load(&data_ptr[i]) != c10::load(&data_ptr[i - 1]);
}
@ -132,7 +132,7 @@ struct IsUnique<scalar_t, false> {
template <typename scalar_t>
struct IsUnique<scalar_t, true> {
inline bool operator() (scalar_t* data_ptr, int64_t i) {
bool operator() (scalar_t* data_ptr, int64_t i) {
if (i == 0) { return true; }
return (c10::load(&data_ptr[i]) != c10::load(&data_ptr[i - 1]))
&& !(_isnan(data_ptr[i]) && _isnan(data_ptr[i - 1]));

View File

@ -4,7 +4,6 @@
#include <ATen/OpMathType.h>
#include <ATen/TensorUtils.h>
#include <ATen/OpMathType.h>
#include <ATen/core/Tensor.h>
#include <ATen/cpu/vec/functional.h>
#include <ATen/cpu/vec/vec.h>

View File

@ -25,11 +25,11 @@
namespace at::native {
void _backward(const Tensor& self, TensorList inputs, const std::optional<Tensor>& gradient_opt, std::optional<bool> keep_graph, bool create_graph) {
return self._backward(inputs, gradient_opt, keep_graph, create_graph);
self._backward(inputs, gradient_opt, keep_graph, create_graph);
}
void set_data(Tensor& self, const Tensor& new_data) {
return self.set_data(new_data);
self.set_data(new_data);
}
Tensor data(const Tensor& self) {
@ -54,7 +54,7 @@ Tensor& requires_grad_(Tensor& self, bool _requires_grad) {
}
void retain_grad(Tensor& self) {
return self.retain_grad();
self.retain_grad();
}
bool retains_grad(const Tensor& self) {

View File

@ -17,7 +17,7 @@
namespace ao::sparse {
int register_linear_params();
#ifdef USE_FBGEMM

View File

@ -20,7 +20,7 @@
namespace ao::sparse {
int register_linear_params();
#ifdef USE_FBGEMM
namespace {

View File

@ -16,7 +16,7 @@
#endif
namespace ao::sparse {
int register_linear_params();
#ifdef USE_FBGEMM

View File

@ -22,7 +22,7 @@ static inline void cpu_atomic_add_float(float* dst, float fvalue)
old_value.floatV = *dst;
new_value.floatV = old_value.floatV + fvalue;
unsigned* old_intV = (unsigned*)(&old_value.intV);
unsigned* old_intV = &old_value.intV;
while (!std::atomic_compare_exchange_strong(dst_intV, old_intV, new_value.intV)) {
#ifdef __aarch64__
__asm__ __volatile__("yield;" : : : "memory");

Some files were not shown because too many files have changed in this diff Show More