98b80bb7ab 
					 
					
						
						
							
							Simplify SingletonOrSharedTypePtr  
						
						... 
						
						
						
						@neildhar pointed out at PTC yesterday that the assumption SingletonOrSharedTypePtr makes about shared_ptr's pointers being either both null or both non-null is incorrect because of the aliasing constructor, and furthermore that SingletonOrSharedTypePtr needn't be as fancy as it is because said constructor exists. (See also https://github.com/pytorch/pytorch/issues/166152  .)
Differential Revision: [D85458769](https://our.internmc.facebook.com/intern/diff/D85458769/ )
[ghstack-poisoned] 
						
						
					 
					
						2025-10-24 12:20:11 -07:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						b146ea411e 
					 
					
						
						
							
							Save GitHub env variables on ROCm ( #165821 )  
						
						... 
						
						
						
						As `.github/actions/setup-rocm/action.yml` is now used on `linux_job_v2` to setup ROCm, we need to have this step here to save the list of GitHub env variables.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165821 
Approved by: https://github.com/atalman  
						
						
					 
					
						2025-10-23 22:13:37 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						8625ffbd45 
					 
					
						
						
							
							[DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so that we don't need to compare root mesh ( #166003 )  
						
						... 
						
						
						
						Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in https://github.com/pytorch/pytorch/pull/164510  and further simply the code.
We do have a more ambitious universe-based change here: https://github.com/pytorch/pytorch/pull/165680  but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166003 
Approved by: https://github.com/Skylion007 , https://github.com/fegin  
						
						
					 
					
						2025-10-23 20:49:59 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						0977cc4474 
					 
					
						
						
							
							[lint] Extend workflowsync linter to more files ( #166082 )  
						
						... 
						
						
						
						And fix the lint issues found
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166082 
Approved by: https://github.com/izaitsevfb , https://github.com/atalman  
						
						
					 
					
						2025-10-23 20:29:29 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						d9a55faccc 
					 
					
						
						
							
							[Pytorch] Add NEON Vectorized<double>  translation layers ( #166092 )  
						
						... 
						
						
						
						Summary:
Adding NEON specializations of Vectorized<double>
Correcness has been checked using test_ops.py and running torch test
Test Plan:
Correctness:
buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch
Performance:
Added torch.float64 as data type to test within binary_test.py
Reviewed By: mcfi
Differential Revision: D84924406
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166092 
Approved by: https://github.com/malfet  
						
						
					 
					
						2025-10-23 20:20:48 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						75b8295868 
					 
					
						
						
							
							Revert "Warn if AccumulateGrad stream does not match producer node stream ( #165065 )"  
						
						... 
						
						
						
						This reverts commit 12f742941d6aecb72c18d8e602f90ac9b4f00af0.
Reverted https://github.com/pytorch/pytorch/pull/165065  on behalf of https://github.com/clee2000  due to broke internal builds D85273204 usages of TORCH_API void add need to be updated? ([comment](https://github.com/pytorch/pytorch/pull/165065#issuecomment-3438061854 )) 
						
						
					 
					
						2025-10-23 17:02:49 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						defb6a80d8 
					 
					
						
						
							
							Enable torch.Generator to support pytorch/xla generator implementation  ( #161369 )  
						
						... 
						
						
						
						Currently, the implementation of `torch.Generator` only support "cpu" and "cuda" device type.  https://github.com/pytorch/pytorch/blob/main/torch/csrc/Generator.cpp#L55-L61 
This change enables `torch.Generator` to support more device type by allowing any device backend to register their own generator factory through a Generator Registry. This is similar to what "DeviceGuardImpl registry" does today.
# Key Changes:
## New registry API:
* Added GeneratorRegistry.h and GeneratorRegistry.cpp in c10/core/impl.
* API supports registerGenerator(DeviceType, GeneratorFactory), unregisterGenerator(DeviceType), and getGeneratorFactory(DeviceType).
* Uses c10::DeviceType as the key and stores a factory function returning c10::intrusive_ptr<c10::GeneratorImpl>.
## Python/C++ integration:
* The registry is consulted in the torch.Generator constructor path for non-CPU/CUDA devices.
* If a factory is registered for the requested device, it constructs the appropriate generator; otherwise, raises an error.
## Backend extensibility:
* Out-of-tree backends (e.g., torch_xla, torch-directml, torch_npu) can now register their custom generator implementation at module load via a static registrar object.
Example usage:
```
C++
namespace {
  struct Registrar {
    Registrar() {
      at::detail::registerGenerator(c10::DeviceType::XLA, &CreateXlaGenerator);
    }
  } registrar_instance;
}
```
This allows torch.Generator(device='xla') to return an XlaGeneratorImpl when the torch_xla extension is imported.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161369 
Approved by: https://github.com/FFFrog , https://github.com/qihqi , https://github.com/albanD  
						
						
					 
					
						2025-10-23 16:49:28 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						f8fccb1e48 
					 
					
						
						
							
							[Code Clean] Clean asserts in torch/optim. ( #165629 )  
						
						... 
						
						
						
						Replaces 50 assert statements across 15 files in torch.optim with explicit  if-checks raising AssertionError to prevent assertions from being disabled with Python -O flag.
fix partially #164878 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165629 
Approved by: https://github.com/albanD  
						
						
					 
					
						2025-10-23 15:56:29 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						5aac4cfce4 
					 
					
						
						
							
							Use is rather than == to work around slow enum comparion in _ops.py ( #165936 )  
						
						... 
						
						
						
						This shows up (under _are_we_tracing) in DTensor dispatch. I have some work in flight to speed up enum comparison in pybind11, but `is` is just much faster and easy to use.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165936 
Approved by: https://github.com/Skylion007 , https://github.com/zou3519  
						
						
					 
					
						2025-10-23 15:01:55 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						baf91bbbfc 
					 
					
						
						
							
							Revert "[inductor][choices] lookup table choices 1/3 ( #164978 )"  
						
						... 
						
						
						
						This reverts commit ab9e466928e7a37844c4f2a8bf90c76d16ac3c34.
Reverted https://github.com/pytorch/pytorch/pull/164978  on behalf of https://github.com/malfet  due to Looks like it broke slow tests, see cbcb4f7768/1https://github.com/pytorch/pytorch/pull/164978#issuecomment-3437424559 )) 
						
						
					 
					
						2025-10-23 14:47:07 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						cbcb4f7768 
					 
					
						
						
							
							[pytorch][torchelastic] Duplicate stdout and stderr and apply custom filter in torchrun ( #160712 )  
						
						... 
						
						
						
						Summary:
Part of an effort to extract some important error logs (e.g. [#157996 ](https://github.com/pytorch/pytorch/pull/157996 )) that was `tee`'ed to `stdout` and `stderr`.
The general idea is to:
- Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively.
- In these files, as its name suggests, only log lines matching a customizable filter.
- Later on in another PR, append the contents of these files to the reply file.
Outline of changes in this PR:
- Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter.
- Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them.
- In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created.
Test Plan:
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test
```
```
Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f 
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 
Network: Up: 398B  Down: 44MiB  (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a)
Analyzing targets. Remaining     0/200
Executing actions. Remaining     0/12856                                                                                                                                        0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 17:37.9s
Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
$ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test
```
```
Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee 
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 
Network: Up: 94KiB  Down: 417MiB  (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922)
Analyzing targets. Remaining     0/3                                                                                                                                            536 actions, 555 artifacts declared
Executing actions. Remaining     0/186                                                                                                                                          1:05.5s exec time total
Command: test.     Finished 7 local, 1 remote, 115 cache (93% hit)                                                                                                              37.0s exec time cached (56%)
Time elapsed: 1:11.5s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```
Rollback Plan:
Differential Revision: D80188995
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160712 
Approved by: https://github.com/fduwjj  
						
						
					 
					
						2025-10-23 14:22:21 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						2b93d5b450 
					 
					
						
						
							
							[FlexAttention][CUDA] Add flex configs for Blackwell ( #165760 )  
						
						... 
						
						
						
						This PR fixes ULFs on `max_autotune` mode for high head-dim sizes on B200. Closes https://github.com/pytorch/torchtitan/issues/1791 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165760 
Approved by: https://github.com/syed-ahmed , https://github.com/drisspg  
						
						
					 
					
						2025-10-23 10:22:06 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						6b7cd48e7e 
					 
					
						
						
							
							[ROCm] Deserialize loads in planer sum portion of reduce() of norm. ( #165927 )  
						
						... 
						
						
						
						Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165927 
Approved by: https://github.com/jeffdaily  
						
						
					 
					
						2025-10-23 09:45:01 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						bf5aa9e42e 
					 
					
						
						
							
							[dynamo] Remove ID guard on method object ( #166096 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/166096 
Approved by: https://github.com/tugsbayasgalan  
						
						
					 
					
						2025-10-23 06:22:49 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						b1eb6dede5 
					 
					
						
						
							
							[vision hash update] update the pinned vision hash ( #166046 )  
						
						... 
						
						
						
						This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml ).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166046 
Approved by: https://github.com/pytorchbot  
						
						
					 
					
						2025-10-23 04:27:44 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						673060beae 
					 
					
						
						
							
							[inductor] turn Inductor deterministic mode on with torch.use_deterministic_algorithms ( #165950 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/165950 
Approved by: https://github.com/v0i0 , https://github.com/eellison  
						
						
					 
					
						2025-10-23 02:48:42 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						2e8e9a59a8 
					 
					
						
						
							
							Revert "[dynamo][easy] Support torch.accelerator.current_accelerator ( #165734 )" ( #166094 )  
						
						... 
						
						
						
						This reverts commit c18ddfc5721dd91bf29c769e850a99c4fdb6f380.
Discovers some latent issues causing internal failures. Will fix those issues first and resend the PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166094 
Approved by: https://github.com/bdhirsh  
						
						
					 
					
						2025-10-23 01:24:46 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						fb277a5916 
					 
					
						
						
							
							Enable new tracer by default ( #165332 )  
						
						... 
						
						
						
						Differential Revision: [D84516080](https://our.internmc.facebook.com/intern/diff/D84516080 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165332 
Approved by: https://github.com/avikchaudhuri 
ghstack dependencies: #165582 , #163580  
						
						
					 
					
						2025-10-23 00:40:29 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						73fa0d0c63 
					 
					
						
						
							
							test for  #165446  ( #165853 )  
						
						... 
						
						
						
						Per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165853 
Approved by: https://github.com/drisspg  
						
						
					 
					
						2025-10-23 00:08:18 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						36c21cc84e 
					 
					
						
						
							
							state dict staging fixes ( #166025 )  
						
						... 
						
						
						
						Summary:
This PR contains three changes -
1. We are losing non-blocking flag value and defaulting to False during the deep_copy. This is introducing a cuda synchronize after each tensor. This is slowing the staging.
2. Adding the capability to skip pinning for scalar tensors to reduce initial staging buffer creation cost. Setting it by default to 65 to avoid pinning small tensors.
3. Tensor share storage but each storage needs to be processed only once in the deep_copy with offloading logic. so, use the memoization table to cache storage ids.
Test Plan:
1. Verified non-blocking copies via kineto profile.
2. ran A/B jobs old and new staging with fixes such that it crashes after ever 2 checkpoints and restarts for several hours and compared loss curves and they are exactly identical.
3. tests
Differential Revision: D85180484
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166025 
Approved by: https://github.com/pradeepfn  
						
						
					 
					
						2025-10-22 23:32:41 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						0b68814b44 
					 
					
						
						
							
							Forward fix to D80948073 ( #166023 )  
						
						... 
						
						
						
						Summary:
realize tensor before accessing layout.
Differential Revision: D85172267
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166023 
Approved by: https://github.com/laithsakka  
						
						
					 
					
						2025-10-22 22:00:53 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						e64a814ae7 
					 
					
						
						
							
							[CUDA] Add experimental green context support for SM carveout ( #159104 )  
						
						... 
						
						
						
						Low-level PyTorch APIs should be usable/stable enough at this point but we might move the underlying driver API usage a bit from here...
Built on top of @drisspg 's branch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159104 
Approved by: https://github.com/ngimel , https://github.com/malfet , https://github.com/kwen2501 
Co-authored-by: drisspg <drisspguessous@gmail.com >
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com > 
						
						
					 
					
						2025-10-22 21:38:52 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						0b58d87aec 
					 
					
						
						
							
							[Submodule] Bump FBGEMM to latest ( #165544 )  
						
						... 
						
						
						
						Summary:
* FBGEMM submodule updated to main
* CMake updated to reflect necessary changes
* Notably pulls in NVFP4 grouped gemm kernels
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165544 
Approved by: https://github.com/cyyever , https://github.com/jeffdaily  
						
						
					 
					
						2025-10-22 20:57:15 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						757975ad50 
					 
					
						
						
							
							[export] Unified graph capture with fullgraph_capture. ( #165562 )  
						
						... 
						
						
						
						Summary:
_dynamo_graph_capture_for_export in the current form has the compability issue
with the main torch.compile() path despite we reuse fullgraph_capture as the
bytecode tracer. The reason is that we flip on many export specific flags
and even trace with a wrapped function which will cause divergence with
torch.compile() again.
This PR instead creates a new implementation of dynamo_graph_capture_for_export
which 100% relies on fullgraph capture and post-processing on CaptureOutput so
that we can avoid the inversion of phases in PT2 compiler stack.
This also benefits precompile workflow since we want to have a feature that
only accepts pytree inputs and ship portable python wrappers in package. In
other words, I think the code here is sharable between export and precompile
for exporting portable graph.
Test Plan:
===================================================================== test session starts =====================================================================
platform linux -- Python 3.12.11, pytest-7.3.2, pluggy-1.6.0
rootdir: /data/users/zhxchen17/pytorch
configfile: pytest.ini
plugins: xdoctest-1.1.0, hypothesis-5.35.1, xdist-3.3.1, subtests-0.13.1, rerunfailures-14.0, flakefinder-1.1.0, cpp-2.3.0, anyio-4.10.0
collected 9 items
Running 9 items in this shard
test/distributed/tensor/test_dtensor_export.py ........x                                                                                                [100%]
================================================================ 8 passed, 1 xfailed in 11.42s ================================================================
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165562 
Approved by: https://github.com/tugsbayasgalan  
						
						
					 
					
						2025-10-22 20:44:55 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						291712026b 
					 
					
						
						
							
							[dynamo][user_defined] Replace UserFunctionVariable with VariableTracker build ( #165706 )  
						
						... 
						
						
						
						Audit: To prevent future issues with functools.partial or callable
objects.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165706 
Approved by: https://github.com/Lucaskabela , https://github.com/williamwen42  
						
						
					 
					
						2025-10-22 19:28:27 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						3e77a2b478 
					 
					
						
						
							
							[PyTorch] Improve aarch64 performance of bfloat16 ops ( #166028 )  
						
						... 
						
						
						
						Summary:
PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON
Benchmarks show measurable improvements:
Before:
bfloat16 add: 250.503us
bfloat16 sub: 245.674us
bfloat16 neg: 113.945us
After:
bfloat16 add: 203.862us ---> 23% higher throughput
bfloat16 sub: 201.526us ---> 22% higher throughput
bfloat16 neg: 74.986us ---> 52% higher throughput
Test Plan:
Correctness:
buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch
Performance:
 binary_test.py has been updated, to run bfloat16 benchmarks using basic arithmetic functions
Differential Revision: D85186786
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166028 
Approved by: https://github.com/Skylion007  
						
						
					 
					
						2025-10-22 19:25:33 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						82ef1b5db3 
					 
					
						
						
							
							[DebugMode] refactor logs into _DebugCalls ( #165376 )  
						
						... 
						
						
						
						Refactors `DebugMode.operators` to be more structured `_DebugCall` objects, instead of (op, args, kwargs, call_depth) tuples. Useful going forward for attaching more information (e.g. output info, call metadata).
Is BC-breaking, but attaches an `__iter__` method for `_OpCall` and `_RedistributeCall` so previous tuple usage is accessible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165376 
Approved by: https://github.com/yushangdi  
						
						
					 
					
						2025-10-22 19:01:56 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						5f370f5c42 
					 
					
						
						
							
							inductor_provenance: Correctly handle null provenance ( #166019 )  
						
						... 
						
						
						
						Summary:
If the provenance is null, we're getting crashes of the form
```
[trainers0]:E1021 10:51:31.990525  2752 PythonApi.h:87] Exception caught in
GeneratedDynamoCompileLoggerConfig: <class
'dsi.logger.py3.GeneratedDynamoCompile.LogEntry.thrift_types.GeneratedDynamoCompileLogEntryThriftBase'>:
error initializing Thrift struct field 'inductor_provenance_thrift_safe':
Cannot create internal string data representation. Expected type <class 'str'>,
got: <class 'NoneType'>.
```
Also fixed a type signature that wasn't being enforced. (It's still not
enforced, but it's accurate).
Test Plan:
Added a new test which reproduces the logging issue
Differential Revision: D85173596
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166019 
Approved by: https://github.com/ppanchalia , https://github.com/yushangdi  
						
						
					 
					
						2025-10-22 18:21:57 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						05b2e02cb4 
					 
					
						
						
							
							Revert "[lint] workflow consistency linter to look at all files instead of just changed files ( #165171 )"  
						
						... 
						
						
						
						This reverts commit c746feb86a1459db5f6294730d1d72ed15f16dd3.
Reverted https://github.com/pytorch/pytorch/pull/165171  on behalf of https://github.com/clee2000  due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/18723760085/job/53402955955 ) [HUD commit link](c746feb86ahttps://github.com/pytorch/pytorch/pull/165171#issuecomment-3433501457 )) 
						
						
					 
					
						2025-10-22 17:47:29 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						12f742941d 
					 
					
						
						
							
							Warn if AccumulateGrad stream does not match producer node stream ( #165065 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/165065 
Approved by: https://github.com/ngimel  
						
						
					 
					
						2025-10-22 17:33:27 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						35180fafee 
					 
					
						
						
							
							Allow GraphPickler to pickle graph modules containing AOTCompiled subgraphs ( #165844 )  
						
						... 
						
						
						
						This PR allows GraphPickler to pickle aot_eager graph modules that have regional inductor bits in them, with a few exceptions:
- FlexAttentionBackward isn't marked cacheable, so those tests don't work immediately since we're not sure how to serialize it. But it's safe to serialize/cache, so the next PR fixes those unit tests.
- It seems that when reloading a GraphPickled object, we don't recompile subgraphs. Will investigate this in a future PR
All unit tests in test_regional_inductor are parameterized so that we try serializing and deserializing the returned graph module before returning.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165844 
Approved by: https://github.com/oulgen 
ghstack dependencies: #165843  
						
						
					 
					
						2025-10-22 17:03:49 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						c746feb86a 
					 
					
						
						
							
							[lint] workflow consistency linter to look at all files instead of just changed files ( #165171 )  
						
						... 
						
						
						
						As in title
If you change only one workflow file, lintrunner (default arg, also the one in CI since it only inputs changed files) won't look at other files in the repo, but the sync-tag might come from those other files
This makes it so that it looks at all workflow files so it will catch those failures
Pros:
catches errors
Cons:
unusual behavior (getting around what lintrunner says the linter should run on)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165171 
Approved by: https://github.com/malfet  
						
						
					 
					
						2025-10-22 16:57:59 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						c5f26db5bf 
					 
					
						
						
							
							fix   #166057 : add tmp ptr to avoid gcc internal compiler error ( #165717 )  
						
						... 
						
						
						
						Fixes  #166057 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165717 
Approved by: https://github.com/malfet  
					
						2025-10-22 16:38:26 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						18e99b6d45 
					 
					
						
						
							
							[dirsync] Switch to top-level xplat/third-party/pthreadpool ( #165995 )  
						
						... 
						
						
						
						Summary: `fbcode//xplat/third-party/pthreadpool:` just redirects to the xplat version. Switch to the real location
Test Plan: This should be a no-op, so CI?
Differential Revision: D83999534
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165995 
Approved by: https://github.com/bigfootjon , https://github.com/Skylion007  
						
						
					 
					
						2025-10-22 16:18:23 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						ab9e466928 
					 
					
						
						
							
							[inductor][choices] lookup table choices 1/3 ( #164978 )  
						
						... 
						
						
						
						\# why
- enable users to control which choices get used on which inputs
- reduce lowering time, and pin kernel selection, by selecting
  them for the inputs
\# what
- a new InductorChoices subclass that implements a lookup table
- a README explaining the usage
- corresponding testing
- currently only supports templates that go through
  `V.choices.get_template_configs`
\# testing
```
python3 -bb -m pytest test/inductor/test_lookup_table.py -v
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164978 
Approved by: https://github.com/PaulZhang12 , https://github.com/eellison  
						
						
					 
					
						2025-10-22 16:11:31 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						af4ba78543 
					 
					
						
						
							
							[scan x vmap] support scan in vmap ( #165580 )  
						
						... 
						
						
						
						This is required by the chunked_with_scan work where two nested vmap(vmap) with chunk sizes > 1 are invoked, which produces a scan-> vmap -> scan -> vmap chain and we need to handle the case of vmap(scan) and scan(vmap).
The way we handle vmap(scan) is to turn it into scan(vmap(combine_fn)). The idea being that the combine_fn no longer do the combine_fn for a single slice, it vmaps over the combine_fn and do multiple combine_fns in one step. We need to need know how combine_fn propagates the batched tensor and what are the batched dims of the output. For this purpose, we use restore_vmap to give us the out_dims information.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165580 
Approved by: https://github.com/zou3519 
ghstack dependencies: #165675  
						
						
					 
					
						2025-10-22 09:46:00 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						282f39a4bc 
					 
					
						
						
							
							[vmap][dynamo] use create_proxy instead of create_node in vmap increate nesting ctx manager ( #165675 )  
						
						... 
						
						
						
						create_node won't do the auto closure lifting, this cause problems when the context manager is used in a hop region. Switch to create_proxy instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165675 
Approved by: https://github.com/zou3519 , https://github.com/guilhermeleobas  
						
						
					 
					
						2025-10-22 09:46:00 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						a479769488 
					 
					
						
						
							
							[dynamo] Clean up assert in dynamo [2/N] ( #165745 )  
						
						... 
						
						
						
						Extend from #165430 
* #165903(Clean up for graph break)
* ->#165745
* #165430 
One main refractor from the previous PR:
* For assertions like checking `len(args)` or `len(kwargs)`, using `raise_args_mismatch` instead of `raise_type_error_exc`
I am also considering moving `raise_type_error_exc` into `utils.py` for consistency.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165745 
Approved by: https://github.com/Lucaskabela  
						
						
					 
					
						2025-10-22 07:12:37 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						26c7375477 
					 
					
						
						
							
							Remove the branch of IS_CUSPARSE11_AVAILABLE is False ( #166048 )  
						
						... 
						
						
						
						This PR removes the branch when `IS_CUSPARSE11_AVAILABLE` is 0. Note that the condition `ROCM_VERSION >= 60300` holds currently as the minimum supported ROCm is 6.3 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166048 
Approved by: https://github.com/Skylion007  
						
						
					 
					
						2025-10-22 07:10:11 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						d01f15152c 
					 
					
						
						
							
							Move toUnderlying to headeronly ( #165694 )  
						
						... 
						
						
						
						As in the title. Required in upper PRs of this ghstack.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165694 
Approved by: https://github.com/janeyx99  
						
						
					 
					
						2025-10-22 05:31:16 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						4fae6968b1 
					 
					
						
						
							
							Move toString(ScalarType) and ScalarType ostream operator to headeronly ( #164405 ) ( #166018 )  
						
						... 
						
						
						
						This PR is created to replace the reverted PR https://github.com/pytorch/pytorch/pull/164405 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166018 
Approved by: https://github.com/janeyx99  
						
						
					 
					
						2025-10-22 05:16:58 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						f9953e0f61 
					 
					
						
						
							
							Enable PLC0414 on ruff ( #165828 )  
						
						... 
						
						
						
						This PR enables `PLC0414` that fixes redundant import aliases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165828 
Approved by: https://github.com/albanD  
						
						
					 
					
						2025-10-22 04:56:52 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						34ed7a8f0d 
					 
					
						
						
							
							[ROCm] Skip test_blockwise_nvfp4_with_global_scale ( #165968 )  
						
						... 
						
						
						
						Disable the fp4 global_scale test till the feature is enabled on ROCm.
Fixes  #166027 .
Not really, but we're trading an issue for a test skip decorator since the test is parameterized.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165968 
Approved by: https://github.com/jeffdaily , https://github.com/drisspg  
						
						
					 
					
						2025-10-22 04:23:05 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						2fde10d914 
					 
					
						
						
							
							[ROCm] fix test_allocator_backend ( #166035 )  
						
						... 
						
						
						
						Fixes  #165872 .
Forward fix PR #165298 . hipify was causing some symbols to be replaced.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166035 
Approved by: https://github.com/jeffdaily 
Co-authored-by: Jeff Daily <jeff.daily@amd.com > 
					
						2025-10-22 03:46:23 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						0a93295da0 
					 
					
						
						
							
							Update doc ( #166024 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/166024 
Approved by: https://github.com/yiming0416  
						
						
					 
					
						2025-10-22 03:41:31 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						4b898b51b9 
					 
					
						
						
							
							[12/n][take2] : Remove fbandroid_compiler_flags platform args ( #165916 )  
						
						... 
						
						
						
						Summary: This diff removes the `fbandroid_compiler_flags` and merges its content with `compiler_flags` and wraps it in a android select. My first attempt at this got reverted - D84626885.
Test Plan:
CI and failing builds are now passing
```
buck2 build --target-universe fbsource//fbandroid/apps/wearable/system/healthservices:healthservices_target30_mosnative_xhdpi_arm64_release_debug_keystore_redex_postprocessed_repack_resign @//fbandroid/mode/nosan @//fbandroid/mode/opt @//fbandroid/mode/milan_build_rdk @//fbandroid/mode/relr-relocations fbsource//fbandroid/apps/wearable/system/healthservices:healthservices_target30_mosnative_xhdpi_arm64_release_debug_keystore_redex_postprocessed_repack_resign fbsource//fbandroid/apps/wearable/system/healthservices:healthservices_target30_mosnative_xhdpi_arm64_release_debug_keystore_redex_genrule fbsource//fbandroid/apps/wearable/system/healthservices:healthservices_target30_mosnative_xhdpi_arm64_release_debug_keystore-mobileconfig-definition-resource-gen fbsource//fbandroid/apps/wearable/system/healthservices:healthservices_target30_mosnative_xhdpi_arm64_release_debug_keystore
File changed: fbsource//tools/build_defs/fb_xplat_cxx_library.bzl
Buck UI: https://www.internalfb.com/buck2/509c0b7b-ada3-421a-8c32-2f1d3a7babdd 
Network: Up: 1.3MiB  Down: 293MiB  (reSessionID-17f73b81-3c34-4c01-9f6c-2b4f3c8332e3)
Loading targets.   Remaining     0/1311                                                                                                                                                                                                292986 targets declared
Analyzing targets. Remaining     0/13515                                                                                                                                                                                               216715 actions, 359204 artifacts declared
Executing actions. Remaining     0/40415                                                                                                                                                                                               6:33.3s exec time total
Command: build.    Finished 40 local, 790 remote
Time elapsed: 32.0s
BUILD SUCCEEDED
```
Reviewed By: jaejunku
Differential Revision: D84868234
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165916 
Approved by: https://github.com/malfet  
						
						
					 
					
						2025-10-22 03:01:55 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						550e3e6efb 
					 
					
						
						
							
							[dynamo] Fix MATCH_KEYS for dict pattern matching ( #165956 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/165956 
Approved by: https://github.com/guilhermeleobas , https://github.com/cyyever  
						
						
					 
					
						2025-10-22 02:52:07 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						715449ca76 
					 
					
						
						
							
							[MPS] Fix parity between CPU and MPS on singular matrices in linalg.lu_factor ( #165871 )  
						
						... 
						
						
						
						Fixes  #165870 . Follow up from #165254 .
This PR [a] removes the MPS specific version of `lu_factor` in favor of the version in BatchedLinearAlgebra.cpp which uses `lu_factor_ex`, and [b] updates `lu_factor_ex` error codes to match expectations.
When `lu_factor` was first implemented for MPS (#99269 ), it bypassed the implementation in BatchedLinearAlgebra.cpp since we did not have `lu_factor_ex`. Since #144651  implements `lu_factor_ex`, we can now remove the MPS specific wrapper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165871 
Approved by: https://github.com/kulinseth , https://github.com/albanD  
					
						2025-10-22 02:48:40 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						84d8d06fc3 
					 
					
						
						
							
							Fixes floating point exception in torch.nn.PixelShuffle ( #163154 )  
						
						... 
						
						
						
						Fixes  #162251 
**Previous Output:**
`Floating point exception (core dumped)`
**Now Output:**
`RuntimeError: upscale factor is too large, (upscale_factor}^2 overflowed: upscale_factor=545460846592`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163154 
Approved by: https://github.com/cyyever , https://github.com/albanD  
					
						2025-10-22 02:22:16 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						60992d98b2 
					 
					
						
						
							
							[dynamo][remaining] Replace UserFunctionVariable with VariableTracker build ( #165896 )  
						
						... 
						
						
						
						Audit: To prevent future issues with functools.partial or callable objects.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165896 
Approved by: https://github.com/Lucaskabela  
						
						
					 
					
						2025-10-22 02:13:00 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						59e015e3a1 
					 
					
						
						
							
							Remove outdated CUB macros ( #164656 )  
						
						... 
						
						
						
						This PR removes `CUB_SUPPORTS_NV_BFLOAT16` and `CUB_SUPPORTS_FUTURE_VALUE` because they are always true on CUDA >=12 installations with its CUB version. Their branches are also removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164656 
Approved by: https://github.com/albanD , https://github.com/eqy , https://github.com/jeffdaily  
						
						
					 
					
						2025-10-22 02:02:50 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						8904a5a7c9 
					 
					
						
						
							
							Move allocation size config to AllocatorConfig for cross-allocator sharing ( #159553 )  
						
						... 
						
						
						
						# Motivation
Make CUDA and XPU share the same config and code. And allow the other backends to reuse them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159553 
Approved by: https://github.com/albanD 
ghstack dependencies: #160067  
						
						
					 
					
						2025-10-22 01:48:56 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						f5df9ca03a 
					 
					
						
						
							
							Fix creation of BINARY_SUBSCR in Python 3.14+ ( #165864 )  
						
						... 
						
						
						
						Python 3.14 replaced `BINARY_SUBSCR` by `BINARY_OP(opcode=BN_SUBSCR)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165864 
Approved by: https://github.com/williamwen42  
						
						
					 
					
						2025-10-22 01:43:03 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						2998abd777 
					 
					
						
						
							
							[Code Clean] Better error handling in torch/csrc/distributed ( #165053 )  
						
						... 
						
						
						
						Replace the runtime_error of the vallina C++ exceptions with TORCH_CEHCK
Including:
torch/csrc/distributed/*
fix partialy #148114 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165053 
Approved by: https://github.com/FFFrog , https://github.com/albanD  
						
						
					 
					
						2025-10-22 01:40:36 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						e13580e41c 
					 
					
						
						
							
							[AMD] Run int4_mm tests only for compatible arch ( #165630 )  
						
						... 
						
						
						
						Such tests should be skipped for rest including gfx1100(Navi3x)
Fixes for CI HUD for gfx1100
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165630 
Approved by: https://github.com/jeffdaily 
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com > 
						
						
					 
					
						2025-10-22 01:38:55 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						f3b8e15f20 
					 
					
						
						
							
							[AMD][gfx1100] test_decompose_mem_bound_mm.py tolerance increase ( #165625 )  
						
						... 
						
						
						
						test_decompose_mem_bound_mm.py tolerance increase for navi3x(gfx11x)
(cherry picked from commit 03c7da05f61890bbf5ae41e23c8df6d5f6805bac) from
Fixes for CI HUD for gfx1100
Signed-off-by: Artem Kuzmitckii <artem.kuzmitckii@amd.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165625 
Approved by: https://github.com/jeffdaily 
Co-authored-by: iupaikov-amd <Iurii.Paikov@amd.com >
Co-authored-by: Dmitry Nikolaev <139769634+dnikolaev-amd@users.noreply.github.com >
Co-authored-by: Jeff Daily <jeff.daily@amd.com > 
						
						
					 
					
						2025-10-22 01:38:48 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						5211f4c108 
					 
					
						
						
							
							[MPS] Fix SDPA fp16 overflow ( #165961 )  
						
						... 
						
						
						
						Do not cast intermediate result back to lower precision data data until
softmax is finished, otherwise it might produce NaN
Adjust the test to use 256 as filler value rather than 64
Fixes https://github.com/pytorch/pytorch/issues/160841 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165961 
Approved by: https://github.com/dcci , https://github.com/Skylion007 
ghstack dependencies: #165960  
						
						
					 
					
						2025-10-22 01:29:42 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						ad9027b80d 
					 
					
						
						
							
							[BE] Remove unused 'rows' parameter from spmm_bmm_coo_rows_grouped ( #166041 )  
						
						... 
						
						
						
						To fix following compilation warning
```
Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/sparse/mps/kernels/Mul.metal:76:14: warning: unused variable 'B' [-Wunused-variable]
  const uint B = dims.x;
             ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/sparse/mps/kernels/Mul.metal:65:26: warning: unused parameter 'rows' [-Wunused-parameter]
    device const long*   rows      [[buffer(0)]],
                         ^
2 warnings generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166041 
Approved by: https://github.com/Skylion007  
						
						
					 
					
						2025-10-22 00:59:41 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						a1005427bf 
					 
					
						
						
							
							[xpu] Support high stream for ProcessGroupXCCL ( #163049 )  
						
						... 
						
						
						
						Add high priority stream support for ProcessGroupXCCL. Just like CUDA, XPU streams also support execution with higher priority compared to other streams. Implementation in https://github.com/intel/torch-xpu-ops/pull/1715 , add register here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163049 
Approved by: https://github.com/guangyey , https://github.com/gujinghui , https://github.com/EikanWang , https://github.com/albanD  
						
						
					 
					
						2025-10-22 00:54:25 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						35153d0846 
					 
					
						
						
							
							Simplify c10::guts::apply ( #164566 )  
						
						... 
						
						
						
						There is only one call site of `c10::guts::apply` that can be replaced by `:std::apply` except for ROCm. This PR therefore simplifies the implementation of `c10::guts::apply`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164566 
Approved by: https://github.com/Aidyn-A , https://github.com/albanD  
						
						
					 
					
						2025-10-22 00:47:43 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						7773a22cdb 
					 
					
						
						
							
							Revert "[AMP][Refactor] Autocast dtype handling to simplify device-specific c… ( #165221 )"  
						
						... 
						
						
						
						This reverts commit 4be1e3bf926b8e798fede3be6a3051560e9e00c5.
Reverted https://github.com/pytorch/pytorch/pull/165221  on behalf of https://github.com/clee2000  due to I think this broke test_openreg [GH job link](https://github.com/pytorch/pytorch/actions/runs/18698271058/job/53322459496 ) [HUD commit link](4be1e3bf92https://github.com/pytorch/pytorch/pull/165221#issuecomment-3430012693 )) 
						
						
					 
					
						2025-10-22 00:26:57 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						7cb467a169 
					 
					
						
						
							
							[CI] Update ONNX CI packages to latest ( #165883 )  
						
						... 
						
						
						
						This PR updates ONNX related packages to their latest versions used in CI environments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165883 
Approved by: https://github.com/justinchuby , https://github.com/albanD  
						
						
					 
					
						2025-10-22 00:25:35 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						12aac12b8d 
					 
					
						
						
							
							[Code Clean] Replace std::runtime_error with TORCH_CHECK ( #165209 )  
						
						... 
						
						
						
						Including:
1. `aten/src/ATen/core`
2. `c10/core`
Fixes part of #148114 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165209 
Approved by: https://github.com/FFFrog , https://github.com/albanD  
						
						
					 
					
						2025-10-22 00:05:22 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						2b748d0a56 
					 
					
						
						
							
							Add operator name to output json  ( #164583 )  
						
						... 
						
						
						
						The benchmarks, model_name on dashboard needs to be grouped with operator_name. This PR passed an additional argument operator_name to the json for grouping.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164583 
Approved by: https://github.com/yangw-dev  
						
						
					 
					
						2025-10-21 23:58:39 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						16745a882a 
					 
					
						
						
							
							[aoti][win] add support for a list of shim libraries ( #165914 )  
						
						... 
						
						
						
						As title, support passing in a list of shim libraries when cross compiling artifacts
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165914 
Approved by: https://github.com/desertfire  
						
						
					 
					
						2025-10-21 22:55:17 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						8daef35cf1 
					 
					
						
						
							
							Revert "[Code Clean] Clean asserts in torch/ao/quantization (root, quantizer, backend_config) ( #165433 )"  
						
						... 
						
						
						
						This reverts commit df64c0c4649984093bd1a46f1e9c658c72018200.
Reverted https://github.com/pytorch/pytorch/pull/165433  on behalf of https://github.com/clee2000  due to I think this broke some quantization tests ([comment](https://github.com/pytorch/pytorch/pull/165433#issuecomment-3429741770 )) 
						
						
					 
					
						2025-10-21 22:10:19 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						51319ca090 
					 
					
						
						
							
							[Pytorch] Add NEON Vectorized<uint> family of translation layers ( #165690 )  
						
						... 
						
						
						
						Summary:
Adding NEON specializations of Vectorized<T> for uint8, uint16, uint32 and uint64.
Correcness has been checked using test_ops.py
operator_benchmark_test.py, which uses the PyTorch API, shows significant enhancements in some operations:
Before:
uint8 mul: 1460.751us
uint8 add: 2359.565us
uint8 lsl: 2151.206us
After:
uint8 mul: 194.792us ---> 650% higher throughput
uint8 add: 195.609us ---> 1100% higher throughput
uint8 lsl: 186.249us ---> 1055% higher throughput
Test Plan:
Correctness:
buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch
Performance:
buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test
Reviewed By: mcfi
Differential Revision: D84770153
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165690 
Approved by: https://github.com/malfet  
						
						
					 
					
						2025-10-21 21:46:55 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						d311a3d1dc 
					 
					
						
						
							
							A temporary fix to autotune out of range and related IMA ( #165943 )  
						
						... 
						
						
						
						Summary:
Autotune issue during lowering w/ AOTI:
```
setStorage: sizes [1536, 32, 8192], strides [8192, 8192, 1], storage offset 0, and itemsize 2 requiring a storage size of 25673728 are out of bounds for storage of size 25362432
```
Need a hack to create new base tensor with sufficient storage
Test Plan: Finally be able to see the e2e test passes on CI. See the detailed Test Plan in D83520844
Differential Revision: D84872792
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165943 
Approved by: https://github.com/laithsakka  
						
						
					 
					
						2025-10-21 21:40:20 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						04adfe5ba9 
					 
					
						
						
							
							Make Backend::setGroupUid virtual ( #165957 )  
						
						... 
						
						
						
						As titled, so that we may customize this function in custom backends
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165957 
Approved by: https://github.com/d4l3k  
						
						
					 
					
						2025-10-21 21:33:24 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						4be1e3bf92 
					 
					
						
						
							
							[AMP][Refactor] Autocast dtype handling to simplify device-specific c… ( #165221 )  
						
						... 
						
						
						
						This PR refactors the autocast context manager in autocast_mode.py to simplify and centralize the logic for checking supported dtypes for each device. The previous implementation repeated similar checks for multiple device types. Now, a single mapping device_supported_dtypes is used to associate device types with their supported dtypes, and the validation logic is unified.
**The former PR #163446  was merged but reverted due to failed CI test on `openreg` related tests.**
This RR additionally slightly modified some test assertions for passing the CI tests. CI failed due to assertion for the exactly same error message. For example:
```
File "/var/lib/jenkins/workspace/test/cpp_extensions/open_registration_extension/torch_openreg/tests/test_autocast.py", line 9, in test_autocast_with_unsupported_type
    with self.assertWarnsRegex(
        AssertionError: "In openreg autocast, but the target dtype torch.float32 is not supported." does not match "In openreg autocast, but the target dtype is not supported. Disabling autocast."
```
Sorry for the inconvenience again.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165221 
Approved by: https://github.com/FFFrog , https://github.com/albanD  
						
						
					 
					
						2025-10-21 21:32:12 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						e7592f4005 
					 
					
						
						
							
							[CI] Move the periodic debug tests to newer runner ( #165158 )  
						
						... 
						
						
						
						Previously g3 = NVIDIA Tesla M60
Now g6 = NVIDIA L4
Also change cuda arch list accordingly
Pros:
More memory, newer GPU
Cons:
That was one of the few remaining tests on g3 runners, so we probably lost coverage?
We can probably run more tests in parallel now but I'm not going to do that here
Disabled a bunch of sparse tests and nestedtensor tests that were previously skipped due to not having sufficient hardware?  They are now failing with
```
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3293, in wrapper
    method(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3292, in wrapper
    with policy():
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2532, in __enter__
    self.beforeStreams[-1].synchronize()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/streams.py", line 105, in synchronize
    super().synchronize()
torch.AcceleratorError: CUDA error: device-side assert triggered
Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html  for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from stream_synchronize at /var/lib/jenkins/workspace/c10/cuda/CUDAFunctions.h:120 (most recent call first):
C++ CapturedTraceback:
#4  std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5  c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6  c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) [clone .cold] from CUDAException.cpp:0
#7  THCPStream_synchronize(_object*, _object*) from Stream.cpp:0
#8  cfunction_vectorcall_NOARGS from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:489
#9  _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114
#10  _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46
#11  _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114
#12  _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46
```
when run with cuda launch blocking I got a ton of stuff like
```
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [5,3,0], thread: [2,7,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [5,3,0], thread: [3,7,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,0,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,0,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [2,0,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,0,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,1,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,1,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,1,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,2,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [2,2,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,2,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,3,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,3,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,4,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,4,0] Assertion `value < upper_bound` failed.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165158 
Approved by: https://github.com/seemethere  
						
						
					 
					
						2025-10-21 21:28:12 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						d334c3649d 
					 
					
						
						
							
							[CUDA] fix reflection padding for large batch size ( #165942 )  
						
						... 
						
						
						
						Fixes [#165861 ](https://github.com/pytorch/pytorch/issues/165861 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165942 
Approved by: https://github.com/eqy  
						
						
					 
					
						2025-10-21 21:07:38 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						9f82535c5a 
					 
					
						
						
							
							[ROCm] [Normalization] Update block size ( #165941 )  
						
						... 
						
						
						
						* Seeing upto 6x improvement
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165941 
Approved by: https://github.com/jeffdaily  
						
						
					 
					
						2025-10-21 20:53:05 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						5b35fc8777 
					 
					
						
						
							
							Support multiple commits on push events in trunk tagging workflow ( #165937 )  
						
						... 
						
						
						
						Context:
* this workflow is used to create tags like `trunk/{sha}` for all `main` commits
* those tags are used by [autorevert](https://github.com/pytorch/test-infra/blob/main/aws/lambda/pytorch-auto-revert/README.md ) to rerun selected workflows
Problem: currently the workflow creates only a single tag per push event, while ghstack pushes multiple commits per single push.
This PR supports tag creation for all commits in the push event.
Complimentary autorevert PR: https://github.com/pytorch/test-infra/pull/7291 
---
### Testing
I created an identical copy of this workflow in my personal repo: https://github.com/izaitsevfb/pr-head-test/actions/workflows/trunk-tagging.yml 
See action runs there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165937 
Approved by: https://github.com/huydhn  
						
						
					 
					
						2025-10-21 20:52:34 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						2f38eece7c 
					 
					
						
						
							
							[CUDA][cuBLAS] addmm -- some refactoring for easier navigation between the Lt and non-Lt paths ( #163955 )  
						
						... 
						
						
						
						As per title. Additionally, some Lt selection conditions are revisited, and some redundancy removed (especially in the ROCm vs non-ROCm paths).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163955 
Approved by: https://github.com/ngimel , https://github.com/eqy  
						
						
					 
					
						2025-10-21 20:48:12 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						830e789a55 
					 
					
						
						
							
							[dynamo][annotate] Graph break cleanly on fx.traceback.annotate reconstruction ( #166006 )  
						
						... 
						
						
						
						This avoids generation of bad bytecode, leading to really confusing
error. I am not sure why we can't reconstruct cleanly, it has to do with
the input being a dict, while other supported ctx managers take bools.
Fixing that is for another day. Lets give a good error message for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166006 
Approved by: https://github.com/yushangdi , https://github.com/SherlockNoMad  
						
						
					 
					
						2025-10-21 20:48:04 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						ad4dc52bf6 
					 
					
						
						
							
							Revert "shrink_group implementation to expose ncclCommShrink API ( #164518 )"  
						
						... 
						
						
						
						This reverts commit 4e643422f63a3cdd71bd141615f98de6bb54d15f.
Reverted https://github.com/pytorch/pytorch/pull/164518  on behalf of https://github.com/albanD  due to Breaks lint ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3429426503 )) 
						
						
					 
					
						2025-10-21 20:24:14 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						dac9ed9790 
					 
					
						
						
							
							Bump uv from 0.8.6 to 0.9.5 in /.ci/lumen_cli ( #166017 )  
						
						... 
						
						
						
						Bumps [uv](https://github.com/astral-sh/uv ) from 0.8.6 to 0.9.5.
- [Release notes](https://github.com/astral-sh/uv/releases )
- [Changelog](https://github.com/astral-sh/uv/blob/main/CHANGELOG.md )
- [Commits](https://github.com/astral-sh/uv/compare/0.8.6...0.9.5 )
---
updated-dependencies:
- dependency-name: uv
  dependency-version: 0.9.5
  dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> 
						
						
					 
					
						2025-10-21 13:16:30 -07:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						1c7fe8f861 
					 
					
						
						
							
							[BugFix] chunk_size should always be int64_t ( #165971 )  
						
						... 
						
						
						
						aspired by https://github.com/pytorch/pytorch/pull/156872 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165971 
Approved by: https://github.com/albanD  
						
						
					 
					
						2025-10-21 19:52:47 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						4e643422f6 
					 
					
						
						
							
							shrink_group implementation to expose ncclCommShrink API ( #164518 )  
						
						... 
						
						
						
						Closes  #164529 
To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink ) API to PyTorch.
This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization.
For more info:  [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518 
Approved by: https://github.com/kwen2501  
					
						2025-10-21 19:47:33 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						3c3b278872 
					 
					
						
						
							
							[reland][fx] Move Node._prepend/Node._remove_from_list to C++ ( #165882 )  
						
						... 
						
						
						
						Relands #148261  that was reverted by #150542 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165882 
Approved by: https://github.com/ezyang  
						
						
					 
					
						2025-10-21 19:43:55 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						0bd12c1168 
					 
					
						
						
							
							[CI] Extend test_transfomers to MPS ( #165960 )  
						
						... 
						
						
						
						Just skip grad_checks as they need float64
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165960 
Approved by: https://github.com/Skylion007  
						
						
					 
					
						2025-10-21 19:27:44 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						ce8a7764e2 
					 
					
						
						
							
							Revert "[dynamo][misc] Replace UserFunctionVariable with VariableTracker build ( #165707 )"  
						
						... 
						
						
						
						This reverts commit 1290b077f26543a34262587137ef64ca9ca5e17d.
Reverted https://github.com/pytorch/pytorch/pull/165707  on behalf of https://github.com/clee2000  due to failing internal tests D85160820 ([comment](https://github.com/pytorch/pytorch/pull/165707#issuecomment-3429084393 )) 
						
						
					 
					
						2025-10-21 19:25:03 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						d1269a0434 
					 
					
						
						
							
							update fr trace analysis ( #165994 )  
						
						... 
						
						
						
						Summary:
- allow empty entries from ranks
- allow not all ranks to provide dump
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com ). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/165994 ).
* #165638 
* #165640 
* #165642 
* __->__ #165994 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165994 
Approved by: https://github.com/fduwjj  
						
						
					 
					
						2025-10-21 19:14:33 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						c87cf1be32 
					 
					
						
						
							
							Update workaround to old CUDA bug ( #164354 ) ( #165984 )  
						
						... 
						
						
						
						The workaround cannot be removed because of BC. Here we'll
update PyTorch code base to not use the workaround.
See https://github.com/pytorch/pytorch/pull/164354  for the BC breakage issue.
Resolves https://github.com/pytorch/pytorch/issues/164348 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165984 
Approved by: https://github.com/janeyx99  
						
						
					 
					
						2025-10-21 19:09:43 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						2fc5e45a41 
					 
					
						
						
							
							better error message when there is no pytree impl ( #165955 )  
						
						... 
						
						
						
						Differential Revision: [D85117597](https://our.internmc.facebook.com/intern/diff/D85117597 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165955 
Approved by: https://github.com/avikchaudhuri  
						
						
					 
					
						2025-10-21 18:49:22 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						f9022ba93b 
					 
					
						
						
							
							[PyTorch] Add user_metadata display to memory visualizer ( #165939 )  
						
						... 
						
						
						
						Summary: Enhanced the PyTorch CUDA memory visualizer to display user_metadata alongside stack frames when inspecting allocations. The user_metadata field is now shown in all views (Allocator State History, Active Memory Timeline, etc.) with consistent formatting. The implementation handles both string and object metadata types, displaying strings directly and objects as key-value pairs.
Test Plan:
1. Generate a memory snapshot with user_metadata
2. Open the memory visualizer in a browser
3. Load the snapshot file
4. Verify user_metadata appears
5. Test with both string metadata ("testing") and object metadata ({"key": "value"})
6. Verify formatting shows "User Metadata:\n  <value>" for strings
 {F1982860439}
Differential Revision: D85095152
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165939 
Approved by: https://github.com/yushangdi  
						
						
					 
					
						2025-10-21 18:48:33 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						ff8be889ad 
					 
					
						
						
							
							Remove unused exception parameter from some files, to work with -Wunused-exception-parameter ( #165770 )  
						
						... 
						
						
						
						Summary: address compiler complains that were coming up to unblock the build
Test Plan:
before the change
```
aten/src/ATen/native/LinearAlgebra.cpp:3623:36: error: unused exception parameter 'e' [-Werror,-Wunused-exception-parameter]
 3623 |     } catch (const std::exception& e) {
      |
```
after: targets build with `-Wunused-exception-parameter`
Differential Revision: D84876246
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165770 
Approved by: https://github.com/Skylion007 , https://github.com/cyyever 
Co-authored-by: Tony Targonski <tony.targonski@meta.com > 
						
						
					 
					
						2025-10-21 18:30:29 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						292454942e 
					 
					
						
						
							
							[CD] Introduce windows.12xlarge runners for CD Windows build ( #165287 )  
						
						... 
						
						
						
						Follows https://github.com/pytorch/test-infra/pull/7174 . Windows CD build time cost comparison as below
|Runner|cpu|cuda|xpu|
|-|-|-|-|
|windows.4xlarge|1.5h| 4.0h| 5.5h|
|windows.12xlarge|0.5h|1.5h|2.5h|
Fixes  #162962 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165287 
Approved by: https://github.com/zxiiro , https://github.com/malfet , https://github.com/seemethere  
						
						
					 
					
						2025-10-21 18:28:23 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						6c4412f72b 
					 
					
						
						
							
							Revert "[Inductor] support masked vectorization for the tail_loop for float64 datatype ( #163316 )"  
						
						... 
						
						
						
						This reverts commit e9d89734274a4a2640fa77b898c800a87d1d874e.
Reverted https://github.com/pytorch/pytorch/pull/163316  on behalf of https://github.com/clee2000  due to seems to have broken some no_gpu tests? test/inductor/test_cpu_repro.py::CPUReproTests::test_double_reduction_vec [GH job link](https://github.com/pytorch/pytorch/actions/runs/18689033019/job/53290772740 ) [HUD commit link](e9d8973427https://github.com/pytorch/pytorch/pull/163316#issuecomment-3428210509 )) 
						
						
					 
					
						2025-10-21 17:44:42 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						78bf6186f2 
					 
					
						
						
							
							Revert "[Inductor] support masked vectorization for the tail_loop for fp8 datatype ( #163324 )"  
						
						... 
						
						
						
						This reverts commit e8cb34dd52c063a130f3e659576c313bbe4b4981.
Reverted https://github.com/pytorch/pytorch/pull/163324  on behalf of https://github.com/clee2000  due to seems to have broken some no_gpu tests? test/inductor/test_cpu_repro.py::CPUReproTests::test_double_reduction_vec [GH job link](https://github.com/pytorch/pytorch/actions/runs/18689033019/job/53290772740 ) [HUD commit link](e9d8973427https://github.com/pytorch/pytorch/pull/163316#issuecomment-3428210509 )) 
						
						
					 
					
						2025-10-21 17:44:42 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						c40048472c 
					 
					
						
						
							
							Remove AOTI cross compilation time from internal CI ( #165935 )  
						
						... 
						
						
						
						Summary: as title
Test Plan: CI
Differential Revision: D85088451
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165935 
Approved by: https://github.com/desertfire  
						
						
					 
					
						2025-10-21 16:58:28 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						3dfd0c7584 
					 
					
						
						
							
							Improve PATH hints in FindvecLib.cmake ( #165881 )  
						
						... 
						
						
						
						Change  /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk to /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk in `cmake/Modules/FindvecLib.cmake` which is more general (and MacOSX10.9 is not supported now). Otherwise, vecLib can't be found on MacOS 26.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165881 
Approved by: https://github.com/ezyang  
						
						
					 
					
						2025-10-21 16:44:12 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						e6ba4d0725 
					 
					
						
						
							
							Back out "Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed ( #164939 )" ( #165910 )  
						
						... 
						
						
						
						Summary:
Original commit changeset: d6d62d0c96dd
Original Phabricator Diff: D84468451 and D84613184
D84468451 caused CUDA OutOfMemoryError in model.
Test Plan:
D84468451 was found through bisect.  Also double checked on recent trunk 9866939225248c2adc307be7a804b26db0b9b555: f815887517
With this diff that backs out D84468451 and D84613184 : f816114560
Differential Revision: D85025378
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165910 
Approved by: https://github.com/clee2000  
						
						
					 
					
						2025-10-21 16:36:38 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						bdf7cb9d9c 
					 
					
						
						
							
							Revert "[torch/utils][Code Clean] Clean asserts in torch/utils/*.py ( #165410 )"  
						
						... 
						
						
						
						This reverts commit e20c9bf2889b9252ac45ae6af35c93c795eab701.
Reverted https://github.com/pytorch/pytorch/pull/165410  on behalf of https://github.com/clee2000  due to sorry I'm going to revert this since I want to try to back out some other things that are conflicting with this, there is nothing wrong with this PR, rebasing and resolving the merge conflicts should be enough, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/165410#issuecomment-3427532373 )) 
						
						
					 
					
						2025-10-21 16:27:54 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						6aed378958 
					 
					
						
						
							
							[export] Handle kwargs better in aot_export_joint_with_descriptors ( #165334 )  
						
						... 
						
						
						
						fx.Interpreter doesn't handle kwargs... not sure how this code worked previously
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165334 
Approved by: https://github.com/tugsbayasgalan , https://github.com/ezyang  
						
						
					 
					
						2025-10-21 15:53:05 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						8b3dc0d1b0 
					 
					
						
						
							
							Better error handling in torch/csrc/jit/runtime/*  ( #165118 )  
						
						... 
						
						
						
						Refactor error handling by using TORCH_CHECK for improved clarity in constants and scope management in some files in torch/csrc/jit/runtime/*
Fixes some parts of ISSUE https://github.com/pytorch/pytorch/issues/148114 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165118 
Approved by: https://github.com/FFFrog , https://github.com/albanD  
						
						
					 
					
						2025-10-21 15:22:49 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						06773663b5 
					 
					
						
						
							
							Implement an AOT precompile mode for standalone_compile ( #165843 )  
						
						... 
						
						
						
						This PR introduces an `aot` flag to standalone_compile that uses BundledAOTAutogradCacheEntry, and then allows regional_inductor to use this so that we can start aot compiling regional compiler graphs. The diff above this will attempt to allow GraphPickler to fully serialize graphs that have regionally compiled subgraphs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165843 
Approved by: https://github.com/oulgen  
						
						
					 
					
						2025-10-21 15:02:45 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						0bff65503c 
					 
					
						
						
							
							Move hardware_destructive_interference_size to c10/core/alignment.h ( #160067 )  
						
						... 
						
						
						
						# Motivation
Move `hardware_destructive_interference_size` to `c10/core/alignment.h`, which gives a chance to reuse it across different accelerators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160067 
Approved by: https://github.com/Skylion007 , https://github.com/EikanWang  
						
						
					 
					
						2025-10-21 14:39:46 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						21131a2444 
					 
					
						
						
							
							Revert "[ROCm][CI] Update rocm.yml workflow to use 1 GPU ARC runners ( #165481 )"  
						
						... 
						
						
						
						This reverts commit ffa90d46e61650834d5f926008f48f50c6a7e87a.
Reverted https://github.com/pytorch/pytorch/pull/165481  on behalf of https://github.com/jeffdaily  due to timeouts after merge ([comment](https://github.com/pytorch/pytorch/pull/165481#issuecomment-3426898171 )) 
						
						
					 
					
						2025-10-21 14:15:55 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						1009790ad8 
					 
					
						
						
							
							[pytree][dynamo] trace on native optree functions for community pytree support ( #165860 )  
						
						... 
						
						
						
						Resolves  #164972 
- #164972 
All `torch.utils._cxx_pytree` functions are based on `optree` functions with hardcoded `none_is_leaf=True` and `namespace="torch"`. This PR changes the polyfills to generic `optree` functions with those arguments unhardcoded. This means `torch.utils._cxx_pytree` functions are still traceable while the community `optree` usages can get dynamo support additionally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165860 
Approved by: https://github.com/Lucaskabela  
					
						2025-10-21 14:13:08 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						410e6a4321 
					 
					
						
						
							
							Better error handling in torch/csrc/jit/frontend/* ( #165213 )  
						
						... 
						
						
						
						Refactor error handling by using TORCH_CHECK for improved clarity in constants and scope management in some files in torch/csrc/jit/frontend/*
Fixes some parts of ISSUE https://github.com/pytorch/pytorch/issues/148114 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165213 
Approved by: https://github.com/FFFrog , https://github.com/albanD  
						
						
					 
					
						2025-10-21 13:54:59 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						23c55c5b66 
					 
					
						
						
							
							[Code Clean]Replace assert statements with explicit if/raise patterns ( #165735 )  
						
						... 
						
						
						
						Fix part of #164878 
Replace 75 assert statements with explicit if/raise patterns in `torch/ao/ns` , include:
- `torch/ao/ns/_numeric_suite_fx.py`  - 5 asserts
- `torch/ao/ns/fx/graph_matcher.py` - 6 asserts
- `torch/ao/ns/fx/graph_passes.py` -12 asserts
- `torch/ao/ns/fx/n_shadows_utils.py` - 20 asserts
- `torch/ao/ns/fx/pattern_utils.py` - 2 asserts
- `torch/ao/ns/fx/utils.py` - 21 asserts
- `torch/ao/ns/fx/weight_utils.py` - 19 asserts
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165735 
Approved by: https://github.com/albanD  
						
						
					 
					
						2025-10-21 11:21:57 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						1290b077f2 
					 
					
						
						
							
							[dynamo][misc] Replace UserFunctionVariable with VariableTracker build ( #165707 )  
						
						... 
						
						
						
						Audit: To prevent future issues with functools.partial or callable
objects.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165707 
Approved by: https://github.com/Lucaskabela  
						
						
					 
					
						2025-10-21 09:27:41 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						9f9ab881b2 
					 
					
						
						
							
							[ROCm][inductor] heuristic improvements for reduction kernels ( #161280 )  
						
						... 
						
						
						
						Improvements to reduction kernel heuristics for MI350.
Contributions from several members of the AMD Inductor and Triton teams: @jataylo @iupaikov-amd @AmdSampsa @xiaohuguo2023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161280 
Approved by: https://github.com/jansel , https://github.com/PaulZhang12 , https://github.com/eellison , https://github.com/jeffdaily  
						
						
					 
					
						2025-10-21 07:48:54 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						f2bb22ff84 
					 
					
						
						
							
							[Inductor-FX] Support Tensor.item ( #165599 )  
						
						... 
						
						
						
						# Feature
This PR supports compiling `Tensor.item` with Inductor's FX backend. This maps to a custom WrapperCodeGen method called `codegen_dynamic_scalar`.
# Implementation
The implementation is fairly mechanical, following the usual flow for these types of PRs.
1. Introduce a new Wrapper IR line for this, called `DynamicScalarLine`.
2. Split `PythonWrapperCodegen.codegen_dynamic_scalar` into 2 parts: a public method which generates the Wrapper IR line, and a private one generating Python from Wrapper IR.
3. Implement an FX codegen method for the wrapper IR line. This one calls `aten.where.Scalar` to handle code like `1 if x.item() else 0`, which is a bit tricky. It also calls `aten.item.default` to convert tensors to scalars.
# Test plan
Added CI tests mirroring the AOTI ones. They test float, int and bool types, the latter taking a distinct codegen path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165599 
Approved by: https://github.com/angelayi , https://github.com/jansel  
						
						
					 
					
						2025-10-21 07:09:56 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						03f3f7899c 
					 
					
						
						
							
							[ATen] Add reduction tag to reduction operators ( #165155 )  
						
						... 
						
						
						
						Add a new 'reduction' tag to tags.yaml and apply it to 98 reduction
operator variants across 21 operator families (sum, mean, min, max,
argmin, argmax, amin, amax, aminmax, prod, all, any, norm, var, std,
std_mean, var_mean, nansum, logsumexp, count_nonzero, linalg_vector_norm).
This tag categorizes operators that perform reduction operations,
computing aggregate values across one or more dimensions of input
tensor(s).
Based on PR #153342  - co-written with @AlonSardas.
Just as we have pointwise tag - this can be useful for compiler passes, or for opting into sharding rules.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165155 
Approved by: https://github.com/ezyang , https://github.com/zou3519 , https://github.com/mlazos  
						
						
					 
					
						2025-10-21 04:35:03 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						771170807b 
					 
					
						
						
							
							[dynamo][nn_module] Replace UserFunctionVariable with VariableTracker build ( #165708 )  
						
						... 
						
						
						
						Audit: To prevent future issues with functools.partial or callable objects.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165708 
Approved by: https://github.com/Lucaskabela  
						
						
					 
					
						2025-10-21 04:13:12 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						ffa90d46e6 
					 
					
						
						
							
							[ROCm][CI] Update rocm.yml workflow to use 1 GPU ARC runners ( #165481 )  
						
						... 
						
						
						
						* Moving rocm.yml from using persistent non-ARC runners from the combined MI2xx (MI210 + MI250) cluster to the ARC runners from the MI250 cluster. This halves the number of nodes, but provides access to approximately 4 times the runners, since every 8-GPU MI250 node now provides 8 1-GPU runners. This should help with concurrent capacity and queueing on the MI2xx jobs.
Tested here successfully: https://github.com/pytorch/pytorch/actions/runs/18620814622/job/53092469720 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165481 
Approved by: https://github.com/jeffdaily 
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com > 
						
						
					 
					
						2025-10-21 04:02:04 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						0e083942cc 
					 
					
						
						
							
							Enable PLW0127 in ruff ( #165851 )  
						
						... 
						
						
						
						This PR enables `PLW0127` in ruff, which checks self-assignment of variables with the form `var=var`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165851 
Approved by: https://github.com/Lucaskabela  
						
						
					 
					
						2025-10-21 03:30:57 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						ce1fcff03e 
					 
					
						
						
							
							[ROCm] Keep amdgpu-coerce-illegal-types flag if rocm version is less than 7.2 ( #165789 )  
						
						... 
						
						
						
						The `-amdgpu-coerce-illegal-types=1` flag is for LLVM that is in ROCm 6.3, 6.4, 7.0, and 7.1. It will not be in ROCm7.2. It was added to enable performance improvements for composable kernel. ROCm7.2 and newer changed the compiler so that the flag isn't needed to achieve those performance improvements. Keeping the flag with ROCm 7.2 breaks the PyTorch build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165789 
Approved by: https://github.com/jithunnair-amd , https://github.com/jeffdaily  
						
						
					 
					
						2025-10-21 03:17:33 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						a238a9a100 
					 
					
						
						
							
							Add clang-tidy misc-definitions-in-headers check ( #164959 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/164959 
Approved by: https://github.com/Skylion007 , https://github.com/mikaylagawarecki 
ghstack dependencies: #164882 , #164956  
						
						
					 
					
						2025-10-21 02:59:46 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						fe69a2bbbd 
					 
					
						
						
							
							Move from/to to torch::stable::detail ( #164956 )  
						
						... 
						
						
						
						To not pollute the global namespace, we should move the `from`/`to` APIs into torch::stable::detail. We are also following our normal deprecation cycle and choosing to continue exposing the global `from`/`to` for the time being as people who onboard their extensions onto 2.9 would not be able to build with 2.10 otherwise.
Note that this means that within libtorch, we do not get the luxury of tacking on a `using torch::stable::detail::from` because then it leads to build time ambiguous calls --> both the global and namespace APIs are exposed, which one do I want? So that is why you see every local site is updated.
Note that the update is _not_ necessary from a custom op writer point of view. FA3 can continue to build on torch nightlies without changing any code. (Since this is a header change, this PR has no implication on runtime, a previously built FA3 ABI stable wheel will continue to work fine with newer torch versions after this PR.)
Once TORCH_BOX lands, we would be free to remove these global APIs when the deprecation cycle is up (April 2026) and encourage people to use TORCH_BOX and avoid from/to entirely.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164956 
Approved by: https://github.com/malfet 
ghstack dependencies: #164882  
						
						
					 
					
						2025-10-21 02:59:46 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						0be0de4ffa 
					 
					
						
						
							
							Add type suppressions to _inductor/runtime ( #165918 )  
						
						... 
						
						
						
						Original PR that did this was reverted due to merge conflicts.
Trying it again
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165918 
Approved by: https://github.com/oulgen  
						
						
					 
					
						2025-10-21 02:54:22 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						7406d2e665 
					 
					
						
						
							
							[DeviceMesh] Clean up the call into mesh_resouces to get root mesh ( #165787 )  
						
						... 
						
						
						
						We moved the method to get root mesh into class in https://github.com/pytorch/pytorch/pull/164510 . This is to further clean code up.
Differential Revision: [D85090191](https://our.internmc.facebook.com/intern/diff/D85090191 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165787 
Approved by: https://github.com/fegin  
						
						
					 
					
						2025-10-21 02:54:04 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						303c9cf048 
					 
					
						
						
							
							Save Python refcount bump on each arg in maybe_handle_torch_function ( #164625 )  
						
						... 
						
						
						
						Pybind's API entails a small unnecessary overhead when working with args. (Similarly, we should probably be using vectorcall, but that's a bigger change for both us and pybind11.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164625 
Approved by: https://github.com/albanD 
ghstack dependencies: #164624  
						
						
					 
					
						2025-10-21 02:40:12 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						d7d4bb7c51 
					 
					
						
						
							
							Add XPU part for persons_of_interest ( #165920 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/165920 
Approved by: https://github.com/albanD  
						
						
					 
					
						2025-10-21 01:57:17 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						0b1c462979 
					 
					
						
						
							
							Making Numpy depedency in Local Tensor optional to fix broken Torchao CI ( #165938 )  
						
						... 
						
						
						
						In recent change LocalTensor introduced dependency on Numpy and has broken Torchao CI.
This dependency cna be made optional and required only when Local Tensor is used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165938 
Approved by: https://github.com/atalman  
						
						
					 
					
						2025-10-21 01:46:53 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						4a6cf0a93e 
					 
					
						
						
							
							Fix dynamo stack trace ( #165930 )  
						
						... 
						
						
						
						Fixes  #165911 
- Add message to Attribute error so we see `  Developer debug context: raised exception AttributeError(["'Linear' object has no attribute 'w'"])` instead of just `Developer debug context: raised exception AttributeError([])`
- Add stack trace in `ObservedException` so we display the inner most error stack trace back to user code
Output:
```
/data/users/shangdiy/pytorch/torch/__init__.py:2641: UserWarning: You are calling torch.compile inside torch.export region. To capture an useful graph, we will implicitly switch to torch.compile(backend=eager)
  warnings.warn(
Traceback (most recent call last):
  File "/data/users/shangdiy/pytorch/torch/_dynamo/variables/user_defined.py", line 1385, in var_getattr
    subobj = self._getattr_static(name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/shangdiy/pytorch/torch/_dynamo/variables/user_defined.py", line 1256, in _getattr_static
    subobj = type(self.value).__getattribute__(self.value, name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Linear' object has no attribute 'w'
During handling of the above exception, another exception occurred:
torch._dynamo.exc.ObservedAttributeError: 'Linear' object has no attribute 'w'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/data/users/shangdiy/pytorch/test.py", line 34, in <module>
    mod = torch._dynamo.functional_export._dynamo_graph_capture_for_export(Model())(x)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/shangdiy/pytorch/torch/_dynamo/functional_export.py", line 481, in inner
    out = fullgraph_capture(
          ^^^^^^^^^^^^^^^^^^
  File "/data/users/shangdiy/pytorch/torch/_dynamo/convert_frame.py", line 1053, in fullgraph_capture
    return _fullgraph_capture_frame(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/shangdiy/pytorch/torch/_dynamo/convert_frame.py", line 1115, in _fullgraph_capture_frame
    raise e.with_traceback(None) from e.__cause__  # User compiler error
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.Unsupported: Observed exception
  Explanation: Dynamo found no exception handler at the top-level compiled function when encountering an exception. Exception will propagate outside the compiled region.
  Hint: Dynamo has detected that tracing the code will result in an error when running in eager. Please double check that your code doesn't contain a similar error when actually running eager/uncompiled.
  Hint: It may be possible to write Dynamo tracing rules for this code. Please report an issue to PyTorch if you encounter this graph break often and it is causing performance issues.
  Developer debug context: raised exception AttributeError(["'Linear' object has no attribute 'w'"])
 For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0088.html 
from user code:
   File "/data/users/shangdiy/pytorch/torch/_dynamo/functional_export.py", line 171, in forward
    res = self._export_root(*args, **kwargs)
  File "/data/users/shangdiy/pytorch/test.py", line 31, in forward
    weight = self.linear.w
Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165930 
Approved by: https://github.com/anijain2305  
					
						2025-10-21 01:32:23 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						4c963a68d7 
					 
					
						
						
							
							Use inline instead of anon namespace for stableivalue from/to ( #164882 )  
						
						... 
						
						
						
						Fixes https://github.com/pytorch/pytorch/issues/163343 .
After some consideration, I propose we remove the anonymous namespace around from/to in favor of:
1. Adding inline to the function implementations, assuming that they will not change in the near future
2. If we decide to change them, we will wrap the code in inline versioned namespaces such that the implementations within any versioned namespace will be guaranteed identical.
Note that:
- We eventually intend to abstract away usage of `from`/`to` (related: @lw's TORCH_BOX work)
- The from/to implementations are now powered through class template specializations, where adding a specialization does not change the from/to signatures.
I do plan to deprecate top-level from/to in favor of torch::stable::details::from/to consequently. This way we can stop polluting the global namespace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164882 
Approved by: https://github.com/lw , https://github.com/albanD  
						
						
					 
					
						2025-10-21 00:12:15 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						b20deec3d1 
					 
					
						
						
							
							[PP] Add optional argument to not save outputs ( #165822 )  
						
						... 
						
						
						
						Fix https://github.com/pytorch/pytorch/issues/159251 
Add an optional argument `return_outputs` to the schedule `step`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165822 
Approved by: https://github.com/wconstab  
						
						
					 
					
						2025-10-21 00:09:31 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						51d0d8ee67 
					 
					
						
						
							
							[ATen] Fix CUDA reduction warp shuffle order ( #164790 )  
						
						... 
						
						
						
						Typical warp shuffle reduction has the following pattern:
<img width="1138" height="501" alt="image" src="https://github.com/user-attachments/assets/3bd176dc-0ad2-4df6-90c7-06e467337166 " />
which is exhibited in Triton generated by torch.compile:
<img width="663" height="403" alt="image" src="https://github.com/user-attachments/assets/7f9f36cd-b9eb-44c1-879e-b469668a2ea8 " />
Switch the warp shuffle order to make bitwise equivalence between the 2 easier.
PTX difference between old and new, we see a few extra instructions: https://www.diffchecker.com/h6ly3INC/ 
Comparing the performance on different reduction operations, we see minimal differences. New represents the changes in this PR, old represents the past warp shuffle order:
```
Tensor Shape              Operation            New all dims (ms)       New dim=0 (ms)      New dim=1 (ms)     Old all dims (ms)    Old dim=0 (ms)      Old dim=1 (ms)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 1024)              mean                 0.015817             0.016259             0.013642             0.015990             0.016258             0.013631
(1024, 1024)              sum                  0.015917             0.015906             0.013359             0.015707             0.016266             0.013226
(1024, 1024)              min                  0.016021             0.024625             0.015631             0.015761             0.024485             0.015317
(1024, 1024)              max                  0.016349             0.024971             0.015972             0.015771             0.025001             0.015314
(1024, 1024)              argmin               0.018070             0.024448             0.015578             0.018135             0.025370             0.015322
(1024, 1024)              argmax               0.018427             0.024859             0.015932             0.018164             0.024452             0.015639
(1024, 1024)              var                  0.020078             0.026413             0.020295             0.020199             0.026381             0.020214
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(2048, 2048)              mean                 0.023826             0.023726             0.022273             0.023236             0.023776             0.022248
(2048, 2048)              sum                  0.023840             0.023355             0.021974             0.023294             0.023354             0.021884
(2048, 2048)              min                  0.024519             0.041263             0.024620             0.023292             0.041491             0.024358
(2048, 2048)              max                  0.024509             0.041670             0.024277             0.023334             0.041231             0.024395
(2048, 2048)              argmin               0.026125             0.041282             0.024567             0.026772             0.041773             0.024296
(2048, 2048)              argmax               0.026117             0.041487             0.024572             0.026412             0.041477             0.024273
(2048, 2048)              var                  0.026603             0.048581             0.031308             0.027587             0.048603             0.030860
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(4096, 4096)              mean                 0.053927             0.057070             0.054073             0.053028             0.057544             0.053935
(4096, 4096)              sum                  0.053604             0.057410             0.054451             0.053076             0.057033             0.054266
(4096, 4096)              min                  0.054293             0.109122             0.058363             0.053821             0.108689             0.058382
(4096, 4096)              max                  0.054258             0.108035             0.058703             0.053492             0.110552             0.058376
(4096, 4096)              argmin               0.056805             0.111167             0.058301             0.056836             0.112325             0.058292
(4096, 4096)              argmax               0.056488             0.110958             0.058636             0.056844             0.111000             0.057928
(4096, 4096)              var                  0.058936             0.141755             0.068693             0.059735             0.141284             0.068500
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 8192)              mean                 0.145552             0.148082             0.138647             0.145364             0.147818             0.138207
(8192, 8192)              sum                  0.145985             0.147900             0.138714             0.145755             0.148031             0.138616
(8192, 8192)              min                  0.146566             0.205359             0.192739             0.145611             0.205237             0.182335
(8192, 8192)              max                  0.146526             0.204844             0.193050             0.146073             0.205457             0.182697
(8192, 8192)              argmin               0.150190             0.206605             0.192543             0.150654             0.206847             0.182007
(8192, 8192)              argmax               0.150481             0.206368             0.192535             0.150845             0.206430             0.182022
(8192, 8192)              var                  0.150884             0.184546             0.203900             0.151594             0.184172             0.197983
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1, 1024, 128)            mean                 0.014293             0.008119             0.014533             0.013861             0.008022             0.014449
(1, 1024, 128)            sum                  0.014039             0.007877             0.014111             0.014219             0.008227             0.014045
(1, 1024, 128)            min                  0.014159             0.011354             0.023493             0.014271             0.010862             0.023644
(1, 1024, 128)            max                  0.014154             0.011027             0.023368             0.014259             0.011234             0.023692
(1, 1024, 128)            argmin               0.016403             0.005677             0.023328             0.016273             0.005683             0.024073
(1, 1024, 128)            argmax               0.016734             0.005675             0.023437             0.016580             0.005318             0.023331
(1, 1024, 128)            var                  0.018338             0.009549             0.025538             0.018528             0.009391             0.024777
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(5, 1024, 128)            mean                 0.014873             0.010131             0.015546             0.015123             0.010131             0.015481
(5, 1024, 128)            sum                  0.015334             0.009673             0.015824             0.014736             0.009671             0.015438
(5, 1024, 128)            min                  0.015047             0.013252             0.024573             0.014803             0.013163             0.024551
(5, 1024, 128)            max                  0.015050             0.013339             0.024197             0.014810             0.013525             0.024230
(5, 1024, 128)            argmin               0.017341             0.012737             0.024306             0.017471             0.012379             0.024991
(5, 1024, 128)            argmax               0.017345             0.012411             0.024421             0.017422             0.012471             0.024237
(5, 1024, 128)            var                  0.019973             0.011453             0.026188             0.020050             0.011438             0.026282
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(10, 1024, 128)           mean                 0.016976             0.011575             0.016831             0.016722             0.011927             0.017173
(10, 1024, 128)           sum                  0.017039             0.011841             0.017159             0.016385             0.011860             0.016753
(10, 1024, 128)           min                  0.017036             0.015331             0.026770             0.016944             0.015205             0.027166
(10, 1024, 128)           max                  0.017369             0.015348             0.027077             0.016531             0.015716             0.026819
(10, 1024, 128)           argmin               0.019203             0.014447             0.026813             0.018994             0.014497             0.027313
(10, 1024, 128)           argmax               0.019563             0.014795             0.027140             0.019460             0.014912             0.026733
(10, 1024, 128)           var                  0.020529             0.014316             0.030405             0.020719             0.013960             0.029964
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(100, 1024, 128)          mean                 0.045046             0.039168             0.046082             0.044839             0.039217             0.045782
(100, 1024, 128)          sum                  0.045094             0.039150             0.045777             0.044496             0.039542             0.046083
(100, 1024, 128)          min                  0.045768             0.054466             0.076244             0.044915             0.053943             0.076599
(100, 1024, 128)          max                  0.045748             0.054459             0.076188             0.044931             0.053949             0.076856
(100, 1024, 128)          argmin               0.048275             0.054046             0.076647             0.048694             0.054105             0.077004
(100, 1024, 128)          argmax               0.048267             0.054395             0.077401             0.048691             0.054131             0.076751
(100, 1024, 128)          var                  0.049710             0.043254             0.083077             0.050971             0.043251             0.082378
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1000, 1000, 100)         mean                 0.202312             0.196723             0.197765             0.201774             0.196641             0.197459
(1000, 1000, 100)         sum                  0.202651             0.196682             0.197736             0.202175             0.196313             0.197523
(1000, 1000, 100)         min                  0.203022             0.264762             0.269200             0.202729             0.264129             0.268694
(1000, 1000, 100)         max                  0.202864             0.264396             0.269388             0.202486             0.263896             0.268720
(1000, 1000, 100)         argmin               0.226727             0.263781             0.268651             0.226597             0.264676             0.268983
(1000, 1000, 100)         argmax               0.226412             0.264469             0.269090             0.226570             0.264595             0.269178
(1000, 1000, 100)         var                  0.243223             0.204079             0.216096             0.241942             0.204079             0.215925
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(10000, 100)              mean                 0.016193             0.020277             0.014316             0.016152             0.020324             0.013712
(10000, 100)              sum                  0.016289             0.020237             0.014034             0.016168             0.020265             0.013708
(10000, 100)              min                  0.016046             0.030872             0.019609             0.016208             0.030867             0.018627
(10000, 100)              max                  0.016369             0.030835             0.019257             0.016218             0.030861             0.018209
(10000, 100)              argmin               0.017957             0.031171             0.019517             0.018050             0.031556             0.018077
(10000, 100)              argmax               0.017961             0.031658             0.019521             0.018060             0.031564             0.018087
(10000, 100)              var                  0.020393             0.035652             0.019339             0.020144             0.035987             0.019171
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(100000, 10)              mean                 0.015718             0.016576             0.016555             0.015999             0.016246             0.014869
(100000, 10)              sum                  0.015833             0.016247             0.016572             0.016007             0.016627             0.014872
(100000, 10)              min                  0.015888             0.020510             0.023920             0.015671             0.020821             0.021417
(100000, 10)              max                  0.015889             0.020479             0.023918             0.016077             0.020386             0.021421
(100000, 10)              argmin               0.018233             0.020863             0.023647             0.017574             0.020864             0.021103
(100000, 10)              argmax               0.017896             0.020527             0.023296             0.017569             0.020447             0.021098
(100000, 10)              var                  0.020005             0.024198             0.024372             0.020075             0.024167             0.022415
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1023, 1023, 1023)        mean                 1.874816             1.963506             1.903909             1.873279             1.963859             1.903230
(1023, 1023, 1023)        sum                  1.875030             1.965716             1.902458             1.873566             1.960730             1.901642
(1023, 1023, 1023)        min                  1.878563             2.473455             2.179092             1.875174             2.482086             2.183027
(1023, 1023, 1023)        max                  1.879128             2.474803             2.178895             1.874831             2.482253             2.183884
(1023, 1023, 1023)        argmin               1.921800             2.476629             2.174831             1.923987             2.472641             2.170453
(1023, 1023, 1023)        argmax               1.922605             2.476688             2.177927             1.923366             2.472808             2.172979
(1023, 1023, 1023)        var                  1.972606             3.088695             2.758797             1.978679             3.095658             2.762243
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1023, 1023, 255)         mean                 0.489984             0.500954             0.492957             0.489891             0.500654             0.491971
(1023, 1023, 255)         sum                  0.490228             0.500764             0.492289             0.489624             0.501089             0.492824
(1023, 1023, 255)         min                  0.491457             0.563560             0.553334             0.490355             0.564709             0.554754
(1023, 1023, 255)         max                  0.491396             0.563628             0.553345             0.490017             0.565004             0.554947
(1023, 1023, 255)         argmin               0.503666             0.561512             0.551831             0.503845             0.560972             0.551017
(1023, 1023, 255)         argmax               0.503602             0.561185             0.551407             0.504328             0.561267             0.551448
(1023, 1023, 255)         var                  0.510844             0.709452             0.701630             0.512693             0.710365             0.701965
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1023, 1023, 377)         mean                 0.707439             0.727646             0.712019             0.706769             0.727101             0.711632
(1023, 1023, 377)         sum                  0.707780             0.727453             0.711554             0.706807             0.726656             0.711729
(1023, 1023, 377)         min                  0.709423             0.819809             0.794379             0.707847             0.822086             0.796664
(1023, 1023, 377)         max                  0.709297             0.819780             0.794308             0.707566             0.821913             0.796690
(1023, 1023, 377)         argmin               0.725028             0.817088             0.791695             0.726039             0.816445             0.790828
(1023, 1023, 377)         argmax               0.725301             0.817011             0.791420             0.726040             0.816917             0.791143
(1023, 1023, 377)         var                  0.740859             1.034165             1.006712             0.743413             1.035506             1.007638
```
Differential Revision: [D85022826](https://our.internmc.facebook.com/intern/diff/D85022826 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164790 
Approved by: https://github.com/ngimel , https://github.com/eqy  
						
						
					 
					
						2025-10-21 00:09:13 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						70592c6819 
					 
					
						
						
							
							[ROCm][CI] Move gfx1100 workflows to own yaml file ( #165699 )  
						
						... 
						
						
						
						This should allow us to move gfx1100 workflow to a lower frequency and also allow it to be triggered on PRs via a dedicated label, for any PRs that target Navi fixes such as [this](https://github.com/pytorch/pytorch/pull/165630 ) or [this](https://github.com/pytorch/pytorch/pull/165625 ).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165699 
Approved by: https://github.com/jeffdaily  
						
						
					 
					
						2025-10-20 23:52:48 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						259cb945f5 
					 
					
						
						
							
							[stage 2c] make autograd and inference functions ( #165668 )  
						
						... 
						
						
						
						Add final stage of aot_stage2_compile for autograd and inference.
Differential Revision: D84844699
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165668 
Approved by: https://github.com/zhxchen17 , https://github.com/tugsbayasgalan  
						
						
					 
					
						2025-10-20 23:50:31 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						e20c9bf288 
					 
					
						
						
							
							[torch/utils][Code Clean] Clean asserts in torch/utils/*.py ( #165410 )  
						
						... 
						
						
						
						Including:
- `torch/utils/*.py`
Fixes part of #164878 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165410 
Approved by: https://github.com/albanD  
						
						
					 
					
						2025-10-20 23:29:17 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						99c8640b5d 
					 
					
						
						
							
							[1/N] Change C-style casts to static_cast or reinterpret_cast ( #165750 )  
						
						... 
						
						
						
						This series of changes try to cover C style casts into C++ alternatives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165750 
Approved by: https://github.com/Skylion007  
						
						
					 
					
						2025-10-20 23:27:13 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						96b0e7aaa6 
					 
					
						
						
							
							[Code Clean] Clean asserts in torch/ao/quantization/experimental/* and torch/ao/quantization/pt2e/* ( #165317 )  
						
						... 
						
						
						
						Replace assert statements with explicit if/raise patterns in:
- torch/ao/quantization/experimental/* (11 errors)
- torch/ao/quantization/pt2e/* (68 errors)
fix partialy #164878 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165317 
Approved by: https://github.com/albanD  
						
						
					 
					
						2025-10-20 23:07:11 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						850ba8c96d 
					 
					
						
						
							
							[Code Clean] Clean asserts in torch/autograd. ( #165627 )  
						
						... 
						
						
						
						Replaces 78 assert statements across 10 files in torch.autograd with explicit if-checks raising AssertionError to prevent assertions from being disabled with Python -O flag. This ensures error checking remains active in optimized builds.
fix partially #164878 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165627 
Approved by: https://github.com/albanD  
						
						
					 
					
						2025-10-20 23:03:47 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						1bcd736f91 
					 
					
						
						
							
							fix bad merge duplicate pre pass ( #165917 )  
						
						... 
						
						
						
						fix for https://github.com/pytorch/pytorch/issues/165624  - we were applying pre pass multiple times.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165917 
Approved by: https://github.com/bdhirsh  
						
						
					 
					
						2025-10-20 22:54:36 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						df64c0c464 
					 
					
						
						
							
							[Code Clean] Clean asserts in torch/ao/quantization (root, quantizer, backend_config) ( #165433 )  
						
						... 
						
						
						
						Replace assert statements with explicit if/raise patterns in:
- torch/ao/quantization/~
- torch/ao/quantization/quantizer/
- torch/ao/quantization/backend_config/
fix partialy #164878 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165433 
Approved by: https://github.com/albanD  
						
						
					 
					
						2025-10-20 22:42:51 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						1891239a1d 
					 
					
						
						
							
							[Graph Partition] fix graph partition input signature for fallback kernels ( #165815 )  
						
						... 
						
						
						
						Scheduler relies on node.last_usage to free buffers. `last_usage` may contain a buffer that is allocated in previous graph partition AND not directly accessed in the current graph partition.
## Example
```python
def f(x):
    y = x + 1
    z = torch.ops.aten.view.dtype(y, torch.float8_e4m3fn)
    z_cpu = z.cpu()
    u_cuda = z_cpu.cuda()
    return u_cuda
```
In the generated code, we have
```
def partition_0(args):
    ...
    # Topologically Sorted Source Nodes: [y, z], Original ATen: [aten.add, aten.view]
    buf1 = torch.ops.aten.view.dtype(buf0, torch.float8_e4m3fn) # < ------ buf1 is a view of buf0
    buf2 = buf1 # <------- buf2 is buf1
    assert_size_stride(buf2, (8, ), (1, ), 'torch.ops.aten.view.dtype')
    assert_alignment(buf2, 16, 'torch.ops.aten.view.dtype')
    return (buf2, )
def call(self, args):
    ...
    (buf2,) = self.partitions[0](partition0_args)
    ...
    buf3.copy_(buf2, False)
    del buf0
    del buf1
    del buf2  # <---- `del buf2` leads to `del buf0`. BUT `buf0` is not returned from partition_0.
    ...
```
Note: view is treated as a fallback kernel due to its special dtype.
de09bab4b6/torch/_inductor/lowering.py (L841-L843)https://github.com/pytorch/pytorch/pull/165815 
Approved by: https://github.com/eellison  
						
						
					 
					
						2025-10-20 22:23:29 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						cf280ca1e8 
					 
					
						
						
							
							Revert "[Inductor] Naive foreach autotune support ( #162053 )"  
						
						... 
						
						
						
						This reverts commit 779296a3fce5db0829377c792f13a8eafe537b30.
Reverted https://github.com/pytorch/pytorch/pull/162053  on behalf of https://github.com/pytorch-auto-revert  due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/162053#issuecomment-3423808492 )) 
						
						
					 
					
						2025-10-20 21:36:44 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						efc277cac7 
					 
					
						
						
							
							[annotation] add logging for debugging annotation ( #165797 )  
						
						... 
						
						
						
						Add logging for debugging annotation bugs. Log will show with `TORCH_LOGS="+annotation" `
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165797 
Approved by: https://github.com/ezyang , https://github.com/Skylion007 , https://github.com/SherlockNoMad  
						
						
					 
					
						2025-10-20 21:27:38 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						4f7f43253d 
					 
					
						
						
							
							Revert "[ROCm][CI] Update rocm.yml workflow to use 1 GPU ARC runners ( #165481 )"  
						
						... 
						
						
						
						This reverts commit 8700d68fef855850e2e0aa65056a77b8f80adbdb.
Reverted https://github.com/pytorch/pytorch/pull/165481  on behalf of https://github.com/malfet  due to Broke lint somehow, see 8f06a1308f/1https://github.com/pytorch/pytorch/pull/165481#issuecomment-3423642456 )) 
						
						
					 
					
						2025-10-20 20:39:56 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						779296a3fc 
					 
					
						
						
							
							[Inductor] Naive foreach autotune support ( #162053 )  
						
						... 
						
						
						
						Initial autotuning support for foreach kernels, 4x improvement for some kernels in internal workload. More improvements can surely be made here in the future. Removing num_warps for definition to enable autotune support in generated wrapper code.
Before:
triton_for_fused_18.kd 🔍  | 4.986 ms | 4.986 ms | 2.493 ms | 2 |
triton_for_fused_6.kd 🔍  | 0.098 ms | 0.098 ms | 0.049 ms | 2 |
triton_for_fused_7.kd 🔍  | 0.036 ms | 0.036 ms | 0.018 ms | 2 |
After:
triton_for_fused_18.kd 🔍  | 1.273 ms | 1.273 ms | 0.636 ms | 2 |
triton_for_fused_6.kd 🔍  | 0.044 ms | 0.044 ms | 0.022 ms | 2 |
triton_for_fused_7.kd 🔍  | 0.024 ms | 0.024 ms | 0.012 ms | 2 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162053 
Approved by: https://github.com/mlazos , https://github.com/naromero77amd  
						
						
					 
					
						2025-10-20 20:39:04 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						8f06a1308f 
					 
					
						
						
							
							[MPS] slightly faster cholesky ( #165867 )  
						
						... 
						
						
						
						Slightly faster cholesky, removed one redundant simdgroup_multiply
<img width="721" height="593" alt="Screenshot 2025-10-19 at 22 00 19" src="https://github.com/user-attachments/assets/e3a9005b-9347-4e62-a24d-16ba5e28849a " />
Generate benchmarks with(measured on M1 Pro):
```
import torch
import numpy as np
import time
import csv
matrix_sizes = [512, 1024, 2048, 4096]
batch_sizes = [1, 2, 4, 8, 16]
num_runs = 10
warmup_runs = 3
def create_spd_matrix(n, batch_size):
    torch.manual_seed(42)
    A = torch.randn(batch_size, n, n, dtype=torch.float32)
    return A @ A.transpose(-2, -1) + n * torch.eye(n).expand(batch_size, -1, -1)
def run_cholesky_mps(A):
    torch.mps.synchronize()
    start = time.perf_counter()
    b = torch.linalg.cholesky(A, upper=False)
    torch.mps.synchronize()
    end = time.perf_counter()
    return b, end - start
results = {
    'N': [],
    'batch_size': [],
    'mean_time': [],
    'std_time': []
}
for n in matrix_sizes:
    for batch_size in batch_sizes:
        print(f"\nBenchmarking N={n}, batch_size={batch_size}")
        try:
            A_cpu = create_spd_matrix(n, batch_size)
            A_mps = A_cpu.to("mps")
            for _ in range(warmup_runs):
                _, _ = run_cholesky_mps(A_mps)
            times = []
            for _ in range(num_runs):
                _, t = run_cholesky_mps(A_mps)
                times.append(t)
            mean_time = np.mean(times)
            std_time = np.std(times)
            results['N'].append(n)
            results['batch_size'].append(batch_size)
            results['mean_time'].append(mean_time)
            results['std_time'].append(std_time)
            print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")
        except RuntimeError as e:
            print(f"Error for N={n}, batch_size={batch_size}: {e}")
            continue
with open('cholesky_benchmark_times.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['N', 'batch_size', 'mean_time', 'std_time'])
    for i in range(len(results['N'])):
        writer.writerow([
            results['N'][i],
            results['batch_size'][i],
            results['mean_time'][i],
            results['std_time'][i]
        ])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165867 
Approved by: https://github.com/malfet  
						
						
					 
					
						2025-10-20 18:56:17 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						240c13394e 
					 
					
						
						
							
							Revert "[inductor] require shape in TritonCSEVariable ( #162275 )"  
						
						... 
						
						
						
						This reverts commit 3af2f0c12accc6bd10ef2b76fb5c51aa0f6b73a3.
Reverted https://github.com/pytorch/pytorch/pull/162275  on behalf of https://github.com/clee2000  due to still failing due to the above D84932446 ([comment](https://github.com/pytorch/pytorch/pull/162275#issuecomment-3423153819 )) 
						
						
					 
					
						2025-10-20 17:55:54 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						150682ba7f 
					 
					
						
						
							
							Revert "Remove workaround to old CUDA bug ( #164354 )"  
						
						... 
						
						
						
						This reverts commit 26f38034332a99f2bdcc67ce1f4ba9403d420e52.
Reverted https://github.com/pytorch/pytorch/pull/164354  on behalf of https://github.com/facebook-github-bot  due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164354#issuecomment-3423132083 )) 
						
						
					 
					
						2025-10-20 17:48:08 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						ca7360e996 
					 
					
						
						
							
							Revert "Move toString(ScalarType) and ScalarType ostream operator to headeronly ( #164405 )"  
						
						... 
						
						
						
						This reverts commit ca8bd5dbedb5b46f78026e0378b0f47500ddba38.
Reverted https://github.com/pytorch/pytorch/pull/164405  on behalf of https://github.com/facebook-github-bot  due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164354#issuecomment-3423132083 )) 
						
						
					 
					
						2025-10-20 17:48:08 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						0bf604320f 
					 
					
						
						
							
							Revert "[dynamo][user_defined] Replace UserFunctionVariable with VariableTracker build ( #165706 )"  
						
						... 
						
						
						
						This reverts commit 1dc9a05d0323ee3c7a20945c62463959d40f1a51.
Reverted https://github.com/pytorch/pytorch/pull/165706  on behalf of https://github.com/clee2000  due to breaking internal tests D84961097 ([comment](https://github.com/pytorch/pytorch/pull/165706#issuecomment-3423059867 )) 
						
						
					 
					
						2025-10-20 17:28:58 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						9875e70da8 
					 
					
						
						
							
							Revert "[dynamo][misc] Replace UserFunctionVariable with VariableTracker build ( #165707 )"  
						
						... 
						
						
						
						This reverts commit 630520b346b8883db7821562e589ccde7d12687a.
Reverted https://github.com/pytorch/pytorch/pull/165707  on behalf of https://github.com/clee2000  due to breaking internal tests D84961097 ([comment](https://github.com/pytorch/pytorch/pull/165706#issuecomment-3423059867 )) 
						
						
					 
					
						2025-10-20 17:28:58 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						69a4bfe8bb 
					 
					
						
						
							
							Revert "Refactor out headeronly ArrayRef ( #164991 )"  
						
						... 
						
						
						
						This reverts commit 3806e9767b03d06edc317cb90a3a996abdf192a0.
Reverted https://github.com/pytorch/pytorch/pull/164991  on behalf of https://github.com/clee2000  due to breaking internal tests D84961075 ([comment](https://github.com/pytorch/pytorch/pull/164991#issuecomment-3423058017 )) 
						
						
					 
					
						2025-10-20 17:26:42 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						62a263b8d4 
					 
					
						
						
							
							Revert "Widen ops support to take in IntHOArrayRef vs only std::vec ( #165152 )"  
						
						... 
						
						
						
						This reverts commit e4454947e2c692db1a249591121f8583fefe7df1.
Reverted https://github.com/pytorch/pytorch/pull/165152  on behalf of https://github.com/clee2000  due to breaking internal tests D84961075 ([comment](https://github.com/pytorch/pytorch/pull/164991#issuecomment-3423058017 )) 
						
						
					 
					
						2025-10-20 17:26:42 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						0da1f911dc 
					 
					
						
						
							
							Revert "[Submodule] Bump FBGEMM to latest ( #165544 )"  
						
						... 
						
						
						
						This reverts commit 23417ae50f5d9bc02e988d916c103ff3a03c5903.
Reverted https://github.com/pytorch/pytorch/pull/165544  on behalf of https://github.com/clee2000  due to failing in internal D84996252, probably needs some sort of update to fbgemm internally? ([comment](https://github.com/pytorch/pytorch/pull/165544#issuecomment-3422993703 )) 
						
						
					 
					
						2025-10-20 17:06:07 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						8700d68fef 
					 
					
						
						
							
							[ROCm][CI] Update rocm.yml workflow to use 1 GPU ARC runners ( #165481 )  
						
						... 
						
						
						
						* Moving rocm.yml from using persistent non-ARC runners from the combined MI2xx (MI210 + MI250) cluster to the ARC runners from the MI250 cluster. This halves the number of nodes, but provides access to approximately 4 times the runners, since every 8-GPU MI250 node now provides 8 1-GPU runners. This should help with concurrent capacity and queueing on the MI2xx jobs.
Tested here successfully: https://github.com/pytorch/pytorch/actions/runs/18620814622/job/53092469720 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165481 
Approved by: https://github.com/jeffdaily , https://github.com/pruthvistony , https://github.com/albanD 
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com > 
						
						
					 
					
						2025-10-20 16:06:37 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						ab82456c16 
					 
					
						
						
							
							Revert "[1/N] Change C-style casts to static_cast or reinterpret_cast ( #165750 )"  
						
						... 
						
						
						
						This reverts commit e1e8491b316df810388d9fa24f135cdba27ab40e.
Reverted https://github.com/pytorch/pytorch/pull/165750  on behalf of https://github.com/pytorch-auto-revert  due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165750#issuecomment-3422413890 )) 
						
						
					 
					
						2025-10-20 14:51:58 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						b23f4687fd 
					 
					
						
						
							
							[Inductor][CuTeDSL] Move load_template up two directories ( #165868 )  
						
						... 
						
						
						
						Summary:
This is a reland of https://github.com/pytorch/pytorch/pull/165347 
Moves the function used to load CuTeDSL Jinja templates up one level out of the flex attention folder. This way it can be used for more generate Inductor templates in the future.
Test Plan: test/inductor/test_flex_flash
Differential Revision: D85013024
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165868 
Approved by: https://github.com/jananisriram  
						
						
					 
					
						2025-10-20 12:14:38 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						2705937080 
					 
					
						
						
							
							[CI] Add rocm CI back to trunk for pre-submit/PR jobs ( #165674 )  
						
						... 
						
						
						
						Only adding single-GPU shards for now, to observe how current capacity handles it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165674 
Approved by: https://github.com/jeffdaily  
						
						
					 
					
						2025-10-20 12:14:06 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						c1eda348be 
					 
					
						
						
							
							[cuda] fix triu/tril int32 overflow for large matrices ( #164705 )  
						
						... 
						
						
						
						Fixes  #136611 
Cast blockIdx.x to int64_t before multiplication to prevent overflow when computing linear_idx for matrices larger than 2^31 elements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164705 
Approved by: https://github.com/eqy , https://github.com/ngimel  
					
						2025-10-20 07:17:41 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						ba93d5636e 
					 
					
						
						
							
							[cuda] fix nll_loss2d backward bounds check with reduction=none ( #165247 )  
						
						... 
						
						
						
						Fixes  #49882 
Add missing bounds check in nll_loss2d backward kernel with reduction=none. Forward kernel already had CUDA_KERNEL_ASSERT for target bounds, now backward kernel matches.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165247 
Approved by: https://github.com/ngimel  
					
						2025-10-20 06:25:11 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						722b2b86c9 
					 
					
						
						
							
							[dynamo] Remove duplicated guards ( #165806 )  
						
						... 
						
						
						
						This is by looking at a tlparse of an internal job. We will need deeper audit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165806 
Approved by: https://github.com/jansel  
						
						
					 
					
						2025-10-20 05:50:33 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						e1e8491b31 
					 
					
						
						
							
							[1/N] Change C-style casts to static_cast or reinterpret_cast ( #165750 )  
						
						... 
						
						
						
						This series of changes try to cover C style casts into C++ alternatives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165750 
Approved by: https://github.com/Skylion007  
						
						
					 
					
						2025-10-20 04:36:19 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						767199fd9b 
					 
					
						
						
							
							[flex_attention] replace sliced BlockMask noop with helpful error ( #164702 )  
						
						... 
						
						
						
						Fixes part of #163314 
After slicing BlockMask with `[]`, mask_mod was silently replaced with noop_mask. This caused silent incorrect results when users applied transformations to `sliced_mask.mask_mod`.
Replace noop with `_sliced_mask_mod_error` that raises RuntimeError with guidance to use `base_mask.mask_mod` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164702 
Approved by: https://github.com/drisspg , https://github.com/BoyuanFeng  
						
						
					 
					
						2025-10-20 03:46:16 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						602ace5eb4 
					 
					
						
						
							
							Revert "[ATen] Fix CUDA reduction warp shuffle order ( #164790 )"  
						
						... 
						
						
						
						This reverts commit 36371b8ec7a1baed255c18451b2c716386a54c95.
Reverted https://github.com/pytorch/pytorch/pull/164790  on behalf of https://github.com/clee2000  due to was reverted due to failing internal tests after merge D84992607 ([comment](https://github.com/pytorch/pytorch/pull/164790#issuecomment-3420373755 )) 
						
						
					 
					
						2025-10-20 03:06:52 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						47804ce467 
					 
					
						
						
							
							Revert "12/n : Remove fbandroid_compiler_flags ( #165558 )"  
						
						... 
						
						
						
						This reverts commit aead9270f56ebc7302c7f5fa7e5dff959f26608e.
Reverted https://github.com/pytorch/pytorch/pull/165558  on behalf of https://github.com/clee2000  due to Diff was actually reverted internally D84832629 ([comment](https://github.com/pytorch/pytorch/pull/165558#issuecomment-3420367955 )) 
						
						
					 
					
						2025-10-20 03:03:13 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						e8cb34dd52 
					 
					
						
						
							
							[Inductor] support masked vectorization for the tail_loop for fp8 datatype ( #163324 )  
						
						... 
						
						
						
						**Summary:**
Support masked vectorization for the tail_loop for fp8 datatype.
**Example:**
```
import torch
def fn(
    x,
    scale,
    zero_point,
    quant_min,
    quant_max,
    dtype,
):
    x = torch.ops.quantized_decomposed.dequantize_per_tensor(
        x,
        scale,
        zero_point,
        quant_min,
        quant_max,
        dtype,
    )
    x = torch.relu(x)
    x = torch.ops.quantized_decomposed.quantize_per_tensor(
        x, scale, zero_point, quant_min, quant_max, dtype
    )
    return x
quant_min = -128
quant_max = 127
dtype = torch.float8_e4m3fn
x = torch.clamp(torch.randn((1, 7, 7, 9), dtype=torch.float32) * 100, quant_min, quant_max).to(dtype)
zero_point = 100
scale = 0.01
with torch.no_grad():
    compiled_fn = torch.compile(fn)
    compiled_fn(x, scale, zero_point, quant_min, quant_max, dtype)
```
**Generated code:**
- Before
```
cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0 = async_compile.cpp_pybinding(['const at::Float8_e4m3fn*', 'at::Float8_e4m3fn*'], r'''
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C"  void  kernel(const at::Float8_e4m3fn* in_ptr0,
                       at::Float8_e4m3fn* out_ptr0)
{
    {
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(441L); x0+=static_cast<int64_t>(16L))
        {
            {
                if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(432L)))
                {
                    auto tmp0 = at::vec::Vectorized<at::Float8_e4m3fn>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                    auto tmp1 = at::vec::convert<float>(tmp0);
                    auto tmp2 = static_cast<float>(100.0);
                    auto tmp3 = at::vec::Vectorized<float>(tmp2);
                    auto tmp4 = tmp1 - tmp3;
                    auto tmp5 = static_cast<float>(0.01);
                    auto tmp6 = at::vec::Vectorized<float>(tmp5);
                    auto tmp7 = tmp4 * tmp6;
                    auto tmp8 = (tmp7);
                    auto tmp9 = at::vec::clamp_min(tmp8, decltype(tmp8)(0));
                    auto tmp10 = tmp9 * tmp3;
                    auto tmp11 = tmp10.round();
                    auto tmp12 = tmp11 + tmp3;
                    auto tmp13 = static_cast<float>(-128.0);
                    auto tmp14 = at::vec::Vectorized<float>(tmp13);
                    auto tmp15 = at::vec::maximum(tmp12, tmp14);
                    auto tmp16 = static_cast<float>(127.0);
                    auto tmp17 = at::vec::Vectorized<float>(tmp16);
                    auto tmp18 = at::vec::minimum(tmp15, tmp17);
                    auto tmp19 = at::vec::convert<at::Float8_e4m3fn>(tmp18);
                    tmp19.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                }
                if(C10_UNLIKELY(x0 >= static_cast<int64_t>(432L) && x0 < static_cast<int64_t>(441L)))
                {
                    for (int64_t x0_tail = static_cast<int64_t>(432L);x0_tail < static_cast<int64_t>(441L); x0_tail++)
                    {
                        auto tmp0 = in_ptr0[static_cast<int64_t>(x0_tail)];
                        auto tmp1 = c10::convert<float>(tmp0);
                        auto tmp2 = static_cast<float>(100.0);
                        auto tmp3 = float(tmp1 - tmp2);
                        auto tmp4 = static_cast<float>(0.01);
                        auto tmp5 = float(tmp3 * tmp4);
                        auto tmp6 = c10::convert<float>(tmp5);
                        auto tmp7 = std::max(tmp6, decltype(tmp6)(0));
                        auto tmp8 = float(tmp7 * tmp2);
                        auto tmp9 = std::nearbyint(tmp8);
                        auto tmp10 = float(tmp9 + tmp2);
                        auto tmp11 = static_cast<float>(-128.0);
                        auto tmp12 = max_propagate_nan(tmp10, tmp11);
                        auto tmp13 = static_cast<float>(127.0);
                        auto tmp14 = min_propagate_nan(tmp12, tmp13);
                        auto tmp15 = c10::convert<at::Float8_e4m3fn>(tmp14);
                        out_ptr0[static_cast<int64_t>(x0_tail)] = tmp15;
                    }
                }
            }
        }
    }
}
''')
async_compile.wait(globals())
del async_compile
class Runner:
    def __init__(self, partitions):
        self.partitions = partitions
    def recursively_apply_fns(self, fns):
        new_callables = []
        for fn, c in zip(fns, self.partitions):
            new_callables.append(fn(c))
        self.partitions = new_callables
    def call(self, args):
        arg0_1, = args
        args.clear()
        assert_size_stride(arg0_1, (1, 7, 7, 9), (441, 63, 9, 1))
        buf0 = empty_strided_cpu((1, 7, 7, 9), (441, 63, 9, 1), torch.float8_e4m3fn)
        # [Provenance debug handles] cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0:1
        cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0(arg0_1, buf0)
        del arg0_1
        return (buf0, )
```
- After
```
cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0 = async_compile.cpp_pybinding(['const at::Float8_e4m3fn*', 'at::Float8_e4m3fn*'], r'''
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C"  void  kernel(const at::Float8_e4m3fn* in_ptr0,
                       at::Float8_e4m3fn* out_ptr0)
{
    {
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(441L); x0+=static_cast<int64_t>(16L))
        {
            {
                if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(432L)))
                {
                    auto tmp0 = at::vec::Vectorized<at::Float8_e4m3fn>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                    auto tmp1 = at::vec::convert<float>(tmp0);
                    auto tmp2 = static_cast<float>(100.0);
                    auto tmp3 = at::vec::Vectorized<float>(tmp2);
                    auto tmp4 = tmp1 - tmp3;
                    auto tmp5 = static_cast<float>(0.01);
                    auto tmp6 = at::vec::Vectorized<float>(tmp5);
                    auto tmp7 = tmp4 * tmp6;
                    auto tmp8 = (tmp7);
                    auto tmp9 = at::vec::clamp_min(tmp8, decltype(tmp8)(0));
                    auto tmp10 = tmp9 * tmp3;
                    auto tmp11 = tmp10.round();
                    auto tmp12 = tmp11 + tmp3;
                    auto tmp13 = static_cast<float>(-128.0);
                    auto tmp14 = at::vec::Vectorized<float>(tmp13);
                    auto tmp15 = at::vec::maximum(tmp12, tmp14);
                    auto tmp16 = static_cast<float>(127.0);
                    auto tmp17 = at::vec::Vectorized<float>(tmp16);
                    auto tmp18 = at::vec::minimum(tmp15, tmp17);
                    auto tmp19 = at::vec::convert<at::Float8_e4m3fn>(tmp18);
                    tmp19.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                }
                if(C10_UNLIKELY(x0 >= static_cast<int64_t>(432L) && x0 < static_cast<int64_t>(441L)))
                {
                    auto tmp0 = at::vec::Vectorized<at::Float8_e4m3fn>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L));
                    auto tmp1 = at::vec::convert<float>(tmp0);
                    auto tmp2 = static_cast<float>(100.0);
                    auto tmp3 = at::vec::Vectorized<float>(tmp2);
                    auto tmp4 = tmp1 - tmp3;
                    auto tmp5 = static_cast<float>(0.01);
                    auto tmp6 = at::vec::Vectorized<float>(tmp5);
                    auto tmp7 = tmp4 * tmp6;
                    auto tmp8 = (tmp7);
                    auto tmp9 = at::vec::clamp_min(tmp8, decltype(tmp8)(0));
                    auto tmp10 = tmp9 * tmp3;
                    auto tmp11 = tmp10.round();
                    auto tmp12 = tmp11 + tmp3;
                    auto tmp13 = static_cast<float>(-128.0);
                    auto tmp14 = at::vec::Vectorized<float>(tmp13);
                    auto tmp15 = at::vec::maximum(tmp12, tmp14);
                    auto tmp16 = static_cast<float>(127.0);
                    auto tmp17 = at::vec::Vectorized<float>(tmp16);
                    auto tmp18 = at::vec::minimum(tmp15, tmp17);
                    auto tmp19 = at::vec::convert<at::Float8_e4m3fn>(tmp18);
                    tmp19.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L));
                }
            }
        }
    }
}
''')
async_compile.wait(globals())
del async_compile
class Runner:
    def __init__(self, partitions):
        self.partitions = partitions
    def recursively_apply_fns(self, fns):
        new_callables = []
        for fn, c in zip(fns, self.partitions):
            new_callables.append(fn(c))
        self.partitions = new_callables
    def call(self, args):
        arg0_1, = args
        args.clear()
        assert_size_stride(arg0_1, (1, 7, 7, 9), (441, 63, 9, 1))
        buf0 = empty_strided_cpu((1, 7, 7, 9), (441, 63, 9, 1), torch.float8_e4m3fn)
        # [Provenance debug handles] cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0:1
        cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0(arg0_1, buf0)
        del arg0_1
        return (buf0, )
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163324 
Approved by: https://github.com/Xia-Weiwen , https://github.com/mingfeima , https://github.com/jansel 
ghstack dependencies: #163316  
						
						
					 
					
						2025-10-20 01:56:00 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						e9d8973427 
					 
					
						
						
							
							[Inductor] support masked vectorization for the tail_loop for float64 datatype ( #163316 )  
						
						... 
						
						
						
						**Summary:**
Support masked vectorization for the tail_loop for float64 datatype.
**Example:**
```
import torch
def fn(x):
    return x * x
x = torch.randn((22, 22), dtype=torch.double)
with torch.no_grad():
    compiled_fn = torch.compile(fn)
    compiled_fn(x)
```
**Generated code:**
- Before
```
cpp_fused_mul_0 = async_compile.cpp_pybinding(['const double*', 'double*'], r'''
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C"  void  kernel(const double* in_ptr0,
                       double* out_ptr0)
{
    {
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(484L); x0+=static_cast<int64_t>(16L))
        {
            {
                if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(480L)))
                {
                    auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                    auto tmp1 = tmp0 * tmp0;
                    tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                }
                if(C10_UNLIKELY(x0 >= static_cast<int64_t>(480L) && x0 < static_cast<int64_t>(484L)))
                {
                    for (int64_t x0_tail = static_cast<int64_t>(480L);x0_tail < static_cast<int64_t>(484L); x0_tail++)
                    {
                        auto tmp0 = in_ptr0[static_cast<int64_t>(x0_tail)];
                        auto tmp1 = double(tmp0 * tmp0);
                        out_ptr0[static_cast<int64_t>(x0_tail)] = tmp1;
                    }
                }
            }
        }
    }
}
''')
async_compile.wait(globals())
del async_compile
class Runner:
    def __init__(self, partitions):
        self.partitions = partitions
    def recursively_apply_fns(self, fns):
        new_callables = []
        for fn, c in zip(fns, self.partitions):
            new_callables.append(fn(c))
        self.partitions = new_callables
    def call(self, args):
        arg0_1, = args
        args.clear()
        assert_size_stride(arg0_1, (22, 22), (22, 1))
        buf0 = empty_strided_cpu((22, 22), (22, 1), torch.float64)
        # [Provenance debug handles] cpp_fused_mul_0:1
        cpp_fused_mul_0(arg0_1, buf0)
        del arg0_1
        return (buf0, )
```
- After
```
cpp_fused_mul_0 = async_compile.cpp_pybinding(['const double*', 'double*'], r'''
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C"  void  kernel(const double* in_ptr0,
                       double* out_ptr0)
{
    {
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(484L); x0+=static_cast<int64_t>(16L))
        {
            {
                if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(480L)))
                {
                    auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                    auto tmp1 = tmp0 * tmp0;
                    tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                }
                if(C10_UNLIKELY(x0 >= static_cast<int64_t>(480L) && x0 < static_cast<int64_t>(484L)))
                {
                    auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(4L));
                    auto tmp1 = tmp0 * tmp0;
                    tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(4L));
                }
            }
        }
    }
}
''')
async_compile.wait(globals())
del async_compile
class Runner:
    def __init__(self, partitions):
        self.partitions = partitions
    def recursively_apply_fns(self, fns):
        new_callables = []
        for fn, c in zip(fns, self.partitions):
            new_callables.append(fn(c))
        self.partitions = new_callables
    def call(self, args):
        arg0_1, = args
        args.clear()
        assert_size_stride(arg0_1, (22, 22), (22, 1))
        buf0 = empty_strided_cpu((22, 22), (22, 1), torch.float64)
        # [Provenance debug handles] cpp_fused_mul_0:1
        cpp_fused_mul_0(arg0_1, buf0)
        del arg0_1
        return (buf0, )
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163316 
Approved by: https://github.com/mingfeima , https://github.com/jansel  
						
						
					 
					
						2025-10-20 01:41:38 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						61d9a5180e 
					 
					
						
						
							
							[Fix XPU CI] [Inductor UT] Fix test cases broken by community.  ( #165714 )  
						
						... 
						
						
						
						Fixes  #165719 , Fixes  #165771 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165714 
Approved by: https://github.com/jansel  
					
						2025-10-19 23:59:04 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						8a8329b51f 
					 
					
						
						
							
							[ATen] Switch order of blocked reduce when vectorize loads ( #165178 )  
						
						... 
						
						
						
						Performance benchmarking, perf neutral:
```
================================================================================================================================================================================================================================================
Tensor Shape         Operation    Full reduce (ms)     Non-Contig dim (ms)    Contig dim (ms)      Full reduce (ms)     Non-Contig dim (ms)    Contig dim (ms)      Full diff %     Non-Contig diff %    Contig diff %
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(256, 256)           mean         0.015684             0.017056               0.008287             0.016015             0.016929               0.008170                      -2.07%               +0.75%          +1.43%
(256, 256)           sum          0.015774             0.016638               0.007926             0.015811             0.016935               0.008330                      -0.23%               -1.75%          -4.85%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(512, 512)           mean         0.013385             0.025742               0.008629             0.013046             0.026005               0.008924                      +2.60%               -1.01%          -3.31%
(512, 512)           sum          0.013390             0.026059               0.009116             0.013054             0.025696               0.008952                      +2.57%               +1.41%          +1.83%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 1024)         mean         0.014213             0.015467               0.010334             0.013862             0.015082               0.010318                      +2.53%               +2.55%          +0.16%
(1024, 1024)         sum          0.014179             0.015446               0.010774             0.014132             0.015073               0.010350                      +0.33%               +2.47%          +4.10%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(2048, 2048)         mean         0.018234             0.019487               0.014812             0.018482             0.019397               0.014802                      -1.34%               +0.46%          +0.07%
(2048, 2048)         sum          0.018202             0.019529               0.015195             0.018122             0.019485               0.015129                      +0.44%               +0.23%          +0.44%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(4096, 4096)         mean         0.033582             0.039378               0.030751             0.033810             0.039673               0.031019                      -0.67%               -0.74%          -0.86%
(4096, 4096)         sum          0.033604             0.039777               0.030809             0.033530             0.039386               0.031113                      +0.22%               +0.99%          -0.98%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 8192)         mean         0.085824             0.091133               0.084200             0.085431             0.091364               0.084303                      +0.46%               -0.25%          -0.12%
(8192, 8192)         sum          0.085763             0.091442               0.084180             0.085508             0.091419               0.084595                      +0.30%               +0.03%          -0.49%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 16384)        mean         0.146480             0.147666               0.138807             0.146515             0.147987               0.138930                      -0.02%               -0.22%          -0.09%
(8192, 16384)        sum          0.146446             0.147593               0.138559             0.146151             0.147982               0.139120                      +0.20%               -0.26%          -0.40%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 32768)        mean         0.266047             0.265386               0.253837             0.265648             0.265885               0.253652                      +0.15%               -0.19%          +0.07%
(8192, 32768)        sum          0.266093             0.265421               0.253890             0.265458             0.265591               0.253567                      +0.24%               -0.06%          +0.13%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 65536)        mean         0.498632             0.508976               0.481865             0.498237             0.508777               0.481476                      +0.08%               +0.04%          +0.08%
(8192, 65536)        sum          0.498917             0.508202               0.481883             0.498104             0.508016               0.481972                      +0.16%               +0.04%          -0.02%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 131072)       mean         0.957633             0.968519               0.938172             0.956766             0.968267               0.938196                      +0.09%               +0.03%          -0.00%
(8192, 131072)       sum          0.956972             0.968140               0.937741             0.957365             0.968404               0.938056                      -0.04%               -0.03%          -0.03%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 262144)       mean         1.906661             1.928377               1.861846             1.907327             1.928811               1.862083                      -0.03%               -0.02%          -0.01%
(8192, 262144)       sum          1.905976             1.928362               1.862399             1.907098             1.928844               1.861782                      -0.06%               -0.02%          +0.03%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(4096, 262144)       mean         0.956852             0.970101               0.936524             0.957263             0.969809               0.936965                      -0.04%               +0.03%          -0.05%
(4096, 262144)       sum          0.957117             0.969933               0.936247             0.956675             0.969451               0.936395                      +0.05%               +0.05%          -0.02%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(2048, 262144)       mean         0.498813             0.511299               0.483415             0.498567             0.511482               0.483376                      +0.05%               -0.04%          +0.01%
(2048, 262144)       sum          0.498813             0.510834               0.483641             0.498875             0.511036               0.483338                      -0.01%               -0.04%          +0.06%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 262144)       mean         0.266157             0.276751               0.255192             0.265966             0.276808               0.255544                      +0.07%               -0.02%          -0.14%
(1024, 262144)       sum          0.266133             0.276709               0.255528             0.265658             0.276685               0.255287                      +0.18%               +0.01%          +0.09%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(512, 131072)        mean         0.085941             0.081184               0.087931             0.085591             0.080832               0.088008                      +0.41%               +0.44%          -0.09%
(512, 131072)        sum          0.085962             0.081107               0.088045             0.085882             0.081160               0.088024                      +0.09%               -0.07%          +0.02%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1000, 1000)         mean         0.014203             0.045859               0.010310             0.013885             0.046132               0.010621                      +2.29%               -0.59%          -2.93%
(1000, 1000)         sum          0.014180             0.046165               0.010756             0.013893             0.046109               0.010338                      +2.07%               +0.12%          +4.04%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 129)          mean         0.012953             0.016751               0.008536             0.012977             0.016714               0.008916                      -0.18%               +0.22%          -4.26%
(1024, 129)          sum          0.013356             0.016806               0.008722             0.013003             0.017071               0.008611                      +2.71%               -1.55%          +1.29%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 257)          mean         0.013075             0.016787               0.009102             0.013116             0.016769               0.008679                      -0.31%               +0.11%          +4.87%
(1024, 257)          sum          0.013092             0.016842               0.008786             0.013126             0.017128               0.008771                      -0.26%               -1.67%          +0.17%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 587)          mean         0.013662             0.017412               0.010055             0.013659             0.017019               0.010033                      +0.02%               +2.31%          +0.22%
(1024, 587)          sum          0.013636             0.017473               0.010163             0.013642             0.017363               0.010101                      -0.04%               +0.63%          +0.61%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(2048, 977)          mean         0.015276             0.027873               0.012531             0.015241             0.027783               0.012467                      +0.23%               +0.32%          +0.51%
(2048, 977)          sum          0.015345             0.027949               0.012192             0.015255             0.027839               0.012485                      +0.59%               +0.40%          -2.35%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 128)          mean         0.012806             0.014020               0.008291             0.013137             0.014309               0.007908                      -2.52%               -2.02%          +4.84%
(1024, 128)          sum          0.012769             0.014308               0.007924             0.012788             0.014236               0.008038                      -0.15%               +0.51%          -1.42%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 128)          mean         0.014145             0.023049               0.009143             0.014104             0.023298               0.009501                      +0.29%               -1.07%          -3.77%
(8192, 128)          sum          0.014132             0.023082               0.009638             0.014107             0.023331               0.009244                      +0.18%               -1.07%          +4.26%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 130)          mean         0.013420             0.025834               0.008949             0.013368             0.025724               0.008918                      +0.39%               +0.43%          +0.35%
(1024, 130)          sum          0.013300             0.025940               0.009113             0.013266             0.025419               0.008922                      +0.26%               +2.05%          +2.14%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 130)          mean         0.013993             0.017883               0.009661             0.014275             0.018220               0.009596                      -1.98%               -1.85%          +0.68%
(8192, 130)          sum          0.014026             0.018297               0.010066             0.014326             0.018257               0.009659                      -2.09%               +0.22%          +4.21%
================================================================================================================================================================================================================================================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165178 
Approved by: https://github.com/ngimel 
ghstack dependencies: #165494 , #164790 , #165055  
						
						
					 
					
						2025-10-19 23:39:05 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						6b80c94901 
					 
					
						
						
							
							[FlexAttention] Fix dynamic shaped heads flex_flash check ( #165866 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/165866 
Approved by: https://github.com/BoyuanFeng 
ghstack dependencies: #165729  
						
						
					 
					
						2025-10-19 23:10:16 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						8951df03de 
					 
					
						
						
							
							test_scaled_matmul_cuda: fix infer_scale_swizzle ( #165788 )  
						
						... 
						
						
						
						Extend #165747  fix to other cases.
Add parentheses to clarify operator precedence.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165788 
Approved by: https://github.com/jeffdaily , https://github.com/slayton58  
						
						
					 
					
						2025-10-19 21:42:01 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						8139f33fa5 
					 
					
						
						
							
							[dynamo] Add recompile reason for set_stance fail_on_recompile ( #165445 )  
						
						... 
						
						
						
						Fixes  #163500 
### Summary:
For `set_stance("fail_on_recompile")` failures will provide the reason why the recompilation occurred
### Impacts:
module: dynamo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165445 
Approved by: https://github.com/williamwen42  
					
						2025-10-19 21:12:19 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						a88587348b 
					 
					
						
						
							
							[dynamo] Clean up assert in dynamo [1/N] ( #165430 )  
						
						... 
						
						
						
						Fixes some part of #162852  and #164878 . These two issues have some relationship though.
* __->__ #165430 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165430 
Approved by: https://github.com/Lucaskabela , https://github.com/williamwen42 
Co-authored-by: Lucas Kabela <lucasakabela@gmail.com > 
						
						
					 
					
						2025-10-19 21:00:05 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						633a3b7f67 
					 
					
						
						
							
							Revert "shrink_group implementation to expose ncclCommShrink API ( #164518 )"  
						
						... 
						
						
						
						This reverts commit fa0db212e717b6cb225159cb32ea3d83baa52381.
Reverted https://github.com/pytorch/pytorch/pull/164518  on behalf of https://github.com/pytorch-auto-revert  due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3419893217 )) 
						
						
					 
					
						2025-10-19 19:20:45 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						fa0db212e7 
					 
					
						
						
							
							shrink_group implementation to expose ncclCommShrink API ( #164518 )  
						
						... 
						
						
						
						Closes  #164529 
To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink ) API to PyTorch.
This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization.
For more info:  [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518 
Approved by: https://github.com/kwen2501  
					
						2025-10-19 18:00:08 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						15ff1cd28b 
					 
					
						
						
							
							Remove E721 suppression in flake8 ( #165855 )  
						
						... 
						
						
						
						Currently all files pass the E721 check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165855 
Approved by: https://github.com/albanD  
						
						
					 
					
						2025-10-19 17:51:12 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						c73f5080de 
					 
					
						
						
							
							Migrating some more callsites ( #163580 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/163580 
Approved by: https://github.com/avikchaudhuri 
ghstack dependencies: #165582  
						
						
					 
					
						2025-10-19 15:52:17 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						22ae059d32 
					 
					
						
						
							
							AOTI util deprecated flow using the new tracer ( #165582 )  
						
						... 
						
						
						
						Reapply of https://github.com/pytorch/pytorch/pull/163260 
AOTI utils expect free function sometimes so adjust export API to handle that, haven't seen any methods getting exported. Some AOTI flows also require we populate dynamo_flat_name_to_original_fqn so i just copy how it is done in eval_frame.py. I also cleaned up how we get rid of export_root and fixed some overcomplicated nn_module_stack handling in export code. The logic is simpler now thanks to @anijain2305 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165582 
Approved by: https://github.com/anijain2305  
						
						
					 
					
						2025-10-19 15:52:16 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						1b121d636e 
					 
					
						
						
							
							Fix AllocatorConfig parse roundup division bug ( #165304 )  
						
						... 
						
						
						
						* #165288 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165304 
Approved by: https://github.com/albanD 
ghstack dependencies: #165288 , #165289 , #165291 , #165298  
						
						
					 
					
						2025-10-19 15:34:44 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						1ba808dd97 
					 
					
						
						
							
							Refine CUDA BackendStaticInitializer for allocator select ( #165298 )  
						
						... 
						
						
						
						* #165288 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165298 
Approved by: https://github.com/albanD 
ghstack dependencies: #165288 , #165289 , #165291  
						
						
					 
					
						2025-10-19 15:34:44 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						b2f5c25b27 
					 
					
						
						
							
							Introduce a generic API torch._C._accelerator_setAllocatorSettings ( #165291 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/165291 
Approved by: https://github.com/albanD 
ghstack dependencies: #165288 , #165289  
						
						
					 
					
						2025-10-19 15:34:36 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						a1114beed2 
					 
					
						
						
							
							Deprecate overlapped functions in CUDAAllocatorConfig ( #165289 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/165289 
Approved by: https://github.com/albanD 
ghstack dependencies: #165288  
						
						
					 
					
						2025-10-19 15:34:26 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						4888ed440e 
					 
					
						
						
							
							Refine Allocator Config error message friendly ( #165288 )  
						
						... 
						
						
						
						* __->__ #165288 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165288 
Approved by: https://github.com/albanD  
						
						
					 
					
						2025-10-19 15:34:17 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						5d62b63a76 
					 
					
						
						
							
							[BE] Use Python-3.14 GE build ( #165804 )  
						
						... 
						
						
						
						3.14 reached general availability on Oct 7th 2025, so we can remove all pre-release workarounds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165804 
Approved by: https://github.com/yangw-dev , https://github.com/Skylion007 , https://github.com/cyyever  
						
						
					 
					
						2025-10-19 11:45:10 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						57ba575242 
					 
					
						
						
							
							[BE][Ez]: Update torch.is_tensor documentation ( #165841 )  
						
						... 
						
						
						
						TypeIs propogates the isinstance check with the typing system. They are now equivalent.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165841 
Approved by: https://github.com/albanD  
						
						
					 
					
						2025-10-19 09:24:11 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						ceb11a584d 
					 
					
						
						
							
							[BE]: Update kleidai submodule to v1.15.0 ( #165842 )  
						
						... 
						
						
						
						This mostly just adds a few new kernels and fixes some IMA and performance improvement of prev kernels. Also improves compiler support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165842 
Approved by: https://github.com/albanD  
						
						
					 
					
						2025-10-19 08:25:03 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						33adb276fe 
					 
					
						
						
							
							[BE][Ez]: Update Eigen to 5.0.0. C++14 support and more! ( #165840 )  
						
						... 
						
						
						
						Update Eigen pin to 5.0.0 . Tons of new features and perf improvements. Most importantly updates minimum from C++03 to C++14 giving a ton of performance optimizations like properly implemented move operators, simplified code, etc. Also improved vectorization particularily on ARM. We really only use this library as a fallback for sparse operators, but still useful to update it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165840 
Approved by: https://github.com/albanD  
						
						
					 
					
						2025-10-19 08:00:06 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						e939651972 
					 
					
						
						
							
							[audio hash update] update the pinned audio hash ( #165807 )  
						
						... 
						
						
						
						This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml ).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165807 
Approved by: https://github.com/pytorchbot  
						
						
					 
					
						2025-10-19 04:45:20 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						3255e7872b 
					 
					
						
						
							
							Enable all flake8-logging-format rules ( #164655 )  
						
						... 
						
						
						
						These rules are enabled by removing existing suppressions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164655 
Approved by: https://github.com/janeyx99 , https://github.com/mlazos  
						
						
					 
					
						2025-10-19 00:59:28 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						c4f6619330 
					 
					
						
						
							
							Enable more DTensor tests in local tensor mode and fix more integration issues ( #165716 )  
						
						... 
						
						
						
						- During op dispatch local tensor is supposed to collect rng state from CPU and CUDA
devices so that it can be reset before execution of the op for each such that ops
with randomness produces the same result for all ranks (note that we are planning a
separate change to add support of per rank rng state). Previously we relied on
op input arguments to deduce which devices to get rng state from. Which doesn't work
for factory functions such torch.randn. Hence this changes switches to uncondionally
collecting rng state from all devices.
- Fixing per rank specific computations in _MaskedPartial and Shard placements discovered
during test enablement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165716 
Approved by: https://github.com/ezyang  
						
						
					 
					
						2025-10-18 23:33:24 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						f18041cca8 
					 
					
						
						
							
							Fix missing closing quote  in __init__.py documentation ( #165827 )  
						
						... 
						
						
						
						Title says it all.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165827 
Approved by: https://github.com/Skylion007  
						
						
					 
					
						2025-10-18 22:09:18 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						35e51893bd 
					 
					
						
						
							
							Remove CUDA 11 workarounds for CUB_SUPPORTS_SCAN_BY_KEY and CUB_SUPPORTS_UNIQUE_BY_KEY ( #164637 )  
						
						... 
						
						
						
						`CUB_SUPPORTS_SCAN_BY_KEY` and `CUB_SUPPORTS_UNIQUE_BY_KEY` are true since CUDA 12. This PR removes the old branches and source files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164637 
Approved by: https://github.com/ezyang  
						
						
					 
					
						2025-10-18 20:05:54 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						1f43d17ce6 
					 
					
						
						
							
							Fix self assignment ( #165816 )  
						
						... 
						
						
						
						This PR removes assignments of the form `var=var`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165816 
Approved by: https://github.com/jansel  
						
						
					 
					
						2025-10-18 18:51:52 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						032bed95cd 
					 
					
						
						
							
							Various C++ code fixes in LSAN integration ( #165818 )  
						
						... 
						
						
						
						This PR extracts the C++ code fixes from #154584 , which are fixes in enabling LSAN.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165818 
Approved by: https://github.com/ezyang  
						
						
					 
					
						2025-10-18 17:59:23 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						d14cbb4476 
					 
					
						
						
							
							Add NVFP4 two-level scaling to scaled_mm ( #165774 )  
						
						... 
						
						
						
						Summary:
* Add second-level scaling dispatch to scaled_mm, tying into optional `alpha` passing
* Add two-level tests
Test Plan:
```
pytest -svv -k "nvfp4_global_scale" test/test_scaled_matmul_cuda.py
```
Reviewers:
Subscribers:
Tasks:
Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165774 
Approved by: https://github.com/drisspg  
						
						
					 
					
						2025-10-18 13:06:04 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						f510d0dbc0 
					 
					
						
						
							
							Clarrifying input output angle unit in the docs for trigonometric fun… ( #161248 )  
						
						... 
						
						
						
						…ctions
Fixes #[160995](https://github.com/pytorch/pytorch/issues/160995 )
Modified the docs to clarify that input tensor  values for torch.sin, torch.cos and torch.tan should be in radians and the output tensor  values for torch.acos, torch.asin and torch.atan is in radians.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161248 
Approved by: https://github.com/isuruf 
Co-authored-by: Isuru Fernando <isuruf@gmail.com > 
						
						
					 
					
						2025-10-18 11:53:48 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						beb6b62e8c 
					 
					
						
						
							
							Revert "Enable more DTensor tests in local tensor mode and fix more integration issues ( #165716 )"  
						
						... 
						
						
						
						This reverts commit 1b397420f22b22f90a1093233ecd9167656e50cb.
Reverted https://github.com/pytorch/pytorch/pull/165716  on behalf of https://github.com/pytorch-auto-revert  due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165716#issuecomment-3418083391 )) 
						
						
					 
					
						2025-10-18 09:15:49 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						4740ce7787 
					 
					
						
						
							
							[CP] Fix load balancer incorrectly assuming batch dimension exists ( #165792 )  
						
						... 
						
						
						
						https://github.com/pytorch/pytorch/pull/163617  removes the if/else statement to check if the input buffers have the batch dimension.
This PR fixes the issue and also adds a test.
In the future, we should explicitly ask users to unsqueeze the batch dimension. This is a BC of the existing contract but implicitly infers the batch dimension existence is not safe.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165792 
Approved by: https://github.com/XilunWu  
					
						2025-10-18 09:11:16 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						ad67170c8b 
					 
					
						
						
							
							[MPS] sparse matmuls ( #165232 )  
						
						... 
						
						
						
						Implements matmuls for sparse tensors. With this commit most of the core sparse operations should be implemented. Fixes:
https://github.com/pytorch/pytorch/issues/156540 
https://github.com/pytorch/pytorch/issues/129842 
Should be merged after:
https://github.com/pytorch/pytorch/pull/165102 
To compare MPS and CPU, you can use this script:
```python
import torch
import time
import matplotlib.pyplot as plt
B, I, J, K = 8, 20000, 20000, 20000
num_iterations = 500
nnz_values = [10, 50, 100, 200, 500, 1000, 2000, 5000, 10000, 20000, 100000]
speedups = []
for nnz in nnz_values:
    indices = torch.stack([
        torch.randint(0, B, (nnz,)),
        torch.randint(0, I, (nnz,)),
        torch.randint(0, J, (nnz,)),
    ])
    values = torch.rand(nnz)
    sparse = torch.sparse_coo_tensor(indices, values, size=(B, I, J), device="mps").coalesce()
    dense = torch.randn(B, J, 200, device="mps")
    t1 = time.time()
    for _ in range(num_iterations):
        result = torch.bmm(sparse, dense)
    torch.mps.synchronize()
    t2 = time.time()
    mps_time = (t2 - t1) / num_iterations
    sparse_cpu = sparse.cpu()
    dense_cpu = dense.cpu()
    t1 = time.time()
    for _ in range(num_iterations):
        result_cpu = torch.bmm(sparse_cpu, dense_cpu)
    t2 = time.time()
    cpu_time = (t2 - t1) / num_iterations
    speedup = cpu_time / mps_time
    speedups.append(speedup)
    print(f"nnz={nnz}: MPS={mps_time:.6f}s, CPU={cpu_time:.6f}s, Speedup={speedup:.2f}x")
plt.figure(figsize=(10, 6))
plt.plot(nnz_values, speedups, marker='o', linewidth=2, markersize=8)
plt.xlabel('Number of Non-Zero Elements (nnz)', fontsize=12)
plt.ylabel('Speedup (CPU time / MPS time)', fontsize=12)
plt.title('MPS vs CPU Speedup for Sparse-Dense BMM', fontsize=14)
plt.grid(True, alpha=0.3)
plt.axhline(y=1, color='r', linestyle='--', alpha=0.5)
plt.xscale('log')
plt.tight_layout()
plt.show()
```
## Tested on M1 Pro
<img width="1000" height="600" alt="Figure_1" src="https://github.com/user-attachments/assets/4a2402ec-3dc4-402d-8196-a0426906ca3d " />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165232 
Approved by: https://github.com/malfet  
						
						
					 
					
						2025-10-18 09:04:42 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						fdab48a7c1 
					 
					
						
						
							
							Enable all PIE rules on ruff ( #165814 )  
						
						... 
						
						
						
						This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are
```
PIE796  Enum contains duplicate value: {value}
PIE808  Unnecessary start argument in range
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814 
Approved by: https://github.com/ezyang  
						
						
					 
					
						2025-10-18 07:36:18 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						a0948d4d23 
					 
					
						
						
							
							[ROCm][inductor] autotune support for persistent reduction kernels ( #163908 )  
						
						... 
						
						
						
						After the removal of want_no_x_dim for persistent reduction kernels, we can improve the autotuning setup for persistent reduction kernels.
Currently even with tuning enable, filtering will only try a single config in many cases. Avoid filtering with autotune mode, and override MAX_BLOCK limit. Also we always include tiny_config when autotuning is enabled.
Contributions from several members of the AMD Inductor and Triton teams: @jataylo @iupaikov-amd @AmdSampsa @xiaohuguo2023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163908 
Approved by: https://github.com/jansel , https://github.com/PaulZhang12  
						
						
					 
					
						2025-10-18 07:33:24 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						0bbdd6b8db 
					 
					
						
						
							
							[ROCm][inductor] heuristic improvements for pointwise kernels ( #163197 )  
						
						... 
						
						
						
						Heuristic improvements for pointwise kernels for MI350.
Contributions from several members of the AMD Inductor and Triton teams:
@jataylo @AmdSampsa @iupaikov-amd @@xiaohuguo2023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163197 
Approved by: https://github.com/PaulZhang12 , https://github.com/eellison , https://github.com/jansel 
Co-authored-by: AmdSampsa <sampsa.riikonen@amd.com >
Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com > 
						
						
					 
					
						2025-10-18 07:23:41 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						24520b8386 
					 
					
						
						
							
							Revert "Enable all PIE rules on ruff ( #165814 )"  
						
						... 
						
						
						
						This reverts commit c79dfdc6550e872783aa5cb5fc9e86589bf18872.
Reverted https://github.com/pytorch/pytorch/pull/165814  on behalf of https://github.com/cyyever  due to Need to cover more files ([comment](https://github.com/pytorch/pytorch/pull/165814#issuecomment-3417931863 )) 
						
						
					 
					
						2025-10-18 07:21:08 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						c79dfdc655 
					 
					
						
						
							
							Enable all PIE rules on ruff ( #165814 )  
						
						... 
						
						
						
						This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are
```
PIE796  Enum contains duplicate value: {value}
PIE808  Unnecessary start argument in range
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814 
Approved by: https://github.com/ezyang  
						
						
					 
					
						2025-10-18 06:40:12 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						e595136187 
					 
					
						
						
							
							Enable PLC1802 on ruff ( #165813 )  
						
						... 
						
						
						
						This PR enables ruff check `PLC1802`, which detects len calls on sequences in a boolean test context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165813 
Approved by: https://github.com/ezyang  
						
						
					 
					
						2025-10-18 05:44:14 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						aaac8cb0f5 
					 
					
						
						
							
							[1/N] Add strict parameter to Python zip calls  ( #165531 )  
						
						... 
						
						
						
						Add `strict=True/False` to zip calls in test utils. `strict=True` is passed when possible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165531 
Approved by: https://github.com/Skylion007  
						
						
					 
					
						2025-10-18 05:26:33 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						0f0b4bf029 
					 
					
						
						
							
							[1/N] Remove unused header inclusion ( #165763 )  
						
						... 
						
						
						
						This PR removes unused header inclusion in C++ files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165763 
Approved by: https://github.com/Skylion007  
						
						
					 
					
						2025-10-18 05:23:11 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						b8194268a6 
					 
					
						
						
							
							Remove unnecessary noqa suppressions  ( #164106 )  
						
						... 
						
						
						
						This PR removes unused `noqa` suppressions in Python code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164106 
Approved by: https://github.com/albanD  
						
						
					 
					
						2025-10-18 04:52:41 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						f02e3947f6 
					 
					
						
						
							
							Expand type checking to mypy strict files ( #165697 )  
						
						... 
						
						
						
						Expands Pyrefly type checking to check the files outlined in the mypy-strict.ini configuration file:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165697 
Approved by: https://github.com/ezyang  
						
						
					 
					
						2025-10-18 04:34:45 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						9095a9dfae 
					 
					
						
						
							
							[CD] Apply the fix from  #162455  to aarch64+cu129 build ( #165794 )  
						
						... 
						
						
						
						When trying to bring cu129 back in https://github.com/pytorch/pytorch/pull/163029 , I mainly looked at https://github.com/pytorch/pytorch/pull/163029  and missed another tweak coming from https://github.com/pytorch/pytorch/pull/162455 
I discover this issue when testing aarch64+cu129 builds in https://github.com/pytorch/test-infra/actions/runs/18603342105/job/53046883322?pr=7373 .  Surprisingly, there is no test running for aarch64 CUDA build from what I see in 79a37055e7https://github.com/pytorch/pytorch/pull/165794 
Approved by: https://github.com/malfet  
						
						
					 
					
						2025-10-18 04:16:24 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						d9f94e0d7d 
					 
					
						
						
							
							[dynamo] Support fx.traceback.annotate as decorator ( #165805 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/165805 
Approved by: https://github.com/Lucaskabela , https://github.com/SherlockNoMad , https://github.com/yushangdi  
						
						
					 
					
						2025-10-18 03:58:11 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						23417ae50f 
					 
					
						
						
							
							[Submodule] Bump FBGEMM to latest ( #165544 )  
						
						... 
						
						
						
						Summary:
* FBGEMM submodule updated to main
* CMake updated to reflect necessary changes
* Notably pulls in NVFP4 grouped gemm kernels
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165544 
Approved by: https://github.com/cyyever , https://github.com/jeffdaily  
						
						
					 
					
						2025-10-18 03:58:08 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						e4d6c56ffb 
					 
					
						
						
							
							Improve dynamo graph capture stack trace for custom ops ( #165693 )  
						
						... 
						
						
						
						For a custom op
```
@torch.library.custom_op("my_lib::foo", mutates_args={})
def foo(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    return x + y
```
ppl could call `torch.ops.my_lib.foo()` or directly call `foo()` in the `forward` of an `nn.Module`
These two calling conventions will lead to the same node in the output graph, but different stack traces.
When directly calling `foo()`, the displayed stack_trace in the graph will be
```
# File: .../pytorch/torch/_library/custom_ops.py:687 in __call__, code: return self._opoverload(*args, **kwargs)
```
This is not useful so we filter it out.
```
python test/functorch/test_aot_joint_with_descriptors.py -k test_custom_op_stack_trace
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165693 
Approved by: https://github.com/SherlockNoMad , https://github.com/williamwen42  
						
						
					 
					
						2025-10-18 03:48:18 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						017d2985f3 
					 
					
						
						
							
							set unbacked bindings in reinplace pass for newly created nodes during generalize_scatter decomp ( #164948 )  
						
						... 
						
						
						
						Two fixes:
1. in rein_place pass, set unbacked bindings for newly created nodes.
2. In inductor, ComputeBuffer used to miss detecting some used symbols, fixed that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164948 
Approved by: https://github.com/bobrenjc93 
ghstack dependencies: #164341  
						
						
					 
					
						2025-10-18 03:20:30 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						c6a8db0b9a 
					 
					
						
						
							
							Fix issues with generalized_scatter  and setitem allocated unbacked symbols. ( #164341 )  
						
						... 
						
						
						
						Three fixes:
1. When doing t[u0] +=1  if u0 is unbacked we could allocate a new unbacked symbol during the the indexing of t[u0] (when we fake trace setitem), namely because meta_select does allocate a new unbacked symbol for the storage offset when we do not know if u0>=0 or u0<0.  but the output size/stride of setitem(), does not depend on that new symbol. it's self consumed in setitem so we shall ignore it.
2. Also when we trace through generalized_scatter the applications of the views could allocate unbacked symints
but those do not effect final output, we also shall ignore them.
3.Before accessing strides in lowering we shall materialize.
Address  https://github.com/pytorch/pytorch/issues/114293  and https://github.com/pytorch/pytorch/issues/131911 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164341 
Approved by: https://github.com/bobrenjc93  
						
						
					 
					
						2025-10-18 03:20:30 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						de09bab4b6 
					 
					
						
						
							
							[BE]: Update cudnn frontend submodule to 1.15.0 ( #165776 )  
						
						... 
						
						
						
						Update cudnn frontend submodule to 1.15.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165776 
Approved by: https://github.com/eqy  
						
						
					 
					
						2025-10-18 02:23:27 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						c137e222d4 
					 
					
						
						
							
							.venv/ in .gitignore  ( #165418 )  
						
						... 
						
						
						
						`uv venv` creates venv in `.venv/` directory. So, it's useful to have `.venv/` in `.gitignore`, since perhaps more people are using `uv` in their work. As per comment 3592f5f4e5 (diff-bc37d034bad564583790a46f19d807abfe519c5671395fd494d8cce506c42947)https://docs.astral.sh/uv/pip/environments/#using-arbitrary-python-environments 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165418 
Approved by: https://github.com/ezyang  
						
						
					 
					
						2025-10-18 02:00:52 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						cf3a787bbc 
					 
					
						
						
							
							[annotate] Annotate bw nodes before eliminate dead code ( #165782 )  
						
						... 
						
						
						
						Fixes https://github.com/pytorch/torchtitan/pull/1907 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165782 
Approved by: https://github.com/SherlockNoMad  
						
						
					 
					
						2025-10-18 01:54:31 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						de3da77cf7 
					 
					
						
						
							
							Thread deterministic config vars to subproc compilation ( #165729 )  
						
						... 
						
						
						
						# Summary
TIL (AFTER WAYYYY TOO MUCH INSANITY), that we do not serialize the full set of configs for the subproc compilation.
I found this while working on Flex-attention determinism: https://github.com/meta-pytorch/attention-gym/pull/168 
might be good to audit if we need to thread through any more
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165729 
Approved by: https://github.com/shunting314 , https://github.com/eellison  
						
						
					 
					
						2025-10-18 01:25:50 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						543ddbf44c 
					 
					
						
						
							
							[ONNX] Support renaming in dynamic axes to shapes conversion ( #165769 )  
						
						... 
						
						
						
						Discovered in ##165748
This PR also deprecates the conversion. ONNX exporter team does not intend to maintain the conversion in long term.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165769 
Approved by: https://github.com/justinchuby  
						
						
					 
					
						2025-10-18 01:11:20 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						e9f4999985 
					 
					
						
						
							
							[Code Clean] Replace std::runtime_error with TORCH_CHECK ( #165305 )  
						
						... 
						
						
						
						Fixes part of #148114 
Including:
- torch/csrc/distributed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165305 
Approved by: https://github.com/FFFrog , https://github.com/albanD  
						
						
					 
					
						2025-10-18 01:08:44 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						29b029648e 
					 
					
						
						
							
							Fixed issue with GradTrackingTensor not properly propagating sparse layout ( #165765 )  
						
						... 
						
						
						
						Fixes  #164286 
Fixed issue with GradTrackingTensor not properly propagating sparse layout.
@ezyang @jcaip
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165765 
Approved by: https://github.com/ezyang  
					
						2025-10-18 01:00:53 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						a25a649e70 
					 
					
						
						
							
							[Mem Snapshot] Add Metadata Field ( #165490 )  
						
						... 
						
						
						
						Summary:
The implementation adds the ability to:
Set custom metadata strings that will be attached to all subsequent allocations
Clear or change the metadata at any point
View the metadata in memory snapshots via _dump_snapshot()
Test Plan: Added test in test_cuda.py and check manually in snapshot to see that metadata was added.
Differential Revision: D84654933
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165490 
Approved by: https://github.com/yushangdi  
						
						
					 
					
						2025-10-17 23:46:02 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						69c33898fa 
					 
					
						
						
							
							Revert "[Inductor][CuTeDSL] Move load_template up two directories ( #165347 ) ( #165576 )"  
						
						... 
						
						
						
						This reverts commit febb60323018948b2b9d2cff35b3cc4e0d0c55c8.
Reverted https://github.com/pytorch/pytorch/pull/165576  on behalf of https://github.com/seemethere  due to This was actually reverted internally, current PR is linked to a stale diff so diff train tools think that this is landed via co-dev when it was actually reverted ([comment](https://github.com/pytorch/pytorch/pull/165576#issuecomment-3417510146 )) 
						
						
					 
					
						2025-10-17 23:33:17 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						1b397420f2 
					 
					
						
						
							
							Enable more DTensor tests in local tensor mode and fix more integration issues ( #165716 )  
						
						... 
						
						
						
						- During op dispatch local tensor is supposed to collect rng state from CPU and CUDA
devices so that it can be reset before execution of the op for each such that ops
with randomness produces the same result for all ranks (note that we are planning a
separate change to add support of per rank rng state). Previously we relied on
op input arguments to deduce which devices to get rng state from. Which doesn't work
for factory functions such torch.randn. Hence this changes switches to uncondionally
collecting rng state from all devices.
- Fixing per rank specific computations in _MaskedPartial and Shard placements discovered
during test enablement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165716 
Approved by: https://github.com/ezyang  
						
						
					 
					
						2025-10-17 23:28:22 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						fe80f03726 
					 
					
						
						
							
							Add B200 files to labeler and update codeowners ( #165767 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/165767 
Approved by: https://github.com/slayton58  
						
						
					 
					
						2025-10-17 23:24:17 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						e50dc40d28 
					 
					
						
						
							
							Revert "Update gm.print_readable to include Annotation ( #165397 )"  
						
						... 
						
						
						
						This reverts commit 7a657700131f31577544e93587eb339618677e97.
Reverted https://github.com/pytorch/pytorch/pull/165397  on behalf of https://github.com/malfet  due to I don't know how/why, but it breaks windows tests, see 2e22b1a61e/1https://github.com/pytorch/pytorch/pull/165397#issuecomment-3417428128 )) 
						
						
					 
					
						2025-10-17 22:35:50 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						2e22b1a61e 
					 
					
						
						
							
							[pytorch] Composite backend potential fix for is_backend_available ( #165061 )  
						
						... 
						
						
						
						Summary: `is_backend_available` takes in a string and expects it to only be backend, if its given a composite (device:backend) string, it fails.
Reviewed By: prashrock
Differential Revision: D81886736
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165061 
Approved by: https://github.com/H-Huang  
						
						
					 
					
						2025-10-17 22:06:36 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						616c6bdf8f 
					 
					
						
						
							
							[dynamo][ac] Config flag to allow eager and compile AC divergence for side-effects ( #165775 )  
						
						... 
						
						
						
						Eager AC/SAC reapplies the mutations (like global dict mutations) in the backward during the recomputation of forward. torch.compile has no easy way to reapply python mutations in the backward. But many users might be ok to skip reapplication of side effects in the backward. They can set this config flag to accept this eager and compile divergence.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165775 
Approved by: https://github.com/zou3519 
ghstack dependencies: #165734  
						
						
					 
					
						2025-10-17 22:04:19 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						c18ddfc572 
					 
					
						
						
							
							[dynamo][easy] Support torch.accelerator.current_accelerator ( #165734 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/165734 
Approved by: https://github.com/Skylion007  
						
						
					 
					
						2025-10-17 22:04:19 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						86ebce1766 
					 
					
						
						
							
							[precompile] Pass tensor_to_context to backend. ( #165702 )  
						
						... 
						
						
						
						Summary:
Fixing a VLLM issue https://github.com/vllm-project/vllm/issues/27040  where
aot precompile fails on some models using symbolic shapes in inductor.
Test Plan:
pp HF_HUB_DISABLE_XET=1 VLLM_ENABLE_V1_MULTIPROCESSING=0 VLLM_USE_AOT_COMPILE=1 vllm bench latency --model microsoft/DialoGPT-small --input-len 128 --output-len 256 --num-iters 50 --dtype float16
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165702 
Approved by: https://github.com/tugsbayasgalan  
						
						
					 
					
						2025-10-17 21:52:04 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						8cb2fb44f2 
					 
					
						
						
							
							[Inductor] Support fallback for all gemm like ops ( #165755 )  
						
						... 
						
						
						
						Summary: Fill op_override field for bmm aten ops so they can be converted properly in the wrapper_fxir backend
Reviewed By: StellarrZ
Differential Revision: D84840948
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165755 
Approved by: https://github.com/blaine-rister  
						
						
					 
					
						2025-10-17 21:08:29 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						ab65498d71 
					 
					
						
						
							
							Fix _StridedShard incorrect split ( #165533 )  
						
						... 
						
						
						
						https://github.com/pytorch/pytorch/pull/164820  introduced a bug that `_StridedShard` will call parent class `Shard`'s `split_tensor` method, thus results in incorrect data locality. (I think @ezyang spotted this issue, but we have no test to capture this)
Meanwhile, I notice another bug that when we normalize a `_StridedShard`'s placement, it will also trigger parent class `Shard`'s `split_tensor` method because it will create a Shard class [here](0c14f55de6/torch/distributed/tensor/_api.py (L783)https://github.com/pytorch/pytorch/pull/165533 
Approved by: https://github.com/XilunWu  
					
						2025-10-17 20:54:46 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						06d324365c 
					 
					
						
						
							
							Revert "Escaped html tags name and target to appear as strings ( #165543 )"  
						
						... 
						
						
						
						This reverts commit 080365b7d82a3c99c995cab6dc912b7dfe22aa41.
Reverted https://github.com/pytorch/pytorch/pull/165543  on behalf of https://github.com/pytorch-auto-revert  due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165543#issuecomment-3417102048 )) 
						
						
					 
					
						2025-10-17 20:45:48 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						6c9c6e0936 
					 
					
						
						
							
							Enable C407 of flake8 ( #165046 )  
						
						... 
						
						
						
						This PR enables C407 on flake8. The description is `C407` is `Unnecessary list comprehension - ‘<builtin>’ can take a generator`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165046 
Approved by: https://github.com/albanD  
						
						
					 
					
						2025-10-17 20:15:39 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						2bcd892c86 
					 
					
						
						
							
							[distributed] Replace assert statements in distributed checkpoint with explicit checks ( #165256 )  
						
						... 
						
						
						
						Fixes partially #164878 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165256 
Approved by: https://github.com/albanD  
						
						
					 
					
						2025-10-17 20:14:35 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						75e2a9fae3 
					 
					
						
						
							
							[annotate] add annotate_fn function decorator ( #165703 )  
						
						... 
						
						
						
						Example usage:
```
        @fx_traceback.annotate_fn({"pp_stage": 1})
        def example_function(x):
            return x * x
        class SimpleLinear(nn.Module):
            def __init__(self):
                super().__init__()
                self.linear = nn.Linear(3, 2)
            def forward(self, x):
                with fx_traceback.annotate({"pp_stage": 0}):
                    y = self.linear(x)
                y = example_function(y)
                return y - 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165703 
Approved by: https://github.com/SherlockNoMad  
						
						
					 
					
						2025-10-17 20:10:53 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						a16fd6b488 
					 
					
						
						
							
							[NVSHMEM][Triton] Fix NVSHMEM triton test for wacky world sizes ( #165704 )  
						
						... 
						
						
						
						Currently assumes divisible by 4? world size
Not as slick as the old setup code but more general
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165704 
Approved by: https://github.com/Skylion007 , https://github.com/kwen2501  
						
						
					 
					
						2025-10-17 19:33:26 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						382b0150de 
					 
					
						
						
							
							[docs] Add usage examples to ConvTranspose1d docstring ( #165618 )  
						
						... 
						
						
						
						Fixes  #165615 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165618 
Approved by: https://github.com/mikaylagawarecki  
					
						2025-10-17 19:11:57 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						a664b299ac 
					 
					
						
						
							
							Update docs for torch.mode ( #165614 )  
						
						... 
						
						
						
						Currently the docs for `torch.mode` include a note:
`This function is not defined for torch.cuda.Tensor yet.`
However with `torch==2.7.1+cu126` when I try to get the mode of a Tensor that is in cuda memory, I do not face any issues:
```
>>> a = torch.tensor([0, 2, 1, 1, 1, 3, 3])
>>> a.mode()
torch.return_types.mode(
values=tensor(1),
indices=tensor(4))
>>> a.cuda().mode()
torch.return_types.mode(
values=tensor(1, device='cuda:0'),
indices=tensor(4, device='cuda:0'))
```
Am I misunderstanding the note? If not, I suggest removing it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165614 
Approved by: https://github.com/mikaylagawarecki  
						
						
					 
					
						2025-10-17 19:06:33 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						9c12651417 
					 
					
						
						
							
							Improve error message for non-positive groups in convolution ( #165669 )  
						
						... 
						
						
						
						Prevents from segmentation fault for invalid groups value in convolution.
Fixes  #142835 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165669 
Approved by: https://github.com/mikaylagawarecki  
						
						
					 
					
						2025-10-17 19:06:05 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						08c97b4a1f 
					 
					
						
						
							
							Don't run compile inside kernel invocation ( #165687 )  
						
						... 
						
						
						
						When we call torch.compile during fake tensor prop, we shouldn't actually compile because we can't guarantee that the compiled artifact can be fake tensor prop-d. (for example, inductor backend). Instead we should just skip compiling. However, the inner compile will be triggered when being executed in runtime.
Fixes: https://github.com/pytorch/pytorch/issues/151328 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165687 
Approved by: https://github.com/zou3519  
						
						
					 
					
						2025-10-17 19:03:57 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						fae74cd52f 
					 
					
						
						
							
							Revert "shrink_group implementation to expose ncclCommShrink API ( #164518 )"  
						
						... 
						
						
						
						This reverts commit a032510db38e8331afa08f7635d146f9cefdd0ab.
Reverted https://github.com/pytorch/pytorch/pull/164518  on behalf of https://github.com/pytorch-auto-revert  due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3416718767 )) 
						
						
					 
					
						2025-10-17 18:55:53 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						7a65770013 
					 
					
						
						
							
							Update gm.print_readable to include Annotation ( #165397 )  
						
						... 
						
						
						
						Sample output
```
[rank0]:        # Annotation: {'compile_with_inductor': 'flex_attention'} File: /data/users/bahuang/pytorch/torch/nn/attention/flex_attention.py:1490 in flex_attention, code: out, lse, max_scores = flex_attention_hop(
[rank0]:        score_mod_2 = self.score_mod_2
[rank0]:        mask_fn_2 = self.mask_fn_2
[rank0]:        flex_attention_1 = torch.ops.higher_order.flex_attention(xq_5, xk_5, xv_3, score_mod_2, (2048, 2048, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___kv_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___kv_indices, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_kv_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_kv_indices, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___q_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___q_indices, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_q_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_q_indices, 128, 128, mask_fn_2), 0.25, {'PRESCALE_QK': False, 'ROWS_GUARANTEED_SAFE': False, 'BLOCKS_ARE_CONTIGUOUS': False, 'WRITE_DQ': True, 'OUTPUT_LOGSUMEXP': True, 'OUTPUT_MAX': False}, (), (g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___mask_mod___closure___0_cell_contents,));  xq_5 = xk_5 = xv_3 = score_mod_2 = mask_fn_2 = None
[rank0]:        out_2: "bf16[8, 4, 2048, 16]" = flex_attention_1[0];  flex_attention_1 = None
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165397 
Approved by: https://github.com/yushangdi , https://github.com/anijain2305  
						
						
					 
					
						2025-10-17 18:35:18 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						e4454947e2 
					 
					
						
						
							
							Widen ops support to take in IntHOArrayRef vs only std::vec ( #165152 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/165152 
Approved by: https://github.com/mikaylagawarecki 
ghstack dependencies: #164991  
						
						
					 
					
						2025-10-17 18:32:39 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						3806e9767b 
					 
					
						
						
							
							Refactor out headeronly ArrayRef ( #164991 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/164991 
Approved by: https://github.com/swolchok  
						
						
					 
					
						2025-10-17 18:32:39 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						b08d8c2e50 
					 
					
						
						
							
							Revert "[DebugMode][2/N] add nn.Module tracking ( #165498 )"  
						
						... 
						
						
						
						This reverts commit 45afaf08a14ab760d86ea80dea6d50cec8626513.
Reverted https://github.com/pytorch/pytorch/pull/165498  on behalf of https://github.com/seemethere  due to First part of the stack was reverted so will need to revert this too ([comment](https://github.com/pytorch/pytorch/pull/165498#issuecomment-3416618198 )) 
						
						
					 
					
						2025-10-17 18:22:48 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						ca5b7f8ded 
					 
					
						
						
							
							torch.compile: populate compiler_config ( #165581 )  
						
						... 
						
						
						
						Summary: This starts writing the compiler_config metadata into logger
Test Plan:
Modified existing test case to make sure this is not null.
(Also eyeballed what we're logging tomake sure it's reasonable
Reviewed By: masnesral
Differential Revision: D84014636
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165581 
Approved by: https://github.com/masnesral  
						
						
					 
					
						2025-10-17 18:21:18 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						9a71d96256 
					 
					
						
						
							
							Revert "[DebugMode][1/N] refactor logs into _DebugCalls ( #165376 )"  
						
						... 
						
						
						
						This reverts commit 556fc09a9f67f24ca5591ec049c5d0c347c5f62a.
Reverted https://github.com/pytorch/pytorch/pull/165376  on behalf of https://github.com/seemethere  due to This is failing for internal tests, see D84877379 for more context ([comment](https://github.com/pytorch/pytorch/pull/165376#issuecomment-3416570407 )) 
						
						
					 
					
						2025-10-17 18:08:59 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						0d4c2b71e8 
					 
					
						
						
							
							[DeviceMesh] Simplify unflatten method ( #165556 )  
						
						... 
						
						
						
						By adding a few small helpers (e.g., a `splice` method to `_MeshLayout`, and making `_init_process_groups` static and thus stateless) we can substantially shorten the definition of the unflatten method, and help readability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165556 
Approved by: https://github.com/fduwjj 
ghstack dependencies: #165554 , #165555  
						
						
					 
					
						2025-10-17 17:57:51 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						d659bbde62 
					 
					
						
						
							
							[DeviceMesh] Introduce private constructor instead of _create_mesh_from_ranks ( #165555 )  
						
						... 
						
						
						
						The refactoring of DeviceMesh is heavily constrained by the signature of its constructor, which is a public API which contains some "legacy" concepts which we'd love to get rid of, such as an explicit/materialized `mesh` Tensor.
In other languages the solution to this would be to add a private overload of the constructor. Python doesn't natively allow this, but in this PR I managed to build something that approximates it.
This new private constructor basically only takes `_layout`, `_global_rank_permutation`, and `mesh_dim_names`.
With such a constructor we can effectively simplify a lot of callsites and get rid of the `_create_mesh_from_ranks` helper method. That's a good thing because it was instantiating many DeviceMeshes in a for loop, which always felt unnecessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165555 
Approved by: https://github.com/fduwjj , https://github.com/fegin 
ghstack dependencies: #165554  
						
						
					 
					
						2025-10-17 17:57:51 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						58879bfafa 
					 
					
						
						
							
							[DeviceMesh] Prefer using _layout over _mesh for all sorts of things ( #165554 )  
						
						... 
						
						
						
						The goal of this PR is to avoid storing the explicit `mesh` Tensor inside each DeviceMesh, and instead compute it on-the-fly when the end user needs it, and try to replace all of its internal usages with `_layout` and the newly-introduced `_global_rank_permutation` Tensor. The name of this attribute is up for debate. The advantage of the `_global_rank_permutation` Tensor is that it is _the same_ Tensor for the root mesh and all its children, so it doesn't need to be copied/reallocated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165554 
Approved by: https://github.com/fduwjj  
						
						
					 
					
						2025-10-17 17:57:51 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						a032510db3 
					 
					
						
						
							
							shrink_group implementation to expose ncclCommShrink API ( #164518 )  
						
						... 
						
						
						
						Closes  #164529 
To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink ) API to PyTorch.
This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization.
For more info:  [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518 
Approved by: https://github.com/Skylion007 , https://github.com/syed-ahmed , https://github.com/kwen2501  
					
						2025-10-17 17:55:03 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						39e0a832c9 
					 
					
						
						
							
							Fix B200 test fails in scaled_mm ( #165747 )  
						
						... 
						
						
						
						Summary:
PR #165528  changes some scale/swizzle inference behavior in scaled_mm
tests - mxfp8 tests on Blackwell can get incorrectly classified,
resulting in failures.
Fix the scale/swizzle inference code to prevent this.
Fixes https://github.com/pytorch/pytorch/issues/165743 
Test Plan:
```
pytest -svv test/test_scaled_matmul_cuda.py
```
Reviewers:
@jagadish-amd @jeffdaily @drisspg
Subscribers:
@Aidyn-A
Tasks:
Tags:
Signed-off-by: Simon Layton <simonlaytonmeta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165747 
Approved by: https://github.com/eqy , https://github.com/drisspg , https://github.com/jeffdaily  
						
						
					 
					
						2025-10-17 17:52:19 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						dd3b48e85d 
					 
					
						
						
							
							Fix bug with serialization after AOTAutogradCache hit ( #165474 )  
						
						... 
						
						
						
						Fixes  #165447 
On AOTAutogradCache load, the serialization function we pick is just lambda: self, because the object itself is an AOTAutogradCacheEntry. However, this isn't safe, because `wrap_post_compile` will make `self` unserializable, since it needs to load triton kernels and stuff!
So instead, on AOTAutogradCache load, we preserve the bytes that were used to load the object to begin with, and return that object on a call to serialize(). This effectively makes it so that we save a copy of the pre-hydrated artifact, without needing to do an eager copy until someone actually calls `serialize`.
Test Plan:
Run
```py
import torch
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = torch.nn.Linear(2, 4)
        self.relu = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(4, 8)
    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))
device = "cuda"
m = M().to(device)
sample_inputs = (torch.randn(2, 2, device=device),)
eager_out = m(*sample_inputs)
with torch._dynamo.config.patch("enable_aot_compile", True):
    compiled_fn_path = "./m.pt"
    compiled_fn = torch.compile(
        m,
        fullgraph=True
    ).forward.aot_compile((sample_inputs, {}))
    compiled_fn.save_compiled_function(compiled_fn_path)
    torch._dynamo.reset()
    with torch.compiler.set_stance("fail_on_recompile"):
        with open(compiled_fn_path, "rb") as f:
            loaded_fn = torch.compiler.load_compiled_function(f)
assert loaded_fn is not None
compiled_out = loaded_fn(m, *sample_inputs)
assert torch.allclose(eager_out, compiled_out)
```
twice, see that it succeeds.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165474 
Approved by: https://github.com/yiming0416 , https://github.com/zhxchen17  
					
						2025-10-17 17:47:24 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						cff1b20771 
					 
					
						
						
							
							Patch the flex_attention._get_mod_type to not use inspect.signature when computing num_positional_args (an alternative fix for flex attention graph break on create_block_mask) ( #164923 )  
						
						... 
						
						
						
						The initial fix for inspect.signature uses not a right approach (https://github.com/pytorch/pytorch/pull/164349#pullrequestreview-3306614010 ). As @williamwen42 suggests (https://github.com/pytorch/pytorch/pull/164349#issuecomment-3379222885 ) we can just for now get rid of `inspect.signature` call in flex_attention to resolve this high priority issue (https://github.com/pytorch/pytorch/issues/164247#issuecomment-3378673179 ). In this PR I did exactly this - limited the scope of fix to just computing `num_positional_args` in `flex_attention._get_mod_type` based on properties returned by `NestedUserFunctionVariable.const_getattr` (some were missing so I added them)
Fixes  #164247 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164923 
Approved by: https://github.com/williamwen42  
						
						
					 
					
						2025-10-17 17:44:45 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						da8517fa63 
					 
					
						
						
							
							[ROCm][CI] upgrade wheels to 7.0.2 and 6.4.4 patch release ( #165756 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/165756 
Approved by: https://github.com/jeffdaily 
Co-authored-by: Jeff Daily <jeff.daily@amd.com > 
						
						
					 
					
						2025-10-17 17:41:19 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						45afaf08a1 
					 
					
						
						
							
							[DebugMode][2/N] add nn.Module tracking ( #165498 )  
						
						... 
						
						
						
						Uses ModTracker to record nn.Module entries, much like CommDebugMode.
Can be switched on with `DebugMode(record_nn_module=True)`:
```
    [nn.Mod] Bar
      [nn.Mod] Bar.abc
        [nn.Mod] Bar.abc.l1
          aten::t(t: f32[4, 4])
          aten::addmm(t: f32[4], t: f32[4, 4], t: f32[4, 4])
        [nn.Mod] Bar.abc.l2
          aten::t(t: f32[4, 4])
          aten::addmm(t: f32[4], t: f32[4, 4], t: f32[4, 4])
      [nn.Mod] Bar.xyz
        aten::t(t: f32[4, 4])
        aten::addmm(t: f32[4], t: f32[4, 4], t: f32[4, 4])"""
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165498 
Approved by: https://github.com/SherlockNoMad 
ghstack dependencies: #165376  
						
						
					 
					
						2025-10-17 17:39:48 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						080365b7d8 
					 
					
						
						
							
							Escaped html tags name and target to appear as strings ( #165543 )  
						
						... 
						
						
						
						Fixes small typo in markdown documentation file - Added escape characters to precede tag pattern.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165543 
Approved by: https://github.com/mikaylagawarecki  
						
						
					 
					
						2025-10-17 17:35:18 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						2928c5c572 
					 
					
						
						
							
							Revert "Pyrefly suppressions 2 ( #165692 )"  
						
						... 
						
						
						
						This reverts commit 43d78423ac224cce432bf34ed9627035169d5433.
Reverted https://github.com/pytorch/pytorch/pull/165692  on behalf of https://github.com/seemethere  due to This is causing merge conflicts when attempting to land internally, see D84890919 for more details ([comment](https://github.com/pytorch/pytorch/pull/165692#issuecomment-3416397240 )) 
						
						
					 
					
						2025-10-17 17:13:04 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						630520b346 
					 
					
						
						
							
							[dynamo][misc] Replace UserFunctionVariable with VariableTracker build ( #165707 )  
						
						... 
						
						
						
						Audit: To prevent future issues with functools.partial or callable
objects.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165707 
Approved by: https://github.com/Lucaskabela 
ghstack dependencies: #165683 , #165706  
						
						
					 
					
						2025-10-17 17:02:18 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						1dc9a05d03 
					 
					
						
						
							
							[dynamo][user_defined] Replace UserFunctionVariable with VariableTracker build ( #165706 )  
						
						... 
						
						
						
						Audit: To prevent future issues with functools.partial or callable
objects.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165706 
Approved by: https://github.com/Lucaskabela 
ghstack dependencies: #165683  
						
						
					 
					
						2025-10-17 17:02:18 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						bfcdbd0a97 
					 
					
						
						
							
							fix wrong accuracy_status when exception. ( #165731 )  
						
						... 
						
						
						
						When I debug `XPU` accruacy issue, I found the script output wrong accuracy_status.
When the `try` block raise an exception, we should process the exception, but not return the `fail_accuracy`.
Before fixing, it returned as `fail_accuracy`:
<img width="1109" height="216" alt="image" src="https://github.com/user-attachments/assets/385c354f-fbf6-48e4-a1be-3e37e987341b " />
After fixing, it returned the exception message:
<img width="1101" height="292" alt="image" src="https://github.com/user-attachments/assets/f18c0e3c-8358-4ec7-a6bb-c2e01b69d27f " />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165731 
Approved by: https://github.com/Stonepia , https://github.com/chuanqi129 , https://github.com/Lucaskabela  
						
						
					 
					
						2025-10-17 16:37:06 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						faff826a46 
					 
					
						
						
							
							Revert "[ROCm] new implementation of upsample_bilinear2d_backward ( #164572 )"  
						
						... 
						
						
						
						This reverts commit 53f9ae0e50d4dcc47f2ca4bf854803f9d4f875ae.
Reverted https://github.com/pytorch/pytorch/pull/164572  on behalf of https://github.com/seemethere  due to Looks like this is failing in our internal builds, will post a suggestion for a fix but want you to double verify that this behavior is correct ([comment](https://github.com/pytorch/pytorch/pull/164572#issuecomment-3416262676 )) 
						
						
					 
					
						2025-10-17 16:27:59 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						85c5433d38 
					 
					
						
						
							
							Revert "Fix _StridedShard incorrect split ( #165533 )"  
						
						... 
						
						
						
						This reverts commit dfc8a1c5ddc8401197e9ab546e03b0f745edc27b.
Reverted https://github.com/pytorch/pytorch/pull/165533  on behalf of https://github.com/seemethere  due to Causing a merge conflict internally, see D84829161 ([comment](https://github.com/pytorch/pytorch/pull/165533#issuecomment-3416143176 )) 
						
						
					 
					
						2025-10-17 15:57:01 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						935ccdbe75 
					 
					
						
						
							
							[MPS] Fix internal assertion in torch.linalg.solve for singular matrices ( #165254 )  
						
						... 
						
						
						
						Fixes  #163962  by special casing MPS in the negative status code branch in `_linalg_check_errors`.
Checks if info is [`MPSMatrixDecompositionStatus.singular`](https://developer.apple.com/documentation/metalperformanceshaders/mpsmatrixdecompositionstatus/singular ) (which has a raw value of -2). I didn't find an official Apple source with this raw value (besides printing the enum value), so I'm not sure if we can (or should) depend on it? Is there a way to directly get the Objective-C enum value in C++?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165254 
Approved by: https://github.com/malfet  
					
						2025-10-17 15:35:49 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						3af2f0c12a 
					 
					
						
						
							
							[inductor] require shape in TritonCSEVariable ( #162275 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/162275 
Approved by: https://github.com/mlazos 
ghstack dependencies: #164158  
						
						
					 
					
						2025-10-17 14:47:45 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						6ece527fc5 
					 
					
						
						
							
							[CI] Add aarch64 operator benchmark ( #165585 )  
						
						... 
						
						
						
						Running on Graviton4
Skip ConvTranspose1d benchmarks if PyTorch is compiled with ACL, due to https://github.com/pytorch/pytorch/issues/165654 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165585 
Approved by: https://github.com/huydhn  
						
						
					 
					
						2025-10-17 14:42:14 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						ce29d0d796 
					 
					
						
						
							
							[ATen] Vectorize 8 elements on 16 bit data types for sum/mean ( #165055 )  
						
						... 
						
						
						
						Benchmarks for a full reduction + reduction on the contiguous dimension. Vectorized loads do not occur on the non contiguous dimension. Benchmarking done for FP16/BF16, ~6% improvement on average across shapes, up to ~24% for single reduction on contiguous dimension and 46% for full reduce:
**BF16**
```
Tensor Shape         Operation    Full reduce (ms)     Contiguous dim (ms)  Full reduce (ms)     Contiguous dim (ms)  Full reduce diff %   Contiguous diff %
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(256, 256)           mean         0.022686             0.008263             0.015498             0.008117                          +46.38%               +1.80%
(256, 256)           sum          0.022769             0.008269             0.015628             0.008185                          +45.69%               +1.03%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(512, 512)           mean         0.014116             0.009545             0.012892             0.008839                           +9.49%               +7.99%
(512, 512)           sum          0.014110             0.009892             0.012891             0.008878                           +9.46%              +11.42%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 1024)         mean         0.014727             0.012642             0.014061             0.010519                           +4.74%              +20.18%
(1024, 1024)         sum          0.014376             0.012636             0.014069             0.010595                           +2.18%              +19.26%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(2048, 2048)         mean         0.018663             0.018294             0.018171             0.014678                           +2.71%              +24.64%
(2048, 2048)         sum          0.018638             0.017931             0.018142             0.014713                           +2.73%              +21.87%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(4096, 4096)         mean         0.034216             0.036953             0.033520             0.030585                           +2.08%              +20.82%
(4096, 4096)         sum          0.034196             0.036942             0.033518             0.030676                           +2.02%              +20.43%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 8192)         mean         0.087763             0.095201             0.085439             0.084960                           +2.72%              +12.05%
(8192, 8192)         sum          0.088079             0.095592             0.085353             0.084632                           +3.19%              +12.95%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 16384)        mean         0.148174             0.149705             0.146274             0.138865                           +1.30%               +7.81%
(8192, 16384)        sum          0.147820             0.149371             0.146419             0.138752                           +0.96%               +7.65%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 32768)        mean         0.266144             0.260807             0.265953             0.253330                           +0.07%               +2.95%
(8192, 32768)        sum          0.266572             0.261163             0.265729             0.253294                           +0.32%               +3.11%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 65536)        mean         0.502034             0.486312             0.498417             0.481246                           +0.73%               +1.05%
(8192, 65536)        sum          0.501597             0.486351             0.497735             0.481579                           +0.78%               +0.99%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 131072)       mean         0.971178             0.942988             0.957164             0.938316                           +1.46%               +0.50%
(8192, 131072)       sum          0.971189             0.943232             0.956814             0.937816                           +1.50%               +0.58%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 262144)       mean         1.953728             1.877648             1.904937             1.861692                           +2.56%               +0.86%
(8192, 262144)       sum          1.953969             1.877538             1.905990             1.862547                           +2.52%               +0.80%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(4096, 262144)       mean         0.970408             0.940965             0.957871             0.936732                           +1.31%               +0.45%
(4096, 262144)       sum          0.970919             0.941652             0.957765             0.936676                           +1.37%               +0.53%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(2048, 262144)       mean         0.501477             0.486976             0.497964             0.483570                           +0.71%               +0.70%
(2048, 262144)       sum          0.501955             0.487213             0.498210             0.483218                           +0.75%               +0.83%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 262144)       mean         0.266536             0.257111             0.265642             0.255439                           +0.34%               +0.65%
(1024, 262144)       sum          0.266613             0.257096             0.265427             0.255472                           +0.45%               +0.64%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(512, 131072)        mean         0.087805             0.091200             0.085818             0.087851                           +2.32%               +3.81%
(512, 131072)        sum          0.087788             0.091249             0.085373             0.087944                           +2.83%               +3.76%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1000, 1000)         mean         0.014503             0.012328             0.013663             0.010190                           +6.15%              +20.98%
(1000, 1000)         sum          0.014545             0.012378             0.013662             0.010579                           +6.46%              +17.01%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 129)          mean         0.014163             0.008371             0.012893             0.008828                           +9.85%               -5.18%
(1024, 129)          sum          0.014132             0.008751             0.013234             0.008868                           +6.79%               -1.32%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 257)          mean         0.014296             0.009101             0.013334             0.008563                           +7.21%               +6.28%
(1024, 257)          sum          0.014302             0.009058             0.013020             0.008672                           +9.85%               +4.45%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 587)          mean         0.014127             0.010997             0.013443             0.009944                           +5.09%              +10.59%
(1024, 587)          sum          0.014471             0.011373             0.013123             0.010354                          +10.27%               +9.84%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(2048, 977)          mean         0.015607             0.013566             0.015089             0.012152                           +3.43%              +11.64%
(2048, 977)          sum          0.015953             0.013580             0.015039             0.011861                           +6.08%              +14.49%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 128)          mean         0.013982             0.008058             0.012747             0.008139                           +9.69%               -1.00%
(1024, 128)          sum          0.013967             0.008071             0.012726             0.007859                           +9.75%               +2.70%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 128)          mean         0.014378             0.009627             0.013712             0.009395                           +4.86%               +2.47%
(8192, 128)          sum          0.014389             0.009965             0.013718             0.009521                           +4.89%               +4.66%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 130)          mean         0.014156             0.008267             0.012895             0.008833                           +9.78%               -6.41%
(1024, 130)          sum          0.013797             0.008277             0.012903             0.008512                           +6.93%               -2.76%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 130)          mean         0.014977             0.010026             0.013911             0.009876                           +7.66%               +1.52%
(8192, 130)          sum          0.014994             0.010043             0.014235             0.009604                           +5.33%               +4.57%
====================================================================================================================================================================================
```
**FP16**
```
Tensor Shape         Operation    Full reduce (ms)     Contiguous dim (ms)  Full reduce (ms)     Contiguous dim (ms)  Full reduce diff %   Contiguous diff %
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(256, 256)           mean         0.022804             0.008298             0.015888             0.007848                          +43.53%               +5.73%
(256, 256)           sum          0.023215             0.008328             0.015677             0.007850                          +48.08%               +6.09%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(512, 512)           mean         0.013777             0.009988             0.012884             0.008512                           +6.93%              +17.34%
(512, 512)           sum          0.013775             0.009622             0.012870             0.009028                           +7.03%               +6.58%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 1024)         mean         0.014740             0.012322             0.013708             0.010239                           +7.53%              +20.34%
(1024, 1024)         sum          0.014762             0.012756             0.013722             0.010307                           +7.58%              +23.76%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(2048, 2048)         mean         0.018700             0.018364             0.018135             0.015078                           +3.12%              +21.79%
(2048, 2048)         sum          0.018276             0.018415             0.018471             0.015127                           -1.06%              +21.74%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(4096, 4096)         mean         0.034518             0.037000             0.033838             0.030617                           +2.01%              +20.85%
(4096, 4096)         sum          0.034569             0.037448             0.033842             0.031100                           +2.15%              +20.41%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 8192)         mean         0.087675             0.095176             0.085328             0.084105                           +2.75%              +13.16%
(8192, 8192)         sum          0.088102             0.095211             0.085707             0.084090                           +2.79%              +13.23%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 16384)        mean         0.147800             0.149263             0.146388             0.138390                           +0.96%               +7.86%
(8192, 16384)        sum          0.148147             0.148957             0.146439             0.138801                           +1.17%               +7.32%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 32768)        mean         0.266316             0.260294             0.265829             0.253411                           +0.18%               +2.72%
(8192, 32768)        sum          0.266562             0.260717             0.265744             0.253308                           +0.31%               +2.92%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 65536)        mean         0.502035             0.486077             0.498139             0.481374                           +0.78%               +0.98%
(8192, 65536)        sum          0.501571             0.485733             0.498353             0.481350                           +0.65%               +0.91%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 131072)       mean         0.971343             0.943016             0.956600             0.938622                           +1.54%               +0.47%
(8192, 131072)       sum          0.971463             0.942991             0.957352             0.938334                           +1.47%               +0.50%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 262144)       mean         1.952722             1.877165             1.906406             1.861455                           +2.43%               +0.84%
(8192, 262144)       sum          1.952634             1.876388             1.904677             1.861282                           +2.52%               +0.81%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(4096, 262144)       mean         0.970697             0.941298             0.956964             0.936160                           +1.44%               +0.55%
(4096, 262144)       sum          0.969981             0.941078             0.957016             0.936260                           +1.35%               +0.51%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(2048, 262144)       mean         0.501577             0.487208             0.498422             0.483493                           +0.63%               +0.77%
(2048, 262144)       sum          0.502029             0.487124             0.497854             0.483643                           +0.84%               +0.72%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 262144)       mean         0.266416             0.257383             0.265928             0.255140                           +0.18%               +0.88%
(1024, 262144)       sum          0.266434             0.257081             0.265817             0.255143                           +0.23%               +0.76%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(512, 131072)        mean         0.087858             0.091296             0.085816             0.087745                           +2.38%               +4.05%
(512, 131072)        sum          0.088144             0.091314             0.085664             0.087864                           +2.90%               +3.93%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1000, 1000)         mean         0.014977             0.012393             0.014141             0.010614                           +5.91%              +16.76%
(1000, 1000)         sum          0.014589             0.012804             0.014118             0.010320                           +3.34%              +24.07%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 129)          mean         0.014208             0.008383             0.013273             0.008440                           +7.04%               -0.68%
(1024, 129)          sum          0.013804             0.008863             0.013265             0.009003                           +4.06%               -1.56%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 257)          mean         0.014378             0.009109             0.013037             0.009038                          +10.29%               +0.79%
(1024, 257)          sum          0.014387             0.009113             0.013396             0.008698                           +7.40%               +4.77%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 587)          mean         0.014207             0.011037             0.013182             0.010391                           +7.78%               +6.22%
(1024, 587)          sum          0.014588             0.011453             0.013539             0.010049                           +7.75%              +13.97%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(2048, 977)          mean         0.016024             0.013614             0.015448             0.011845                           +3.73%              +14.93%
(2048, 977)          sum          0.015990             0.014033             0.015406             0.012278                           +3.79%              +14.29%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 128)          mean         0.014037             0.007804             0.013143             0.008242                           +6.80%               -5.31%
(1024, 128)          sum          0.014041             0.007847             0.012759             0.007850                          +10.05%               -0.04%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 128)          mean         0.014361             0.009644             0.014075             0.009061                           +2.03%               +6.43%
(8192, 128)          sum          0.014366             0.010032             0.013702             0.009181                           +4.85%               +9.27%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 130)          mean         0.014226             0.008696             0.012894             0.008835                          +10.33%               -1.57%
(1024, 130)          sum          0.013830             0.008740             0.013288             0.008989                           +4.08%               -2.77%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 130)          mean         0.015036             0.010019             0.013917             0.009538                           +8.04%               +5.04%
(8192, 130)          sum          0.014652             0.010403             0.013900             0.009565                           +5.41%               +8.76%
====================================================================================================================================================================================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165055 
Approved by: https://github.com/ngimel 
ghstack dependencies: #165494 , #164790  
						
						
					 
					
						2025-10-17 13:39:36 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						7231118db3 
					 
					
						
						
							
							Turn some const variables into constexpr in C++ code ( #165401 )  
						
						... 
						
						
						
						This PR checks the C++ code and turns some const variables into constexpr.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165401 
Approved by: https://github.com/Skylion007  
						
						
					 
					
						2025-10-17 13:24:46 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						5d4da26ed0 
					 
					
						
						
							
							Revert "[export] preserve_node_meta by default ( #165524 )"  
						
						... 
						
						
						
						This reverts commit fdd560afd1d413a9f814cbf7cc2a72e0d39b0117.
Reverted https://github.com/pytorch/pytorch/pull/165524  on behalf of https://github.com/lw  due to test/functorch/test_control_flow.py::TestControlFlowTraced::test_cond_symint_closure [GH job link](https://github.com/pytorch/pytorch/actions/runs/18586312291/job/52991654051 ) [HUD commit link](fdd560afd1https://github.com/pytorch/pytorch/pull/165524#issuecomment-3415352522 )) 
						
						
					 
					
						2025-10-17 12:27:17 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						574c9fc950 
					 
					
						
						
							
							Revert "Remove torch.serialization entries from the doc ignore list ( #160224 )"  
						
						... 
						
						
						
						This reverts commit 9fe3b2afbeff12080b483af1ee23e1c9d9fb0421.
Reverted https://github.com/pytorch/pytorch/pull/160224  on behalf of https://github.com/lw  due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/18588004962/job/52997748336 ) [HUD commit link](9fe3b2afbehttps://github.com/pytorch/pytorch/pull/160224#issuecomment-3415345175 )) 
						
						
					 
					
						2025-10-17 12:24:08 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						80d2ca7566 
					 
					
						
						
							
							Revert "[annotate] add annotate_fn function decorator ( #165703 )"  
						
						... 
						
						
						
						This reverts commit f1d882212afc3a73ce1e319d80b6406f9dc4a0c8.
Reverted https://github.com/pytorch/pytorch/pull/165703  on behalf of https://github.com/lw  due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/18585518705/job/52989521797 ) [HUD commit link](f1d882212ahttps://github.com/pytorch/pytorch/pull/165703#issuecomment-3415073467 )) 
						
						
					 
					
						2025-10-17 11:23:13 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						4a22139eea 
					 
					
						
						
							
							[MPS][BE] Fix unused variable warning ( #165726 )  
						
						... 
						
						
						
						Namely this one
```
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Shape.metal:19:18: warning: unused variable 'output_sizes' [-Wunused-variable]
  constant auto& output_sizes = shared_params.output_sizes;
                 ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Shape.metal:85:1: note: in instantiation of function template specialization 'cat<long, float, float>' requested here
REGISTER_CAT_FOR_INDEX_TYPE(int64_t);
^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Shape.metal:69:3: note: expanded from macro 'REGISTER_CAT_FOR_INDEX_TYPE'
  REGISTER_CAT_OP_ALL_INPUT_TYPES(I, float);  \
  ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Shape.metal:55:3: note: expanded from macro 'REGISTER_CAT_OP_ALL_INPUT_TYPES'
  REGISTER_CAT_OP(I, float, T_out);               \
  ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Shape.metal:47:15: note: expanded from macro 'REGISTER_CAT_OP'
  kernel void cat<I, T_in, T_out>(                               \
```
Repeated about 20-30 times
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165726 
Approved by: https://github.com/Skylion007  
						
						
					 
					
						2025-10-17 11:16:21 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						cb6e4d7d82 
					 
					
						
						
							
							User-passed alpha to scaled_gemm ( #165563 )  
						
						... 
						
						
						
						Summary:
Add optional user-passed `alpha` argument to
`at::cuda::blas::scaled_gemm`, necessary for two-level-scaled NVFP4 gemm
calls (where the global de-scales are folded into the `alpha` argument.
Global de-scales are naturally device tensors, but using cublas'
device-pointer mode for `alpha`/`beta` has an interesting lifetime
implication - the `alpha` tensor must be valid & correct until the end
of the matmul call, *not* just the launch (as for host values). To
enable this, I added device-constant memory for `one` and `zero`, along
with a statically-held single-fp32-value tensor, which is valid from the
first passed-`alpha` invocation of `scaled_gemm` to the end of the
program. User-passed values are copied into this perpetual buffer to
ensure lifetime requirements are met.
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165563 
Approved by: https://github.com/drisspg , https://github.com/eqy  
						
						
					 
					
						2025-10-17 09:42:33 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						202f83dc4e 
					 
					
						
						
							
							[ROCm][layer_norm] Use __builtin_amdgcn_rcpf(x) instead of 1.f/x ( #165589 )  
						
						... 
						
						
						
						Replace (more) exact calculation with hardware approximation.
Benefits:
Reduced code size.
Improved performance for certain scenarios.
Experiments show low reduction in precision.
Experiments show no significant performance regressions. bfloat16 as well as float16 related calculations may benefit largely from this change.
Co-author: @mhalk @amd-hhashemi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165589 
Approved by: https://github.com/jeffdaily  
						
						
					 
					
						2025-10-17 09:12:30 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						9fe3b2afbe 
					 
					
						
						
							
							Remove torch.serialization entries from the doc ignore list ( #160224 )  
						
						... 
						
						
						
						Follows the approach done in #158581 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160224 
Approved by: https://github.com/janeyx99  
						
						
					 
					
						2025-10-17 09:06:09 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						d0c24b392c 
					 
					
						
						
							
							[APF Logging][Error Trait] To fill the errorTraits for ChildFailedError with signal abort (re-attempt of  #165476 ) ( #165688 )  
						
						... 
						
						
						
						**Summary**
Land @guoding83128 's PR https://github.com/pytorch/pytorch/pull/165476  on his behalf due to EasyCLA blocking.
Refer his original PR for detail. But in short, elastic leaves 'errorTraits' as unknown when the error dump file is missing,
this PR adds a "system terminated error" to such case so the internal scuba table can correctly aggregate.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165688 
Approved by: https://github.com/fduwjj  
						
						
					 
					
						2025-10-17 08:23:27 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						b44fb14906 
					 
					
						
						
							
							Remove unused parameter when query extension attribute ( #165623 )  
						
						... 
						
						
						
						# Motivation
This code is no longer needed since SYCL compiler 2025.0. We are now using compiler 2025.2 (two tool uplifts later), so it can be safely removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165623 
Approved by: https://github.com/EikanWang 
ghstack dependencies: #165622  
						
						
					 
					
						2025-10-17 08:16:13 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						51348c0219 
					 
					
						
						
							
							Give a friendly message for older Intel GPU ( #165622 )  
						
						... 
						
						
						
						# Motivation
Notify the user if the GPU is older than officially supported. This provides a friendly warning that the GPU may work, but the experience could be unstable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165622 
Approved by: https://github.com/EikanWang  
						
						
					 
					
						2025-10-17 08:16:13 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						fdd560afd1 
					 
					
						
						
							
							[export] preserve_node_meta by default ( #165524 )  
						
						... 
						
						
						
						Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165524 
Approved by: https://github.com/malaybag  
						
						
					 
					
						2025-10-17 07:55:28 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						e925dfcc6b 
					 
					
						
						
							
							Enable all SIM rules except disabled ones ( #164645 )  
						
						... 
						
						
						
						`SIM` rules are useful for simplifying boolean expressions and enhances code readability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645 
Approved by: https://github.com/ezyang , https://github.com/mlazos  
						
						
					 
					
						2025-10-17 07:27:11 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						f1d882212a 
					 
					
						
						
							
							[annotate] add annotate_fn function decorator ( #165703 )  
						
						... 
						
						
						
						Example usage:
```
        @fx_traceback.annotate_fn({"pp_stage": 1})
        def example_function(x):
            return x * x
        class SimpleLinear(nn.Module):
            def __init__(self):
                super().__init__()
                self.linear = nn.Linear(3, 2)
            def forward(self, x):
                with fx_traceback.annotate({"pp_stage": 0}):
                    y = self.linear(x)
                y = example_function(y)
                return y - 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165703 
Approved by: https://github.com/SherlockNoMad  
						
						
					 
					
						2025-10-17 07:18:47 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						24879f0de9 
					 
					
						
						
							
							[dynamo] Use Variable Builder to build the property fget object ( #165683 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/165683 
Approved by: https://github.com/ezyang , https://github.com/williamwen42  
						
						
					 
					
						2025-10-17 06:29:24 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						9e94ec76b8 
					 
					
						
						
							
							Revert "Turn some const variables into constexpr in C++ code ( #165401 )"  
						
						... 
						
						
						
						This reverts commit 5b2afe4c5dc87786ca65bf22ca9a78f7c21a33a4.
Reverted https://github.com/pytorch/pytorch/pull/165401  on behalf of https://github.com/seemethere  due to This is breaking test/distributions/test_distributions.py::TestDistributions::test_binomial_sample on HUD, see 5b2afe4c5dhttps://github.com/pytorch/pytorch/pull/165401#issuecomment-3414023134 )) 
						
						
					 
					
						2025-10-17 06:14:09 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						364624e209 
					 
					
						
						
							
							[codemod][lowrisk] Remove unused exception parameter from some files ( #165700 )  
						
						... 
						
						
						
						Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.
This:
```
try {
    ...
} catch (exception& e) {
    // no use of e
}
```
should instead be written as
```
} catch (exception&) {
```
If the code compiles, this is safe to land.
Test Plan: Sandcastle
Differential Revision: D84868162
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165700 
Approved by: https://github.com/Skylion007  
						
						
					 
					
						2025-10-17 05:30:06 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						7e150467f7 
					 
					
						
						
							
							allow providing full fr trace path ( #165639 )  
						
						... 
						
						
						
						Summary:
- allow users to specify the full path instead of fr suffixing the rank id
- this will be used by torchft to provide the global rank id accross all replicas
- we can't just prefix the replica id because analysis tool expects the file name to provide a unique integer
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com ). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/165639 ).
* #165638 
* #165640 
* #165677 
* #165642 
* __->__ #165639 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165639 
Approved by: https://github.com/fduwjj  
						
						
					 
					
						2025-10-17 04:43:44 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						43d78423ac 
					 
					
						
						
							
							Pyrefly suppressions 2 ( #165692 )  
						
						... 
						
						
						
						This is the last directory to opt in for the regular mypy.ini file. Will put up a diff to remove unused ignores before making sure we're also type checking all the files in the mypy strict configurations
Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check
step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 
after:
INFO 0 errors (6,884 ignored)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165692 
Approved by: https://github.com/oulgen  
						
						
					 
					
						2025-10-17 04:15:25 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						fcbde24c1c 
					 
					
						
						
							
							[ONNX] Remove common imports from torchlib ( #165156 )  
						
						... 
						
						
						
						The Rank and IsScalar functions are no longer used in the torchlib. Requires onnxscript v0.5.4
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165156 
Approved by: https://github.com/Skylion007 , https://github.com/cyyever  
						
						
					 
					
						2025-10-17 03:25:34 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						861cdb887b 
					 
					
						
						
							
							use statically_known_leq & *=2 instead of bound_sympy in persistent rblock ( #165657 )  
						
						... 
						
						
						
						While these should be equivalent, we've found instances where they are not, and an error was caused. update until we figure out underlying issue.
Differential Revision: [D84835898](https://our.internmc.facebook.com/intern/diff/D84835898 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165657 
Approved by: https://github.com/bobrenjc93  
						
						
					 
					
						2025-10-17 02:48:03 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						3154482072 
					 
					
						
						
							
							[CUDA][cuBLAS] Only xFail addmm with reduced precision reductions on non-RTX skus ( #165379 )  
						
						... 
						
						
						
						RTX Blackwells don't behave quite like their datacenter counterparts here
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165379 
Approved by: https://github.com/Skylion007  
						
						
					 
					
						2025-10-17 02:45:07 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						9fccbdd4f0 
					 
					
						
						
							
							Fix incorrect function signature in template ( #165567 )  
						
						... 
						
						
						
						Summary:
In https://github.com/pytorch/pytorch/pull/148305  we refactored the grid
argument out, but it's not reflected in our template.
Test Plan:
Included in commit.
python test/inductor/test_aot_inductor.py
AOTInductorTestABICompatibleGpu.test_cond_symint_input_disable_one_pass_cuda
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165567 
Approved by: https://github.com/desertfire  
						
						
					 
					
						2025-10-17 02:40:56 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						7dabfb07cb 
					 
					
						
						
							
							[torchfuzz] add support for --stop-at-first-failure flag ( #165529 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/165529 
Approved by: https://github.com/pianpwk 
ghstack dependencies: #164749  
						
						
					 
					
						2025-10-17 02:18:07 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						d0add0be43 
					 
					
						
						
							
							[torchfuzz] check in some more ignore regexes ( #164749 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/164749 
Approved by: https://github.com/pianpwk  
						
						
					 
					
						2025-10-17 02:18:07 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						11e2084308 
					 
					
						
						
							
							Revert "[Mem Snapshot] Add Metadata Field ( #165490 )"  
						
						... 
						
						
						
						This reverts commit 5b3ea758951558e7d9f681ae784acb57eaa07910.
Reverted https://github.com/pytorch/pytorch/pull/165490  on behalf of https://github.com/pytorch-auto-revert  due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165490#issuecomment-3413491091 )) 
						
						
					 
					
						2025-10-17 02:01:53 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						9726553653 
					 
					
						
						
							
							[BE][Ez]: Use sys.executable instead of hardcoded Python ( #165679 )  
						
						... 
						
						
						
						Handles edgecase to ensure proper interpreter is called. Inspired by #165633 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165679 
Approved by: https://github.com/FindHao  
						
						
					 
					
						2025-10-17 01:07:40 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						d82527b32a 
					 
					
						
						
							
							[Windows] Add AOTI cross-compilation CI ( #165573 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/165573 
Approved by: https://github.com/malfet 
ghstack dependencies: #165560  
						
						
					 
					
						2025-10-17 01:05:35 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						5d9b024276 
					 
					
						
						
							
							Add mingw to docker ( #165560 )  
						
						... 
						
						
						
						Add mingw to `pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11` docker image to support AOTI cross-compilation
This PR will make docker container rebuild, and upgrade python version from 3.13.7 to 3.13.8. and it relies on https://github.com/pytorch/pytorch/pull/165667 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165560 
Approved by: https://github.com/malfet  
						
						
					 
					
						2025-10-17 00:47:01 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						5b2afe4c5d 
					 
					
						
						
							
							Turn some const variables into constexpr in C++ code ( #165401 )  
						
						... 
						
						
						
						This PR checks the C++ code and turns some const variables into constexpr.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165401 
Approved by: https://github.com/Skylion007  
						
						
					 
					
						2025-10-17 00:40:11 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						b2953f5643 
					 
					
						
						
							
							[9/N] Apply ruff UP035 rule ( #165515 )  
						
						... 
						
						
						
						This is follow-up of #165214  to continue applying ruff UP035 rule to the code base.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165515 
Approved by: https://github.com/Lucaskabela  
						
						
					 
					
						2025-10-17 00:09:51 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						470e2f61c3 
					 
					
						
						
							
							Revert "[Fix] Use sys.executable instead of hardcoded python ( #165633 )"  
						
						... 
						
						
						
						This reverts commit 37f3ba274a8ccebc6b3409f52cf068a8b23617d4.
Reverted https://github.com/pytorch/pytorch/pull/165633  on behalf of https://github.com/malfet  due to Looks like it broke test_collect_callgrind in slow workflows, see e0fe37fa68/1https://github.com/pytorch/pytorch/pull/165633#issuecomment-3413290813 )) 
						
						
					 
					
						2025-10-17 00:06:40 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						e0fe37fa68 
					 
					
						
						
							
							[MPS] Move torch.cat impl to Metal ( #165373 )  
						
						... 
						
						
						
						After this change, all of the cases tested in [this performance measurement script](10de64c5ac/cat/perf0.pyFixes  #165350 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165373 
Approved by: https://github.com/kulinseth , https://github.com/malfet  
						
						
					 
					
						2025-10-17 00:03:04 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						d2c82bafb7 
					 
					
						
						
							
							Revert "158232  Fix autocast cache incorrectly retaining no_grad state ( #165068 )"  
						
						... 
						
						
						
						This reverts commit 5daef30b26b794d237fbbc399c1d47ec0380200a.
Reverted https://github.com/pytorch/pytorch/pull/165068  on behalf of https://github.com/jeffdaily  due to This broke ROCm CI. test/test_transformers.py::TestTransformersCUDA::test_transformerencoder_fastpath_use_torchscript_False_enable_nested_tensor_True_use_autocast_True_d_model_256_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/18572589089/job/52952074008 ) [HUD commit link](5daef30b26https://github.com/pytorch/pytorch/pull/165068#issuecomment-3413184445 )) 
						
						
					 
					
						2025-10-16 23:08:27 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						98a488c9aa 
					 
					
						
						
							
							Start recording inductor provenance ( #162669 )  
						
						... 
						
						
						
						Summary:
This stores information on where fx graphs come from, which makes it
significantly easier to debug.
One outstanding question
1) I only stored the kernel stack traces, do we also want the node mappings?
Test Plan:
I wrote a explicit logging test which makes a module, fx traces it, compiles it, and makes sure the logging infomration shows up.
```
clr@devvm17763 ~/fbsource/fbcode/caffe2/test/dynamo
 % buck2 test @//mode/opt fbcode//caffe2/test/dynamo:test_dynamo -- test_utils
File changed: fbsource//xplat/caffe2/test/dynamo/test_utils.py
File changed: fbcode//caffe2/test/dynamo/test_utils.py
Buck UI: https://www.internalfb.com/buck2/528dea32-2416-4a62-a1ec-39f3c0efdd2e 
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13229324015574003 
Network: Up: 0B  Down: 0B
Executing actions. Remaining     0/2
Command: test.
Time elapsed: 17.3s
Tests finished: Pass 16. Fail 0. Fatal 0. Skip 0. Build failure 0
```
Rollback Plan:
Differential Revision: D82037582
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162669 
Approved by: https://github.com/yushangdi  
						
						
					 
					
						2025-10-16 23:05:31 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						5b3ea75895 
					 
					
						
						
							
							[Mem Snapshot] Add Metadata Field ( #165490 )  
						
						... 
						
						
						
						Summary:
The implementation adds the ability to:
Set custom metadata strings that will be attached to all subsequent allocations
Clear or change the metadata at any point
View the metadata in memory snapshots via _dump_snapshot()
Test Plan: Added test in test_cuda.py and check manually in snapshot to see that metadata was added.
Differential Revision: D84654933
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165490 
Approved by: https://github.com/yushangdi  
						
						
					 
					
						2025-10-16 22:54:27 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						556fc09a9f 
					 
					
						
						
							
							[DebugMode][1/N] refactor logs into _DebugCalls ( #165376 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/165376 
Approved by: https://github.com/SherlockNoMad  
						
						
					 
					
						2025-10-16 22:43:52 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						ce109b3f79 
					 
					
						
						
							
							Add torch.backends.mkldnn.is_acl_available() method ( #165678 )  
						
						... 
						
						
						
						That tells whether or not PyTorch was compiled with Arm Compute Library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165678 
Approved by: https://github.com/Skylion007 , https://github.com/atalman , https://github.com/albanD 
ghstack dependencies: #165583 , #165584 , #165676  
						
						
					 
					
						2025-10-16 22:34:21 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						4d833f859b 
					 
					
						
						
							
							[BE] [CI] Fix aarch64 arch checks ( #165676 )  
						
						... 
						
						
						
						Instead of relying on `TEST_CONFIG` environment variable  to contain `aarch64`, which is prone to errors,  use output of  `$(uname -m)` that is equal to `aarch64` on Linux ARM systems
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165676 
Approved by: https://github.com/huydhn , https://github.com/atalman 
ghstack dependencies: #165583 , #165584  
						
						
					 
					
						2025-10-16 22:19:53 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						d7e275d4b4 
					 
					
						
						
							
							[CI][CUDA] Add periodic b200 distributed job ( #159323 )  
						
						... 
						
						
						
						1. Run distributed job with B200 runner, periodically.
2. discovered generic distributed test issue that certain unit test hard-coded ranks, calling for require_exact_world_size(world_size) API instead of require_world_size(world_size).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159323 
Approved by: https://github.com/eqy 
Co-authored-by: Aidyn-A <aidyn.b.aitzhan@gmail.com > 
						
						
					 
					
						2025-10-16 21:54:04 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						d5db3aee0d 
					 
					
						
						
							
							[CI] Use 1-GPU runners for rocm-mi355.yml ( #165658 )  
						
						... 
						
						
						
						Should only need 1-GPU runners for rocm-mi355.yml since it runs `default` test config which only needs 1 GPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165658 
Approved by: https://github.com/jeffdaily  
						
						
					 
					
						2025-10-16 21:53:22 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						5641de7b6b 
					 
					
						
						
							
							Add suppressions for _inductor/codegen ( #165659 )  
						
						... 
						
						
						
						Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 
Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check
step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 
after:
INFO 0 errors (6,884 ignored)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165659 
Approved by: https://github.com/oulgen  
						
						
					 
					
						2025-10-16 21:37:37 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						cbc08c8993 
					 
					
						
						
							
							Add NEON acceleration for Vectorized<int[8|16|32|64> ( #165273 )  
						
						... 
						
						
						
						Summary:
Adding NEON specializations of Vectorized<T> for int8, int16, int32 and int64.
Correcness has been checked using test_ops.py and the comprehensive torch test
operator_benchmark_test.py has been enhanced by adding cases of bitwise operations, boolean ops and integer ops.
The benchmark, which uses the PyTorch API, shows significant enhancements in a wide variety of operations:
Before:
bitwise xor: 779.882us
boolean any: 636.209us
boolean all: 538.621us
integer mul: 304.457us
integer asr: 447.997us
After:
bitwise xor: 680.221us ---> 15% higher throughput
boolean any: 391.468us ---> 63% higher throughput
boolean all: 390.189us ---> 38% higher throughput
integer mul: 193.532us ---> 57% higher throughput
integer asr: 179.929us---> 149% higher throughput
Test Plan:
Correctness:
buck2 test @mode/opt //caffe2/test:test_ops
buck2 test @mode/opt //caffe2/test:torch
buck2 test @mode/opt //caffe2/test/distributed/launcher/fb:fb_run_test
Performance:
buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test
Differential Revision: D84424638
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165273 
Approved by: https://github.com/malfet  
						
						
					 
					
						2025-10-16 21:35:13 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						1a54d3333d 
					 
					
						
						
							
							[easy] Fix graph_capture in aot_joint_with_descriptors test ( #165660 )  
						
						... 
						
						
						
						when `with_export=True`, `aot_export_joint_with_descriptors` should take the graph produced by `_dynamo_graph_capture_for_export`
```
python test/functorch/test_aot_joint_with_descriptors.py -k test_preserve_annotate_simple
python test/functorch/test_aot_joint_with_descriptors.py -k test_preserve_annotate_flex_attention
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165660 
Approved by: https://github.com/yushangdi  
						
						
					 
					
						2025-10-16 21:10:11 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						4c1c341fa0 
					 
					
						
						
							
							FakeTensorMode shouldn't cache syms when tracing ( #164718 )  
						
						... 
						
						
						
						Improve FakeTensor cache to handle SymNode and tracing properly.
For now, when we're proxy tracing just don't bother caching operations that contain SymNodes in the output. The problem is that the proxy tracer relies on SymNode identity and our cache doesn't preserve that. It can be fixed (and I left some notes in _validate_symbolic_output_for_caching() how) but it's not worth it for now.
If we aren't proxy tracing then caching is fine.
Thus these changes:
1. Our cache key needs to include whether we were actively tracing or not - this way if we create a cache entry when we weren't tracing and then we try to use it when we ARE tracing it gets rerun.
2. If there's a SymNode in the output then bypass tracing.
3. Some general cleanup of the output validation - we were unnecessarily doing it as a two-step process when it could just be a single step (it's still two parts internally but only a single outer try/except).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164718 
Approved by: https://github.com/bobrenjc93 
ghstack dependencies: #165266 , #164717  
						
						
					 
					
						2025-10-16 20:57:07 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						5f21cc786a 
					 
					
						
						
							
							Teach ProxyTorchDispatchMode how to decompose sympy.Expr into known inputs ( #164717 )  
						
						... 
						
						
						
						In a training library we hit a weird conflict between dtensor, dynamic shapes, and proxy tensor.
The problem is occuring because in sharding_prop we use FakeTensors to compute an operation size (so we don't have to  use the full "real" data). We turn off proxy tracing while we're doing that because we don't want the FakeTensor ops to end up in the graph.  We then use that size when doing later operations.
Normally this is no problem - but when those sizes are dynamic shapes then we have a problem - the proxy tracer wants to track the provenance of all shape operations (`s1*s2`) but since tracing is disabled it doesn't see the operation and when we then use the result shape later on the proxy tracer gets all confused (because the SymNode appeared out of nowhere).
At first we were thinking to never disable shape tracing - but that caused a slew of other downstream problems (lots of code that actually needs the shape tracing to be disabled) so instead we enable having a "sym tracing override" and surgically when we disable proxy tracing we leave shape tracing enabled.
After this change the dtensor embedding is "fixed" but then runs afoul of a FakeTensor cache bug - which is fixed in the next PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164717 
Approved by: https://github.com/bobrenjc93 , https://github.com/ezyang 
ghstack dependencies: #165266  
						
						
					 
					
						2025-10-16 20:57:06 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						e86942f422 
					 
					
						
						
							
							minor proxy_tensor reorg ( #165266 )  
						
						... 
						
						
						
						Moving some code around in proxy_tensor in preparation for the next PR. There we
no actual changes (other than simple relabeling such as `self.tracer` ->
`tracer`):
- Move _compute_proxy() out of ProxyTorchDispatchMode.
- Give `sympy_expr_tracker` a structured type instead of `object`.
- Split SymNode registration out of ProxyTorchDispatchMode.__sym_dispatch__() so
  it can be reused.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165266 
Approved by: https://github.com/ezyang , https://github.com/mlazos  
						
						
					 
					
						2025-10-16 20:57:06 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						2cd5fd1588 
					 
					
						
						
							
							Enable local tensor mode on DTensor view ops test ( #165596 )  
						
						... 
						
						
						
						While enabling this test discovered lack of support for sub meshes. Added limited support
for sub meshes by properly computing rank coordinates for a given sub mesh. The implementation
follows similar approach to collectives. We infer all sub meshes for the given dimensions and
compute each rank's coordinates with respect to is sub mesh.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165596 
Approved by: https://github.com/ezyang  
						
						
					 
					
						2025-10-16 20:52:06 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						7d0f872cb3 
					 
					
						
						
							
							Use union syntax in torch/_inductor runtime and fx_passes ( #165652 )  
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/165652 
Approved by: https://github.com/aorenste  
						
						
					 
					
						2025-10-16 20:51:59 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						fb06e49ce8 
					 
					
						
						
							
							Revert "[inductor] print 0.0 as 0 for triton ( #164291 )"  
						
						... 
						
						
						
						This reverts commit 99b32a6750bfd0cfe2bc84a47823e1da34802b7b.
Reverted https://github.com/pytorch/pytorch/pull/164291  on behalf of https://github.com/malfet  due to Broke slow job, see aba8c43594/1https://github.com/pytorch/pytorch/pull/164291#issuecomment-3412768915 )) 
						
						
					 
					
						2025-10-16 20:44:29 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						27a98e6ae9 
					 
					
						
						
							
							Revert "[DeviceMesh] Prefer using _layout over _mesh for all sorts of things ( #165554 )"  
						
						... 
						
						
						
						This reverts commit d61a9b88cf3be04a29c5a7d6e9622ae5e8d51de3.
Reverted https://github.com/pytorch/pytorch/pull/165554  on behalf of https://github.com/malfet  due to Looks like it broke serialization test, see aba8c43594/1https://github.com/pytorch/pytorch/pull/165554#issuecomment-3412765681 )) 
						
						
					 
					
						2025-10-16 20:41:37 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						b10f463b1a 
					 
					
						
						
							
							Revert "[DeviceMesh] Introduce private constructor instead of _create_mesh_from_ranks ( #165555 )"  
						
						... 
						
						
						
						This reverts commit 99097b6d89c927c15180ff4683c38be01f9955f6.
Reverted https://github.com/pytorch/pytorch/pull/165555  on behalf of https://github.com/malfet  due to Looks like it broke serialization test, see aba8c43594/1https://github.com/pytorch/pytorch/pull/165554#issuecomment-3412765681 )) 
						
						
					 
					
						2025-10-16 20:41:37 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						431c13cf61 
					 
					
						
						
							
							Revert "[DeviceMesh] Simplify unflatten method ( #165556 )"  
						
						... 
						
						
						
						This reverts commit 86fd4fc23e697e275d37c36e3cbe521f156434fd.
Reverted https://github.com/pytorch/pytorch/pull/165556  on behalf of https://github.com/malfet  due to Looks like it broke serialization test, see aba8c43594/1https://github.com/pytorch/pytorch/pull/165554#issuecomment-3412765681 )) 
						
						
					 
					
						2025-10-16 20:41:37 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						aead9270f5 
					 
					
						
						
							
							12/n : Remove fbandroid_compiler_flags ( #165558 )  
						
						... 
						
						
						
						Summary:
Currently `get_c2_fbandroid_xplat_compiler_flags()` is reading the `caffe2.strip_glog` buckconfig which we want to get rid of.
This diff removes the `fbandroid_compiler_flags` arg and merges it with compiler_flags with a nested select and the select version of the method
The goal is to get rid of all the usages of `get_c2_fbandroid_xplat_compiler_flags()` so that we can get rid of the `caffe2.strip_glog` buckconfig
Test Plan: CI
bifferential Revision: D84626885
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165558 
Approved by: https://github.com/malfet  
						
						
					 
					
						2025-10-16 20:41:24 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						9bf5b38c14 
					 
					
						
						
							
							[Inductor][Triton][FP8] Refactor scaled_mm template to accept scaling mode ( #164318 )  
						
						... 
						
						
						
						Summary: Refactor `scaled_mm` Inductor template to support template choice based on scaling mode. This modification sets up the infrastructure for adding new templates based on new scaling modes, such as deepseek-style scaling (a follow-up diff), as new scaling modes (deepseek, block, group) scale before the accumulation (as opposed to per-tensor and per-row scaling, which apply scaling after accumulation). This modification also further enables Inductor to infer a scaling type based on the shape of the scaling tensors, which makes existing infrastructure more extensible to new scaling modes.
Test Plan:
```
TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor CUDA_LAUNCH_BLOCKING=1 TORCH_USE_CUDA_DSA=1 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -- --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --m 256 --n 768 --k 512 --output="/home/jananisriram/personal/random_bench.csv" --scaling_rowwise --atol=20 --rtol=2 2>&1 | tee ~/personal/random.log
```
bifferential Revision: D83591083
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164318 
Approved by: https://github.com/drisspg , https://github.com/slayton58  
						
						
					 
					
						2025-10-16 20:40:45 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						aba8c43594 
					 
					
						
						
							
							Register var for MTIA ( #165382 )  
						
						... 
						
						
						
						Summary: Registers variance kernel
Reviewed By: srsuryadev
Differential Revision: D84546250
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165382 
Approved by: https://github.com/malfet  
						
						
					 
					
						2025-10-16 20:35:15 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						37f3ba274a 
					 
					
						
						
							
							[Fix] Use sys.executable instead of hardcoded python ( #165633 )  
						
						... 
						
						
						
						Replace hardcoded "python" string with sys.executable to ensure correct Python interpreter is used. This fixes failures on systems with multiple Python runtimes or where "python" is not in PATH.
Similar to pytorch/pytorch#155918 
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165633 
Approved by: https://github.com/Skylion007  
						
						
					 
					
						2025-10-16 20:26:10 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						585b9dbb5e 
					 
					
						
						
							
							[async_tp] Support ag+mm with gather_dim lastdim of mat_A ( #163068 )  
						
						... 
						
						
						
						Adding ag+mm support for the case, when gather_dim is last dim of matmul (reduction dim).
When we decompose matmul by reduction dimension we result in partials that needs additional reduction,
we allocate memory for accumulator.
Decomposition should not produce small (thin) mms that can not efficiently load the GPU. Limiting for minimal size of the shard 1024 (found empirically by testing in torchtitan).
scaled_mm is not supported yet for this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163068 
Approved by: https://github.com/ngimel  
						
						
					 
					
						2025-10-16 20:14:39 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						d795fb225a 
					 
					
						
						
							
							[RFC] Add pyrefly to lintrunner ( #165179 )  
						
						... 
						
						
						
						This will add pyrefly to lint runner as a warning only - and allow us to collect feedback about the tool before switching to pyrefly as the main type checker.
References the steps outlined here: : https://github.com/pytorch/pytorch/issues/163283 :
test plan:
`lintrunner init`
`lintrunner`
confirm when pyrefly errors are present results look like: https://gist.github.com/maggiemoss/e6cb2d015dd1ded560ae1329098cf33f 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165179 
Approved by: https://github.com/ezyang  
						
						
					 
					
						2025-10-16 20:07:09 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						7df9aca529 
					 
					
						
						
							
							[ROCm][Windows] Enable AOTriton runtime compile on Windows ( #165538 )  
						
						... 
						
						
						
						AOTriton uses prebuilt runtime binaries if the user's ROCm version matches the ones used to generate the prebuilt runtime. However, since there's no prebuilt runtime available for Windows, this check needs to be bypassed for Windows. This PR enables it by changing condition to always build AOTriton runtime from source on Windows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165538 
Approved by: https://github.com/xinyazhang , https://github.com/jeffdaily  
						
						
					 
					
						2025-10-16 19:51:43 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						d4a713cd9c 
					 
					
						
						
							
							Change forkserver test to only run below 3.13.8 ( #165667 )  
						
						... 
						
						
						
						A multiprocessing bug is fixed in 3.13.8, see [https://docs.python.org/3.13/whatsnew/changelog.html ](https://l.workplace.com/l.php?u=https%3A%2F%2Fdocs.python.org%2F3.13%2Fwhatsnew%2Fchangelog.html&h=AT0qUhHJq5c2UJvQaq9_MrSo0mVhwn1VOfq1nDQl2C1UOhDI80RMbzVayhG7LSAT1uYHKtkftKnBDwiGMhbw0YRvQLe5vwE01qejpPFautHvU3LXeOE1KChPykqz3qnCRzk7czu_iNzQ05shR4F1N_qYOzR5YxejA52ZZQ ), [gh-126631](https://github.com/python/cpython/issues/126631 )
So this test will fail when we update to python 3.13.8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165667 
Approved by: https://github.com/malfet  
						
						
					 
					
						2025-10-16 19:34:10 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						5daef30b26 
					 
					
						
						
							
							158232  Fix autocast cache incorrectly retaining no_grad state ( #165068 )  
						
						... 
						
						
						
						Fixes  #158232 
The autocast caching heuristic in `aten/src/ATen/autocast_mode.cpp:139` did not account for gradient mode state when deciding whether to cache. FSDP2 is not directly related.
~~This PR adds `GradMode::is_enabled()` check to caching condition. Caching is now disabled in `no_grad()` contexts to prevent storing tensors with incorrect gradient state. Ensures correctness at the cost of using cache.~~
This PR proposes separate caches for gradient-enabled and gradient-disabled modes.
Adds tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165068 
Approved by: https://github.com/ngimel , https://github.com/janeyx99  
					
						2025-10-16 19:32:01 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						6dedd34c31 
					 
					
						
						
							
							[CD] Skip 12.9 build on Windows ( #165665 )  
						
						... 
						
						
						
						Per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165665 
Approved by: https://github.com/Camyll , https://github.com/malfet  
						
						
					 
					
						2025-10-16 19:11:27 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						a303d6dda9 
					 
					
						
						
							
							[inductor] don't try to reorder loops for template ( #165601 )  
						
						... 
						
						
						
						fix https://github.com/pytorch/pytorch/issues/165579 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165601 
Approved by: https://github.com/yushangdi  
						
						
					 
					
						2025-10-16 19:05:21 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						7669ac9402 
					 
					
						
						
							
							[ROCm] Add scaled_mm v2 support. ( #165528 )  
						
						... 
						
						
						
						Add mx fp4 support in Blas.cpp.
Updated the scale_kernel_dispatch array and ScaledGemmImplementation enum to include MXFP4 support.
Modify the tests under test_scaled_matmul_cuda accordingly.
PYTORCH_TEST_WITH_ROCM=1 python test/test_scaled_matmul_cuda.py -v -k test_blockwise
115 test passed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165528 
Approved by: https://github.com/jeffdaily  
						
						
					 
					
						2025-10-16 18:36:41 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						86fd4fc23e 
					 
					
						
						
							
							[DeviceMesh] Simplify unflatten method ( #165556 )  
						
						... 
						
						
						
						By adding a few small helpers (e.g., a `splice` method to `_MeshLayout`, and making `_init_process_groups` static and thus stateless) we can substantially shorten the definition of the unflatten method, and help readability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165556 
Approved by: https://github.com/fduwjj 
ghstack dependencies: #165554 , #165555  
						
						
					 
					
						2025-10-16 18:36:16 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						99097b6d89 
					 
					
						
						
							
							[DeviceMesh] Introduce private constructor instead of _create_mesh_from_ranks ( #165555 )  
						
						... 
						
						
						
						The refactoring of DeviceMesh is heavily constrained by the signature of its constructor, which is a public API which contains some "legacy" concepts which we'd love to get rid of, such as an explicit/materialized `mesh` Tensor.
In other languages the solution to this would be to add a private overload of the constructor. Python doesn't natively allow this, but in this PR I managed to build something that approximates it.
This new private constructor basically only takes `_layout`, `_global_rank_permutation`, and `mesh_dim_names`.
With such a constructor we can effectively simplify a lot of callsites and get rid of the `_create_mesh_from_ranks` helper method. That's a good thing because it was instantiating many DeviceMeshes in a for loop, which always felt unnecessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165555 
Approved by: https://github.com/fduwjj , https://github.com/fegin 
ghstack dependencies: #165554  
						
						
					 
					
						2025-10-16 18:36:16 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						a214371008 
					 
					
						
						
							
							[FP8] Add other Blackwell compute-capabiilities to expected fail test_honor_sm_carveout ( #165159 )  
						
						... 
						
						
						
						CUTLASS SM hint also isn't working for other Blackwells, need green context for carveout
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165159 
Approved by: https://github.com/Skylion007  
						
						
					 
					
						2025-10-16 18:35:06 +00:00 
						 
				 
			
				
					
						
					 
					
						
						
							
						
						7d87d7052e 
					 
					
						
						
							
							[inductor][bucketing] Fx collectives bucketing of multiple dtypes ( #162470 )  
						
						... 
						
						
						
						Bucketing of multiple dtypes to be processed in one bucketed collective.
First target is to bucket bf16 and f32, but already can be used with other dtypes.
For now multidtype bucketing is only supported with "custom_ops" mode.
Non custom_ops needs additional work on inductor side.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162470 
Approved by: https://github.com/eellison  
						
						
					 
					
						2025-10-16 18:31:43 +00:00