This PR replaces all instances of 'pytorch-labs' with 'meta-pytorch' in this repository now that the 'pytorch-labs' org has been renamed to 'meta-pytorch'
## Changes Made
- Replaced all occurrences of 'pytorch-labs' with 'meta-pytorch'
- Only modified files with extensions: .py, .md, .sh, .rst, .cpp, .h, .txt, .yml
- Skipped binary files and files larger than 1MB due to GitHub api payload limits in the script to cover all repos in this org. Will do a more manual second pass later to cover any larger files
## Files Modified
This PR updates files that contained the target text.
Generated by automated script on 2025-08-12T20:41:29.888681+00:00Z
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160459
Approved by: https://github.com/huydhn, https://github.com/clee2000, https://github.com/atalman, https://github.com/malfet
In this PR we will port all distributed pipeline test files.
We could enable Intel GPU with following methods and try the best to keep the original code styles:
1. instantiate_device_type_tests()
2. use "torch.accelerator.current_accelerator()" to determine the accelerator backend
3. use "requires_accelerator_dist_backend()" to replace requires_nccl()
4. use "get_default_backend_for_device()" to get backend
5. enabled XPU for some test path
6. add TEST_MULTIACCELERATOR in common_utils for all backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159033
Approved by: https://github.com/guangyey, https://github.com/d4l3k
Co-authored-by: Daisy Deng <daisy.deng@intel.com>
By adding `addmm` kernel, which is a logical continuation of `mm` one. The only tricking part are how alpha and beta constants are handled, which are passed as `optmath_t`, i.e. that it could be, int64, int32 or float
Unified all MM flavors instantiations thru `INSTANTIATE_MM_OPS` and tested that `addmm` metal kernel works as expected for floating types as well by testing it via
```
PYTORCH_MPS_PREFER_METAL=1 python test/test_mps.py -v -k test_output_match_addmm_mps_
```
Fixes https://github.com/pytorch/pytorch/issues/154901
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160270
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #160228, #160234
Automatically replaces split with rsplit when relevant and only performs the split up to the first ( or last value). This allows early return of the split function and improve efficiency.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160107
Approved by: https://github.com/albanD
The current implementation assumes test functions are resolved as test_module.TestClass.test_fn, however this would not work for modules nested in directories e.g. inductor.test_torchinductor.TestClass.test_fn
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158637
Approved by: https://github.com/jbschlosser
Also migrate `test_common_rules.py` since it was a short file
`python test/distributed/tensor/test_common_rules.py`
Before:
Ran 10 tests in 91.516s
After:
Ran 10 tests in 5.604s
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159885
Approved by: https://github.com/ezyang
This would allow me to enable more opinfo tests against MPS device eventually and supposed to be a very simple test, but actually required minor adjustments to lots of test files, namely:
- Introduce `all_mps_types_and` that is very similar to `all_types_and`, but skips `float64`
- Decorate lots of tests with `@dtypesIfMPS(*all_mps_types())`
- Skip `test_from_dlpack_noncontinguous` as it currently crashes (need to be fixed)
- Add lots of `expectedFailureIfMPS`
- Delete all `@onlyNativeDeviceTypesAnd("mps")`
<sarcasm> I love how well documented this variable are </sarcasm>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153835
Approved by: https://github.com/Skylion007
Hi team,
Please help review this patch.
This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable.
I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by 257c413cd1 on 3.12.5.
So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it.
There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`. These solutions may make the code hard to maintain.
~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446
Approved by: https://github.com/sraikund16
Hi team,
Please help review this patch.
This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable.
I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by 257c413cd1 on 3.12.5.
So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it.
There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`. These solutions may make the code hard to maintain.
~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446
Approved by: https://github.com/sraikund16
cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack.
The scaling format is still detected from the sizes of the scale tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037
Approved by: https://github.com/eqy, https://github.com/drisspg
Added `torch.hash_tensor` reduction function with a `mode` argument that defaults to reduction with xor.
- The hash is always uint64.
- Integers will be casted to uint64 before performing the xor_sum reduction
- Floats will be upcasted to double and then bitcasted to uint64 before performing the xor_sum reduction
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154149
Approved by: https://github.com/albanD
Hi team,
Please help review this patch.
This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable.
I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by 257c413cd1 on 3.12.5.
So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it.
There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`. These solutions may make the code hard to maintain.
~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446
Approved by: https://github.com/sraikund16, https://github.com/cyyever
An out of tree backend can have its own configuration options that the user can enable to control inductor compilation. These config options need to be taken into account when calculating the key that is used to determine cache miss / hits. This PR allows out of tree backends to specify a custom config module that has the same type as `torch._inductor.config` that can be used to control codegen (in addition to the default config), and will be used when creating the cache key.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158254
Approved by: https://github.com/eellison
Avoid failures caused by tests exiting via sys.exit instead of `unittest.skip`
In particular it will not try to start the test (causing forks into subprocess) just to stop them (killing the subprocess) which is done in the test setup
Using `unittest.skip` decorators avoids the starting of the test in the first place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158846
Approved by: https://github.com/Skylion007
### Description
This PR is to enable TF32 as fp32 internal precision for matmul/linear/conv in `mkldnn backend`. Since we have refined fp32 precision API in https://github.com/pytorch/pytorch/pull/125888, we can easily extend the API to support TF32 for `mkldnn backend`.
```
torch.backends.mkldnn.matmul.fp32_precision = 'tf32'
torch.backends.mkldnn.conv.fp32_precision = "tf32"
```
Related kernel update and UTs update are done. And the wrapper `bf32_on_and _off` is updated to `reduced_f32_on_and_off`, and it can run tests 3 times, one is reduced_f32 OFF, the other two are reduced_f32 ON (including `bf32 ON` and `tf32 ON`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157520
Approved by: https://github.com/mingfeima, https://github.com/jansel
That fixes `index_put(..., accumulate=True)` for all dtypes
int64 operation is not really atomic, but eventually consistent from the `index_put_accumulate` kernel point of view: i.e. by the end of the operation results in the global memory are indeed accumulation of the operands at given indices
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158179
Approved by: https://github.com/dcci, https://github.com/Skylion007
ghstack dependencies: #158064, #158178
Move `MetalShaderLibrary::bind_tensors` private method to OperatorUtils.h and extract `iter_tensor_offset` method, that returns an offset from the start of the storage associated with given tensor inside the iterator
Migrated `index`, `index_put[_accumulate][_serial]` to the new paradigm that does not require additional tensor for indices nor special handling for 32 vs 64-bit offset, which resulted in almost 2x perf gain for 2000x2000 tensor, see results below before
```
[------------------------------------------------------------ -----------------------------------------------------------]
| 11x50x50 | 11x100x100 | 11x500x500 | 11x1000x1000 | 11x2000x2000
1 threads: ----------------------------------------------------------------------------------------------------------------
__getitem__ (torch.int8, torch.int64) | 383.5 | 379.8 | 470.9 | 1232.9 | 4410.3
__getitem__ (torch.float16, torch.int64) | 379.6 | 354.5 | 533.2 | 1290.3 | 4442.2
__getitem__ (torch.float32, torch.int64) | 360.8 | 338.6 | 478.6 | 1348.9 | 4870.4
Times are in microseconds (us).
```
and after
```
[------------------------------------------------------------ -----------------------------------------------------------]
| 11x50x50 | 11x100x100 | 11x500x500 | 11x1000x1000 | 11x2000x2000
1 threads: ----------------------------------------------------------------------------------------------------------------
__getitem__ (torch.int8, torch.int64) | 349.8 | 330.5 | 432.6 | 764.5 | 1961.2
__getitem__ (torch.float16, torch.int64) | 342.5 | 330.7 | 434.7 | 741.0 | 1969.4
__getitem__ (torch.float32, torch.int64) | 332.2 | 326.1 | 445.4 | 751.3 | 1972.6
Times are in microseconds (us).
```
While migrating also fixed index_put_accumulate for boolean types, by using compare_and_exchange trick over uint
Fixes https://github.com/pytorch/pytorch/issues/153560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158064
Approved by: https://github.com/dcci