fdc622b513
[CMake] Remove LLVM link code ( #134940 )
...
This handling is not needed no recent LLVM APIs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134940
Approved by: https://github.com/ezyang , https://github.com/malfet
2025-10-08 18:39:16 +00:00
f37a6523ef
Move version.h to torch/headeronly ( #164381 )
...
Differential Revision: [D83685392](https://our.internmc.facebook.com/intern/diff/D83685392 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164381
Approved by: https://github.com/janeyx99
2025-10-07 17:47:30 +00:00
9fff8155c3
[2/N] Fix clang-tidy readability checks ( #164652 )
...
This PR applies clang-tidy readability checks to jit sources and all headers in the code base.
`readability-redundant-inline-specifier` is suppressed because it incurs too many changes. `readability-redundant-inline-specifier` is used to detect redundant inline specifiers on function and variable declarations. There are many in-class method definitions that are marked inline.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164652
Approved by: https://github.com/Skylion007
2025-10-06 01:06:01 +00:00
2c5ed6e7c0
Revert "[2/N] Fix clang-tidy readability checks ( #164652 )"
...
This reverts commit 3c5ca685d6f5b6f3971c0cd20a054aa355610419.
Reverted https://github.com/pytorch/pytorch/pull/164652 on behalf of https://github.com/izaitsevfb due to need to revert due to a conflict with revert of https://github.com/pytorch/pytorch/pull/162659 ([comment](https://github.com/pytorch/pytorch/pull/164652#issuecomment-3369346707 ))
2025-10-05 21:36:57 +00:00
3c5ca685d6
[2/N] Fix clang-tidy readability checks ( #164652 )
...
This PR applies clang-tidy readability checks to jit sources and all headers in the code base.
`readability-redundant-inline-specifier` is suppressed because it incurs too many changes. `readability-redundant-inline-specifier` is used to detect redundant inline specifiers on function and variable declarations. There are many in-class method definitions that are marked inline.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164652
Approved by: https://github.com/Skylion007
2025-10-05 07:05:11 +00:00
1bb68271b7
Stop building nativert in OSS ( #164463 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164463
Approved by: https://github.com/albanD , https://github.com/Skylion007
2025-10-03 19:41:15 +00:00
31681bcacc
[PyTorch] Pull ARM's box-cox ( #164152 )
...
Summary:
ARM has provided with an SVE128 box-cox implementation.
It uses the same underlying algorithm as the previous version, but it has better log and exp implementations.
These supplied mathematical functions have switches to adjust the precision/speed trade-off.
We've noted a slight precision improvement, while also about a 5% peroformance increase
Before:
ZeroLambda1 61.66ns 16.22M
NonZeroLambda1 125.73ns 7.95M
NonZeroLambdaManyColumns 1.84ms 542.11
NonZeroLambdaEigenColumnar 262.31us 3.81K
NonZeroLambdaEigenRowMajor 275.17us 3.63K
NonZeroLambdaWithPyTorchColumnar 97.43us 10.26K
NonZeroLambdaWithPyTorchRowMajor 90.82us 11.01K
NonZeroLambdaWithPyTorchRowMajorFullBatch 96.96us 10.31K
NonZeroLambdaBatch 151.84us 6.59K
After:
ZeroLambda1 57.85ns 17.29M
NonZeroLambda1 118.85ns 8.41M
NonZeroLambdaManyColumns 1.82ms 548.16
NonZeroLambdaEigenColumnar 261.67us 3.82K
NonZeroLambdaEigenRowMajor 274.53us 3.64K
NonZeroLambdaWithPyTorchColumnar 89.12us 11.22K
NonZeroLambdaWithPyTorchRowMajor 83.49us 11.98K
NonZeroLambdaWithPyTorchRowMajorFullBatch 88.79us 11.26K
NonZeroLambdaBatch 144.74us 6.91K
Test Plan:
Correctness:
buck2 test @//mode/opt //koski/functions_contrib/df4ai/tests:batch_box_cox_test
Performance:
buck2 run @//mode/opt //koski/functions_contrib/df4ai/benchmark:boxcox_benchmark
Differential Revision:
D83485704
Privacy Context Container: L1196524
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164152
Approved by: https://github.com/ezyang
2025-10-01 15:00:03 +00:00
7441a1b9b1
Update ruff to 0.13.1 ( #163744 )
...
Update ruff to 0.13.1 so that we can remove `UP038` from `pyproject.toml` because it has been removed from supported rules of ruff.
There are some fixes, the most notable one is [(PYI059)](https://docs.astral.sh/ruff/rules/generic-not-last-base-class/#generic-not-last-base-class-pyi059 )
```
Checks for classes inheriting from typing.Generic[] where Generic[] is not the last base class in the bases tuple.
```
A BC-breaking change is introduced to change the typing of `OrderedSet .storage`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163744
Approved by: https://github.com/Skylion007 , https://github.com/jingsh
2025-09-26 10:12:21 +00:00
00059db034
Revert "[RELAND] Always build USE_DISTRIBUTED ( #160449 ) and Make distributed modules importable even when backend not built ( #159889 ) ( #162594 )"
...
This reverts commit 09cb34c1dce8fe1b880bbf3115d8ddad3401d871.
Reverted https://github.com/pytorch/pytorch/pull/162594 on behalf of https://github.com/malfet due to reverted internally and now can be safely reverted in OSS ([comment](https://github.com/pytorch/pytorch/pull/162594#issuecomment-3334176367 ))
2025-09-25 13:47:46 +00:00
7d710403b0
Reapply "Make functionalization ViewMeta
serializable with pickle. ( #143712 )" ( #163769 )
...
### Summary:
NOTE: This is a re-export of https://github.com/pytorch/pytorch/pull/161994 ; the changes between these two PRs is exclusively to the buck/build files
(Summary from #161994 )
Attempted rebase of https://github.com/pytorch/pytorch/pull/143712 .
This reverts commit 6c713ccb5e0df227dd5b630057cbccd373cbe7d6.
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames Lucaskabela
imported-using-ghimport
Test Plan: Imported from OSS
Differential Revision: D81524507
Pulled By: Lucaskabela
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163769
Approved by: https://github.com/dolpm
Co-authored-by: Brian Hirsh <hirsheybar@fb.com >
2025-09-25 10:27:37 +00:00
09cb34c1dc
[RELAND] Always build USE_DISTRIBUTED ( #160449 ) and Make distributed modules importable even when backend not built ( #159889 ) ( #162594 )
...
Summary:
Original: D81957844 and D81957923
Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well
#buildall
Test Plan:
sandcastle and oss ci
Rollback Plan:
Reviewed By: H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594
Approved by: https://github.com/H-Huang , https://github.com/dcci
2025-09-22 21:12:18 +00:00
f0078941cf
Revert "[RELAND] Always build USE_DISTRIBUTED ( #160449 ) and Make distributed modules importable even when backend not built ( #159889 ) ( #162594 )"
...
This reverts commit 6c334885d48725197b5d35e2c1543efc0f4198d0.
Reverted https://github.com/pytorch/pytorch/pull/162594 on behalf of https://github.com/wdvr due to reverted internally - @ezyang see D82281294 ([comment](https://github.com/pytorch/pytorch/pull/162594#issuecomment-3317017530 ))
2025-09-22 05:39:07 +00:00
eb11d172e3
[Caffe2] Improve SVE batch box cox by 2% ( #163360 )
...
Summary:
Improve bound checking on exp computation, decreasing the longest dependency chain by 1.
Box-cox benchmarks show about 2% of improved throughput.
Precision remains unaltered.
before:
NonZeroLambdaBatch 155.30us 6.44K
after:
NonZeroLambdaBatch 151.78us 6.59K
Test Plan:
Correctness:
buck2 test @//mode/opt //koski/functions_contrib/df4ai/tests:batch_box_cox_test
Performance:
buck2 run @//mode/opt //koski/functions_contrib/df4ai/benchmark:boxcox_benchmark
Differential Revision:
D82847111
Privacy Context Container: L1208939
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163360
Approved by: https://github.com/Skylion007
2025-09-20 06:42:26 +00:00
a3b68c7c57
Revert "Fix boxcox to return same result for same input in one batch ( #162772 )"
...
This reverts commit 49d30f9a234f0816a1ece278c8450d119e417714.
Reverted https://github.com/pytorch/pytorch/pull/162772 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162772#issuecomment-3313213011 ))
2025-09-19 17:58:29 +00:00
794b48c9f4
[PyTorch] Compile SVE's box-cox only when building targeting SVE ( #163078 )
...
Summary:
Internally, we are building PyTorch on the compat layer.
Need to avoid compiling sve's box-cox, as sve is not marked as build target.
Rollback Plan:
Reviewed By: rraometa, YifanYuan3
Differential Revision:
D82544412
Privacy Context Container: L1208939
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163078
Approved by: https://github.com/Skylion007 , https://github.com/malfet
2025-09-17 03:35:11 +00:00
2459da4a64
[Caffe2] Add float batch box cox SVE128 implementation ( #159778 )
...
Introduce SVE128 SIMD batch box-cox computation.
We've seen about 65% throughput improvement.
Privacy Context Container: L1196524
This is a no-op from OSS point of view, therefore it could be landed without tests (see precedence set by https://github.com/pytorch/pytorch/pull/143627 ), but we should delete those at some point
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159778
Approved by: https://github.com/malfet
2025-09-16 07:25:04 +00:00
49d30f9a23
Fix boxcox to return same result for same input in one batch ( #162772 )
...
Summary:
The SIMD path is using SLEEF version of `pow` which is slightly different from `std::pow`. The fix is to use the same vectorized code (with partial load and store) for the trailing data as well to ensure consistency between results.
Rollback Plan:
Differential Revision: D82265247
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162772
Approved by: https://github.com/swolchok
2025-09-13 03:57:35 +00:00
b2553a6ec4
[AOTI] raise PyTorchStreamWriter open failed error code on windows ( #162799 )
...
When I debug AOTI UT: `TestAOTInductorPackage_cpu::test_add`. I found it didn't output the verbose error code, when PyTorchStreamWriter open failed.
This PR add the verbose error code output for debug. Local test shows as below:
<img width="1124" height="653" alt="image" src="https://github.com/user-attachments/assets/01cb1a51-2982-4106-8b5b-c608ac26a075 " />
The error code is 32, we can check the Windows error code 32 at https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499-
```
ERROR_SHARING_VIOLATION
32 (0x20)
The process cannot access the file because it is being used by another process.
```
This issue is caused by the file is opened by another process. I fixed same issue in zip open as PR: https://github.com/pytorch/pytorch/pull/162617 But still no idea how to open file with shared access in `std::ofstream`. I will continue to researching it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162799
Approved by: https://github.com/jansel
2025-09-13 01:41:14 +00:00
65d642d6db
[ROCm] enable aoti tests, forward fix 162353 ( #162827 )
...
Forward fix for tests added by #162353 . Enables aoti tests on rocm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162827
Approved by: https://github.com/dolpm , https://github.com/huydhn
2025-09-12 20:05:50 +00:00
6c334885d4
[RELAND] Always build USE_DISTRIBUTED ( #160449 ) and Make distributed modules importable even when backend not built ( #159889 ) ( #162594 )
...
Summary:
Original: D81957844 and D81957923
Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well
#buildall
Test Plan:
sandcastle and oss ci
Rollback Plan:
Reviewed By: H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594
Approved by: https://github.com/H-Huang , https://github.com/dcci
2025-09-12 10:54:42 +00:00
6b59a19242
Revert "[RELAND] Always build USE_DISTRIBUTED ( #160449 ) and Make distributed modules importable even when backend not built ( #159889 ) ( #162594 )"
...
This reverts commit 6e8f17c58029e5fa6bc222b2445ebbc0cbdc17c7.
Reverted https://github.com/pytorch/pytorch/pull/162594 on behalf of https://github.com/huydhn due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162594#issuecomment-3283985880 ))
2025-09-12 06:52:03 +00:00
6e8f17c580
[RELAND] Always build USE_DISTRIBUTED ( #160449 ) and Make distributed modules importable even when backend not built ( #159889 ) ( #162594 )
...
Summary:
Original: D81957844 and D81957923
Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well
#buildall
Test Plan:
sandcastle and oss ci
Rollback Plan:
Reviewed By: H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594
Approved by: https://github.com/H-Huang , https://github.com/dcci
2025-09-12 03:56:18 +00:00
1c16c18a53
[nativert][triton] improve hardware registration ( #162499 )
...
Summary: att
Test Plan:
ci
Rollback Plan:
Differential Revision: D82031814
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162499
Approved by: https://github.com/angelayi
2025-09-10 04:52:57 +00:00
dda071587f
Revert "Make distributed modules importable even when backend not built ( #159889 )" ( #162568 )
...
This reverts commit a0d026688cd69583d5a4e0c6f3e5fda141a7f4a9.
Revert "Always build USE_DISTRIBUTED. (#160449 )"
This reverts commit d80297a6846f1f2c36fd4f19e22919f2abe8fcea.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162568
Approved by: https://github.com/huydhn
2025-09-10 04:29:42 +00:00
2281d009e5
Revert "[ROCm] Add specific compile options for CK SDPA ( #161759 )"
...
This reverts commit d22d916719eb7daff8455a01d216d65f81899a9e.
Reverted https://github.com/pytorch/pytorch/pull/161759 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this seems to break internal ROCm jobs ([comment](https://github.com/pytorch/pytorch/pull/161759#issuecomment-3272807726 ))
2025-09-10 00:44:30 +00:00
d22d916719
[ROCm] Add specific compile options for CK SDPA ( #161759 )
...
Updates CK version and adds CK specific compilation options
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161759
Approved by: https://github.com/jeffdaily
2025-09-09 20:04:19 +00:00
bdbe931d58
[build] Add LeakSanitizer option to CMake ( #158686 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158686
Approved by: https://github.com/eellison
2025-09-09 18:41:20 +00:00
d80297a684
Always build USE_DISTRIBUTED. ( #160449 )
...
Signed-off-by: Edward Yang <ezyang@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449
Approved by: https://github.com/wconstab , https://github.com/albanD , https://github.com/dcci
2025-09-08 19:10:36 +00:00
1e0656f063
Revert "Always build USE_DISTRIBUTED. ( #160449 )"
...
This reverts commit de893e96c775023aa3be895060848fac3296772c.
Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053 ) ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3264887002 ))
2025-09-08 07:04:36 +00:00
b6d0a9ea90
MXFP8 grouped GEMM support for torch._scaled_grouped_mm + submodule bump ( #162209 )
...
## Summary
- We just landed 2d-2d support for mxfp8 grouped gemm in FBGEMM: https://github.com/pytorch/FBGEMM/pull/4816
- This is needed for backward pass of mxfp8 MoE training with grouped gemms
- Changes:
- Add dispatching + input validation for mxfp8 grouped gemm in `torch._scaled_grouped_mm`
- Add meta registration input validation for mxfp8 grouped gemm, for composability with compile
- Add unit tests exercising torch._scaled_grouped_mm with mxfp8 inputs
- Bump FBGEMM third party submodule to include:
- https://github.com/pytorch/FBGEMM/pull/4816
- https://github.com/pytorch/FBGEMM/pull/4820
- https://github.com/pytorch/FBGEMM/pull/4821
- https://github.com/pytorch/FBGEMM/pull/4823
#### How fbgemm dependency was bumped
Documenting this since I haven't found it documented elsewhere:
- `cd ~/pytorch/third_party/fbgemm`
- `git fetch`
- `git checkout <hash>`
- `cd ~/pytorch`
- `git add third_party/fbgemm`
## Test plan
#### Test build
```
USE_FBGEMM_GENAI=1 python -m pip install --no-build-isolation -v -e .
...
Successfully installed torch-2.9.0a0+gitf5070f3
```
[full build log](https://www.internalfb.com/phabricator/paste/view/P1933787581 )
#### Unit tests
```
pytest test/test_matmul_cuda.py -k test_mxfp8_scaled_grouped_mm_
...
test/test_matmul_cuda.py ......... [100%]
============================================================== 9 passed, 1668 deselected in 5.34s ===============================================================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162209
Approved by: https://github.com/ngimel
2025-09-06 15:25:30 +00:00
de893e96c7
Always build USE_DISTRIBUTED. ( #160449 )
...
Signed-off-by: Edward Yang <ezyang@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449
Approved by: https://github.com/wconstab , https://github.com/albanD , https://github.com/dcci
2025-09-05 20:15:11 +00:00
adae7f66aa
Revert "Always build USE_DISTRIBUTED. ( #160449 )"
...
This reverts commit c37103234afc832dcad307e9016230810957c9d5.
Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Breaking internal build rules, see D81756619 ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3259430011 ))
2025-09-05 18:58:47 +00:00
c37103234a
Always build USE_DISTRIBUTED. ( #160449 )
...
Signed-off-by: Edward Yang <ezyang@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449
Approved by: https://github.com/wconstab , https://github.com/albanD , https://github.com/dcci
2025-09-04 19:43:17 +00:00
b7dad7dd49
Revert "Always build USE_DISTRIBUTED. ( #160449 )"
...
This reverts commit 90b08643c3a6eb1f3265b7d1388bd76660759f46.
Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Already discussed with @ezyang about the internal quirks and errors ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3254219358 ))
2025-09-04 15:25:07 +00:00
90b08643c3
Always build USE_DISTRIBUTED. ( #160449 )
...
Signed-off-by: Edward Yang <ezyang@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449
Approved by: https://github.com/wconstab , https://github.com/albanD , https://github.com/dcci
2025-09-03 07:33:55 +00:00
4e42aa8ffc
Revert "Always build USE_DISTRIBUTED. ( #160449 )"
...
This reverts commit b7034e9c924412bfbe8ee25a22d7e95239b5ca65.
Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, can't be landed with forward fix due to internal tooling problems ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3246689684 ))
2025-09-02 20:28:42 +00:00
b7034e9c92
Always build USE_DISTRIBUTED. ( #160449 )
...
Signed-off-by: Edward Yang <ezyang@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449
Approved by: https://github.com/wconstab , https://github.com/albanD , https://github.com/dcci
2025-09-01 23:00:21 +00:00
61e18b5304
[2/N][SymmMem] Add MemPool allocator and tests ( #161471 )
...
(Porting most of #161008 )
Hooking SymmetricMemory Allocator to MemPool so that user can create symmetric tensors with regular `torch.zeros`, `torch.arange` etc factories. Also so that our ops can have functional variants that create `out` tensors on symmetric memory.
To end users, this PR supports a python UI as follows:
```
allocator = symm_mem.get_mempool_allocator(device)
mempool = torch.cuda.MemPool(allocator)
with torch.cuda.use_mem_pool(mempool):
tensor = torch.arange(numel, dtype=dtype, device=device)
```
Added tests for both use cases above.
Differential Revision: [](https://our.internmc.facebook.com/intern/diff/ )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161471
Approved by: https://github.com/ngimel
ghstack dependencies: #161470
2025-08-31 18:08:57 +00:00
fb2d5ea697
Revert "[2/N][SymmMem] Add MemPool allocator and tests ( #161471 )"
...
This reverts commit b291dc9684d00396239a0c7786b7aac71bf69c05.
Reverted https://github.com/pytorch/pytorch/pull/161471 on behalf of https://github.com/atalman due to Multiple internal failures on PR #https://github.com/pytorch/pytorch/pull/161471 will need to land it via co-dev ([comment](https://github.com/pytorch/pytorch/pull/161471#issuecomment-3239283585 ))
2025-08-30 14:00:29 +00:00
303f514d5b
[CI] Add basic CUDA 13.0 periodic test ( #161013 )
...
https://github.com/pytorch/pytorch/issues/159779
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161013
Approved by: https://github.com/atalman
Co-authored-by: Andrey Talman <atalman@fb.com >
Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com >
2025-08-29 17:56:33 +00:00
b291dc9684
[2/N][SymmMem] Add MemPool allocator and tests ( #161471 )
...
(Porting most of #161008 )
Hooking SymmetricMemory Allocator to MemPool so that user can create symmetric tensors with regular `torch.zeros`, `torch.arange` etc factories. Also so that our ops can have functional variants that create `out` tensors on symmetric memory.
To end users, this PR supports a python UI as follows:
```
allocator = symm_mem.get_mempool_allocator(device)
mempool = torch.cuda.MemPool(allocator)
with torch.cuda.use_mem_pool(mempool):
tensor = torch.arange(numel, dtype=dtype, device=device)
```
Added tests for both use cases above.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161471
Approved by: https://github.com/ngimel
ghstack dependencies: #161470
2025-08-28 06:31:29 +00:00
903181bb6f
Revert "[2/N][SymmMem] Add MemPool allocator and tests ( #161471 )"
...
This reverts commit 4ed71d5412d58746d23f16689cab61da0e8149ef.
Reverted https://github.com/pytorch/pytorch/pull/161471 on behalf of https://github.com/atalman due to failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/161471#issuecomment-3230069186 ))
2025-08-27 23:18:36 +00:00
4ed71d5412
[2/N][SymmMem] Add MemPool allocator and tests ( #161471 )
...
(Porting most of #161008 )
Hooking SymmetricMemory Allocator to MemPool so that user can create symmetric tensors with regular `torch.zeros`, `torch.arange` etc factories. Also so that our ops can have functional variants that create `out` tensors on symmetric memory.
To end users, this PR supports a python UI as follows:
```
allocator = symm_mem.get_mempool_allocator(device)
mempool = torch.cuda.MemPool(allocator)
with torch.cuda.use_mem_pool(mempool):
tensor = torch.arange(numel, dtype=dtype, device=device)
```
Added tests for both use cases above.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161471
Approved by: https://github.com/ngimel
ghstack dependencies: #161470
2025-08-27 00:49:06 +00:00
0254646654
harden fabric checks for symmetric memory ( #160790 )
...
Now we check only that fabric allocation succeeded, but sometimes we fail during export or import afterwards, with no recourse. Check the full cycle before attempting to allocate memory with the fabric.
TODO: move it to c10/cuda so that it can be used from CUDACachingAllocator too
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160790
Approved by: https://github.com/Skylion007
2025-08-18 22:35:50 +00:00
10e3514c96
Remove tensorexpr tests ( #158928 )
...
The tests are not maintained.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928
Approved by: https://github.com/albanD , https://github.com/malfet
2025-08-09 02:21:22 +00:00
5f5f508aa8
[ROCm] Ck backend UX refactor ( #152951 )
...
Refactors how the enablement/disablement of CK Gemms and SDPA works.
- Adds USE_ROCM_CK_GEMM compile flag for enabling CK gemms.
- USE_ROCM_CK_GEMM is set to True by default on Linux
- Updates USE_CK_FLASH_ATTENTION to USE_ROCM_CK_SDPA.
- USE_ROCM_CK_SDPA is set to False by default
- (USE_CK_FLASH_ATTENTION still works for now, but will be deprecated in a future release)
- Prevents these CK libraries from being used unless pytorch has been built specifically with the functionality AND is running on a system architecture that supports it.
- the getters for these library backends will also do some validity checking in case the user used an environment variable to change the backend. If invalid, (i.e. one of the cases mentioned above is false) the backend will be set as the current non-CK default
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152951
Approved by: https://github.com/eqy , https://github.com/jeffdaily , https://github.com/m-gallus
Co-authored-by: Jeff Daily <jeff.daily@amd.com >
Co-authored-by: Jithun Nair <jithun.nair@amd.com >
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com >
2025-08-08 18:40:17 +00:00
e2a5c42e7e
[BE][MPS] Build metal kernels of MacOS-14+ ( #159733 )
...
Which makes `#if __METAL_VERSION__ >= 310` guards for `bfloat` use support unnecessary.
Rename `kernels_bfloat.metallib` into `kernels_basic` and remove custom build/selection logic.
Part of https://github.com/pytorch/pytorch/issues/159275
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159733
Approved by: https://github.com/dcci
ghstack dependencies: #159731 , #159732
2025-08-03 20:53:58 +00:00
c400c8e2e0
[ROCm] Add FP8 rowwise support to _scaled_grouped_mm + Submodule update ( #159075 )
...
Summary:
In this PR we integrate the [FBGEMM AMD FP8 rowwise scaling grouped GEMM kernel](https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped ) to add support for the `_scaled_grouped_mm` API on AMD. `_scaled_grouped_mm` is [currently supported on Nvidia](9faef3d17c/aten/src/ATen/native/cuda/Blas.cpp (L1614)
), this PR aims to bring parity to AMD. Related: [[RFC]: PyTorch Low-Precision GEMMs Public API](https://github.com/pytorch/pytorch/issues/157950#top ) #157950 .
The kernel is developed using the Composable Kernel framework. Only MI300X is currently supported. In the near future we plan to add support for MI350X as well. For data types we support FP8 e3m4.
The kernel support will be gated with the `USE_FBGEMM_GENAI` flag. We hope to enable this by default for relevant AMD builds.
Note we also update submodule `third_party/fbgemm` to 0adf62831 for the required updates from fbgemm.
Test Plan:
**Hipify & build**
```
python tools/amd_build/build_amd.py
USE_FBGEMM_GENAI=1 python setup.py develop
```
**Unit tests**
```
python test/test_matmul_cuda.py -- TestFP8MatmulCUDA
Ran 488 tests in 32.969s
OK (skipped=454)
```
**Performance Sample**
| G | M | N | K | Runtime Ms | GB/S | TFLOPS |
| -- | -- | -- | -- | -- | -- | -- |
| 128 | 1 | 2048 | 5120 | 0.37| 3590 | 7.17 |
| 128 | 64 | 2048 | 5120 | 0.51| 2792 | 338.34 |
| 128 | 128 | 2048 | 5120 | 0.66| 2272 | 522.72 |
| 128 | 1 | 5120 | 1024 | 0.21| 3224 | 6.43 |
| 128 | 64 | 5120 | 1024 | 0.29| 2590 | 291.40 |
| 128 | 128 | 5120 | 1024 | 0.40| 2165 | 434.76 |
| 128 | 1 | 4096 | 4096 | 0.69| 3126 | 6.25 |
| 128 | 64 | 4096 | 4096 | 0.85| 2655 | 324.66 |
| 128 | 128 | 4096 | 4096 | 1.10| 2142 | 501.40 |
| 128 | 1 | 8192 | 8192 | 2.45| 3508 | 7.01 |
| 128 | 64 | 8192 | 8192 | 3.27| 2692 | 336.74 |
| 128 | 128 | 8192 | 8192 | 4.04| 2224 | 543.76 |
| 16 | 1 | 2048 | 5120 | 0.04| 3928 | 7.85 |
| 16 | 64 | 2048 | 5120 | 0.05| 3295 | 399.29 |
| 16 | 128 | 2048 | 5120 | 0.07| 2558 | 588.69 |
| 16 | 1 | 5120 | 1024 | 0.03| 3119 | 6.23 |
| 16 | 64 | 5120 | 1024 | 0.03| 2849 | 320.62 |
| 16 | 128 | 5120 | 1024 | 0.05| 2013 | 404.11 |
| 16 | 1 | 4096 | 4096 | 0.06| 4512 | 9.02 |
| 16 | 64 | 4096 | 4096 | 0.09| 3124 | 381.95 |
| 16 | 128 | 4096 | 4096 | 0.13| 2340 | 547.67 |
| 16 | 1 | 8192 | 8192 | 0.32| 3374 | 6.75 |
| 16 | 64 | 8192 | 8192 | 0.42| 2593 | 324.28 |
| 16 | 128 | 8192 | 8192 | 0.53| 2120 | 518.36 |
- Using ROCm 6.4.1
- Collected through `triton.testing.do_bench_cudagraph`
**Binary size with gfx942 arch**
Before: 116103856 Jul 23 14:12 build/lib/libtorch_hip.so
After: 118860960 Jul 23 14:29 build/lib/libtorch_hip.so
The difference is 2757104 bytes (~2.6 MiB).
Reviewers: @drisspg @ngimel @jwfromm @jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159075
Approved by: https://github.com/drisspg
2025-07-30 23:53:58 +00:00
e288c258f7
Revert "Remove tensorexpr tests ( #158928 )"
...
This reverts commit d742a2896c571a535003d5928fe80397325575a5.
Reverted https://github.com/pytorch/pytorch/pull/158928 on behalf of https://github.com/yangw-dev due to this breaks bunch of internal dependency since some tests are still using the deleted test files from this pr, the internal reviewer please help fix this using codev ([comment](https://github.com/pytorch/pytorch/pull/158928#issuecomment-3134378616 ))
2025-07-29 23:32:07 +00:00
d742a2896c
Remove tensorexpr tests ( #158928 )
...
The tests are not maintained.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928
Approved by: https://github.com/albanD , https://github.com/malfet
2025-07-27 07:13:27 +00:00