a0ab243c3a
Revert "Generalize poison fork logic for each device backend ( #144664 )"
...
This reverts commit 83bd0b63b55f224fada6d5f6dd7eb5b4cb3072fb.
Reverted https://github.com/pytorch/pytorch/pull/144664 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/144664#issuecomment-2795157082 ))
2025-04-10 21:02:14 +00:00
83bd0b63b5
Generalize poison fork logic for each device backend ( #144664 )
...
# Motivation
Generalize the posion_fork code to make it reusable across different devices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144664
Approved by: https://github.com/EikanWang , https://github.com/albanD
2025-04-10 02:34:53 +00:00
bf1132c196
Revert "Generalize poison fork logic for each device backend ( #144664 )"
...
This reverts commit d86c14156d875b782b82dda96842a1f77910f010.
Reverted https://github.com/pytorch/pytorch/pull/144664 on behalf of https://github.com/atalman due to failing periodic test: python test/test_cpp_extensions_mtia_backend.py TestCppExtensionMTIABackend.test_device_context ([comment](https://github.com/pytorch/pytorch/pull/144664#issuecomment-2784506104 ))
2025-04-07 20:09:53 +00:00
d86c14156d
Generalize poison fork logic for each device backend ( #144664 )
...
# Motivation
Generalize the posion_fork code to make it reusable across different devices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144664
Approved by: https://github.com/EikanWang , https://github.com/albanD
2025-04-07 02:06:21 +00:00
a11538aa46
[GPU Snapshot] Add Clear History Flag ( #149352 )
...
Summary:
Oftentimes, users complain that a bunch of extra events are prepended to their desired GPU snapshot. This is because they usually attach an OOM logger without knowing and when they go to collect the actual snapshot, it adds all the OOM logger contents. Since OOM and regular snapshot use the same backend, we currently don't have the infra in place to split these snapshots.
As a solution we add a flag to the snapshot frontend to clear out the history when starting the auto-trace record memory history.
A more thorough solution would be to have a user pass in a handle and to have snapshots per handle to seperate the events. However, this would likely be complicated and more work than it is worth as we would have to change the callbacks in the caching allocator and pass these objects between python and cpp.
Test Plan:
See diff below
Differential Revision: D71159720
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149352
Approved by: https://github.com/eqy , https://github.com/aaronenyeshi
2025-03-19 21:44:20 +00:00
c65ee728f0
Initial implementation of host memory stats ( #147660 )
...
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.
This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.
As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.
Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-03-05 16:13:19 +00:00
a983b2b11a
Revert "Initial implementation of host memory stats ( #147660 )"
...
This reverts commit 945e359fc1afe6c0bb6129ed9607b237fa19cd98.
Reverted https://github.com/pytorch/pytorch/pull/147660 on behalf of https://github.com/mradmila due to There is an issue with ambiguous definition of Stat structure when different C++ tools are used. Backing out for now. ([comment](https://github.com/pytorch/pytorch/pull/147660#issuecomment-2692346379 ))
2025-03-01 18:05:45 +00:00
945e359fc1
Initial implementation of host memory stats ( #147660 )
...
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.
This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.
As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.
Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-02-28 18:36:44 +00:00
25aa7ca62d
Cleanup CallOnce.h ( #146700 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146700
Approved by: https://github.com/albanD
2025-02-07 16:44:45 +00:00
29f52e3972
[2/N] Remove unnecessary once flag usage ( #145057 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145057
Approved by: https://github.com/albanD
2025-01-23 09:48:46 +00:00
3848de55ed
Add get_stream_from_external API for CUDA backend ( #143799 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143799
Approved by: https://github.com/albanD , https://github.com/EikanWang
ghstack dependencies: #142347 , #141119 , #141123
2024-12-31 11:15:59 +00:00
c0a39ad35a
[ROCm] Fix TunableOp UTs: Rotating Buffer ( #143172 )
...
TunableOp's rotating buffer feature cannot be properly tested because the environment variable that controls this feature is sticky. A Python API is introduced to modify this value.
Additional items in this PR:
* UT for rotating buffer API
* Clean up UTs that were setting the rotating buffer via the environment variable
* Align behavior of environment variable and Python API when a negative value (< 0) is set.
* Update documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143172
Approved by: https://github.com/jeffdaily
2024-12-14 06:18:11 +00:00
96c3b2c388
Expose remaining sharedMem cudaDeviceProps to python ( #143226 )
...
Was a bit too fast with my earlier PR, `sharedMemPerMultiprocessor` includes some memory that is reserved for the system. The amount a kernel can actually use is limited by `sharedMemPerBlockOptin`.
I also expose `sharedMemPerBlock` for completeness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143226
Approved by: https://github.com/ezyang
2024-12-14 06:13:28 +00:00
82a45d19b4
Expose sharedMemPerMultiprocessor device property to python ( #143119 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143119
Approved by: https://github.com/ezyang
2024-12-13 16:53:57 +00:00
4959784dac
Add API query for available per-process CUDA memory ( #140620 )
...
Certain `cpp_wrapper`-enabled tests were OOM-ing in the CI pipeline, with error messages suggesting that sufficient memory was accessible. This ultimately resulted from an internal memory limitation that was not queryable in the API. This PR adds querying for that limit.
Additionally, the failing tests had incorrect memory availability checks, and are updated with measured memory requirements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140620
Approved by: https://github.com/malfet , https://github.com/eqy
ghstack dependencies: #141367
2024-12-03 00:24:03 +00:00
f95c71867e
[9/N] Fix extra warnings brought by clang-tidy-17 ( #139286 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139286
Approved by: https://github.com/ezyang
2024-10-31 05:20:31 +00:00
456c87c8a2
[8/N] Fix extra warnings brought by clang-tidy-17 ( #139151 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139151
Approved by: https://github.com/ezyang
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com >
2024-10-30 14:20:08 +00:00
b14269dcfb
Make Context to be Device-agnostic Step by Step (1/N) ( #136519 ) ( #138155 )
...
Summary:
- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization
Original pull request: https://github.com/pytorch/pytorch/pull/136519
Test Plan: contbuild & OSS CI, see 4a8e49389c
Reviewed By: malfet
Differential Revision: D64471142
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138155
Approved by: https://github.com/malfet , https://github.com/bobrenjc93
2024-10-17 20:58:56 +00:00
d4d687ffb2
Revert "Make Context to be Device-agnostic Step by Step (1/N) ( #136519 )"
...
This reverts commit 4a8e49389c33934234dc89616fd17a58e760e2e7.
Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/clee2000 due to breaking internal tests related to MITA, @ezyang has a forward fix? ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2414588302 ))
2024-10-15 17:19:16 +00:00
4a8e49389c
Make Context to be Device-agnostic Step by Step (1/N) ( #136519 )
...
----
- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519
Approved by: https://github.com/ezyang , https://github.com/EikanWang , https://github.com/guangyey
2024-10-13 12:38:02 +00:00
079f909263
Revert "Make Context to be Device-agnostic Step by Step (1/N) ( #136519 )"
...
This reverts commit be0b75256a7e516217b059ef273901b95c022fe7.
Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/jovianjaison due to this pr is causing errors internally ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2405781093 ))
2024-10-10 18:32:17 +00:00
5516ac5c21
[ROCm] Tunableop record untuned ( #128813 )
...
When enable tunableop, It is easy to have OOM since APP usually needs large video memory size, such as running a LLM for inference. So we need a offline mode to tune the GEMMs. This PR provide an offline mode for tunableOp:
- record untuned GEMMs to file.
- a python API named tune_gemm_in_file is added to read the untuned file and tune the GEMMs in file
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128813
Approved by: https://github.com/jeffdaily , https://github.com/hongxiayang , https://github.com/naromero77amd
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com >
2024-10-09 21:59:03 +00:00
be0b75256a
Make Context to be Device-agnostic Step by Step (1/N) ( #136519 )
...
- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519
Approved by: https://github.com/ezyang , https://github.com/EikanWang , https://github.com/guangyey
2024-10-09 02:13:36 +00:00
8893881867
Invalidate StorageImpl instances when tensor is overwritten with cudagraphs ( #125264 )
...
Fixes #104435
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264
Approved by: https://github.com/ezyang
Co-authored-by: eellison <elias.ellison@gmail.com >
2024-10-09 00:05:52 +00:00
a2396b2dd8
[2/N] Fix extra warnings brought by clang-tidy-17 ( #137459 )
...
Follows #137407
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137459
Approved by: https://github.com/Skylion007
2024-10-08 19:05:02 +00:00
88e54de219
More nogil unsafe API fix ( #137142 )
...
Cover the PyDict APIs and confirms no update needed for PyModule one.
The rest was already covered in https://github.com/pytorch/pytorch/pull/136899
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137142
Approved by: https://github.com/eqy , https://github.com/Skylion007
2024-10-04 21:56:34 +00:00
c7b0d4b148
raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING ( #131114 )
...
raw_alloc is used by cudnn, miopen, thrust, and tunableop. Without this PR, the env var for disabling the caching allocator will only partially work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114
Approved by: https://github.com/eqy , https://github.com/houseroad , https://github.com/albanD
Co-authored-by: Nichols A. Romero <nick.romero@amd.com >
2024-10-04 15:36:29 +00:00
0d1701f310
Revert "raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING ( #131114 )"
...
This reverts commit 70019074806920f95976fedad775d7570294f635.
Reverted https://github.com/pytorch/pytorch/pull/131114 on behalf of https://github.com/PaliC due to failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/131114#issuecomment-2390615007 ))
2024-10-03 06:22:55 +00:00
7001907480
raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING ( #131114 )
...
raw_alloc is used by cudnn, miopen, thrust, and tunableop. Without this PR, the env var for disabling the caching allocator will only partially work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114
Approved by: https://github.com/eqy , https://github.com/houseroad , https://github.com/albanD
Co-authored-by: Nichols A. Romero <nick.romero@amd.com >
2024-10-02 16:27:15 +00:00
a15774563b
[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling ( #129663 )
...
As of ROCm 6.1 [hipDeviceProp_t::regsPerMultiprocessor](https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/structhip_device_prop__t.html#a7390d5b180d63978c81aa971060270b4 ) is now available allowing us to enable this attribute on ROCm.
```
>>> torch.cuda.get_device_properties(0)
_CudaDeviceProperties(name='AMD Instinct MI250X/MI250', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104)
>>> torch.cuda.get_device_properties(0).regs_per_multiprocessor
65536
```
With https://github.com/triton-lang/triton/pull/3962we can extract n_regs and n_spells from a triton binary with AMD backend allowing us to enable inductor's dynamic_rblock_scaling on ROCm initially implemented in https://github.com/pytorch/pytorch/pull/115094
Leaving this in draft until following PRs have landed:
- https://github.com/pytorch/pytorch/pull/129361 to bump the triton commit pin
- https://github.com/pytorch/pytorch/pull/128449 to allow us to grab warp_size from device properties instead of hard coding 64 on ROCm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129663
Approved by: https://github.com/jansel , https://github.com/shunting314
2024-09-13 16:45:39 +00:00
6c1da66407
[Reland] Refactor caching device allocator utils ( #130923 )
...
# Motivation
Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322 ), this PR aims to refactor caching device allocator utils to improve code reuse usage.
This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923
Approved by: https://github.com/EikanWang , https://github.com/gujinghui , https://github.com/albanD , https://github.com/eqy
2024-09-07 11:14:17 +00:00
e55c0f59e5
Revert "[Reland] Refactor caching device allocator utils ( #130923 )"
...
This reverts commit 9809080b9ed657a8c0ea0383be7cbdce3a26e05e.
Reverted https://github.com/pytorch/pytorch/pull/130923 on behalf of https://github.com/kit1980 due to breaking internal builds - Error: Relocation overflow has occured ([comment](https://github.com/pytorch/pytorch/pull/130923#issuecomment-2332640961 ))
2024-09-05 21:16:14 +00:00
9809080b9e
[Reland] Refactor caching device allocator utils ( #130923 )
...
# Motivation
Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322 ), this PR aims to refactor caching device allocator utils to improve code reuse usage.
This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923
Approved by: https://github.com/EikanWang , https://github.com/gujinghui , https://github.com/albanD , https://github.com/eqy
2024-09-04 05:31:08 +00:00
c25b64a057
expose host_emptyCache to python, fix a bug in freeing cudaHostRegist… ( #134919 )
...
…ered memory
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134919
Approved by: https://github.com/eqy
2024-09-01 09:07:25 +00:00
29b7852dc1
drop gil in couple places (leads to deadlocks) ( #134910 )
...
Per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134910
Approved by: https://github.com/eqy
2024-09-01 00:05:53 +00:00
4655eb3ee2
Uses MemPoolContext to route allocations from CUDACachingAllocator ( #134685 )
...
Re-open of https://github.com/pytorch/pytorch/pull/133599 that was mistakenly closed by issuing `ghstack land`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134685
Approved by: https://github.com/ezyang
2024-08-29 03:56:31 +00:00
2c88a923a7
Revert "Refactor caching device allocator utils ( #130923 )"
...
This reverts commit c45ca8092dddf718563a1a754de798ad25eae1ee.
Reverted https://github.com/pytorch/pytorch/pull/130923 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to be causing internal tests to fail with errors like `error: no type named 'DeviceStats' in namespace 'xxx::xxx:xxxAllocator'; did you mean 'DeviceStatus'?` ([comment](https://github.com/pytorch/pytorch/pull/130923#issuecomment-2315730155 ))
2024-08-28 15:56:08 +00:00
c45ca8092d
Refactor caching device allocator utils ( #130923 )
...
# Motivation
Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322 ), this PR aims to refactor caching device allocator utils to improve code reuse usage.
This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923
Approved by: https://github.com/EikanWang , https://github.com/gujinghui , https://github.com/albanD , https://github.com/eqy
2024-08-28 01:35:23 +00:00
255cd75a97
[sparse] Add cuSPARSELt as a backend ( #128534 )
...
Summary:
This PR adds in cuSPARSELt as a backend to PyTorch.
It is now possible to see if cuSPARSELt is available and the version if
it is with
```
torch.backends.cusparselt.is_available()
torch.backends.cusparselt.version()
```
Test Plan:
```
python test/test_sparse_semi_structured.py -k test_cusparselt_backend
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128534
Approved by: https://github.com/cpuhrsch , https://github.com/eqy , https://github.com/syed-ahmed
2024-08-21 22:06:07 +00:00
018e48c337
[Reland] Add wrappers for synchronous GPUDirect Storage APIs ( #133489 )
...
Reland #130633
USE_CUFILE turned off by default in this version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133489
Approved by: https://github.com/albanD
2024-08-15 17:11:52 +00:00
260e7cb143
Make CUDA device properties's __repr__
output actually printable ( #132863 )
...
Previously we would write the UUID bytes directly, leading to 'invalid
UTF-8 sequence' errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132863
Approved by: https://github.com/Skylion007 , https://github.com/eqy
2024-08-07 21:08:43 +00:00
527f104a69
add L2 cache size to device properties ( #132819 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132819
Approved by: https://github.com/eellison
2024-08-07 04:55:06 +00:00
e191b83462
Revert "Add wrappers for synchronous GPUDirect Storage APIs ( #130633 )"
...
This reverts commit 709ddf7a9dcfa1268848b72f6f56b55afa6728d6.
Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607 ))
2024-07-26 18:08:20 +00:00
709ddf7a9d
Add wrappers for synchronous GPUDirect Storage APIs ( #130633 )
...
Based in part on https://github.com/NVIDIA/apex/pull/1774
Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-25 22:23:38 +00:00
fddb1bcdea
[CCA][Memory Snapshot] Move user_defined annotations to Native Caching Allocator ( #130964 )
...
Summary: Instead of embedding the user_defined TraceEntry inside of device_traces, which causes issues when some threads may not have the proper device id set, save them into an external_annotations field by using a RingBuffer<AnnotationEntry> called annotation_buffer owned by the NativeCachingAllocator.
Test Plan: CI, resnet run, and FBR model.
Differential Revision: D59703213
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130964
Approved by: https://github.com/zdevito
2024-07-25 14:06:52 +00:00
e4b5645f83
Revert "Add wrappers for synchronous GPUDirect Storage APIs ( #130633 )"
...
This reverts commit 5b5e0698a5f560decb9bbdd150ed7b0622eb7777.
Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738 ))
2024-07-23 17:19:34 +00:00
5b5e0698a5
Add wrappers for synchronous GPUDirect Storage APIs ( #130633 )
...
Based in part on https://github.com/NVIDIA/apex/pull/1774
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-22 14:51:24 +00:00
7c299b46ca
Revert "Invalidate StorageImpl instances when tensor is overwritten with cudagraphs ( #125264 )"
...
This reverts commit 8390843eba6271dcdbec7d048e9fa4e56d4479d8.
Reverted https://github.com/pytorch/pytorch/pull/125264 on behalf of https://github.com/izaitsevfb due to breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/125264#issuecomment-2240516202 ))
2024-07-19 22:58:51 +00:00
5f981388ec
Revert "[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling ( #129663 )"
...
This reverts commit d7a78ec8b938a61297221912464f5afef288b823.
Reverted https://github.com/pytorch/pytorch/pull/129663 on behalf of https://github.com/atalman due to Breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/129663#issuecomment-2240011143 ))
2024-07-19 19:46:26 +00:00
d7a78ec8b9
[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling ( #129663 )
...
As of ROCm 6.1 [hipDeviceProp_t::regsPerMultiprocessor](https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/structhip_device_prop__t.html#a7390d5b180d63978c81aa971060270b4 ) is now available allowing us to enable this attribute on ROCm.
```
>>> torch.cuda.get_device_properties(0)
_CudaDeviceProperties(name='AMD Instinct MI250X/MI250', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104)
>>> torch.cuda.get_device_properties(0).regs_per_multiprocessor
65536
```
With https://github.com/triton-lang/triton/pull/3962we can extract n_regs and n_spells from a triton binary with AMD backend allowing us to enable inductor's dynamic_rblock_scaling on ROCm initially implemented in https://github.com/pytorch/pytorch/pull/115094
Leaving this in draft until following PRs have landed:
- https://github.com/pytorch/pytorch/pull/129361 to bump the triton commit pin
- https://github.com/pytorch/pytorch/pull/128449 to allow us to grab warp_size from device properties instead of hard coding 64 on ROCm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129663
Approved by: https://github.com/jansel , https://github.com/shunting314
2024-07-19 09:45:03 +00:00