Commit Graph

808 Commits

Author SHA1 Message Date
68c12ecfe2 Move get accelerator to use build time flags when possible (#146098)
This PR does two main things (they are in a single PR to show how the newly added APIs are used).

- Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic
- Use the newly added isBuilt for accelerator check to ensure it does not poison fork

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-10 13:17:58 +00:00
b246cd7b82 Revert "Move get accelerator to use build time flags when possible (#146098)"
This reverts commit 17302b4bc837af079d2f6480f07ea2c99b93fb4b.

Reverted https://github.com/pytorch/pytorch/pull/146098 on behalf of https://github.com/albanD due to Still fails with cuda build on a non-gpu machine ([comment](https://github.com/pytorch/pytorch/pull/146098#issuecomment-2707191770))
2025-03-07 18:59:58 +00:00
17302b4bc8 Move get accelerator to use build time flags when possible (#146098)
This PR does two main things (they are in a single PR to show how the newly added APIs are used).

- Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic
- Use the newly added isBuilt for accelerator check to ensure it does not poison fork

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-07 15:19:34 +00:00
5f392ae560 Throws error when using torch.cuda.MemPool with expandable segments (#148378)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148378
Approved by: https://github.com/ngimel, https://github.com/eqy
ghstack dependencies: #148374
2025-03-07 05:22:03 +00:00
c65ee728f0 Initial implementation of host memory stats (#147660)
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.

This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.

As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.

Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-03-05 16:13:19 +00:00
cyy
ec2805ada8 Remove outdated CUDA version check (#148142)
Since Torch requires CUDA>=11, some checks can be removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148142
Approved by: https://github.com/janeyx99, https://github.com/eqy
2025-03-04 03:33:44 +00:00
a983b2b11a Revert "Initial implementation of host memory stats (#147660)"
This reverts commit 945e359fc1afe6c0bb6129ed9607b237fa19cd98.

Reverted https://github.com/pytorch/pytorch/pull/147660 on behalf of https://github.com/mradmila due to There is an issue with ambiguous definition of Stat structure when different C++ tools are used. Backing out for now. ([comment](https://github.com/pytorch/pytorch/pull/147660#issuecomment-2692346379))
2025-03-01 18:05:45 +00:00
493cd97af5 add skips to test_notifies_oom and test_set_per_process_memory_fraction (#148134)
Tests fail in NVIDIA internal CI since we do not support nvml on Jetson, but nvml is required for OOM reporting to work properly, so we are skipping the failing tests for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148134
Approved by: https://github.com/eqy
2025-03-01 02:59:48 +00:00
945e359fc1 Initial implementation of host memory stats (#147660)
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.

This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.

As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.

Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-02-28 18:36:44 +00:00
cyy
b0dfd242fa Remove NO_MULTIPROCESSING_SPAWN checks (#146705)
py 3.9 has spawn.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146705
Approved by: https://github.com/colesbury
2025-02-28 05:53:19 +00:00
926b7b5027 Revert "Remove NO_MULTIPROCESSING_SPAWN checks (#146705)"
This reverts commit 40ad5e01dff05c7d64e070fb01683820e678f788.

Reverted https://github.com/pytorch/pytorch/pull/146705 on behalf of https://github.com/cyyever due to Broke lint?, I guess land race with rufff update ([comment](https://github.com/pytorch/pytorch/pull/146705#issuecomment-2689603077))
2025-02-28 03:04:38 +00:00
40ad5e01df Remove NO_MULTIPROCESSING_SPAWN checks (#146705)
py 3.9 has spawn.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146705
Approved by: https://github.com/colesbury
2025-02-28 00:15:32 +00:00
0ea5d1067b ROCm: Remove static specifier for allow_tf32 variable. (#147186)
Since the env variable HIPBLASLT_ALLOW_TF32 can change, remove static type for allow_tf32 variable so that it captures the current value of env variable HIPBLASLT_ALLOW_TF32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147186
Approved by: https://github.com/jeffdaily, https://github.com/naromero77amd
2025-02-26 18:24:02 +00:00
de80b6f0d3 Updated test_cuda.py to rerun tests (#147040)
Initially test_cuda::TestCudaMallocAsync::test_clock_speed and test_cuda::TestCudaMallocAsync::test_power_draw are skipped in this [commit](d4871750d9).

Pulled ROCm nightly image and verified these two tests run fine locally. Filed this PR to enable them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147040
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-02-25 19:58:42 +00:00
fb73b0c7c5 Revert "use copy2d in h2d/d2h copy when possible (#146256)"
This reverts commit 0bc036a9e98d2cc92ff9dd367342b1f2efcc15f0.

Reverted https://github.com/pytorch/pytorch/pull/146256 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/146256#issuecomment-2680868627))
2025-02-25 07:06:38 +00:00
e8fbc86de0 Make torch.cuda.gds APIs public (#147120)
Follow up to https://github.com/pytorch/pytorch/pull/145748 that turned USE_CUFILE on for CUDA 12.6 and 12.8 binaries

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147120
Approved by: https://github.com/albanD
2025-02-14 17:06:50 +00:00
861bf892fb Set USE_CUFILE=1 by default and add pypi package to binary build matrix (#145748)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145748
Approved by: https://github.com/atalman
2025-02-11 15:49:01 +00:00
9ee506bd93 [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee, https://github.com/malfet
2025-02-06 19:04:50 +00:00
f27220e32a Revert "Move get accelerator to use build time flags when possible (#146098)"
This reverts commit 157d81c201715f84ead21d0ee420669ab7f58c04.

Reverted https://github.com/pytorch/pytorch/pull/146098 on behalf of https://github.com/atalman due to Failing internally, sorry need to revert ([comment](https://github.com/pytorch/pytorch/pull/146098#issuecomment-2637443675))
2025-02-05 16:39:37 +00:00
157d81c201 Move get accelerator to use build time flags when possible (#146098)
This PR does two main things (they are in a single PR to show how the newly added APIs are used).

- Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic
- Use the newly added isBuilt for accelerator check to ensure it does not poison fork

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-02-04 18:23:24 +00:00
0bc036a9e9 use copy2d in h2d/d2h copy when possible (#146256)
A rewrite of #138964
In addition to rewriting the conditions for using copy2d, this PR fixes a few other problems with #138964:
1) gpu-gpu copies when peer access is disabled shouldn't rely on copy2d
2) copy2d should record even for the host pinned memory, like the regular copy does
3) copy2d shouldn't pretend that it's synchronizing (for the purposes of cuda sanitizer tracer) when it's non-blocking

In this PR copy2d behaves in exactly the same way as copy does wrt to those additional syncs, except it calls a different underlying cuda call.

Tests for multiple cases going through copy2d and avoiding copy2d pattern due to unsatisfied conditions are added.
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146256
Approved by: https://github.com/eqy, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-03 23:07:54 +00:00
c39c679813 Revert "Tensor .cuda() very slow with specific array sizes (#138964)"
This reverts commit 98f87edd233ea69cee5f3e73e9eb4b5ab77aa744.

Reverted https://github.com/pytorch/pytorch/pull/138964 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but some slow test start failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/138964#issuecomment-2628455198))
2025-01-31 21:48:51 +00:00
c3f71eb61b Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit e2917245fb0c0b6aab216e7a0a254b80e7a9e78f.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/ZainRizvi due to Sorry but this still fails internally with the same error.  @Chillee or @malfet, can you please help the change get tested? (See D68783351) ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2627886999))
2025-01-31 17:43:09 +00:00
98f87edd23 Tensor .cuda() very slow with specific array sizes (#138964)
### **Pull Request: Optimized Non-Contiguous Tensor Copy for CPU to GPU in PyTorch**

#### **Summary**
This PR addresses the performance issue identified in [#111570](https://github.com/pytorch/pytorch/issues/111570), where non-contiguous tensors took significantly longer to transfer from CPU to GPU. Through detailed tracing of the call flow, we identified that PyTorch was creating temporary contiguous buffers for non-contiguous tensor transfers, which introduced unnecessary overhead.

#### **Tracing the Issue**
To pinpoint the cause of the slowdown, we followed the call flow from Python’s `tensor.cuda()` method through PyTorch’s backend, ultimately identifying `copy_kernel_cuda` as the key function responsible for CPU-to-GPU tensor transfers. Here’s a summary of the tracing process:

1. **Python Call: `tensor.cuda()`**
   - Starting from Python, the `cuda()` method initiates the tensor transfer to the GPU.

2. **`TensorBody.h: cuda()`**
   - The `cuda()` method calls `to()`, specifying the target device as CUDA.

3. **`Tensor.cpp: TensorBase::to()`**
   - The `to()` function prepares device and data type options before invoking `_ops::to_dtype_layout::call()`.

4. **Operator Call: `_ops::to_dtype_layout::call()`**
   - This operator dispatches the request to the backend-specific function responsible for managing the transfer.

5. **`Copy.cpp: copy_()`**
   - The `copy_()` function performs preliminary checks (e.g., zero-tensor immutability) and proceeds to call `copy_impl()`.

6. **`Copy.cpp: copy_impl()`**
   - This function sets up a tensor iterator and dispatches the copy operation to the appropriate backend through `copy_stub`.

7. **Dispatch to CUDA: `copy_stub`**
   - The dispatch mechanism routes the call to the CUDA-specific function, `copy_kernel_cuda`.

8. **`Copy.cu: copy_kernel_cuda()`**
   - Here, we identified that PyTorch was creating temporary contiguous buffers for 1D and 2D non-contiguous tensors, which slowed down the copy process. This behavior is managed by the `copy_requires_temporaries()` function.

#### **Solution**
To address this, we modified `copy_kernel_cuda` to handle non-contiguous 1D and 2D tensors directly by using `cudaMemcpy2DAsync`, which allows efficient, stride-aware memory transfers without temporary buffers. Here’s why this approach improves performance:

- **Efficiency of `cudaMemcpy2DAsync`**: This CUDA function is optimized for pitched (stride-based) memory transfers, allowing it to handle non-contiguous data layouts effectively by specifying memory strides for source and destination tensors.
- **Reduction of Overhead**: By directly copying non-contiguous tensors without intermediate buffers, we eliminate extra memory allocation and achieve faster CPU-to-GPU transfers.
- **Asynchronous Execution**: `cudaMemcpy2DAsync` enables asynchronous transfer on the CUDA stream, further improving performance by taking advantage of CUDA's optimized memory handling for non-contiguous layouts.

#### **Performance Results**

In my testing, I created tensors of size `327680 x 2000` and used slices for transfer performance measurements. The tests show that the average time for transferring a non-contiguous slice (e.g., rows 10,000 to 50,000) from CPU to GPU now closely matches the contiguous case. This improvement indicates that the updated implementation effectively addresses the performance discrepancy. Below are the measured times and validation checks:

```plaintext
Average time for contiguous slice (rows 10,000-50,000): 66 ms
Average time for non-contiguous slice (rows 10,000-50,000): 66 ms

Validation of contiguous and non-contiguous tensor copies:
 PASS: Tensor shapes match.
 PASS: Tensor contiguity matches.
 PASS: Tensor contents match.
 PASS: Tensor data types match.

 Success: Both contiguous and non-contiguous tensors were copied correctly to the GPU.
```

#### **Conclusion**
This PR resolves the identified performance issue by eliminating the need for temporary buffers in non-contiguous 1D and 2D tensor transfers, ensuring faster and more efficient copies from CPU to GPU. Future optimizations could further enhance performance for higher-dimensional non-contiguous tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138964
Approved by: https://github.com/jeffdaily

Co-authored-by: Natalia Gimelshein <ngimel@gmail.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-31 17:05:02 +00:00
e2917245fb [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee, https://github.com/malfet
2025-01-30 22:33:50 +00:00
08ff11e9d0 initialize device when pinning memory on this device, short circuit i… (#145752)
…s_pinned if device is not initialized
Do not land
RFC
potential fix for #144687

Now `.is_pinned(device="cuda")` does not initialize device and thus doesn't poison the fork (but it complains about `device` arg being deprecated). To not need `device=` arg we'd need to fix get_accelerator to not initialize device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145752
Approved by: https://github.com/albanD

Co-authored-by: albanD <albandes@fb.com>
2025-01-30 21:37:29 +00:00
6967ef1b07 [ROCm] fix test_cublas_workspace_explicit_allocation for gfx12 (#145227)
gfx12 passes the condition `torch.cuda.get_device_capability() >= (9, 4)` and uses `default_workspace_size=128MB`, but it required only for MI300
Fix condition to use `("gfx94" in gcn_arch)` instead of `torch.cuda.get_device_properties()` to detect MI300.
Now `default_workspace_size=32MB` is used for gfx12 and the test passes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145227
Approved by: https://github.com/jeffdaily, https://github.com/eqy
2025-01-28 16:19:27 +00:00
c986eba560 Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit abf28982a8cb43342e7669d859de9543fd804cc9.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. @Chillee can you please help change get remerged? See  D68720562 ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2616726406))
2025-01-27 19:38:26 +00:00
abf28982a8 [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee
2025-01-27 18:05:23 +00:00
FEI
615bdd9c81 Improve the caching allocator test for raw alloc (#145269)
1 Prevent block allocated by torch._C._cuda_cudaCachingAllocator_raw_alloc from affecting torch.cuda.empty_cache() in other unit tests
2 Additionally, tested the changes to raw_delete in https://github.com/pytorch/pytorch/pull/131114

@jeffdaily @albanD @houseroad @eqy @aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145269
Approved by: https://github.com/albanD, https://github.com/eqy, https://github.com/jeffdaily
2025-01-24 21:07:17 +00:00
dad9bc3461 Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit de945d78da9198e58df7c19c53b737d0f987ddff.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/izaitsevfb due to unused variables again :( ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2611182461))
2025-01-23 22:59:25 +00:00
de945d78da [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee
2025-01-22 22:42:48 +00:00
4ea189422d Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit a6763b7b81cd1a55c8316dfdb5bca19819a1429a.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2596895865))
2025-01-16 21:12:41 +00:00
eqy
a6763b7b81 [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee
2025-01-15 18:37:55 +00:00
64bcf39180 Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit 388b75edec09182131be0dfe1abeafc5c3b91adf.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2588517060))
2025-01-14 00:48:28 +00:00
eqy
388b75edec [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee
2025-01-11 15:30:38 +00:00
b80ecc4457 Revert "Fix poision child process issue when call getAccelerator() (#144368)"
This reverts commit 2583d831d40d6fa64f0b637d5bc7598e484a3283.

Reverted https://github.com/pytorch/pytorch/pull/144368 on behalf of https://github.com/clee2000 due to broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above ([comment](https://github.com/pytorch/pytorch/pull/144368#issuecomment-2584848568))
2025-01-10 23:36:43 +00:00
2583d831d4 Fix poision child process issue when call getAccelerator() (#144368)
# Motivation
fix https://github.com/pytorch/pytorch/issues/144152

# Solution

- Align `at::globalContext()::hasXXX` to determine if accelerator XXX is built with PyTorch or an extension already registered to PyTorch.
- Define `at::hasXXX` to determine if accelerator XXX is available at runtime.
- Use `at::globalContext()::hasXXX` in `getAccelerator` rather than `at::hasXXX` to avoid initializing the XXX runtime (which can poison child processes) while detecting the current accelerator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144368
Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/gujinghui
2025-01-10 09:28:27 +00:00
6de110b862 Support with statement on torch.Stream (#140138)
# Motivation
We propose to support Python with statement on `torch.Stream`. This is a benefit for all accelerators when writing device-agnostic code. The device-specific stream will also be supported because they are generally derived from `torch.Stream`.

With this PR, we can do like this
```python
s1= torch.Stream()
# Set s1 to the current stream
torch.accelerator.set_stream(s1)
with torch.Stream() as s2:
    # Inside with statement, we set s2 to the current stream
    assert torch.accelerator.current_stream() == s2
# Here the current stream should be s1
assert torch.accelerator.current_stream() == s1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140138
Approved by: https://github.com/albanD
2025-01-10 02:05:19 +00:00
d4871750d9 [ROCm] Enable post-merge trunk workflow on MI300 runners; skip and fix MI300 related failed tests (#143673)
This PR
* makes changes to the workflow files and scripts so we can run CI workflows on the MI300 runners
* skips and fixes several tests, failed on MI300, observed in https://github.com/pytorch/pytorch/pull/140989

Skipped due to unsupported Float8_e4m3fn data type on MI300 (need to update test code to use datatypes supported by MI300):
- distributed.tensor.parallel.test_micro_pipeline_tp.py::MicroPipelineTPTest::test_fuse_all_gather_scaled_matmul_A_dims_\*_gather_dim_\* (24 tests across inductor/distributed configs)
- distributed.tensor.parallel.test_micro_pipeline_tp.py::test_fuse_scaled_matmul_reduce_scatter_A_dims_\*_scatter_dim_\* (12 tests across inductor/distributed configs))
- inductor.test_loop_ordering::LoopOrderingTest::test_fp8_cast_and_t
- inductor.test_loop_ordering::LoopOrderingTest::test_fp8_pattern_2

Skipped due to AssertionError on MI300:
- inductor.test_mkldnn_pattern_matcher.py::test_qconv2d_int8_mixed_bf16
- distributed._tools.test_sac_ilp::TestSACILP::test_sac_ilp_case1

Skipped:
- test_cuda.py::TestCudaMallocAsync::test_clock_speed
- test_cuda.py::TestCudaMallocAsync::test_power_draw
- test_torch.py::TestTorchDeviceTypeCUDA::test_deterministic_cumsum_cuda

Skipped flaky tests on MI300:
- distributed.test_c10d_gloo.py::ProcessGroupGlooTest::test_gather_stress_cuda
- inductor.test_cpu_repro::CPUReproTests::test_lstm_packed_unbatched_False* (256 tests)

Fixed:
- test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_float8_basics_cuda

Features:
- inductor/test_fp8.py - declare a new function to convert FP8 datatypes to ROCm supported FP8 datatypes. It keeps test names for CUDA and ROCm and allows to enable Inductor FP8 tests on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143673
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/pruthvistony

Co-authored-by: saienduri <saimanas.enduri@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-09 05:18:57 +00:00
3d3a07963f [reland][attempt2][AMD] Turn on TF32 for aten::mm (#144145)
Summary:
https://github.com/pytorch/pytorch/pull/143549 was reverted due to some
internal/oss tooling issue. Relanding.

hipblaslt supports TF32, so adding the support.
Original PR https://github.com/pytorch/pytorch/pull/139869

Test Plan: CI

Differential Revision: D67785496

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144145
Approved by: https://github.com/jianyuh
2025-01-06 00:37:01 +00:00
cyy
df458be4e5 [4/N] Apply py39 ruff and pyupgrade fixes (#143257)
```torch/fx/passes/annotate_getitem_nodes.py``` was changed to support the new type hinting annotations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143257
Approved by: https://github.com/justinchuby, https://github.com/albanD
2025-01-04 10:47:51 +00:00
c0d710634f Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#142292)
Reland of #140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors.

Fixes #140318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142292
Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2024-12-25 02:37:11 +00:00
448c16ac87 Revert "[reland][AMD] Turn on TF32 for aten::mm (#143549)"
This reverts commit 41cdc7f73552cc8a0dbf2d3cb55440c0d6b548ea.

Reverted https://github.com/pytorch/pytorch/pull/143549 on behalf of https://github.com/malfet due to It breaks ROCM testing, see 06b4b96b34/1 ([comment](https://github.com/pytorch/pytorch/pull/143549#issuecomment-2559016960))
2024-12-23 06:47:36 +00:00
07fa6e2c8b Fix torch.accelerator api abort when passing invaild device (#143550)
# Motivation
Fix https://github.com/pytorch/pytorch/issues/143543

# Solution
We should raise python exception instead of aborting...

# Additional Context
without this PR:
```python
>>> import torch
>>> torch.accelerator.current_stream(torch.accelerator.device_count())
terminate called after throwing an instance of 'c10::Error'
  what():  device is out of range, device is 2, total number of device is 2.
Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so)
frame #2: <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #3: c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #4: <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #5: <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so)
<omitting python frames>
frame #20: <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)
```
with this PR:
```python
>>> import torch
>>> torch.accelerator.current_stream(torch.accelerator.device_count())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream
    return torch._C._accelerator_getStream(device_index)
RuntimeError: The device index is out of range. It must be in [0, 2), but got 2.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143550
Approved by: https://github.com/EikanWang, https://github.com/dvrogozh, https://github.com/albanD
2024-12-23 03:44:22 +00:00
41cdc7f735 [reland][AMD] Turn on TF32 for aten::mm (#143549)
Summary:
hipblaslt supports TF32, so adding the support.

Original PR https://github.com/pytorch/pytorch/pull/139869

Test Plan: CI

Differential Revision: D67431681

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143549
Approved by: https://github.com/eqy
2024-12-22 21:05:05 +00:00
d8c8ba2440 Fix unused Python variables in test/[e-z]* (#136964)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964
Approved by: https://github.com/justinchuby, https://github.com/albanD
2024-12-18 23:02:30 +00:00
7ab3177776 Revert "[AMD] Turn on TF32 for aten::mm (#139869)"
This reverts commit e0bdae7884aed09d9e3f1a3f7a53c095e74a9aff.

Reverted https://github.com/pytorch/pytorch/pull/139869 on behalf of https://github.com/jeffdaily due to causing ROCm CI failures, need to investigate, revert for now ([comment](https://github.com/pytorch/pytorch/pull/139869#issuecomment-2546127069))
2024-12-16 16:46:48 +00:00
45ac4ebf15 [RELAND] Add UTs for accelerator device-agnostic runtime APIs (#133572)
# Motivation
This PR intends to add UTs for accelerator device-agnostic APIs.

# Additional Context
This PR is relanded. It is reverted because `torch.Event` doesn't support mps backend. We have fixed it in https://github.com/pytorch/pytorch/pull/142468. The previous commit is 952514f0c8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133572
Approved by: https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #143171
2024-12-16 02:18:41 +00:00
e0bdae7884 [AMD] Turn on TF32 for aten::mm (#139869)
Summary: hipblaslt supports TF32, so adding the support.

Test Plan: CI

Differential Revision: D65435392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139869
Approved by: https://github.com/leitian
2024-12-15 10:02:29 +00:00