b44fb14906
Remove unused parameter when query extension attribute ( #165623 )
...
# Motivation
This code is no longer needed since SYCL compiler 2025.0. We are now using compiler 2025.2 (two tool uplifts later), so it can be safely removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165623
Approved by: https://github.com/EikanWang
ghstack dependencies: #165622
2025-10-17 08:16:13 +00:00
51348c0219
Give a friendly message for older Intel GPU ( #165622 )
...
# Motivation
Notify the user if the GPU is older than officially supported. This provides a friendly warning that the GPU may work, but the experience could be unstable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165622
Approved by: https://github.com/EikanWang
2025-10-17 08:16:13 +00:00
d0c32971b4
Refine XPU allocator message when OOM ( #165509 )
...
# Motivation
Provide more information and align with other backends to enhance the user experience.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165509
Approved by: https://github.com/EikanWang
ghstack dependencies: #165508
2025-10-16 05:47:49 +00:00
66b75693ae
Reuse kLargeBuffer in XPUCachingAllocator ( #165508 )
...
# Motivation
Reuse the shared code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165508
Approved by: https://github.com/EikanWang
2025-10-16 04:12:52 +00:00
59ad8f1ac6
[XPU] Enhance XPUGeneratorImpl functionality to support XPUGraph ( #163332 )
...
As this [XPUGraph RFC](https://github.com/pytorch/pytorch/issues/162143 ) descripted. This PR enhances `XPUGeneratorImpl` to support XPUGraph.
In this PR, we add `XPUGerneratorState` and `PhiloxXpuState`. Which makes XPUGraph update philox state during graph capture and replay correctly
XPUGraph PR submission plan:
- [ ] 1, Enhance XPUGenerator functionality. Add XPUGeneratorState and philoxState
- [ ] 2, implemenet XPUGraph capture_begin/capture_end/instantiate functionality
- [ ] 3, implemenet XPUGraph replay/debug_dump/reset functionality
- [ ] 4, python APIs: is_current_stream_capturing/graph_pool_handle/graph
- [ ] 5, python APIs: make_graphed_callables
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163332
Approved by: https://github.com/gujinghui , https://github.com/EikanWang , https://github.com/albanD
2025-10-13 02:10:41 +00:00
f8746b878d
Add uuid to XPU device properties ( #161392 )
...
# Motivation
Fix https://github.com/intel/torch-xpu-ops/issues/1955
Refer to https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_intel_device_info.md#device-uuid , `ext::intel::info::device::uuid` returns `std::array<unsigned char, 16>` as the UUID.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161392
Approved by: https://github.com/EikanWang , https://github.com/albanD
2025-09-02 06:41:32 +00:00
b7b9fb9962
Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead ( #156165 )" ( #161627 )
...
This reverts commit c1145852a5eac96f5551b5d1805109ce4dc5e1fa.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161627
Approved by: https://github.com/atalman
ghstack dependencies: #161625 , #161626
2025-08-27 21:37:14 +00:00
06ddaf1e0a
Revert "Back out "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead ( #156165 )" ( #160999 )" ( #161625 )
...
This reverts commit a818fa77e3a72271f144514ef349c5a666313205.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161625
Approved by: https://github.com/atalman
2025-08-27 21:34:12 +00:00
a818fa77e3
Back out "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead ( #156165 )" ( #160999 )
...
Summary: reverting this diff since it caused S551328. Please see D80217492 for dertails.
Test Plan:
NA
Rollback Plan:
Differential Revision: D80553314
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160999
Approved by: https://github.com/izaitsevfb , https://github.com/jingsh
2025-08-20 15:04:36 +00:00
d7114f05b1
Add DeviceAllocator as the base device allocator ( #138222 )
...
# Motivation
In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978 ), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code ). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases.
<div align="center">
<table>
<tr>
<td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td>
</tr>
<tr>
<td>
```python
torch.xxx.empty_cache
```
</td>
<td>
```python
torch.accelerator.empty_cache
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.reset_peak_memory_stats
```
</td>
<td>
```python
torch.accelerator.reset_peak_memory_stats
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.reset_accumulated_memory_stats
```
</td>
<td>
```python
torch.accelerator.reset_accumulated_memory_stats
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.memory_stats
```
</td>
<td>
```python
torch.accelerator.memory_stats
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.memory_allocated
```
</td>
<td>
```python
torch.accelerator.memory_allocated
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.max_memory_allocated
```
</td>
<td>
```python
torch.accelerator.max_memory_allocated
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.memory_reserved
```
</td>
<td>
```python
torch.accelerator.memory_reserved
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.max_memory_reserved
```
</td>
<td>
```python
torch.accelerator.max_memory_reserved
```
</td>
</tr>
</table>
</div>
# Solution
This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222
Approved by: https://github.com/albanD , https://github.com/Camyll
2025-08-08 17:41:10 +00:00
f3a4d742ec
Revert "Add DeviceAllocator as the base device allocator ( #138222 )"
...
This reverts commit f7a66da5f9f6b8b75119b1ee8ce9ddc23e15570e.
Reverted https://github.com/pytorch/pytorch/pull/138222 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3164941815 ))
2025-08-07 16:34:36 +00:00
f7a66da5f9
Add DeviceAllocator as the base device allocator ( #138222 )
...
# Motivation
In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978 ), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code ). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases.
<div align="center">
<table>
<tr>
<td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td>
</tr>
<tr>
<td>
```python
torch.xxx.empty_cache
```
</td>
<td>
```python
torch.accelerator.empty_cache
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.reset_peak_memory_stats
```
</td>
<td>
```python
torch.accelerator.reset_peak_memory_stats
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.reset_accumulated_memory_stats
```
</td>
<td>
```python
torch.accelerator.reset_accumulated_memory_stats
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.memory_stats
```
</td>
<td>
```python
torch.accelerator.memory_stats
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.memory_allocated
```
</td>
<td>
```python
torch.accelerator.memory_allocated
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.max_memory_allocated
```
</td>
<td>
```python
torch.accelerator.max_memory_allocated
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.memory_reserved
```
</td>
<td>
```python
torch.accelerator.memory_reserved
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.max_memory_reserved
```
</td>
<td>
```python
torch.accelerator.max_memory_reserved
```
</td>
</tr>
</table>
</div>
# Solution
This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222
Approved by: https://github.com/albanD , https://github.com/Camyll
2025-08-06 00:40:29 +00:00
c1145852a5
Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead ( #156165 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165
Approved by: https://github.com/albanD
ghstack dependencies: #159629 , #150312
2025-08-05 04:08:42 +00:00
90f13f3b2a
Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead ( #156165 )"
...
This reverts commit 1fc010a9d8ea95bb74e54b31d17eba56ef16c27c.
Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3142035444 ))
2025-08-01 03:24:54 +00:00
1fc010a9d8
Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead ( #156165 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165
Approved by: https://github.com/albanD
ghstack dependencies: #149601 , #157908 , #150312
2025-07-30 06:37:15 +00:00
95b658427d
Revert "Add DeviceAllocator as the base device allocator ( #138222 )"
...
This reverts commit 1179e333237b02ed8fe2ba10cb9a23adf98d7d7a.
Reverted https://github.com/pytorch/pytorch/pull/138222 on behalf of https://github.com/ZainRizvi due to Very sorry but this is still breaking internally. @albanD would you be able to help get this past the finish line? D78496124 has more details on the failure and the workaround might be to do something like what's in D78684669. To validate the fixes internally, you can follow the instructions here to ghimport the changes: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3100195370 ))
2025-07-22 01:01:41 +00:00
1179e33323
Add DeviceAllocator as the base device allocator ( #138222 )
...
# Motivation
In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978 ), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code ). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases.
<div align="center">
<table>
<tr>
<td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td>
</tr>
<tr>
<td>
```python
torch.xxx.empty_cache
```
</td>
<td>
```python
torch.accelerator.empty_cache
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.reset_peak_memory_stats
```
</td>
<td>
```python
torch.accelerator.reset_peak_memory_stats
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.reset_accumulated_memory_stats
```
</td>
<td>
```python
torch.accelerator.reset_accumulated_memory_stats
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.memory_stats
```
</td>
<td>
```python
torch.accelerator.memory_stats
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.memory_allocated
```
</td>
<td>
```python
torch.accelerator.memory_allocated
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.max_memory_allocated
```
</td>
<td>
```python
torch.accelerator.max_memory_allocated
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.memory_reserved
```
</td>
<td>
```python
torch.accelerator.memory_reserved
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.max_memory_reserved
```
</td>
<td>
```python
torch.accelerator.max_memory_reserved
```
</td>
</tr>
</table>
</div>
# Solution
This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222
Approved by: https://github.com/albanD , https://github.com/Camyll
2025-07-17 01:56:01 +00:00
ea5f88dca6
Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead ( #156165 )"
...
This reverts commit e40ade5182233f548b25f2732effe3719d16e9ad.
Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/huydhn due to Sorry for reverting your change but because https://github.com/pytorch/pytorch/pull/157908 has been reverted + this PR caused issue earlier, I think it is better to revert the whole stack and reland it from scratch to be sure ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3074897532 ))
2025-07-15 18:24:36 +00:00
e40ade5182
Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead ( #156165 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165
Approved by: https://github.com/albanD
ghstack dependencies: #150312
2025-07-15 10:14:35 +00:00
e8cca7bac7
Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead ( #156165 )"
...
This reverts commit 85857181ebca86e9c709e9922a9d9ef41a9c4ef9.
Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing to build PyTorch internally ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3070218901 ))
2025-07-14 16:33:48 +00:00
85857181eb
Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead ( #156165 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165
Approved by: https://github.com/albanD
ghstack dependencies: #149601 , #157908 , #150312
2025-07-11 11:41:34 +00:00
5cc4e856fd
Add device_id to XPU device properties ( #156481 )
...
# Motivation
Some older Intel iGPUs may share the same device name across different hardware products.
(See [device name example](aaa01c06f9/shared/source/dll/devices/devices_base.inl (L190-L199)
))
To help disambiguate which specific iGPU product is being used, we introduce the use of a
[device id](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_intel_device_info.md#device-id ). This device id corresponds to the Device ID in [official Intel product specification](https://www.intel.com/content/www/us/en/products/sku/232155/intel-core-i71360p-processor-18m-cache-up-to-5-00-ghz/specifications.html ) and enables more accurate identification and troubleshooting for user issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156481
Approved by: https://github.com/EikanWang , https://github.com/albanD
2025-07-03 01:22:11 +00:00
3dd872e6d5
Revert "Add DeviceAllocator as the base device allocator ( #138222 )"
...
This reverts commit 92409b6c89fbfbd3caa79c81b1e3d9e7917d3bc7.
Reverted https://github.com/pytorch/pytorch/pull/138222 on behalf of https://github.com/Camyll due to internal build failures ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3002206756 ))
2025-06-25 00:11:35 +00:00
92409b6c89
Add DeviceAllocator as the base device allocator ( #138222 )
...
# Motivation
In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978 ), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code ). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases.
<div align="center">
<table>
<tr>
<td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td>
</tr>
<tr>
<td>
```python
torch.xxx.empty_cache
```
</td>
<td>
```python
torch.accelerator.empty_cache
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.reset_peak_memory_stats
```
</td>
<td>
```python
torch.accelerator.reset_peak_memory_stats
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.reset_accumulated_memory_stats
```
</td>
<td>
```python
torch.accelerator.reset_accumulated_memory_stats
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.memory_stats
```
</td>
<td>
```python
torch.accelerator.memory_stats
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.memory_allocated
```
</td>
<td>
```python
torch.accelerator.memory_allocated
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.max_memory_allocated
```
</td>
<td>
```python
torch.accelerator.max_memory_allocated
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.memory_reserved
```
</td>
<td>
```python
torch.accelerator.memory_reserved
```
</td>
</tr>
<tr>
<td>
```python
torch.xxx.max_memory_reserved
```
</td>
<td>
```python
torch.accelerator.max_memory_reserved
```
</td>
</tr>
</table>
</div>
# Solution
This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222
Approved by: https://github.com/albanD
2025-06-23 08:49:30 +00:00
402ae09e41
[BE] fix typos in c10/ ( #156078 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156078
Approved by: https://github.com/malfet , https://github.com/cyyever
2025-06-18 10:24:44 +00:00
db01f1032f
Support XPU in memory tracker ( #150703 )
...
This PR adds support for XPU devices to the distributed MemoryTracker tool, including unit test for XPU.
Specifically, this code adds tracking for a few alloc-related statistics for XPUCachingAllocator. It also adapts the existing memory tracker tool to be device agnostic, by getting the device module and recording the necessary memory stats. (I get the device module instead of using `torch.accelerator` methods, as that API is still in-progress.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150703
Approved by: https://github.com/EikanWang , https://github.com/guangyey , https://github.com/gujinghui , https://github.com/d4l3k
2025-06-12 21:33:52 +00:00
3c74a72ea0
Keep XPU compatible with toolchain 2025.2 ( #154359 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154359
Approved by: https://github.com/EikanWang , https://github.com/cyyever
2025-05-29 11:12:07 +00:00
fe49b11e09
Add memory reporting for XPU to Memory Profiler ( #152842 )
...
Adds support for XPU profile_memory in Pytorch Profiler.
Currently, when `profile_memory=True` is passed to `torch.profiler.profile`, there is no XPU memory reported. For example, the profiling table printed by the code below is missing any `XPU Mem` columns:
<details><summary>profiling.py</summary>
<p>
```python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.profiler import profile, ProfilerActivity
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.conv1 = nn.Conv1d(20,20,15,padding="same")
self.flatten = nn.Flatten()
self.net1 = nn.Linear(2048, 4096)
self.relu = nn.ReLU()
self.net2 = nn.Linear(4096, 5)
def forward(self, x):
res = self.conv1(x)
res = self.flatten(res)
res = self.net1(res)
return self.net2(self.relu(res))
def demo_basic():
model = ToyModel().to("xpu")
loss_fn = nn.MSELoss().to("xpu")
optimizer = optim.SGD(model.parameters(), lr=0.001)
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.XPU], profile_memory=True) as prof:
for epoch in range(10):
optimizer.zero_grad()
outputs = model(torch.randn(20, 2048).to("xpu"))
labels = torch.randn(20, 5).to("xpu")
loss_fn(outputs, labels).backward()
optimizer.step()
print(prof.key_averages().table(max_name_column_width=100, sort_by="xpu_time_total", row_limit=100))
if __name__ == "__main__":
demo_basic()
```
</p>
</details>
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self XPU Self XPU % XPU total XPU time avg CPU Mem Self CPU Mem # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
gemm_kernel 0.00% 0.000us 0.00% 0.000us 0.000us 1.501ms 44.73% 1.501ms 25.024us 0 b 0 b 60
autograd::engine::evaluate_function: AddmmBackward0 0.12% 1.067ms 30.47% 260.929ms 13.046ms 0.000us 0.00% 1.009ms 50.448us 0 b 0 b 20
AddmmBackward0 0.09% 744.983us 15.99% 136.944ms 6.847ms 0.000us 0.00% 784.640us 39.232us 0 b 0 b 20
aten::mm 15.41% 131.956ms 15.79% 135.167ms 3.379ms 784.640us 23.37% 784.640us 19.616us 0 b 0 b 40
aten::linear 0.02% 156.361us 20.58% 176.187ms 8.809ms 0.000us 0.00% 741.760us 37.088us 0 b 0 b 20
aten::addmm 20.25% 173.371ms 20.52% 175.723ms 8.786ms 741.760us 22.10% 741.760us 37.088us 0 b 0 b 20
Optimizer.step#SGD.step 0.40% 3.429ms 5.55% 47.509ms 4.751ms 0.000us 0.00% 488.960us 48.896us 0 b 0 b 10
aten::_foreach_add_ 4.81% 41.162ms 5.15% 44.080ms 4.408ms 488.960us 14.57% 488.960us 48.896us 0 b 0 b 10
at::native::xpu::MultiTensorApplyKernelFunctor<at::n... 0.00% 0.000us 0.00% 0.000us 0.000us 422.880us 12.60% 422.880us 42.288us 0 b 0 b 10
autograd::engine::evaluate_function: ConvolutionBack... 0.03% 280.041us 4.36% 37.328ms 3.733ms 0.000us 0.00% 356.320us 35.632us 0 b 0 b 10
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 856.227ms
Self XPU time total: 3.357ms
```
This PR updates the XPUCachingAllocator.cpp to report allocation events to the Profiler, and causes these to be printed in the table:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self XPU Self XPU % XPU total XPU time avg CPU Mem Self CPU Mem XPU Mem Self XPU Mem # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
gemm_kernel 0.00% 0.000us 0.00% 0.000us 0.000us 1.436ms 43.64% 1.436ms 23.939us 0 b 0 b 0 b 0 b 60
autograd::engine::evaluate_function: AddmmBackward0 0.13% 1.186ms 29.92% 262.875ms 13.144ms 0.000us 0.00% 1.005ms 50.272us 0 b 0 b 320.94 Mb -4.69 Mb 20
AddmmBackward0 0.09% 815.288us 16.48% 144.802ms 7.240ms 0.000us 0.00% 790.720us 39.536us 0 b 0 b 325.47 Mb 0 b 20
aten::mm 15.86% 139.342ms 16.26% 142.875ms 3.572ms 790.720us 24.03% 790.720us 19.768us 0 b 0 b 325.47 Mb 325.47 Mb 40
aten::linear 0.02% 182.856us 20.46% 179.775ms 8.989ms 0.000us 0.00% 669.440us 33.472us 0 b 0 b 3.13 Mb 0 b 20
aten::addmm 20.10% 176.607ms 20.40% 179.210ms 8.961ms 669.440us 20.34% 669.440us 33.472us 0 b 0 b 3.13 Mb 3.13 Mb 20
Optimizer.step#SGD.step 0.42% 3.692ms 5.61% 49.267ms 4.927ms 0.000us 0.00% 486.640us 48.664us 0 b 0 b 0 b 0 b 10
aten::_foreach_add_ 4.83% 42.439ms 5.19% 45.574ms 4.557ms 486.640us 14.79% 486.640us 48.664us 0 b 0 b 0 b -20.00 Kb 10
at::native::xpu::MultiTensorApplyKernelFunctor<at::n... 0.00% 0.000us 0.00% 0.000us 0.000us 420.960us 12.79% 420.960us 42.096us 0 b 0 b 0 b 0 b 10
autograd::engine::evaluate_function: ConvolutionBack... 0.04% 310.719us 4.47% 39.279ms 3.928ms 0.000us 0.00% 339.520us 33.952us 0 b 0 b -2.89 Mb -3.12 Mb 10
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 878.627ms
Self XPU time total: 3.291ms
```
These XPU memory numbers match the same profiling results on CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152842
Approved by: https://github.com/guangyey , https://github.com/sraikund16
2025-05-21 01:19:19 +00:00
9d3b6ee4c1
[submodule] Update gtest to v1.17.0 ( #153618 )
...
And remove some outdated CMake code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153618
Approved by: https://github.com/malfet
2025-05-16 01:24:19 +00:00
9e106019f6
[XPU] Add an implict conversion from XPUStream to sycl::queue* ( #148646 )
...
# Motivation
Currently, in Pytorch XPU, `cudaStream_t` is mapped to `sycl::queue&`, so an implicit cast from `XPUStream` to `sycl::queue&` is provided just like `CUDAStream` has an implicit cast to `cudaStream_t`.
But on the SYCLomatic side, we migrate `cudaStream_t` to `sycl::queue*` but not `sycl::queue&` (One reason is that `cudaStream_t` is actually a pointer so users can do anything with that integer. Another reason is that the early `sycl::queue` was not impl-ed by a pointer, so copy by value is not desirable.)
Without this PR:
```
cudaStream_t a = getCurrentCUDAStream();
cudaStream_t b = getCurrentCUDAStream().stream();
```
need be migrated to:
```
queue_ptr a = &(sycl::queue&)getCurrentXPUStream();
queue_ptr b = &(getCurrentXPUStream().queue());
```
With this PR:
```
queue_ptr a = getCurrentXPUStream();
queue_ptr b = &(getCurrentXPUStream().queue());
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148646
Approved by: https://github.com/guangyey , https://github.com/EikanWang
2025-04-03 08:12:38 +00:00
86fbbe44cc
Improve error message for CUDAGuardImpl, MPSGuardImpl, XPUGuardImpl ( #149838 )
...
Fixes #149822
Will get:
```
RuntimeError: t == DeviceType::CUDA INTERNAL ASSERT FAILED at "/home/jyh/workspace/pytorch/c10/cuda/impl/CUDAGuardImpl.h":28, please report a bug to PyTorch. CUDAGuardImpl initialized with non-CUDA DeviceType: cpu
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149838
Approved by: https://github.com/Skylion007 , https://github.com/guangyey
2025-03-25 07:29:53 +00:00
1096443467
Use torch_compile_options for c10 libraries ( #147821 )
...
c10, c10_cuda, c10_hip and c10_xpu are given additional compile options by torch_compile_options, which are more restrictive and can help reveal potential bugs inside the code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147821
Approved by: https://github.com/guangyey , https://github.com/malfet
2025-03-18 01:54:23 +00:00
8fa81a6066
Enable misc-use-internal-linkage check and apply fixes ( #148948 )
...
Enables clang-tidy rule [`misc-use-internal-linkage`](https://clang.llvm.org/extra/clang-tidy/checks/misc/use-internal-linkage.html ). This new check was introduced in Clang-Tidy 18 and is available due to recent update of Clang-Tidy 19.
The check marks functions and variables used only in the translation unit as static. Therefore undesired symbols are not leaked into other units, more link time optimisations are possible and the resulting binaries may be smaller.
The detected violations were mostly fixed by using static. In other cases, the symbols were indeed consumed by others files, then their declaring headers were included. Still some declarations were wrong and have been fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148948
Approved by: https://github.com/Skylion007
2025-03-12 14:22:56 +00:00
c65ee728f0
Initial implementation of host memory stats ( #147660 )
...
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.
This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.
As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.
Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-03-05 16:13:19 +00:00
a983b2b11a
Revert "Initial implementation of host memory stats ( #147660 )"
...
This reverts commit 945e359fc1afe6c0bb6129ed9607b237fa19cd98.
Reverted https://github.com/pytorch/pytorch/pull/147660 on behalf of https://github.com/mradmila due to There is an issue with ambiguous definition of Stat structure when different C++ tools are used. Backing out for now. ([comment](https://github.com/pytorch/pytorch/pull/147660#issuecomment-2692346379 ))
2025-03-01 18:05:45 +00:00
945e359fc1
Initial implementation of host memory stats ( #147660 )
...
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.
This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.
As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.
Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-02-28 18:36:44 +00:00
3b3aac0cde
Filter out iGPU if dGPU is found on XPU ( #144378 )
...
# Motivation
for https://github.com/pytorch/pytorch/issues/143914
On Windows, there are two separate SYCL platforms for iGPU and dGPU. To simplify the logic, we will exclude iGPUs when a dGPU is present. This ensures that all XPU devices enumerated by PyTorch share the same SYCL context.
Now I generalize the logic as below:
1. We find the first L0 platform containing at least one dGPU and enumerate all dGPUs of that platform.
2. If no dGPU is found, we find the first L0 platform containing iGPU and enumerate all iGPUs of that platform.
3. No GPU is found (neither iGPU nor dGPU).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144378
Approved by: https://github.com/EikanWang , https://github.com/gujinghui
2025-01-29 15:53:16 +00:00
ed015143ef
Set RUNPATH on CUDA and XPU tests ( #144305 )
...
#136627 has almost fixed the issue that test binaries' runpath has not been set correctly, with few cases left.
This PR fixes the rest.
The binaries are found by `auditwheel repair` a wheel built with `BUILD_TEST=1`.
@malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144305
Approved by: https://github.com/malfet
2025-01-26 08:40:22 +00:00
a68c0ca497
Add low priority XPU Stream ( #141119 )
...
# Motivation
Due to the potential for the external SYCL queue to have a low priority, we need to support the low-priority SYCL queue for native XPU Streams to maintain consistency.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141119
Approved by: https://github.com/gujinghui , https://github.com/albanD
ghstack dependencies: #142347
2024-12-31 11:15:45 +00:00
39450ae655
Refine XPU external Stream ( #142347 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142347
Approved by: https://github.com/gujinghui , https://github.com/albanD
2024-12-31 11:15:38 +00:00
07fa6e2c8b
Fix torch.accelerator api abort when passing invaild device ( #143550 )
...
# Motivation
Fix https://github.com/pytorch/pytorch/issues/143543
# Solution
We should raise python exception instead of aborting...
# Additional Context
without this PR:
```python
>>> import torch
>>> torch.accelerator.current_stream(torch.accelerator.device_count())
terminate called after throwing an instance of 'c10::Error'
what(): device is out of range, device is 2, total number of device is 2.
Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first):
frame #0 : c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so)
frame #1 : c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so)
frame #2 : <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #3 : c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #4 : <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #5 : <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so)
frame #6 : <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so)
<omitting python frames>
frame #20 : <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21 : __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
```
with this PR:
```python
>>> import torch
>>> torch.accelerator.current_stream(torch.accelerator.device_count())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream
return torch._C._accelerator_getStream(device_index)
RuntimeError: The device index is out of range. It must be in [0, 2), but got 2.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143550
Approved by: https://github.com/EikanWang , https://github.com/dvrogozh , https://github.com/albanD
2024-12-23 03:44:22 +00:00
3d227ae315
[Intel GPU] Support getStreamFromExternel for XPU. ( #140268 )
...
In AOT inductor scenario, the GPU Stream can be created outside of the pool of `XPUStream`, and we need to create a `XPUStream` which refers to this stream for the the common logic of AOTI, for example a stream guard is a guard for `XPUStream`. So we add the getStreamFromExternel following the design of CUDAStream.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140268
Approved by: https://github.com/desertfire , https://github.com/jansel , https://github.com/EikanWang
2024-12-07 19:22:04 +00:00
77748ed8ec
fix c10::Event UT failure on XPU backend ( #141800 )
...
# Motivation
Fix this UT failure introduced by https://github.com/pytorch/pytorch/pull/140865 . The unrelated failure suppressed this UT failure.
It goes to happen since https://github.com/pytorch/pytorch/pull/141546 is landed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141800
Approved by: https://github.com/EikanWang
2024-12-03 01:34:42 +00:00
d905f1350a
Friendly catch exception when fail to initialize XPU devices ( #141658 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141658
Approved by: https://github.com/EikanWang
2024-11-28 05:17:08 +00:00
b556549357
Use default context on Windows for Intel GPU ( #138049 )
...
# Motivation
Use default context in Windows to keep consistency with Linux. It makes it easy to interact with external libraries like `dlpack`.
# Additional Context
This PR depends on Intel GPU oneAPI 2025.0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138049
Approved by: https://github.com/gujinghui
2024-11-28 02:49:46 +00:00
b1a8be6b0a
Support torch.Event elapsed_time method on XPU ( #140865 )
...
# Motivation
This PR aims to support c10::Event/torch.Event elapsed_time method on XPU. We create a profiling tag Event when the timing flag is enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140865
Approved by: https://github.com/Samkm0084 , https://github.com/gujinghui
2024-11-28 02:41:11 +00:00
62d2c5b667
Revert "Enable XPUEvent elapsed_time function ( #134666 )" ( #140872 )
...
# Motivation
This PR raises an internal UT failure on XPU.
This reverts commit 4bbd6da33101a8d709f1d2921ad8ae6f9b0dc166.
# Additional Context
refer to https://github.com/pytorch/pytorch/issues/140814
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140872
Approved by: https://github.com/EikanWang
2024-11-18 02:58:05 +00:00
ebeab262d9
Refine XPU device prop and fix typo ( #140661 )
...
# Motivation
`architecture` is an experimental attribute that might been used by triton AOT codegen. It should not be in `__repr__`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140661
Approved by: https://github.com/EikanWang
2024-11-14 11:18:01 +00:00
4bbd6da331
Enable XPUEvent elapsed_time function ( #134666 )
...
# Motivation
This PR aims to enable `elapsed_time` function for `XPUEvent`.
# Additional Context
This PR depends on toolchain oneAPI 2025.0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134666
Approved by: https://github.com/EikanWang , https://github.com/ezyang
2024-11-13 04:32:50 +00:00
659d2132be
Add architecture to XPU device property ( #138186 )
...
# Motivation
Add `architecture` to XPU device property.
In some cases, low-level application code can use special features or do specific optimizations depending on the device architecture, and this PR enables such applications.
Modified from https://github.com/pytorch/pytorch/pull/129675/files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138186
Approved by: https://github.com/ezyang
2024-11-13 03:35:13 +00:00