Commit Graph

35 Commits

Author SHA1 Message Date
cyy
8fa81a6066 Enable misc-use-internal-linkage check and apply fixes (#148948)
Enables clang-tidy rule [`misc-use-internal-linkage`](https://clang.llvm.org/extra/clang-tidy/checks/misc/use-internal-linkage.html). This new check was introduced in Clang-Tidy 18 and is available due to recent update of Clang-Tidy 19.

The check marks functions and variables used only in the translation unit as static. Therefore undesired symbols are not leaked into other units, more link time optimisations are possible and the resulting binaries may be smaller.

The detected violations were mostly fixed by using static. In other cases, the symbols were indeed consumed by others files, then their declaring headers were included. Still some declarations were wrong and have been fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148948
Approved by: https://github.com/Skylion007
2025-03-12 14:22:56 +00:00
c65ee728f0 Initial implementation of host memory stats (#147660)
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.

This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.

As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.

Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-03-05 16:13:19 +00:00
a983b2b11a Revert "Initial implementation of host memory stats (#147660)"
This reverts commit 945e359fc1afe6c0bb6129ed9607b237fa19cd98.

Reverted https://github.com/pytorch/pytorch/pull/147660 on behalf of https://github.com/mradmila due to There is an issue with ambiguous definition of Stat structure when different C++ tools are used. Backing out for now. ([comment](https://github.com/pytorch/pytorch/pull/147660#issuecomment-2692346379))
2025-03-01 18:05:45 +00:00
945e359fc1 Initial implementation of host memory stats (#147660)
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.

This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.

As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.

Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-02-28 18:36:44 +00:00
aa20b4b6cf Friendly handle mem_get_info's runtime error message (#146899)
# Motivation
Friendly handle the runtime error message if the device doesn't support querying the available free memory. See https://github.com/intel/torch-xpu-ops/issues/1352

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146899
Approved by: https://github.com/EikanWang
2025-02-13 06:26:19 +00:00
cyy
25aa7ca62d Cleanup CallOnce.h (#146700)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146700
Approved by: https://github.com/albanD
2025-02-07 16:44:45 +00:00
cyy
29f52e3972 [2/N] Remove unnecessary once flag usage (#145057)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145057
Approved by: https://github.com/albanD
2025-01-23 09:48:46 +00:00
8f6c4d1732 Add get_stream_from_external API for XPU backend (#141123)
# Motivation
This PR aims to introduce `torch.xpu.ExternalStream` to be used to wrap SYCL queue created in other libraries to PyTorch.

# Additional Context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141123
Approved by: https://github.com/albanD, https://github.com/EikanWang
ghstack dependencies: #142347, #141119
2024-12-31 11:15:52 +00:00
8dd4673cea Support torch.xpu.mem_get_info API (#141230)
# Motivate
Fix https://github.com/pytorch/pytorch/issues/130599
This PR intends to add a new API, `torch.xpu.mem_get_info,` which is widely used in popular model workloads.
For example, [here](403c0714d1/src/accelerate/utils/modeling.py (L721)) we need to get current GPU memory usage to split or load the model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141230
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-12-05 08:17:25 +00:00
ebeab262d9 Refine XPU device prop and fix typo (#140661)
# Motivation
`architecture` is an experimental attribute that might been used by triton AOT codegen. It should not be in `__repr__`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140661
Approved by: https://github.com/EikanWang
2024-11-14 11:18:01 +00:00
659d2132be Add architecture to XPU device property (#138186)
# Motivation
Add `architecture` to XPU device property.
In some cases, low-level application code can use special features or do specific optimizations depending on the device architecture, and this PR enables such applications.
Modified from https://github.com/pytorch/pytorch/pull/129675/files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138186
Approved by: https://github.com/ezyang
2024-11-13 03:35:13 +00:00
42994234a6 std::value/std::type -> std::_v/std::_t (#138746)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138746
Approved by: https://github.com/cyyever, https://github.com/malfet
2024-10-26 20:59:24 +00:00
8cda774a03 Add torch.xpu.get_arch_list and torch.xpu.get_gencode_flags for XPU (#137773)
# Motivation
Add `torch.xpu.get_arch_list()` and `torch.xpu.get_gencode_flags()` methods that return architecture list and AOT flags to preserve what flags PyTorch XPU was built with.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137773
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-10-18 02:28:08 +00:00
b14269dcfb Make Context to be Device-agnostic Step by Step (1/N) (#136519) (#138155)
Summary:
- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization

Original pull request: https://github.com/pytorch/pytorch/pull/136519

Test Plan: contbuild & OSS CI, see 4a8e49389c

Reviewed By: malfet

Differential Revision: D64471142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138155
Approved by: https://github.com/malfet, https://github.com/bobrenjc93
2024-10-17 20:58:56 +00:00
d4d687ffb2 Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519)"
This reverts commit 4a8e49389c33934234dc89616fd17a58e760e2e7.

Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/clee2000 due to breaking internal tests related to MITA, @ezyang has a forward fix? ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2414588302))
2024-10-15 17:19:16 +00:00
4a8e49389c Make Context to be Device-agnostic Step by Step (1/N) (#136519)
----

- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519
Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey
2024-10-13 12:38:02 +00:00
079f909263 Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519)"
This reverts commit be0b75256a7e516217b059ef273901b95c022fe7.

Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/jovianjaison due to this pr is causing errors internally ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2405781093))
2024-10-10 18:32:17 +00:00
be0b75256a Make Context to be Device-agnostic Step by Step (1/N) (#136519)
- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519
Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey
2024-10-09 02:13:36 +00:00
8962610247 [BE][clang-format] make macro PyObject_HEAD_INIT(type) and PyVarObject_HEAD_INIT(type, size) have its own line (#136949)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136949
Approved by: https://github.com/albanD, https://github.com/eqy
ghstack dependencies: #136945
2024-10-02 18:39:22 +00:00
df5bbc09d1 Make device-specific event inherits from torch.Event (#134845)
# Motivation
This PR intends to make device-specific Event inherit from the generic torch.Event. The benefit is providing a generic abstract class `torch.Event` for different devices, like `torch.Stream`. This make it easier for Dynamo to capture the Event of different devices, like torch.cuda.Event and torch.xpu.Event.
And the next PR would like to remove previous useless base class `_StreamBase` and `_EventBase` to avoid multiple Inheritance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134845
Approved by: https://github.com/albanD, https://github.com/EikanWang
2024-10-01 06:28:41 +00:00
b53d97c7be [Intel GPU] Add XPU memory-related APIs (#129919)
# Motivation
According to https://github.com/pytorch/pytorch/issues/116322, we will help unify the device allocator. So we introduce a simple xpu device allocator only with the key functionality first. And expect to add some memory statistics-related functionality after the unification.
But now, some memory statistic-related APIs listed in https://github.com/pytorch/pytorch/issues/127929 are requested. We need more time to unify the device allocator. In order to facilitate the user experience, we expect to support these memory statistic-related APIs before the unification.

# Additional Context
Fixes: #127929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129919
Approved by: https://github.com/dvrogozh, https://github.com/abhilash1910, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #130923
2024-09-07 11:15:17 +00:00
fbd020fce6 Add new prop to _XpuDevicePropertie for triton gemm optimization (#131738)
# Motivation
This PR aims to add new properties to `_XpuDevicePropertie` for triton gemm optimization.

# Additional Context
`ext_oneapi_supports_cl_extension` is not a ABI-neutral API. It depends on compiler 2025.0. For more details, see https://github.com/intel/llvm/pull/13212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131738
Approved by: https://github.com/gujinghui
2024-08-18 08:32:30 +00:00
e0d3e4a498 remove unused code for XPU (#131856)
# Motivation
This PR aims to remove unused code in PyTorch for XPU, following https://github.com/pytorch/pytorch/pull/128179
Otherwise, CI will block without this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131856
Approved by: https://github.com/EikanWang
2024-07-26 02:57:12 +00:00
cyy
29861779ce [2/N] Change #include <c10/util/Optional.h> to #include <optional> (#130236)
Follows  #128301. The changes were made by grep and sed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130236
Approved by: https://github.com/ezyang
2024-07-09 03:17:24 +00:00
sdp
b4a0161449 Build SYCL kernels for ATen XPU ops on Native Windows (take 2) (#127390)
Original PR https://github.com/pytorch/pytorch/pull/126725 is closed due to bad rebase.

-------
As proposed in https://github.com/pytorch/pytorch/issues/126719, we are enabling PyTorch XPU on Native Windows on Intel GPU.

This PR  enables XPU build on Windows as the first step of #126719:

- Enable `USE_XPU` build on Windows using MSVC as host compiler. The use of MSVC as host compiler seamlessly aligns with the existing PyTorch build on Windows.
- Build oneDNN GPU library on Windows.

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127390
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/ezyang
2024-06-06 01:41:06 +00:00
f4ff063c33 Add attributes to xpu device prop (#121898)
# Motivation
Add some attributes to `XPUDeviceProp` and expose them via `torch.xpu.get_device_properties` and `torch.xpu.get_device_capability`. They can be used in `torch.compile`  or directly passed to triton to generate more optimized code based on device properties.

# Additional Context
expose the following attributes to `torch.xpu.get_device_properties`:
- `has_fp16` (newly added)
- `has_fp64` (newly added)
- `has_atomic64` (newly added)
- `driver_version`
- `vendor`
- `version`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121898
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet, https://github.com/albanD, https://github.com/atalman
2024-03-30 00:25:39 +00:00
12995a5d9d [2/2] Intel GPU Runtime Upstreaming for Generator (#118613)
# Motivation
According to [[1/2] Intel GPU Runtime Upstreaming for Generator](https://github.com/pytorch/pytorch/pull/118528), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`.

# Design
Currently, it primarily offers geneartor-related APIs, including

- `torch.xpu.default_generators`
- `torch.xpu.get_rng_state`
- `torch.xpu.get_rng_state_all`
- `torch.xpu.initial_seed`
- `torch.xpu.manual_seed`
- `torch.xpu.manual_seed_all`
- `torch.xpu.seed`
- `torch.xpu.seed_all`
- `torch.xpu.set_rng_state`
- `torch.xpu.set_rng_state_all`

# Additional Context
The differences with CUDA:
The generator-related frontend python APIs are 1:1 mapping with CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118613
Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD
2024-02-28 05:28:11 +00:00
1aa9099839 [CLANGTIDY] Enable clang-tidy in torch/csrc/xpu (#120616)
# Motivation
refer to [#118504](https://github.com/pytorch/pytorch/pull/118504), enabling clang-tidy in `torch/csrc/xpu`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120616
Approved by: https://github.com/albanD
2024-02-28 01:35:25 +00:00
cyy
3cd6a21e8f [DeviceIndex][6/N] Use DeviceIndex in more places (#120133)
This PR follows the series of patches beginning with #119142 and fixes various XPU and python related methods to use DeviceIndex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120133
Approved by: https://github.com/Skylion007
2024-02-21 06:24:23 +00:00
8f9f12c068 Intel GPU Runtime Upstreaming for Device Allocator (#118091)
# Motivation
According to [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842) and [[RFC] Intel GPU Runtime Upstreaming for Allocator](https://github.com/pytorch/pytorch/issues/116322), we will upstream the key functionality of device `Allocator` dedicated for XPU to PyTorch. And following our design prepare to generalize `Allocator` in parallel.

# Design
In the current design, XPU uses an `XPUAllocator` class, inherited from `c10::Allocator`. `XPUAllocator` is a manager to handle `DeviceCachingAllocator`, which is a per-device implementation of the caching mechanism to manage the already cached or newly allocated memory. The caching mechanism is similar to other backends, like CUDA. We can visualize the design as below.
<p align="center">
<img width="162" alt="image" src="https://github.com/pytorch/pytorch/assets/106960996/6b17b8cf-e7d1-48b4-b684-f830c409d218">
</p>

# Additional Context
We're going to implement our design gradually. This PR covers the device `Allocator` dedicated to XPU. The second PR covers the host `Allocator`.
Besides these PRs, we plan to generalize the device `Allocator` device-agnostic through another PR.
In this PR, our device `Allocator` has the same memory management mechanism as CUDA, but lacks features such as expendable segments and statistics. We will add these features back in the subsequent PR which intend to generalize `Allocator`.

The differences with CUDA:
only key functionality, and lack of AsyncAllocator, gpu_trace, history_record, graph functionality, memory snapshot, memory statistics, expandable segment...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118091
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD
ghstack dependencies: #117611, #117619, #117734
2024-02-16 06:46:00 +00:00
4dc75f9084 Intel GPU Runtime Upstreaming for Event (#117734)
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the next runtime component we would like to upstream is `Event` which handles the status of an operation that is being executed. Typically, in some circumstances, we can fine-grain control of the operation execution via `Event`.

# Design
`XPUEvent` is a movable but not a copyable wrapper around sycl event. It should be created lazily on an XPU device when recording an `XPUStream`. Meanwhile, `XPUEvent` can wait for another `XPUEvent` or all the submitted kernels on an `XPUStream` to complete. Align to the other backend, the C++ files related to `Event` will be placed in `aten/src/ATen/xpu` folder. For frontend code, `XPUEvent` runtime API will be bound to Python `torch.xpu.Event`. The corresponding C++ code will be placed in `torch/csrc/xpu/Event.cpp` and Python code will be placed in `torch/xpu/streams.py` respectively.

# Additional Context
It is worth mentioning that the `elapsed_time` method is temporarily not supported by `XPUEvent`. We will be adding support for it soon. Meanwhile `XPUEvent` doesn't support IPC from different processes. For the other parts, we have almost a 1:1 mapping with CUDA.

lack of the below APIs:
- `torch.cuda.Event.ipc_handle`
- `CUDAEvent`'s constructor with `IpcEventHandle`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117734
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #117611, #117619
2024-02-16 06:28:26 +00:00
cyy
cb0886ecf2 [DeviceIndex][4/N] Use DeviceIndex in more places (#119741)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119741
Approved by: https://github.com/aaronenyeshi, https://github.com/ezyang
2024-02-14 00:29:10 +00:00
8fd11cb307 [2/2] Intel GPU Runtime Upstreaming for Stream (#117619)
# Motivation
According to [[1/2] Intel GPU Runtime Upstreaming for Stream](https://github.com/pytorch/pytorch/pull/117611), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`.

# Design
Currently, it primarily offers stream-related APIs, including
 - `torch.xpu.StreamContext`
 - `torch.xpu.current_stream`
 - `torch.xpu.set_stream`
 - `torch.xpu.synchronize`
 - `torch._C._xpu_getCurrentRawStream`

# Additional Context
We will implement functions like `torch.xpu.Stream.wait_event`, `torch.xpu.Stream.wait_stream`, and `torch.xpu.Stream.record_event` in the next PR related with `Event`.

The differences with CUDA:
no default and external stream in XPU and lack of below APIs:
- `torch.cuda.ExternalStream`
- `torch.cuda.default_stream`
- `toch.cuda.is_current_stream_capturing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117619
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #117611
2024-02-10 03:39:42 +00:00
9a992b0918 [4/4] Intel GPU Runtime Upstreaming for Device (#116869)
# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this last PR  covers the changes under lazy initialization.

# Design
This PR primarily offers the support of multi-processing via lazy initialization. We lazily initialize our runtime avoiding initializing XPU until the first time it is accessed. In our design, we extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for both CUDA and XPU while maintaining scalability.

# Additional Context
We adopt a similar design to CUDA. So we share some code with CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116869
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
ghstack dependencies: #119248
2024-02-08 03:01:21 +00:00
a205e7bf56 [3/4] Intel GPU Runtime Upstreaming for Device (#116850)
# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this third PR  covers the changes under `libtorch_python`.

# Design
This PR primarily offers device-related APIs in python frontend, including
- `torch.xpu.is_available`
- `torch.xpu.device_count`
- `torch.xpu.current_device`
- `torch.xpu.set_device`
- `torch.xpu.device`
- `torch.xpu.device_of`
- `torch.xpu.get_device_name`
- `torch.xpu.get_device_capability`
- `torch.xpu.get_device_properties`
- ====================
- `torch.xpu._DeviceGuard`
- `torch.xpu._is_compiled`
- `torch.xpu._get_device`

# Additional Context
We will implement the support of lazy initialization in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116850
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
2024-02-01 12:31:26 +00:00