pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 05:34:18 +08:00

Author	SHA1	Message	Date
cyy	8fa81a6066	Enable misc-use-internal-linkage check and apply fixes (#148948 ) Enables clang-tidy rule [`misc-use-internal-linkage`](https://clang.llvm.org/extra/clang-tidy/checks/misc/use-internal-linkage.html). This new check was introduced in Clang-Tidy 18 and is available due to recent update of Clang-Tidy 19. The check marks functions and variables used only in the translation unit as static. Therefore undesired symbols are not leaked into other units, more link time optimisations are possible and the resulting binaries may be smaller. The detected violations were mostly fixed by using static. In other cases, the symbols were indeed consumed by others files, then their declaring headers were included. Still some declarations were wrong and have been fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148948 Approved by: https://github.com/Skylion007	2025-03-12 14:22:56 +00:00
Marko Radmilac	c65ee728f0	Initial implementation of host memory stats (#147660 ) This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics. This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache. As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later. Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660 Approved by: https://github.com/ngimel	2025-03-05 16:13:19 +00:00
PyTorch MergeBot	a983b2b11a	Revert "Initial implementation of host memory stats (#147660 )" This reverts commit 945e359fc1afe6c0bb6129ed9607b237fa19cd98. Reverted https://github.com/pytorch/pytorch/pull/147660 on behalf of https://github.com/mradmila due to There is an issue with ambiguous definition of Stat structure when different C++ tools are used. Backing out for now. ([comment](https://github.com/pytorch/pytorch/pull/147660#issuecomment-2692346379))	2025-03-01 18:05:45 +00:00
Marko Radmilac	945e359fc1	Initial implementation of host memory stats (#147660 ) This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics. This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache. As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later. Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660 Approved by: https://github.com/ngimel	2025-02-28 18:36:44 +00:00
Yu, Guangye	aa20b4b6cf	Friendly handle mem_get_info's runtime error message (#146899 ) # Motivation Friendly handle the runtime error message if the device doesn't support querying the available free memory. See https://github.com/intel/torch-xpu-ops/issues/1352 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146899 Approved by: https://github.com/EikanWang	2025-02-13 06:26:19 +00:00
cyy	25aa7ca62d	Cleanup CallOnce.h (#146700 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146700 Approved by: https://github.com/albanD	2025-02-07 16:44:45 +00:00
cyy	29f52e3972	[2/N] Remove unnecessary once flag usage (#145057 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145057 Approved by: https://github.com/albanD	2025-01-23 09:48:46 +00:00
Yu, Guangye	8f6c4d1732	Add get_stream_from_external API for XPU backend (#141123 ) # Motivation This PR aims to introduce `torch.xpu.ExternalStream` to be used to wrap SYCL queue created in other libraries to PyTorch. # Additional Context Pull Request resolved: https://github.com/pytorch/pytorch/pull/141123 Approved by: https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #142347, #141119	2024-12-31 11:15:52 +00:00
Yu, Guangye	8dd4673cea	Support torch.xpu.mem_get_info API (#141230 ) # Motivate Fix https://github.com/pytorch/pytorch/issues/130599 This PR intends to add a new API, `torch.xpu.mem_get_info,` which is widely used in popular model workloads. For example, [here](`403c0714d1/src/accelerate/utils/modeling.py (L721)`) we need to get current GPU memory usage to split or load the model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141230 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-12-05 08:17:25 +00:00
Yu, Guangye	ebeab262d9	Refine XPU device prop and fix typo (#140661 ) # Motivation `architecture` is an experimental attribute that might been used by triton AOT codegen. It should not be in `__repr__`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140661 Approved by: https://github.com/EikanWang	2024-11-14 11:18:01 +00:00
Yu, Guangye	659d2132be	Add architecture to XPU device property (#138186 ) # Motivation Add `architecture` to XPU device property. In some cases, low-level application code can use special features or do specific optimizations depending on the device architecture, and this PR enables such applications. Modified from https://github.com/pytorch/pytorch/pull/129675/files Pull Request resolved: https://github.com/pytorch/pytorch/pull/138186 Approved by: https://github.com/ezyang	2024-11-13 03:35:13 +00:00
Richard Barnes	42994234a6	std::value/std::type -> std::_v/std::_t (#138746 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138746 Approved by: https://github.com/cyyever, https://github.com/malfet	2024-10-26 20:59:24 +00:00
Yu, Guangye	8cda774a03	Add torch.xpu.get_arch_list and torch.xpu.get_gencode_flags for XPU (#137773 ) # Motivation Add `torch.xpu.get_arch_list()` and `torch.xpu.get_gencode_flags()` methods that return architecture list and AOT flags to preserve what flags PyTorch XPU was built with. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137773 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-10-18 02:28:08 +00:00
Edward Yang	b14269dcfb	Make Context to be Device-agnostic Step by Step (1/N) (#136519 ) (#138155 ) Summary: - make init to be device-agnostic and move it to AcceleratorHooksInterface - refactoring context related to device initialization Original pull request: https://github.com/pytorch/pytorch/pull/136519 Test Plan: contbuild & OSS CI, see `4a8e49389c` Reviewed By: malfet Differential Revision: D64471142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138155 Approved by: https://github.com/malfet, https://github.com/bobrenjc93	2024-10-17 20:58:56 +00:00
PyTorch MergeBot	d4d687ffb2	Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519 )" This reverts commit 4a8e49389c33934234dc89616fd17a58e760e2e7. Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/clee2000 due to breaking internal tests related to MITA, @ezyang has a forward fix? ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2414588302))	2024-10-15 17:19:16 +00:00
FFFrog	4a8e49389c	Make Context to be Device-agnostic Step by Step (1/N) (#136519 ) ---- - make init to be device-agnostic and move it to AcceleratorHooksInterface - refactoring context related to device initialization Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519 Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey	2024-10-13 12:38:02 +00:00
PyTorch MergeBot	079f909263	Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519 )" This reverts commit be0b75256a7e516217b059ef273901b95c022fe7. Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/jovianjaison due to this pr is causing errors internally ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2405781093))	2024-10-10 18:32:17 +00:00
FFFrog	be0b75256a	Make Context to be Device-agnostic Step by Step (1/N) (#136519 ) - make init to be device-agnostic and move it to AcceleratorHooksInterface - refactoring context related to device initialization Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519 Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey	2024-10-09 02:13:36 +00:00
Xuehai Pan	8962610247	[BE][clang-format] make macro `PyObject_HEAD_INIT(type)` and `PyVarObject_HEAD_INIT(type, size)` have its own line (#136949 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136949 Approved by: https://github.com/albanD, https://github.com/eqy ghstack dependencies: #136945	2024-10-02 18:39:22 +00:00
Yu, Guangye	df5bbc09d1	Make device-specific event inherits from torch.Event (#134845 ) # Motivation This PR intends to make device-specific Event inherit from the generic torch.Event. The benefit is providing a generic abstract class `torch.Event` for different devices, like `torch.Stream`. This make it easier for Dynamo to capture the Event of different devices, like torch.cuda.Event and torch.xpu.Event. And the next PR would like to remove previous useless base class `_StreamBase` and `_EventBase` to avoid multiple Inheritance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134845 Approved by: https://github.com/albanD, https://github.com/EikanWang	2024-10-01 06:28:41 +00:00
Yu, Guangye	b53d97c7be	[Intel GPU] Add XPU memory-related APIs (#129919 ) # Motivation According to https://github.com/pytorch/pytorch/issues/116322, we will help unify the device allocator. So we introduce a simple xpu device allocator only with the key functionality first. And expect to add some memory statistics-related functionality after the unification. But now, some memory statistic-related APIs listed in https://github.com/pytorch/pytorch/issues/127929 are requested. We need more time to unify the device allocator. In order to facilitate the user experience, we expect to support these memory statistic-related APIs before the unification. # Additional Context Fixes: #127929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129919 Approved by: https://github.com/dvrogozh, https://github.com/abhilash1910, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #130923	2024-09-07 11:15:17 +00:00
Yu, Guangye	fbd020fce6	Add new prop to _XpuDevicePropertie for triton gemm optimization (#131738 ) # Motivation This PR aims to add new properties to `_XpuDevicePropertie` for triton gemm optimization. # Additional Context `ext_oneapi_supports_cl_extension` is not a ABI-neutral API. It depends on compiler 2025.0. For more details, see https://github.com/intel/llvm/pull/13212 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131738 Approved by: https://github.com/gujinghui	2024-08-18 08:32:30 +00:00
Yu, Guangye	e0d3e4a498	remove unused code for XPU (#131856 ) # Motivation This PR aims to remove unused code in PyTorch for XPU, following https://github.com/pytorch/pytorch/pull/128179 Otherwise, CI will block without this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131856 Approved by: https://github.com/EikanWang	2024-07-26 02:57:12 +00:00
cyy	29861779ce	[2/N] Change #include <c10/util/Optional.h> to #include <optional> (#130236 ) Follows #128301. The changes were made by grep and sed Pull Request resolved: https://github.com/pytorch/pytorch/pull/130236 Approved by: https://github.com/ezyang	2024-07-09 03:17:24 +00:00
sdp	b4a0161449	Build SYCL kernels for ATen XPU ops on Native Windows (take 2) (#127390 ) Original PR https://github.com/pytorch/pytorch/pull/126725 is closed due to bad rebase. ------- As proposed in https://github.com/pytorch/pytorch/issues/126719, we are enabling PyTorch XPU on Native Windows on Intel GPU. This PR enables XPU build on Windows as the first step of #126719: - Enable `USE_XPU` build on Windows using MSVC as host compiler. The use of MSVC as host compiler seamlessly aligns with the existing PyTorch build on Windows. - Build oneDNN GPU library on Windows. Co-authored-by: Yu, Guangye <guangye.yu@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127390 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/ezyang	2024-06-06 01:41:06 +00:00
Yu, Guangye	f4ff063c33	Add attributes to xpu device prop (#121898 ) # Motivation Add some attributes to `XPUDeviceProp` and expose them via `torch.xpu.get_device_properties` and `torch.xpu.get_device_capability`. They can be used in `torch.compile` or directly passed to triton to generate more optimized code based on device properties. # Additional Context expose the following attributes to `torch.xpu.get_device_properties`： - `has_fp16` (newly added) - `has_fp64` (newly added) - `has_atomic64` (newly added) - `driver_version` - `vendor` - `version` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121898 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet, https://github.com/albanD, https://github.com/atalman	2024-03-30 00:25:39 +00:00
Yu, Guangye	12995a5d9d	[2/2] Intel GPU Runtime Upstreaming for Generator (#118613 ) # Motivation According to [[1/2] Intel GPU Runtime Upstreaming for Generator](https://github.com/pytorch/pytorch/pull/118528), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`. # Design Currently, it primarily offers geneartor-related APIs, including - `torch.xpu.default_generators` - `torch.xpu.get_rng_state` - `torch.xpu.get_rng_state_all` - `torch.xpu.initial_seed` - `torch.xpu.manual_seed` - `torch.xpu.manual_seed_all` - `torch.xpu.seed` - `torch.xpu.seed_all` - `torch.xpu.set_rng_state` - `torch.xpu.set_rng_state_all` # Additional Context The differences with CUDA: The generator-related frontend python APIs are 1:1 mapping with CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118613 Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD	2024-02-28 05:28:11 +00:00
Yu, Guangye	1aa9099839	[CLANGTIDY] Enable clang-tidy in torch/csrc/xpu (#120616 ) # Motivation refer to [#118504](https://github.com/pytorch/pytorch/pull/118504), enabling clang-tidy in `torch/csrc/xpu`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120616 Approved by: https://github.com/albanD	2024-02-28 01:35:25 +00:00
cyy	3cd6a21e8f	[DeviceIndex][6/N] Use DeviceIndex in more places (#120133 ) This PR follows the series of patches beginning with #119142 and fixes various XPU and python related methods to use DeviceIndex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120133 Approved by: https://github.com/Skylion007	2024-02-21 06:24:23 +00:00
Yu, Guangye	8f9f12c068	Intel GPU Runtime Upstreaming for Device Allocator (#118091 ) # Motivation According to [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842) and [[RFC] Intel GPU Runtime Upstreaming for Allocator](https://github.com/pytorch/pytorch/issues/116322), we will upstream the key functionality of device `Allocator` dedicated for XPU to PyTorch. And following our design prepare to generalize `Allocator` in parallel. # Design In the current design, XPU uses an `XPUAllocator` class, inherited from `c10::Allocator`. `XPUAllocator` is a manager to handle `DeviceCachingAllocator`, which is a per-device implementation of the caching mechanism to manage the already cached or newly allocated memory. The caching mechanism is similar to other backends, like CUDA. We can visualize the design as below. <p align="center"> <img width="162" alt="image" src="https://github.com/pytorch/pytorch/assets/106960996/6b17b8cf-e7d1-48b4-b684-f830c409d218"> </p> # Additional Context We're going to implement our design gradually. This PR covers the device `Allocator` dedicated to XPU. The second PR covers the host `Allocator`. Besides these PRs, we plan to generalize the device `Allocator` device-agnostic through another PR. In this PR, our device `Allocator` has the same memory management mechanism as CUDA, but lacks features such as expendable segments and statistics. We will add these features back in the subsequent PR which intend to generalize `Allocator`. The differences with CUDA: only key functionality, and lack of AsyncAllocator, gpu_trace, history_record, graph functionality, memory snapshot, memory statistics, expandable segment... Pull Request resolved: https://github.com/pytorch/pytorch/pull/118091 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD ghstack dependencies: #117611, #117619, #117734	2024-02-16 06:46:00 +00:00
Yu, Guangye	4dc75f9084	Intel GPU Runtime Upstreaming for Event (#117734 ) # Motivation As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the next runtime component we would like to upstream is `Event` which handles the status of an operation that is being executed. Typically, in some circumstances, we can fine-grain control of the operation execution via `Event`. # Design `XPUEvent` is a movable but not a copyable wrapper around sycl event. It should be created lazily on an XPU device when recording an `XPUStream`. Meanwhile, `XPUEvent` can wait for another `XPUEvent` or all the submitted kernels on an `XPUStream` to complete. Align to the other backend, the C++ files related to `Event` will be placed in `aten/src/ATen/xpu` folder. For frontend code, `XPUEvent` runtime API will be bound to Python `torch.xpu.Event`. The corresponding C++ code will be placed in `torch/csrc/xpu/Event.cpp` and Python code will be placed in `torch/xpu/streams.py` respectively. # Additional Context It is worth mentioning that the `elapsed_time` method is temporarily not supported by `XPUEvent`. We will be adding support for it soon. Meanwhile `XPUEvent` doesn't support IPC from different processes. For the other parts, we have almost a 1:1 mapping with CUDA. lack of the below APIs: - `torch.cuda.Event.ipc_handle` - `CUDAEvent`'s constructor with `IpcEventHandle` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117734 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #117611, #117619	2024-02-16 06:28:26 +00:00
cyy	cb0886ecf2	[DeviceIndex][4/N] Use DeviceIndex in more places (#119741 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119741 Approved by: https://github.com/aaronenyeshi, https://github.com/ezyang	2024-02-14 00:29:10 +00:00
Yu, Guangye	8fd11cb307	[2/2] Intel GPU Runtime Upstreaming for Stream (#117619 ) # Motivation According to [[1/2] Intel GPU Runtime Upstreaming for Stream](https://github.com/pytorch/pytorch/pull/117611), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`. # Design Currently, it primarily offers stream-related APIs, including - `torch.xpu.StreamContext` - `torch.xpu.current_stream` - `torch.xpu.set_stream` - `torch.xpu.synchronize` - `torch._C._xpu_getCurrentRawStream` # Additional Context We will implement functions like `torch.xpu.Stream.wait_event`, `torch.xpu.Stream.wait_stream`, and `torch.xpu.Stream.record_event` in the next PR related with `Event`. The differences with CUDA: no default and external stream in XPU and lack of below APIs: - `torch.cuda.ExternalStream` - `torch.cuda.default_stream` - `toch.cuda.is_current_stream_capturing` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117619 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #117611	2024-02-10 03:39:42 +00:00
Yu, Guangye	9a992b0918	[4/4] Intel GPU Runtime Upstreaming for Device (#116869 ) # Motivation According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this last PR covers the changes under lazy initialization. # Design This PR primarily offers the support of multi-processing via lazy initialization. We lazily initialize our runtime avoiding initializing XPU until the first time it is accessed. In our design, we extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for both CUDA and XPU while maintaining scalability. # Additional Context We adopt a similar design to CUDA. So we share some code with CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116869 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet ghstack dependencies: #119248	2024-02-08 03:01:21 +00:00
Yu, Guangye	a205e7bf56	[3/4] Intel GPU Runtime Upstreaming for Device (#116850 ) # Motivation According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this third PR covers the changes under `libtorch_python`. # Design This PR primarily offers device-related APIs in python frontend, including - `torch.xpu.is_available` - `torch.xpu.device_count` - `torch.xpu.current_device` - `torch.xpu.set_device` - `torch.xpu.device` - `torch.xpu.device_of` - `torch.xpu.get_device_name` - `torch.xpu.get_device_capability` - `torch.xpu.get_device_properties` - ==================== - `torch.xpu._DeviceGuard` - `torch.xpu._is_compiled` - `torch.xpu._get_device` # Additional Context We will implement the support of lazy initialization in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116850 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet	2024-02-01 12:31:26 +00:00

35 Commits