Using good old IOKit to get `gpu-core-count` property from device implementing `AGXAccelerator` service
Expose this one as `torch.backend.mps.get_core_count()` and make it accessible via `MpsInterface` to the inductor
Test Plan: Run `python3 -c "import torch;print(torch.backends.mps.get_name(), torch.backends.mps.get_core_count())"` and compare it to `system_profiler SPDisplaysDataType|head -n10`
```
% python3 -c "import torch;print(torch.backends.mps.get_name(), torch.backends.mps.get_core_count())"
Apple M1 Pro 16
% system_profiler SPDisplaysDataType|head -n10
Graphics/Displays:
Apple M1 Pro:
Chipset Model: Apple M1 Pro
Type: GPU
Bus: Built-In
Total Number of Cores: 16
Vendor: Apple (0x106b)
Metal Support: Metal 3
```
This would significantly improve occupancy for torch.compile generated kernels
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160414
Approved by: https://github.com/dcci
Adds the `is_triton_capable` and `raise_if_triton_unavailable` class methods to the device interface, to allow device types to run their own checks for Triton _capability_ (which means a device can actually support Triton in the first place) and _availability_ (if the correct backend of Triton is installed and is functional for the device).
Using the device interface allows us to do these checks in a device-agnostic way, allow external backends to attest their Triton support by simply implementing those methods. The intention is for this to back things like the `has_triton` utility method.
This has been split from #139171.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152529
Approved by: https://github.com/jansel
This allows for each device type to check current devices for Triton compatibility and ensure their Triton backend is present.
This PR replaces the `has_triton()` global method which was previously used for this task, and moves the initial check for each Inductor backend on to their associated `BaseScheduler` subclass. This means that other backends, such as Halide, can also implement their own availability checks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139171
Approved by: https://github.com/jansel
----
- Use `is_dtype_supported` to skip dtype promotions portion of the test on unsupported device
- Extend it to use `torch.float16` so promotions could be checked there
- Implement `CpuInterface.is_bfloat16_supported` that returns true (which looks like the case, even if it's supported via emulation)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144795
Approved by: https://github.com/Skylion007
ghstack dependencies: #144509, #144798
The original PR #122396 used the CPU device since at that point in time
there was no actual Triton CPU backend. After #133408, this is no longer
the case, so we now have multiple backends getting registered for the
CPU. The test still works in OSS but fails internally due to different
test runners initializing the backends in a different order.
This PR doesn't actually end up fixing the test internally because
cpp_extension -- needed to implement the privateuseone device -- isn't
supported there, so we simply skip it instead. However, it still makes the
OSS test independent of initialization order, which is good.
Differential Revision: [D63838169](https://our.internmc.facebook.com/intern/diff/D63838169/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137611
Approved by: https://github.com/henrylhtsang
Just something I noticed while implementing a new DeviceInterface
I had to add `# type: ignore[assignment]` because mypy thinks
DeviceInterface.get_raw_stream is a `Callable` and therefore
incompatible with a `staticmethod`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134187
Approved by: https://github.com/jansel
Summary: Currently although only in one place in inductor, the `device` context manager from the device interface is used . This PR creates an inductor specific `DeviceGuard` class for use in these cases, which keeps a reference to the `DeviceInterface` class which is defined and added out of tree. This then offloads the device specific work to the device interface, instead of having to define this logic on the device class which isn't strictly necessary for inductor.
Ideally I would have used the existing `DeviceGuard` class, but these are defined per device and don't work well with inductor's device agnostic/ out of tree compatible design. With the existing classes in mind, I am happy to take suggestions on the renaming of this class.
Whilst I was there, I also took the opportunity to rename `gpu_device` to `device_interface` to clarify this is not necessarily a GPU.
Test Plan: None currently, happy to add some.
Co-authored-by: Matthew Haddock <matthewha@graphcore.ai>
Co-authored-by: Adnan Akhundov <adnan.akhundov@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123338
Approved by: https://github.com/aakhundov
This is a placeholder implementation for reconstructing streams via global storage to unblock FSDP, pending proper stream support design
This PR does a few things:
1) fixes registration for devices with indices. We were only supporting "cuda", we now support "cuda:k" interfaces where k is # of gpu
2) Changes the stream objects in dynamo to take devices as device types, instead of strings, and updates the string based device APIs to gracefully take device types.
3) Introduces a reconstruct-by-global (using existing cleanup hook structures) to streams as a placeholder impl for now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117386
Approved by: https://github.com/jansel
One small runtime change: `get_interface_for_device()` now throws
instead of returning None when an interface is not found. Inspecting all
the callsites in the codebase shows that none of them actually check if
the return type is None, so I think this is safe.
I also silenced a bunch of mypy errors around method assignment; mypy
seems unable to handle the subtype checks correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112974
Approved by: https://github.com/eellison
ghstack dependencies: #112130, #112970, #112971, #112972, #112973
This PR implements 2 things:
1. support the device agnostic stream and runtime APIs captured by the dynamo.
2. support the stream methods(include the event) captured by the dynamo.
Here are details for 1st.
Previously the stream captured in dynamo was tightly bind to CUDA. Here we implement a global singleton container named `StreamMethodContainer` for different backends to register their associated stream methods to dynamo. When import the backend’s product, the stream operations can be registered directly by calling
```
device_stream_method = {'current_stream': method_1,
'create_stream_context': method_2,
'set_stream': method_3,
'set_stream_by_id': method_4}
torch._dynamo.stream.register_stream_method(device_name, device_stream_method)
```
Stream methods need to be passed in this API according to the precise semantics represented by the dict key in `device_stream_method`. After register, these methods can be used by dynamo to capture the stream operations in users’ script, for example, get the current stream or set the specific stream. Additionally, the wrapped stream variable and the stream context variable are changed to be the device-agnostic, the proxy functions of these variables are assigned by the associated methods in the container. All of this are illustrated in the below. Below is a illustration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108312
Approved by: https://github.com/jansel, https://github.com/jgong5
# Motivation
@jansel As discussed before, we expected to generalize some cuda-specific code. This can make inductor more friendly to third-party backend so that we can leverage inductor code as much as possible.
# Solution
To implement this, we give a solution to introduce device runtime abstraction. We wrapper them inside `DeviceInterface` and use `register_interface_for_device` to register each kind of device to inductor. Then use `get_interface_for_device` to fetch the corresponding runtime from device type. Then usage is like this:
```python
device_interface = get_interface_for_device("xpu")
device_interface .is_available() # to check if XPU is available
device_interface .device_count() # to check how much XPU device is available
```
The `DeviceInterface` is a simple abstraction, which enables third-party backends that implement CUDA-like semantics to be integrated with inductor. This can prevent third-party backend from using monkey patch to override some utility functions, like `decode_device` that is hard-coded with CUDA.
# Additional Context
The main code change:
- To leverage AsyncCompile, make it device-agnostic
- Avoid monkey patches, make some utility functions device-agnostic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109486
Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/EikanWang