Summary: We currently have foreach kernel implementations for MTIA, and for when we don't we internally decompose the ops. Anyone using this list for compatibility checks should be sending through the foreach kernels.
Reviewed By: egienvalue, scottxu0730
Differential Revision: D77751248
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157583
Approved by: https://github.com/egienvalue
This change adds "hpu" to the list of device types that support fused kernels in the optimizer, ensuring
compatibility with HPU backend.
Without this change, when `test_all_gather_extension_outer_size_stride` of `pytorch/test/distributed/_composable/fsdp/test_fully_shard_extensions.py` is run on 'hpu' backend, it fails with:
RuntimeError: fused=True requires all the params to be floating point Tensors
of supported devices: ['mps', 'cuda', 'xpu', 'cpu', 'privateuseone']
but torch.float32 and hpu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148666
Approved by: https://github.com/albanD
Now torch.dtype can pass through pybind11, so modify function _group_tensors_by_device_and_dtype to using scalar type. And without convert torch.dtype and string in python and c++ side.
@ezyang @bdhirsh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127869
Approved by: https://github.com/ezyang
Support fused_sgd_kernel support for CPU.
## Bench result:
32 core/sockets ICX
Test Scripts:
https://gist.github.com/zhuhaozhe/688763e17e93e4c5e12f25f676ec90d9https://gist.github.com/zhuhaozhe/ad9938694bc7fae8b66d376f4dffc6c9
```
Tensor Size: 262144, Num Tensor 4, Num Threads: 1
_single_tensor_sgd time: 0.2301 seconds
_fused_sgd time: 0.0925 seconds
Tensor Size: 4194304, Num Tensor 32, Num Threads: 32
_single_tensor_sgd time: 2.6195 seconds
_fused_sgd time: 1.7543 seconds
```
## Test Plan:
```
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_optim.py -k test_can_load_older_state_dict
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
python test_torch.py -k test_grad_scaling_autocast_fused
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
```
Looks like we already have some PRs under this issue https://github.com/pytorch/pytorch/issues/123451 to unified the UTs, I did not modified UT in this PR.
Co-authored-by: Jane Xu <janeyx@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123629
Approved by: https://github.com/jgong5, https://github.com/janeyx99
This PR adds support for `clip_grad_norm_(foreach=True)` by implementing `aten._foreach_norm.Scalar` and `aten._foreach_mul_.Tensor`. `foreach=True` is required to get competitive performance with `DTensor`.
`foreach=True` reduces CPU overhead for Llama-7B from 388 ms to 63 ms. Existing flat-parameter FSDP's `clip_grad_norm_` takes 3 ms on CPU 😢 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120910
Approved by: https://github.com/wanchaol, https://github.com/janeyx99
ghstack dependencies: #120238
We want to add xpu support for foreach kernels, so we add the "xpu" devices to the support list.
Besides, for fused kernels in Adam and AdamW, the devices check is enabled by the support list in adam.py (lines 44-46) and adamw.py (lines 60-64), so we remove the repetitive check for cuda devices as it will block the other devices in the support list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106021
Approved by: https://github.com/janeyx99
This is a reland of https://github.com/pytorch/pytorch/pull/100007 with a build fix for Windows debug builds.
`at::native::ParamsHash` only works on structs with standard layout, but `std::string` isn't one in Visual C++ debug builds, which one can easily verified by running something like:
```cpp
#define _DEBUG
#include <type_traits>
#include <string>
static_assert(std::is_standard_layout_v<std::string>, "Oh noes");
```
If above conditon is not met, instead of printing a static_assert output, VC++ raises a very cryptic compilation errors, see https://github.com/pytorch/pytorch/pull/100007#discussion_r1227116292 for more detail.
Also, using `std::hash` for string should result in a faster hash function.
(cherry picked from commit 74b7a6c75e698378882d30958908073407f97fb3)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 5914771</samp>
This pull request introduces a new function `_group_tensors_by_device_and_dtype` that can group tensors by their device and dtype, and updates the `foreach` utilities and several optimizers to use this function. The goal is to improve the performance, readability, and compatibility of the code that handles tensors with different properties. The pull request also adds a test case and type annotations for the new function, and some error checks for the `fused` argument in Adam and AdamW.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103912
Approved by: https://github.com/janeyx99