Update Eigen pin to 5.0.0 . Tons of new features and perf improvements. Most importantly updates minimum from C++03 to C++14 giving a ton of performance optimizations like properly implemented move operators, simplified code, etc. Also improved vectorization particularily on ARM. We really only use this library as a fallback for sparse operators, but still useful to update it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165840
Approved by: https://github.com/albanD
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@ce9db1](ce9db15136), includes:
- Fix test_barrier hang by using static global rank in ProcessGroupXCCL
- Update install_xpu_headers only when content should change to speedup recompilation
- Add global rank information to communication logging
- Remove duplicate normalization from FFT methods
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165321
Approved by: https://github.com/EikanWang
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@229e8b](229e8ba104), includes:
- Revert tracking of Work status for FlightRecorder in ProcessGroupXCCL to fix memory leak
- Enable SYCL warnings on Linux
- Fix accuracy issues with CTC loss
- Enable aten::nonzero_static on XPU backend
- Stop recursive calculations in polynomial kernels if tensor has NaNs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163758
Approved by: https://github.com/EikanWang
Update the torch-xpu-ops commit to 24fab67b6e, includes:
- Clean up getDeviceIndexOfCurrentQueue
- Fix hardswish gradients corner case
- Fix xccl contiguous check
- Move checks from nonzero kernel to operator
- support high priority stream for xccl
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163244
Approved by: https://github.com/EikanWang
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@d8c3ee](d8c3eefc29), includes:
- Optimize adaptive average pool for channel-last memory format
- Add unregister wait_tensor
- Replace deprecated `[[intel::reqd_sub_group_size(SgSize)]]` with `[[sycl::reqd_sub_group_size(SIMD)]]` and remove unnecessary attributes
- Revert "Roll back to original usage of sycl::get_kernel_bundle"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162804
Approved by: https://github.com/EikanWang
Fixes a few bugs introduced to CUDNN 1.11 which affects all our CUDA13 builds. Also adds support for new CUDNN features whenever we choose to update. @eqy pretty sure this addresses the concern you had over the previous upgrade since that bugfix is now merged. This is a simple header only update.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162347
Approved by: https://github.com/eqy, https://github.com/atalman
## Summary
- We just landed 2d-2d support for mxfp8 grouped gemm in FBGEMM: https://github.com/pytorch/FBGEMM/pull/4816
- This is needed for backward pass of mxfp8 MoE training with grouped gemms
- Changes:
- Add dispatching + input validation for mxfp8 grouped gemm in `torch._scaled_grouped_mm`
- Add meta registration input validation for mxfp8 grouped gemm, for composability with compile
- Add unit tests exercising torch._scaled_grouped_mm with mxfp8 inputs
- Bump FBGEMM third party submodule to include:
- https://github.com/pytorch/FBGEMM/pull/4816
- https://github.com/pytorch/FBGEMM/pull/4820
- https://github.com/pytorch/FBGEMM/pull/4821
- https://github.com/pytorch/FBGEMM/pull/4823
#### How fbgemm dependency was bumped
Documenting this since I haven't found it documented elsewhere:
- `cd ~/pytorch/third_party/fbgemm`
- `git fetch`
- `git checkout <hash>`
- `cd ~/pytorch`
- `git add third_party/fbgemm`
## Test plan
#### Test build
```
USE_FBGEMM_GENAI=1 python -m pip install --no-build-isolation -v -e .
...
Successfully installed torch-2.9.0a0+gitf5070f3
```
[full build log](https://www.internalfb.com/phabricator/paste/view/P1933787581)
#### Unit tests
```
pytest test/test_matmul_cuda.py -k test_mxfp8_scaled_grouped_mm_
...
test/test_matmul_cuda.py ......... [100%]
============================================================== 9 passed, 1668 deselected in 5.34s ===============================================================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162209
Approved by: https://github.com/ngimel
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@83c5a5](83c5a5a551), includes:
- Revert "Disable xccl timer avoid drlm hang" because XPU time event issue has been fixed
- Fallback lu_factor kernel to CPU for single batch
- Enable aten::linalg_inv and aten::linalg_inv_ex on XPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162062
Approved by: https://github.com/EikanWang
Update the torch-xpu-ops commit to [8b58040ee32689487f660462f655085f31506dab](8b58040ee3), includes:
- Add vectorization path on maxpool forward channel last
- Add FlightRecorder support for ProcessGroupXCCL
- Fix random build failure on codegen
- Suppress dllexport warning on Windows
- Make torch-xpu-ops build depend on ATen XPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161152
Approved by: https://github.com/EikanWang
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Update the torch-xpu-ops commit to [77cc792cd265179745d335579d233e6d4f9a2667](77cc792cd2), includes:
- Ensures that the XPU cache is cleared before creating tensors during the test
- Add unused variable warning
- Fix test_linalg and test_torch issue with bf32_on_and_off updates
- Fix deterministic indexing with broadcast
- Fix dist.gather with noncontiguous tensor
- Improve accuracy of index put deterministic kernel
- Add generate file rely avoid build before generate
- optimize embedding bag
Fixes#160661
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160062
Approved by: https://github.com/EikanWang
Merge the recent commits of FBGEMM and remove unnecessary CMake code.
Specifically, we
1. enable `fbgemm_autovec` since the target is now correctly handled.
2. remove option `USE_FAKELOWP` which is not used.
3. remove `CAFFE2_COMPILER_SUPPORTS_AVX512_EXTENSIONS` check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158210
Approved by: https://github.com/q10
No functional changes, just:
- Update C++ standard to C++17
- Update `cmake` min version to 3.18
- Update `libuv` dependency to 1.51 (to move its cmake min version to 3.10)
- Replace boost optional implementation with `std::optional` wrapper
- Make it compilable with gcc-14.x plus by including `cstddef` in few headers
- Avoid using deprecated enums for MacOS builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159834
Approved by: https://github.com/Skylion007
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@1f7a57](1f7a57f507) includes:
- Add Template Parameter to the function `gpu_kernel` for Controlling Broadcasting Vectorization
- Add optional NaN checks to XCCL
- Fix NllLossForwardReduce2DKernelFunctor accuracy
- Extend the existing communication logging to include the reduction operation for collective calls
- [Reland] Install xpu codegen header to torch/include
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159621
Approved by: https://github.com/EikanWang
This PR consists of all the changes required to enable PyTorch ROCm CI on MI355X nodes.
- Rework aotriton cmake configuration to rely on `HIP_VERSION` instead of `ROCM_VERSION` as aotriton depnds on hip. Hip loosely track the rocm major version, but the two are not actually synchronized as observed in the ROCm 7 alpha build.
- Bump composable-kernel submodule to [df6023e305f389bbf7249b0c4414e649f3ad6598](df6023e305) for mi350 compatibility.
- Extend the change docker permissions step to the MI355x runners as well. This step is included to apply the required permission change to the test folder for a successful upload of artifacts in k8s docker.
- Create new rocm-mi355 workflow to trigger core PyTorch tests on a nightly basis at 2:30 am PST.
- Successfully tested running the test suites listed in rocm-mi355.yml on MI355 runners by temporarily hacking rocm-mi300.yml: ca7d5fae11 (rocm-mi300)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158889
Approved by: https://github.com/jeffdaily
Summary:
**Changes**
* Deleted function import from build definition utilities
* Removed `load("//tools/build_defs:fbsource_utils.bzl", "is_arvr_mode")`
* Replaced is_arvr_mode() function calls with direct references to configuration flags
* Changed from `is_arvr_mode()` to `"ovr_config//build_mode:arvr_mode"`
* Changed conditional expressions to Buck `select()` statements
Test Plan:
Check if CI passes
Rollback Plan:
Differential Revision: D78520947
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158682
Approved by: https://github.com/malfet