Commit Graph

7752 Commits

Author SHA1 Message Date
f11d7a5978 [ROCm] Update spack includes (#152569)
* Cleans up code in `caffe2/CMakeLists.txt` to remove individual ROCm library include paths and use `ROCM_INCLUDE_DIRS` CMake var instead
* `ROCM_INCLUDE_DIRS` CMake var is set in `cmake/public/LoadHIP.cmake` by adding all the ROCm packages that PyTorch depends on
* `rocm_version.h` is provided by the `rocm-core` package, so use the include directory for that component to be compliant with Spack
* Move `find_package_and_print_version(hip REQUIRED CONFIG)` earlier so that `hip_version.h` can be located in the hip package include dir for Spack
* `list(REMOVE_DUPLICATES ROCM_INCLUDE_DIRS)` to remove duplicate `/opt/rocm/include` entries in the non-Spack case
* Remove user-provided env var `ROCM_INCLUDE_DIRS` since `ROCM_PATH` already exists as a user-provided env var, which should be sufficient to locate the include directories for ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152569
Approved by: https://github.com/renjithravindrankannath, https://github.com/jeffdaily

Co-authored-by: Renjith Ravindran <Renjith.RavindranKannath@amd.com>
2025-05-09 21:36:38 +00:00
5bf0c3518c Detect NVSHMEM location (#153010)
### Changes
- Detect NVSHMEM install location via `sysconfig.get_path("purelib")`, which typically resolves to `<conda_env>/lib/python/site-packages`, and NVSHMEM include and lib live under `nvidia/nvshmem`
- Added link dir via `target_link_directories`
- Removed direct dependency on mlx5
- Added preload rule (following other other NVIDIA libs)

### Plan of Record
1. End user experience: link against NVSHMEM dynamically (NVSHMEM lib size is 100M, similar to NCCL, thus we'd like users to `pip install nvshmem` than torch carrying the bits)
2. Developer experience: at compile time, prefers wheel dependency than using Git submodule
General rule: submodule for small lib that torch can statically link with
If user pip install a lib, our CI build process should do the same, rather than building from Git submodule (just for its header, for example)
3. Keep `USE_NVSHMEM` to gate non-Linux platforms, like Windows, Mac
4. At configuration time, we should be able to detect whether nvshmem is available, if not, we don't build `NVSHMEMSymmetricMemory` at all.

For now, we have symbol dependency on two particular libs from NVSHMEM:
- libnvshmem_host.so: contains host side APIs;
- libnvshmem_device.a: contains device-side global variables AND device function impls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153010
Approved by: https://github.com/ngimel, https://github.com/fduwjj, https://github.com/Skylion007
2025-05-07 23:35:04 +00:00
1d7728056b [nativert] Move TensorMeta to pytorch core (#152475)
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

This diff moves `TensorMeta.cpp` and `TensorMeta.h` to PyTorch core under `torch/nativert/graph/`

Existing `torch::_export::TensorMeta` in `torch/csrc/utils/generated_serialization_types.h` is auto-generated from the export serde schema and therefore only containing the most basic serializable types. We need the newly added `TensorMeta.cpp` to deserialize the metadata into a in-memory class with c10 types so that it can be consumed by the runtime later.

Test Plan:

Added test under `test/cpp/nativert/test_tensor_meta.cpp`

Differential Revision: D73820548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152475
Approved by: https://github.com/albanD
2025-05-06 01:50:46 +00:00
cyy
45efa1aaa8 [3/N] Use internal linkage in C++ files (#151297)
Follows #151070.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151297
Approved by: https://github.com/Skylion007
2025-05-05 17:48:39 +00:00
a7f1ddc184 [SymmMem] Experimental NVSHMEM integration (#151261)
Adding NVSHMEM as a backend for `SymmetricMemory`, implementation of which is in `NVSHMEMSymmetricMemory.cu`.

Moving some helper functions in `CUDASymmetricMemory.cu` to `CUDASymmetricMemoryUtils.cpp`, so that they can be shared by `NVSHMEMSymmetricMemory`. These functions are mostly side-band exchange helpers (`store_all_gather`, `IpcChannel`, etc).

Adding `TORCH_SYMMEM` to control which implementation to use for CUDA tensors, currently support: `CUDA` (in-house impl), `NVSHMEM`.

The NVSHMEM feature is gated by build-time flag: `USE_NVSHMEM=1`. And `NVSHMEM_HOME` setting is required (TODO).

Ported most code from #146593.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151261
Approved by: https://github.com/fegin, https://github.com/fduwjj
2025-05-01 05:24:50 +00:00
0e015ef116 [ROCm][Windows] Fix HIP Caffe2 Tests (#152014)
Solves the following problems of caffe2 HIP tests building on Windows:
1. HIP tests now use `hip_add_executable` to be built with custom_command invoking hip compiler, due to lack of cmake support for HIP in 3.18 (currently used).
2. failing with "Command line too long" which resulted from `hip_add_executable` adding the same flags over and over on top of `HIP_HIPCC_FLAGS` with every test added.
3. Disables `HasSameArgTypes` test on Windows, as `at::native::modern::detail` is nowhere to be found in the codebase (I think it must be a legacy thing). Perhaps the whole test should be removed/rewritten?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152014
Approved by: https://github.com/jeffdaily
2025-04-26 01:35:46 +00:00
a6182903cd Update PyTorchStreamReader API to take cpu allocator override (#150439)
Summary: Add allocator param in getRecord

Test Plan:
newly added UT
```
buck test caffe2/caffe2/serialize:inline_container_test
```

Differential Revision: D72252585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150439
Approved by: https://github.com/albanD
2025-04-18 01:53:14 +00:00
6fcffd8cd1 Optimize SVE embedding performance (#150176)
Change loop unrolling strategy. Previously, the script only unrolls the inner loop over block_size when block size is multiple of vector length. This version instead unrolls the outer loop which reduces the number of load/store for accumulation into the output array and improves performance for cases when block size is not multiple of vector length.

Benchmarking script:
```python
# SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com>
# SPDX-License-Identifier: BSD-3-Clause
import torch
import torch.nn as nn
import numpy as np
import time
import sys

np.random.seed(0)
torch.manual_seed(0)

num_embeddings = 400000
embedding_dim = int(sys.argv[1])
multi_hot = 100
batch_size = 400
nrun = 1000

class SimpleEmbeddingBagModel(nn.Module):
    def __init__(self, num_embeddings, embedding_dim):
        super(SimpleEmbeddingBagModel, self).__init__()

        weights = torch.from_numpy((np.random.random_sample((num_embeddings, embedding_dim)) + 1).astype(np.float32)).to(torch.float16)

        # Defining the EmbeddingBag layer
        self.embedding_bag = torch.nn.EmbeddingBag(num_embeddings, embedding_dim, _weight=weights,
                                                   mode='sum', include_last_offset=True, dtype=torch.float32)

    def forward(self, input, offsets):
        # Forward pass through the EmbeddingBag layer
        result32 = self.embedding_bag(input, offsets, per_sample_weights=None)
        return result32

# Instantiate the model
model = SimpleEmbeddingBagModel(num_embeddings=num_embeddings, embedding_dim=embedding_dim)
model.eval()

# Example input
input_tensor = torch.randint(0, num_embeddings, (batch_size * multi_hot,), dtype=torch.long)

offsets = torch.tensor(range(0, batch_size * multi_hot + 1, multi_hot))

with torch.no_grad():
    # warm up
    output32 = model(input_tensor, offsets)

    ti = time.time_ns()
    for i in range(nrun):
        _ = model(input_tensor, offsets)
    tf = time.time_ns()
    print("{:3d} {:.3E}".format(embedding_dim, (tf-ti)/nrun/1.e6))
```
Speedup on NEOVERSEV1 with 1 thread
![embedding](https://github.com/user-attachments/assets/16e567ed-b9a5-4db3-90b8-dec66d5414a7)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150176
Approved by: https://github.com/digantdesai, https://github.com/malfet
2025-04-07 18:01:54 +00:00
440c07e56a Fix detection of GPU multicast (#150563)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150563
Approved by: https://github.com/kwen2501
2025-04-03 15:31:15 +00:00
5a7588f183 [Build] Remove pre-CXX11 ABI logic from build script (#149888)
Only keep one in check_binary_symbols to make sure there are no pre-CXX11 ABI symbols in the library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149888
Approved by: https://github.com/atalman, https://github.com/seemethere
ghstack dependencies: #149887
2025-03-25 03:17:16 +00:00
536c0c7a47 [codemod][lowrisk] Remove unused exception parameter from caffe2/aten/src/ATen/cuda/CUDABlas.cpp (#149328)
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.

This:
```
try {
    ...
} catch (exception& e) {
    // no use of e
}
```
should instead be written as
```
} catch (exception&) {
```

If the code compiles, this is safe to land.

Test Plan: Sandcastle

Reviewed By: dtolnay

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149328
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-03-19 02:05:33 +00:00
60523540f1 Force build to conform C++ standard on windows by adding /permissive- flag (#149035)
Fixes #147366

1. Add `/permissive-` to the `torch_compile_options` for the build to conform to the C++ standard.
2. Fix the error when trying to assign a string literal to a non-const ptr.

The `/permissive-` flag can be found at https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170

From the above [doc](https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170#remarks),
>  By default, the /permissive- option is set in new projects created by Visual Studio 2017 version 15.5 and later versions.
> The /permissive- option is implicitly set by the /std:c++latest option starting in Visual Studio 2019 version 16.8, and in version 16.11 by the /std:c++20 option.

Thus, it is reasonable to add this flag to the existing project.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149035
Approved by: https://github.com/guangyey, https://github.com/malfet
2025-03-18 01:51:46 +00:00
9ad6265d04 [AOTI][XPU] Fix: model_container_runner_xpu.cpp is not built into libtorch_xpu.so (#149175)
The missing of model_container_runner_xpu.cpp will cause compilation failure when user build CPP inference application on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149175
Approved by: https://github.com/jansel
2025-03-15 00:30:04 +00:00
d1f21d8ec3 Enable Direct Use of Arm Compute Library (ACL) in ATen (#148584)
ACL is already built with PyTorch as a shared library when USE_MKLDNN_ACL is set.
Currently, it is only used indirectly in ATen via oneDNN for AArch64 targets. However there are cases where it makes sense to utilize ACL directly without  oneDNN as an intermediary - e.g. quantization. See #145942, #147337, #146620.
This patch enables such use cases by exposing ACL to ATen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148584
Approved by: https://github.com/malfet
2025-03-10 18:29:51 +00:00
be0ceee1c3 Make record/storage alignment in torch.save configurable (#147788)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147788
Approved by: https://github.com/albanD
ghstack dependencies: #147786, #147787
2025-03-06 12:04:46 +00:00
cyy
1433bc1455 Remove CAFFE2_USE_EXCEPTION_PTR (#147247)
The check is for older compilers and is now aways true.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147247
Approved by: https://github.com/janeyx99
2025-03-06 02:56:23 +00:00
3ecfe6be25 [Submodule] Turning flash-attention integration into 3rd party submod (#144120) (#146372)
Summary:

# Summary

### Sticky points

Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC

## Dependencies
- Flash PR: https://github.com/Dao-AILab/flash-attention/pull/1419

### Other Points
- The BC linter is complaining about losing generate.py and its functions which is not real BC surface
cc albanD

imported-using-ghimport

Test Plan:
Imported from OSS

Building in dev
`buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a  //caffe2:ATen-cu --show-full-output    `

I and Nming the .so I do see that the flash symbols are correctly named:
```
0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st*)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const
0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const
0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
```

Reviewed By: vkuzo

Differential Revision: D68502879

Pulled By: drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146372
Approved by: https://github.com/jbschlosser
2025-02-26 00:10:59 +00:00
6ff3383157 Enable CUPTI on Windows (#141454)
Fixes:
- https://github.com/pytorch/pytorch/issues/93855

The PR enables CUPTI on Windows and enables unit tests to check CUDA profiling events.
Additionally, the changes can be verified using the following script:

```
import torch
from torch.profiler import profile, ProfilerActivity

def check_cupti_enabled():
    # Check if CUDA is available
    if not torch.cuda.is_available():
        print("CUDA is not available on this system.")
        return False

    # Create a simple CUDA tensor
    x = torch.randn(1000, 1000, device="cuda")
    y = torch.randn(1000, 1000, device="cuda")

    try:
        # Use PyTorch profiler to perform a basic check
        with profile(activities=[ProfilerActivity.CUDA]) as prof:
            z = x @ y  # Simple CUDA operation

        # Print profiling results
        print("CUPTI is enabled and profiling works.")
        print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
        return True
    except RuntimeError as e:
        # If profiling fails, CUPTI is likely not set up correctly
        print("Error: CUPTI might not be enabled or accessible.")
        print(f"Details: {e}")
        return False

if __name__ == "__main__":
    if check_cupti_enabled():
        print("CUPTI is properly configured in PyTorch.")
    else:
        print("CUPTI is not configured correctly. Check your CUDA installation.")
```

Sample output:
```
CUPTI is enabled and profiling works.
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
     sgemm_128x128x8_NN_vec         0.00%       0.000us         0.00%       0.000us       0.000us       2.086ms       100.00%       2.086ms       2.086ms             1
                   cudaFree         9.67%       9.816ms         9.67%       9.816ms       9.816ms       0.000us         0.00%       0.000us       0.000us             1
     cudaDeviceGetAttribute         0.01%      10.000us         0.01%      10.000us       0.476us       0.000us         0.00%       0.000us       0.000us            21
    cudaGetDriverEntryPoint         0.00%       1.700us         0.00%       1.700us       0.850us       0.000us         0.00%       0.000us       0.000us             2
       cudaGetSymbolAddress        85.15%      86.438ms        85.15%      86.438ms      86.438ms       0.000us         0.00%       0.000us       0.000us             1
                 cudaMalloc         0.43%     433.300us         0.43%     433.300us     144.433us       0.000us         0.00%       0.000us       0.000us             3
           cudaLaunchKernel         2.61%       2.648ms         2.61%       2.648ms       2.648ms       0.000us         0.00%       0.000us       0.000us             1
      cudaDeviceSynchronize         2.13%       2.163ms         2.13%       2.163ms       2.163ms       0.000us         0.00%       0.000us       0.000us             1
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 101.511ms
Self CUDA time total: 2.086ms

CUPTI is properly configured in PyTorch.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141454
Approved by: https://github.com/malfet
2025-02-06 15:58:20 +00:00
001e355a56 Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880)
## Background

This PR adds `torch.utils.serialization.config.load.calculate_storage_offsets`. This option relies  on the previous PR in this stack, where storage order was changed to non lexicographical. A `.format_version` entry was added to the zipfile and `calculate_storage_offsets` will only work on checkpoints with `.format_version`.

When this is turned on, for `torch.load(mmap=True)`, offsets of each storage record (other than the 0th storage will be calculated instead of relying on `miniz` APIs to determine this).

The existing APIs will issue multiple random reads (reading the end of central directory record, then reading the zipfile header for the record) to determine the storage offset where the record starts. This can greatly degrade `torch.load(mmap=True)` performance for non-filesystem cases.

6aaae9d78f/caffe2/serialize/inline_container.cc (L589-L605)

## How does this work

The format for the checkpoint is as such

```
archive_name/
|_ data.pkl
|_.format_version
|_byteorder
|_data/
  |_ 0
  |_ 1
  |_ 2
  |_ ...
|_
```

Each `data/i` record represents a storage, where storages are written in the order that the Pickler encounters them.

For each storage, our `persistent_load` logic saves the following metadata to the pickle file `dtype, numel, key, location` where `numel` is the number of bytes in the storage.

Note that we always use `miniz` writer  in the zip64 mode per [here](7796e308d0/caffe2/serialize/inline_container.cc (L701)) A zipfile record written by miniz looks as such

```
 ---------------- ----------------- ------------------- ---------------- --------- ------------------------------
| 30 byte header | n byte filename | zip64_extra_data | m byte padding | storage | 16 or 24 byte local dir footer  |
 ---------------- ----------------- ------------------- ---------------- --------- ------------------------------
```

- The header size (30) is given by [`MZ_ZIP_LOCAL_DIR_HEADER_SIZE`](https://github.com/pytorch/pytorch/blob/main/third_party/miniz-3.0.2/miniz.c?fbclid=IwZXh0bgNhZW0CMTEAAR2O8Vysd--UoSCxW70gabXIS1dbz733oHwuUQ5_Ff1hY2WU6PL2i6CSH4A_aem_J9oaU2HpDeWtJKOU9EnVqw#L3290)
- filename will be `"{archive_name}/{filepath}"`

- `zip64_extra_data` is determined by [`mz_zip_writer_create_zip64_extra_data`](7796e308d0/third_party/miniz-3.0.2/miniz.c (L6202)). Note that [we only create zip64_extra_data if storage_size >= 0xFFFFFFFF or the offset of the start of the header >= 0xFFFFFFFF](7796e308d0/third_party/miniz-3.0.2/miniz.c (L6519-L6524))
- `m` is determined by [`getPadding`](7796e308d0/caffe2/serialize/inline_container.cc (L254)), which accounts for filename, zip64_extra_data to determine `m` such that the start of `storage` is aligned to 64 bytes. The `m` bytes will always start with `F B padding_size" as the first 4 bytes
- The local dir footer size is determined based on [this snippet ](7796e308d0/third_party/miniz-3.0.2/miniz.c (L6610-L6632)): if the buffer size is 0 it is skipped. If the zip64_extra_data was created, it is 24, otherwise it is 16.

When `torch.utils.serialization.config.load.calculate_storage_offsets` is set we do the following
- We keep track of where the "cursor" is in the file using `current_offset`, after each persistent_load call, it will be at the offset where the header for the next record starts
- for the 0th storage, "data/0", we use the regular get_record_offset to determine the start of the storage
- for any other storage, (where the storages will be in order encountered by the unpickler, 0, 1, 2, 3, ...) we use `get_record_offset_no_read`, which re-uses the `getPadding` logic to determine the offset of the storage
- Note that `load_tensor` will only ever be called again with the same key if the storage's `._data_ptr()` is 0 [[pointer1](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L1917-L1918)][[pointer2](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L1936-L1937)], so we cache the offsets for this edge case
- After each storage, if the storage is non-zero, we account for the local dir footer based on the logic described above

## Testing strategy

The agreed upon testing strategy was as follows:
- Add debug code gated by an environment flag `TORCH_SERIALIZATION_DEBUG` that will run this offset calculation logic and verify it against getRecordOffset for each storage (when mmap=False)
- This flag is set throughout CI, which means that every time `torch.load` is called, the offset calculation logic is implicitly being tested.

Differential Revision: [D67673026](https://our.internmc.facebook.com/intern/diff/D67673026)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143880
Approved by: https://github.com/albanD
ghstack dependencies: #143879
2025-01-31 17:09:20 +00:00
cyy
116af809eb Use std::string_view (#145906)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145906
Approved by: https://github.com/albanD
2025-01-30 03:14:27 +00:00
9010649292 Revert "Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880)"
This reverts commit db3685a35cdce32622ab89f6c92e09d52210ff53.

Reverted https://github.com/pytorch/pytorch/pull/143880 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but either this PR or the base PR breaks distributed tests ([comment](https://github.com/pytorch/pytorch/pull/143880#issuecomment-2617743403))
2025-01-28 03:07:17 +00:00
db3685a35c Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880)
## Background

This PR adds `torch.utils.serialization.config.load.calculate_storage_offsets`. This option relies  on the previous PR in this stack, where storage order was changed to non lexicographical. A `.format_version` entry was added to the zipfile and `calculate_storage_offsets` will only work on checkpoints with `.format_version`.

When this is turned on, for `torch.load(mmap=True)`, offsets of each storage record (other than the 0th storage will be calculated instead of relying on `miniz` APIs to determine this).

The existing APIs will issue multiple random reads (reading the end of central directory record, then reading the zipfile header for the record) to determine the storage offset where the record starts. This can greatly degrade `torch.load(mmap=True)` performance for non-filesystem cases.

6aaae9d78f/caffe2/serialize/inline_container.cc (L589-L605)

## Testing strategy

The agreed upon testing strategy was as follows:
- Add debug code gated by an environment flag `TORCH_SERIALIZATION_DEBUG` that will run this offset calculation logic and verify it against getRecordOffset for each storage (when mmap=False)
- This flag is set throughout CI, which means that every time `torch.load` is called, the offset calculation logic is implicitly being tested.

Differential Revision: [D67673026](https://our.internmc.facebook.com/intern/diff/D67673026)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143880
Approved by: https://github.com/albanD
ghstack dependencies: #143879
2025-01-27 23:57:30 +00:00
ed015143ef Set RUNPATH on CUDA and XPU tests (#144305)
#136627 has almost fixed the issue that test binaries' runpath has not been set correctly, with few cases left.

This PR fixes the rest.

The binaries are found by `auditwheel repair` a wheel built with `BUILD_TEST=1`.

@malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144305
Approved by: https://github.com/malfet
2025-01-26 08:40:22 +00:00
66bf7da446 Enable sleef for Win Arm64 (#144876)
Sleef module was disabled for Windows Arm64 on b021486405
This PR enables it again since the issue is no longer valid.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144876
Approved by: https://github.com/albanD, https://github.com/malfet

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
2025-01-23 19:22:58 +00:00
6c713ccb5e Revert "Make functionalization ViewMeta serializable with pickle. (#143712)"
This reverts commit b8abdaa286fd161af48af57a675827f4f849914d.

Reverted https://github.com/pytorch/pytorch/pull/143712 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/143712#issuecomment-2597205261))
2025-01-17 00:52:50 +00:00
b8abdaa286 Make functionalization ViewMeta serializable with pickle. (#143712)
Fix: #141974

This PR makes `ViewMeta` sequence, present in functional tensors,
serializable with pickle. In order to accomplish that, it makes
`ViewMeta` an abstract class with overridable `forward` and `reverse`
functions. In this context, each operation that once instanciated
`ViewMeta`, should now create a new specialized class that inherits from
`ViewMeta. Therefore, this PR also uses codegen for creating these
specializations.

In summary, these are the changes this PR introduces:

- `ViewMeta` is turned into an abstract class (see
  _FunctionalStorageImpl.cpp_). `forward` and `reverse` are pure virtual
  functions that need to be implemented. `to_out_index` should be
  implemented by operations that might return more than 1 output.

- New `ViewMeta` specializations for `resize_` and `_unsafe_view` are
  created (see _FunctionalizeFallbackKernel.h_).

- New templates _ViewMetaClasses.{cpp,h}_ are created. They hold the
  declaration and definition of the `ViewMeta` specializations, which
  are automatically generated in the ATen codegen (see _gen.py_).

- New `_functionalization` Python sub-module is created (see
  _Module.cpp_). It serves as namespace for the `ViewMeta`
  specializations and `InverseReturnMode` enum.

- New template _ViewMetaClassesPythonBinding.cpp_ is created. It holds
  the automatically generated Python bindings for the `ViewMeta`
  specialization, which are generated in the torch codegen (see
  _generate_code.py_).

Note that this PR makes use of codegen at 2 different moments:

- ATen codegen (_gen.py_): generates the `ViewMeta` specialized classes.
- Torch codegen (_generate_code.py_): generated the Python bindings for
  them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143712
Approved by: https://github.com/bdhirsh
2025-01-16 19:41:41 +00:00
c3b28491c8 [caffe2] Add AVX512 support for box_cox operator (#143627)
Summary:
Reuse templetized implementation of box_cox caffe2 operator.
* Duplicate .cc file of AVX2
* change intrinsics functions to use AVX512 instructions
* override templates
* extend the caller to use new methods
* guard AVX512 with a gflag to allow smooth transition

Differential Revision: D67433457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143627
Approved by: https://github.com/hl475
2025-01-07 09:54:39 +00:00
aa14fcd96c Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit e141cb9c34e5e96ca47ea69b565bc4fd9c8f34c1.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/clee2000 due to still failing internally D67556174, see D67866123 for link to error ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2573652459))
2025-01-06 18:15:52 +00:00
cyy
f9bf9057ef Fix ruff warnings in caffe2 and functorch (#144182)
In preparation for upgrading ruff config to py3.9.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144182
Approved by: https://github.com/malfet
2025-01-04 04:15:01 +00:00
0a94bb432e [ROCm] CK Flash Attention Backend (#143695)
Replace https://github.com/pytorch/pytorch/pull/138947 for re-import.

Replaces https://github.com/ROCm/pytorch/pull/1592

This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics.

Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author

NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143695
Approved by: https://github.com/malfet

Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
2025-01-03 22:01:36 +00:00
e141cb9c34 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2025-01-03 05:41:06 +00:00
b77406a9ec [BE][CI] bump ruff to 0.8.4 (#143753)
Changes:

1. Bump `ruff` from 0.7.4 to 0.8.4
2. Change `%`-formatted strings to f-string
3. Change arguments with the `__`-prefix to positional-only arguments with the `/` separator in function signature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143753
Approved by: https://github.com/Skylion007
2024-12-24 12:24:10 +00:00
e15442a9b2 Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit 6733045a4aaef7a8d9fb1f9f8b80f4f5f4ef1f4f.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but my first attempt to fix internal build does not fix all the cases, so let us try again ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2558043056))
2024-12-21 08:06:19 +00:00
6733045a4a export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-12-20 11:42:09 +00:00
2def1f6f74 [caffe2] Move vectorized templates into a separate file for box_cox operator (#143556)
Summary: No functional changes in this diff, the code is moved into a separate file to be reused by avx512 version in the follow up diff.

Test Plan: buck build //caffe2/caffe2/perfkernels:perfkernels

Differential Revision: D67433115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143556
Approved by: https://github.com/hl475
2024-12-19 22:02:23 +00:00
969b07b96f Revert "[ROCm] CK Flash Attention Backend (#138947)"
This reverts commit 500d02921bcf1619e268196866ddf099a4b94080.

Reverted https://github.com/pytorch/pytorch/pull/138947 on behalf of https://github.com/atalman due to Breaks default windows checkout ([comment](https://github.com/pytorch/pytorch/pull/138947#issuecomment-2548998359))
2024-12-17 16:46:57 +00:00
500d02921b [ROCm] CK Flash Attention Backend (#138947)
Replaces https://github.com/ROCm/pytorch/pull/1592

This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling `torch.backends.cuda.preferred_rocm_fa_library("ck")`. Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via `USE_FLASH_ATTENTION`) and is selected at runtime by the existing heuristics.

Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author

NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138947
Approved by: https://github.com/pruthvistony, https://github.com/xw285cornell, https://github.com/leitian

Co-authored-by: Xiaodong Wang <xw285@cornell.edu>
2024-12-17 02:18:07 +00:00
cyy
201cb8834f Enable more C++ warnings (#143099)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143099
Approved by: https://github.com/albanD
2024-12-17 02:03:39 +00:00
cyy
2903cf0ad8 Re-enable some C++ warnings (#142332)
It enables some C++ warnings since the code base is fairly clean. Meanwhile, Wextra-semi is disabled on CUDA generated code since there is no way to fix them without the cooperation of CUDA team.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142332
Approved by: https://github.com/albanD, https://github.com/eqy
2024-12-12 04:02:12 +00:00
6680a83e89 [AOTI XPU] Support AOT Inductor for Intel GPU. (#140269)
This PR add XPU support for AOT Inductor, and reuse the corresponding UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140269
Approved by: https://github.com/desertfire, https://github.com/EikanWang
ghstack dependencies: #140268

Co-authored-by: Bin Bao <binbao@meta.com>
2024-12-10 05:05:08 +00:00
5d6acd5a31 Register Intel distributed Backend (XCCL) in PyTorch distributed package (#141856)
### Motivation:

As design illustrated in Intel distributed support RFC https://github.com/pytorch/pytorch/issues/141741, two sections are needed to enable intel distributed backend (`XCCL`) support in PyTorch.
1. Intel GPU distributed Backend integration in PyTorch `torch-xpu-ops`.
2. **Intel distributed Backend register in PyTorch distributed package**. This PR is to contribute section 2 change.

### Example:
Here is a simple example of using spawn to launch XCCL backend and perform allreduce on XPU tensors.
```
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

def run_allreduce(rank, world_size):
    setup(rank, world_size)
    device = torch.device('xpu:{}'.format(rank))
    x = torch.randn([2, 2], device=device)
    dist.all_reduce(x)
    cleanup()

if __name__ == '__main__':
    world_size = 2
    mp.spawn(run_allreduce, args=(world_size,), nprocs=world_size, join=True)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141856
Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD
2024-12-10 01:58:06 +00:00
219e9c83a5 Revert "[AOTI XPU] Support AOT Inductor for Intel GPU. (#140269)"
This reverts commit 854d83133bd4b0bca8ba19477c56ef2dd896dfc7.

Reverted https://github.com/pytorch/pytorch/pull/140269 on behalf of https://github.com/clee2000 due to breaks forward compatibility?  D66937097 ([comment](https://github.com/pytorch/pytorch/pull/140269#issuecomment-2528828555))
2024-12-09 17:33:28 +00:00
90fc2b42e3 Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit 82544bd3a2f71e5995e6b035433139fad884e277.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/clee2000 due to still has failures internally when building, D66923759 ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2528760716))
2024-12-09 17:04:20 +00:00
854d83133b [AOTI XPU] Support AOT Inductor for Intel GPU. (#140269)
This PR add XPU support for AOT Inductor, and reuse the corresponding UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140269
Approved by: https://github.com/desertfire, https://github.com/EikanWang
ghstack dependencies: #140268
2024-12-07 19:22:04 +00:00
82544bd3a2 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-12-07 15:23:38 +00:00
db13bd9ac2 Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit b8eb4b56d8dbcd07570bec616f7ea58e9dd58fb4.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/atalman due to Break internal tests see errors like: csrc\inductor\aoti_torch\shim_common.cpp(481): error C2491: 'aoti_torch__embedding_bag': definition of dllimport function not allowed ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2523968128))
2024-12-06 19:04:04 +00:00
b8eb4b56d8 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-12-06 04:54:42 +00:00
41952c1876 Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit 38e0f72274cfad88e0f2ca40f27c79cd49413f5e.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/malfet due to This broke sm89 builds ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2521290457))
2024-12-05 20:07:29 +00:00
38e0f72274 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-12-05 11:25:55 +00:00
cyy
bffaddf9ea Format caffe2/serialize (#141850)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141850
Approved by: https://github.com/cpuhrsch
2024-12-04 01:14:24 +00:00