Commit Graph

7696 Commits

Author SHA1 Message Date
1172a10574 [Build] Do not regenerate code endlessly without XPU (#140438)
Before this change, if one builds PyTorch without XPU build process will
be perpetually regenerating code because of the reference to non-existing
file, that will make autograd codegened files always out of date, see part of the `ninja -d explain torch_cpu` output:
```
ninja explain: output ../torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.cpp doesn't exist
ninja explain: output third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl of phony edge with no inputs doesn't exist
ninja explain: third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl is dirty
ninja explain: /Users/malfet/git/pytorch/pytorch/torch/csrc/autograd/generated/Functions.cpp is dirty
```

This is a regression introduced by https://github.com/pytorch/pytorch/pull/139025.

After this change, incremental rebuilds with no changes cause no build actions:
```
% ninja -j1 -v -d explain -n torch_cpu
ninja explain: output third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl of phony edge with no inputs doesn't exist
ninja explain: third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl is dirty
ninja: no work to do.
```

Test plan: Wait for at least on XPU build to finish...

Fixes https://github.com/pytorch/pytorch/issues/140432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140438
Approved by: https://github.com/kit1980, https://github.com/huydhn
2024-11-12 20:19:28 +00:00
191971e01d [AOTI] Introduce an extensibility mechanism for the c shim codegen to make it easy to produce c shims for out-of-tree OP kernels as well. Add c_shim for XPU. (#136742)
[AOTI] Introduce an extensibility mechanism for the c shim codegen to make it easy to produce c shims for out-of-tree OP kernels as well. Add c shim for XPU.

### Motivation
Since the current c shim codegen will only produce C wrappers for Op's registered in `aten/src/ATen/native/native_functions.yaml`, for the same backend, when a portion of out-of-tree OP's are not registered in that file, but are registered externally. For example, `third_party/torch-xpu-ops/yaml/native_functions.yaml` , in this case, the existing codegen can't fulfill the need to do extensions for the c shims from the out-of-tree OPs for the in-tree that has already been produced.

### Design
To extend the c shim with more OP for a backend from out-of-tree.
The PR provided a bool option `--aoti-extend` to indicate the codegen is to extend c shim from out-of-tree.
The generated c shim is stored in the `extend` subdirectory , for example:
```
torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.h
torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.cpp
torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.h
torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.cpp
```
example usage:
`python -m torchgen.gen --source-path third_party/torch-xpu-ops/yaml/ --xpu --aoti-extend --update-aoti-c-shim  `
`--xpu`:  generate c shim for XPU
`--aoti-extend `: this is an out-of-tree OPs(defined in `third_party/torch-xpu-ops/yaml/native_functions.yaml`)  extend for in-tree ops(defined in `aten/src/ATen/native/native_functions.yaml`)
`--update-aoti-c-shim`: always generate c_shim_xpu.h for the extend c_shim.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136742
Approved by: https://github.com/EikanWang, https://github.com/desertfire
ghstack dependencies: #139025
2024-11-09 13:19:52 +00:00
929a647363 [Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM OPs. (#139025)
[Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM ops.

Motivation: There are two parts of aten ops for XPU, one is in-tree ops like GEMM related OPs and the other is out-off-tree ops in torch-xpu-ops. For the in-tree part,since Pytorch uses native_functions.yaml registration and is equipped with convenient codegen capabilities, we want to take advantage of these benefits as well.
At the same time, since AOT Inductor also uses native_functions.yaml to generate c shim wrappers, we also need to enable this mechanism for XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139025
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
2024-11-09 13:09:27 +00:00
a02e88d19c [miniz] Bump miniz version to 3.0.2 and add patch for zip64 (#140041)
Summary:
Bump miniz version from 2.1.0 to 3.0.2 and apply these patches:

* #79636 patches internal BUCK and bazel build
* #138959 adds `bool compute_crc32` argument
* miniz PR: https://github.com/richgel999/miniz/pull/324 to support
  zip64

Anyone bumping miniz version again, please apply these patches as well.

Test Plan:
Rely on unit test

Imported from OSS

Differential Revision: D65586230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140041
Approved by: https://github.com/mikaylagawarecki
2024-11-09 00:13:16 +00:00
1659e241c8 [experimental] async-tp impl with cutlass-based, progress aware kernel (#139227)
This PR introduces the following:

### torch.ops.symm_mem._async_input_mm

`_async_input_mm(Tensor a, Tensor b, Tensor a_chunk_signals, int a_chunk_pivot) -> Tensor`

An mm impl that supports consuming asynchronous input. It guarantees the following rasterization order, and that the corresponding signal arrives before an input chunk is consumed.
```
num_chunks = a_chunks_signals.numel()
for chunk_idx in range(a_chunk_pivot, num_chunks + a_chunk_pivot):
    chunk_idx = chunk_idx % num_chunks
    wait_signal(a_chunk_signals, chunk_idx)
    # Compute output tiles that consumes the input chunk
```

### PersistentAsyncInputScheduler

This is a forked version of PersistentScheduler that supports consuming asynchronous input. This tile scheduler introduces the following arguments:

- `tiles_per_chunk_m` – Specifies the size of an M chunk. Chunks are the granularity at which the asynchronous input becomes ready. It must be an interger multiple of the size of an M tile.
- `chunk_signals` – `chunk_signals[i] == 1` indicates that chunk i is ready. Before returning a work tile, get_current_work() waits for the signal to ensure that the corresponding chunk is ready.
- `tile_idx_pivot_m` – After applying swizzling, apply `pivot(m) => (m + tile_idx_pivot_m) % tiles_m` to `m`. In a distributed setting, this allows different ranks to process different m indices at the same time, thus avoiding communication hotspots.

Note that this scheduler currently only supports the `KernelTmaWarpSpecializedCooperative` kernel schedule. This is enforced via the template argument `KernelSchedule`.

Usage:
```
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
   Shape<int, int, int, int>,
   CollectiveMainloop,
   CollectiveEpilogue,
   cutlass::gemm::PersistentAsyncInputScheduler<KernelSchedule>>;
```

### _fused_all_gather_matmul_native
An ag-mm impl that combines `torch.ops.symm_mem._async_input_mm` and progress-aware all-gather. This is not yet enabled via the async-tp passes. We will use it as a backend to optimize the current decomposition-based async-tp impl.

## Benchmarks

### 4096x3584x8192
- cublas + nccl: 539us
- decomp-based async-tp w/o cuda graph: 694us
- decomp-based async-tp w/ cuda graph: 478us
- new cutlass kernel: 408us

<img width="478" alt="image" src="https://github.com/user-attachments/assets/39f316ab-36c5-4b41-af77-07854a385dfc">

### 2048x3584x8192
- cublas + nccl: 301us
- decomp-based async-tp w/o cuda graph: 687us
- decomp-based async-tp w/ cuda graph: 356us
- new cutlass kernel: 276us

<img width="441" alt="image" src="https://github.com/user-attachments/assets/9e23ce21-863b-43dd-a562-fb05d3a5a144">

## Next Steps
- Add tuning logic
- Use `_fused_all_gather_matmul_native` as a backend for the decomp-based async-tp impl

Differential temp Revision: [D65623152](https://our.internmc.facebook.com/intern/diff/D65623152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139227
Approved by: https://github.com/weifengpy, https://github.com/Chillee
2024-11-08 23:28:25 +00:00
0b7a2d4aef [Windows XPU] Fix MSVC ambiguous symbol error (#138727)
PT master build with XPU will fail due to MSVC issue of ambiguous symbol error 'std', previously fixed it with MSVC flag in torch-xpu-ops https://github.com/intel/torch-xpu-ops/pull/946/files, but the error is observed in PT master too after 2.5 and oneAPI update.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138727
Approved by: https://github.com/guangyey, https://github.com/ezyang
2024-11-08 08:29:36 +00:00
36e0f119d0 Revert "[experimental] async-tp impl with cutlass-based, progress aware kernel (#139227)"
This reverts commit 5203138483e97141ad96a8906f1c6f8b7ff8adc6.

Reverted https://github.com/pytorch/pytorch/pull/139227 on behalf of https://github.com/yifuwang due to Need to address internal build failure D65605027 ([comment](https://github.com/pytorch/pytorch/pull/139227#issuecomment-2463204467))
2024-11-07 21:01:36 +00:00
5203138483 [experimental] async-tp impl with cutlass-based, progress aware kernel (#139227)
This PR introduces the following:

### torch.ops.symm_mem._async_input_mm

`_async_input_mm(Tensor a, Tensor b, Tensor a_chunk_signals, int a_chunk_pivot) -> Tensor`

An mm impl that supports consuming asynchronous input. It guarantees the following rasterization order, and that the corresponding signal arrives before an input chunk is consumed.
```
num_chunks = a_chunks_signals.numel()
for chunk_idx in range(a_chunk_pivot, num_chunks + a_chunk_pivot):
    chunk_idx = chunk_idx % num_chunks
    wait_signal(a_chunk_signals, chunk_idx)
    # Compute output tiles that consumes the input chunk
```

### PersistentAsyncInputScheduler

This is a forked version of PersistentScheduler that supports consuming asynchronous input. This tile scheduler introduces the following arguments:

- `tiles_per_chunk_m` – Specifies the size of an M chunk. Chunks are the granularity at which the asynchronous input becomes ready. It must be an interger multiple of the size of an M tile.
- `chunk_signals` – `chunk_signals[i] == 1` indicates that chunk i is ready. Before returning a work tile, get_current_work() waits for the signal to ensure that the corresponding chunk is ready.
- `tile_idx_pivot_m` – After applying swizzling, apply `pivot(m) => (m + tile_idx_pivot_m) % tiles_m` to `m`. In a distributed setting, this allows different ranks to process different m indices at the same time, thus avoiding communication hotspots.

Note that this scheduler currently only supports the `KernelTmaWarpSpecializedCooperative` kernel schedule. This is enforced via the template argument `KernelSchedule`.

Usage:
```
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
   Shape<int, int, int, int>,
   CollectiveMainloop,
   CollectiveEpilogue,
   cutlass::gemm::PersistentAsyncInputScheduler<KernelSchedule>>;
```

### _fused_all_gather_matmul_native
An ag-mm impl that combines `torch.ops.symm_mem._async_input_mm` and progress-aware all-gather. This is not yet enabled via the async-tp passes. We will use it as a backend to optimize the current decomposition-based async-tp impl.

## Benchmarks

### 4096x3584x8192
- cublas + nccl: 539us
- decomp-based async-tp w/o cuda graph: 694us
- decomp-based async-tp w/ cuda graph: 478us
- new cutlass kernel: 408us

<img width="478" alt="image" src="https://github.com/user-attachments/assets/39f316ab-36c5-4b41-af77-07854a385dfc">

### 2048x3584x8192
- cublas + nccl: 301us
- decomp-based async-tp w/o cuda graph: 687us
- decomp-based async-tp w/ cuda graph: 356us
- new cutlass kernel: 276us

<img width="441" alt="image" src="https://github.com/user-attachments/assets/9e23ce21-863b-43dd-a562-fb05d3a5a144">

## Next Steps
- Add tuning logic
- Use `_fused_all_gather_matmul_native` as a backend for the decomp-based async-tp impl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139227
Approved by: https://github.com/weifengpy, https://github.com/Chillee
2024-11-07 03:43:12 +00:00
ee42a99745 [SymmetricMemory] introduce a binding for cuMemset32Async (#138755)
## This Stack

This stack does the following things to support `xformers`-style, comm-aware Triton kernels:
- Exposes `signal_pad`s as tensors in Python
- Adds a binding for `cuMemsetAsync`

These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns.

## This PR
Make `cuMemset32Async` available via `_SymmetricMemory.memset32`. We chose `cuMemset32Async` over `cudaMemsetAsync` because it allows for `uint32_t`-wise memset. This provides users with better flexibility.

To enable this, we also added the following cuda driver APIs in `c10::cuda::DriverAPI`:
- `cuDevicePrimaryCtxRetain` - for obtaining the primary context of a device in the form of `CUcontext`.
- `cuCtxGetCurrent`/`cuCtxSetCurrent` - for setting and restoring the context for cuda driver APIs such as `cuMemset32Async`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138755
Approved by: https://github.com/weifengpy, https://github.com/eqy, https://github.com/lw
2024-11-05 18:47:24 +00:00
356fc41ae0 [Intel GPU] Avoid target_link_libraries twice for torch_xpu_ops which will potentially cause multiple definition symbol linker error. (#139024)
[Intel GPU] Avoid target_link_libraries twice for torch_xpu_ops which will potentially cause multiple definition symbol linker error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139024
Approved by: https://github.com/EikanWang, https://github.com/fengyuan14, https://github.com/jansel
2024-11-05 08:18:09 +00:00
99413cd1a8 [CMake] Fix local MPS builds (#139651)
Not sure how it works on some machines, but clean build fails for me after https://github.com/pytorch/pytorch/pull/138636 was landed, even though it works fine on another machine.

Solution is to create an empty file when one adds a dependency, but later this dependency will be updated by the build rule

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139651
Approved by: https://github.com/atalman
2024-11-04 19:43:53 +00:00
3ca794783f Revert "[SymmetricMemory] introduce a binding for cuMemset32Async (#138755)"
This reverts commit 924e726c3a2566125f55cdbff4dff054d3db3232.

Reverted https://github.com/pytorch/pytorch/pull/138755 on behalf of https://github.com/ZainRizvi due to Sorry but this breaks internally.  Can you please fix this PR so it works internally and re-merge it? See D65401876 for more details ([comment](https://github.com/pytorch/pytorch/pull/138755#issuecomment-2455173596))
2024-11-04 16:34:34 +00:00
cyy
419a7e197d [6/N] Fix Wextra-semi warning (#139605)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139605
Approved by: https://github.com/ezyang
2024-11-04 13:43:16 +00:00
924e726c3a [SymmetricMemory] introduce a binding for cuMemset32Async (#138755)
## This Stack

This stack does the following things to support `xformers`-style, comm-aware Triton kernels:
- Exposes `signal_pad`s as tensors in Python
- Adds a binding for `cuMemsetAsync`

These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns.

## This PR
Make `cuMemset32Async` available via `_SymmetricMemory.memset32`. We chose `cuMemset32Async` over `cudaMemsetAsync` because it allows for `uint32_t`-wise memset. This provides users with better flexibility.

To enable this, we also added the following cuda driver APIs in `c10::cuda::DriverAPI`:
- `cuDevicePrimaryCtxRetain` - for obtaining the primary context of a device in the form of `CUcontext`.
- `cuCtxGetCurrent`/`cuCtxSetCurrent` - for setting and restoring the context for cuda driver APIs such as `cuMemset32Async`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138755
Approved by: https://github.com/weifengpy, https://github.com/eqy, https://github.com/lw
2024-11-03 21:37:31 +00:00
a1f854f270 [MPS] Compile kernels into Metallib (#138636)
PyTorch MPS backend for the most part relies on MPSGraph to provide specific operations, but recently more and more often one had to implement custom kernel here that were simply embedded in the operator codebase and were compiled directly using [`- id<MTLLibrary>newLibraryWithSource:options:error:`](https://developer.apple.com/documentation/metal/mtldevice/1433431-newlibrarywithsource) (first metal kernel to MPS backend was added in https://github.com/pytorch/pytorch/pull/82307 )
Later on, as number of operator grew, those were refactored into `MetalShaderLibrary` convenience class (see  https://github.com/pytorch/pytorch/pull/125550 )

But as number of kernels keeps growing, it's time to make a next step and properly compile them into `.metalib`

This PR does exactly that by:
 - Moving shader sources into separate .metal files
 - Adds check on whether full Xcode installed or just DeveloperTools
 - If full Xcode is installed, compiles and links shaders into .metallib for Metal-3.0(Available on MacOS 13) and Metal-3.1 standard (available on MacOS 14, can use bfloat) and bundles both using `-sectcreate` linker option and `getsectiondata` API call. `metallib_dummy.cpp` file is used to properly express dependencies between metallib build and torch_cpu link stages. Logic for generating metallibraries is loosely based on https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/CMakeLists.txt.
 - If only DeveloperTools CLI is installed, automatically wraps .metal into `_metallib.h` that contains shader source wrapped in `MetalShaderLibrary`

Bulk of changes introduced in this PR are just moving code around. I.e. for every file that contains non-templated shader definition in `aten/src/ATen/native/mps/operators` folder, corresponding `.metal` file is created in `aten/src/ATen/native/mps/kernels` folder and embedded shader definition is replaced with the following
```cpp
#ifndef PYTORCH_JIT_COMPILE_SHADERS
static auto& lib = MetalShaderLibrary::getBundledLibrary();
#else
#include <ATen/native/mps/OpName_metallib.h>
#endif
```

Some historical stats:
| PyTorch Version  | Number of shaders in MPS | Ops added |
| ------------- | ------------- | ---- |
| 1.12  | 0  | |
| 1.13  | 2  | bitwise_ops and  index.out |
| 2.0  | 4  | cross repeat and view)  |
| 2.1  | 9   | unary_ops, histogram, renorm, binary_ops |
| 2.2  | 11   | gamma and bucketization |
| 2.3  | 12  | naive_matmul (to workaround crash) |
| 2.4 | 13 | quantized_mm |
| 2.5 | 14 | fused_adam |

Pros:
  - Better code structure/readability
  - Eventually allows one to use shared headers (and implement something like `TensorIterator`)
  - Faster runtime (as compilation is done ahead of time) and perhaps better optimized compiled kernels

Cons:
  - Build process is a bit more complicated that it used to be
  - Need to maintain two codepath (as our CI builders only has DeveloperTools installed)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138636
Approved by: https://github.com/manuelcandales
2024-11-01 21:47:20 +00:00
300ca6368f Remove depracated alias macro(2/3) (#137559)
**Detailed Descriptions:**
- Remove AT_ASSERTM Macro
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137559
Approved by: https://github.com/ezyang
2024-11-01 06:17:57 +00:00
8ace3e8023 Add sv starts/ends_with (#139261)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139261
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-11-01 01:17:42 +00:00
cyy
1dd503c6fb [4/N] Fix Wextra-semi warning (#139256)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139256
Approved by: https://github.com/ezyang
2024-10-31 03:01:14 +00:00
ec5fbee6c0 Revert "Drop caffe2 string_utils (#139217)"
This reverts commit 1797a2035d92d25d3dcc46fd8facdd6569b30c53.

Reverted https://github.com/pytorch/pytorch/pull/139217 on behalf of https://github.com/huydhn due to Chatting with @r-barnes, this is still used in lots of place internally ([comment](https://github.com/pytorch/pytorch/pull/139217#issuecomment-2448568071))
2024-10-30 22:23:32 +00:00
1797a2035d Drop caffe2 string_utils (#139217)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139217
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2024-10-30 01:13:16 +00:00
cyy
f4f0f2995d Fix Wextra-semi warnings (#139000)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139000
Approved by: https://github.com/ezyang
2024-10-28 21:48:51 +00:00
bae3426af7 reimport pr137735 due to merging check issues (#138959)
This is  a cherry-pick from #137735 by @mikaylagawarecki , that cannot be merged due to a (wrongly) failing check for codev

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138959
Approved by: https://github.com/mikaylagawarecki
2024-10-27 16:31:34 +00:00
42994234a6 std::value/std::type -> std::_v/std::_t (#138746)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138746
Approved by: https://github.com/cyyever, https://github.com/malfet
2024-10-26 20:59:24 +00:00
22d2e2d9a0 Set RUNPATH so installed tests can find the required shared libraries (#136627)
This change fixes the RUNPATH of installed c++ tests so that the linker can find the shared libraries they depend on.

For example, currently:
```bash
venv/lib/python3.10/site-packages/torch $ ./bin/test_lazy
./bin/test_lazy: error while loading shared libraries: libtorch.so: cannot open shared object file: No such file or directory
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136627
Approved by: https://github.com/malfet
2024-10-25 09:38:08 +00:00
b021486405 Enable Windows Arm64 (#133088)
This PR enables Pytorch for Windows on Arm64 - CPU only.
Currently, there aren't any checks in place to build and test for Windows on Arm64, but we're working to implement those as soon as possible.
We recommend using [Arm Performance Libraries (APL)](https://developer.arm.com/Tools%20and%20Software/Arm%20Performance%20Libraries) as a BLAS option, which is introduced in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133088
Approved by: https://github.com/malfet

Co-authored-by: cristian panaite <panaite.cristian2000@gmail.com>
Co-authored-by: Stefan-Alin Pahontu <56953855+alinpahontu2912@users.noreply.github.com>
Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
2024-10-24 16:10:44 +00:00
cyy
f881094366 Use Wmissing-prototypes on torch_cuda (#136080)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136080
Approved by: https://github.com/ezyang
2024-10-22 02:04:19 +00:00
af0bc75460 Remove deprecated alias macro(1/3) (#137556)
**Detailed Descriptions:**
- Remove AT_ERROR Macro

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137556
Approved by: https://github.com/ezyang
2024-10-21 17:32:32 +00:00
cyy
af8bd323e8 Remove legacy Caffe2 pthreadpool from CMake (#134936)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134936
Approved by: https://github.com/ezyang
2024-10-17 05:22:08 +00:00
dd32a32cb6 Revert "Expose option to disable CRC-32 computation during torch.save (#137735)"
This reverts commit 534fa96f2d9a4feb1dcdfaecb3d73990db60f819.

Reverted https://github.com/pytorch/pytorch/pull/137735 on behalf of https://github.com/clee2000 due to failing internally D64438525, probably needs gating ([comment](https://github.com/pytorch/pytorch/pull/137735#issuecomment-2417412264))
2024-10-16 17:03:06 +00:00
534fa96f2d Expose option to disable CRC-32 computation during torch.save (#137735)
Option only works in open source, not internal

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137735
Approved by: https://github.com/albanD
2024-10-15 19:30:02 +00:00
cd292908e5 Revert "Make c10::string_view an alias of std::string_view (#130417)"
This reverts commit c48fe8901114aa2b0a9c2d77f915a2ad8ab2098b.

Reverted https://github.com/pytorch/pytorch/pull/130417 on behalf of https://github.com/clee2000 due to breaking some internal tests, probably usages of string_view that need to be changed? ([comment](https://github.com/pytorch/pytorch/pull/130417#issuecomment-2414775064))
2024-10-15 18:55:09 +00:00
e1e6417d4c Add SVE implementation of embedding_lookup_idx (#133995)
Adds an accelerated version of the embedding_lookup_idx perfkernels. This is done via a python codegen file similarly to `caffe2/perfkernels/hp_emblookup_codegen.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133995
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-10-15 18:52:44 +00:00
dac0b4e62b Revert "Add SVE implementation of embedding_lookup_idx (#133995)"
This reverts commit 770c134998d3422bc2fa3b90baa235ed0c409e62.

Reverted https://github.com/pytorch/pytorch/pull/133995 on behalf of https://github.com/clee2000 due to breaking internal tests, I wondering if this just needs a targets change for buck? ([comment](https://github.com/pytorch/pytorch/pull/133995#issuecomment-2414596554))
2024-10-15 17:23:50 +00:00
eqy
914c90dcea [NCCL][CUDA] Set PYTORCH_C10_DRIVER_API_SUPPORTED in ProcessGroupNCCL.cpp compilation (#137828)
Otherwise `expandable_segments()` is hardcoded to false in `CUDAAllocatorConfig.h`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137828
Approved by: https://github.com/yifuwang, https://github.com/Skylion007
2024-10-14 19:38:23 +00:00
770c134998 Add SVE implementation of embedding_lookup_idx (#133995)
Adds an accelerated version of the embedding_lookup_idx perfkernels. This is done via a python codegen file similarly to `caffe2/perfkernels/hp_emblookup_codegen.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133995
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-10-14 10:17:27 +00:00
cyy
c48fe89011 Make c10::string_view an alias of std::string_view (#130417)
In order to facilitate the mitigation from c10::string_view to std::string_view, the old c10::string_view was renamed to c10::string_view_ext.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130417
Approved by: https://github.com/ezyang
2024-10-14 09:28:04 +00:00
bd63ec4f45 [ROCm] LoadHIP CMake cleanup (#137112)
Should help mitigate issues reported here: https://github.com/pytorch/pytorch/issues/128313

While working on https://github.com/pytorch/pytorch/pull/136700, we realized that some of the ROCm CMake can be streamlined.

This PR does not fix any bugs or provide any new functionality. Strictly clean-up.

The remaining `${ROCM_ROCTX_LIB}` will be removed when we transition to the rocprofiler-sdk (to be done in a separate PR).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137112
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2024-10-13 00:06:41 +00:00
eea1f79a1d [AMD] use rccl.h instead of rccl/rccl.h (#135472)
Summary: We hipify NCCLUtils.h from nccl.h to rccl/rccl.h. This follows the format of the rocm rpm suite (the header is in include/rccl/rccl.h), however the source code is just src/rccl.h. Using the rccl/rccl.h will make us find the rpm's header but not the src code's header.

Test Plan:
buck run mode/opt-amd-gpu -c hpc_comms.use_rccl=develop -c fbcode.split-dwarf=True  --config rccl.build_rdma_core=true --config rccl.adhoc_brcm=true  //aps_models/ads/icvr:icvr_launcher -- mode=local_ctr_cvr_cmf_rep_1000x_v1_no_atom   data_loader.dataset.table_ds=[2024-09-04]   data_loader.dataset.batch_size=512  max_ind_range=10

w/o this diff, it'll show 2.18 nccl version

Differential Revision: D62371434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135472
Approved by: https://github.com/jeffdaily, https://github.com/cenzhaometa
2024-10-10 08:55:57 +00:00
2d465e4d1d [non ghstack] Init threadpool with user defined num_threads before default (#137051)
Very similar to https://github.com/pytorch/pytorch/pull/136793, but adds back `pool->set_thread_count` call as it is still necessary (I am guessing due to the mutex)

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137051
Approved by: https://github.com/albanD
2024-10-02 20:02:30 +00:00
8962610247 [BE][clang-format] make macro PyObject_HEAD_INIT(type) and PyVarObject_HEAD_INIT(type, size) have its own line (#136949)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136949
Approved by: https://github.com/albanD, https://github.com/eqy
ghstack dependencies: #136945
2024-10-02 18:39:22 +00:00
89c37be6b7 [BE][clang-format] make macro PyObject_HEAD have its own line (#136945)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136945
Approved by: https://github.com/albanD
2024-10-02 18:39:21 +00:00
66e3186a48 Revert "Init threadpool with user defined num_threads before default (#136793)"
This reverts commit adbcaee950afa6697c04962096344bf0962a542f.

Reverted https://github.com/pytorch/pytorch/pull/136793 on behalf of https://github.com/janeyx99 due to Caused internal Oculus crash, and internal force landed a diff without exporting to GH =.= ([comment](https://github.com/pytorch/pytorch/pull/136793#issuecomment-2384148132))
2024-09-30 21:10:12 +00:00
adbcaee950 Init threadpool with user defined num_threads before default (#136793)
Fixes #134714 (or attempts to, idk how to test yet)

For posterity, how one can test:
1. make sure you have USE_PTHREADPOOL=1 or pull a packaged binary
2. run gdb --args python, with `r` to enter, `Ctrl-C` to pause, and `c` to get back into Python
3. import torch
4. torch.set_num_threads(1), make sure this does not trigger any additional threads getting created.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136793
Approved by: https://github.com/albanD
2024-09-27 22:22:37 +00:00
564d00f364 Revert "Fix clang-tidy warnings in Caffe2 code (#134935)"
This reverts commit 7cfd23636c8fa6fcbb8bf3ea34e15b847ec9ad9d.

Reverted https://github.com/pytorch/pytorch/pull/134935 on behalf of https://github.com/izaitsevfb due to breaks internal builds, caffe2 is still used internally ([comment](https://github.com/pytorch/pytorch/pull/134935#issuecomment-2349368152))
2024-09-13 16:42:37 +00:00
3de9e474df Revert "Check function declarations of Core ML code (#135467)"
This reverts commit bc1b8f094d24de27432f4c29f0729e85a6b5ba63.

Reverted https://github.com/pytorch/pytorch/pull/135467 on behalf of https://github.com/malfet due to This breaks ios periodic jobs, see https://github.com/pytorch/pytorch/actions/runs/10797026668/job/29947377532 ([comment](https://github.com/pytorch/pytorch/pull/135467#issuecomment-2347322784))
2024-09-12 22:04:35 +00:00
cyy
7cfd23636c Fix clang-tidy warnings in Caffe2 code (#134935)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134935
Approved by: https://github.com/ezyang
2024-09-12 03:27:09 +00:00
cyy
bc1b8f094d Check function declarations of Core ML code (#135467)
Relax the restrictions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135467
Approved by: https://github.com/ezyang
2024-09-10 16:05:22 +00:00
cyy
60e8dc4374 Check function declarations in Caffe2 code (#134925)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134925
Approved by: https://github.com/ezyang
2024-09-09 05:03:29 +00:00
7074de43c0 Porting to GCC 15 (#135188)
uint8_t is found on cstdint header

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135188
Approved by: https://github.com/Skylion007
2024-09-06 16:16:53 +00:00
c044deb9ce Revert "c10d/logging: add C10D_LOCK_GUARD (#134131)"
This reverts commit f33bcbe5fd67e6b18be259ad2f0dc11c74157075.

Reverted https://github.com/pytorch/pytorch/pull/134131 on behalf of https://github.com/kit1980 due to See D61985186 ([comment](https://github.com/pytorch/pytorch/pull/134131#issuecomment-2327556381))
2024-09-03 22:35:14 +00:00