Compare commits

...

564 Commits

Author SHA1 Message Date
4d9de27d56 Amazon Linux 2023: Preload cusparseLt.so (#144493)
Amazon Linux 2023: Preload cusparseLt.so (#144477)

Fixes https://github.com/pytorch/pytorch/issues/144433

Test with some debug statements added:

```
>>> import torch
trying to load libcublas.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cublas/lib/libcublas.so.12']
trying to load libcublas.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cublas/lib/libcublas.so.12
trying to load libcudnn.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cudnn/lib/libcudnn.so.9']
trying to load libcudnn.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cudnn/lib/libcudnn.so.9
trying to load libnvrtc.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cuda_nvrtc/lib/libnvrtc.so.12']
trying to load libnvrtc.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cuda_nvrtc/lib/libnvrtc.so.12
trying to load libcudart.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cuda_runtime/lib/libcudart.so.12']
trying to load libcudart.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cuda_runtime/lib/libcudart.so.12
trying to load libcupti.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cuda_cupti/lib/libcupti.so.12']
trying to load libcupti.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cuda_cupti/lib/libcupti.so.12
trying to load libcufft.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cufft/lib/libcufft.so.11']
trying to load libcufft.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cufft/lib/libcufft.so.11
trying to load libcurand.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/curand/lib/libcurand.so.10']
trying to load libcurand.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/curand/lib/libcurand.so.10
trying to load libnvJitLink.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/nvjitlink/lib/libnvJitLink.so.12']
trying to load libnvJitLink.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/nvjitlink/lib/libnvJitLink.so.12
trying to load libcusparse.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cusparse/lib/libcusparse.so.12']
trying to load libcusparse.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cusparse/lib/libcusparse.so.12
trying to load libcusparseLt.so.*[0-9] from []
trying to load libcusparseLt.so.*[0-9] from /usr/local/lib/python3.9/site-packages/cusparselt/lib/libcusparseLt.so.0
trying to load libcusolver.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cusolver/lib/libcusolver.so.11']
trying to load libcusolver.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cusolver/lib/libcusolver.so.11
trying to load libnccl.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2']
trying to load libnccl.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2
trying to load libnvToolsExt.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/nvtx/lib/libnvToolsExt.so.1']
trying to load libnvToolsExt.so.*[0-9] from /usr/local/lib/python3.9/site-
packages/nvidia/nvtx/lib/libnvToolsExt.so.1
/usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:275: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
>>> exit()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144477
Approved by: https://github.com/Skylion007, https://github.com/nWEIdia

(cherry picked from commit 2b241a8206843f43f0568b7b65473ebb593c4740)

Co-authored-by: atalman <atalman@fb.com>
2025-01-10 09:13:52 -05:00
d155d8ad6a [3.13t] use sysconfig to check for Python nogil builds (#144393)
[3.13t] use sysconfig to check for Python nogil builds (#144361)

`sys._is_gil_enabled()` wasn't working in certain cases, according to @atalman

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144361
Approved by: https://github.com/atalman

(cherry picked from commit f7000350905be5073892e0b23df681c0281be0f0)

Co-authored-by: William Wen <williamwen@meta.com>
2025-01-10 09:10:38 -05:00
e1858b614e Fix PythonMod printing (#144335)
* Fix precedence of bitwise and/or printing (#143197)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143197
Approved by: https://github.com/albanD, https://github.com/williamwen42

(cherry picked from commit 8f404467707ea43860af0f71d8c0867afe047732)

* Fix PythonMod printing (#144078)

Fixes #144075
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144078
Approved by: https://github.com/anijain2305

(cherry picked from commit 301b9c8a90002fa621d93b108e54460066226629)

---------

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
2025-01-10 09:08:47 -05:00
be126bccee [inductor][cpu] Fix bmm b_index for dynamic expressions in inductor autotuner (#144248)
[inductor][cpu] Fix bmm b_index for dynamic expressions in inductor autotuner (#143141)

Fixes #143102

Addresses 2 problems relating to dynamic batch size in BMM autotuner:
1. With dynamic batch size, when the input is a sympy Mult expression, such as `s0*8` which occurs in many dynamo benchmark models. We address this by using `size_hints` to solve for any expressions. This is safe since this section of the code is only called to generate inputs for benchmarking.
2. Some epilogue nodes may use the dynamic batch size as part of the codegen, for example when an input to the epilogue node is transposed and has dynamic batch size in the stride of other dimensions. When these epilogue nodes exist, if the sizevar is not already present in the `kernel.args`, it will create a new sizevar with a name. It is possible that subsequent calls to `def_kernel` could overwrite this variable name, so to avoid this we pass all the sizevars as `extra_sizevars` to the calls to `def_kernel` for the GEMM functions, so no variable renaming happens later in the BMM definition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143141
Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel, https://github.com/jgong5

(cherry picked from commit 51a37a42e0e50df6b199732f2680afa5ed14c94f)

Co-authored-by: Mitchell, Frost <frost.mitchell@intel.com>
2025-01-10 09:05:28 -05:00
8c03454867 Set maximum supported version of Python as 3.13 (#144409)
Set maximum supported version of Python as 3.13 (#144396)

Same as https://github.com/pytorch/pytorch/pull/119743 Required for Release 2.6.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144396
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet

(cherry picked from commit e14c36d3f498e4ec513459209eb95dc392ba9876)

Co-authored-by: atalman <atalman@fb.com>
2025-01-10 09:02:00 -05:00
7092dc521b Link to transformer tutorial in transformer docs (#144482)
Link to transformer tutorial in transformer docs (#144425)

<img width="1045" alt="Screenshot 2025-01-08 at 4 50 20 PM" src="https://github.com/user-attachments/assets/05adfecb-8a23-4c48-9a2c-50c5b3f886b0" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144425
Approved by: https://github.com/albanD

(cherry picked from commit b8f383107eebd9495a0f132d58a970e178e15930)

Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
2025-01-09 14:04:45 -08:00
f35ab0e353 [CD] Aarch64 builds should not override OVERRIDE_PACKAGE_VERSION envvar (#144347)
[CD] Aarch64 builds should not override `OVERRIDE_PACKAGE_VERSION` envvar (#144285)

Currently our nightly aarch64 binaries have correct suffixes +cpu or +cu126. But release binaries are missing these suffixes. Hence to correct this, make sure are nightly and release binaries are consistent, I propose this change.

I see that override is already set correctly in release workflow:
https://github.com/pytorch/pytorch/actions/runs/12383179841/job/34565381200

For CPU:
```
OVERRIDE_PACKAGE_VERSION="2.6.0+cpu"
```

For CUDA:
```
OVERRIDE_PACKAGE_VERSION="2.6.0+cu126"
```

The removed code will set : OVERRIDE_PACKAGE_VERSION="2.6.0" for both cuda and cpu builds for release binaries.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144285
Approved by: https://github.com/malfet, https://github.com/tinglvv

(cherry picked from commit 8d35333498e9433a379611746c177285fa51c8c5)

Co-authored-by: atalman <atalman@fb.com>
2025-01-07 17:48:28 -08:00
3a3de27475 Fix int8 mm V.ops.mul dispatching (#144336)
Fix int8 mm V.ops.mul dispatching (#143127)

This is sort of subtle - because we were doing `V.ops.mul` at binding time, we dont redispatch later when we invoke the epilogue. and then later running into assertion checking in pr above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143127
Approved by: https://github.com/drisspg
ghstack dependencies: #143048

(cherry picked from commit 7968732f5b84ac6509d800a54bfb23fb791d3b88)

Co-authored-by: eellison <elias.ellison@gmail.com>
2025-01-07 17:45:58 -08:00
7d3292c0d3 Fix batch-specific attention mod for NJT + Flex (#144330)
Fix batch-specific attention mod for NJT + Flex (#143866)

Fixes #143788
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143866
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch

(cherry picked from commit 228b228449872d4aa515f2f2ebbd25bb0b8d85bf)

Co-authored-by: Joel Schlosser <jbschlosser@meta.com>
2025-01-07 17:43:10 -08:00
478a99c59b Update torch-xpu-ops commit pin (#144209)
Update torch-xpu-ops commit pin (#143984)

Update the torch-xpu-ops commit to [28cfac20ec662abdb0ac98faf122450013e8f520](28cfac20ec), includes:

- Disable batch_norm vectorization path to fix accuracy issues.
- Fix the LSRM/RNN implementation error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143984
Approved by: https://github.com/EikanWang, https://github.com/ruidazeng, https://github.com/desertfire, https://github.com/jansel

(cherry picked from commit 1e881ceecfe80532206ca4e0acb64391fab8b935)

Co-authored-by: Yutao Xu <yutao.xu@intel.com>
2025-01-07 16:50:13 -08:00
4e4182dbd0 [ROCm] Add miopen_batch_norm to meta_registrations to fix AOTI issue (#144028)
[ROCm] Add miopen_batch_norm to meta_registrations to fix AOTI issue (#143569)

Currently the upstream example for AOTI usage breaks on ROCm (https://pytorch.org/tutorials/recipes/torch_export_aoti_python.html)

```
File "/root/upstream/torch/_dynamo/exc.py", line 317, in unimplemented
    raise Unsupported(msg, case_name=case_name)
torch._dynamo.exc.Unsupported: unsupported operator: aten.miopen_batch_norm.default (see https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit#heading=h.64r4npvq0w0 for how to fix)

from user code:
   File "/root/vision/torchvision/models/resnet.py", line 285, in forward
    return self._forward_impl(x)
  File "/root/vision/torchvision/models/resnet.py", line 269, in _forward_impl
    x = self.bn1(x)
```

This PR adds a meta_registration for miopen_batch_norm to resolve this issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143569
Approved by: https://github.com/jeffdaily

(cherry picked from commit 27b0d41f0ab45bc281b9e9bb594df3277783017d)

Co-authored-by: Jack Taylor <jack.taylor@amd.com>
2025-01-06 16:09:55 -08:00
929efb4531 [Release/2.6][MPS] Fix crash on CPU scalars (#144096)
* [MPS] Fix fmin/fmax for scalar argument (#143934)

CPU scalar promotion to GPU is allowed for CUDA and shoudl be allowed for MPS as well (at the very least it should not crash)

Fixes https://github.com/pytorch/pytorch/issues/143933 https://github.com/pytorch/pytorch/issues/142203
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143934
Approved by: https://github.com/Skylion007

(cherry picked from commit 3054aae493a5347cf8187b5ce611b9a38aace202)

* [MPS] Handle implicit cpu-scalar-to-gpu transfer (#144055)

Followup after https://github.com/pytorch/pytorch/pull/143934, this check is no longer necessary and fixes a subset of inductor tests

Before `pytest test/inductor/test_torchinductor.py -k _mps` reports 463
failed, 291 passed, 32 skipped after 456 failed, 298 passed, 32 skipped
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144055
Approved by: https://github.com/Skylion007

(cherry picked from commit a93e75d1e2635ae1bf9a6c24cbe8fb2a6d65bfd9)
2025-01-06 12:16:55 -08:00
f01a678e02 [ROCm] Guard triton backend call around cuda.is_available (#144027)
[ROCm] Guard triton backend call around cuda.is_available (#143570)

To resolve: https://github.com/pytorch/test-infra/issues/6082

Calling into Triton's get_backend_options will initialise CUDA and break CPU-only environments that may have hip installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143570
Approved by: https://github.com/atalman, https://github.com/jeffdaily

(cherry picked from commit 66172578f918974cca995b0d6f740903a35b1fa5)

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
2025-01-06 12:10:28 -08:00
23e390c711 Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#144026)
Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#142292)

Reland of #140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors.

Fixes #140318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142292
Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
(cherry picked from commit c0d710634fcce172490c3ace0de977829b38bc06)

Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
2025-01-06 12:07:00 -08:00
41811ae689 [CD] Remove redundant triton dependency for xpu wheels (#143983)
[CD] Remove redundant triton dependency for xpu wheels (#143839)

Due to XPU CD wheels enabled pypi dependencies by https://github.com/pytorch/pytorch/pull/141135, so the PYTORCH_EXTRA_INSTALL_REQUIREMENTS has value for XPU CD wheel build.
Works for https://github.com/pytorch/pytorch/issues/139722 and https://github.com/pytorch/pytorch/issues/114850
Fixes #143838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143839
Approved by: https://github.com/huydhn

(cherry picked from commit 438698b20b585fc13f4439f8a9a93d63079eba37)

Co-authored-by: chuanqiw <chuanqi.wang@intel.com>
2025-01-02 10:44:24 -08:00
d9eeddd49f Remove assert from partitioner.py (#143608)
Remove assert from partitioner.py (#143376)

Remove erroneous assert assuming a dependent (user) node to be in the partition. This partially reverts #136616 by removing the assert.

Tested locally with a failing ExecuTorch Arm test using
```
$ python -m examples.arm.aot_arm_compiler --model_name mv2 --target ethos-u55-128 --delegate --quantize
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143376
Approved by: https://github.com/tarun292

(cherry picked from commit 6829897682b2b46a592a92d84417cb4124c26e88)

Co-authored-by: Digant Desai <digantdesai@meta.com>
2024-12-26 14:16:22 -08:00
5eb54f6ebf torch/accelerator: fix device type comparison (#143541) (#143781)
This was failing without the fix:
```
python -c 'import torch; d=torch.device("xpu:0"); torch.accelerator.current_stream(d)'
```
with:
```
ValueError: xpu doesn't match the current accelerator xpu.
```

CC: @guangyey, @EikanWang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143541
Approved by: https://github.com/guangyey, https://github.com/albanD

(cherry picked from commit 7314cf44ae719dfbc9159496030ce84d152686e4)

Co-authored-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
2024-12-26 14:15:47 -08:00
b1a10ecad9 Use random64 in Fischer-Yates algorithm for large N (#143682) (#143875)
Fixes bug in randperm https://nbsanity.com/static/a4774194938414dedcec7d6e99727d31/Shuffling_20in_20torch_20vs_20numpy-public.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143682
Approved by: https://github.com/eqy, https://github.com/albanD
2024-12-26 11:53:04 -08:00
31b520a599 Revert "Exclude py 31.3t triton package from PyTorch 3.13t wheel" (#143767)
Revert "Exclude py 31.3t triton package from PyTorch 3.13t wheel (#143244)"

This reverts commit c92f6871e6d879f129103fa18cb7c2477d43d013.
2024-12-24 12:09:13 -05:00
f61bf202b3 [Inductor] Constrain the shape of other tensor for Conv/Linear + broa… (#143617)
[Inductor] Constrain the shape of other tensor for Conv/Linear + broadcast add fusion. (#141759)

Fix https://github.com/pytorch/pytorch/issues/141671.

Summary:
The performance regression of these two timm_models is caused by Conv/Linear + broadcast add fusion run into oneDNN ref path. This PR constrains the shape of other tensor for Conv/Linear + broadcast add fusion to fix this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141759
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-12-23 12:19:41 -08:00
4b9b7def3d [Inductor] Fix _can_be_inplace function (#143279) (#143452)
Summary:
Modify _can_be_inplace function: return False if `_other.data` is an instance of `ir.BaseView`.

Fix https://github.com/pytorch/pytorch/issues/143280.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143279
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5
2024-12-23 12:15:35 -08:00
9b688182f7 [dynamo, 3.13t] raise error if torch.compile is attempted in 3.13t (nogil) (#143594)
[dynamo, 3.13t] raise error if torch.compile is attempted in 3.13t (nogil) (#143404)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143404
Approved by: https://github.com/colesbury, https://github.com/atalman

(cherry picked from commit e1e83015d24f49cf2ffb0c67a3524cc9ac62463a)
2024-12-19 11:41:10 -08:00
22775e0e8c [Reland 2.6][BE][accelerator] formalize API name {current,set}_device_{idx => index} (#143186)
[BE][accelerator] formalize API name `{current,set}_device_{idx => index}` (#140542)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140542
Approved by: https://github.com/guangyey, https://github.com/albanD
2024-12-18 11:52:49 -08:00
c953e748eb [MPS] Use metal shaders for all view ops (#143520)
[MPS] Use metal shaders for all view ops (#143375)

Before this PR Metal  shaders were used to scatter/gather 1-5 dimensional tensors.
This PR introduces generalized ones that could be used for any dimensionality and as results  gets rid of 700+ lines complex and untested code that might not even work as expected.
Generalized gather shader looks as follows
```metal
kernel void gather_kernel_n(uint linear_index           [[thread_position_in_grid]],
                            constant void * src_        [[buffer(0)]],
                            device void * dst_          [[buffer(1)]],
                            constant uint32_t * size    [[buffer(2)]],
                            constant uint32_t * stride  [[buffer(3)]],
                            constant uint32_t & numel   [[buffer(4)]],
                            constant int32_t & ndim     [[buffer(5)]]) {{
    if (linear_index >= numel) return;

    constant {0} * src = (constant {0} *)src_;
    device {1} * dst = (device {1} *)dst_;

    uint64_t src_offs = 0;
    auto src_idx = linear_index;
    for(int dim = ndim - 1; dim >= 0; --dim) {{
      src_offs += stride[dim] * (src_idx % size[dim]);
      src_idx /= size[dim];
    }}

    dst[linear_index] = cast<{1}>(src[src_offs]);
}}
```

Which, according to the following benchmark
```python
from timeit import default_timer

import torch
import torch.utils.cpp_extension
from torch.utils.benchmark import Measurement, Timer

t = Timer(
    stmt=f"y.copy_(x);torch.mps.synchronize()",
    setup=f"x=torch.rand(4, 5, 16, 64, 33, 24, dtype=torch.float32, device='mps')[:,:,:,:24,:24,];y=torch.empty(x.shape, device=x.device, dtype=x.dtype)",
    language="python", timer=default_timer
)
print(t.blocked_autorange())
```
Is almost twice as fast as previous implementation (i.e. on Mac Book M2 Pro it returns 2.9ms for MPS version vs 1.5ms for shader one

On MacOS Sequoia [`gatherWithUpdatesTensor: indicesTensor:...`](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/gather(withupdatestensor:indicestensor:axis:batchdimensions:name:)?language=objc) crashes if invoked with complex data type, as one can see by running the code below
```swift
import Metal
import MetalPerformanceShadersGraph

func gatherComplexMPS(device: MTLDevice,
                inp_buf: MTLBuffer, idx_buf: MTLBuffer,
                out_buf: MTLBuffer,
                inp_elem: Int, upd_elem: Int) {
  let graph = MPSGraph()
  let inputPlaceholder = graph.placeholder(shape: [inp_elem as NSNumber], dataType: .complexFloat32, name: nil)
  let indicesPlaceholder = graph.placeholder(shape: [upd_elem as NSNumber], dataType: .int64, name: nil)
  let outNode = graph.gather(withUpdatesTensor: inputPlaceholder, indicesTensor: indicesPlaceholder, axis: 0, batchDimensions: 0, name: nil)
  let mpsInputBuffer = MPSGraphTensorData(inp_buf, shape: [inp_elem as NSNumber], dataType: .complexFloat32)
  let mpsIndicesBuffer = MPSGraphTensorData(idx_buf, shape: [upd_elem as NSNumber], dataType: .int64)
  let mpsOutputBuffer = MPSGraphTensorData(out_buf, shape: [inp_elem as NSNumber], dataType: .complexFloat32)
  guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") }
  graph.run(with: queue, feeds: [inputPlaceholder: mpsInputBuffer,
                               indicesPlaceholder: mpsIndicesBuffer ],
            targetOperations: nil, resultsDictionary: [outNode: mpsOutputBuffer])
}

func makeBufferWithValues<T>(device: MTLDevice, values: [T]) -> MTLBuffer {
  guard let buf = device.makeBuffer(length: values.count * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }
  let buf_data = buf.contents().assumingMemoryBound(to: T.self)
  for i in 0..<values.count {
    buf_data[i] = values[i]
  }
  return buf
}

guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") }
print("Using device \(device.name)")

let inp_buf = makeBufferWithValues(device: device, values: [1.0, 2.0 , 3.0, 4.0, 5.0, 6.0, 7.0, 8.0])
let idx_buf = makeBufferWithValues(device: device, values: [0, 1, 2, 3])
guard let out_buf = device.makeBuffer(length:8 * MemoryLayout<Float>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }

gatherComplexMPS(device: device, inp_buf: inp_buf, idx_buf: idx_buf, out_buf: out_buf, inp_elem: 4, upd_elem: 4)
```

Fixes https://github.com/pytorch/pytorch/issues/143140
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143375
Approved by: https://github.com/albanD

(cherry picked from commit 24a18d76c8619c0c4760c94aebef6ae7867fe1e6)

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
2024-12-18 11:46:11 -08:00
6628b70f02 Prevent torch.jit.load path in torch.load when weights_only=True (#143506)
Prevent torch.jit.load path in torch.load when weights_only=True (#143326)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143326
Approved by: https://github.com/albanD

(cherry picked from commit ac8342f8817570494faa85b54d8857d307959a68)

Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
2024-12-18 11:39:41 -08:00
0cdf8b1d09 Enable CPP/CUDAExtension with py_limited_api for python agnosticism (#143448)
Enable CPP/CUDAExtension with py_limited_api for python agnosticism (#138088)

Getting tested with ao, but now there is a real test i added.

## What does this PR do?

We want to allow custom PyTorch extensions to be able to build one wheel for multiple Python versions, in other words, achieve python agnosticism. It turns out that there is such a way that setuptools/Python provides already! Namely, if the user promises to use only the Python limited API in their extension, they can pass in `py_limited_api` to their Extension class and to the bdist_wheel command (with a min python version) in order to build 1 wheel that will suffice across multiple Python versions.

Sounds lovely! Why don't people do that already with PyTorch? Well 2 things. This workflow is hardly documented (even searching for python agnostic specifically does not reveal many answers) so I'd expect that people simply don't know about it. But even if they did, _PyTorch_ custom Extensions would still not work because we always link torch_python, which does not abide by py_limited_api rules.

So this is where this PR comes in! We respect when the user specifies py_limited_api and skip linking torch_python under that condition, allowing users to enroll in the provided functionality I just described.

## How do I know this PR works?

I manually tested my silly little ultra_norm locally (with `import python_agnostic`) and wrote a test case for the extension showing that
- torch_python doesn't show up in the ldd tree
- no Py- symbols show up
It may be a little confusing that our test case is actually python-free (more clean than python-agnostic) but it is sufficient (and not necessary) towards showing that this change works.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138088
Approved by: https://github.com/ezyang, https://github.com/albanD

(cherry picked from commit be27dbf2b806e5d9c8d63ac6f6f96712299f98c3)

Co-authored-by: Jane Xu <janeyx@meta.com>
2024-12-17 15:28:26 -08:00
46f5510d20 Fix search icon (#143120)
Fix search icon (#142808)

Removing:

.pytorch-left-menu-search input[type=text] {
    background-image: none;
}
so that the search icon correctly appears in the sphinx searchbox

Also, fixing scrolling

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142808
Approved by: https://github.com/albanD

(cherry picked from commit 0f78be5573016e65c0b493b788f40b10a6e18060)

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2024-12-17 15:11:48 -08:00
f9e99fc62f Create build_directory if it does not exist when generating ninja build file (#143345)
Create build_directory if it does not exist when generating ninja build file (#143328)

Fixes: https://github.com/pytorch/vision/issues/8816
I am observing this failure on Windows, Python 3.13 vision builds:
```
Emitting ninja build file C:\actions-runner\_work\vision\vision\pytorch\vision\build\temp.win-amd64-cpython-313\Release\build.ninja...
error: [Errno 2] No such file or directory: 'C:\\actions-runner\\_work\\vision\\vision\\pytorch\\vision\\build\\temp.win-amd64-cpython-313\\Release\\build.ninja'
ERROR conda.cli.main_run:execute(49): `conda run packaging/windows/internal/vc_env_helper.bat python setup.py bdist_wheel` failed. (See above for error)
```

Adding the code above fixes it, confirmed by running `` python setup.py bdist_wheel`` :
```
building 'torchvision._C' extension
Emitting ninja build file C:\actions-runner\_work\vision\vision\pytorch\vision\build\temp.win-amd64-cpython-313\Release\build.ninja...
Creating build directory C:\actions-runner\_work\vision\vision\pytorch\vision\build\temp.win-amd64-cpython-313\Release
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/26] cl /showIncludes /nologo /O2 /W3 /GL /DNDEBUG /MD /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc -Dtorchvision_EXPORTS -IC:\actions-runner\_work\vision\vision\pytorch\vision\torchvision\csrc -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include\torch\csrc\api\include -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include\TH -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include\THC -IC:\actions-runner\_work\_temp\conda_environment_12361066769\include -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Include "-IC:\Pr
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143328
Approved by: https://github.com/kit1980, https://github.com/albanD

(cherry picked from commit dd2cd4279e8d46cce4a35dd1a52017a127809640)

Co-authored-by: atalman <atalman@fb.com>
2024-12-17 18:06:37 -05:00
1d3ffeb7ea [CD] Fix XPU linux CD whl test failure (#143292)
[CD] Fix XPU linux CD whl test failure (#143268)

Follow https://github.com/pytorch/pytorch/pull/142482, refer the original fix PR https://github.com/pytorch/pytorch/pull/130742 and new issue in https://github.com/pytorch/pytorch/actions/runs/12323126436/job/34403681230
Works for https://github.com/pytorch/pytorch/issues/114850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143268
Approved by: https://github.com/atalman

(cherry picked from commit a8cc19bb51931991253315b31c64ca2db2505cd6)

Co-authored-by: chuanqiw <chuanqi.wang@intel.com>
2024-12-16 17:27:49 -05:00
c92f6871e6 Exclude py 31.3t triton package from PyTorch 3.13t wheel (#143244)
Exclude py 31.3t triton package from PyTorch 3.13t wheel (#143218)

Follow up after https://github.com/pytorch/pytorch/pull/143162
Include triton only for 3.13 packages not 3.13t
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143218
Approved by: https://github.com/kit1980

(cherry picked from commit 3bfdf6f0633e6feb067e032009256c740a2a2665)

Co-authored-by: atalman <atalman@fb.com>
2024-12-16 10:20:30 -05:00
2b84debd97 [CD] Test torch.compile on 3.13 (#143243)
[CD] Test torch.compile on 3.13 (#143207)

Follow up after https://github.com/pytorch/pytorch/pull/143162
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143207
Approved by: https://github.com/atalman, https://github.com/ZainRizvi

(cherry picked from commit 625b4edb975da25818eeae27cdbf9ba916973961)

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-16 10:18:31 -05:00
5fbc4aa90a Linux Wheels: Remove triton dependency python < 3.13 constraint (#143199)
Linux Wheels: Remove triton dependency python < 3.13 constraint (#143162)

We do build pytorch-triton package for python 3.13 : https://github.com/pytorch/pytorch/actions/runs/12304476674/job/34344764271
Hence constraint is no longer needed.
This stack enabled torch.compile for Python 3.13 : https://github.com/pytorch/pytorch/pull/141264
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143162
Approved by: https://github.com/kit1980

(cherry picked from commit 04bb82f09760b8336ae6761b9aa51c2c525d12bb)

Co-authored-by: Andrey Talman <atalman@fb.com>
2024-12-13 09:44:24 -08:00
5363f7d9fd [CD] Use Anaconda cmake for Mac builds (#143133)
[CD] Use Anaconda cmake for Mac builds (#143054)

To find Anaconda-env-installed OpenMP
(As OpenMP from PyPI is looking for it in a different places)

For posterity: our build script names are very confusing:
 - [`.ci/wheel/build_wheel.sh`](6cb6e8d790/.ci/wheel/build_wheel.sh) is only used for MacOS wheel/libtorch builds
 - [`.ci/manywheel/build.sh`](6cb6e8d790/.ci/manywheel/build.sh) are used for Linux wheel/libtorch builds
 - [`.ci/pytorch/windows/build_pytorch.bat`](6cb6e8d790/.ci/pytorch/windows/build_pytorch.bat) is used for Windows wheel builds

Fixes https://github.com/pytorch/pytorch/issues/142873
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143054
Approved by: https://github.com/Jack-Khuu, https://github.com/atalman

(cherry picked from commit 4d8357e912ec5a9d60f10b44bb699950e4472488)

Co-authored-by: Nikita Shulga <nshulga@meta.com>
2024-12-12 12:41:42 -08:00
f3c0886c05 Remove Checkout pytorch/builder for Linux Binary Builds (#143125) (#143131)
Follow Up after: https://github.com/pytorch/pytorch/pull/142282

Remove Checkout pytorch/builder for Linux Binary Builds
I believe we where not using builder already. Hence remove this checkout.
We should be using scripts from this folder:
```
/pytorch/.ci/${{ inputs.PACKAGE_TYPE }}/build.sh
```

TODO: Will followup with removing BUILDER_ROOT everywhere from PyTorch repo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143125
Approved by: https://github.com/kit1980

Co-authored-by: atalman <atalman@fb.com>
2024-12-12 12:04:25 -08:00
aad1c160a7 Cherry-pick reverts dfe5669 and 1b3f8b7 (#143092)
* Revert "[RELAND] Add device-agnostic runtime Device/Stream C++ API (#138677)"

This reverts commit 734bb01460d59e661e9114e7aa17e04821e4b57a.

Reverted https://github.com/pytorch/pytorch/pull/138677 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is still very flaky on MacOS even when it does not segfault anymore ([comment](https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537256522))

* Revert "[RELAND] Add UTs for accelerator device-agnostic runtime APIs (#133572)"

This reverts commit 209119424922b135fef39aba1f25da3b67f5879a.

Reverted https://github.com/pytorch/pytorch/pull/133572 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is still very flaky on MacOS even when it does not segfault anymore ([comment](https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537256522))

---------

Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
2024-12-11 21:35:44 -08:00
af92bad804 [RELEASE-ONLY CHANGES] Branch Cut for Release 2.6 (#143085)
* Run apply-release-changes.sh

* Use test in docker-release.yml

* Fix spaces in lint.yaml
2024-12-11 17:47:33 -08:00
c69eae32ba Use validate-docker-images workflow from test-infra (#143083)
Use validate-docker-images workflow from test-infra (#143081)

After PR: https://github.com/pytorch/test-infra/pull/6029 use validate-docker-images.yml from test-infra.
Related to: https://github.com/pytorch/builder/issues/2054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143081
Approved by: https://github.com/huydhn

(cherry picked from commit bd7d81db9e8b54cd7042fcb724b9b445e6641cf9)

Co-authored-by: atalman <atalman@fb.com>
2024-12-11 16:31:29 -08:00
f7e621c3ce [ROCm] TunableOp do not log during exit (#142818)
Depending on the order of static object destruction, the TunableOp logger is not available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142818
Approved by: https://github.com/jeffdaily
2024-12-11 17:44:29 +00:00
233853a66f Revert "Prologue Fusion (#134532)"
This reverts commit 59ab3825e77451b29c3a118fd24304afcbf52c09.

Reverted https://github.com/pytorch/pytorch/pull/134532 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))
2024-12-11 17:32:26 +00:00
f0b80d014d Revert "Update low prec codegen for div/mod (#142350)"
This reverts commit 1fb3d5a4e35d4ea5691287d4ce77da40578bda4a.

Reverted https://github.com/pytorch/pytorch/pull/142350 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))
2024-12-11 17:32:26 +00:00
829a93562a Revert "[Easy] factor out inductor ophandler decompositions (#142400)"
This reverts commit fa746e3eeb8e1cdcbe3f47ded9e3ca30efac383c.

Reverted https://github.com/pytorch/pytorch/pull/142400 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))
2024-12-11 17:32:26 +00:00
9e88279737 Revert "Add a pass which analyzes whether a prologue preserves zero mask (#142401)"
This reverts commit 1a0bd402436af4c127817a31c76d7ae47d4668b2.

Reverted https://github.com/pytorch/pytorch/pull/142401 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))
2024-12-11 17:32:25 +00:00
b118702a4e Revert "Infer whether prologues can be computed without upcasting to fp32 without changing numerics (#142402)"
This reverts commit f2d8d7b7acf12f079cadc41b9fdd91cbae94daac.

Reverted https://github.com/pytorch/pytorch/pull/142402 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))
2024-12-11 17:32:25 +00:00
2dcba6eac8 Revert "dont attempt to fuse in unaligned accesses to mm (#142435)"
This reverts commit 22683195964398b37ba0d539cb1bb55bff197db6.

Reverted https://github.com/pytorch/pytorch/pull/142435 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))
2024-12-11 17:32:25 +00:00
5c97ac9721 Revert "Remove unused Python variables in torch/[_-a]* (#133492)"
This reverts commit fda975a7b3071a20dab8fc2c4e453479e1bb7cf2.

Reverted https://github.com/pytorch/pytorch/pull/133492 on behalf of https://github.com/clee2000 due to Sorry, I need to revert this in order to revert something else.  The only thing you need to do is rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/133492#issuecomment-2536635516))
2024-12-11 17:29:12 +00:00
db51308d9c fix output node name (#142506)
Fixes #142227

Differential Revision: [D67043283](https://our.internmc.facebook.com/intern/diff/D67043283/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142506
Approved by: https://github.com/ydwu4
2024-12-11 17:28:28 +00:00
2374d460d0 Revert "filelock: Make waitcounter variant to use (#139816)"
This reverts commit 237c4b559c0f928dd89cf1e773458a1bdcea0b9d.

Reverted https://github.com/pytorch/pytorch/pull/139816 on behalf of https://github.com/clee2000 due to Sorry, I need to revert this in order to revert something else.  The only thing you need to do is rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/139816#issuecomment-2536616808))
2024-12-11 17:26:46 +00:00
498a7808ff Fix unused Python variables outside torch/ and test/ (#136359)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136359
Approved by: https://github.com/albanD
2024-12-11 17:10:23 +00:00
241bf047b3 [Dynamo] Skip some unresolvable tests (#142508)
Fixes #127738
Fixes #127755

In the discussion in https://github.com/pytorch/pytorch/issues/127738 we
determined that this is not fixable, so we're just going to skip the
test.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142508
Approved by: https://github.com/StrongerXi, https://github.com/yanboliang, https://github.com/mlazos
ghstack dependencies: #142502, #142503
2024-12-11 17:00:23 +00:00
00ac4237b2 [Dynamo] stop import third-party astunparse (#142503)
PyTorch's minimum version is 3.9, so we can now use ast.unparse.

Test Plan:
- wait for tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142503
Approved by: https://github.com/StrongerXi, https://github.com/yanboliang, https://github.com/mlazos
ghstack dependencies: #142502
2024-12-11 17:00:23 +00:00
0268abd627 [Dynamo] Stop importing transformers (#142502)
This import was free because transformers should already have been
imported by this time.

Test Plan:
- CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142502
Approved by: https://github.com/StrongerXi, https://github.com/yanboliang, https://github.com/mlazos
2024-12-11 17:00:22 +00:00
8fd4b26504 Revert "[dynamo] Support multiple inheritance for custom dict construction (#142416)"
This reverts commit a45326b6497e47d01527e141cdd16d91fee94c18.

Reverted https://github.com/pytorch/pytorch/pull/142416 on behalf of https://github.com/clee2000 due to The newly added test is faling internally D67056273 ([comment](https://github.com/pytorch/pytorch/pull/142416#issuecomment-2536537693))
2024-12-11 16:56:26 +00:00
c3b30c283f add additional CK BMM instances (#142409)
Summary: stacked changes to keep new codegen-ed instances below 2000 LOC

Reviewed By: zjing14

Differential Revision: D66738746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142409
Approved by: https://github.com/mxz297, https://github.com/xw285cornell
2024-12-11 16:54:05 +00:00
d622040ab1 [AOTI] Unskipped test_scaled_dot_product_efficient_attention for ROCm (#142138)
The test should no longer fail for ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142138
Approved by: https://github.com/janeyx99
2024-12-11 16:36:04 +00:00
d5e00412c7 sparse_broadcast_to: less memory footprint, fewer kernel launches (#142364)
As per title.

The following implementation removes the usage of `repeat_interleave, tile` and `full_coo_indices` and replaces them with broadcasting. That way we reduce memory traffic (and are likely to hit cache a lot) and the total number of launched kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142364
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2024-12-11 16:09:09 +00:00
eed9bb3a0e allow -E to be in any spot in the compiler command (#142813)
Follow up of TODO in https://github.com/pytorch/pytorch/pull/140614

It was found experimentally, that for one GPU architecture, `sccache` passes `-E` as 1st, 2nd or 3rd argument, but it's much better to do this if `-E` is passed as any argument

No need to worry about exit or elif chains, as `exec` aborts script execution

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142813
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <nshulga@meta.com>
2024-12-11 15:58:08 +00:00
ee817e8cf3 [ROCm] Second attempt to fix unit test: matmul_small_brute_force_tunableop (#142422)
Fixes #141458
Fixes #141635
Fixes #141636

~~Address OOM issue by clearing PyTorch's caching allocator.~~

Disabling this test on NVIDIA since it doesn't do much on NVIDIA hardware at the moment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142422
Approved by: https://github.com/jeffdaily, https://github.com/eqy
2024-12-11 15:36:37 +00:00
371bcc1e33 [checkpointing][oss] Throw an error when loading a different size than saved tensor (#141571)
Summary: Fixing issue reported in https://github.com/pytorch/pytorch/issues/126604

Test Plan: buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner -- --exact 'caffe2/test/distributed/checkpoint:test_planner - test_planner.TestLoadPlanner: test_strict

Differential Revision: D66389578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141571
Approved by: https://github.com/mhorowitz
2024-12-11 15:35:48 +00:00
bacd68107a [inductor] Parenthesize expression in _helper_sqrt (#142352)
Fixes https://github.com/pytorch/pytorch/issues/142328. The implied cast-then-sqrt order matches the behavior of the `halide` backend.

2cc01cc6d3/torch/_inductor/codegen/halide.py (L115-L116)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142352
Approved by: https://github.com/ezyang
2024-12-11 15:30:52 +00:00
86300965b6 Add automatic_dynamic_shapes_mark_as == "oblivious" (#141444)
Fixes https://github.com/pytorch/pytorch/issues/137100

Should also add a mark_oblivious API for manual control.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141444
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #141415
2024-12-11 14:39:13 +00:00
e53696bfdb automatic_dynamic_shapes_mark_as (#141415)
This adds an option to cause automatic dynamic shapes to trigger
unbacked SymInts rather than backed SymInts.  This can potentially
help if you are still seeing recompilations from 0/1 specialization
but it also might just cause your program to fail with
GuardOnDataDependent errors.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141415
Approved by: https://github.com/bobrenjc93
2024-12-11 14:39:13 +00:00
b576a8c318 Tests Generelization for multiple accelerator devices (#139184)
Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices.
Depedency : #133209  Merged now.
There was a #135242  for these changes and closed due to in correct commits. I have incoroprated the changes as suggested in comments.
@kwen2501  @zeshengzong Please review the changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139184
Approved by: https://github.com/kwen2501

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
2024-12-11 13:31:20 +00:00
539c46b6e8 [Dynamo] Add register_hook as in-graph tensor method (#142820)
Fixes https://github.com/pytorch/pytorch/issues/141046

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142820
Approved by: https://github.com/StrongerXi, https://github.com/yanboliang
2024-12-11 12:02:03 +00:00
c29b4edbb9 Remove no-op aot_compilation_time (#142490)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142490
Approved by: https://github.com/xuzhao9
2024-12-11 10:37:25 +00:00
30d8b30db7 refactor tensorify restart logic to use sources (#141517)
Differential Revision: [D67066706](https://our.internmc.facebook.com/intern/diff/D67066706)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141517
Approved by: https://github.com/ezyang
2024-12-11 07:15:39 +00:00
bdbdbeeb3d Implements nonzero_static on cuda (#141838)
using blockwide cub primitives.
This adds CUDA functionality for nonzero_static, which was missing in https://github.com/pytorch/pytorch/pull/97417.

For `size` approx equal to number of nonzeros, the perf is very close to the regular version, for larger sizes filling in padding  indices takes additional time.
Disabled for cuda <=11.4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141838
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-11 06:44:48 +00:00
1d3b0108a6 [subclass] Fix unwrap subclass parametrization for Nested subclasses (#142481)
@tugsbayasgalan found a bug for nested subclasses:

E.g. we have

TwoTensor(TwoTensor(t1, t2), t0).
After right_inverse we have:

rebuilt_stack == [(TwoTensor, meta, ["a", "b"]), (TwoTensor, meta, ["a", "b"])]
plain_tensors == [t0, t1, t2]
We will first put plain tensors, and only then the nested TwoTensor.

But when we unflatten:
todo = [t0, t1, t2]
we first create TwoTensor[t1, t2]
put it to todo [t0, TwoTensor[t1, t2]]
And as a result get

 TwoTensor(t0, TwoTensor(t1, t2))
which is swapping original a and b :)

So the fix should be different, we need to preserve the order of elements in the stack for plain/subclasses.
I will think about the fix.

Fix:

Keep order of inner_tensor_attr_names according them added to the stack. (first - plain tensor attributes, then subclass attributes)

Test:
```
python test/functorch/test_aotdispatch.py -k test_subclass_parameters
```

Differential Revision: [D67032477](https://our.internmc.facebook.com/intern/diff/D67032477)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142481
Approved by: https://github.com/tugsbayasgalan, https://github.com/bdhirsh
2024-12-11 06:05:48 +00:00
7e92b02e09 add test for module list slice (#142520)
Nothing to fix for #142439

Differential Revision: [D67049962](https://our.internmc.facebook.com/intern/diff/D67049962/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142520
Approved by: https://github.com/ydwu4
2024-12-11 05:11:00 +00:00
256bfd1096 Rename 'cache limit' to 'recompile limit' (#141542)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141542
Approved by: https://github.com/oulgen, https://github.com/jansel
2024-12-11 05:05:11 +00:00
84cf94ee0b Make more of the reshape_symint stride calculation oblivious (#142488)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142488
Approved by: https://github.com/albanD
2024-12-11 05:04:42 +00:00
921ba0a75e Mark torch._library.custom_ops / torch._dynamo.eval_frame as uninteresting (#142492)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142492
Approved by: https://github.com/bobrenjc93
2024-12-11 05:03:30 +00:00
47a571e166 Document that load_inline requires having a compiler installed (#137521)
Prompted by this forum q: https://discuss.pytorch.org/t/are-the-requirements-for-using-torch-utils-cpp-extension-with-cuda-documented-anywhere/211222

Would be curious to know if we could get more precise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137521
Approved by: https://github.com/zou3519
2024-12-11 03:47:54 +00:00
21833c9642 Added Diffentiable per_sample_weights Check to EmbeddingBag.cpp (#142338)
Added a check in aten/src/ATen/native/EmbeddingBag.cpp that checks if per_sample_weights needs a gradient in order to determine if at::_embedding_bag_forward_only or at::_embedding_bag should run.

Also, added two tests in test_embedding.py that check if the command now works.

Fixes #136457
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142338
Approved by: https://github.com/soulitzer
2024-12-11 03:42:17 +00:00
92cc345683 Implement "torch.mtia.max_memory_allocated" API (#142406)
Summary: This diff implements the inferface of  "torch.mtia.max_memory_allocated" API. The internal implementation will be addressed in a separate diff.

Test Plan:
Passed a local unit test: `buck run //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api`

```
----------------------------------------------------------------------
Ran 15 tests in 16.862s

OK
I1127 11:31:14.613909 2272144 afg_bindings.cpp:943] afg-aten::mul.out-dtype_Float-uqJKuNc0 executable has been unloaded
I1127 11:31:14.615438 2272144 afg_bindings.cpp:943] afg-add-dtype_Float-fa37JncC executable has been unloaded
```

Reviewed By: ttrung149, nautsimon

Differential Revision: D66553954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142406
Approved by: https://github.com/nautsimon
2024-12-11 03:06:18 +00:00
ed388394d1 add torchrec collectives to enforce global ordering (#141970)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141970
Approved by: https://github.com/yf225
2024-12-11 02:45:24 +00:00
082124a322 [Dynamo] Refactor to use install subgraph method in higher order ops (#141384)
Replaced the function in HOP infra with a method on output graph to make it more general and accessible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141384
Approved by: https://github.com/zou3519
ghstack dependencies: #141381, #141382, #141383
2024-12-11 02:22:21 +00:00
c31543c7ae [Dynamo] Initial deduplication pass impl (#141383)
This PR implements the deduplication pass (blocked by config currently) for dynamo where identical regions from https://github.com/pytorch/pytorch/pull/141381 are replaced with a common subgraph.

The two phases of deduplication are explained below.

**Subgraph creation**:
Subgraph creation works by taking one representative region from each region group and creating a subgraph from it, which will then be used to replace all regions in the group. This is implemented by first copying all nodes of the region to the new subgraph and then finding all inputs which are not within the region and creating placeholders for them. For the outputs, all regions in a region group need to be scanned to ensure the largest set of outputs is found, and then an output node is created which returns a tuple of all outputs.

**Graph replacement**:
To replace each region with the extracted subgraph, the node index in the region and argument index within the node's flattened args and kwargs are recorded once during subgraph creation. This allows us to determine which (external to the region) nodes and in which order these nodes are passed as inputs. For the outputs, getitem nodes are created for each output, and all nodes in the region with external outputs are replaced by the proper getitem node. Finally, all original nodes are erased (there should be no uses of these left in the graph).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141383
Approved by: https://github.com/zou3519
ghstack dependencies: #141381, #141382
2024-12-11 02:22:21 +00:00
49e4307686 [Dynamo] add debug logging for graph region expansion (#141382)
This PR adds debug logging for the region expansion algorithm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141382
Approved by: https://github.com/williamwen42
ghstack dependencies: #141381
2024-12-11 02:22:21 +00:00
96c36a6947 [Dynamo] Implement graph region tracking for deduplication (#141381)
This PR implements graph region tracking for later extraction into common subgraphs. The algorithm is as follows:

`GraphRegionTracker` tracks each node added to the output graph and generates a key based on the source location, instruction pointer, input shapes, and global state at the time the node is inserted into the graph. Nodes with the same key are grouped together in a list of identical nodes.

Once graph capture is complete, these nodes are organized into region groups. A region group looks like this:
[[IdenticalNode1], [IdenticalNode2], [IdenticalNode3]] and each sublist is called a region. For each region group (starting at the topologically latest region group), the inner regions are gradually expanded one node at time from args and kwargs of the node in each region provided that for all regions in the group, the nodes being added are also identical (ie have the same key computed above). The `get_identical_regions` function is the main entry point which will be used by the graph replacement algorithm in #141383

Edge cases to add more testing for in future PRs (in progress):
* ~~multiple nodes on the same line~~ (implemented)
* ~~dynamic shapes checking (need to verify symbolic inputs are the same across subgraphs)~~ (implemented)
* ensure we don't expand regions where it will create a cycle during subgraph replacement
* ensure outputs are always tensors (or tuples of tensors iirc)
* ~~out of order kwargs, unevenly nested kwargs~~ (implemented)
* input aliasing - TBD, we may add support for this in `invoke_subgraph` or reuse the aliasing analysis here to not form regions with these properties
* ~~all global state~~ (implemented)

Other followups:
* consolidate global state checking across all caching infra

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141381
Approved by: https://github.com/zou3519
2024-12-11 02:22:21 +00:00
734bb01460 [RELAND] Add device-agnostic runtime Device/Stream C++ API (#138677)
# Motivation
This PR intends to add C++ accelerator device-agnostic APIs.

# Additional Context
This PR is relanded. It is reverted because `torch.Event` doesn't support mps backend. We have fixed it in https://github.com/pytorch/pytorch/pull/142468. The previous commit is f84e533a2c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138677
Approved by: https://github.com/albanD, https://github.com/EikanWang
ghstack dependencies: #142468, #133572
2024-12-11 02:04:52 +00:00
2091194249 [RELAND] Add UTs for accelerator device-agnostic runtime APIs (#133572)
# Motivation
This PR intends to add UTs for accelerator device-agnostic APIs.

# Additional Context
This PR is relanded. It is reverted because `torch.Event` doesn't support mps backend. We have fixed it in https://github.com/pytorch/pytorch/pull/142468. The previous commit is 952514f0c8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133572
Approved by: https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #142468
2024-12-11 02:04:52 +00:00
88154024b3 [pipelining] Add ZBV schedule (#142084)
Adds ZBV schedule which is explained in https://arxiv.org/pdf/2401.10241, Section 6. Tested it works under the new PipelineScheduleRuntime by fixing a small bug in handling V-shaped schedules. This PR is a replacement for https://github.com/pytorch/pytorch/pull/138444

cc the original authors: @QPHutu @ufotalent https://github.com/pytorch/pytorch/pull/138444#issuecomment-2472684977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142084
Approved by: https://github.com/kwen2501
2024-12-11 02:00:57 +00:00
95b17f6346 [MPS] Add CompileShader method (#141478)
This allows one to do something like that
```python
import torch
x = torch.ones(10, device="mps")
m = torch.mps._compile_shader("""
   kernel void foo(device float* x, uint idx [[thread_position_in_grid]]) {
     x[idx] += idx;
   }
")
m.foo(x)
```

And in general enables writing custom operators using Metal shaders purely in Python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141478
Approved by: https://github.com/manuelcandales
2024-12-11 02:00:51 +00:00
d2e5e5b1a5 [AOTI] Remove redudant AOTI_TORCH_EXPORT (#142500)
Summary: Remove redundant AOTI_TORCH_EXPORT from shim_common.cpp since these functions are already declared with AOTI_TORCH_EXPORT in the corresponding header file. This is to solve the issue in https://github.com/pytorch/pytorch/pull/140030#issuecomment-2528760716

Differential Revision: D67031626

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142500
Approved by: https://github.com/frank-wei
2024-12-11 01:59:37 +00:00
cyy
7d98b3dcee [3/N] Apply bugprone-unchecked-optional-access (#142442)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142442
Approved by: https://github.com/albanD
2024-12-11 01:39:10 +00:00
2b105de2c1 [Monitor] Enable non-perf linux test monitor (#142168)
# Overview
Enable monitorings for non-perf linux tests

# Other
- move monitoring step right before build artifact for mac_test.yml, notice this test is not enable monitoring now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142168
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-12-11 01:10:43 +00:00
393cf46f42 Revert "[MPS] Add CompileShader method (#141478)"
This reverts commit 0478fee42db16a0477add1d0a644ce713f31a875.

Reverted https://github.com/pytorch/pytorch/pull/141478 on behalf of https://github.com/malfet due to Broke doctests, by trying to run MPS example on Linux ([comment](https://github.com/pytorch/pytorch/pull/141478#issuecomment-2533351909))
2024-12-11 00:37:10 +00:00
b94a206414 [CI] Use sccache-0.8.2 for CUDA builds (#140614)
Instead of an ancient prebuilt binary

This is a followup from https://github.com/pytorch/pytorch/pull/121323
For some reason, newer `sccache` does not work when `gcc` is invoked with `-E` option, so one have to special-case `-E` case in `/opt/ccache/bin/gcc` wrapper, which had to be special cased to work with  `nvcc` by checking whether `-E` is passed not only as first or second, but as 3rd argument as well(to be followed up by a generic https://github.com/pytorch/pytorch/pull/142813 ), i.e. to generate following wrapper:
```shell
#!/bin/sh

if [ "$1" = "-E" ] || [ "$2" = "-E" ] || [ "$3" = "-E" ]; then
  exec /usr/bin/gcc "$@"
elif [ $(env -u LD_PRELOAD ps -p $PPID -o comm=) != sccache ]; then
  exec sccache /usr/bin/gcc "$@"
else
  exec /usr/bin/gcc "$@"
fi
```

Without it `sccache nvcc hello.cu` failed with no-descriptive
```
    sccache: error: failed to execute compile
    sccache: caused by: Compiler not supported: ""
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140614
Approved by: https://github.com/wdvr

Co-authored-by: Wouter Devriendt <wouterdevriendt@meta.com>
2024-12-11 00:34:38 +00:00
ea152d2472 [be] better error message for flight recorder status (#142505)
Summary:
Change back log to VLOG(2) in waitForFutureOrTimeout. Instead, print a more user friendly message - if FR completes successfully.
This message is meant for developers only - so don't default to `INFO` in this function.

Also, change one more message from LOG(ERROR) to LOG(INFO).

Tested locally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142505
Approved by: https://github.com/kwen2501
2024-12-11 00:10:59 +00:00
eqy
bd4071f0b0 [Matmul][CUDA][FP8] Skip rowwise scaling tests on non-sm90 (#141596)
Since the current kernel is using sm90-specific features, just pre-emptively skip the test for any non-sm90 compute capabilities

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141596
Approved by: https://github.com/drisspg
2024-12-10 23:16:19 +00:00
4a16a60052 [C10D] Add better profiling title for NCCL barrier, nccl:all_reduce to nccl:all_reduce_barrier (#140785)
Fixes [issue](https://github.com/pytorch/pytorch/issues/140257)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140785
Approved by: https://github.com/wconstab
2024-12-10 23:08:15 +00:00
237c4b559c filelock: Make waitcounter variant to use (#139816)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139816
Approved by: https://github.com/ezyang
2024-12-10 23:02:59 +00:00
e36fbbf826 Fix ARM bfloat16 fmsub & improve vec_test_all_types coverage (#142499)
This function was very broken and untested. Now it is tested, and vec_test_all_types is passing internally as well.

Differential Revision: [D67036894](https://our.internmc.facebook.com/intern/diff/D67036894/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142499
Approved by: https://github.com/malfet
2024-12-10 22:51:41 +00:00
0478fee42d [MPS] Add CompileShader method (#141478)
This allows one to do something like that
```python
import torch
x = torch.ones(10, device="mps")
m = torch.mps._compile_shader("""
   kernel void foo(device float* x, uint idx [[thread_position_in_grid]]) {
     x[idx] += idx;
   }
")
m.foo(x)
```

And in general enables writing custom operators using Metal shaders purely in Python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141478
Approved by: https://github.com/manuelcandales
2024-12-10 22:43:17 +00:00
95e7fcf82e inductor: remove duplicate triton configs for autotuning (#142254)
Summary:
# Why

- sampling the same config multiple times is wasteful, especially on exhaustive
- for AMD we rewrite the configs to have a specific number of stages, which might lead to some configs appearing multiple times

# What

cast the configs, already defined as a tuple, through a set to remove duplicates

Test Plan:
taken from the `mm_kernel_configs` logic in the same file
```
>>> mm_kernel_configs = [        {"config": (BLOCK_M, BLOCK_N, BLOCK_K, num_stages, num_warps), "cond": True}        for BLOCK_M, BLOCK_N, BLOCK_K in itertools.product(            [16, 32, 64, 128, 256], repeat=3        )        for num_stages in [1, 2, 3, 4, 5]        for num_warps in [2, 4, 8]    ]
>>> configs = [c['config'] for c in mm_kernel_configs]
>>> a = tuple((c[0], c[1], c[2], 0, c[4]) for c in configs)
>>> len(set(a))
375
>>> len(a)
1875
>>>
```

Differential Revision: D66893774

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142254
Approved by: https://github.com/henrylhtsang
2024-12-10 22:19:06 +00:00
da29c13693 [while_loop] data-dependent op in body_fn (#142031)
The idea is the parent hop's fake tensor mode should ignore the newly allocated unbacked symints in subgraph because the bindings of unbacked symbols in the subgraph should already be done when we trace the subgraph. E.g. if there's an operator in subgraph that produces unbacked symints, the track_tensor_tree logic for that operator will take care of it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142031
Approved by: https://github.com/zou3519
ghstack dependencies: #142162
2024-12-10 21:54:28 +00:00
7111cd6ee0 [hop][BE] add util diff_meta with prettier error message. (#142162)
The error message changes from:
```python
-torch._dynamo.exc.Unsupported: Expected branches to return tensors with same metadata. [(tensor_pair, difference)...]:[('pair0:', TensorMetadata(shape=torch.Size([4, 3]), dtype=torch.float32, requires_grad=False, stride=(3, 1), memory_format=None, is_quantized=False, qparams={}), TensorMetadata(shape=torch.Size([2, 3]), dtype=torch.float32, requires_grad=False, stride=(3, 1), memory_format=None, is_quantized=False, qparams={}))]
```
to
```python
+torch._dynamo.exc.Unsupported: Expect branches to return tensors with same metadata but find pair[0] differ in 'shape', where lhs is TensorMetadata(shape=torch.Size([4, 3]), dtype=torch.float32, requires_grad=False, stride=(3, 1), memory_format=None, is_quantized=False, qparams={}) and rhs is TensorMetadata(shape=torch.Size([2, 3]), dtype=torch.float32, requires_grad=False, stride=(3, 1), memory_format=None, is_quantized=False, qparams={})
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142162
Approved by: https://github.com/zou3519
2024-12-10 21:54:28 +00:00
9ced54a51a [hop] lift free symbols in slice (#142385)
Before the change, we get an unfound proxy error when linting the subgraph.

After the change, we have the following dynamo graph for dynamic_shape test.

```python
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]  /data/users/yidi/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]     def forward(self, s0: "Sym(s0)", s1: "Sym(s1)", s2: "Sym(s2)", L_x_: "f32[s0, s1, s2][s1*s2, s2, 1]cpu"):
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]         l_x_ = L_x_
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]          # File: /data/users/yidi/pytorch/test/dynamo/test_higher_order_ops.py:307 in f, code: i = x.size(0) - 2
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]         sub: "Sym(s0 - 2)" = s0 - 2
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]          # File: /data/users/yidi/pytorch/test/dynamo/test_higher_order_ops.py:308 in f, code: j = x.size(1) - 3
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]         sub_1: "Sym(s1 - 3)" = s1 - 3
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]          # File: /data/users/yidi/pytorch/test/dynamo/test_higher_order_ops.py:310 in f, code: return wrap(lambda x: x[:i, :j, k:], x)
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]         wrap_body_0 = self.wrap_body_0
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]         wrap = torch.ops.higher_order.wrap(wrap_body_0, s0, s1, s2, l_x_, sub, sub_1);  wrap_body_0 = s0 = s1 = s2 = l_x_ = sub = sub_1 = None
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]         getitem: "f32[s0 - 2, s1 - 3, 0][s1*s2, s2, 1]cpu" = wrap[0];  wrap = None
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]         return (getitem,)
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]     class wrap_body_0(torch.nn.Module):
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]         def forward(self, s0: "Sym(s0)", s1: "Sym(s1)", s2: "Sym(s2)", l_x_: "f32[s0, s1, s2][s1*s2, s2, 1]cpu", sub: "Sym(s0 - 2)", sub_1: "Sym(s1 - 3)"):
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]              # File: /data/users/yidi/pytorch/test/dynamo/test_higher_order_ops.py:310 in <lambda>, code: return wrap(lambda x: x[:i, :j, k:], x)
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]             getitem: "f32[s0 - 2, s1 - 3, 0][s1*s2, s2, 1]cpu" = l_x_[(slice(None, sub, None), slice(None, sub_1, None), slice(s2, None, None))];  l_x_ = sub = sub_1 = s2 = None
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]             return (getitem,)
```

We lift sub, sub_1 because they're compound expressions and are directly used in argument of the getitem node. We lift s0, s1 and s2 because they're basic symbols in the tensor input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142385
Approved by: https://github.com/zou3519
2024-12-10 21:52:30 +00:00
795ff0e9f7 [ROCm] Improve reduce sum calculation for low CU count (#141378)
Improve reduce sum calculation for low CU count by enabling splitting the rows across warps for some 2D tensor shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141378
Approved by: https://github.com/jeffdaily
2024-12-10 21:48:56 +00:00
fda975a7b3 Remove unused Python variables in torch/[_-a]* (#133492)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492
Approved by: https://github.com/albanD
2024-12-10 21:48:44 +00:00
2268319596 dont attempt to fuse in unaligned accesses to mm (#142435)
This isn't profitable - we were trying to fuse in a padding of unaligned mm, which defeats padding's purpose.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142435
Approved by: https://github.com/jansel
ghstack dependencies: #134532, #142350, #142400, #142401, #142402
2024-12-10 21:35:26 +00:00
f2d8d7b7ac Infer whether prologues can be computed without upcasting to fp32 without changing numerics (#142402)
For prologues which only do either loads like gathers or dtype conversions, and no actual arithmetic on lower-precision types, we can codegen them without upcasting to fp32 without changing numerics.

Prologues that actually do arithmetic will need to use invoke quant. But I would like to to support upcasts/gathers out of the box.

We could potentially extend this in the future to avoid upcasting max pooling operations as well, if there were perf benefits to be had (less likely).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142402
Approved by: https://github.com/jansel
ghstack dependencies: #134532, #142350, #142400, #142401
2024-12-10 21:26:03 +00:00
bee445c3a3 [MPS] Support torch.Event for MPS (#142468)
# Motivation
Support `torch.Event` on mps backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142468
Approved by: https://github.com/malfet
2024-12-10 21:17:25 +00:00
1a0bd40243 Add a pass which analyzes whether a prologue preserves zero mask (#142401)
We load inputs to prologue fusion with a mask. That mask must still be zero before we run `tl.dot`. Previously, we would always apply the mask:
```
        tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last')
        tmp1 = tmp0.to(tl.float32)
        a = tl.where(a_mask, tmp1, 0.0)
```
now we do not need to ->
```
        tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last')
        tmp1 = tmp0.to(tl.float32)
        a = tmp1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142401
Approved by: https://github.com/jansel
ghstack dependencies: #134532, #142350, #142400
2024-12-10 21:16:13 +00:00
c30dd35877 [Device] Add "mps" to torch._utils._get_device_attr (#142447)
Follow up after  https://github.com/pytorch/pytorch/pull/141098
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142447
Approved by: https://github.com/kit1980
2024-12-10 20:57:17 +00:00
6f8751dcc9 Fix timeout check workflow lint job (#142476)
Fixes https://github.com/pytorch/pytorch/issues/142485

The workflow check lint job timed out in trunk, i.e. https://github.com/pytorch/pytorch/actions/runs/12261226178/job/34207762939, and here was what happened:

1. https://github.com/pytorch/pytorch/pull/142294 landed yesterday to build ROCm on 3.13, but the PR had a landrace with https://github.com/pytorch/pytorch/pull/142282 in the generated workflow file
2. The trunk lint check caught that in https://github.com/pytorch/pytorch/blob/main/.github/scripts/report_git_status.sh#L2
3. However, the script also attempted to print the difference with `git diff .github/workflows`.  This command was the one that stuck because `git diff` uses page by default and requires a prompt to display the next page ¯\_(ツ)_/¯

It took so long to debug this because a timeout Nova GHA doesn't print any progress.  I'll create an issue for this.

Bonus:

I also fix the broken print from test tool lint job that confuses GitHub https://github.com/pytorch/pytorch/actions/runs/12261226178 with an annotation failure `Credentials could not be loaded, please check your action inputs`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142476
Approved by: https://github.com/wdvr
2024-12-10 20:47:22 +00:00
f57606ab85 Migrate smoke tests to pytorch/pytorch (#142482)
Related to https://github.com/pytorch/builder/issues/2054
This should fix nightly xpu failure: https://github.com/pytorch/pytorch/actions/runs/12251477588/job/34180135207 and rocm failure: https://github.com/pytorch/pytorch/actions/runs/12251477588/job/34182185374 due to missing : `` /builder/check_binary.sh``

Builder Scripts revision: 3468139e81
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142482
Approved by: https://github.com/chuanqi129, https://github.com/kit1980, https://github.com/malfet, https://github.com/jeffdaily, https://github.com/huydhn
2024-12-10 20:43:36 +00:00
117b6c3e2c [Easy][Dynamo][TVM] remove unnecessary prints (#142445)
This PR intends to remove the unnecessary prints in the auto-scheduler of dynamo's TVM backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142445
Approved by: https://github.com/jansel
2024-12-10 19:52:02 +00:00
e95bd337e1 Some workflows to use oidc instead of AWS keys (#142264)
Roles defined in https://github.com/pytorch-labs/pytorch-gha-infra/pull/563

With this, I think we can get rid of the AWS credentials in the upload-stats environment

Untestable because I can't add branches to the upload-stats environment
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142264
Approved by: https://github.com/huydhn
2024-12-10 19:40:23 +00:00
3c03bc2431 [dynamo] Expand support of enum attribute access (#142268)
This patch changes `EnumVariable` to support access to all types of
attributes, not just non-callable literals.

Fixes #142050.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142268
Approved by: https://github.com/jansel
ghstack dependencies: #142267
2024-12-10 19:32:40 +00:00
b117945918 [dynamo] Remove dead code in ConstantVariable.const_getattr (#142267)
This path is no longer reachable after #113390, which also updated
`test_access_class_method_from_user_class` to reflect that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142267
Approved by: https://github.com/jansel
2024-12-10 19:32:40 +00:00
f74ba5d30d [dynamo] Remove special graph break for self-referential list (#142438)
We introduced a special graph break to avoid max-recursion-depth error
in #100296.

After #111415, the original `test_list_self_reference` no longer
triggers the special graph break because we started modeling root frame
free variables with `LazyVariableTracker`.

After #117426, we no longer build the list items eagerly, and they'll hit
`variable_tracker_cache` when they get lazily constructed later.

As a result, this patch updates the `test_list_self_reference` test and
removes the special graph break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142438
Approved by: https://github.com/jansel
ghstack dependencies: #142437
2024-12-10 19:23:48 +00:00
4f75f1e80d [dynamo] Use proper item source for NamedTupleVariable (#142437)
Dynamo was generating `GetItemSource(tuple_source, index)` for items of
`NamedTupleVariable`, but that stops working when a user supplied named
tuple has a custom `__getitem__` function with different semantics.

This patch
- fixes the aforementioned issue by using `AttrSource` instead.
- handles named tuple outside `wrap_listlike`, by removing the special
  case of named tuple in `BaseListVariable.cls_for_instance`, since the
  semantics of named tuple is different enough.
- makes user all constructions of `NamedTupleVariable` has items with
  proper sources.

Fixes #142399.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142437
Approved by: https://github.com/jansel
2024-12-10 19:23:48 +00:00
a45326b649 [dynamo] Support multiple inheritance for custom dict construction (#142416)
This patch applies a local and practical workaround for custom dict
construction when multiple inheritance is involved.

Handling multiple inheritance in general could be a lot more involved,
so I created #142414 to track that.

Fixes #141118.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142416
Approved by: https://github.com/jansel
2024-12-10 19:22:15 +00:00
67cf126cf8 Disable PIP version check in collect_env (#142308)
Disables version check which might require users to reach out to PyPI, reference: https://pip.pypa.io/en/latest/cli/pip/#cmdoption-disable-pip-version-check

Switches pip to be used directly as a python module (`python3 -mpip`) instead of relying on `pip3` or `pip`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142308
Approved by: https://github.com/seemethere
2024-12-10 19:16:36 +00:00
3e28da1e06 Revert "skip test dynamo for aot_dispatch tests on ci (#142185)"
This reverts commit 7eda06b36674afa117b28ad807c3421c94e775c1.

Reverted https://github.com/pytorch/pytorch/pull/142185 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I think it has a landrace in trunk ([comment](https://github.com/pytorch/pytorch/pull/142185#issuecomment-2532605728))
2024-12-10 18:50:17 +00:00
9aefc59649 Revert "[hop][dynamo] support torch.SymInt inputs (#141524)"
This reverts commit 6713b457aee3e36ab2499fb31b733ecd7104c764.

Reverted https://github.com/pytorch/pytorch/pull/141524 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I think it has a landrace in trunk ([comment](https://github.com/pytorch/pytorch/pull/142185#issuecomment-2532605728))
2024-12-10 18:50:17 +00:00
d102cfa2cb [Profiler] Add CUDA Overhead to Auto-trace (#142271)
Summary: We already have CUDA OVERHEAD events enabled in on-demand so we should also add them to auto-trace

Test Plan: Tested using internal performance suites and found no noticeable performance change

Differential Revision: D66904879

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142271
Approved by: https://github.com/ngimel
2024-12-10 18:39:59 +00:00
bce07deb96 [dtensor][cp][experiment] add CP experimental API to choose rotate method (#142093)
**Summary**
This PR adds a new experimental API `set_rotate_method` for Context Parallel. This API allows user to choose the desired communication method (between all-to-all and all-gather) for shards rotation.

**Test**
`pytest test/distributed/_tensor/test_attention.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142093
Approved by: https://github.com/fegin
2024-12-10 18:25:23 +00:00
eb84788fee [fr] change back vlog(2) to LOG(INFO) (#142441)
Summary:
Change log message for future execution back from VLOG(2) to LOG(INFO).
This message is useful for Flight Recorder to verify that flight recorder dumps completed successfully (or not).

Test Plan: Tested manually on a mast job and noted that the INFO message was as expected.
(meta only link: https://fburl.com/mlhub/iui2tpc9)
```
[trainer5]:I1208 10:21:00.772841  7528 ProcessGroupNCCL.cpp:1294] [PG ID 0 PG GUID 0(precheck) Rank 21] future is successfully executed for: Flight recorder dump in heartbeatMonitor
```

Differential Revision: D66996439

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142441
Approved by: https://github.com/fduwjj
2024-12-10 17:43:22 +00:00
6713b457ae [hop][dynamo] support torch.SymInt inputs (#141524)
Fixes https://github.com/pytorch/pytorch/issues/141305.

```python
        class M(torch.nn.Module):
            def forward(self, x, y, z):
                a = y.shape[0]
                b = z.shape[0]

                def true_fn(x):
                    return x + a

                def false_fn(x):
                    return x + b * z

                # When exporting with non-strict: a and b are symints,
                # so torch.compile need to wrap and trace symint inputs.
                return torch.cond(x.shape[0] > 5, true_fn, false_fn, (x,))
```

In non-strict export, when inputs are annotated with dynamic shape, the a, and b in above example are torch.SymInt type. true_fn and false_fn will have closure that're of torch.SymInt types.  The error is triggered because we didn't handle SymInt inputs in dynamo and ends up using a UserDefinedObjectVariable for it, which doesn't have a proxy. We added support by following how we handle SymBool input previously.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141524
Approved by: https://github.com/zou3519
ghstack dependencies: #141610, #142185
2024-12-10 17:33:57 +00:00
7eda06b366 skip test dynamo for aot_dispatch tests on ci (#142185)
A lot of tests in test_aotdispatch.py is not meaningful (from user's perspective) when we run with dynamo. So we skip them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142185
Approved by: https://github.com/zou3519
ghstack dependencies: #141610
2024-12-10 17:33:57 +00:00
b838bdd4d4 [dynamo] remove unnecessary set_example_value for SymBool input. (#141610)
These are automatically done in create_graph_input so we can remove them. Code refactoring only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141610
Approved by: https://github.com/zou3519
2024-12-10 17:33:48 +00:00
1986b46d63 [export] Change Tuple[()] to bool in schema to sync with thrift. (#142257)
Summary:
In thrift schema, we represent every None value as "True/False" while we represent None as () in OSS schema. This will cause some inconsistency between the type systems and the simplest thing to do here is changing Tuple[()] to bool in oss schema.

This change should NOT cause version bump, because on deserializer side we never read the value from as_none fields, as it doesn't have real meaning. Therefore this schema change should be considered as safe.

Test Plan: CI

Reviewed By: SherlockNoMad

Differential Revision: D66888892

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142257
Approved by: https://github.com/yiming0416, https://github.com/hl475
2024-12-10 17:13:35 +00:00
b1b0afb8e8 [BE] Add type annotation to eliminate_dead_code (#142251)
Test Plan: CI

Reviewed By: evanleed

D-ifferential Revision: D66887283

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142251
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-12-10 17:09:21 +00:00
09b2232fd1 Make core_aten_decomp to be alias to export table (#140086)
Differential Revision: [D64554098](https://our.internmc.facebook.com/intern/diff/D64554098/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140086
Approved by: https://github.com/bdhirsh
2024-12-10 17:04:59 +00:00
fa746e3eeb [Easy] factor out inductor ophandler decompositions (#142400)
Factor out inductor operator decompositions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142400
Approved by: https://github.com/Chillee, https://github.com/jansel
ghstack dependencies: #134532, #142350
2024-12-10 16:58:36 +00:00
1fb3d5a4e3 Update low prec codegen for div/mod (#142350)
Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350
Approved by: https://github.com/blaine-rister
ghstack dependencies: #134532
2024-12-10 16:50:28 +00:00
59ab3825e7 Prologue Fusion (#134532)
This PR extends our ability to fuse pointwise nodes onto triton templates with the ability to fuse pointwise nodes into triton templates - prologue fusion.

Similar to the store_output api:
`{{store_output(("idx_m", "idx_n"), "acc", "mask")}}`

And the modification api:

```
{{ modification(
    subgraph_number=0,
    output_name="post_mod_scores",
    score="qk",
    out="qk"
) | indent_except_first(1) }}
```

We have:

```{{load_input("B", "b", ("idx_m", "idx_n"), mask=None if EVEN_K else "b_mask", indent_width=8)}}```

Because we are now loading the input with explicit indices and mask, I needed to rewrite the mm kernel to no longer update the [pointers by BLOCK_K](bb03ef7aca/torch/_inductor/kernel/mm.py (L110-L111)) on every iteration and instead on each iteration compute indices from the the k_idx of each loop. This did not have any perf difference.

There are a couple main use cases for prologue fusion:

- Fusing dequants into a matmul. particularly for more bandwidth bound scenarios.
- Fusing gather into a matmul. This is useful particularly in MOE. See https://github.com/pytorch/pytorch/issues/134535 for more details.

Prologue fusion is generally much less profitable than epilogue fusion, because it must be applied to an element of an input on each loop of the matmul, compared to only once in the epilogue (gather into matmul is a potential exception). Accordingly, we are much less aggressive in attempting to fuse prologue fusion. We only attempt fusion if it does not increase the number of memory bytes read instead the triton template, multipled by a small factor to allow gathers. This restricts reliably unprofitable fusions like fp32->fp16 inside kernel. In future pr we could potentially have api of being more aggressive if we know we are in a bandwidth bound regime. See: https://github.com/pytorch/pytorch/pull/134532/files#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R3060-R3066

Other notes:

By default we will upcast to fp32 inside every kernel. This matches eager numerics. This is fine enough for epilogue because it is only done once (although it is probably unnecessary for say a relu) but tanks perf for prologue. I am currently using the `codegen_upcast_to_fp32` option to avoid it, but that will not work for libdevice calls that require fp32. We will need https://github.com/pytorch/pytorch/pull/136778/ and dtype-aware codegen to upcast fp16 ops into libdevice calls.

With prologue fusion, we now have essentially separate kernels for each input, and for the output. I had to increase the number of fields that are swapped out in `set_subgraph_body` by a large number :/ I also update the fusion logic because the inputs will have a different group than the outputs. Maybe as part of enabling multiple outputs, this could get cleaned up a bit so..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134532
Approved by: https://github.com/jansel
2024-12-10 16:25:57 +00:00
a751558467 [logging] Fix bug involving missing compilation_metrics fields in tlparse logs (#142423)
Summary: The line of code that's compiling the set of compilation_metrics to include in the corresponding tlparse log is missing the "legacy" and "common" fields populated above. Fix is to make sure we consider all fields in the compilation_metrics object.

Test Plan:
Before: https://fburl.com/d6em8csg (e.g, https://fburl.com/c19s7ny0)
After: https://fburl.com/5zr6kbvf (e.g, https://fburl.com/3hp14ht2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142423
Approved by: https://github.com/ezyang
2024-12-10 15:58:43 +00:00
882b6af219 c10::string_view -> std::string_view in autograd (#142354)
Differential Revision: D66939966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142354
Approved by: https://github.com/Skylion007
2024-12-10 15:43:41 +00:00
7e41717a26 c10::string_view -> std::string_view in caffe2/jit (#142383)
Test Plan: Sandcastle

Differential Revision: D66939979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142383
Approved by: https://github.com/malfet
2024-12-10 15:42:28 +00:00
dd2d0c6b80 [FSDP2] Gate PT2 code for torch deploy (#142456)
See diff for internal details

Differential Revision: [D67003832](https://our.internmc.facebook.com/intern/diff/D67003832)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142456
Approved by: https://github.com/yf225, https://github.com/weifengpy, https://github.com/fegin
2024-12-10 14:39:07 +00:00
ff059587c6 support condition branch in ao debug handler (#141516)
This diff introduced the supportive of condition statement into ao debug handler generation.

Most of code borrowed from ExecuTorch to avoid circle dependency issue.

Differential Revision: [D66270691](https://our.internmc.facebook.com/intern/diff/D66270691/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141516
Approved by: https://github.com/jerryzh168
2024-12-10 14:05:12 +00:00
75530885ba Revert "[BE] Add type annotation to eliminate_dead_code (#142251)"
This reverts commit 3d04de6b2f78a78bc28ce82d6e3a4af1867ec7d8.

Reverted https://github.com/pytorch/pytorch/pull/142251 on behalf of https://github.com/jeanschmidt due to checking if reverting will fix 'FAILED [5.0221s] test_dataloader.py::TestIndividualWorkerQueue::test_ind_worker_queue' on windows ([comment](https://github.com/pytorch/pytorch/pull/142251#issuecomment-2531706362))
2024-12-10 13:57:00 +00:00
a3abe1a5ae Add support for bfloat16 atomic adds in fbcode (#141857)
This adds support for bfloat16 atomic add in fbcode (OSS will have to wait until those changes are upstreamed to triton)

Originally I attempted to write inline asm, but the triton API was not flexible enough to support this use case. In the long run the right answer is to implement this properly in OSS triton.

relevant issues:
* https://github.com/pytorch/pytorch/issues/137425 in fbcode only
* https://github.com/pytorch/pytorch/issues/97016

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141857
Approved by: https://github.com/eellison
2024-12-10 11:40:15 +00:00
d51e6fa7f6 [inductor][cpp] Add FlexAttention support for CPU inference (#141453)
This PR brings the FlexAttention inference support for the inductor backend in torch.compile (support precisions: bf16 and fp32) on CPUs.

Based on the existing CPP template, this PR extends and implements a FlexAttention CPP template to support broad attention variants, and meanwhile brings optimized performance on CPUs.

With this, users can transparently extend their Flex Attention usages to CPUs with good and common support from torch.compile, both functionality and performance.

For UT tests, in this PR, we include partial critical tests for CPUs as the following (conduct inference tests):
```
pytest test/inductor/test_flex_attention.py
`TestFlexAttention`
#common functions:
run_test
preprocess_paged_attention
run_paged_attention
run_test_with_paged_attention
run_test_with_call
run_dynamic_test
run_automatic_dynamic_test

#test functions:
test_builtin_score_mods
test_builtin_score_mods_automatic_dynamic
test_builtin_score_mods_different_seqlen
test_builtin_score_mods_different_block_size
test_kv_batch_broadcast
test_GQA
test_cpu_error_message_return_lse
test_validate_cpu_dtype_error_message

`TestPagedAttention`
#test function:
test_paged_builtin_score_mods
```
For the rest UTs in `test/inductor/test_flex_attention.py ` and `test/inductor/test_flex_decoding.py`, due to bigger lines of changes (1500+ LOC) that make this PR hard to review, will submit another PR specific for CPU device UTs enabling and refactor.

Besides, more optimizations are also planned in follow up PRs, including:

- Block sparse computation
- Flash decoding tuning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141453
Approved by: https://github.com/drisspg, https://github.com/leslie-fang-intel

Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
2024-12-10 11:11:09 +00:00
5ba61d7fb8 [Fix][Profiler UT] Skip CPU for the UT test/profiler/test_execution_trace.py::test_execution_trace_with_pt2 (#142027)
[Fix] Skip CPU device for the UT `test_execution_trace_with_pt2`
skip CPU because triton is only for GPUs. This UT is designed to test profiling the triton kernels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142027
Approved by: https://github.com/aaronenyeshi
2024-12-10 09:29:19 +00:00
3d04de6b2f [BE] Add type annotation to eliminate_dead_code (#142251)
Test Plan: CI

Reviewed By: evanleed

Differential Revision: D66887283

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142251
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-12-10 09:27:29 +00:00
539286a67b Inductor annotations (#130429)
Add NVTX annotations around training phases and buffer computations

RFC/discussion: https://dev-discuss.pytorch.org/t/rfc-performance-profiling-at-scale-with-details-nvtx-annotations/2224

<img width="2160" alt="Screenshot 2024-07-10 at 11 48 04" src="https://github.com/pytorch/pytorch/assets/1175576/9ade139c-d393-473f-9b68-6c25da367dc4">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130429
Approved by: https://github.com/aorenste, https://github.com/eellison, https://github.com/albanD

Co-authored-by: Cedric GESTES <cedric.gestes@flex.ai>
2024-12-10 08:53:39 +00:00
24650c3caa Revert "[Inductor][Easy] Fix a test failure in loop_ordering_after_fusion (#142273)"
This reverts commit e4ecb09b3513e0ee53ed87496d8bfdf5d2944042.

Reverted https://github.com/pytorch/pytorch/pull/142273 on behalf of https://github.com/huydhn due to Internal has been ninja unlanded D66906175 ([comment](https://github.com/pytorch/pytorch/pull/142273#issuecomment-2530751665))
2024-12-10 08:16:58 +00:00
d3d1a78774 [AOTInductor] Add standalone test for compilation from ExportedProgram (#142327)
Summary: Provide a standalone path to compile and run a ExportedProgram in C.

Test Plan:
(1) Generate a compiled model from ExportedProgram
```
python generate_lowered_cpu.py --input-path /tmp/$USER/ep.pt --output-path /tmp/$USER/final.pt
```
(2) Compile a standalone test runner
```
TORCH_ROOT_DIR=/data/users/$USER/pytorch sh standalone_compile.sh standalone_test.cpp standalone_test.out
```
(3) Run test for the compiled model in step (1)
```
LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib ./standalone_test.out /tmp/$USER/final.pt
```

Differential Revision: D66872380

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142327
Approved by: https://github.com/hl475
2024-12-10 06:50:09 +00:00
b9e253cb72 [inductor] update numbytes_hint for NoneLayout to allow more fusions (#141766)
We found that [this commit](6eca0aee76) caused a ~6% performance drop in ViT INT8. This was due to changes to the `numbytes_hint` for `NoneLayout`. In this PR, we reverted the changes in `numbytes_hint` to allow more fusions.

```
class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.dense = torch.nn.Linear(768, 768)
        self.layernorm = torch.nn.LayerNorm(768, eps=1e-12)
    def forward(self, context_layer, hidden_states):
        attention_output = self.dense(context_layer)
        hidden_states = attention_output + hidden_states
        layer_output = self.layernorm(hidden_states)
        return layer_output
```
The generated code before (left) and after (right) this PR is as follows:
![image](https://github.com/user-attachments/assets/0ec65ae5-103e-4e2c-bf7c-e8bed24fc179)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141766
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-12-10 06:45:07 +00:00
daa27fe59d [DeviceMesh] Call no_dispatch before doing tensor slicing in DeviceMesh (#142287)
Summary:
DeviceMesh's tensor operation is a control plane operation not data plane and should not be affected by FakeTensorMode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142287
Approved by: https://github.com/XilunWu
2024-12-10 06:33:01 +00:00
f26b75b7ac [aarch64] add CUDA 12.6 sbsa nightly binary (#142335)
related to #138440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142335
Approved by: https://github.com/atalman
2024-12-10 06:19:28 +00:00
1cb2ebd740 [AOTI] Fix #140546 and support AOTI package load for Intel GPU. (#140664)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* #140686
* __->__ #140664
* #140269
* #140268
* #135320
* #135318
* #139026

Fix #140546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140664
Approved by: https://github.com/desertfire, https://github.com/EikanWang
ghstack dependencies: #140268, #140269

Co-authored-by: Bin Bao <binbao@meta.com>
2024-12-10 05:05:08 +00:00
6680a83e89 [AOTI XPU] Support AOT Inductor for Intel GPU. (#140269)
This PR add XPU support for AOT Inductor, and reuse the corresponding UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140269
Approved by: https://github.com/desertfire, https://github.com/EikanWang
ghstack dependencies: #140268

Co-authored-by: Bin Bao <binbao@meta.com>
2024-12-10 05:05:08 +00:00
a1c6cf7e9f Revert "Add UTs for accelerator device-agnostic runtime APIs (#133572)"
This reverts commit 952514f0c8d8ff2e1719e0ca82b0d178a5c5ff45.

Reverted https://github.com/pytorch/pytorch/pull/133572 on behalf of https://github.com/malfet due to Sorry for reverting your PR, but it segfaults on MacOS ([comment](https://github.com/pytorch/pytorch/pull/133572#issuecomment-2530354401))
2024-12-10 04:42:55 +00:00
adbfdbd6a0 Revert "Add device-agnostic runtime Device/Stream C++ API (#138677)"
This reverts commit f84e533a2cb89a42c021dce7d22af7d5bd5f5ac1.

Reverted https://github.com/pytorch/pytorch/pull/138677 on behalf of https://github.com/malfet due to Sorry for reverting your PR, but it segfaults on MacOS ([comment](https://github.com/pytorch/pytorch/pull/133572#issuecomment-2530354401))
2024-12-10 04:42:55 +00:00
08e9ceb0a4 Make sure the benchmark build config is tested on trunk for easy bisect (#142376)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142376
Approved by: https://github.com/atalman
2024-12-10 04:42:52 +00:00
4e7056d94d Fixes in-order test flakiness (#142389)
Fixes #142343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142389
Approved by: https://github.com/michael-diggin, https://github.com/divyanshk
2024-12-10 04:19:20 +00:00
e3886fb13c misc. fixes to unflatten (#142141)
Combining several fixes to unflatten for bugs revealed by random graph testing.

The fixes target two categories of bugs:
1. Some bugs show up as exponential blowups for largish system of nn modules. These are fixes by converting lists to sets, using caching, or otherwise rewriting to reuse computation more effiicently.
2. Other bugs were due to missing intermediate modules created when attributes such as submodules and buffers are accessed through longish paths before calling the corresponding intermediate modules, or missing attributes such as buffers and constants in submodules corresponding to multiple calls.

Differential Revision: [D66659795](https://our.internmc.facebook.com/intern/diff/D66659795/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142141
Approved by: https://github.com/ydwu4
2024-12-10 03:45:13 +00:00
e4ecb09b35 [Inductor][Easy] Fix a test failure in loop_ordering_after_fusion (#142273)
**Summary:**
(Since I am trying the other solution for https://github.com/pytorch/pytorch/pull/141082, I moved out the test case fixes from that pr to a separate pr to land first.)

-----
Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference.

The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0

-------

The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`.

Before the change:
`shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold`
After the change:
`shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused`

----
It's the same issue as fixed in https://github.com/pytorch/pytorch/pull/136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again.

**Test Plan:**
```
buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering
```
And ran a float8 dynamic scaling training script to verify it e2e

-----

Differential Revision: D66906175

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142273
Approved by: https://github.com/eellison
2024-12-10 02:58:04 +00:00
3291b0a013 [DataParallel] Skip for MPS device (#142448)
As `torch._C._scatter` is only defined for CUDA/ROCm (and may be XPU?)

This is a regression introduced by https://github.com/pytorch/pytorch/pull/141098 that went unnoticed due to https://github.com/pytorch/pytorch/issues/142206

Test plan:
```
python test_autograd.py -v -k test_dataparallel_saved_tensors_hooks
```

Before this change it failed with
```
ERROR: test_dataparallel_saved_tensors_hooks (__main__.TestMultithreadAutograd.test_dataparallel_saved_tensors_hooks)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper
    method(*args, **kwargs)
    ~~~~~~^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/test/test_autograd.py", line 13074, in test_dataparallel_saved_tensors_hooks
    model = torch.nn.DataParallel(Model())
  File "/Users/malfet/git/pytorch/pytorch/torch/nn/parallel/data_parallel.py", line 153, in __init__
    raise RuntimeError("no available devices were found")
RuntimeError: no available devices were found
```

After this change it passes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142448
Approved by: https://github.com/kit1980
2024-12-10 02:49:23 +00:00
cyy
9a309fb4c6 Remove ConstQuantizerPtr in torchgen (#142375)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142375
Approved by: https://github.com/albanD
2024-12-10 02:37:01 +00:00
41757372c4 Set timeout value for remaining lint jobs (#142444)
Some lint jobs are using the default 30 minutes timeout, but the jobs could wait up to 90 minutes now for the Docker image to become available after https://github.com/pytorch/test-infra/pull/6013
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142444
Approved by: https://github.com/wdvr
2024-12-10 02:29:44 +00:00
e83b0fa945 set CUB_VERSION to 200001 for USE_ROCM (#140861)
Summary:
currently, CUB_VERSION is 0 for USE_ROCM
CUB_VERSION is used for determine whether to use advanced cub APIs for some implementation.

Test Plan:
`buck2 build --flagfile fbsource//arvr/mode/win/vs2022/cpp20/cuda12_5/dev --flagfile fbsource//arvr/mode/cuda/rtx30 fbsource//arvr/libraries/eye/apollo_visualizer:unit_test_apollo_hu_module_capability`

`buck2 build --flagfile fbcode//mode/amd-gpu fbcode//aiplatform/modelstore/checkpointing/pyper:tensor_save_load_utils`

Differential Revision: D63054638

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140861
Approved by: https://github.com/eqy, https://github.com/zoranzhao, https://github.com/houseroad
2024-12-10 02:28:48 +00:00
2f1191fb6a Corrected metadata variable names (#142342)
Fixes #142341

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142342
Approved by: https://github.com/janeyx99
2024-12-10 02:24:31 +00:00
5d6acd5a31 Register Intel distributed Backend (XCCL) in PyTorch distributed package (#141856)
### Motivation:

As design illustrated in Intel distributed support RFC https://github.com/pytorch/pytorch/issues/141741, two sections are needed to enable intel distributed backend (`XCCL`) support in PyTorch.
1. Intel GPU distributed Backend integration in PyTorch `torch-xpu-ops`.
2. **Intel distributed Backend register in PyTorch distributed package**. This PR is to contribute section 2 change.

### Example:
Here is a simple example of using spawn to launch XCCL backend and perform allreduce on XPU tensors.
```
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

def run_allreduce(rank, world_size):
    setup(rank, world_size)
    device = torch.device('xpu:{}'.format(rank))
    x = torch.randn([2, 2], device=device)
    dist.all_reduce(x)
    cleanup()

if __name__ == '__main__':
    world_size = 2
    mp.spawn(run_allreduce, args=(world_size,), nprocs=world_size, join=True)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141856
Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD
2024-12-10 01:58:06 +00:00
b98f40a4d5 remove vulkan sdk installation on executorch build (#142424)
pytorch-linux-jammy-py3-clang12-executorch [started to fail](https://github.com/pytorch/pytorch/actions/runs/12244909721/job/34157668780) today due to a 404 on the Vulkan SDK we use/download (1.2.198.1, 3 years old, URL: https://sdk.lunarg.com/sdk/download/1.2.198.1/linux/vulkansdk-linux-x86_64-1.2.198.1.tar.gz )

The Vulkan SDK is probably no longer needed for building Executorch, and is not used down the line for testing.

This PR tests removing the installation of the SDK

https://github.com/pytorch/executorch/pull/7258
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142424
Approved by: https://github.com/huydhn
2024-12-10 01:50:16 +00:00
a1688d8607 Fix test_indexing on MacOS (#142440)
Where int64_t is long long rather than long

This fixes test regression introduced by https://github.com/pytorch/pytorch/pull/140597 that went undetected due to https://github.com/pytorch/pytorch/issues/142206

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142440
Approved by: https://github.com/kit1980
2024-12-10 01:46:28 +00:00
871b524398 Revert "temporarily turn on keep-going/continue on error for mac (#142421)"
This reverts commit 17202ea8f6fb0eebdb14b346bb2610f08800a7df.

Reverted https://github.com/pytorch/pytorch/pull/142421 on behalf of https://github.com/malfet due to We've collected enough info for now ([comment](https://github.com/pytorch/pytorch/pull/142421#issuecomment-2530010220))
2024-12-10 01:45:21 +00:00
bef103934a [DeviceMesh][ROCm] skip ProcessGroup init test on ROCm because #ranks != #devices in CI (#142386)
**Summary**
Fixes #142361

Skip the DeviceMesh test since the test suite doesn't consider the case where `# ranks != # devices`.

**Test**
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142386
Approved by: https://github.com/huydhn, https://github.com/fegin
2024-12-10 01:22:21 +00:00
a1b5067297 Enable py3.13 wheels for ROCm (#142294)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142294
Approved by: https://github.com/huydhn
2024-12-10 01:10:24 +00:00
20718cdebb [Fast Packing] Add packing ukernels to gemm config (#142191)
Add file to buck build

Differential Revision: [D66692673](https://our.internmc.facebook.com/intern/diff/D66692673/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D66692673/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142191
Approved by: https://github.com/kirklandsign, https://github.com/digantdesai
2024-12-10 01:06:17 +00:00
0f6bfc58a2 Introduce remote cache key prefix to break cache (#142148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142148
Approved by: https://github.com/jamesjwu, https://github.com/ezyang
2024-12-10 00:35:50 +00:00
1cb5f38328 [EZ] Skip test_zero_grid_with_backed_symbols on Mac (#142436)
As it expects to load traced module on CUDA, which is not available on Mac bd867d691b/test/inductor/test_aot_inductor.py (L1414)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142436
Approved by: https://github.com/kit1980
2024-12-10 00:25:32 +00:00
ec746f7026 Remove an unused variable from _prims_common/wrappers.py (#138480)
----

* Extracted from https://github.com/pytorch/pytorch/pull/133492
* albanD thinks this is a bug!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138480
Approved by: https://github.com/albanD
2024-12-10 00:12:53 +00:00
5743b11039 Improve messaging of ProcessGroupNCCL destructor (#142297)
And removed some unnecessary conditions for calling `thread.join()` -- `thread.joinable()` should have covered it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142297
Approved by: https://github.com/wconstab
ghstack dependencies: #141510, #141511
2024-12-10 00:02:33 +00:00
bd867d691b [FSDP2] Fix backward-compatible imports (#142419)
Internal only: the before way meant that `from torch.distributed._composable.fsdp import fully_shard` was importing `fully_shard.py` not the function `fully_shard`. For some reason, the resolution order is different from open source.

To fix this, we match the old import as closely as possible. Namely, we import `fully_shard.py` contents from `.fully_shard`. This should force that import to take precedence.

@diff-train-skip-merge

Differential Revision: [D66990327](https://our.internmc.facebook.com/intern/diff/D66990327)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142419
Approved by: https://github.com/weifengpy
2024-12-09 23:56:32 +00:00
bcddae14ec Enhance "from_node" node meta to track source recursively (#142066)
Summary:
Change the "from_node" node meta format to be able to track the provenance of nodes recursively.

The new "from_node" format is a a list node NodeSource:

```
class NodeSource:
	self.node_name: str
	self.target: str
	self.graph_id: int
	self.pass_name: str
	self.action: str
	self.from_node: List[NoedSource]
```

This is in preparation for the inductor provenance tracking. For background, the inductor provenance tracking doc: https://docs.google.com/document/d/1dGh9myqNhywmbfP0Quzx_f04bghDFlj8cawj8MopiO8/edit?fbclid=IwZXh0bgNhZW0CMTEAAR0jUQ0Tf4ROLDED8Y_eIzrU0KVZVdRmyIQLp-avt-kGRPI_VgYVNyjH_q0_aem_HCQ_pxHDiwOkO9mQyWB2-g&tab=t.0 (internal only),

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_unflatten_multiple_graphs_state
buck run mode/dev-nosan caffe2/test:fx -- -r node_source
```

Differential Revision: D66737916

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142066
Approved by: https://github.com/avikchaudhuri
2024-12-09 23:39:15 +00:00
42b222edef [AOTI] Fix an issue when fallback op does not return a value (#142339)
Summary: Refine https://github.com/pytorch/pytorch/pull/137660 to support fallback op without a return value.

Differential Revision: D66939108

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142339
Approved by: https://github.com/henrylhtsang
2024-12-09 23:24:29 +00:00
17202ea8f6 temporarily turn on keep-going/continue on error for mac (#142421)
See https://github.com/pytorch/pytorch/pull/142270 for additional info.

Make all mac default shard tests run with keep going / continue on error so we can see all the test failures.

Red signal will show up later, but you can see failing tests mid run on HUD by clicking the additional test failures button

After the job is finished, searching for "consistently: " in the logs will find the failed tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142421
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-09 23:24:10 +00:00
beeffe77e4 Revert "[inductor][cpp] Add FlexAttention support for CPU inference (#141453)"
This reverts commit db379ed1ada58608d4d3c5c35777da051e4e49e5.

Reverted https://github.com/pytorch/pytorch/pull/141453 on behalf of https://github.com/malfet due to This breaks tests on platforms compiled without MKLDNN, namely MacOS, see https://github.com/pytorch/pytorch/actions/runs/12245441371/job/34159967794 ([comment](https://github.com/pytorch/pytorch/pull/141453#issuecomment-2529710573))
2024-12-09 22:57:59 +00:00
8d24eb0c94 [Inductor] Represent size_hints as a dict (#142249)
Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243.

# Feature

Follow up to https://github.com/pytorch/pytorch/pull/141751. Since we now represent `numels` as a dict, it's natural to extend this to `size_hints`. The latter are basically just the former rounded up to the nearest power of 2. This simplifies various heuristics such as the coordinate descent tuner. Where we previously needed to determine which index in `size_hints` corresponds to each dimension, now we can just query by prefix. This will be especially important when we enable 2D reductions, as it becomes harder to keep track of these things when we have multiple reduction dimensions. (See the previous PR for some examples.)

# Test plan

The existing CI provides good coverage. This PR modifies a few tests which explicitly constructed size hints.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142249
Approved by: https://github.com/jansel
2024-12-09 22:31:53 +00:00
4b69a68c7c [Do not revert] Re-enable Mac testing (#142270)
The bash script modification in https://github.com/pytorch/pytorch/pull/135386 results in tests on mac in default shard not running.
This PR is expected to cause test failures, but we need to start getting signal, so landing with known failures
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142270
Approved by: https://github.com/malfet, https://github.com/seemethere, https://github.com/atalman

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-09 22:26:26 +00:00
274223d719 Add and use borrow_arrayref_tensor_as_tensor (#142183)
Differential Revision: [D66847773](https://our.internmc.facebook.com/intern/diff/D66847773/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142183
Approved by: https://github.com/desertfire, https://github.com/hl475
ghstack dependencies: #142340, #142182
2024-12-09 22:23:21 +00:00
18d25aa7aa Rename convert_arrayref_tensor_to_tensor to copy_arrayref_tensor_to_tensor (#142182)
Be explicit about what we are doing, in preparation for adding borrow_arrayref_tensor_as_tensor.

Differential Revision: [D66847772](https://our.internmc.facebook.com/intern/diff/D66847772/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142182
Approved by: https://github.com/desertfire
ghstack dependencies: #142340
2024-12-09 22:23:21 +00:00
dc1ef9afb4 Reapply #142091 (Unbreak dynamic shape minimal arrayref interface tests) (#142340)
Simple bug got introduced somewhere.

The original PR was reverted because it broke (caused unexpected successes for) some tests in test_aot_inductor_arrayref.py that still only run internally because #123691 hasn't been fixed. I've fixed those.

Differential Revision: [D66890276](https://our.internmc.facebook.com/intern/diff/D66890276/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142340
Approved by: https://github.com/hl475
2024-12-09 22:23:21 +00:00
cca33d50b9 [PGNCCL] Use long/short wait for different non-blocking calls (#142291)
In nonblocking mode, we always check if the NCCL communicator is ready between issuing commands to it.
Today this is done by the `waitReady()` function.
Unfortunately, the `waitReady()` function is burned with `C10D_NCCL_CHECK_TIMEOUT_SLEEP` which would sleep for an interval between two consecutive checks.
While this is nice when waiting for comm init or finalize, it degrades performance of collective calls (which would almost certainly return success immediately.)

This PR adds a `bool longInterval` argument to `waitReady` and let call site determine whether long wait is likely; if not, `waitReady` would use `sched_yield()` to more eagerly check for readiness.

Thanks @eqy for reporting an issue that small collectives has perf impact in nonblocking mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142291
Approved by: https://github.com/eqy, https://github.com/fduwjj
2024-12-09 22:19:58 +00:00
452e1a7840 [c10d] Update backend arg documentation (#142404)
Update doc to reflect change brought by https://github.com/pytorch/pytorch/pull/142216

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142404
Approved by: https://github.com/XilunWu
2024-12-09 21:53:44 +00:00
12f1989a4a [aoti package] seek 0 after loading buffer (#142204)
Differential Revision: D66855265

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142204
Approved by: https://github.com/chenyang78, https://github.com/angelayi
2024-12-09 21:53:28 +00:00
4c7688ca06 Add pytest support for unittest.subTests to CI env (#142238)
Fixes #142157
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142238
Approved by: https://github.com/malfet, https://github.com/huydhn
ghstack dependencies: #142243
2024-12-09 21:48:20 +00:00
5d3bc633ff [PGNCCL] Rework NCCLComm dtor to avoid clash with CUDA driver shutdown (#141511)
Making CUDA or NCCL calls in object destruction can be dangerous because CUDA context may have exited before the the destructor, in which case, the CUDA calls would see a "CUDA driver shutting down" error.

this PR does take a destroy call away from NCCLComm dtor, and doesn't add a new one. If users are calling destroy_process_group or abort_process_group as recommended, then we are destroying for them, and otherwise we are OK with letting them possibly leak resources (and get a warning).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141511
Approved by: https://github.com/eqy, https://github.com/wconstab
ghstack dependencies: #141510
2024-12-09 21:41:15 +00:00
4dbecf3ba7 Implement CPU pins functions for HPU hooks (#139495)
Link CPU pins function in HPU hooks to the host allocator in tensor_empty

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139495
Approved by: https://github.com/zou3519
2024-12-09 21:37:20 +00:00
e52a534994 [PGNCCL] Deprecate suppport of onCompletionHook (#142390)
The usage of `onCompletionHook` is mostly similar to what Flight Recorder does today -- for example, measuring how long a collective takes and put it into a profiler's "database".

Since FR already records and can dump info like this, we are considering deprecating the onCompletionHook support to save a side thread. (Each PG runs 3 side threads today, which is resource consuming and complicates the code)

User can file an issue if additional information needs to be recorded.
They can also file an RFC if Flight Recorder needs to accept plugins that customize the recording.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142390
Approved by: https://github.com/fduwjj, https://github.com/fegin
2024-12-09 21:11:33 +00:00
04312293a2 [Inductor] Fix wrong CSEVariable dtype for reduction. Fix #141861 (#142189)
Fix #141861

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142189
Approved by: https://github.com/jansel
2024-12-09 21:07:35 +00:00
a4dedf27b9 [Inductor] Generalize newly introduced device-bias code to align the behavior of XPU unroll reduction with cuda. (#142348)
Fix #141861

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142348
Approved by: https://github.com/desertfire, https://github.com/jansel
2024-12-09 20:58:35 +00:00
8bf28b3613 [EZ] Do not checkout builder for Linux builds (#142282)
All logic should have been migrated to .ci/manywheel folder from builder repo a while back
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142282
Approved by: https://github.com/atalman
ghstack dependencies: #142276, #142277, #142382
2024-12-09 20:52:13 +00:00
cb0a302dde Fix fallthrough behaviour when Meta in TLS include set (#141581)
Fixes https://github.com/pytorch/pytorch/issues/141120

Registering a fallthrough for a backend correctly alters nonFallthroughKeysPerBackend_[backend_idx]. However, the backend_idx calculation does not take into account the local dispatch key set, which is used to temporarily turn on Meta as a backend. This means that makeFallthrough does not behave exactly as if it was a normal function which redispatched rather than a "fake function" implemented with a key mask.

So e.g. impl::computeDispatchKeySet(ks, nonFallthroughKeysPerBackend_[backend_idx]); will exclude keys like Meta which may be in the TLS include set.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141581
Approved by: https://github.com/bdhirsh
2024-12-09 20:32:44 +00:00
a1bd784ffd add CK BMM instances (#142002)
Summary:
adds instances of CK DeviceBatchedGemmMultiD_Xdl_CShuffle_V3 for aten.bmm CK backend. adds simple heuristic that will need improving over time.
adds support for TN, NT, TT and NN layouts.

Differential Revision: D66662554

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142002
Approved by: https://github.com/mxz297, https://github.com/xw285cornell
2024-12-09 20:31:24 +00:00
5c76a2834d Revert "add torchrec collectives to enforce global ordering (#141970)"
This reverts commit ceb94d6a7d38930d662e7eb71b9c7620de8c2997.

Reverted https://github.com/pytorch/pytorch/pull/141970 on behalf of https://github.com/malfet due to Apologies for reverting this change, but it broke MacOS testing, but CI was broken at the time ([comment](https://github.com/pytorch/pytorch/pull/141970#issuecomment-2529367680))
2024-12-09 20:25:04 +00:00
960a81fdcd [EZ] Delete unsued binary_macos_test.sh (#142382)
According to https://github.com/search?type=code&q=binary_macos_test.sh+repo%3Apytorch%2Fpytorch (and grep in the repo) it's not used anywhere

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142382
Approved by: https://github.com/atalman
ghstack dependencies: #142276, #142277
2024-12-09 19:37:56 +00:00
cyy
b4c0973b59 [2/N] Apply bugprone-unchecked-optional-access (#141091)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141091
Approved by: https://github.com/Skylion007, https://github.com/albanD

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-12-09 19:30:19 +00:00
005c5694eb Refactor "torch.mtia.memory_stats" API (#141723)
Summary:
This diff refactors the code for the "torch.mtia.memory_stats" API to maintain the same file hierarchy as its CUDA counterpart:
- All device memory APIs are now located under ".../mtia/memory.py".
- Device memory APIs can be accessed using either "torch.mtia.XYZ" or "torch.mtia.memory.XYZ".

Test Plan:
Passed a local unit test: `buck run //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api`

```
Ran 14 tests in 16.657s

OK
I1127 11:06:06.505201 2133030 afg_bindings.cpp:943] afg-aten::mul.out-dtype_Float-bBtLGD6Y executable has been unloaded
I1127 11:06:06.506654 2133030 afg_bindings.cpp:943] afg-add-dtype_Float-fa37JncC executable has been unloaded
W1127 11:06:08.731138 2133030 HazptrDomain.h:148] Tagged objects remain. This may indicate a higher-level leak of object(s) that use hazptr_obj_cohort.
```

Differential Revision: D66549179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141723
Approved by: https://github.com/nautsimon
2024-12-09 19:19:19 +00:00
db379ed1ad [inductor][cpp] Add FlexAttention support for CPU inference (#141453)
This PR brings the FlexAttention inference support for the inductor backend in torch.compile (support precisions: bf16 and fp32) on CPUs.

Based on the existing CPP template, this PR extends and implements a FlexAttention CPP template to support broad attention variants, and meanwhile brings optimized performance on CPUs.

With this, users can transparently extend their Flex Attention usages to CPUs with good and common support from torch.compile, both functionality and performance.

For UT tests, in this PR, we include partial critical tests for CPUs as the following (conduct inference tests):
```
pytest test/inductor/test_flex_attention.py
`TestFlexAttention`
#common functions:
run_test
preprocess_paged_attention
run_paged_attention
run_test_with_paged_attention
run_test_with_call
run_dynamic_test
run_automatic_dynamic_test

#test functions:
test_builtin_score_mods
test_builtin_score_mods_automatic_dynamic
test_builtin_score_mods_different_seqlen
test_builtin_score_mods_different_block_size
test_kv_batch_broadcast
test_GQA
test_cpu_error_message_return_lse
test_validate_cpu_dtype_error_message

`TestPagedAttention`
#test function:
test_paged_builtin_score_mods
```
For the rest UTs in `test/inductor/test_flex_attention.py ` and `test/inductor/test_flex_decoding.py`, due to bigger lines of changes (1500+ LOC) that make this PR hard to review, will submit another PR specific for CPU device UTs enabling and refactor.

Besides, more optimizations are also planned in follow up PRs, including:

- Block sparse computation
- Flash decoding tuning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141453
Approved by: https://github.com/drisspg, https://github.com/leslie-fang-intel

Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
2024-12-09 18:44:39 +00:00
a0d49dc047 Fix to make GELU on aarch64 preserve COW input tensor (#142366)
Fixes #142365

itensor_from_tensor call was causing COW tensor to materialize

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142366
Approved by: https://github.com/malfet
2024-12-09 18:42:06 +00:00
0610b9730e Do not use builder repo for MacOS builds (#142277)
Added c7564f31f7/wheel/build_wheel.sh to `.ci/wheel/` folder

Commented out call to 39532891a0/run_tests.sh, because since 2018 this script just checked that tests folder is there and exited, as there are no way to run all pytorch tests in single shard, see this logic:
```bash
#!/bin/bash
set -eux -o pipefail

# Essentially runs pytorch/test/run_test.py, but keeps track of which tests to
# skip in a centralized place.
#
# TODO Except for a few tests, this entire file is a giant TODO. Why are these
# tests # failing?
# TODO deal with Windows

# This script expects to be in the pytorch root folder
if [[ ! -d 'test' || ! -f 'test/run_test.py' ]]; then
    echo "builder/test.sh expects to be run from the Pytorch root directory " \
         "but I'm actually in $(pwd)"
    exit 2
fi

# Allow master skip of all tests
if [[ -n "${SKIP_ALL_TESTS:-}" ]]; then
    exit 0
fi
```

https://github.com/pytorch/pytorch/pull/123390 is a misread attempt to interpret above-mentioned logic, as run_tests will be skipped if `${SKIP_ALL_TESTS}` is a non-empty string
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142277
Approved by: https://github.com/huydhn, https://github.com/atalman
ghstack dependencies: #142276
2024-12-09 18:33:58 +00:00
5e8e1d725a Remove some unused type ignores (round 1) (#142325)
Over time, a large number of the existing type ignores have become irrelevant/unused/dead as a result of improvements in annotations and type checking.

Having these `# type: ignore` linger around is not ideal for two reasons:

- They are verbose/ugly syntatically.
- They could hide genuine bugs in the future, if a refactoring would actually introduce a bug but it gets hidden by the ignore.

I'm counting over 1500 unused ignores already. This is a first PR that removes some of them. Note that I haven't touched type ignores that looked "conditional" like the import challenge mentioned in https://github.com/pytorch/pytorch/pull/60006#issuecomment-2480604728. I will address these at a later point, and eventually would enable `warn_unused_ignores = True` in the mypy configuration as discussed in that comment to prevent accumulating more dead ignores going forward.

This PR should have no effect on runtime at all.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142325
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
2024-12-09 18:23:46 +00:00
a52d9f6f4c Fix torch.lerp RuntimeError when weight is CPU scalar while input & end are CUDA tensor (#141820)
Fixes #141811

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141820
Approved by: https://github.com/eqy, https://github.com/janeyx99
2024-12-09 18:14:54 +00:00
d99c9c2acb [PGNCCL] Make sure we do not use split for P2P comm creation (#139013)
Resolve comment https://github.com/pytorch/pytorch/pull/138527#issuecomment-2438613172

There was a split-vs-P2P bug:
When P2P comm creation invokes `getNCCLComm`, it may see a `split_from` options which is meant for the previous PG creation. Then the P2P comm creation may use `ncclCommSplit` and hang, because not all ranks join this call. The bug slips previously/today because there is no CI test with the following recipe: eager init + new group + P2P in that new group.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139013
Approved by: https://github.com/shuqiangzhang
2024-12-09 17:56:03 +00:00
219e9c83a5 Revert "[AOTI XPU] Support AOT Inductor for Intel GPU. (#140269)"
This reverts commit 854d83133bd4b0bca8ba19477c56ef2dd896dfc7.

Reverted https://github.com/pytorch/pytorch/pull/140269 on behalf of https://github.com/clee2000 due to breaks forward compatibility?  D66937097 ([comment](https://github.com/pytorch/pytorch/pull/140269#issuecomment-2528828555))
2024-12-09 17:33:28 +00:00
6fcb294e18 Revert "[AOTI] Fix #140546 and support AOTI package load for Intel GPU. (#140664)"
This reverts commit 91d30546a4338b17f31d31a674662aa53d61b1aa.

Reverted https://github.com/pytorch/pytorch/pull/140664 on behalf of https://github.com/clee2000 due to breaks forward compatibility?  D66937097 ([comment](https://github.com/pytorch/pytorch/pull/140269#issuecomment-2528828555))
2024-12-09 17:33:28 +00:00
90fc2b42e3 Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit 82544bd3a2f71e5995e6b035433139fad884e277.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/clee2000 due to still has failures internally when building, D66923759 ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2528760716))
2024-12-09 17:04:20 +00:00
dd5df002b9 [pt2e][quant] Make move_exported_model_to_train/eval idempotent (#142239)
Summary: Before we would recompile the model unnecessarily even
if the model is already in the desired mode. For training
frameworks that assume `model.train()` is idempotent and calls
this before every single training step, this led to a bunch of
tiny graphs and poor performance. This commit makes these calls
no-ops if we're already in the target train/eval mode.

Test Plan:
python test/test_quantization -k TestQuantizePT2E.test_allow_exported_model_train_eval_idempotent
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142239
Approved by: https://github.com/jerryzh168
2024-12-09 16:50:20 +00:00
d29e0ac9e9 Use set -o pipefail for build.sh (#142377)
This would have made https://github.com/pytorch/pytorch/pull/142359 a
hard failure.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142377
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-12-09 16:30:57 +00:00
9b6ef8abaf Update inductor jobs to use CUDA 12.4 (#142177)
CUDA 12.4 is the default now.  This frees up some resources.  This also fixes newly added Python 3.13 job by #140733. That PR missed adding the new Docker image `pytorch-linux-focal-cuda12.4-cudnn9-py3.13-gcc9-inductor-benchmarks` into docker build workflow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142177
Approved by: https://github.com/atalman
2024-12-09 16:18:38 +00:00
02848c2e14 [cpu/aarch64] fix compilation for Vec:bf16 (128bit) (#142370)
Fix typo causing compilation error on aarch64 architecture with BF16 support. (#139090)

tag: @swolchok

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142370
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-12-09 16:17:32 +00:00
a17ecd8668 Add missing bc CI dependency (#142359)
The [bc](https://www.geeksforgeeks.org/bc-command-linux-examples) command that I use to calculate the MAX_JOBS in https://github.com/pytorch/pytorch/pull/142164 isn't part of the Docker image https://github.com/pytorch/pytorch/actions/runs/12230618287/job/34113698986#step:14:321.  I missed this error when landing https://github.com/pytorch/pytorch/pull/142164.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142359
Approved by: https://github.com/Skylion007
2024-12-09 16:07:40 +00:00
1589c2bc4b [c10d][UCC] Add _reduce_scatter_base to c10d::ProcessGroupUCC (#138021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138021
Approved by: https://github.com/kwen2501
2024-12-09 16:02:24 +00:00
8d9ac9d94e [aarch64] Fix libcusparselt format for CUDA sbsa docker (#142363)
Corrects https://github.com/pytorch/pytorch/pull/141433/files

Error whe building arm wheel https://github.com/pytorch/pytorch/actions/runs/12226514901/job/34101913511
`/opt/rh/gcc-toolset-11/root/usr/bin/ld: /usr/local/cuda/lib64/libcusparseLt.so: error adding symbols: file in wrong format`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142363
Approved by: https://github.com/Aidyn-A, https://github.com/Skylion007, https://github.com/atalman
2024-12-09 15:29:36 +00:00
5fc9f419ef [AOTI] Fix multi-kernel codegen when using one-pass (#142333)
Summary: Update multi-kernel codegen to one-pass, following https://github.com/pytorch/pytorch/pull/141980.

Differential Revision: [D66936717](https://our.internmc.facebook.com/intern/diff/D66936717)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142333
Approved by: https://github.com/chenyang78
ghstack dependencies: #141980
2024-12-09 14:49:10 +00:00
4d43ec2189 [AOTI] Swith GPU codegen to one-pass (#141980)
Summary: With autotune_at_compile_time enabled, AOTI now can perform CUDA codegen in one pass. CUDA kernel related code is generated in a deferred way, after autotuning is done. This one-pass implementation will eliminate any issue caused by disparity between passes in the previous two-pass implementation (which caused multiple bug reports in the past). One-pass implementation also avoids cloning mutated inputs needed in the two-pass implementation, which will reduce GPU memory consumption.

Differential Revision: [D66739414](https://our.internmc.facebook.com/intern/diff/D66739414)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141980
Approved by: https://github.com/chenyang78
2024-12-09 14:40:34 +00:00
f14ce3a923 Update slow tests (#140248)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140248
Approved by: https://github.com/pytorchbot
2024-12-09 11:15:51 +00:00
7101dcfb98 Revert "[inductor][cpp] Add FlexAttention support for CPU inference (#141453)"
This reverts commit 7edbde3334df3223c009769d8226d06071e1fff9.

Reverted https://github.com/pytorch/pytorch/pull/141453 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it is failing periodic NO_AVX2 ([comment](https://github.com/pytorch/pytorch/pull/141453#issuecomment-2527377475))
2024-12-09 09:26:20 +00:00
a108b282ff [4/N] Avoid copy in std::get (#142285)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142285
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-12-09 07:59:35 +00:00
2cc01cc6d3 [Quant][Inductor][X86] add fusion pass for linear_dynamic_fp16 with relu (#141556)
**Description**
Fuse and prepack weight for `linear_dynamic_fp16` with post op relu. In Inductor, the pattern we see is
```
fp32 activation
  |
(reshape)
  |
mm/addmm <- t <- to_fp32 <- tp_fp16 <- weight
  |
(reshape) <- relu
```
Or
```
fp32 activation
  |
expand
  |
 bmm <- expand <- t <- to_fp32 <- tp_fp16 <- weight
  |
(add) <- relu
```
The second pattern is for x.ndim > 2 and x is not contiguous. The first pattern is for other cases.

Fuse the pattern with weight prepack, and we get
```
fp32 activation
  |
onednn.linear_relu_dynamic_fp16 <- onednn.linear_prepack_fp16 <- weight
```
After freezing, the prepack op is gone.

**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_relu_dynamic_fp16
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141556
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
ghstack dependencies: #141549
2024-12-09 05:05:11 +00:00
7435f57f60 [BE] Remove unusued channels arg in col2im (#142336)
Number of channels is passed to col2im kernel/device function, but is not used during the computations at all
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142336
Approved by: https://github.com/Skylion007, https://github.com/eqy
2024-12-09 01:49:41 +00:00
75e72e1408 Adding lowering to persistent-tma device kernel for _scaled_mm (#142045)
# Summary
This PR adds an alternative triton lowering for _scaled_mm. This uses an updated mm template that utilizes persistent scheduling + TMAs on A and B matrices.

Limitations:
* This implementations does not work with Bias values: 0602676c8d/torch/_inductor/kernel/mm_scaled.py (L106) Plan is to remove this work around and enforce that both scaling + bias is properly done as epilogues onto the existing templates
* K dim must be 32 or greater for these to take effect
* Gated by a config flag ( currently defaults to Off, maybe should be on)

## Testing
We dont have any tests exercising this code in CI/CD but I updated the relevant tests in test_fp8 and they are all green:
<img width="1680" alt="Screenshot 2024-12-05 at 7 24 07 PM" src="https://github.com/user-attachments/assets/9c520541-d97a-416f-9af7-e68b366ec90f">

## Follow Ups
* Work to update the base mm triton templates and utilize the same template from mm/addmm/scaled_mm w/ respective epilogues
* Tuning on Persistent kernel configs. I found ones that work for my problem shapes but need to do some more NCU work

### Some profiling code I was using

Code I am using to iterate w/
```Python
import torch
from dataclasses import dataclass
from jsonargparse import CLI
import logging
from pathlib import Path

from transformer_nuggets.utils.benchmark import ProfileConfig, profile_function
from torchao.float8.inference import (
    addmm_float8_unwrapped_inference,
    preprocess_data,
    Float8MMConfig,
)
from transformer_nuggets.fp8.fp8_matmul import (
    matmul_persistent,
    matmul_tma_persistent,
    matmul_device_tma_persistent,
)
from enum import Enum

logging.getLogger("transformer_nuggets").setLevel(logging.INFO)

class FP8Kernel(Enum):
    PERSISTENT = "Persistent"
    PERSISTENT_TMA = "Persistent-TMA"
    DEVICE_TMA = "Device-TMA"
    SCALED_MM = "Scaled-MM"

class ScalingStrategy(Enum):
    PER_TENSOR = "PerTensor"
    PER_ROW = "PerRow"

@dataclass(frozen=True)
class ExperimentConfig:
    M: int
    K: int
    N: int
    scaling_strategy: ScalingStrategy
    fp8_kernel: FP8Kernel
    compile: bool

def get_fp8_matmul(
    A: torch.Tensor,
    B: torch.Tensor,
    scaling_strategy: ScalingStrategy,
    fp8_kernel: FP8Kernel,
):
    A_fp8 = A.to(torch.float8_e4m3fn)
    B_fp8 = B.to(torch.float8_e4m3fn)
    A_fp8, B_fp8 = preprocess_data(A_fp8, B_fp8, Float8MMConfig(use_fast_accum=True))

    if scaling_strategy == ScalingStrategy.PER_TENSOR:
        a_scale = torch.tensor(1, device="cuda", dtype=torch.float32)
        b_scale = torch.tensor(1, device="cuda", dtype=torch.float32)
    elif scaling_strategy == ScalingStrategy.PER_ROW:
        a_scale = torch.ones((A_fp8.size(0), 1), device="cuda", dtype=torch.float32)
        b_scale = torch.ones((B_fp8.size(1), 1), device="cuda", dtype=torch.float32).T
    else:
        raise ValueError(f"Invalid scaling strategy: {scaling_strategy}")

    assert fp8_kernel == FP8Kernel.SCALED_MM
    return lambda: addmm_float8_unwrapped_inference(
        A_fp8, a_scale, B_fp8, b_scale, output_dtype=torch.bfloat16, use_fast_accum=True
    )

def run_matmul(config: ExperimentConfig):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    A = torch.randn(config.M, config.K, device=device, dtype=torch.bfloat16)
    B = torch.randn(config.K, config.N, device=device, dtype=torch.bfloat16)

    fp8_matmul = get_fp8_matmul(A, B, config.scaling_strategy, config.fp8_kernel)

    if config.compile and config.fp8_kernel == FP8Kernel.SCALED_MM:
        fp8_matmul = torch.compile(fp8_matmul, mode="max-autotune-no-cudagraphs")

    _ = fp8_matmul()

    return

def main():
    torch.random.manual_seed(123)

    # Define your experiment configuration here
    config = ExperimentConfig(
        M=8192,
        K=8192,
        N=8192,
        scaling_strategy=ScalingStrategy.PER_TENSOR,
        fp8_kernel=FP8Kernel.SCALED_MM,
        compile=True,
    )

    run_matmul(config)

if __name__ == "__main__":
    CLI(main)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142045
Approved by: https://github.com/eellison
2024-12-09 01:48:40 +00:00
29e985b7b0 [dim_order] raised runtime error when tensor has ambiguous dim order (#141632)
This diff makes tensor.dim_order() raise error when tensor's dim order is ambiguous. Detail discussion can be found https://fb.workplace.com/groups/894363187646754/permalink/2039987243084337/

Differential Revision: [D65133579](https://our.internmc.facebook.com/intern/diff/D65133579/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141632
Approved by: https://github.com/larryliu0820
2024-12-08 23:16:57 +00:00
e1196dfe51 Deprecate torch._utils.is_compiling() (#127690)
This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-12-08 22:55:36 +00:00
869665c44c [torchgen] Fix an unused variable in api/python.py (#142337)
Extracted from https://github.com/pytorch/pytorch/pull/136359

Changes behavior, but the original code seems like it was an obvious oops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142337
Approved by: https://github.com/Skylion007
2024-12-08 21:48:08 +00:00
ef26f1c57e Migrate windows build scripts from builder to pytorch (#142156)
Move builder windows build scripts to pytorch/pytorch
Remove builder checkout during windows build
Pending remove windows build scripts https://github.com/pytorch/builder/tree/main/windows

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142156
Approved by: https://github.com/malfet, https://github.com/chuanqi129
2024-12-08 21:43:59 +00:00
05c1f37188 Enable memory swap on Linux docker build (#142293)
Lots of CUDA build jobs are OOM-ing in trunk and the hotspot seems to come from building flash attention, for example https://github.com/pytorch/pytorch/actions/runs/12208390090/job/34061532155#step:14:9369.  There are several options around:

* Mimic the logic from https://github.com/Dao-AILab/flash-attention/blob/main/setup.py#L495-L508.  We are using `linux.2xlarge` for the build with 8 CPU and 16GB.  The current max number of parallel jobs is `(8 - 2)/3 = 2` while the logic from upstream repo has `16 / 9 = 1.7`, so it's very close.
* Upgrade to `linux.2xlarge.memory` with 8 CPU and 64GB for all CUDA build, it could afford up to 7 max parallel jobs according to the above logic.
* Enable swap.

These approaches can work together, so I want to experiment with swapping first as this technique, if working, could be useful in other context too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142293
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-12-08 20:59:43 +00:00
46dc2965de Adding missing space to pybind_utils.h error message (#142258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142258
Approved by: https://github.com/Skylion007
2024-12-08 20:46:32 +00:00
0c66cee9a2 [Inductor] Expand dtype aware codegen for libdevice and tl.math ops (#140864)
# Feature
Previously, only the codegen for `torch.sqrt` was dtype aware. This PR updates most of the `libdevice`/`tl.math` ops to support dtype-aware codegen as well. This is often necessary to get correct code when `config.triton.codegen_upcast_to_fp32=False`, as most Triton math ops do not support float16/bfloat16.

This PR enables dtype aware codegen via the `maybe_upcast_float32` decorator. This wraps `TritonOverrides` macros to upcast arguments to float32, and downcast the result back to the original dtype. The exception is for ops that return booleans, in which case we set `convert_output=False` and skip the output cast.

# Test Plan
Added CI tests for all the new ops. The list of ops to test is automatically generated based on uses of the `maybe_upcast_float32` decorator, and stored in the new `OpDtypeSupport` class. In each new test, we search the generated code for upcasts/downcasts using a regex.

Also added a unit test for `OpDtypeSupport` which checks that we have correct dtype info for ops that require upcasts.

This PR also moves some existing tests around, to collect all the dtype aware codegen tests in one file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140864
Approved by: https://github.com/eellison, https://github.com/arui-meta

Co-authored-by: eellison <elias.ellison@gmail.com>
2024-12-08 19:42:48 +00:00
c814dd08aa Fixed installing dependencies instructions in CONTRIBUTING.md (#142334)
In the original code, “pip install -r requirements” missed the suffix “.txt”, so I'll add it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142334
Approved by: https://github.com/malfet
2024-12-08 19:35:36 +00:00
e343f46464 [inductor] Refactor is_big_gpu (#142220)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142220
Approved by: https://github.com/yanboliang
ghstack dependencies: #142219, #142033, #142222
2024-12-08 18:51:36 +00:00
dc7461d6f5 docstring_linter finds long classes and functions without docstrings (#140426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140426
Approved by: https://github.com/eellison
2024-12-08 17:03:57 +00:00
d0b9874603 Teach ruff_linter to report syntax errors (fix #140228) (#142312)
Now syntax errors look like this:

```
torch/_dynamo/variables/base.py:20:

  Error (RUFF) E999
    SyntaxError: Expected ',', found indent.
    See https://beta.ruff.rs/docs/rules/
    >>>  19  |class SourceType(Enum]:
         20  |    """
         21  |    This Enum divides VariableTracker into 2 cases, depending on the variable
         22  |    it represents:

[...more errors...]
```

Note that most syntax errors lead to a cascade of other errors, so the exception is generally wrong, but the location and name are good.

Before they looked like this:

```
>>> General linter failure:

  Error (RUFF) Linter failed
    Linter failed. This a bug, please file an issue against the linter
    maintainer.

    CONTEXT:
    Linter command failed with non-zero exit code.
    STDERR:
    <MainThread:DEBUG> $ /home/rec/.conda/envs/pytorch-dev-constant/
    bin/python3 -m ruff check --exit-zero --quiet --output-format=json
    --config=pyproject.toml /home/rec/git-constant/pytorch/torch/_dynamo/
    variables/base.py
    <MainThread:DEBUG> took 38ms
    Traceback (most recent call last):
      File "/home/rec/git-constant/pytorch/tools/linter/adapters/
    ruff_linter.py", line 465, in <module>
        main()
      File "/home/rec/git-constant/pytorch/tools/linter/adapters/
    ruff_linter.py", line 424, in main
        lint_messages = check_files(
      File "/home/rec/git-constant/pytorch/tools/linter/adapters/
    ruff_linter.py", line 273, in check_files
        return [
      File "/home/rec/git-constant/pytorch/tools/linter/adapters/
    ruff_linter.py", line 288, in <listcomp>
        severity=severities.get(vuln["code"],
    get_issue_severity(vuln["code"])),
      File "/home/rec/git-constant/pytorch/tools/linter/adapters/
    ruff_linter.py", line 172, in get_issue_severity
        if any(
      File "/home/rec/git-constant/pytorch/tools/linter/adapters/
    ruff_linter.py", line 173, in <genexpr>
        code.startswith(x)
    AttributeError: 'NoneType' object has no attribute 'startswith'

    STDOUT:
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142312
Approved by: https://github.com/Skylion007
2024-12-08 16:48:05 +00:00
2c6d094869 [AOTI] Assert misaligned input (#142136)
Summary: Fixes https://github.com/pytorch/pytorch/issues/141891. JIT Inductor relies on copy_misaligned_inputs to fix misaligned inputs. For AOTInductor's use scenario, this is an unacceptable performance hit, so we codegen input alignment check at the entry point and throws an error if any misalignment exists.

Differential Revision: [D66881038](https://our.internmc.facebook.com/intern/diff/D66881038)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142136
Approved by: https://github.com/eellison, https://github.com/ezyang
ghstack dependencies: #142133
2024-12-08 15:13:01 +00:00
5035ff0796 [AOTI] Refactor codegen_inputs signature (#142133)
Summary: Since codegen_inputs only writes to self.prefix, drop IndentedBuffer from its parameters, to make the API consistent with other similar functions.

Differential Revision: [D66881040](https://our.internmc.facebook.com/intern/diff/D66881040)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142133
Approved by: https://github.com/chenyang78
2024-12-08 15:05:03 +00:00
32b94644fc Add manylinux_2_28_x86_64 tags to wheel builds (#141988)
Tag the Wheels with appropriate Manylinx 2.28 tags
Initially used auditwheel but it does much more that just adding tags. It also tries to package multiple libs into wheel, which we don't want at this point. Hence just changed tag and filename. If no librs are repackages by auditwheel all ti does is to tag and rename.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141988
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-12-08 14:36:24 +00:00
0ecba57561 [CI] Add xpu new docker image name into docker builds workflow (#142298)
Add missed new xpu docker image name to adapt the new mechanism introduced by https://github.com/pytorch/test-infra/pull/6013
Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142298
Approved by: https://github.com/huydhn
2024-12-08 09:34:20 +00:00
7edbde3334 [inductor][cpp] Add FlexAttention support for CPU inference (#141453)
This PR brings the FlexAttention inference support for the inductor backend in torch.compile (support precisions: bf16 and fp32) on CPUs.

Based on the existing CPP template, this PR extends and implements a FlexAttention CPP template to support broad attention variants, and meanwhile brings optimized performance on CPUs.

With this, users can transparently extend their Flex Attention usages to CPUs with good and common support from torch.compile, both functionality and performance.

For UT tests, in this PR, we include partial critical tests for CPUs as the following (conduct inference tests):
```
pytest test/inductor/test_flex_attention.py
`TestFlexAttention`
#common functions:
run_test
preprocess_paged_attention
run_paged_attention
run_test_with_paged_attention
run_test_with_call
run_dynamic_test
run_automatic_dynamic_test

#test functions:
test_builtin_score_mods
test_builtin_score_mods_automatic_dynamic
test_builtin_score_mods_different_seqlen
test_builtin_score_mods_different_block_size
test_kv_batch_broadcast
test_GQA
test_cpu_error_message_return_lse
test_validate_cpu_dtype_error_message

`TestPagedAttention`
#test function:
test_paged_builtin_score_mods
```
For the rest UTs in `test/inductor/test_flex_attention.py ` and `test/inductor/test_flex_decoding.py`, due to bigger lines of changes (1500+ LOC) that make this PR hard to review, will submit another PR specific for CPU device UTs enabling and refactor.

Besides, more optimizations are also planned in follow up PRs, including:

- Block sparse computation
- Flash decoding tuning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141453
Approved by: https://github.com/jgong5, https://github.com/drisspg, https://github.com/leslie-fang-intel

Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
2024-12-08 07:57:21 +00:00
0bd7b7ae58 Add version check for C++ pytree availability (#142299)
Resolves #142256

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142299
Approved by: https://github.com/jansel, https://github.com/weifengpy
2024-12-08 06:27:32 +00:00
2682e5e0d4 [BE]: Add TypeGuard to is_symbolic (#142304)
Improves type inference for is_symbolic. If it's True, it must be either a SymInt or Torch Tensor currently.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142304
Approved by: https://github.com/jansel
2024-12-08 02:18:17 +00:00
2fc8bac091 [ROCm] Fix unit test: matmul_offline_mgpu_gpu_tunableop (#142269)
Fixes #141652

This PR fixes (at least in part) the unit test failure. However, we may also need to do a separate flush of the untuned results-- if this test continues to be flaky, another PR would be needed to flush the untuned results as well.

Tested locally and it seems to be working.

Also fixing code that was accidentally commented out code in the unit test from the prior multi-gpu offline tuning PR https://github.com/pytorch/pytorch/pull/139673

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142269
Approved by: https://github.com/jeffdaily
2024-12-08 02:18:00 +00:00
b1bb860d3c c10::string_view -> std::string_view in aten (#141903)
D66560348 passes internally, but won't export, so I'm rebuilding here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141903
Approved by: https://github.com/Skylion007
2024-12-07 23:23:52 +00:00
8cb68b136f Proper modeling of recursive types (#142300)
Currently there are a few type annotations that falsely state that mypy doesn't support recursive types.

Recursive type support is available in mypy for a few years already. It has been officially enabled in [version 0.991](https://mypy-lang.blogspot.com/2022/11/mypy-0990-released.html). Pyright even had support for recursive types earlier (https://github.com/microsoft/pyright/issues/569), so there is probably no reason not to model these types correctly.

This PR models these types properly now. Since this has turned a few implicit `Any` into fully typed variables that are not narrowed cleanly, a small number of type ignores were necessary.

Note that regarding the `Argument` it is desirable to model it in a covariant way (i.e. using `Sequence` and `Mapping`) instead of making it invariant unnecessarily (using `List` and `Dict`). If it were modeled invariant, it would for instance mean that a `List[Node]` would not type check as `Argument`, because invariance would mean that it really has to be a `List[Argument]` (i.e., including all the branches of the union type). Since even the name of the type "argument" strongly suggest that it is semantically used as "argument", having covariance natural anyway.

There are no chances in this PR that affect runtime behavior.

CC @Skylion007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142300
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-12-07 21:30:45 +00:00
17f1a42c13 Add missing py::bytes to pybind_utils tryToInferType (#142265)
I'm not sure what the best way to fix this is, but this does unbreak an internal test.

Test Plan: Sandcastle

Reviewed By: itamaro

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142265
Approved by: https://github.com/houseroad
2024-12-07 20:31:57 +00:00
3b531f18c7 [BE] Improve Flight Recorder efficacy (#142178)
Summary:
This is an attempt to improve the flight recorder efficacy.
We have a small subset of jobs that are timing out (i.e. failing to write out FR logs in 1 minute) and some that are throwing a `std::exception - broken promise`.

There are two changes in here.
1. We attempt to write out FR buffer with stack traces. If this fails, we attempt to capture FR buffer again - but this time without stack traces. The assumption here is that FR could be locking up when unwinding stack.
Note, to keep things simple, I'm re-using the same file name for both with/without stack_trace.
2.  Add additional catch statements in the Manifold writer. There might be something going on in here - so we'll get a log statement if this is failing.

TODO:
- there's nothing differentiating in the output that says whether stack traces were omitted purposefully or not.
This info might be useful for the analyzer - so I'll add this in a follow on diff.

Test Plan: Unit tests.

Differential Revision: D66843194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142178
Approved by: https://github.com/kwen2501
2024-12-07 19:32:28 +00:00
91d30546a4 [AOTI] Fix #140546 and support AOTI package load for Intel GPU. (#140664)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* #140686
* __->__ #140664
* #140269
* #140268
* #135320
* #135318
* #139026

Fix #140546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140664
Approved by: https://github.com/desertfire, https://github.com/EikanWang
ghstack dependencies: #140268, #140269
2024-12-07 19:22:04 +00:00
854d83133b [AOTI XPU] Support AOT Inductor for Intel GPU. (#140269)
This PR add XPU support for AOT Inductor, and reuse the corresponding UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140269
Approved by: https://github.com/desertfire, https://github.com/EikanWang
ghstack dependencies: #140268
2024-12-07 19:22:04 +00:00
3d227ae315 [Intel GPU] Support getStreamFromExternel for XPU. (#140268)
In AOT inductor scenario, the GPU Stream can be created outside of the pool of `XPUStream`, and we need to create a `XPUStream` which refers to this stream for the the common logic of AOTI, for example a stream guard is a guard for `XPUStream`.  So we add the getStreamFromExternel following the design of CUDAStream.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140268
Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/EikanWang
2024-12-07 19:22:04 +00:00
843018f407 [inductor] Refactor split factor into V.choices.reduction_split_factor (#142222)
I want to reuse this for cooperative reduction heuristics (in a later PR).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142222
Approved by: https://github.com/eellison
ghstack dependencies: #142219, #142033
2024-12-07 17:48:45 +00:00
81edca08ab [inductor] Refactor some DeviceProperties usage (#142033)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142033
Approved by: https://github.com/eellison
ghstack dependencies: #142219
2024-12-07 17:48:45 +00:00
0367a31401 [inductor] Minor typing changes (#142219)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142219
Approved by: https://github.com/Skylion007, https://github.com/yanboliang
2024-12-07 17:48:37 +00:00
524395edf4 [aarch64] build cuda 12.6 manywheel dockers (#139988)
Add Builds sbsa 12.6 manywheel dockers to workflow
Related to #138440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139988
Approved by: https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
2024-12-07 15:38:41 +00:00
82544bd3a2 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-12-07 15:23:38 +00:00
f84e533a2c Add device-agnostic runtime Device/Stream C++ API (#138677)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138677
Approved by: https://github.com/albanD, https://github.com/EikanWang
ghstack dependencies: #133572
2024-12-07 13:14:10 +00:00
952514f0c8 Add UTs for accelerator device-agnostic runtime APIs (#133572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133572
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-12-07 13:14:10 +00:00
b6a64b64de Add ncu profile to final output_code.py (#142259)
This PR adds `--ncu` to the output code benchmark utils to generate ncu profile reports.

Test Plan:
```
% python torch_compile_debug/run_2024_12_05_22_27_59_182730-pid_4112931/torchinductor/model__0_forward_1.0/output_code2.py --ncu
0.000160
Peak GPU memory usage 671.220 MB
==PROF== Connected to process 502514 (python3.10)
==PROF== Connected to process 503187 (python3.10)
==WARNING== Unable to access the following 6 metrics: ctc__rx_bytes_data_user.sum, ctc__rx_bytes_data_user.sum.pct_of_peak_sustained_elapsed, ctc__rx_bytes_data_user.sum.per_second, ctc__tx_bytes_data_user.sum, ctc__tx_bytes_data_user.sum.pct_of_peak_sustained_elapsed, ctc__tx_bytes_data_user.sum.per_second.

==PROF== Profiling "distribution_elementwise_grid..." - 0: 0%....50%....100% - 38 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 1: 0%....50%....100% - 38 passes
==PROF== Profiling "triton_poi_fused_embedding_0" - 2: 0%....50%....100% - 38 passes
6.891588
==PROF== Disconnected from process 502514
==PROF== Disconnected from process 503187
==PROF== Report: /tmp/ncu_output_20241206_131245.ncu-rep

NCU profiling results for benchmark None:
NCU report has been written to /tmp/ncu_output_20241206_131245.ncu-rep
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142259
Approved by: https://github.com/eellison
2024-12-07 07:54:43 +00:00
fc831f76f8 Revert "[Inductor] Represent size_hints as a dict (#142249)"
This reverts commit f870ee2cc4f3dd1babd3043b5291d54f487a2999.

Reverted https://github.com/pytorch/pytorch/pull/142249 on behalf of https://github.com/blaine-rister due to would break internal tests ([comment](https://github.com/pytorch/pytorch/pull/142249#issuecomment-2524991008))
2024-12-07 07:43:51 +00:00
f870ee2cc4 [Inductor] Represent size_hints as a dict (#142249)
Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243.

# Feature

Follow up to https://github.com/pytorch/pytorch/pull/141751. Since we now represent `numels` as a dict, it's natural to extend this to `size_hints`. The latter are basically just the former rounded up to the nearest power of 2. This simplifies various heuristics such as the coordinate descent tuner. Where we previously needed to determine which index in `size_hints` corresponds to each dimension, now we can just query by prefix. This will be especially important when we enable 2D reductions, as it becomes harder to keep track of these things when we have multiple reduction dimensions. (See the previous PR for some examples.)

# Test plan

The existing CI provides good coverage. This PR modifies a few tests which explicitly constructed size hints.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142249
Approved by: https://github.com/jansel
2024-12-07 06:43:05 +00:00
a58d2f14e8 [DTensor] Add a private util for sharding tensor (#142288)
Locally shards a full tensor based on indicated sharding arrangement, and returns a DTensor containing the local shard.

warning: This is a private API purposed to skip the communication otherwise required by `distribute_tensor`. It is only applicable to a case where all ranks have the same `full_tensor`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142288
Approved by: https://github.com/wz337
2024-12-07 05:30:18 +00:00
2d9b081012 Move knapsack algorithms into separate filing (#141451) (#141614)
Summary:

Original Context Doc: https://docs.google.com/document/d/1Gv5ZqN3UY7kTuCUd0JLU3L_onwVIr2vd3AiZyim1Muc/edit?disco=AAABZX_tqdk

### Changes

This diff restructures the Partitioners.py file in the _functorch package.

* Moves the three knapsack problem algorithems (greedy, ilp, dp) into a separate file

Test Plan:
### Unit Testing

```
$ buck test mode/opt //caffe2/test/functorch:test_ac
File changed: fbsource//xplat/caffe2/test/functorch/TARGETS
File changed: fbsource//xplat/caffe2/test/functorch
File changed: fbsource//xplat/caffe2/test
7 additional file change events
Soft Error: source_directory_includes_subpackage: Directory `v2.17.1-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.17.1-1/src/tests`.
Soft Error: source_directory_includes_subpackage: Directory `v2.18.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.18.3-1/src/tests`.
Soft Error: source_directory_includes_subpackage: Directory `v2.19.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.19.3-1/src/tests`.
Buck UI: https://www.internalfb.com/buck2/a2f91f8a-5326-435e-9075-5af0de930b8b
Test UI: https://www.internalfb.com/intern/testinfra/testrun/7036874660108924
Network: Up: 28MiB  Down: 3.5GiB  (reSessionID-19af3d71-7528-448c-9126-5615d27b3bd7)
Jobs completed: 423656. Time elapsed: 3:57.2s.
Cache hits: 99%. Commands: 146147 (cached: 145758, remote: 317, local: 72)
Tests finished: Pass 8. Fail 0. Fatal 0. Skip 0. Build failure 0

```

### Tested on Local Training Run

```
CUDA_VISIBLE_DEVICES=5,6 AOT_PARTITIONER_DEBUG=1 PARTITIONER_MEMORY_BUDGET_PARETO=0 buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=local_fb_fm_v4 launcher.num_workers=2 2>&1 | tee log_2024-11-2320:41:50.757256__bento_trigger.txt
```

Output Summary Paste: P1685697066

Differential Revision: D65800097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141614
Approved by: https://github.com/jansel, https://github.com/Chillee
2024-12-07 03:21:52 +00:00
c863227be3 [Quant][Inductor][X86] add fusion pass for linear_dynamic_fp16 (#141549)
**Description**
For `linear_dynamic_fp16`, we insert `quantize` and `dequantize` between x/w and linear to have the following pattern:
```
  x
  |
linear <- to_fp32 <- to_fp16 <- w
```
In Inductor, the pattern we finally see will be
```
fp32 activation
  |
(reshape)
  |
mm/addmm <- t <- to_fp32 <- tp_fp16 <- weight
  |
(reshape)
```
Or
```
fp32 activation
  |
expand
  |
 bmm <- expand <- t <- to_fp32 <- tp_fp16 <- weight
  |
(add)
```
The second pattern is for x.ndim > 2 and x is not contiguous. The first pattern is for other cases.

Fuse the pattern with weight prepack, and we get
```
fp32 activation
  |
onednn.linear_dynamic_fp16 <- onednn.linear_prepack_fp16 <- weight
```
After freezing, the prepack op is gone.

**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_dynamic_fp16
```

Differential Revision: [D66802159](https://our.internmc.facebook.com/intern/diff/D66802159)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141549
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2024-12-07 03:08:08 +00:00
7939b5f5f9 remove sccache from bazel, to go together with #140614 (#142241)
removes sccache from bazel builds. Will move bazel builds to periodic if build succeed

CUDA bazel test succeeded, moving to periodic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142241
Approved by: https://github.com/malfet
2024-12-07 02:08:06 +00:00
40d1b5f490 Revert "Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#140320)"
This reverts commit add4a42ea2c56f7687a3564aefe9e017cd118936.

Reverted https://github.com/pytorch/pytorch/pull/140320 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_hip_device_count is failing in trunk after this land ([comment](https://github.com/pytorch/pytorch/pull/140320#issuecomment-2524742845))
2024-12-07 01:28:51 +00:00
db313c87f9 [OSS] Enable Flight Recorder buffer for all (#142260)
Summary: Enable collecting Flight Recorder data for all.

Test Plan: This has been rolled out internally for a while now.

Differential Revision: D66897635

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142260
Approved by: https://github.com/kwen2501, https://github.com/fduwjj, https://github.com/wconstab
2024-12-07 01:28:12 +00:00
78425bff30 [FSDP2] Move to public torch.distributed.fsdp (#141868)
**Overview**
This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.:
```
from torch.distributed.fsdp import fully_shard
```
This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.

**Changes for Reland**
- Preserved the public objects from `torch/distributed/_composable/fsdp/fully_shard.py` so that the import path still works internally
- Added a unit test that we can do `from torch.distributed._composable.fsdp.fully_shard import FSDPModule`

Differential Revision: [D66890387](https://our.internmc.facebook.com/intern/diff/D66890387)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868
Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy, https://github.com/fegin, https://github.com/XilunWu

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2024-12-07 01:24:28 +00:00
868d62552d [aoti] Add load_constants to package api (#142246)
Summary:
With the changes in https://github.com/pytorch/pytorch/pull/140755 and https://github.com/pytorch/pytorch/pull/141997, I added a load_constants function to the packaging API. Currently this doesn't work for cpu.

The workflow is something like:

```
ep = torch.export.export(model, example_inputs)
package = torch._inductor.aoti_compile_and_package(ep, inductor_configs=inductor_configs)
compiled = torch._inductor.aoti_load_package(package)

print(compiled.get_constant_fqns())  # see what are the fqns needed/available

compiled.load_constants(new_state_dict, check_full_update=True)  # update the constants in AOTI
```

You can also use the `aot_inductor.package_constants_in_so` config to stop including the constants in the so:
```
package = torch._inductor.aoti_compile_and_package(ep, inductor_configs={`aot_inductor.package_constants_in_so`: False)
compiled = torch._inductor.aoti_load_package(package)
compiled(*inputs)  # segfaults because there are no constants --> we should probably have a better error msg

compiled.load_constants(new_state_dict, check_full_update=True)
compiled(*inputs)
```

Test Plan: `buck2 run @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_so_without_weight"  `

Differential Revision: D66796206

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142246
Approved by: https://github.com/henrylhtsang, https://github.com/desertfire
2024-12-07 01:18:42 +00:00
3a3638be50 [BE] Enable Scalar.h compilation on 32-bit system (#142235)
By hiding ambiguous Scalar(long long) constructor behind `std::enable_if_t<sizeof(void *) == 8>`

Followup after https://github.com/pytorch/pytorch/pull/141244

Test Plan: Run `printf "#include <c10/core/Scalar.h>\n c10::Scalar x(3);" | gcc -x c++ -std=c++17 -I. -Ibuild - -c` on ARMv7 system.
Before this change it failed with:
```
In file included from <stdin>:1:
./c10/core/Scalar.h:83:3: error: ‘c10::Scalar::Scalar(long long int)’ cannot be overloaded with ‘c10::Scalar::Scalar(int64_t)’
   83 |   Scalar(long long vv) : Scalar(vv, true) {}
      |   ^~~~~~
./c10/core/Scalar.h:50:3: note: previous declaration ‘c10::Scalar::Scalar(int64_t)’
   50 |   Scalar(type vv) : Scalar(vv, true) {}
      |   ^~~~~~
./c10/core/ScalarType.h:288:3: note: in expansion of macro ‘DEFINE_IMPLICIT_CTOR’
  288 |   _(int64_t, Long)                                \
      |   ^
./c10/core/Scalar.h:52:3: note: in expansion of macro ‘AT_FORALL_SCALAR_TYPES_AND7’
   52 |   AT_FORALL_SCALAR_TYPES_AND7(
      |   ^~~~~~~~~~~~~~~~~~~~~~~~~~~
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142235
Approved by: https://github.com/Skylion007
2024-12-07 01:05:55 +00:00
022cbf2f31 Back out "[Reland][Environment Variable][5/N] Use thread-safe getenv functions (#140594)" (#142226)
Summary: Failed to write the auto-tune result to `PYTORCH_TUNABLEOP_FILENAME` with this change (empty file) , reverting to unblock.

Test Plan: CI

Differential Revision: D66870750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142226
Approved by: https://github.com/leitian
2024-12-07 00:59:22 +00:00
c6e18a1ed1 [EZ] Remove unused binary_linux_build.sh (#142276)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142276
Approved by: https://github.com/huydhn
2024-12-07 00:48:56 +00:00
716a06d22c Mark async-tp ops as needs_fixed_stride_order (#142252)
Inductor seems to not respect the input striding of these ops, which is required for fp8 async-tp and has performance implication on other cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142252
Approved by: https://github.com/weifengpy
2024-12-07 00:42:27 +00:00
be16dd678e [BE][MPS] Fix unused parameter waring
Before this change running `xcrun metal -c Indexing.metal -Wall -Wextra -fno-fast-math` resulted in 
```
Indexing.metal:75:10: warning: unused parameter 'thread_index' [-Wunused-parameter]
    uint thread_index [[thread_position_in_grid]]) {
         ^
1 warning generated.
```
After no warnings are generated

Also, remove redundant semicolons
2024-12-06 16:41:27 -08:00
a30bfab224 random dag (#142180)
Utils for creating random dags and generating code for nn modules based on such dags.

In particular, this was used to do fuzz testing for unflatten, where the random dags instructed the generation of calls, const accesses, and buffer mutations in a system of nn modules.

Example of generated test:
```python
    def test_unflatten_random_dag_const_preserving_3(self):
        class N2(torch.nn.Module):
            def __init__(self):
                super().__init__()
                self.const = torch.ones(1)

            def forward(self, x):
                return x + 1

        class N1(torch.nn.Module):
            def __init__(self):
                super().__init__()
                self.const = torch.ones(1)
                self.n2 = N2()

            def forward(self, x):
                x = x + self.n2.const
                x = self.n2(x + 1)
                return x + 1

        class N0(torch.nn.Module):
            def __init__(self):
                super().__init__()
                self.const = torch.ones(1)
                self.n1 = N1()

            def forward(self, x):
                x = x + self.n1.n2.const
                x = self.n1(x + 1)
                x = self.n1.n2(x + 1)
                return x + 1

        inp = (torch.ones(1),)
        eager = N0()(*inp)
        ep = torch.export.export(
            N0(),
            inp,
            strict=False,
            preserve_module_call_signature=(
                "n1",
                "n1.n2",
            ),
        )
        epm = ep.module()
        ufm = torch.export.unflatten(ep)
        assert torch.allclose(epm(*inp), eager)
        assert torch.allclose(ufm(*inp), eager)
```

Differential Revision: [D66838348](https://our.internmc.facebook.com/intern/diff/D66838348/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142180
Approved by: https://github.com/angelayi, https://github.com/ydwu4
2024-12-07 00:39:43 +00:00
0052943bee [SymmetricMemory] reorganize the op registry (#140763)
- Separates the definition and implementation
- Removes the false pt2_compliant flags

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140763
Approved by: https://github.com/weifengpy
2024-12-06 23:59:11 +00:00
6e203ae6de [REFACTOR] Implement AOTDispatchCompiler wrapper (#142205)
This implements a new wrapper class AOTDispatchCompiler wrapper, which is just a wrapper around a callable that returns an OutputCode. We can then use it in AOTDispatch to decide whether or not to use the cache: if fw_compiler, bw_compiler and inference_compiler are all AOTDispatchCompilers, then we enable caching.

This type is pretty close to _CompiledFxGraphCallable, except it's not allowed to take any kwargs. Not sure how to consolidate the two ideas together just yet: unfortunately, there's no way to properly annotate the types to make them related. But a lot of the time, the input to this function will be a partially applied _CompiledFxGraphCallable.

This allows the PR above this one to enable AOTAutogradCache everywhere, but not increase instruction count or enable cache on unit tests that use aot_eager or other non inductor compilers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142205
Approved by: https://github.com/oulgen, https://github.com/bdhirsh
2024-12-06 23:23:20 +00:00
5663ad99e7 Fix per-sample xfails for NJT tests (#142243)
#140736 fixed some xfails, but these were not properly failing in CI due to #142157. This PR removes the xfails so we can land a fix to that issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142243
Approved by: https://github.com/huydhn
2024-12-06 22:39:35 +00:00
ceb94d6a7d add torchrec collectives to enforce global ordering (#141970)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141970
Approved by: https://github.com/yf225
2024-12-06 22:38:54 +00:00
UV
7597ab6370 Corrected AMSGrad max equation in Adam and AdamW (#142051)
Fixes #142041

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142051
Approved by: https://github.com/janeyx99
2024-12-06 21:55:26 +00:00
424156c26c [ROCm] Update to AOTriton 0.8b (#140172)
Notable new features for SDPA operators on AMD systems from AOTriton 0.8b:

1. Nestedtensor support;
2. MQA/GQA support;
3. Restore Efficient attention support for causal=True and seqlen_q != seqlen_k cases;
    + The kernel should use top-left alignment, bottom right alignment will be added later
4. Move gfx1100 (RX7900/W7800/W7900) out of experimental support status.
   However, users are strongly recommended to update to ROCM 6.2.4, notably for
   its firmware updates.

Related unit tests are enabled as well.

Notable related changes from AOTriton 0.8b:

1. AOTriton 0.8b moves the GPU kernel out of libaotriton.so to a separate directory `aotriton.images`;
2. LZMA replaces ZSTD as GPU kernel compression algorithm for better compression ratio: aotriton0.8b (.so + aotriton.images take 350MB) compared to aotriton0.7b .so: 800MB
3. The compression cannot be disabled now, and `liblzma` is hard run-time dependency.
    + Should not be a problem, since `lzma` is part of Python Standard Library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140172
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2024-12-06 21:45:18 +00:00
eqy
0a619a212f [CUDA] Cleanup per-process-memory-fraction in test_cuda.py tests (#140852)
Otherwise certain sequences of tests will fail with OOM e.g.,
```
# python test/test_cuda.py -k max_split_expandable -k test_assigning_back_deleter_fns_to_tensor  --repeat 100                                                                                                                                                                                                                                                                                          ..                                                                                                                                                                                                                                                                                                                                                                                                                                         ----------------------------------------------------------------------                                                                                                                                                                                                                                                                                                                                                                     Ran 2 tests in 0.311s                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 OK                                                                                                                                                                                                                                                                                                                                                                                                                                         E.                                                                                                                                                                                                                                                                                                                                                                                                                                         ======================================================================                                                                                                                                                                                                                                                                                                                                                                     ERROR: test_assigning_back_deleter_fns_to_tensor (__main__.TestBlockStateAbsorption.test_assigning_back_deleter_fns_to_tensor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/workspace/pytorch/torch/testing/_internal/common_utils.py", line 3058, in wrapper
    method(*args, **kwargs)
  File "/workspace/pytorch/test/test_cuda.py", line 4320, in test_assigning_back_deleter_fns_to_tensor
    graph, outputs = cudagraphify(foo, [inp])
                     ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/pytorch/test/test_cuda.py", line 4080, in cudagraphify
    fn(*inputs)
  File "/workspace/pytorch/test/test_cuda.py", line 4316, in foo
    int8_cuda(LARGE_BUFFER) + x,
    ~~~~~~~~~~~~~~~~~~~~~~~~^~~
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB. GPU 0 has a total capacity of 31.73 GiB of which 31.30 GiB is free. Process 2916661 has 442.00 MiB memory in use. 120.00 MiB allowed; Of the allocated memory 52.00 MiB is allocated by PyTorch, and 6.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

To execute this test, run the following from the base repo dir:
    python test/test_cuda.py TestBlockStateAbsorption.test_assigning_back_deleter_fns_to_tensor
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 2 tests in 0.136s

FAILED (errors=1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140852
Approved by: https://github.com/Skylion007
2024-12-06 21:26:54 +00:00
660845a1aa [AOTI] Add deprecation warning for torch._export.aot_load (#142212)
Summary: Add deprecation warning for torch._export.aot_load, and encourage user to move to the new torch._inductor.aoti_load_package.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142212
Approved by: https://github.com/angelayi
2024-12-06 21:12:34 +00:00
f36cccba2e Revert "[Inductor] Expand dtype aware codegen for libdevice and tl.math ops (#140864)"
This reverts commit 80ca6dd892613fd4f1dee9040b8273ddeadb1c50.

Reverted https://github.com/pytorch/pytorch/pull/140864 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/140864#issuecomment-2524168602))
2024-12-06 21:03:06 +00:00
cyy
1fa27f6e82 [3/N] Avoid copy in std::get (#141843)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141843
Approved by: https://github.com/Skylion007
2024-12-06 20:13:36 +00:00
add4a42ea2 Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#140320)
Fixes #140318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140320
Approved by: https://github.com/eqy, https://github.com/jithunnair-amd, https://github.com/jataylo, https://github.com/jeffdaily

Co-authored-by: Jack Taylor <jack.taylor@amd.com>
2024-12-06 20:09:56 +00:00
37c4b19e4d make sure ukernel prod is everywhere XNNPACK is (#142086)
Just double checking that ukernel prod (which should be linked with XNNPACK) is in all the places XNNPACK is
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142086
Approved by: https://github.com/kirklandsign
2024-12-06 20:09:30 +00:00
18ef3a09cd Add option in data loader for out of order data (#141833)
Fixes #105203

Facing a similar problem to the linked issue, where variable sized input data can mean that a handful of slow to process samples holds up smaller and faster to process samples from being used. This also leads to lower GPU utilization as well. In certain cases, e.g. evaluation epochs, inference pipelines or other cases where reproducibility isn't important, this can bring significant speed ups.

This PR adds an `allow_out_of_order` bool input to the `DataLoader` class, defaulting to `false`, which when set to `true` will returning data from workers in whatever order they are ready/processed in, rather in the strict index order.
Instead of storing data that was returned out of order, it is passed directly to the main thread and the entry in `_task_info` is deleted. The main changes are they to check that an entry in `_task_info` does exist, and only increasing `self._rcvd_idx` when the lowest index remaining gets returned.

Two tests are added to test this for iterable type datasets and index type datasets.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141833
Approved by: https://github.com/andrewkho
2024-12-06 19:55:58 +00:00
61a7c83c64 [Inductor] fix device error for NopKernelSchedulerNode (#141372)
This PR adds device guard support for NopKernelSchedulerNode which may create a tensor. Prior to this PR, we do not codegen device guard for NopKernelSchedulerNode, leading to errors.

Prior to the PR:
```python
def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1 = args
    args.clear()
    assert_size_stride(arg0_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg1_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg2_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg3_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg4_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg5_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg6_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg7_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg8_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg9_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg10_1, (1, 1, 16, 16), (256, 256, 16, 1))
    buf0 = empty_strided_cuda((1, 1, 2048), (2048, 2048, 1), torch.float32) # TODO: ERROR here. Should be cuda:1
    with torch.cuda._DeviceGuard(1):
        torch.cuda.set_device(1)
        buf1 = empty_strided_cuda((1, 1, 2048, 128), (262144, 262144, 128, 1), torch.bfloat16)
        # Topologically Sorted Source Nodes: [flex_attention], Original ATen: []
        stream1 = get_raw_stream(1)
        breakpoint()
        triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, arg3_1, arg4_1, arg5_1, arg6_1, buf1, grid=torch._inductor.kernel.flex_attention.flex_attention_grid(1, 1, 2048, 128, meta0), stream=stream1)
        del arg0_1
        del arg1_1
        del arg2_1
        del arg3_1
        del arg4_1
        del arg5_1
        del arg6_1
        del buf0
    return (buf1, )
```

After the PR:
```python
def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1 = args
    args.clear()
    assert_size_stride(arg0_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg1_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg2_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg3_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg4_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg5_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg6_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg7_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg8_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg9_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg10_1, (1, 1, 16, 16), (256, 256, 16, 1))
    with torch.cuda._DeviceGuard(1):
        torch.cuda.set_device(1)
        buf0 = empty_strided_cuda((1, 1, 2048), (2048, 2048, 1), torch.float32) # New: move into device guard
        buf1 = empty_strided_cuda((1, 1, 2048, 128), (262144, 262144, 128, 1), torch.bfloat16)
        # Topologically Sorted Source Nodes: [flex_attention], Original ATen: []
        stream1 = get_raw_stream(1)
        triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, arg3_1, arg4_1, arg5_1, arg6_1, buf1, grid=torch._inductor.kernel.flex_attention.flex_attention_grid(1, 1, 2048, 128, meta0), stream=stream1)
        del arg0_1
        del arg1_1
        del arg2_1
        del arg3_1
        del arg4_1
        del arg5_1
        del arg6_1
        del buf0
    return (buf1, )
```

Fixes #141010

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141372
Approved by: https://github.com/eellison
2024-12-06 19:27:50 +00:00
3fd51e079d Revert "[Inductor] Constrain the shape of other tensor for Conv/Linear + broadcast add fusion. (#141759)"
This reverts commit 35752cb1ba8324a00b06d72ed388f6437c82c5e5.

Reverted https://github.com/pytorch/pytorch/pull/141759 on behalf of https://github.com/atalman due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/141759#issuecomment-2523983558))
2024-12-06 19:14:36 +00:00
db13bd9ac2 Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit b8eb4b56d8dbcd07570bec616f7ea58e9dd58fb4.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/atalman due to Break internal tests see errors like: csrc\inductor\aoti_torch\shim_common.cpp(481): error C2491: 'aoti_torch__embedding_bag': definition of dllimport function not allowed ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2523968128))
2024-12-06 19:04:04 +00:00
cf58de59d7 [cuBLASLt][Memtracker] Allocate temprorary cuBLASLt workspaces using tensors rather than going to the caching allocator directly (#139442)
CC @zdevito @janeyx99

This isn't ideal but cuBLASLt workspaces are not currently cached, so this additional untracked allocation will cause `test_cuda_tracker_equivalence` to fail with a large enough workspace size e.g., `CUBLAS_LT_WORKSPACE_SIZE=32768`. One solution is to just use byte-tensors for the workspace instead of going directly to the caching allocator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139442
Approved by: https://github.com/Aidyn-A, https://github.com/albanD, https://github.com/janeyx99
2024-12-06 19:01:12 +00:00
b7b56576d8 Allow user to manually pass module.name associated with global in {add}_safe_global (#142153)
Fixes #142144

A global x is saved in checkpoint as `GLOBAL x.__module__ x.__name__`. So , after allowlisting a GLOBAL it is expected to match any GLOBAL instruction of the form `GLOBAL x.__module__ x.__name__`  but there are edge cases when for the same API from the same module, what `__module__` gives changes between versions which prevents users from allowlisting the global.

In this case, in numpy < 2.1

```
torch.save("bla", np_array)
# checkpoint has GLOBAL "np.core.multiarray" "_reconstruct"
```
In np version 2.1

```
with safe_globals([np.core.multiarray._reconstruct]):
    torch.load("bla")
```
np.core.multiarray._reconstruct.__module__ gives "np._core.multiarray" (note the extra _ before core) and see what was done [here](https://github.com/numpy/numpy/blob/main/numpy/core/multiarray.py)

Since the dictionary to access safe globals is keyed on "{foo.__module__}.{foo.__name__}", __module__, __name__ will no longer match that in the checkpoint so "np.core.multiarray._reconstruct" can no longer be properly allowlisted (instead np._core.multiarray._reconstruct is a key in the dict).

We allow `add_safe_globals/safe_globals` to optionally take tuples of (global, str of module.name) to workaround such (odd/edge case) situations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142153
Approved by: https://github.com/albanD
2024-12-06 18:56:39 +00:00
1a7da6e7e9 [export] Add test to enforce consistency between synced thrift and generated thrift from schema.py (#141989)
Summary:
In this diff we implement a way to ensure the internal thrift schema from cfgr (configerator/structs/caffe2/torch/export/schema.thrift) and the schema in OSS (torch/_export/serde/schema.thrift) are in sync, by adding a unittest to reflect on the type names and fields from each schema and compare them field by field.

When we detect new fields/types from torch/_export/serde/schema.thrift, there'll be a test failure on the trunk and the error message hints people to add the missing field/type to the thrift schema from cfgr, so that they are always in sync in practice.

Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_thrift_schema_in_sync

Differential Revision: D66716834

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141989
Approved by: https://github.com/yiming0416
2024-12-06 18:42:20 +00:00
bab15df40a Revert "[FSDP2] Move to public torch.distributed.fsdp (#141868)"
This reverts commit 45583a5df907a7948693c047e5fe2c8349622069.

Reverted https://github.com/pytorch/pytorch/pull/141868 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/141868#issuecomment-2523925180))
2024-12-06 18:38:12 +00:00
4af7aa5e64 Revert "E2E composability testing (#141398)"
This reverts commit ad93aa854d2d7837c917ae81cfb8f3bf05ee58c9.

Reverted https://github.com/pytorch/pytorch/pull/141398 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/141868, we can try rebase and reland this after ([comment](https://github.com/pytorch/pytorch/pull/141398#issuecomment-2523908998))
2024-12-06 18:28:51 +00:00
683ec42958 Revert "Unbreak dynamic shape minimal arrayref interface tests (#142091)"
This reverts commit 2bfc600644ed59332f9da7b94558b9c4c9562b0d.

Reverted https://github.com/pytorch/pytorch/pull/142091 on behalf of https://github.com/atalman due to Breaks internal changes ([comment](https://github.com/pytorch/pytorch/pull/142091#issuecomment-2523906048))
2024-12-06 18:25:54 +00:00
f2f95ba813 [dynamo] Remove workaround for functools.wraps in functorch (#142014)
This is no longer needed after #142000.

Fixes #123365.

D66838774
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142014
Approved by: https://github.com/zou3519
ghstack dependencies: #142000
2024-12-06 17:34:59 +00:00
c0ffeab02f [dynamo] Simplify handling of functools.wraps (#142000)
Previously when Dynamo encounters a `functools.wrap(...)` call, it would
check `VariableTracker.can_reconstruct` and graph break if failed.

That has 2 issues:
1. Implementation of `can_reconstruct` is incorrect, since logic of
   reconstructability isn't necessarily encapsulated in
   `VariableTracker.reconstruct` -- for some VTs like `CellVariable`,
   it's also in `SideEffects.codegen_save_tempvars`. This is exposed by
   #134731.
2. We don't always need to reconstruct the result of
   `functools.wrap(...)`, for those cases we don't want to give up
   tracing by an early `con_reconstruct` check. Instead we could just
   let it fall through, and graph break in the actual `reconstruct` call
   later, if needed.

This patch removes the `can_reconstruct` check altogether. It was
introduced in #114279, but the added tests pass even without the check
now; this might be because of some recent bug fixing on cells and side
effects.

Fixes #134731, #141514.

D66838708
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142000
Approved by: https://github.com/zou3519
2024-12-06 17:34:59 +00:00
5872a8c6b0 Use task submitter TLS in gloo working threads (#142184)
Fixes: #86830

CC: @albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142184
Approved by: https://github.com/albanD
2024-12-06 17:03:17 +00:00
692b5e75ed [logging] Add triton_compile_time_us column to dynamo_compile (#142068)
Test Plan: See internal diff [D66799565](https://www.internalfb.com/diff/D66799565)

Differential Revision: [D66799565](https://our.internmc.facebook.com/intern/diff/D66799565)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142068
Approved by: https://github.com/c00w
2024-12-06 16:11:57 +00:00
b64a537993 [CD] xpu nightly manylinux whl with cxx11-abi (#142210)
Follow https://github.com/pytorch/pytorch/issues/123649
Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142210
Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/malfet
2024-12-06 15:10:47 +00:00
34033cce4d Enable concat support through inductor using pointwise kernels (#141966)
Summary: Add ability to always force pointwise kernel for concat codegen through Inductor.

Differential Revision: D66669372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141966
Approved by: https://github.com/eellison, https://github.com/blaine-rister, https://github.com/jansel
2024-12-06 14:28:07 +00:00
661d1f0372 [aotd] non-contiguous NestedTensor mutation in compile (#139630)
Allow mutations mutations for subclasses that are non-contiguous.

Changes:

Removing assert in collect_metadata_analysis

Main requested testcase:
Compilation of NJT.index_put()

Adding test in test_nestedtensor.py, that compiles NJT.index_put()

It is  decomposed to NJT split,unbind, which  needed additional `torch._check`, `torch._check_is_size` for NJT.unbind()  and guard_size_oblivious() usage in _meta_registrations and _inductor/lowering.py.

Special case:
If tangent is mutated outside of the graph, it does not participate in backward graph. Autograd in this case will set this tangent to zeros tensor.

We handle it separately in CompiledFunction.backward: not doing any processing for this tangent and broadcast to number of expected subclass unwrapped arguments.

disabling for dynamo 2 tests:
1/ For nested tensor - symbolic shapes issue on nested_tensor index operation that does splits [0, 0, 0] - there is a failure with "pending unbacked symints". This PR does not add more .tolist()/item() ops than it was before.

2/ As we do not fail with exception in collect_metadata_analysis new paths for dynamo started working and it started failing with smth strange that set_ in storage_offset (because of test for views) handling updates storage "cpu" -> "meta"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139630
Approved by: https://github.com/bdhirsh
2024-12-06 12:18:46 +00:00
c683839e6e [AOTI] Clean up temporary files generated by AOTI package loader. (#141773)
Fix #141772
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141773
Approved by: https://github.com/desertfire, https://github.com/EikanWang
2024-12-06 11:46:47 +00:00
c8c669ce74 When using a third-party device to test DeviceMesh,the error check for 'test_raises_invalid_device_type' can only prints 'GPU' (#142038)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142038
Approved by: https://github.com/kwen2501
2024-12-06 11:14:00 +00:00
cc64ad659d Detect accelerator type when backend is not specified (#142216)
Today, when user does `init_process_group()`, without `backend` or `device_id` specification, we would auto-translate it into `cuda:nccl,cpu:gloo`. The idea was to initialize all **default** backends to cover what the user may do later.

A side effect is increase of initialization time and resources.

This PR changes it to detecting the accelerator type on the machine, and initialize only the backend for that accelerator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142216
Approved by: https://github.com/wconstab, https://github.com/XilunWu
2024-12-06 10:55:56 +00:00
cyy
5d3622447d Enable Wtype-limits (#142099)
Since it can detect underflow bugs of unsigned integers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142099
Approved by: https://github.com/ezyang
2024-12-06 08:14:18 +00:00
01d7644dc9 [dynamo] Undo some jvp old workarounds in functorch (#142082)
This basically undoes most of the workarounds introduced in #123118, the
root causes of which have been fixed by #142078.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142082
Approved by: https://github.com/zou3519
ghstack dependencies: #142078, #142080, #142081
2024-12-06 08:06:53 +00:00
9d54cd1504 [dynamo] Undo some jvp old workarounds in functorch (#142081)
This basically undoes some workarounds introduced in #119926, the
root causes of which have been fixed by #142078 and other changes in
Dynamo.

Now that Dynamo traces the spec comparison code, the test also needs update:
- removing the `_jvp_treespec_compare` calls in fx graph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142081
Approved by: https://github.com/zou3519
ghstack dependencies: #142078, #142080
2024-12-06 08:06:53 +00:00
59de5e867b [dynamo] Undo some vjp old workarounds in functorch (#142080)
This basically undoes most of the workarounds introduced in #119405, the
root causes of which have been fixed by #142078 and other changes in
Dynamo.

Now that Dynamo traces the spec comparison code, the test also needs update:
1. renaming `o` to `pimals_out`
2. removing the `_vjp_treespec_compare` calls in fx graph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142080
Approved by: https://github.com/zou3519
ghstack dependencies: #142078
2024-12-06 08:06:53 +00:00
aab0f32ea4 [dynamo] Properly handle != under user-defined __eq__ (#142078)
Previously Dynamo modelled `object.__ne__` as just comparison over value
identity; however, in CPython the default `!=` dispatches to `__eq__`,
which might've been overriden by user. This patch fixes the behavior
divergence.

Fixes #142055.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142078
Approved by: https://github.com/jansel, https://github.com/zou3519
2024-12-06 08:06:53 +00:00
c5cfc6a4c9 [pipelining] forward fix for _validate_schedule (#142211)
https://github.com/pytorch/pytorch/pull/142009 broke CSV loading since it can no longer handle schedules with `I` and `W`. This was caught in the torchtitan tests which loads a custom CSV file using `I` and `W` https://github.com/pytorch/torchtitan/actions/runs/12188167461/job/34000683921?pr=689.

Follow up would be to test a real custom schedule in PyTorch rather than torchtitan. The custom schedule in titan is here:  https://github.com/pytorch/torchtitan/blob/main/test/assets/custom_schedule.csv

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142211
Approved by: https://github.com/mori360
ghstack dependencies: #142009
2024-12-06 08:04:31 +00:00
eqy
8fc6d3a5d8 [SDPA] Allow user-specified priority order with context manager (#140467)
TODO: docs changes?
For better debuggability of issues like https://github.com/pytorch/pytorch/issues/139298

Better testing, current sketch:

``` Python
import torch
from torch.nn.functional import scaled_dot_product_attention
from torch.nn.attention import SDPBackend, sdpa_kernel

q = torch.randn(64, 1024, 8, 64, dtype=torch.half, device='cuda')
print(torch._C._get_sdp_priority_order())

orders = [[SDPBackend.CUDNN_ATTENTION, SDPBackend.MATH, SDPBackend.EFFICIENT_ATTENTION],
          [SDPBackend.MATH, SDPBackend.CUDNN_ATTENTION, SDPBackend.EFFICIENT_ATTENTION],
          [SDPBackend.EFFICIENT_ATTENTION, SDPBackend.CUDNN_ATTENTION, SDPBackend.MATH]]
import time
times = list()
for order in orders:
    print(order)
    with sdpa_kernel(order, set_priority=True):
        scaled_dot_product_attention(q, q, q)
    torch.cuda.synchronize()
    t0 = time.perf_counter()
    with sdpa_kernel(order, set_priority=True):
        scaled_dot_product_attention(q, q, q)
    torch.cuda.synchronize()
    t1 = time.perf_counter()
    times.append(t1 - t0)
print(times)
assert times[0] < times[1]
assert times[0] > times[2]
assert times[1] > times[2]
print(torch._C._get_sdp_priority_order())
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140467
Approved by: https://github.com/drisspg
2024-12-06 07:56:35 +00:00
e7de245ee1 Revert "[reland][dynamo][guards] Consider tensors as immutable for dict tag matches (#141085)"
This reverts commit 8bfc0094e468b0abefe087d671903a1ca738edf0.

Reverted https://github.com/pytorch/pytorch/pull/141085 on behalf of https://github.com/williamwen42 due to internal regression ([comment](https://github.com/pytorch/pytorch/pull/141085#issuecomment-2522403360))
2024-12-06 07:50:10 +00:00
4a6c056466 Revert "[3/N] Avoid copy in std::get (#141843)"
This reverts commit 671e9c7aba2dc72b65391aa4bba1b9e079c2f1b2.

Reverted https://github.com/pytorch/pytorch/pull/141843 on behalf of https://github.com/huydhn due to Sorry fo reverting your change but a bunch of CUDA builds are failing in trunk after this lands due to OOM ([comment](https://github.com/pytorch/pytorch/pull/141843#issuecomment-2522335911))
2024-12-06 07:32:07 +00:00
8bdcdae733 [DTensor] Support matmul in inference_mode (#142197)
Fixes #142190 .

The solution is to add a `decompose_handler` for `aten.matmul`, similar to how we handle `aten.linear`.
With the decomposition, `aten.matmul` becomes `aten.mm` which has sharding strategy registered with DTensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142197
Approved by: https://github.com/XilunWu, https://github.com/wz337
2024-12-06 07:15:05 +00:00
02c509669a Aoti minifier flatten (#141156)
Flatten the inputs to minifier so AOTI Minifier can handle unflattened inputs and kwargs.

- flatten the inputs in minifier
- changed the "load_and_run" part of the minifier verification to run on the flattened inputs.
- refactored code to keep `torch._inductor.__init__.py` clean
- update doc

`python test/inductor/test_minifier.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141156
Approved by: https://github.com/desertfire
2024-12-06 07:12:45 +00:00
23e2f8ab3a [Inductor] add flag for linear binary folding and turn it off by default (#142108)
Fix https://github.com/pytorch/pytorch/issues/141755.

Summary:
linear binary folding results in a timm_model(levit_128) accuracy regression, this PR adds flag `enable_linear_binary_folding` for linear binary folding and turn it off by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142108
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-12-06 07:12:29 +00:00
67ba79676f [Dynamo] Replace torch._dynamo.optimize() with torch.compile() [7/N] (#140922)
related commits:

- #139706
- #140238
- #140247
- #140253
- #140663
- #140688
- #140922
- #140924
- #140933

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140922
Approved by: https://github.com/williamwen42
2024-12-06 07:07:29 +00:00
52b7f0ba12 [DTensor] fix stride of fake tensor produced by shard_dim_alltoall (#141835)
currently, DTensor redistributions involving all2all `Shard(n)->Shard(m)` will generate faulty inductor code when compiled:
```python
# torchrun --nproc_per_node=2 crash.py
import torch
from torch.distributed.device_mesh import init_device_mesh
from torch.distributed.tensor import Shard, DTensor

mesh = init_device_mesh('cuda', (2,), mesh_dim_names=('ep',))
dt = DTensor.from_local(torch.randn(2, 4, device='cuda'), mesh, [Shard(0)]).requires_grad_()
def f(dt): return dt.redistribute(placements=[Shard(1)]).to_local()
f(dt).sum().backward() # no crash
f = torch.compile(f)
f(dt).sum().backward() # crash
```
resulting:
```python
[rank1]: Traceback (most recent call last):
[rank1]:   File "/crash.py", line 11, in <module>
[rank1]:     f(dt).sum().backward() # crash
[rank1]:     ^^^^^
...
[rank1]:   File "/tmp/torchinductor_main/gu/cgurkeb7tzx7kfsnooolsjefrgoizzylrldrugc52n4avmgiccas.py", line 41, in call
[rank1]:     assert_size_stride(buf0, (4, 2), (4, 1))
[rank1]: AssertionError: expected size 4==4, stride 2==4 at dim=0
```

This happens because the current [`register_fake` implementation for `shard_dim_alltoall` ops](5deca07c0d/torch/distributed/tensor/_collective_utils.py (L32)) returns an erroneous stride:

```python
import torch
import torch.distributed as dist
from torch._C._distributed_c10d import _register_process_group
from torch.distributed.device_mesh import init_device_mesh
from torch.distributed.tensor._collective_utils import _shard_dim_alltoall_meta, _get_group_size_by_name

mesh = init_device_mesh('cuda', (2,), mesh_dim_names=('ep',))
_register_process_group('ep', mesh['ep'].get_group())
x = torch.randn(2, 4, device='meta')
y = _shard_dim_alltoall_meta(x, 0, 1, 'ep')
if dist.get_rank() == 0:
    print(x.shape, x.stride()) # torch.Size([2, 4]) (4, 1)
    print(y.shape, y.stride()) # torch.Size([4, 2]) (4, 1)
```

---

The proposed fix in the pull request causes the provided example code to compile correctly && stop erroring. However, I know very little about torch internals, and expect there to be something wrong with this patch. Any corrections are appreciated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141835
Approved by: https://github.com/awgu, https://github.com/tianyu-l
2024-12-06 06:56:03 +00:00
35752cb1ba [Inductor] Constrain the shape of other tensor for Conv/Linear + broadcast add fusion. (#141759)
Fix https://github.com/pytorch/pytorch/issues/141671.

Summary:
The performance regression of these two timm_models is caused by Conv/Linear + broadcast add fusion run into oneDNN ref path. This PR constrains the shape of other tensor for Conv/Linear + broadcast add fusion to fix this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141759
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-12-06 06:20:41 +00:00
b8eb4b56d8 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-12-06 04:54:42 +00:00
20f24e3fbd [inductor][cpp] Add BMM kernel template for autotuning (#129772)
This PR adds the Cpp template for BMM, for FP32, FP16, and BF16. See #125683 for more background.

1.  Adds `CppBmmTemplate` class which inherits from `CppPackedGemmTemplate`. Given a number of worker threads `num_threads` and batch size `B`, execute the Gemm kernel. For the first `B - (B % num_threads)` batch inputs, run one sub-gemm problem per thread. Then for the remaining `B % num_threads` sub-gemms, we execute each subproblem using the parallelized Gemm kernel.
To manage this code, the `GEMM_TEMPLATE` from `CppPackedGemmTemplate` is rendered two different times, one with a single thread and one which includes the parallel OMP pragma.
2. Adapts `CppPackedGemmTemplate` to allow for child class. The `GEMM_TEMPLATE` is separated into different strings to allow for rendering by the child class. Slicing/indexing are adapted to allow for 3D BMM inputs. Additional methods `get_options()` and `_get_params_for_choices()` are added to reduce code duplication.

BMM within `dlrm` benchmark has a single input buffer which is used for but X and W inputs. This is currently not supported in this PR.

### Performance
On Granite/Sapphire Rapids, cpp_bmm template code uses AMX which requires an expensive transpose operation so the BMM op is rarely selected as faster than the existing external bmm kernel. As a result, speedup on SPR is identical with and without BMM code. Pass rate matches the rates for main exactly.

#### Test Summary on Granite Rapids
Test   Scenario | Comp Item | Date | Compiler | torchbench | huggingface | timm_models
-- | -- | -- | -- | -- | -- | --
Single Socket Multi-Threads | Pass   Rate | gemm autotune| inductor | 91%,   73/80 | 100%,   46/46 | 100%,   61/61
   |     |   |  bmm + gemm autotune | inductor | 91%,   73/80 | 100%,   46/46 | 100%,   61/61
  |  |  Geomean Speedup | gemm autotune| inductor | 2.15x | 1.91x | 2.52x
   |     |   |  bmm + gemm autotune | inductor | 2.15x | 1.96x | 2.53x
Single Core Single-Thread | Pass   Rate | gemm autotune | inductor | 91%,   73/80 | 100%,   46/46 | 100%,   61/61
   |    |   |  bmm + gemm autotune| inductor | 91%,   73/80 | 100%,   46/46 | 100%,   61/61
 |  | Geomean Speedup | inductor_locally_benchmark_586 | inductor | 2.43x | 1.56x | 2.60x
   |    |   |  inductor_locally_benchmark_585 | inductor | 2.45x | 1.56x | 2.63x

This is not the case on an older Skylake Xeon machine.
For the BMM ops contained in torchbench models, bmm performance improves by 1.10-2.64x.

#### BF16 28-core Skylake Xeon
| Model | Inductor | GemmAutotune | Gemm+BMM Autotune |
|--------|--------|--------|--------|
| BERT_pytorch | 1.233x | 2.597x | 2.608x |
| hf_DistilBert | 1.128x | 2.242x | 2.368x |
| hf_Reformer | 1.124x | 1.419x | 1.590x |
| hf_T5_base | 1.012x | 1.257x | 1.382x |
| hf_T5_large | 1.085x | 2.228x | 2.345x |

## Example BMM Code
```
#include <c10/util/Unroll.h>
#include <torch/csrc/inductor/aoti_torch/c/shim.h>

template <bool accum>
inline void cpp_bmm_micro_gemm_amx_kernel_32_2(
    AMXState& amx_state,
    const bfloat16* __restrict__ A,
    const bfloat16* __restrict__ B,
    float* __restrict__ C,
    int64_t K,
    int64_t lda,
    int64_t ldb,
    int64_t ldc,
    uint8_t tilecfg_rows
) {
    // TODO(jgong5): add prefetch hint for A, B, C
    auto loadconfig = [](const amx_tilecfg& cfg) {
        _tile_loadconfig(&cfg);
    };
    const auto last_k_offset = K / 32 * 32;
    const auto tail_k_size = K - last_k_offset;
    if C10_LIKELY (last_k_offset > 0) {
        amx_state.configure(tilecfg_rows, 64, 32 / 16, 2, loadconfig);
    } else {
        amx_state.configure(tilecfg_rows, tail_k_size * sizeof(bfloat16), 32 / 16, 2, loadconfig);
    }
    auto load_c = [&]() {
        _tile_loadd(0, C + 0 * ldc + 0, ldc * sizeof(float));
        _tile_loadd(1, C + 0 * ldc + 16, ldc * sizeof(float));
        _tile_loadd(2, C + 16 * ldc + 0, ldc * sizeof(float));
        _tile_loadd(3, C + 16 * ldc + 16, ldc * sizeof(float));
    };
    auto zero_c = [&]() {
        _tile_zero(0);
        _tile_zero(1);
        _tile_zero(2);
        _tile_zero(3);
    };

    if constexpr (accum) {
        load_c();
    } else {
        zero_c();
    }

    auto compute = [&](int k) {
        _tile_stream_loadd(4, A + 0 * lda + k, lda * sizeof(bfloat16));
        _tile_loadd(6, B + k * ldb + 0, ldb * 2 * sizeof(bfloat16));
        _tile_dpbf16ps(0, 4, 6);
        _tile_loadd(7, B + k * ldb + 32, ldb * 2 * sizeof(bfloat16));
        _tile_dpbf16ps(1, 4, 7);
        _tile_stream_loadd(5, A + 16 * lda + k, lda * sizeof(bfloat16));
        _tile_dpbf16ps(2, 5, 6);
        _tile_dpbf16ps(3, 5, 7);
    };

    #pragma GCC unroll 4
    for (int k = 0; k < last_k_offset; k += 32) {
        compute(k);
    }

    auto store_c = [&]() {
    // store to C
        _tile_stored(0, C + 0 * ldc + 0, ldc * sizeof(float));
        _tile_stored(1, C + 0 * ldc + 16, ldc * sizeof(float));
        _tile_stored(2, C + 16 * ldc + 0, ldc * sizeof(float));
        _tile_stored(3, C + 16 * ldc + 16, ldc * sizeof(float));
    };

    // TODO(jgong5): move tail k computation to separate loopnest to save tile configuration overhead
    if C10_UNLIKELY (tail_k_size > 0) {
        if C10_LIKELY (last_k_offset > 0) {
            store_c();
            amx_state.configure(tilecfg_rows, tail_k_size * sizeof(bfloat16), 32 / 16, 2, loadconfig);
            load_c();
        }
        compute(last_k_offset);
    }

    store_c();
}

template <bool accum>
inline void cpp_bmm_micro_gemm_amx_kernel_16_2(
    AMXState& amx_state,
    const bfloat16* __restrict__ A,
    const bfloat16* __restrict__ B,
    float* __restrict__ C,
    int64_t K,
    int64_t lda,
    int64_t ldb,
    int64_t ldc,
    uint8_t tilecfg_rows
) {
    // TODO(jgong5): add prefetch hint for A, B, C
    auto loadconfig = [](const amx_tilecfg& cfg) {
        _tile_loadconfig(&cfg);
    };
    const auto last_k_offset = K / 32 * 32;
    const auto tail_k_size = K - last_k_offset;
    if C10_LIKELY (last_k_offset > 0) {
        amx_state.configure(tilecfg_rows, 64, 16 / 16, 2, loadconfig);
    } else {
        amx_state.configure(tilecfg_rows, tail_k_size * sizeof(bfloat16), 16 / 16, 2, loadconfig);
    }
    auto load_c = [&]() {
        _tile_loadd(0, C + 0 * ldc + 0, ldc * sizeof(float));
        _tile_loadd(1, C + 0 * ldc + 16, ldc * sizeof(float));
    };
    auto zero_c = [&]() {
        _tile_zero(0);
        _tile_zero(1);
    };

    if constexpr (accum) {
        load_c();
    } else {
        zero_c();
    }

    auto compute = [&](int k) {
        _tile_stream_loadd(2, A + 0 * lda + k, lda * sizeof(bfloat16));
        _tile_loadd(3, B + k * ldb + 0, ldb * 2 * sizeof(bfloat16));
        _tile_dpbf16ps(0, 2, 3);
        _tile_loadd(4, B + k * ldb + 32, ldb * 2 * sizeof(bfloat16));
        _tile_dpbf16ps(1, 2, 4);
    };

    #pragma GCC unroll 4
    for (int k = 0; k < last_k_offset; k += 32) {
        compute(k);
    }

    auto store_c = [&]() {
    // store to C
        _tile_stored(0, C + 0 * ldc + 0, ldc * sizeof(float));
        _tile_stored(1, C + 0 * ldc + 16, ldc * sizeof(float));
    };

    // TODO(jgong5): move tail k computation to separate loopnest to save tile configuration overhead
    if C10_UNLIKELY (tail_k_size > 0) {
        if C10_LIKELY (last_k_offset > 0) {
            store_c();
            amx_state.configure(tilecfg_rows, tail_k_size * sizeof(bfloat16), 16 / 16, 2, loadconfig);
            load_c();
        }
        compute(last_k_offset);
    }

    store_c();
}

template <bool accum>
inline void cpp_bmm_micro_gemm(
    AMXState& amx_state,
    const bfloat16* __restrict__ A,
    const bfloat16* __restrict__ B,
    float* __restrict__ C,
    int64_t M,
    int64_t N,
    int64_t K,
    int64_t lda,
    int64_t ldb,
    int64_t ldc
) {
    AOTI_TORCH_CHECK(N % 32 == 0, "N dimension must be multiple of 32");
    AOTI_TORCH_CHECK(K % 2 == 0, "K dimension must be multiple of 2");
    // TODO(jgong5): loop unroll for M and N
    for (int64_t n = 0; n < N; n += 32) {
        for (int64_t m = 0; m < M; m += 32) {
            int64_t block_m = std::min<int64_t>(M - m, 32);
            int64_t m_tail = m;
            if (block_m >= 32) {
                cpp_bmm_micro_gemm_amx_kernel_32_2<accum>(
                    amx_state,
                    A + m * lda,
                    B + n,
                    C + m * ldc + n,
                    K,
                    lda,
                    ldb,
                    ldc,
                    16
                );
                block_m -= 32;
                m_tail += 32;
            }
            else
            if (block_m >= 16) {
                cpp_bmm_micro_gemm_amx_kernel_16_2<accum>(
                    amx_state,
                    A + m * lda,
                    B + n,
                    C + m * ldc + n,
                    K,
                    lda,
                    ldb,
                    ldc,
                    16
                );
                block_m -= 16;
                m_tail += 16;
            }
            if (block_m > 0) {
                cpp_bmm_micro_gemm_amx_kernel_16_2<accum>(
                    amx_state,
                    A + m_tail * lda,
                    B + n,
                    C + m_tail * ldc + n,
                    K,
                    lda,
                    ldb,
                    ldc,
                    block_m
                );
            }
        }
    }
}
void threaded_mm(const bfloat16* X, const bfloat16* W, bfloat16* Y, const int64_t ks_b_index)
{

    constexpr int64_t num_threads = 48;
    constexpr int64_t N = 64;
    constexpr int64_t K = 96;
    constexpr int64_t Mr = 32;
    constexpr int64_t Nr = 32;
    constexpr int64_t Kr = 32;
    constexpr int64_t Nr_blocks = (N + Nr - 1) / Nr;
    constexpr int64_t Kr_blocks = (K + Kr - 1) / Kr;
    constexpr int64_t M = static_cast<int64_t>(384L);
    constexpr int64_t Mr_blocks = (M + Mr - 1) / Mr;
    constexpr int64_t Mt_blocks = 1;
    constexpr int64_t Nt_blocks = 1;
    constexpr int64_t Kt_blocks = 3;
    constexpr int64_t Mc_blocks = 1;
    constexpr int64_t Nc_blocks = 1;
    constexpr int64_t Kc_blocks = 3;
    constexpr int64_t num_Mc_blocks = (Mr_blocks + Mc_blocks - 1) / Mc_blocks;
    constexpr int64_t num_Nc_blocks = (Nr_blocks + Nc_blocks - 1) / Nc_blocks;
    constexpr int64_t num_Mt_blocks = (Mr_blocks + Mt_blocks - 1) / Mt_blocks;
    constexpr int64_t num_Nt_blocks = (Nr_blocks + Nt_blocks - 1) / Nt_blocks;
    constexpr int64_t num_Kt_blocks = (Kr_blocks + Kt_blocks - 1) / Kt_blocks;

    // make sure all partitions are assigned
    AOTI_TORCH_CHECK(
        Mt_blocks * Nt_blocks * Kt_blocks * 48 >= Mr_blocks * Nr_blocks * Kr_blocks,
        "Not all partitions are assigned."
    );
    #pragma omp parallel num_threads(48)
    {
        const int tid = omp_get_thread_num();
        const int64_t k_group_id = tid / num_Kt_blocks;
        const int64_t k_slice_id = tid % num_Kt_blocks;
        const int64_t n_group_id = k_group_id / num_Nt_blocks;
        const int64_t n_slice_id = k_group_id % num_Nt_blocks;
        const int64_t k_block_start = k_slice_id * Kt_blocks;
        const int64_t k_block_end = std::min(k_block_start + Kt_blocks, Kr_blocks);
        const int64_t n_block_start = n_slice_id * Nt_blocks;
        const int64_t n_block_end = std::min(n_block_start + Nt_blocks, Nr_blocks);
        const int64_t m_block_start = std::min(n_group_id * Mt_blocks, Mr_blocks);
        const int64_t m_block_end = std::min(m_block_start + Mt_blocks, Mr_blocks);
        const int64_t num_Mc_blocks_per_thread = (m_block_end - m_block_start + Mc_blocks - 1) / Mc_blocks;
        AMXState amx_state;
        auto _local_acc_buf = std::make_unique<float[]>(static_cast<int64_t>(Mc_blocks*Mr*Nc_blocks*Nr)); auto local_acc_buf = _local_acc_buf.get();
        for (int64_t mc_block_id = 0; mc_block_id < num_Mc_blocks_per_thread; mc_block_id++) {
            const int64_t my_mc_block_id = (mc_block_id + n_slice_id) % num_Mc_blocks_per_thread;
            const int64_t mc = m_block_start + my_mc_block_id * Mc_blocks;
            const int64_t m_start = mc * Mr;
            const int64_t m_end = std::min(std::min(mc + Mc_blocks, m_block_end) * Mr, M);
            const int64_t m_size = m_end - m_start;
            for (int64_t nc = n_block_start; nc < n_block_end; nc += Nc_blocks) {
                const int64_t n_start = nc * Nr;
                const int64_t n_end = std::min(std::min(nc + Nc_blocks, n_block_end) * Nr, N);
                const int64_t n_size = n_end - n_start;
                // NB: assume we pad N, nc_block_end won't exceed padded N here.
                const int64_t nc_block_end = std::min(nc + Nc_blocks, n_block_end);
                if (_local_acc_buf == nullptr) { _local_acc_buf = std::make_unique<float[]>(static_cast<int64_t>(Mc_blocks*Mr*Nc_blocks*Nr)); local_acc_buf = _local_acc_buf.get(); }
                for (int64_t kc = k_block_start; kc < k_block_end; kc += Kc_blocks) {
                    int64_t k_start = kc * Kr;
                    int64_t k_end = std::min(std::min(kc + Kc_blocks, k_block_end) * Kr, K);
                    for (int64_t nci = nc; nci < nc_block_end; nci++) {
                        if (kc == k_block_start) {
                            cpp_bmm_micro_gemm<static_cast<bool>(false)>(
                                amx_state,
                                &(X[static_cast<int64_t>(k_start + (96L*m_start) + (36864L*ks_b_index))]),
                                &(W[static_cast<int64_t>((32L*k_start) + (3072L*nci) + (6144L*ks_b_index))]),
                                &(local_acc_buf[static_cast<int64_t>((Nr*nci) + ((-1L)*Nr*nc))]),
                                static_cast<int64_t>(m_end + ((-1L)*m_start)),
                                static_cast<int64_t>(Nr),
                                static_cast<int64_t>(k_end + ((-1L)*k_start)),
                                static_cast<int64_t>(96L),
                                static_cast<int64_t>(32L),
                                static_cast<int64_t>(Nc_blocks*Nr)
                            );

                        } else {
                            cpp_bmm_micro_gemm<static_cast<bool>(true)>(
                                amx_state,
                                &(X[static_cast<int64_t>(k_start + (96L*m_start) + (36864L*ks_b_index))]),
                                &(W[static_cast<int64_t>((32L*k_start) + (3072L*nci) + (6144L*ks_b_index))]),
                                &(local_acc_buf[static_cast<int64_t>((Nr*nci) + ((-1L)*Nr*nc))]),
                                static_cast<int64_t>(m_end + ((-1L)*m_start)),
                                static_cast<int64_t>(Nr),
                                static_cast<int64_t>(k_end + ((-1L)*k_start)),
                                static_cast<int64_t>(96L),
                                static_cast<int64_t>(32L),
                                static_cast<int64_t>(Nc_blocks*Nr)
                            );

                        }
                    }
                }
                {
                    {
                        #pragma GCC ivdep
                        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(m_end + ((-1L)*m_start)); x0+=static_cast<int64_t>(1L))
                        {
                            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L)))); x1+=static_cast<int64_t>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<int64_t>(x1 + (Nc_blocks*Nr*x0)), static_cast<int64_t>(16));
                                auto tmp1 = at::vec::convert<bfloat16>(tmp0);
                                tmp1.store(Y + static_cast<int64_t>(n_start + x1 + (64L*m_start) + (64L*x0) + (24576L*ks_b_index)), static_cast<int64_t>(16));
                            }
                            for(int64_t x1=static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L)))); x1<static_cast<int64_t>(n_end + ((-1L)*n_start)); x1+=(static_cast<int64_t>(n_end + ((-1L)*n_start) + ((-16L)*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L))))) == 0 ? 1 : static_cast<int64_t>(n_end + ((-1L)*n_start) + ((-16L)*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L)))))))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<int64_t>(x1 + (Nc_blocks*Nr*x0)), static_cast<int64_t>(n_end + ((-1L)*n_start) + ((-16L)*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L))))));
                                auto tmp1 = at::vec::convert<bfloat16>(tmp0);
                                tmp1.store(Y + static_cast<int64_t>(n_start + x1 + (64L*m_start) + (64L*x0) + (24576L*ks_b_index)), static_cast<int64_t>(n_end + ((-1L)*n_start) + ((-16L)*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L))))));
                            }
                        }
                    }

                }
            }
        }
        amx_state.release([]() { _tile_release(); });
    }
}
void single_thread_mm(const bfloat16* X, const bfloat16* W, bfloat16* Y, const int64_t ks_b_index)
{

    constexpr int64_t num_threads = 1;
    constexpr int64_t N = 64;
    constexpr int64_t K = 96;
    constexpr int64_t Mr = 32;
    constexpr int64_t Nr = 32;
    constexpr int64_t Kr = 32;
    constexpr int64_t Nr_blocks = (N + Nr - 1) / Nr;
    constexpr int64_t Kr_blocks = (K + Kr - 1) / Kr;
    constexpr int64_t M = static_cast<int64_t>(384L);
    constexpr int64_t Mr_blocks = (M + Mr - 1) / Mr;
    constexpr int64_t Mt_blocks = 12;
    constexpr int64_t Nt_blocks = 2;
    constexpr int64_t Kt_blocks = 3;
    constexpr int64_t Mc_blocks = 12;
    constexpr int64_t Nc_blocks = 1;
    constexpr int64_t Kc_blocks = 3;
    constexpr int64_t num_Mc_blocks = (Mr_blocks + Mc_blocks - 1) / Mc_blocks;
    constexpr int64_t num_Nc_blocks = (Nr_blocks + Nc_blocks - 1) / Nc_blocks;
    constexpr int64_t num_Mt_blocks = (Mr_blocks + Mt_blocks - 1) / Mt_blocks;
    constexpr int64_t num_Nt_blocks = (Nr_blocks + Nt_blocks - 1) / Nt_blocks;
    constexpr int64_t num_Kt_blocks = (Kr_blocks + Kt_blocks - 1) / Kt_blocks;

    // make sure all partitions are assigned
    AOTI_TORCH_CHECK(
        Mt_blocks * Nt_blocks * Kt_blocks * 1 >= Mr_blocks * Nr_blocks * Kr_blocks,
        "Not all partitions are assigned."
    );
    {
        constexpr int tid = 0;
        constexpr int64_t k_group_id = 0;
        constexpr int64_t k_slice_id = 0;
        constexpr int64_t n_group_id = 0;
        constexpr int64_t n_slice_id = 0;
        constexpr int64_t m_block_start = 0;
        constexpr int64_t n_block_start = 0;
        constexpr int64_t n_block_end = Nr_blocks;
        constexpr int64_t k_block_start = 0;
        constexpr int64_t k_block_end = Kr_blocks;
        constexpr int64_t num_Mc_blocks_per_thread = num_Mc_blocks;
        constexpr int64_t m_block_end = Mr_blocks;
        AMXState amx_state;
        auto _local_acc_buf = std::make_unique<float[]>(static_cast<int64_t>(Mc_blocks*Mr*Nc_blocks*Nr)); auto local_acc_buf = _local_acc_buf.get();
        for (int64_t mc_block_id = 0; mc_block_id < num_Mc_blocks_per_thread; mc_block_id++) {
            const int64_t my_mc_block_id = (mc_block_id + n_slice_id) % num_Mc_blocks_per_thread;
            const int64_t mc = m_block_start + my_mc_block_id * Mc_blocks;
            const int64_t m_start = mc * Mr;
            const int64_t m_end = std::min(std::min(mc + Mc_blocks, m_block_end) * Mr, M);
            const int64_t m_size = m_end - m_start;
            for (int64_t nc = n_block_start; nc < n_block_end; nc += Nc_blocks) {
                const int64_t n_start = nc * Nr;
                const int64_t n_end = std::min(std::min(nc + Nc_blocks, n_block_end) * Nr, N);
                const int64_t n_size = n_end - n_start;
                // NB: assume we pad N, nc_block_end won't exceed padded N here.
                const int64_t nc_block_end = std::min(nc + Nc_blocks, n_block_end);
                if (_local_acc_buf == nullptr) { _local_acc_buf = std::make_unique<float[]>(static_cast<int64_t>(Mc_blocks*Mr*Nc_blocks*Nr)); local_acc_buf = _local_acc_buf.get(); }
                for (int64_t kc = k_block_start; kc < k_block_end; kc += Kc_blocks) {
                    int64_t k_start = kc * Kr;
                    int64_t k_end = std::min(std::min(kc + Kc_blocks, k_block_end) * Kr, K);
                    for (int64_t nci = nc; nci < nc_block_end; nci++) {
                        if (kc == k_block_start) {
                            cpp_bmm_micro_gemm<static_cast<bool>(false)>(
                                amx_state,
                                &(X[static_cast<int64_t>(k_start + (96L*m_start) + (36864L*ks_b_index))]),
                                &(W[static_cast<int64_t>((32L*k_start) + (3072L*nci) + (6144L*ks_b_index))]),
                                &(local_acc_buf[static_cast<int64_t>((Nr*nci) + ((-1L)*Nr*nc))]),
                                static_cast<int64_t>(m_end + ((-1L)*m_start)),
                                static_cast<int64_t>(Nr),
                                static_cast<int64_t>(k_end + ((-1L)*k_start)),
                                static_cast<int64_t>(96L),
                                static_cast<int64_t>(32L),
                                static_cast<int64_t>(Nc_blocks*Nr)
                            );

                        } else {
                            cpp_bmm_micro_gemm<static_cast<bool>(true)>(
                                amx_state,
                                &(X[static_cast<int64_t>(k_start + (96L*m_start) + (36864L*ks_b_index))]),
                                &(W[static_cast<int64_t>((32L*k_start) + (3072L*nci) + (6144L*ks_b_index))]),
                                &(local_acc_buf[static_cast<int64_t>((Nr*nci) + ((-1L)*Nr*nc))]),
                                static_cast<int64_t>(m_end + ((-1L)*m_start)),
                                static_cast<int64_t>(Nr),
                                static_cast<int64_t>(k_end + ((-1L)*k_start)),
                                static_cast<int64_t>(96L),
                                static_cast<int64_t>(32L),
                                static_cast<int64_t>(Nc_blocks*Nr)
                            );

                        }
                    }
                }
                {
                    {
                        #pragma GCC ivdep
                        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(m_end + ((-1L)*m_start)); x0+=static_cast<int64_t>(1L))
                        {
                            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L)))); x1+=static_cast<int64_t>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<int64_t>(x1 + (Nc_blocks*Nr*x0)), static_cast<int64_t>(16));
                                auto tmp1 = at::vec::convert<bfloat16>(tmp0);
                                tmp1.store(Y + static_cast<int64_t>(n_start + x1 + (64L*m_start) + (64L*x0) + (24576L*ks_b_index)), static_cast<int64_t>(16));
                            }
                            for(int64_t x1=static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L)))); x1<static_cast<int64_t>(n_end + ((-1L)*n_start)); x1+=(static_cast<int64_t>(n_end + ((-1L)*n_start) + ((-16L)*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L))))) == 0 ? 1 : static_cast<int64_t>(n_end + ((-1L)*n_start) + ((-16L)*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L)))))))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<int64_t>(x1 + (Nc_blocks*Nr*x0)), static_cast<int64_t>(n_end + ((-1L)*n_start) + ((-16L)*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L))))));
                                auto tmp1 = at::vec::convert<bfloat16>(tmp0);
                                tmp1.store(Y + static_cast<int64_t>(n_start + x1 + (64L*m_start) + (64L*x0) + (24576L*ks_b_index)), static_cast<int64_t>(n_end + ((-1L)*n_start) + ((-16L)*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L))))));
                            }
                        }
                    }

                }
            }
        }
        amx_state.release([]() { _tile_release(); });
    }
}
extern "C"
void cpp_bmm(const bfloat16* X, const bfloat16* W, bfloat16* Y)
{
    const int64_t B = static_cast<int64_t>(5L);
    constexpr int64_t num_threads = 48;
    int64_t B_single_thread_block = (B / num_threads) * num_threads;

    #pragma omp parallel for num_threads(48)
    for (int64_t b_start = 0; b_start < B_single_thread_block; ++b_start) {
        single_thread_mm(X, W, Y, b_start);
    }
    for (int64_t b_start = B_single_thread_block; b_start < B; ++b_start) {
        threaded_mm(X, W, Y, b_start);
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129772
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-12-06 04:54:00 +00:00
39425feac7 Filter pattern matching tests based on ACL (#141921)
There are a number of cases where pattern matching differs based on the presence of ACL, causing the tests to fail. This adds `TEST_ACL` and `skipIfACL` so that these tests can still run with different values or be entirely skipped if necessary.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141921
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-06 04:19:41 +00:00
cyy
671e9c7aba [3/N] Avoid copy in std::get (#141843)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141843
Approved by: https://github.com/Skylion007
2024-12-06 04:15:31 +00:00
cyy
4bc8de334f Remove __ubsan_ignore_undefined__ in some cases (#142120)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142120
Approved by: https://github.com/ezyang
2024-12-06 04:13:57 +00:00
646024e823 Add convnext_base to higher tolerance (#142159)
See https://github.com/pytorch/pytorch/issues/141498 https://github.com/pytorch/pytorch/issues/141703

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142159
Approved by: https://github.com/bertmaher, https://github.com/huydhn
2024-12-06 04:00:13 +00:00
80ca6dd892 [Inductor] Expand dtype aware codegen for libdevice and tl.math ops (#140864)
# Feature
Previously, only the codegen for `torch.sqrt` was dtype aware. This PR updates most of the `libdevice`/`tl.math` ops to support dtype-aware codegen as well. This is often necessary to get correct code when `config.triton.codegen_upcast_to_fp32=False`, as most Triton math ops do not support float16/bfloat16.

This PR enables dtype aware codegen via the `maybe_upcast_float32` decorator. This wraps `TritonOverrides` macros to upcast arguments to float32, and downcast the result back to the original dtype. The exception is for ops that return booleans, in which case we set `convert_output=False` and skip the output cast.

# Test Plan
Added CI tests for all the new ops. The list of ops to test is automatically generated based on uses of the `maybe_upcast_float32` decorator, and stored in the new `OpDtypeSupport` class. In each new test, we search the generated code for upcasts/downcasts using a regex.

Also added a unit test for `OpDtypeSupport` which checks that we have correct dtype info for ops that require upcasts.

This PR also moves some existing tests around, to collect all the dtype aware codegen tests in one file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140864
Approved by: https://github.com/eellison, https://github.com/arui-meta

Co-authored-by: eellison <elias.ellison@gmail.com>
2024-12-06 03:15:20 +00:00
0602676c8d [CUTLASS][AOTI] Fixes undefined symbol: cudaLaunchKernelExC (#142094)
Summary:
### Context
* When compiling the object file for a CUTLASS kernel, CUDA RT symbols are left undefined.
* When compiling the final shared object file, we statically link with `libcudart_static.a`.
* One important thing is that ordering matters when specifying the lib search paths (-L).

Test Plan:
```
// before diff
RuntimeError: Failure loading .so: /tmp/tmpqhz_dnza/model.so: undefined symbol: cudaLaunchKernelExC
```

Differential Revision: D66793974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142094
Approved by: https://github.com/chenyang78, https://github.com/hl475
2024-12-06 02:18:54 +00:00
8bfc0094e4 [reland][dynamo][guards] Consider tensors as immutable for dict tag matches (#141085)
Reland - https://github.com/pytorch/pytorch/pull/139560

As mentioned in https://github.com/pytorch/pytorch/pull/130341, using `static py::object` can lead to segfaults. I suspect this is the reason for the import system error seen internally (https://www.internalfb.com/sevmanager/view/469592). In this PR, I am removing the `static` part. This is fine and also the right thing to do because this will catch if user changes the flag in the same process for compiling two different functions.

Unfortunately, there is no easy way to trigger this segfault, so I can't write a test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141085
Approved by: https://github.com/jansel

Co-authored-by: William Wen <williamwen@meta.com>
2024-12-06 01:49:55 +00:00
ce22a01e11 Add an option for classic search (#142018)
Fixes https://github.com/pytorch/tutorials/issues/3143

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142018
Approved by: https://github.com/albanD
2024-12-06 01:24:52 +00:00
e803a3d83a Fix reductions for NJTs with ragged_idx != 1 (#142173)
**Background:** conversion from outer dim -> inner dim makes the (previously valid) assumption that the ragged dim is immediately next to the batch dim. This is no longer the case after #137125.

This PR:
* Updates the outer dim -> inner dim conversion logic to match the actual ragged_idx. Since ragged_idx tells us where the packed ragged / batch dim is, both ragged and batch outer dims should map to this inner dim. The conversion logic must now take in `ragged_idx` to make this possible, so the PR updates all call-sites to pass this.
* Fixes outputs across keepdim settings when reducing over ragged / batch dims.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142173
Approved by: https://github.com/drisspg
2024-12-06 01:23:17 +00:00
6b0df2f720 [torch.func] expand stack_module_state's typing (#142169)
Summary:
https://github.com/pytorch/pytorch/pull/141894 made this API actually
typed w.r.t. pyre, which is causing some internal type failures. This PR
expands the typing for stack_module_state to squash those failures.

Test Plan:
- pyre
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142169
Approved by: https://github.com/albanD
2024-12-06 01:08:53 +00:00
93214aad30 [ROCM] Fix unit test: matmul_small_brute_force_tunableop (#142089)
Fixes #141636
Fixes #141635
Fixes #141458

Changes include:

- TunableOp filename that wasn't set properly
- Activate numerical check (see additional test comment)
- Entire test in try-finally clause to avoid OS environment variable leakage (see additional test comment)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142089
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2024-12-06 00:58:10 +00:00
ad93aa854d E2E composability testing (#141398)
Add 3D(pp+tp+fsdp) test `test_3d_with_tp_dp_pp` at test_pp_compodability
Currently provide @parametrize on
"ScheduleClass" for pp in [ScheduleGPipe, Schedule1F1B, ScheduleInterleaved1F1B, ScheduleLoopedBFS, ScheduleInterleavedZeroBubble]
"MixedPrecisionParam" for fsdp in [torch.bfloat16, torch.float32]

Future work:
1. add fp8
2. add cp(context parallelism) to enable 4D test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141398
Approved by: https://github.com/wconstab, https://github.com/kwen2501
2024-12-06 00:53:22 +00:00
461bd2c2f7 Update nested tensor warning to recommend layout=torch.jagged (#142140)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142140
Approved by: https://github.com/YuqingJ
2024-12-06 00:40:30 +00:00
90052a8ae2 Revert "[ROCm] unskip hermite_polynomial_h unit tests (#141150)"
This reverts commit 69f8b3e269641fae93ed7afba49b6df8e44ed3c9.

Reverted https://github.com/pytorch/pytorch/pull/141150 on behalf of https://github.com/jeffdaily due to this PR is tied to #141955 and that one was reverted so need to revert this too ([comment](https://github.com/pytorch/pytorch/pull/141150#issuecomment-2521830067))
2024-12-06 00:39:56 +00:00
efab8c433f [subclass] Fix unwrap_subclass_parameters parametrization (#142155)
Parametrization can not be registered for non-direct child parameters of the module.
We have to iterate through all submodules and register parametrization at every level.

Original testcase did not test the nested modules case - adding submodule to the test.

Testing:
```
python test/functorch/test_aotdispatch.py -k test_subclass_parameters
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142155
Approved by: https://github.com/bdhirsh
2024-12-05 23:53:36 +00:00
2bfc600644 Unbreak dynamic shape minimal arrayref interface tests (#142091)
Simple bug got introduced somewhere.

Differential Revision: [D66792420](https://our.internmc.facebook.com/intern/diff/D66792420/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142091
Approved by: https://github.com/desertfire, https://github.com/hl475
2024-12-05 23:26:35 +00:00
ca7be75e0a Reduce the nproc when building FA on 8.9 (#142164)
The newly introduce sm89 build is failing consistently in trunk now because of OOM https://github.com/pytorch/pytorch/actions/runs/12186328178/job/33994606556.  I suspect that FlashAttention is the cause.

### Testing

CI to see if the build works.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142164
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-05 23:23:40 +00:00
4981bd8355 Make cache keys consistent between OSS and internal (#142147)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142147
Approved by: https://github.com/jamesjwu, https://github.com/masnesral
2024-12-05 22:29:07 +00:00
a9e3281e94 [rfc][be] static assert that nccl version is >= 2.4 (#142023)
Summary:
Static assert that NCCL VERSION is greater than 2.4.
This is in preparation of enabling error checking by default in PyTorch library and removal of some macros.
This is in PR #141914.
The rationale behind this version is:
1. 2.4 released ~2 years ago so it's unlikely that someone is still using the old library.
2. Enabling error checking is benefitial to the community as it helps debug subtle bugs in production environments.

Test Plan: unit tests

Differential Revision: D66737055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142023
Approved by: https://github.com/kwen2501
2024-12-05 22:11:14 +00:00
5513e2ec35 [SymmetricMemory] use the python version of empty() and rendezvous() for tests and library ops (#142154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142154
Approved by: https://github.com/weifengpy
2024-12-05 22:09:36 +00:00
16ea0ddcdb Ignore logger methods to avoid graph breaks (#139403)
Fixes #132635

Calls to logging.logger cause a graph break, this PR allows the user to avoid these graph breaks (for specific methods) by setting DISABLE_LOGS_WHILE_COMPILING to 1.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139403
Approved by: https://github.com/williamwen42
2024-12-05 20:12:26 +00:00
41952c1876 Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit 38e0f72274cfad88e0f2ca40f27c79cd49413f5e.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/malfet due to This broke sm89 builds ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2521290457))
2024-12-05 20:07:29 +00:00
39482907be [AOTI] Refactor codegen_inputs in wrapper codegen (#141965)
Summary: Fork codegen_inputs for CppWrapperCodegen, because the behavior between python and cpp needs to diverge. On the python side, input backed symbols need to be generated for the autotune block. This is to prepare for one-pass AOTI CUDA codegen.

Differential Revision: [D66718225](https://our.internmc.facebook.com/intern/diff/D66718225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141965
Approved by: https://github.com/chenyang78
ghstack dependencies: #141388, #141387, #141979
2024-12-05 19:49:34 +00:00
2fd8a7be71 [AOTI] Refactor additional_files generation (#141979)
Summary: https://github.com/pytorch/pytorch/pull/140675 adds logic to collect all the generated cubin file paths into an additional_files list, but the collection should only happen when DeferredGpuKernelLine is materialized. This is to prepare for one-pass AOTI CUDA codegen.

Differential Revision: [D66718227](https://our.internmc.facebook.com/intern/diff/D66718227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141979
Approved by: https://github.com/chenyang78
ghstack dependencies: #141388, #141387
2024-12-05 19:49:02 +00:00
7e49da6077 DLPack: add support to PyTorch/XLA (#138470)
Taking over: #128176.

In summary, this PR:

- `__dlpack__`: Calls PyTorch/XLA `to_dlpack` function, if the tensor lives in an XLA:CUDA device
- `__dlpack_device__`: Correctly maps PyTorch/XLA tensors to `kDLGPU`, if XLA:CUDA is being used

The tests are introduced in https://github.com/pytorch/xla/pull/7213.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138470
Approved by: https://github.com/albanD

Co-authored-by: iefgnoix <isaacwxf23@gmail.com>
2024-12-05 19:36:36 +00:00
5f28c42746 [AOIT] Remove several overloaded members from WrapperCodegen (#141387)
Summary: Remove several overloaded string members from WrapperCodegen classes, including open_bracket, closed_braket, size, stride. Instead of relying on polymorphism, we explicitly generate different strings for PythonWrapperCodegen and CppWrapperCodegen. This is to prepare for one-pass AOTI CUDA codegen.

Differential Revision: [D66459991](https://our.internmc.facebook.com/intern/diff/D66459991)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141387
Approved by: https://github.com/chenyang78
ghstack dependencies: #141388
2024-12-05 19:29:38 +00:00
4cc0fc2707 [AOTI] Remove WrapperCodegen.expr_printer (#141388)
Summary: Avoid using expr_printer as an overriden class member for WrapperCodegen. Instead, use pexpr and cexpr explicitly for python and cpp expression print respectively. This is to prepare for one-pass AOTI CUDA codegen, where PythonWrapperCodegen is used to generate the autotune block and CppWrapperCodegen is used to generate the model code.

Differential Revision: [D66459992](https://our.internmc.facebook.com/intern/diff/D66459992)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141388
Approved by: https://github.com/chenyang78
2024-12-05 19:20:39 +00:00
12b8c2fd8b Remove lintrunner windows exclusion (#142150)
As it's available right now, see https://pypi.org/project/lintrunner/0.12.7/#files
And no longer ask developers to install rust on the platform

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142150
Approved by: https://github.com/wdvr
2024-12-05 19:02:21 +00:00
a9d84875a9 Fix mha torch._check in jit tracing (#142059)
Test Plan: `buck2 run @//mode/dev-nosan //mobile-vision/d2go/projects_oss/detr:tests -- -r test_detr_fbnet_export`

Differential Revision: D66769339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142059
Approved by: https://github.com/ezyang
2024-12-05 18:38:17 +00:00
540dc0c114 [aoti] Prototype loading from bytes (#142070)
Loader needs to have an official solution -- I'm pretty sure miniz can do this out of box, but haven't gotten the time to look at it yet. For now it just loads the buffer into a file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142070
Approved by: https://github.com/henrylhtsang
2024-12-05 18:38:02 +00:00
a5ec09d0cd Flip specialize_float to default False in fbcode (#142111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142111
Approved by: https://github.com/ezyang
2024-12-05 18:23:47 +00:00
7ff42f7f04 Use the correct CSV filenames for MPS benchmark (#142034)
After https://github.com/pytorch/pytorch/pull/135386 and https://github.com/pytorch/pytorch/pull/141999, MPS benchmark has been running for a while and the data has been uploaded correctly.  However, the dashboard is still using the old schema that requires the output CSV files to be named in a certain way for its query to work https://github.com/pytorch/test-infra/blob/main/torchci/clickhouse_queries/compilers_benchmark_performance/query.sql#L32-L40.  Specifically, the filename needs to be in the following format `inductor_${backend}_${suite}_${dtype}_${mode}_${device}_${target}.csv`.

The new schema gets away with all this hacky setup, but the dashboard hasn't been migrated to the new schema yet. So, this is a quick way to just get the data to show up first.

### Testing

https://github.com/pytorch/pytorch/actions/runs/12153886764
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142034
Approved by: https://github.com/skotapati, https://github.com/malfet
2024-12-05 17:53:58 +00:00
cb70d9fd05 Revert "[ATen][Native][Special] Hermite polynomial prematurely return NaN if n is high (#141955)"
This reverts commit 51b7528e274d350c1d5091acc40572d6b43879b8.

Reverted https://github.com/pytorch/pytorch/pull/141955 on behalf of https://github.com/atalman due to Failing internal test ([comment](https://github.com/pytorch/pytorch/pull/141955#issuecomment-2521024701))
2024-12-05 17:39:32 +00:00
ae9cda0221 Add truediv support in export serializer (#136364)
Fixes #136113

- [x] Inital `truediv` coverage
- [ ] Expand/reduce coverage?
- [x] Add tests
- [x] Re-check docstrings
- [ ] Linting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136364
Approved by: https://github.com/pianpwk

Co-authored-by: Angela Yi <angelayi@meta.com>
Co-authored-by: Pian Pawakapan <pianpwk@meta.com>
2024-12-05 17:33:33 +00:00
07edb2ec4d Update documentation for torch.mean() to note behavior with empty tensors (#142039)
This PR updates the documentation for `torch.mean()` to explicitly mention that computing the mean over an empty tensor returns `nan`. This clarification helps users understand the behavior and handle it appropriately in their code.

Fixes #141057

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142039
Approved by: https://github.com/albanD
2024-12-05 17:21:53 +00:00
5bc09ac5e9 Remove option for fork-based compile pool (#142001)
Summary: This has been set to "subproc" for a while internally and externally, so we can remove and simplify some of the code. Note that there's no pressing need here -- just that since we've had internal outage with the legacy "fork" implementation, it doesn't seem helpful to leave it available. But if people aren't in the mood for this sort of cleanup, I won't be offended to abandon it.

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142001
Approved by: https://github.com/eellison, https://github.com/jansel
2024-12-05 17:02:08 +00:00
3cdd997f4c Update torch-xpu-ops commit pin (#142113)
Update the torch-xpu-ops commit to [7ecb0b](7ecb0b1a56), includes:

- Capture rrelu_with_noise noise mutation in compile (Reslove https://github.com/pytorch/pytorch/issues/142102)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142113
Approved by: https://github.com/EikanWang
2024-12-05 17:00:29 +00:00
f8c212a925 Transform unbacked int expressions into a fresh unbacked int. (#141917)
Fix: #141419

This PR introduces the `torch.sym_fresh_size` API, which transforms an unbacked int
expression into a fresh unbacked int.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141917
Approved by: https://github.com/ezyang
2024-12-05 16:53:44 +00:00
c376b29c67 [CI] Add more tests to the numpy 2 CI (#141925)
Related to #107302

This PR adds all the tests that failed with NumPy 2, which all have been fixed, to the CI to test with NumPy 2 to prevent regression.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141925
Approved by: https://github.com/albanD
2024-12-05 16:46:21 +00:00
822e8a01c6 [ROCm][Inductor][CK] Add batched gemms into gemm max autotune with CK backend (#141520)
## Testing
```
TORCH_LOGS=+torch._inductor pytest --capture=no test/inductor/test_ck_backend.py -k bmm
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141520
Approved by: https://github.com/chenyang78
2024-12-05 16:03:12 +00:00
ca9aeedf40 Revert "[dynamo] Simplify handling of functools.wraps (#142000)"
This reverts commit f8cb692d77fb1ab75d6663eb32d71037b82e9107.

Reverted https://github.com/pytorch/pytorch/pull/142000 on behalf of https://github.com/atalman due to Newly added test test_functions.py::DefaultsTests::test_tree_map is failing internally ([comment](https://github.com/pytorch/pytorch/pull/142000#issuecomment-2520611808))
2024-12-05 15:23:53 +00:00
82c140327e Install magma from a tarball (#140417)
Magma is built for specific CUDA versions and stored in the ossci-linux bucket. Install it from there rather than the deprecated conda package.

There are two places where magma is installed today:
- `install_conda.sh`: extract the magma package in the same exact location where conda would install it, using a dedicated `install_magma_conda.sh` script. The new script is included in the relevant Dockerfiles where CUDA+magma is needed
- `install_magma.sh`: this script already uses a tarball. Use the new tarball instead of the tarball from the conda package. The format of the new tarball is compatible with the old one, so changes here are minimal:wq

Fixes #140538
Test PR: https://github.com/pytorch/pytorch/pull/141584

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140417
Approved by: https://github.com/atalman
2024-12-05 15:20:58 +00:00
86d08f0b4a Revert "[dynamo] Remove workaround for functools.wraps in functorch (#142014)"
This reverts commit ed77901ec521f3516c96f9ac2a48e659816c8905.

Reverted https://github.com/pytorch/pytorch/pull/142014 on behalf of https://github.com/atalman due to Sorry https://github.com/pytorch/pytorch/pull/142000 is failing internally, need to revert this ([comment](https://github.com/pytorch/pytorch/pull/142014#issuecomment-2520601186))
2024-12-05 15:18:56 +00:00
d24b147520 Update dead reference link for triplet margin loss (#142071)
The current link for _Learning local feature descriptors with triplets and shallow convolutional neural networks_ (https://www.bmva.org/bmvc/2016/papers/paper119/index.html) is dead (404). The paper is archived here: https://bmva-archive.org.uk/bmvc/2016/papers/paper119/index.html
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142071
Approved by: https://github.com/albanD
2024-12-05 15:01:10 +00:00
08df79819d Uniformly pass secrets: inherit to all jobs that go to _linux-build/_linux-test (#141995)
There's also a new lint to make sure you did it right.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141995
Approved by: https://github.com/albanD, https://github.com/malfet
2024-12-05 14:52:43 +00:00
c6c45467a3 Use cxx11-abi for Linux CUDA 12.6 builds (#142064)
Manylinux 2.28 and cxx11-abi migration. Please see: https://dev-discuss.pytorch.org/t/pytorch-linux-wheels-switching-to-new-wheel-build-platform-manylinux-2-28-on-november-12-2024/2581
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142064
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-12-05 14:51:50 +00:00
2d1d125d60 Add BFloat16 support and use a new pack method for flash attention forward kernel (#138783)
* Add BFloat16 support for BRGEMM flash attention forward kernel
* Use a new pack method instead of oneDNN pack for flash attention forward kernel to avoid the output leading dimension limitation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138783
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-12-05 14:50:26 +00:00
470b775d7a Remove functorch config: _max_aliased_inputs_with_dynamic_shapes_enabled. (#141680)
This PR removes the functorch config that set an upper limit on the number of aliased
inputs with dynamic shapes. After moving them to be run at runtime in C++, the compilation
time and runtime (in true alias cases) improved, rendering the error no longer relevant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141680
Approved by: https://github.com/bdhirsh
ghstack dependencies: #139554, #139555, #140013
2024-12-05 14:43:58 +00:00
12d28a5929 Move overlapping guards to C++. (#140013)
This PR moves the logic for computing the overlapping relations between input tensors that
share a storage instance to C++.

In summary, this PR:

- Moves both `tensors_definitely_do_not_overlap` and part of `compute_overlapping_tensors`
to C++
- Introduces a `check_overlapping` function that re-runs `compute_overlapping_tensors`,
checking that the result is consistent with what is expected
- Introduces the `StorageOverlapChecker` class
    - Keeps track of overlapping and non-overlapping tensors
    - Actually checks the overlapping relation (call `check_overlapping`) when all tensors
    are collected
- Introduces the `STORAGE_OVERLAPPING` relational guard
    - Has a reference to a `StorageOverlapChecker`
    - Stores the to-be-checked tensors in the checker, and triggers its check
- Introduces `install_storage_overlapping_guard` python function
    - Creates an instance of `StorageOverlapChecker`
    - Creates 2 instances of the `STORAGE_OVERLAPPING` guard (for overlapping and
    non-overlapping tensors), referencing the same `StorageOverlapChecker` instance

**Why is `StorageOverlapChecker` needed?**

The way `GuardManager` is implemented, we have no control over the order in which the
check methods are called, i.e. no control over the order the tensors are collected. So, we
can't easily split them in "overlapping" and non-overlapping kinds.

Instead, we create 2 instances of `STORAGE_OVERLAPPING` guard, each of which helps
collecting the tensors for one of the kinds mentioned above. They are then used in a
single `StorageOverlapChecker` instance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140013
Approved by: https://github.com/bdhirsh
ghstack dependencies: #139554, #139555
2024-12-05 14:43:58 +00:00
3a1ded5caa Add tensor overlapping guards. (#139555)
Fix: #118214

This PR replaces the guards introduced by running `_tensors_definitely_do_not_overlap` at
compile-time by a single `___check_overlapping` guard. When evaluated, this function calls
the original `_tensors_definitely_do_not_overlap` so as to check whether the current state
of the inputs are consistent, i.e. tensors that should overlap do overlap, and those that
shouldn't don't.

In summary, the changes are:

- Introduce `StorageOverlap` derived class from `GuardEnvExpr`
- Plumb `AOTConfig` to the `compute_overlapping_inputs` function, so as to have access to
AOTAutograd input sources
- Suppress the guards generated by `_tensors_definitely_do_not_overlap` function at runtime
- Issue a `StorageOverlap` AOTAutograd guard, specifying the sources that should and
shouldn't overlap

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139555
Approved by: https://github.com/bdhirsh
ghstack dependencies: #139554
2024-12-05 14:43:58 +00:00
cbfab8b4de Add tensor._base as a tracked fake for ShapeEnv guards. (#139554)
This PR fixes the issue where AOTAutograd would produce a guard that used a symbolic value
that came from one of the input's base.

```python
@torch.compile(backend="aot_eager", dynamic=True)
def f(a, b):
    a.add_(1)
    b.add_(1)
    return a

x = torch.ones(10)
f(x[1:], x[1:])
```

In the example above, AOTAutograd functionalizes the mutation by making use of
`as_strided_scatter` operation, which produces the guard: `s0 >= s1 + 1`, where:

- `s0`: corresponds to `x.size()[0]`
- `s1`: corresponds to `a.size()[0]`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139554
Approved by: https://github.com/bdhirsh
2024-12-05 14:43:58 +00:00
27bf7d62e7 Enable retry on A100 perf nightly (#142074)
This is a quick mitigation while the investigation on https://github.com/pytorch/pytorch/issues/142069 going.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142074
Approved by: https://github.com/jeanschmidt
2024-12-05 14:35:53 +00:00
6183c90e99 Avoid recursion in FloorDiv constructor (#142057)
address https://github.com/pytorch/pytorch/issues/141215 and max recursion issue in
this also optimize perf by avoiding a lot of sympy expressions construction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142057
Approved by: https://github.com/ezyang
2024-12-05 14:25:28 +00:00
895c8ce5b3 MetaTensorDesc changes for reconstructing proper FakeTensors (#141926)
A few changes to MetaTensorDesc and friends:

1. Change view_func from a raw method to an ADT where the common case (FakeTensor._view_func_unsafe) is a simple representation instead.
2. (minor) Remove and fix some `type: ignore`s added by #141839
3. (minor) Fix _UNSERIALIZABLE to be a set instead of a dict which is converted into a set each time it's used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141926
Approved by: https://github.com/ezyang
2024-12-05 14:21:57 +00:00
65c2086d45 fix the lint from D66795414 (#142122)
Summary: this diff is to fix the lint issues from D66457500 / https://github.com/pytorch/pytorch/pull/142056

Test Plan: OSS CI

Reviewed By: houseroad, FulinHuang

Differential Revision: D66795414

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142122
Approved by: https://github.com/houseroad
2024-12-05 12:05:51 +00:00
38e0f72274 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-12-05 11:25:55 +00:00
ad2cc96218 Refactor test_torchinductor_strided_blocks to also support triton CPU (#141587)
This increases test coverage for triton CPU from just test_torchinductor.py to also testing block pointer lowering.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141587
Approved by: https://github.com/jansel
2024-12-05 09:57:08 +00:00
8dd4673cea Support torch.xpu.mem_get_info API (#141230)
# Motivate
Fix https://github.com/pytorch/pytorch/issues/130599
This PR intends to add a new API, `torch.xpu.mem_get_info,` which is widely used in popular model workloads.
For example, [here](403c0714d1/src/accelerate/utils/modeling.py (L721)) we need to get current GPU memory usage to split or load the model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141230
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-12-05 08:17:25 +00:00
0be004ff37 Enable fuse_by_partitions to always return output as tuple (#142056)
Summary:
aot_compile only accept a graph with tuple output.
we introduce an option to fuse_by_partitions to alway return outputs as tuple, even if it only have a single entry.

Test Plan: OSS CI

Differential Revision: D66457500

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142056
Approved by: https://github.com/angelayi, https://github.com/hl475
2024-12-05 08:07:41 +00:00
f675f644fd Cleanup between each test in test/test_utils_config_module.py (#142087)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142087
Approved by: https://github.com/ezyang
2024-12-05 07:17:27 +00:00
cyy
aa95618268 [2/N] Apply py39 ruff fixes (#141938)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141938
Approved by: https://github.com/ezyang
2024-12-05 06:26:06 +00:00
cyy
653efe14e4 [3/N] Enable UBSAN tests (#142022)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142022
Approved by: https://github.com/ezyang
2024-12-05 06:06:53 +00:00
b31d3b2f41 Update torch-xpu-ops commit pin (#141949)
Update the torch-xpu-ops commit to [f31219](f312190a92), includes:

- Add lazy init for empty_xpu
- Fix nan propagation error for soft_shrink

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141949
Approved by: https://github.com/EikanWang
2024-12-05 05:22:38 +00:00
31f2d4eb4e [export] Update docs (#142011)
Summary:
Update export docs. Including:
1. Update the output graph.
2. Misc fixes for examples.

Test Plan: CI

Differential Revision: D66726729

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142011
Approved by: https://github.com/angelayi
2024-12-05 03:44:46 +00:00
471017cbc9 avoid specializing strides with DDPOptimizer + inductor (#140751)
Fixes https://github.com/pytorch/pytorch/issues/140229

Fixes https://github.com/pytorch/pytorch/issues/139474

The issue was that:

(1) DDPOptimizer has some logic to partition the dynamo graph into buckets, and run AOTAutograd/inductor on each bucket

(2) doing so requires knowing the **exact** strides of the outputs of each subgraph, so we can have example inputs (with correct strides) to each of the later subgraphs to compile with

(3) there is some existing logic to do this today: we have a `fakify_first_call` flag in AOTAutograd that lets you run it with fake tensor inputs (to handle the calling convention changes that AOTAutograd performs at runtime). During this process, we query inductor for the output strides that it compiled with

(4) these outputs strides are stored in the FX graph cache as raw strings of sympy expressions. We have a function, `evaluate_symexpr`, which given the sympy string, and the ShapeEnv's `var_to_val` mapping, will evaluate the sympy string to generate concrete strides

(5) evaluating this expression will specialize on the exact values of any variables in our shape env, however. In DDPOptimizer, we want to know what inductor's stride outputs are symbolically. This requires converting the (string) sympy expression into actual `SymInts` that we can return.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140751
Approved by: https://github.com/eellison
2024-12-05 03:41:12 +00:00
b08bc07cd7 [AOTInductor] Option to not include weight in .so (#141997)
Summary: Add an option in config to not include weights in .so

Test Plan: `test/inductor:test_aot_inductor -- -r test_so_without_weight_cuda`

Reviewed By: desertfire

Differential Revision: D65968885

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141997
Approved by: https://github.com/desertfire
2024-12-05 03:35:54 +00:00
51cbac4e6a [export] Change fx graph _replace_hook to a list of Callable (#142006)
Summary: Change fx graph module's _replace_hook from a single hook, to a list of hooks. This is to prepare to registering more hooks for inductor provenance tracking, where we might need to register multiple hooks for node replacement.

Test Plan:
```
buck run mode/dev-nosan caffe2/test:fx -- -r test_hooks_for_node_update
buck run mode/dev-nosan caffe2/test:test_export -- -r test_replace_hook
```

Differential Revision: D66726724

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142006
Approved by: https://github.com/zhxchen17
2024-12-05 03:26:48 +00:00
45583a5df9 [FSDP2] Move to public torch.distributed.fsdp (#141868)
**Overview**
This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.:
```
from torch.distributed.fsdp import fully_shard
```
This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.

**Follow-Ups**
- [x] Add some explanation in the docs about FSDP1 vs. FSDP2
- [ ] Move unit tests from `test/distributed/_composable/fsdp` to `test/distributed/fsdp/fully_shard/`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868
Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2024-12-05 03:04:01 +00:00
f9af86de01 [Inductor] Represent tiling as a dict (#141751)
# Summary

Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. This makes it easier to generalize to multi-dimensional reductions.

This diff refactors `self.numels` from a tuple like `(8,16)` to a dict like `{"x": 8, "r": 16}`.

Note: this is based off of https://github.com/pytorch/pytorch/pull/141738, which enables `tree.is_reduction`. That PR should land first.

# Test plan
The existing CI provides good coverage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141751
Approved by: https://github.com/jansel
2024-12-05 02:28:16 +00:00
ff0cfec4c0 AsyncCollectiveTensor: fix _are_we_tracing() in dynamo (#142075)
Fixes https://github.com/pytorch/pytorch/issues/142076. Under compile, functional collectives are supposed to **not** return `AsyncCollectiveTensor`, and instead immediately issue calls to `wait_tensor()` (that we rely on the compiler to reorder as necessary.

This is done with a function `_are_we_tracing()`, that tries to detect if we are running from inside of the compiler. One of the checks it performs is `is_torchdynamo_compiling()` ([here](https://github.com/pytorch/pytorch/blob/main/torch/distributed/_functional_collectives.py#L808C8-L808C34)).

Unfortunately, this will always return False, even if dynamo is indeed tracing. The problem is that this function only returns true if dynamo **intercepts** the bytecode for `is_torchdynamo_compiling()`. However, this function is called during fake-tensor propagation, which is run as part of dynamo, but is not actually intercepted by dynamo itself.

One thing that we know is the case during dynamo tracing, however, is that a `FakeTensorMode` is active. So I tweaked the logic to assume that we are tracing if there is an active fake mode.

This could potentially have consequences for anybody running functional collectives with a fake mode directly, without compile in the loop. Although hopefully it's not too unreasonable to issue wait() calls immediately if you are running with fake tensor (presumably you only care about fake tensor propagation, in which case the wait() calls should technically be a no-op).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142075
Approved by: https://github.com/yifuwang, https://github.com/kwen2501
ghstack dependencies: #141725, #141728
2024-12-05 02:01:18 +00:00
dbd7b820dd Revert "[ROCm] port CK rowwise F8 from fbgemm (#140856)"
This reverts commit 291626fb22832f9381524be73241b495efa60532.

Reverted https://github.com/pytorch/pytorch/pull/140856 on behalf of https://github.com/atalman due to Failing internal build ([comment](https://github.com/pytorch/pytorch/pull/140856#issuecomment-2518911997))
2024-12-05 01:51:40 +00:00
7e77c5ffba cpp_wrapper: input kwargs to custom ops (#141370)
Fixes a situation where kwargs were being passed to a Python fallback op, but as args rather than kwargs. This does not work for arguments that are kwarg-only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141370
Approved by: https://github.com/desertfire
ghstack dependencies: #141368, #141580, #141369
2024-12-05 00:58:01 +00:00
dd7debdbe8 cpp_wrapper: rethrow Python exceptions, when present (#141369)
When running fallback operations in `cpp_wrapper` mode, Python errors thrown in the fallback should be propagated up the stack. This PR fixes the current situation, which discards all Python errors thrown in the fallback op in favor of an uninformative `RuntimeError`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141369
Approved by: https://github.com/desertfire
ghstack dependencies: #141368, #141580
2024-12-05 00:58:01 +00:00
4613bd393d cpp_wrapper: Add support for torch.device arguments (#141580)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141580
Approved by: https://github.com/desertfire
ghstack dependencies: #141368
2024-12-05 00:58:01 +00:00
923a778f97 cpp_wrapper: Complete support for Layout arguments (#141368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141368
Approved by: https://github.com/desertfire
2024-12-05 00:58:01 +00:00
3fdc74ae29 Fix dumb typo (#142079)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142079
Approved by: https://github.com/jainapurva, https://github.com/soulitzer
2024-12-05 00:43:49 +00:00
60a192036b Refactor optional graph module into CompiledFxGraphConstants (#141897)
FXGraphCache supports freezing, but AOTAutogradCache does not. This is due to the fact that when freezing is turned on, instead of using the constants from the graph module that was saved on cache miss, we have to take the constants from the AOTAutograd generated graph module. This PR does two things:

- It bypasses AOTAutogradCache when freezing is turned on. We should have always been doing this.

- It refactors the code to be way more clear about the constants we're using and when we're using them.

Basically, there are two possible sets of constants we can grab from the compiled fx graph.

1. If freezing is turned off, we save the constants directly in CompiledFxGraph.
2. If freezing is turned on, we save the *names* of the constants in CompiledFxGraph, and use the runtime GraphModule's actual constant values: we reconstruct them from the saved names + the new graph module from AOTDispatch.

We implement two different classes for doing just this: one that has access to the post aotdispatch gm, which supports freezing, and one that doesn't have it, which does not support freezing. Then we construct the wrappers and unwrap the result as needed.

This makes it clear that the gm passed to AOTAutogradCache is *not* part of post compile, only the cache key generated from it is.

The whole flow is pretty confusing, but hopefully this gives us better types and static information for understanding what the different codepaths are doing.

Will add a specific AOTAutogradCache to confirm we bypass freezing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141897
Approved by: https://github.com/ezyang, https://github.com/masnesral
2024-12-05 00:34:14 +00:00
25d9fa84ea [CI, 3.13] enable dynamo_wrapped unittests in 3.13 (#141264)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141264
Approved by: https://github.com/StrongerXi, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533, #140733, #141859, #141860, #141886, #141887, #141950, #141951
2024-12-05 00:33:26 +00:00
797a347cd0 [ci, 3.13] disable segfaulting dynamo-wrapped profiler test (#141951)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141951
Approved by: https://github.com/sraikund16, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533, #140733, #141859, #141860, #141886, #141887, #141950
2024-12-05 00:33:26 +00:00
ae71240780 [ci, 3.13] fix/skip failing numpy 2.0+ dynamo-wrapped tests (#141950)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141950
Approved by: https://github.com/StrongerXi, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533, #140733, #141859, #141860, #141886, #141887
2024-12-05 00:33:26 +00:00
fbd130a41f [ci, 3.13] skip failing module tracker dynamo-wrapped test (#141887)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141887
Approved by: https://github.com/StrongerXi, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533, #140733, #141859, #141860, #141886
2024-12-05 00:33:26 +00:00
9e474231d7 [ci, 3.13] skip failing torch.package dynamo-wrapped test (#141886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141886
Approved by: https://github.com/PaliC
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533, #140733, #141859, #141860
2024-12-05 00:33:26 +00:00
408669a559 [dynamo, 3.13] disable 3.13.0 warning in dynamo-wrapped tests (#141860)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141860
Approved by: https://github.com/StrongerXi, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533, #140733, #141859
2024-12-05 00:33:26 +00:00
d34235a2a3 [dynamo, 3.13] add JUMP_BACKWARD_NO_INTERRUPT to terminal opcodes (#141859)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141859
Approved by: https://github.com/StrongerXi, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533, #140733
2024-12-05 00:33:26 +00:00
2f45484331 [ci] add 3.13 inductor unittests to CI (#140733)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140733
Approved by: https://github.com/malfet, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533
2024-12-05 00:33:26 +00:00
3baf8859e6 [Dynamo] Replace torch._dynamo.optimize() with torch.compile() [4/N] (#140253)
related commits:

- #139706
- #140238
- #140247
- #140253
- #140663
- #140688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140253
Approved by: https://github.com/soulitzer
2024-12-05 00:30:00 +00:00
416f500bfe [CI, 3.13] enable 3.13 CI (#139533)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139533
Approved by: https://github.com/atalman, https://github.com/malfet
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862
2024-12-05 00:25:03 +00:00
abc4111348 [ci, 3.13] skip dynamo-xpass'd numpy tests in numpy >= 2.0 (#141862)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141862
Approved by: https://github.com/StrongerXi, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858
2024-12-05 00:25:02 +00:00
76d1047629 [dynamo, 3.13] support CONVERT_VALUE (#141858)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141858
Approved by: https://github.com/StrongerXi, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674
2024-12-05 00:24:55 +00:00
40c959484c [ci, 3.13] disable segfaulting profiler tests in 3.13 (#141674)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141674
Approved by: https://github.com/sraikund16, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673
2024-12-05 00:24:48 +00:00
c946d82077 [ci, 3.13] disable another failing cpp_extension test in 3.13 (#141673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141673
Approved by: https://github.com/StrongerXi, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623
2024-12-05 00:24:42 +00:00
cd56cd30f2 [ci, 3.13] disable failing cpp_extension test due to weights_only error in numpy 2.1 (#141623)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141623
Approved by: https://github.com/mikaylagawarecki, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621
2024-12-05 00:24:35 +00:00
2be8d16247 [ci, 3.13] disable some quantization tests affected by numpy 2.1 overflow error (#141621)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141621
Approved by: https://github.com/jerryzh168, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605
2024-12-05 00:24:29 +00:00
314e5dd1d1 [ci, 3.13] skip some parts of a failing jit test in 3.13 (#141605)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141605
Approved by: https://github.com/davidberard98, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577
2024-12-05 00:24:22 +00:00
1a44f01beb [ci, 3.13] update test_testing.py usage of locals() for 3.13 (#141577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141577
Approved by: https://github.com/StrongerXi, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572
2024-12-05 00:24:14 +00:00
9459952175 [ci, 3.13] update tensorboard version for 3.13 to fix broken tests (#141572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141572
Approved by: https://github.com/StrongerXi, https://github.com/atalman
ghstack dependencies: #141409, #142003
2024-12-05 00:24:07 +00:00
c93dd531d3 format test_monitor.py and test_tensorboard.py (#142003)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142003
Approved by: https://github.com/StrongerXi, https://github.com/atalman
ghstack dependencies: #141409
2024-12-05 00:23:54 +00:00
22ae34af88 [torch.package, 3.13] fixes to torch.package for 3.13 (#141409)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141409
Approved by: https://github.com/PaliC, https://github.com/atalman
2024-12-05 00:23:47 +00:00
e6e75ebd0a Silent TD warnings when there is no td_results.json (#142083)
Despite the fact that we have `continue-on-error: true` there, GH behaves noisily when `td_results.json` doesn't exist.
 For example, all benchmark jobs in https://github.com/pytorch/pytorch/actions/runs/12149624686 finished successfully but they all showed up as errors on GH UI.  To make this worst, log classifier sometimes pick up the error https://github.com/pytorch/pytorch/actions/runs/12149624686/job/33882285001#step:16:37
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142083
Approved by: https://github.com/clee2000
2024-12-04 23:43:29 +00:00
UV
0318589e87 Changed 'standard-deviation' to 'variance' in GroupNorm documentation (#141982)
Fixes #141315

Updated the GroupNorm documentation to replace 'standard-deviation' with 'variance' to accurately reflect the calculation
method.

@pytorchbot label "topic: not user facing"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141982
Approved by: https://github.com/mikaylagawarecki
2024-12-04 22:49:45 +00:00
326f487809 Bypass AutogradCache when view replay affects the mutation meta (#141978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141978
Approved by: https://github.com/bdhirsh
2024-12-04 22:13:12 +00:00
fa2fe9cafb Delete linux-focal-cuda12.4-py3.10-gcc9-sm86 from trunk (#142073)
As it exactly mirrors the the job on pull.yml, see
c83b739f14/.github/workflows/pull.yml (L479-L495)

And also HUD [permalink](53768d67ab/3):

<img width="1201" alt="image" src="https://github.com/user-attachments/assets/d3f0f81c-843b-4f96-82ce-9fd18ebfe2ad">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142073
Approved by: https://github.com/seemethere, https://github.com/huydhn, https://github.com/atalman
2024-12-04 22:04:55 +00:00
69f8b3e269 [ROCm] unskip hermite_polynomial_h unit tests (#141150)
Large n input caused a regression starting in ROCm 6.1. The for loop will run for an excessive number of iterations. The root cause seems to be how static_cast<int64_t> behaves for large float values such as 1e20 that certain unit tests will use. The workaround is to break out of the loop once the returned value reaches nan.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141150
Approved by: https://github.com/eqy, https://github.com/malfet
2024-12-04 22:01:57 +00:00
53768d67ab Fix unit test failures with SciPy 1.13+ (#141986)
Related to #107302

To use `numpy>=2`, we need to upgrade `scipy>=1.13.0` from `1.11.0`.
This PR fixes a failed test caused by the `scipy` upgrade.

The `scipy` implementation of `logsumexp` has changed and deviated from the torch implementation.
So, we replace it with a simple custom implementation as the ground truth.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141986
Approved by: https://github.com/rgommers, https://github.com/albanD
2024-12-04 21:41:38 +00:00
54324fc2d9 [MPS] Release MetalShaderLibrary cached resources (#142053)
By releasing retained `id<MTLFunction>` and `id<MTLComputePipelineState>`
Please note, that `id<MTLLibrary>` associated with class are currently leaked, which is by design, all dynamic shader allocations shoudl use `DynamicMetalShaderLibrary`

Test plan: `leaks --atExit -- ./bin/mps_test_metal_library`

Before:
```
STACK OF 1 INSTANCE OF 'ROOT LEAK: <_MTLFunctionInternal>':
18  dyld                                  0x197a94274 start + 2840
17  mps_test_metal_library                0x1002cb420 main + 68
16  mps_test_metal_library                0x1002fa388 testing::UnitTest::Run() + 124
15  mps_test_metal_library                0x1002fa40c bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 80
14  mps_test_metal_library                0x1002fac50 testing::internal::UnitTestImpl::RunAllTests() + 1588
13  mps_test_metal_library                0x1002e9934 testing::TestSuite::Run() + 1032
12  mps_test_metal_library                0x1002e8688 testing::TestInfo::Run() + 960
11  mps_test_metal_library                0x1002e715c testing::Test::Run() + 812
10  mps_test_metal_library                0x1002e7200 void testing::internal::HandleExceptionsInMethodIfSupported<testing::TestSuite, void>(testing::TestSuite*, void (testing::TestSuite::*)(), char const*) + 80
9   mps_test_metal_library                0x1002c5518 MPSTestMetalLibrary_ArangeShader_Test::TestBody() + 420
8   libtorch_cpu.dylib                    0x10fdd3804 at::native::mps::MetalShaderLibrary::getKernelFunction(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 56
7   libtorch_cpu.dylib                    0x10fdd3394 at::native::mps::MetalShaderLibrary::getLibraryPipelineState(id<MTLLibrary>, std::__1::basic_string<char, id<MTLLibrary>::char_traits<char>, id<MTLLibrary>::allocator<char>> const&) + 268
6   com.apple.Metal                       0x1a2be43b4 -[_MTLLibrary newFunctionWithName:] + 28
5   com.apple.Metal                       0x1a2be4498 -[_MTLLibrary newFunctionWithNameInternal:] + 148
4   com.apple.Metal                       0x1a2be4580 MTLLibraryContainer::functionWithName(NSString*, id<MTLDevice>) + 68
3   com.apple.Metal                       0x1a2be4724 MTLLibraryDataWithArchive::newFunction(NSString*, id<MTLDevice>) + 368
2   libobjc.A.dylib                       0x197a49ddc _objc_rootAllocWithZone + 48
1   libsystem_malloc.dylib                0x197c3baf8 _calloc + 88
0   libsystem_malloc.dylib                0x197c4e9bc _malloc_zone_calloc_instrumented_or_legacy + 128
====
    2 (592 bytes) ROOT LEAK: <_MTLFunctionInternal 0x1325e5550> [448]
       1 (144 bytes) _functionQueue --> <dispatch_queue_t (serial) 0x13254c340> [144]  "function queue" (from Metal)
```
After:
```
Process:         mps_test_metal_library [30687]
Path:            /Users/USER/*/mps_test_metal_library
Load Address:    0x100f74000
Identifier:      mps_test_metal_library
Version:         0
Code Type:       ARM64
Platform:        macOS
Parent Process:  leaks [30686]

Date/Time:       2024-12-04 07:57:01.020 -0800
Launch Time:     2024-12-04 07:56:59.030 -0800
OS Version:      macOS 15.1.1 (24B2091)
Report Version:  7
Analysis Tool:   /usr/bin/leaks

Physical footprint:         177.2M
Physical footprint (peak):  236.5M
Idle exit:                  untracked
----

leaks Report Version: 4.0, multi-line stacks
Process 30687: 40691 nodes malloced for 5575 KB
Process 30687: 0 leaks for 0 total leaked bytes.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142053
Approved by: https://github.com/manuelcandales
ghstack dependencies: #142052
2024-12-04 21:40:50 +00:00
e8200a507d [MPS] Fix memory leak (#142052)
`NSProcessInfo` was allocated inside autorelease pool, but was not added to the pool

Test plan: `leaks --atExit -- ./bin/mps_test_print`

Before it reported the leaks as follows
```
leaks Report Version: 4.0, multi-line stacks
Process 30066: 39595 nodes malloced for 5034 KB
Process 30066: 7 leaks for 448 total leaked bytes.

STACK OF 1 INSTANCE OF 'ROOT LEAK: <NSProcessInfo>':
29  dyld                                  0x197a94274 start + 2840
28  mps_test_print                        0x10224440c main + 68
27  mps_test_print                        0x1022733e4 testing::UnitTest::Run() + 124
26  mps_test_print                        0x102273468 bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 80
25  mps_test_print                        0x102273cac testing::internal::UnitTestImpl::RunAllTests() + 1588
24  mps_test_print                        0x102262990 testing::TestSuite::Run() + 1032
23  mps_test_print                        0x1022616e4 testing::TestInfo::Run() + 960
22  mps_test_print                        0x1022601b8 testing::Test::Run() + 812
21  mps_test_print                        0x10226025c void testing::internal::HandleExceptionsInMethodIfSupported<testing::TestSuite, void>(testing::TestSuite*, void (testing::TestSuite::*)(), char const*) + 80
20  mps_test_print                        0x102240f88 MPSPrintTest_PrintFloatMatrix_Test::TestBody() + 88
19  mps_test_print                        0x1022414f4 torch::randn(c10::ArrayRef<long long>, c10::TensorOptions) + 72
18  libtorch_cpu.dylib                    0x10de1cb34 at::_ops::randn::call(c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::ScalarType>, std::__1::optional<c10::Layout>, std::__1::optional<c10::Device>, std::__1::optional<bool>) + 280
17  libtorch_cpu.dylib                    0x10de1cf1c at::_ops::randn::redispatch(c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::ScalarType>, std::__1::optional<c10::Layout>, std::__1::optional<c10::Device>, std::__1::optional<bool>) + 152
16  libtorch_cpu.dylib                    0x10d9b1078 at::native::randn(c10::ArrayRef<long long>, std::__1::optional<c10::ScalarType>, std::__1::optional<c10::Layout>, std::__1::optional<c10::Device>, std::__1::optional<bool>) + 60
15  libtorch_cpu.dylib                    0x10d9b1220 at::native::randn(c10::ArrayRef<long long>, std::__1::optional<at::Generator>, std::__1::optional<c10::ScalarType>, std::__1::optional<c10::Layout>, std::__1::optional<c10::Device>, std::__1::optional<bool>) + 256
14  libtorch_cpu.dylib                    0x10e0151f8 at::_ops::normal_::call(at::Tensor&, double, double, std::__1::optional<at::Generator>) + 476
13  libtorch_cpu.dylib                    0x10f08ceac c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor& (at::Tensor&, double, double, std::__1::optional<at::Generator>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_MPS__normal_(at::Tensor&, double, double, std::__1::optional<at::Generator>)>, at::Tensor&, c10::guts::typelist::typelist<at::Tensor&, double, double, std::__1::optional<at::Generator>>>, at::Tensor& (at::Tensor&, double, double, std::__1::optional<at::Generator>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor&, double, double, std::__1::optional<at::Generator>) + 84
12  libtorch_cpu.dylib                    0x10f037674 at::(anonymous namespace)::(anonymous namespace)::wrapper_MPS__normal_(at::Tensor&, double, double, std::__1::optional<at::Generator>) + 72
11  libtorch_cpu.dylib                    0x111d8bde8 at::native::normal_mps_(at::Tensor&, double, double, std::__1::optional<at::Generator>) + 132
10  libtorch_cpu.dylib                    0x111d8c334 at::native::mps::normal_mps_impl(at::Tensor&, double, double, std::__1::optional<at::Tensor> const&, std::__1::optional<at::Tensor> const&, std::__1::optional<at::Generator>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 884
9   libtorch_cpu.dylib                    0x111d8b8d8 at::Tensor& at::native::mps::random_mps_impl<double>(at::Tensor&, double, double, std::__1::optional<at::Tensor> const&, std::__1::optional<at::Tensor> const&, MPSGraphRandomDistribution, std::__1::optional<at::Generator>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, MPSGraphTensor* (at::native::mps::RandomCachedGraph*, MPSGraphTensor*) block_pointer) + 2508
8   libtorch_cpu.dylib                    0x111d453bc at::native::mps::Placeholder::Placeholder(MPSGraphTensor*, at::Tensor const&, NSArray<NSNumber*>*, bool, MPSDataType, bool) + 5120
7   libtorch_cpu.dylib                    0x111d2dbc8 at::mps::MPSDevice::isMacOS13Plus(at::mps::MacOSVersion) const + 404
6   libtorch_cpu.dylib                    0x111d2ddf0 at::mps::MPSDevice::isMacOS13Plus(at::mps::MacOSVersion) const::$_0::operator()(int, int) const + 48
5   libobjc.A.dylib                       0x197a7b3f4 objc_alloc_init + 80
4   com.apple.Foundation                  0x19995fbe4 +[NSProcessInfo alloc] + 112
3   com.apple.Foundation                  0x19995faec +[NSProcessInfo allocWithZone:] + 120
2   libobjc.A.dylib                       0x197a49ddc _objc_rootAllocWithZone + 48
1   libsystem_malloc.dylib                0x197c3baf8 _calloc + 88
0   libsystem_malloc.dylib                0x197c4e9bc _malloc_zone_calloc_instrumented_or_legacy + 128
====
    1 (64 bytes) ROOT LEAK: <NSProcessInfo 0x102ce4de0> [64]
```
After test run finishes with no leaks reported
```
Process 29875 is not debuggable. Due to security restrictions, leaks can only show or save contents of readonly memory of restricted processes.

Process:         mps_test_print [29875]
Path:            /Users/USER/*/mps_test_print
Load Address:    0x10223c000
Identifier:      mps_test_print
Version:         0
Code Type:       ARM64
Platform:        macOS
Parent Process:  leaks [29874]

Date/Time:       2024-12-04 07:43:15.287 -0800
Launch Time:     2024-12-04 07:43:14.400 -0800
OS Version:      macOS 15.1.1 (24B2091)
Report Version:  7
Analysis Tool:   /usr/bin/leaks

Physical footprint:         172.0M
Physical footprint (peak):  234.1M
Idle exit:                  untracked
----

leaks Report Version: 4.0, multi-line stacks
Process 29875: 39508 nodes malloced for 5021 KB
Process 29875: 0 leaks for 0 total leaked bytes.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142052
Approved by: https://github.com/manuelcandales
2024-12-04 21:40:50 +00:00
c0c8f41679 [ROCm] add gfx1101 to wheels (#141667)
- Remove older ROCm 5.x build condition

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141667
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2024-12-04 21:21:29 +00:00
760b8ec10a [easy] Log bypass reasons if we're unable to serialize or deserialize a saved graph (#141911)
When we fail to deserialize/serialize a graph, we should alert and log it somewhere so that it's debuggable.

This can happen in OSS if we use view_replay and encounter an output that requires functional tensor to be serialized to work.

Differential Revision: [D66669993](https://our.internmc.facebook.com/intern/diff/D66669993/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141911
Approved by: https://github.com/oulgen, https://github.com/ezyang
2024-12-04 21:03:32 +00:00
f24a9d0755 [PGNCCL] Fix behavior of destroy_process_group (#141510)
Today `destroy_process_group()` is implemented via `ncclCommAbort`.
When user call it in CPU, risk is that a healthy NCCL kernel gets preempted, which causes data corruption.

Instead of aborting kernels, we should flush collectives in `destroy_process_group`, i.e. let them complete normally, before we tear down resources.

This PR implements such "flushing" behavior using `ncclCommFinalize`, then reclaims resources via `ncclCommDestroy`.

Expected behaviors:
For a bad program, a hang is expected at `destroy_process_group()`. If the PG uses non-blocking communicators, such hang is recoverable, because we attaches a timeout to the flush behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141510
Approved by: https://github.com/wconstab
2024-12-04 20:30:47 +00:00
f7bd0c6b60 [doc] Fix the toctree level (#142008)
Changing this back 1 in order to not expand on the index.html page.
Before:
![Screenshot 2024-12-04 at 11 47 54 AM (2)](https://github.com/user-attachments/assets/40d730ee-61b9-4d60-ab13-9b9075cb3cba)
After:
![Screenshot 2024-12-04 at 11 48 30 AM (2)](https://github.com/user-attachments/assets/5eb711a0-e76c-4573-9fdf-88b6b94b31a9)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142008
Approved by: https://github.com/sekyondaMeta, https://github.com/malfet
2024-12-04 19:52:14 +00:00
d552625920 [xla] Update pin to current xla/master (#142065)
Previous xla pin update was to branch on xla, which I force pushed to remove .torch_pin from xla PR and this commit became non available. Updating to merged xla/master.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142065
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-12-04 19:44:50 +00:00
ed77901ec5 [dynamo] Remove workaround for functools.wraps in functorch (#142014)
This is no longer needed after #142000.

Fixes #123365.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142014
Approved by: https://github.com/zou3519
ghstack dependencies: #142000
2024-12-04 19:10:46 +00:00
f8cb692d77 [dynamo] Simplify handling of functools.wraps (#142000)
Previously when Dynamo encounters a `functools.wrap(...)` call, it would
check `VariableTracker.can_reconstruct` and graph break if failed.

That has 2 issues:
1. Implementation of `can_reconstruct` is incorrect, since logic of
   reconstructability isn't necessarily encapsulated in
   `VariableTracker.reconstruct` -- for some VTs like `CellVariable`,
   it's also in `SideEffects.codegen_save_tempvars`. This is exposed by
   #134731.
2. We don't always need to reconstruct the result of
   `functools.wrap(...)`, for those cases we don't want to give up
   tracing by an early `con_reconstruct` check. Instead we could just
   let it fall through, and graph break in the actual `reconstruct` call
   later, if needed.

This patch removes the `can_reconstruct` check altogether. It was
introduced in #114279, but the added tests pass even without the check
now; this might be because of some recent bug fixing on cells and side
effects.

Fixes #134731, #141514.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142000
Approved by: https://github.com/zou3519
2024-12-04 19:10:45 +00:00
51b7528e27 [ATen][Native][Special] Hermite polynomial prematurely return NaN if n is high (#141955)
Hermite polynomials diverge to NaN at high orders due to numerical overflow. The proposal is to prematurely return NaN of it is known that at this value it will be NaN.

According to my short test
```Python
import torch
device = "cuda"
dtype = torch.float32

x = torch.linspace(-1000, 1000, 100000, device=device, dtype=dtype)

for n in range(1024):
    if torch.special.hermite_polynomial_h(x, n).isnan().sum().item() == x.shape[0]:
        print(f"hermite_polynomial_h: all outputs are nans! n = {n}")
        break

for n in range(1024):
    if torch.special.hermite_polynomial_he(x, n).isnan().sum().item() == x.shape[0]:
        print(f"hermite_polynomial_he: all outputs are nans! n = {n}")
        break
```

The output values become NaNs at these orders:
```
hermite_polynomial_h: all outputs are nans! n = 53, dtype=torch.float32
hermite_polynomial_he: all outputs are nans! n = 61, dtype=torch.float32
hermite_polynomial_h: all outputs are nans! n = 272, dtype=torch.float64
hermite_polynomial_he: all outputs are nans! n = 304, dtype=torch.float64
```

Surely, it makes sense to increase the limit as a safety margin.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141955
Approved by: https://github.com/malfet
2024-12-04 18:21:44 +00:00
c47dae8646 [functional autograd] refactor CopyBackward to be functional (#141719)
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141719
Approved by: https://github.com/soulitzer
ghstack dependencies: #141278, #141348
2024-12-04 18:06:31 +00:00
215f5d77b5 [functional autograd] Refactor validate_outputs into a functional variant (#141348)
Today, validate_outputs is stateful (it depends on the autograd graph).
This PR refactors it into a stateless form that just depends on
InputMetadata.

Test Plan:
- new unittest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141348
Approved by: https://github.com/soulitzer
ghstack dependencies: #141278
2024-12-04 18:06:31 +00:00
2b4f1f4990 [functional autograd] Refactor built-in autograd nodes into functional variants (#141278)
This PR refactors all builtin autograd nodes (e.g. MulBackward0) from
having a single MulBackward0::apply into having:
- a "pure function variant" `MulBackward0_apply_functional`
- a stateful variant MulBackward0::apply that ends up calling
  `MulBackward0_apply_functional`.

In order to do this we left the stateful pieces in MulBackward0::apply
(like unpacking of saved vars, determining which gradients actually need
computing).

The motivation is that this will be useful for compiled autograd in a
future PR. We might refactor this more later, but I wanted to get
something reviewed, shipped, and tested in-tree because the entire stack
is going to be big and this change by itself might have subtle perf issues.

The new codegen looks like the following:
- https://gist.github.com/zou3519/84721cfbef71bb640ddf1a64ef8583a3

Here's the old codegen for comparison:
- https://gist.github.com/zou3519/73f925fe6aca6dd3ceb0a6e6fcf5f77d

Test Plan:
- existing tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141278
Approved by: https://github.com/soulitzer
2024-12-04 18:06:31 +00:00
fd35be2fd3 TritonTemplate dtype fixes (#141991)
- Set the dtype of "acc" appropriately so that epilogue fusion will have args with dtype
- Update dtype propagation to use `type_to_dtype` instead of instantiating tensor
- Throw if we have a string arg where we should have a proper CSEVariable, unless we're doing the Modification Subgraph thing which is nyi. everything else is appropriately typed (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @drisspg ).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141991
Approved by: https://github.com/drisspg
ghstack dependencies: #139945, #140057, #141495, #141882
2024-12-04 17:24:23 +00:00
920e4364b7 [BE] Remove "$PACKAGE_TYPE" == 'conda' logic from build scripts (#142019)
Please see: https://github.com/pytorch/pytorch/issues/138506
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142019
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-12-04 16:05:43 +00:00
0582b32f6c Enable Extension Support (#142028)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142028
Approved by: https://github.com/ezyang, https://github.com/eqy
2024-12-04 15:54:06 +00:00
38d10a1b17 Revert "[Inductor] Represent tiling as a dict (#141751)"
This reverts commit 5deca07c0dcf1482eba99bf93b805cf1cc41ad6c.

Reverted https://github.com/pytorch/pytorch/pull/141751 on behalf of https://github.com/atalman due to Failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/141751#issuecomment-2517815899))
2024-12-04 15:43:16 +00:00
7830c213d7 [FlexAttention] Fix max-autotune bug with captured buffer grads (#141531)
# Summary
Fix tensor argument ordering for autotuning flex attention, change how we enabled scatters codegen for triton. We used to go through the existing store_output triton codegen but now we just short circuit and generate the correct expression earlier on.

This enables us to instead of relying on arg.python_defs to thread arguments through via input_buffers we can instead reuse the exact same mutated buffer infra as we did for multiple outputs before.

Test cases added for both default and max-autotune-no-cudagraphs modes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141531
Approved by: https://github.com/Chillee
2024-12-04 14:56:58 +00:00
6ad422d778 set_linter finds and replaces built-in set in Python code (#138454)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138454
Approved by: https://github.com/eellison
2024-12-04 14:31:24 +00:00
7666c8263a [REFACTOR] Inline FxGraphCache.post_compile into sole call site (#141877)
I am going to break apart the arguments passed to the constituents
to only pass exactly what is needed, so easy access to the insides
is helpful here.

This also moves two helper functions to output_code.py as well.

Also set _boxed_call at constructor.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141877
Approved by: https://github.com/jamesjwu, https://github.com/jansel

Co-authored-by: James Wu <jjwu@meta.com>
2024-12-04 14:21:04 +00:00
f85e238186 [aotd] capture rrelu_with_noise noise mutation in compile (#141867)
Rebase-copy of long standing already approved PR https://github.com/pytorch/pytorch/pull/138503 that was blocked on landing by xla build issues.

Got a new  PR with the same content (ghstack checkout was failing due to changed submodules)

Corresponding xla PR:
https://github.com/pytorch/xla/pull/8363

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141867
Approved by: https://github.com/bdhirsh
2024-12-04 12:18:58 +00:00
61dc5e9c0a Enforce contiguity for alltoall (#141816)
Summary: We cannot relax the alltoall contiguous requirement which will lead to wrong results.

Test Plan: Added a test.

Differential Revision: D66560930

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141816
Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/fduwjj, https://github.com/fegin, https://github.com/yoyoyocmu
2024-12-04 10:17:39 +00:00
eff99a4b4b fix linalg.SVD docs typo: wrong V* shape in reduced SVD (#142037)
https://en.wikipedia.org/wiki/Singular_value_decomposition#Reduced_SVDs

in reduced SVD
V* shape is (n, k)
V shape is (n, k)

in docs it was wrong

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142037
Approved by: https://github.com/lezcano
2024-12-04 09:18:33 +00:00
16676fd17b Disable unused ARM SME to reduce android app binary size (#141942)
Summary: ARM SME kernels aren't currently used right now, so disabling their build so

Reviewed By: digantdesai

Differential Revision: D66336599

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141942
Approved by: https://github.com/digantdesai
2024-12-04 07:24:50 +00:00
9dffd12f90 Upgrade ROCm wheels to manylinux2_28 - 2 of 2 (binaries) (#141423)
Depends on https://github.com/pytorch/pytorch/pull/140681 and https://github.com/pytorch/pytorch/pull/141609

Highlights:
* Upgrade binaries to ROCm6.2.4 to use latest docker images
* Remove pre-cxx11 builds for libtorch on ROCm
* Use manylinux2_28 docker images for ROCm
* Set `DESIRED_DEVTOOLSET=cxx-abi` (and hence `_GLIBCXX_USE_CXX11_ABI=1`) for ROCm manylinux2_28 wheels (ROCm RHEL8 packages also have GCC_ABI=1, so it keeps it consistent)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141423
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>
2024-12-04 07:00:25 +00:00
c0e1fc4919 Avoid casting low precision inputs to high precision for XPU Tensor in torch.linalg.vector_norm (#141954)
Fixes https://github.com/pytorch/pytorch/issues/141953

For mixed precision cases, tensors with device is cpu would cast type to `out_dtype`, while tensors with cuda devices will not do so for computational efficiency. For Intel xpu tensors, low-precision inputs should also not be converted to high-precision (same as cuda).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141954
Approved by: https://github.com/guangyey, https://github.com/ezyang
2024-12-04 06:44:19 +00:00
75d57b04ec [Dynamo] Replace torch._dynamo.optimize() with torch.compile() [9/N] (#140933)
related commits:

- #139706
- #140238
- #140247
- #140253
- #140663
- #140688
- #140922
- #140924
- #140933

> This is the last one

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140933
Approved by: https://github.com/ezyang
2024-12-04 06:28:08 +00:00
d6481333ad [MPS] Add scatter_reduce.two (#141948)
Which has been request 20+ times on https://github.com/pytorch/pytorch/issues/77764 is just a flavor of out-of-box scatter-reduce, so all this op does is redispatches existing implementation.
Unsupported dtype/reduction type combinations:
 - min/max for int64
 - min/max for int32 on MacOS-14 or older

Following swift code demonstrates problem with scatterAlongAxis MPS call
```swift
import Metal
import MetalPerformanceShadersGraph

func scatterMPS(device: MTLDevice,
                inp_buf: MTLBuffer, upd_buf: MTLBuffer,
                idx_buf: MTLBuffer, out_buf: MTLBuffer,
                inp_elem: Int, upd_elem: Int) {
  let graph = MPSGraph()
  let inputPlaceholder = graph.placeholder(shape: [inp_elem as NSNumber], dataType: .int64, name: nil)
  let updatesPlaceholder = graph.placeholder(shape: [upd_elem as NSNumber], dataType: .int64, name: nil)
  let indicesPlaceholder = graph.placeholder(shape: [upd_elem as NSNumber], dataType: .int64, name: nil)
  let outNode = graph.scatterAlongAxis(0, data: inputPlaceholder, updates: updatesPlaceholder, indices: indicesPlaceholder, mode: .min, name: nil)
  let mpsInputBuffer = MPSGraphTensorData(inp_buf, shape: [inp_elem as NSNumber], dataType: .int64)
  let mpsUpdatesBuffer = MPSGraphTensorData(upd_buf, shape: [upd_elem as NSNumber], dataType: .int64)
  let mpsIndicesBuffer = MPSGraphTensorData(idx_buf, shape: [upd_elem as NSNumber], dataType: .int64)
  let mpsOutputBuffer = MPSGraphTensorData(out_buf, shape: [inp_elem as NSNumber], dataType: .int64)
  guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") }
  graph.run(with: queue, feeds: [inputPlaceholder: mpsInputBuffer,
                               updatesPlaceholder: mpsUpdatesBuffer,
                               indicesPlaceholder: mpsIndicesBuffer ],
            targetOperations: nil, resultsDictionary: [outNode: mpsOutputBuffer])
}

func makeBufferWithValues(device: MTLDevice, values: [Int64]) -> MTLBuffer {
  guard let buf = device.makeBuffer(length: values.count * MemoryLayout<Int64>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }
  let buf_data = buf.contents().assumingMemoryBound(to: Int64.self)
  for i in 0..<values.count {
    buf_data[i] = values[i]
  }
  return buf
}

guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") }
print("Using device \(device.name)")

let inp_elem = 4
let upd_elem = 4
let inp_buf = makeBufferWithValues(device: device, values: [1, 2, 3, 4])
let upd_buf = makeBufferWithValues(device: device, values: [Int64.max - 1, Int64.max - 2 , Int64.max >> 16 , 11])
let idx_buf = makeBufferWithValues(device: device, values: [0, 1, 2, 3])
guard let out_buf = device.makeBuffer(length:inp_elem * MemoryLayout<Int64>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }

scatterMPS(device: device,
           inp_buf: inp_buf, upd_buf: upd_buf,
           idx_buf: idx_buf, out_buf: out_buf,
           inp_elem: inp_elem, upd_elem: upd_elem)

let obuf_data = out_buf.contents().assumingMemoryBound(to: Int64.self)
for i in 0..<inp_elem {
    print("out_buf[\(i)] = \(obuf_data[i])")
}
```
that prints `4294967294, 4294967293, 4294967295, 4` instead of expected `1, 2, 3, 4`
Where `torch.tensor([[1, 9223372036854775806], [2, 9223372036854775805], [3, 140737488355327], [4, 11]], dtype=torch.int64, device='mps').max(1)` yields an expected results
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141948
Approved by: https://github.com/manuelcandales
2024-12-04 04:56:43 +00:00
deffbbdb91 Update submodule ideep for pd cache changes (#141555)
Fixes https://github.com/pytorch/pytorch/issues/141327.
Fixes https://github.com/pytorch/pytorch/issues/141328.
Fixes https://github.com/pytorch/pytorch/issues/141329.
Fixes https://github.com/pytorch/pytorch/issues/141330.
Fixes https://github.com/pytorch/pytorch/issues/141331.

Summary:
1. Modify to_bytes function to include binary_src shape information into the keys of pd cache.
2. Modify inner_product_forward to support broadcast add fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141555
Approved by: https://github.com/jgong5
2024-12-04 04:55:33 +00:00
e8e65764d1 [pipelining] Improve schedule csv loading (#142009)
Add small changes based on feedback from Less when testing out https://github.com/pytorch/torchtitan/pull/707
- expose `validate_schedule` as a function
- handle spaces around actions in csv file
- add error arrow to `_format_pipeline_schedule()` to better show where the step errored

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142009
Approved by: https://github.com/lessw2020
2024-12-04 04:15:34 +00:00
86f306b15e _inductor: Add dynamo_timed for async_compile.precompile and turn on (#141920)
waitcounters

This fixes some review comments from https://github.com/pytorch/pytorch/pull/141379
and gives us another dynamo_timed event for local compilation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141920
Approved by: https://github.com/masnesral
2024-12-04 04:03:46 +00:00
30d907c6fb When serializing treespec context, support enum as well (#141525)
Following https://github.com/pytorch/pytorch/pull/102716, per @angelayi's suggestion.

Note that in general enum as an input is not supported.

repro:
```
class TestEnum(enum.Enum):
    A = auto()
    B = auto()

    @staticmethod
    def from_string(s):
        return TestEnum[s.upper()]

class M(torch.nn.Module):
    def forward(self, x, en):
        return x.clone()

input1 = (
    torch.rand(10, device="cuda"),
    {TestEnum.A: torch.rand(10, device="cuda")},
)
inputs = [input1]
model = M().cuda()

_ = model(*input1)

ep = torch.export.export(model, input1, strict=False)
path = torch._inductor.aot_compile(ep.module(), input1)
```

Differential Revision: D66269157
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141525
Approved by: https://github.com/angelayi
2024-12-04 03:08:50 +00:00
288b73cb14 [Redo] Set remote cache version and backend type once in compilation metrics (#141967)
(Got reverted due to a silly bug, fixed now.)

This is causing FbFxGraphRemoteCache.init to no longer be idempotent, i.e. only safe to call once per compile. AOTAutogradCache initializes a new remote cache for the forward and the backward.
Technically, we could make AOTAutogradCache smart and globally thread through a single FbFxGraphRemoteCache everywhere. But there's no reason to do so, as this class is just the handle to access the cache. Plus, it's very brittle for FbFxGraphRemoteCache to not be safe to call multiple times

Differential Revision: [D66701970](https://our.internmc.facebook.com/intern/diff/D66701970/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141967
Approved by: https://github.com/laithsakka
2024-12-04 03:07:53 +00:00
7dfb439a2a Only write predicate once when there are multiple torch.cond (#141528)
Fixes #140606

TEST PLAN:

```
python test/inductor/test_aot_inductor.py -k cond_share
python test/inductor/test_aot_inductor_arrayref.py -k cond_share
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141528
Approved by: https://github.com/desertfire
2024-12-04 01:56:10 +00:00
cyy
bffaddf9ea Format caffe2/serialize (#141850)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141850
Approved by: https://github.com/cpuhrsch
2024-12-04 01:14:24 +00:00
941da90e8a Add macos perf run to the dashboard upload (#141999)
Adjust the inductor workflow to ensure the macOS perf run gets uploaded
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141999
Approved by: https://github.com/huydhn
2024-12-04 01:08:13 +00:00
291626fb22 [ROCm] port CK rowwise F8 from fbgemm (#140856)
This ports (copies) FBGEMM's implementation from @jwfromm.

https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140856
Approved by: https://github.com/drisspg, https://github.com/atalman
2024-12-04 00:32:24 +00:00
a51a048027 [AOTI][refactor] Move stack allocation related configs (#139093)
Summary: Move allow_stack_allocation and use_minimal_arrayref_interface configs into the aot_inductor subclass.

Test Plan: CI

Differential Revision: D65064301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139093
Approved by: https://github.com/chenyang78
2024-12-04 00:15:19 +00:00
0190d929f2 [BE] Remove unused argument (#141983)
Summary: As title, the `node_filter` argument is not used.

Test Plan: CI

Differential Revision: D66712599

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141983
Approved by: https://github.com/tugsbayasgalan
2024-12-04 00:07:33 +00:00
9286c21b22 Fix fbcode tests for automatic dynamic unspecialize float (#141975)
Differential Revision: D66708552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141975
Approved by: https://github.com/bdhirsh, https://github.com/atalman
2024-12-03 23:59:06 +00:00
20912ba582 fix incorrect c10::SymFloat::sqrt (#141728)
Fixes the silent correctness for SDPA in https://github.com/pytorch/pytorch/issues/141710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141728
Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/drisspg
ghstack dependencies: #141725
2024-12-03 23:34:16 +00:00
af3e7389ef guard on flash attention SymFloat scale instead of incorrectly casting to float (#141725)
Fixes https://github.com/pytorch/pytorch/issues/141710. Previously, if we called flash attention with a `SymFloat` scale that was properly symbolic, we would unsafely cast its raw `SymFloat._data` into a `float`, which is pretty much guaranteed to give `NaN`.

This avoids the NaNs in the linked issue, but I'm not sure if it's worth landing yet because we'll start specializing and recompiling for every distinct `scale` that is passed in (which in the dynamic shapes case, is some function of `query.size(-1)`).

The real fix would be to ensure that the flash attention (and related) ops all accept a symbolic version of the `scale`. I'm not sure if we should use `SymFloat` or `Scalar` though - more discussion in the issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141725
Approved by: https://github.com/ezyang
2024-12-03 23:34:16 +00:00
da5b281f23 Generate op variants for core CIA ops (#141797)
There are four core ATen ops with Composite Implicit Autograd (CIA) dispatch: upsample_bilinear2d.vec, upsample_nearest2d.vec, avg_pool1d, and adaptive_avg_pool1d. Op variant auto-generation is currently skipped for CIA ops. In preparation to disable the decompositions for upsample ops by default in export, we need to generate out variants for these ops.

This change enables autogen for core-tagged CIA ops, which enables generation of upsample_bilinear2d.vec_out and upsample_nearest2d.vec_out.

Test Plan:
Added a new test test_functional_variant_autogen_out_variant_core to cover this case in test_codegen.py.
Confirmed that upsample_bilinear2d.vec_out and upsample_nearest2d.vec_out op overloads are registered (they were previously not available).

Differential Revision: D66590257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141797
Approved by: https://github.com/larryliu0820
2024-12-03 22:57:46 +00:00
f0b33658f8 Dont use constant mask if ynumel potentially overflows ygrids (#139751)
If (ynumel / YBLOCK)  > get_max_ygrids(), the z dimension will be used if znumel is None. However, if (ynumel / YBLOCK) % get_max_ygrids() != 0, there will be program launches with inputs that require masking, and so this needs to be considered when determining if the y dimension has a constant mask.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139751
Approved by: https://github.com/eellison

Co-authored-by: George White <georgew@graphcore.ai>
2024-12-03 22:56:18 +00:00
cc98a1b599 _inductor: Add WaitCounter for triton.compile calls. (#141379)
_inductor: Add WaitCounter for async_compile.wait calls.

This will start recording how long these async_compile.wait calls take.

Note that we want to just unify dynamo_timed in the long term.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141379
Approved by: https://github.com/oulgen, https://github.com/masnesral
2024-12-03 22:56:04 +00:00
f86a1753d1 Add option to split Linear gates for Quantizable LSTM into separate ops (#141366)
Add option to split Linear gates for Quantizable LSTM into separate ops (#141366)

Summary:

Reattempt to land D65283170, adding pyre-fixmes / mypy ignores following D52890934

For LSTM, the input and hidden state are projected with Linear layers to construct the 4 gates. This is typically performed together as a single Linear (for each state) with output channel count `4 * hidden_dim` for efficiency.
https://www.internalfb.com/code/fbsource/[ebef7c4238aa55948b2b444044f2c8ed2040de55]/fbcode/caffe2/torch/ao/nn/quantizable/modules/rnn.py?lines=52-58
The output is then ultimately split into 4:
https://www.internalfb.com/code/fbsource/[ebef7c4238aa55948b2b444044f2c8ed2040de55]/fbcode/caffe2/torch/ao/nn/quantizable/modules/rnn.py?lines=83-87

For on-device latency (and possibly memory) considerations, we want to avoid constructing the intermediate `gates` tensor (which can be relatively large), by splitting `igates` and `hgates` first (as 4x `Linear(hidden_dim, hidden_dim)` each), applying add separately, then proceeding as usual.

This functionality can be enabled by specifying `split_gates=True` (default False is original behavior) at any entry point (directly with `torch.ao.nn.quantizable.LSTM`  or via `_get_lstm_with_individually_observed_parts`).

Test Plan:
piggy back on existing test to check for correct swap handling, numerics, and jit.script during prepare/convert
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_custom_module_lstm (caffe2.test.quantization.core.test_quantized_op.TestQuantizedOps)'
```
https://www.internalfb.com/intern/testinfra/testrun/4503599884152725

This test is quite long running now (more than double original).

---

shorter test to confirm original `LSTMCell` passes
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:quantization_fx -- --exact 'caffe2/test:quantization_fx - test_static_lstm_with_custom_fixed_qparams (quantization.fx.test_quantize_fx.TestQuantizeFx)'
```
https://www.internalfb.com/intern/testinfra/testrun/11258999127933996

Reviewed By: Ninja91

Differential Revision: D66380336
2024-12-03 17:21:44 -05:00
80705d3abf Convert assert to torch._check in MHA (#141918)
Fixes https://github.com/pytorch/pytorch/issues/139610
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141918
Approved by: https://github.com/ezyang
2024-12-03 21:58:02 +00:00
5303af2d27 Structured compile_fx (#141505)
- Turn fx_codegen_and_compile() into a class (FxCompile) so we can override the implementation.
- Pull the current body into an implementation (_InProcessFxCompile) which just performs the existing behavior.
- Add an async interface. (See below)

The intended future behavior of Async Compile will be to allow dynamo functions to start compiling in the background (and on a separate machine) while we continue to run eager in the foreground. As such we'll need to put the compilation behind some kind of Future implementation - it makes sense to reuse the existing python futures for that.  An async function is just a syntactic way to return an asyncio.Future.

Because asyncio.run() adds confusion to the stack traces when the called function isn't actually being used in an asynchronous way we also provide a synchronous interface which can be directly called.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141505
Approved by: https://github.com/ezyang
ghstack dependencies: #141502
2024-12-03 21:27:32 +00:00
02147fe0f9 codecache: pull out some Graph serialization code into common helpers (#141502)
Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile.

Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141502
Approved by: https://github.com/ezyang
2024-12-03 21:27:32 +00:00
8e9873d0a3 Allow attribute mutation for MutableMappingVariable (#141376)
Fixes #141375

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141376
Approved by: https://github.com/vmoens
2024-12-03 21:00:10 +00:00
b4ea913978 Check /var/lib/jenkins/workspace exists before setting permissions (#141767)
Currently, if you run these CI scripts in a non-jenkins environment, they fail due to the folder not existing. This ensures the CI scripts can be re-used in different runners.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141767
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-12-03 20:56:20 +00:00
cyy
7c1d5db1f3 [2/N] Enable UBSAN tests (#141740)
Apply c10::load in more places. The function was introduced to cast a byte to valid boolean values, thus fixing the UBSAN errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141740
Approved by: https://github.com/ezyang
2024-12-03 20:52:26 +00:00
28efc17d2c [pytorch/profiler] Honor escape quotes arg in a profiler metadata log formatter (#141527) (#141626)
Summary:

We were ignoring the with_escaped_quotes param in format_list inline function iin utils.cpp in the case where we had to truncate a list of more than kTruncatelength items.

In that case we would truncate a list into a string but always return it with an escaped quotes wrapping it. this will cause issues if this string is meant to be added to other lists which will also go through formatting. Leading to cases like `"["[a, b, c, ...]"]"`.

now the above will be well formatted as `"[[a, b, c, ...]]"` as the escape quote requests will be honored.

Differential Revision: D66521676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141626
Approved by: https://github.com/sraikund16
2024-12-03 20:42:57 +00:00
78e53a92c3 Remove monkeypatch of has_frozen_params in test/inductor/test_codecache.py (#141898)
Summary: This particular test isn't really needed since the code path is already exercised in `test_freezing`. While I was here, I beefed up testing in that method to consider whether the frozen paramater is inlinable vs. not since the caching behavior is different.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141898
Approved by: https://github.com/ezyang, https://github.com/jansel
2024-12-03 20:38:10 +00:00
42547f8d48 Add support for blackwell codegen (#141724)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141724
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/eqy
2024-12-03 20:34:43 +00:00
8b0fcad0fd [AOTInductor] Add update_constant_buffer pybind support (#140755)
Summary: We add update_constant_buffer python support for testing purpose.

Test Plan: Included in commit

Differential Revision: D65968613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140755
Approved by: https://github.com/22quinn
2024-12-03 20:34:25 +00:00
e5f5283ab2 Fix cuda arch full version for 12.6 (#141976)
follow up for https://github.com/pytorch/pytorch/pull/141433/files
build still showing up as 12.6.2 in the name, see latest https://github.com/pytorch/pytorch/actions/runs/12134985224/job/33833276884.

related to https://github.com/pytorch/pytorch/issues/138440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141976
Approved by: https://github.com/atalman, https://github.com/nWEIdia, https://github.com/Skylion007
2024-12-03 20:33:01 +00:00
f472b3aee1 improve typings around torch.export (#141829)
This is another follow-up to https://github.com/pytorch/pytorch/pull/115074 / https://github.com/pytorch/pytorch/pull/141240 following the strategy discussed there (https://github.com/pytorch/pytorch/pull/115074#issuecomment-2480992230).

This PR improves the type annotations around `torch._export`. Even though the PR introduces a few runtime type asserts, the runtime behavior should stay equivalent, because the failed assertions should have been immediate crashes anyway.

CC @Skylion007 @ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141829
Approved by: https://github.com/ezyang
2024-12-03 19:57:21 +00:00
43c5f59190 flip capture_autograd_function to default to true and warn if false (#141972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141972
Approved by: https://github.com/zou3519
ghstack dependencies: #141932
2024-12-03 19:50:14 +00:00
96a35716d1 [aoti] Improve OSSProxyExecutor error messages (#141501)
For debugging issues like https://fb.workplace.com/groups/1028545332188949/permalink/1092584242451724/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141501
Approved by: https://github.com/henrylhtsang
2024-12-03 19:32:49 +00:00
6b620423a3 dynamo_timed: Add a log_waitcounter option. (#141402)
This logs a waitcounter of the name pytorch.dynamo_timed.{key}.

Primarily sending this now to make sure everyone likes the API, then
I'll add tests, and migrate one dynamo_timed to use it. (likely starting
with
https://github.com/pytorch/pytorch/pull/141379).

Testing is a bit harder, since we don't normally have any way to read
_WaitCounter state AFAICT. I want to poke around and see if I can figure
out a way to read the state, otherwise I'll just mock it to at least
make sure it's mostly working.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141402
Approved by: https://github.com/jamesjwu, https://github.com/masnesral
2024-12-03 19:24:29 +00:00
d35358b271 [FlexAttention] Remove failing num_warps=8 in bwds (#141653)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141653
Approved by: https://github.com/BoyuanFeng
2024-12-03 19:22:52 +00:00
9125e9119c Fix memory leak in ModuleTracker (#141960)
Thanks @drisspg and @albanD for finding the fix

**TEST PLAN**
```
import gc
import torch
import torch.nn as nn
from torch.utils.module_tracker import ModuleTracker

class MyModel(nn.Module):
    def forward(self, x):
        return x * x

print(f"torch=={torch.__version__}")
m = MyModel()
m.cuda()
m.to(torch.bfloat16)
mt = ModuleTracker()
for i in range(1000):
    if i % 100 == 0:
        gc.collect()
        print("memory_allocated:", torch.cuda.memory_allocated())
    x = torch.randn([128, 256], device="cuda", dtype=torch.bfloat16, requires_grad=True)
    with mt:
        m(x)

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141960
Approved by: https://github.com/albanD
2024-12-03 18:36:15 +00:00
7bb2228ffd Test cpp_wrapper_hipify string comparison (#141353)
Updating the test to match this code that takes device warpsize into account: cf1d95a965/torch/_inductor/codegen/cuda/device_op_overrides.py (L120)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141353
Approved by: https://github.com/desertfire
2024-12-03 18:25:32 +00:00
8b5c26287d Initialize lr as a tensor if it is originally a tensor (#141620)
Fix https://github.com/pytorch/pytorch/issues/139575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141620
Approved by: https://github.com/kwen2501
2024-12-03 18:10:23 +00:00
314e08eb52 [fr_trace][bugfix] Log missing ranks when provided (#141924)
Summary: For missing ranks issues, `build_collectives` doesn't log any errors (5c2584a14c/tools/flight_recorder/components/builder.py (L293C23-L306C24)), which means that when `EntryState.to_collective` is called [here](5c2584a14c/tools/flight_recorder/components/builder.py (L400C21-L405C22)), errors will be empty and `to_collective` will enter the first if statement. But that codepath doesn't log `missing_ranks`, meaning it will be absent from the `Collective` returned. This diff fixes that oversight.

Test Plan:
eyes

Sandcastle run

Differential Revision: D66679224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141924
Approved by: https://github.com/c-p-i-o
2024-12-03 17:54:43 +00:00
5c59f4a55a Remove old FSDP1 fully_shard (#141875)
FSDP1's `fully_shard` frontend was an exploration at the end of 2022 H2 as part of the `torch/distributed/_composable` APIs to avoid `nn.Module` wrappers. It calls into the same backend code as FSDP1's `FullyShardedDataParallel`.

The API did not gain traction internally, so we instead reused the name `fully_shard` for FSDP2, which similarly is not an `nn.Module` wrapper and follows similar design principles as FSDP1's `fully_shard`.

To the best of our knowledge, we have removed all instances of FSDP1's `fully_shard` internally, and we put the deprecation warning in open source in 2.4 saying it will be removed after 2.5 (which is now):
4959784dac/torch/distributed/_composable/fully_shard.py (L40-L48)

We are skipping the PR sanity check because this PR is only removing code, not adding new code, and should not require this sanity check.

Differential Revision: [D66664988](https://our.internmc.facebook.com/intern/diff/D66664988)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141875
Approved by: https://github.com/weifengpy
2024-12-03 17:00:47 +00:00
ed4831b93c Improve torch.library.opcheck and register_autograd docs (#141883)
Fixes https://github.com/pytorch/pytorch/issues/141618
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141883
Approved by: https://github.com/albanD
ghstack dependencies: #141894, #141880
2024-12-03 16:28:56 +00:00
827c322290 Make torch.library.triton_op public (#141880)
We've been using it privately for half a year and everything's been
good. This PR:
1. Makes torch.library.triton_op public
2. Renames capture_triton -> wrap_triton. We got feedback that no one
   knew what "capture triton" does.
3. Makes torch.library.wrap_triton public.

triton_op is used to construct a Python custom operator that may call 1+
triton kernels. Each of those triton kernels must be annotated with
wrap_triton.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141880
Approved by: https://github.com/albanD
ghstack dependencies: #141894
2024-12-03 16:28:56 +00:00
ac600fdce6 Type exposed_in decorator (#141894)
Test Plan:
- lintrunner
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141894
Approved by: https://github.com/albanD
2024-12-03 16:28:17 +00:00
7a806a839d [FP8] Expand MaskedSelect to float8 (#141928)
Needed for printing those.
Though I wonder if better solution would be to change those ops to use element size rather than actual type (to extend them automatically to unsigned integral types as well)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141928
Approved by: https://github.com/ezyang, https://github.com/jgong5
2024-12-03 15:14:26 +00:00
78543e6002 [dynamo][pytree][1/N] make CXX pytree traceable: tree_iter / tree_leaves (#137397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137397
Approved by: https://github.com/jansel
2024-12-03 11:17:39 +00:00
9990b47ea3 [inductor][pattern matcher] revise mkldnn pattern matcher UT (#141334)
Fixes #139970, #139812.

Revise mkldnn pattern matcher UTs, to check the relevant specific matched patterns instead of the total matched number.
1) Add the missing specific counters in pattern matchers, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`.
2) In UTs, change the general `matcher_count`/`matcher_nodes` checks to the specific ones, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`.
3) In UTs, remove the option of `matcher_count`/`matcher_nodes` params in _test_common and make `matcher_check_fn` a necessary param.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141334
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-12-03 09:26:43 +00:00
ff73e2e679 [dynamo] Validate mutation_type and source in VariableTracker.__init__ (#141717)
As title, this also uncovered a few invalid use cases; the cases that
cause error are fixed in separate patches prior to this patch, and the
rest are fixed in this patch.

This patch also moves a few `.source` mutation to variable construction,
to increase the coverage of the validation.

Fixes #133027.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141717
Approved by: https://github.com/jansel
ghstack dependencies: #141713, #141714, #141715, #141902, #141716
2024-12-03 09:18:06 +00:00
0efd184685 [dynamo] Fix side effects for range iterator that escapes the graph (#141716)
`wrap_range_iterator` mistakenly used `ValueMutationNew`, when it
should've used `ValueMutationExisting`, because this code path always
has a source.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141716
Approved by: https://github.com/jansel
ghstack dependencies: #141713, #141714, #141715, #141902
2024-12-03 09:18:06 +00:00
7c3c8a662e [dynamo] Add RANGE_ITERATOR_MATCH to properly guard on range iterators (#141902)
A subsequeunt patch attempts to fix a side-effect issue for range
iterators, which in turn exposed an exising issue on guards for range
iterators -- the following test started failing:
```
PYTORCH_TEST_WITH_DYNAMO=1 python test/test_tensor_creation_ops.py TestTensorCreationCPU.test_hstack_column_stack_cpu_int16
```

This patch adds a `RANGE_ITERATOR_MATCH` guard to make sure that we
properly guard on range iterators, and adds a regression test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141902
Approved by: https://github.com/jansel
ghstack dependencies: #141713, #141714, #141715
2024-12-03 09:18:06 +00:00
ff3f4a164c [dynamo] Fix aliasing issue for dict.copy that escapes the graph (#141715)
Dynamo accidentally passed the original `ConstDictVariable.source` to
the result of `dict.copy(...)`, which caused aliasing issue when the
result escapes the graph (e.g., is a return value).

This patch fixes that and adds a regression test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141715
Approved by: https://github.com/jansel
ghstack dependencies: #141713, #141714
2024-12-03 09:18:06 +00:00
9eb0520d75 [dynamo] Fix side-effect handling for pre-existing collections.deque (#141714)
Previously we never replayed side effects to `DequeVariable` with a
source; the bug was already in the `test_deque_input` test, but went
unnoticed because we didn't check the deque objects.

This patch adds limited but practical support for this (see comments in
`side_effects.py` for why limited), and updates the deque tests to check
for this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141714
Approved by: https://github.com/jansel
ghstack dependencies: #141713
2024-12-03 09:18:06 +00:00
f2ce2d435b [dynamo] Add test for returning a nested recursive function and update documentation (#141713)
Addresses https://github.com/pytorch/pytorch/pull/137905#discussion_r1806923085.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141713
Approved by: https://github.com/jansel
2024-12-03 09:18:06 +00:00
f8a64c324e Broadcast constants on vectorised stores in CppTile2DKernel (#140262)
Currently constants are not broadcasted on vectorised stores in `CppTile2DKernel`. This leads to errors like the following:
```shell
error:: request for member 'store' in 'tmp1', which is of non-class type 'signed char'
   61 |                                 tmp1.store(tmp2 + static_cast<int64_t>(8L*x0_inner), static_cast<int64_t>(8));
      |                                           ^~~~~
```
This PR adds the required broadcasting.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140262
Approved by: https://github.com/jgong5
2024-12-03 09:15:17 +00:00
e1e3bbc2e1 Set capture_autograd_function=False by default (#141932)
https://github.com/pytorch/pytorch/pull/136959 cleaned up the flag and added a warning. @Chillee pointed out that we should really default this flag to false otherwise we subject all users that go down this code path to log spew.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141932
Approved by: https://github.com/jansel
2024-12-03 07:59:03 +00:00
e499b46465 Speed up half tensors printing (#141927)
This PR removes copycast of reduced precision types to float before printing, that was added in https://github.com/pytorch/pytorch/pull/14418 to probably unblock printing when many operations, like `is_nan` and `max` were not supported on CPUs

(Reusing old test plan) Before the PR:
```python
In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16)

In [2]: %timeit str(a)
621 μs ± 5.06 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```

after the PR
```python
In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16)

In [2]: %timeit str(a)
449 μs ± 2.34 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```

Also, this allows one printing 15Gb Metal tensors on 32GB Mac machine:
```
% python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))"
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='mps:0', dtype=torch.float16)
```

Before this change it failed with non-descriptive
```
% python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))
                 ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor.py", line 568, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 708, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 625, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 339, in _tensor_str
    self = self.float()
RuntimeError: Invalid buffer size: 19.45 GB
```

Convert fp8 dtypes to float16, as float range is an overkill
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141927
Approved by: https://github.com/ezyang
2024-12-03 07:01:49 +00:00
d035db3d86 [AMD] [submodule] aten.bmm CK-backend prototype (#140758)
Summary:
Early prototype of adding CK backend for aten.bmm. Currently, it is very limited in that:

1. BF16 only
2. A single CK instance
3. NT layout only
4. Alpha=1, Beta=0 only

Reviewed By: xw285cornell, zjing14

Differential Revision: D65954695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140758
Approved by: https://github.com/bradleyhd
2024-12-03 06:54:51 +00:00
6afcec0c58 Assert is GraphModule in compile_fx_aot (#141575)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141575
Approved by: https://github.com/Skylion007, https://github.com/desertfire
2024-12-03 05:39:44 +00:00
ce86119503 Revert "Set remote cache version and backend type once in compilation metrics (#141707)"
This reverts commit d633cf1f55f87e5536f63981357d543ac46e48f7.

Reverted https://github.com/pytorch/pytorch/pull/141707 on behalf of https://github.com/malfet due to It breaks tests by referencing FbRemoteFxGraphCache, but CI was green ([comment](https://github.com/pytorch/pytorch/pull/141707#issuecomment-2513555185))
2024-12-03 05:01:02 +00:00
2999dbfd21 Revert "[REFACTOR] Inline FxGraphCache.post_compile into sole call site (#141877)"
This reverts commit 3ab4a28eaa7dc67d5c46c2016bbfe9932b36de06.

Reverted https://github.com/pytorch/pytorch/pull/141877 on behalf of https://github.com/huydhn due to Job are failing en masse after this lands, so it looks like a land race ([comment](https://github.com/pytorch/pytorch/pull/141877#issuecomment-2513552752))
2024-12-03 04:57:58 +00:00
38bbe37187 Enable CI on SM89 (#140305)
Using EC2 G6 instance, based on NVIDIA L4, added to scale config in https://github.com/pytorch/test-infra/pull/5376

To enable more balanced sharding, had to push 148ae19935

Added `@xfailIfSM89` to the following tests:
 - test_fp8_pattern_2
 - test_original_aten_preserved_split_addmm
 - test_sparse_semi_structured_scaled_mm
 - test_sparse_semi_structured_scaled_mm_fp8
 - test_sparse_fp8fp8_mm

Increased tolerance to 2e-4 for `RNNTest.BidirectionalMultilayerGRU_CPU_vs_CUDA`

Skipped following inductor tests (that either flaky OOMs or timeouts):
 - test_reduction_fn_std_float64
 - test_reduction_fn_var_mean_float64
 - test_multi_output_unbacked_custom_op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140305
Approved by: https://github.com/wdvr, https://github.com/ZainRizvi
2024-12-03 04:49:46 +00:00
af88326250 Ensure that BlockMask length must always exactly match the sequence length in flex_attention (#141625)
Fixes https://github.com/pytorch/pytorch/issues/141435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141625
Approved by: https://github.com/drisspg
ghstack dependencies: #138788
2024-12-03 04:45:05 +00:00
9cfc9e636d [while_loop] change to guard_equals for checking output and carry (#141734)
The input with the same can be represented with different symbols e.g.
```python
def body_fn(a, b):
  return b.sin(), a.sin()
```
, where a = torch.randn(3, 4), b= torch.randn(3, 4). There could be 4 symbols allocated for a and b. So instead of checking their shapes and strides' symbol must be the same, we just use guard_equals to enforce the constraint.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141734
Approved by: https://github.com/zou3519, https://github.com/eellison
2024-12-03 04:00:21 +00:00
871b93bc59 [associative_scan] Fixing shape checks (#141698)
This PR fixes the shape checks that are done in the associative_scan operation.
Before all shapes of the input leaves were required to be the same. With this PR only the shapes of the output of the combine_fn and the input leaves need to be the same, but not among the input leaves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141698
Approved by: https://github.com/ydwu4
2024-12-03 03:49:11 +00:00
3ab4a28eaa [REFACTOR] Inline FxGraphCache.post_compile into sole call site (#141877)
I am going to break apart the arguments passed to the constituents
to only pass exactly what is needed, so easy access to the insides
is helpful here.

This also moves two helper functions to output_code.py as well.

Also set _boxed_call at constructor.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141877
Approved by: https://github.com/jamesjwu, https://github.com/jansel

Co-authored-by: James Wu <jjwu@meta.com>
2024-12-03 03:48:23 +00:00
ecbb8a8800 Mention version of flip in weights_only error message (#141304)
Fixes https://github.com/pytorch/pytorch/issues/141139

How the 3 versions of the error message now look

### Version 1

Old error message:

```
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
        (1) Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
        (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
        WeightsUnpickler error: Unsupported global: GLOBAL __main__._rebuild_class_that_uses_build_instruction was not an allowed global by default. Please use `torch.serialization.add_safe_globals([_rebuild_class_that_uses_build_instruction])` or the `torch.serialization.safe_globals([_rebuild_class_that_uses_build_instruction])` context manager to allowlist this global if you trust this class/function.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
```

New error message:

```
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
        (1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
        (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
        WeightsUnpickler error: Unsupported global: GLOBAL __main__._rebuild_class_that_uses_build_instruction was not an allowed global by default. Please use `torch.serialization.add_safe_globals([_rebuild_class_that_uses_build_instruction])` or the `torch.serialization.safe_globals([_rebuild_class_that_uses_build_instruction])` context manager to allowlist this global if you trust this class/function.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
````
### Version 2

Old error message:

```
_pickle.UnpicklingError: Weights only load failed. ``torch.nested`` and ``torch._dynamo`` must be imported to load nested jagged tensors (NJTs)
```

New error message:
```

_pickle.UnpicklingError: Weights only load failed. ``torch.nested`` and ``torch._dynamo`` must be imported to load nested jagged tensors (NJTs)
 In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.

 ```

 ### Version 3

Old error message
```
_pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Trying to load unsupported GLOBAL posix.execv whose module posix is blocked.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
```

New error message
```
_pickle.UnpicklingError: Weights only load failed. In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Trying to load unsupported GLOBAL posix.execv whose module posix is blocked.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
````

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141304
Approved by: https://github.com/zou3519
2024-12-03 03:26:27 +00:00
4cbb3b4bd2 [ROCm] Enable finding HIP and ROCm libraries on Windows (#137279)
This PR introduces support for finding HIP-SDK Libraries on Windows.

Since reading the code changes using the diff view is a bit cumbersome due to introduced if branch, let me explain what was changed:
- The linux-specific steps to find HIP packages have been dragged into `if(UNIX) block`
- Windows steps follow in the `else()` clause

The separation was needed, because of several factors:
- HIP SDK for Windows typically names its components using `hip` in their names (for exmaple: `hip_version.h` instead of `rocm_version.h`, `HIP_VERSION_DEV_MAJOR` instead of `ROCM_VERSION_DEV_MAJOR`, etc.),
- The libraries included in HIP SDK are only a subset of what is available in Linux ROCm (missing hsa-rt, rccl, roctx)
- MIOpen isn't a part of HIP SDK, but can be built separately and as of now requires additional path to be defined using and env var.
- Windows can only find hip package in version greater than 1.0 and its libraries if the lowercase `find_package(hip ...)` is invoked first. This is because the lowercase `hip` name will cause the mechanism to find hip's packages using [config mode](https://cmake.org/cmake/help/latest/command/find_package.html#search-modes) which is the only one supported on Windows, assuming we also want to [include its libraries](https://rocm.docs.amd.com/en/latest/conceptual/cmake-packages.html#consuming-the-hip-api-in-c-code). The upper-case module-mode-seearched `find_package(HIP)` is used later for inclusion of macros such as `hip_add_library` and related macros.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137279
Approved by: https://github.com/jeffdaily
2024-12-03 03:26:01 +00:00
33573488d0 Make Dtypepropagation singleton (#141882)
Should fix compile time regression, it was doing fairly expensive meta programming in init and being instantiated multiple times.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141882
Approved by: https://github.com/ezyang
ghstack dependencies: #139945, #140057, #141495
2024-12-03 03:15:16 +00:00
f911361de1 Correctly specify size of sparse_csr tensors in maskedtensor binary ops (#134335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134335
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2024-12-03 02:55:57 +00:00
08db735629 [BE]: Update mypy to 1.13.0 (#140808)
Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-03 02:50:10 +00:00
34127fc688 Only reconstruct dict if needed (#141606)
Fixes #141452

This is a follow-up of PR #134876, which optimized dict reconstruct to codegen only if any value changed. In this PR we cover the general case and do not codegen any instruction if the dictionary remains the same.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141606
Approved by: https://github.com/zou3519
2024-12-03 02:22:34 +00:00
a6bea3d86d Fix DCe in training IR to reflect correct record function op (#141899)
Summary: The exit function is actually exit._recordFunction not exit.default

Test Plan: CI

Differential Revision: D66665359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141899
Approved by: https://github.com/ydwu4
2024-12-03 01:59:37 +00:00
d633cf1f55 Set remote cache version and backend type once in compilation metrics (#141707)
This is causing FbFxGraphRemoteCache.init to no longer be idempotent, i.e. only safe to call once per compile. AOTAutogradCache initializes a new remote cache for the forward and the backward.
Technically, we could make AOTAutogradCache smart and globally thread through a single FbFxGraphRemoteCache everywhere. But there's no reason to do so, as this class is just the handle to access the cache. Plus, it's very brittle for FbFxGraphRemoteCache to not be safe to call multiple times.

(Same problem, different fix of D66502138)

Differential Revision: [D66508492](https://our.internmc.facebook.com/intern/diff/D66508492/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141707
Approved by: https://github.com/ezyang
2024-12-03 01:49:11 +00:00
77748ed8ec fix c10::Event UT failure on XPU backend (#141800)
# Motivation
Fix this UT failure introduced by https://github.com/pytorch/pytorch/pull/140865. The unrelated failure suppressed this UT failure.
It goes to happen since https://github.com/pytorch/pytorch/pull/141546 is landed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141800
Approved by: https://github.com/EikanWang
2024-12-03 01:34:42 +00:00
09ce760fef Revert "Add missing data types at torch export serialization (#138561)"
This reverts commit 1ef1b3b39123255483c51fafbd21217d76e140e7.

Reverted https://github.com/pytorch/pytorch/pull/138561 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/138561#issuecomment-2513343401))
2024-12-03 01:32:50 +00:00
4959784dac Add API query for available per-process CUDA memory (#140620)
Certain `cpp_wrapper`-enabled tests were OOM-ing in the CI pipeline, with error messages suggesting that sufficient memory was accessible.  This ultimately resulted from an internal memory limitation that was not queryable in the API.  This PR adds querying for that limit.

Additionally, the failing tests had incorrect memory availability checks, and are updated with measured memory requirements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140620
Approved by: https://github.com/malfet, https://github.com/eqy
ghstack dependencies: #141367
2024-12-03 00:24:03 +00:00
5c33c9202f Skip test_cpu_repo.py::CPUReproTests::test_auto_zvec_vsx_simd on AArch64 (#141155)
The skipping logic clearly states it shouldn't be running on this architecture. The test then fails due to `VecNEON` returning `128` from `bit_width()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141155
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/malfet
2024-12-03 00:19:06 +00:00
c17ba69ba5 [submodule] Revert "Adds support for accelerated sorting with x86-simd-sort (#127936) (#141901)
Looks like the original PR caused: https://github.com/pytorch/pytorch/issues/140590

Please see comment: https://github.com/pytorch/pytorch/issues/140590#issuecomment-2508704480

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141901
Approved by: https://github.com/andrewor14, https://github.com/malfet
2024-12-03 00:16:35 +00:00
e41a0b33ec Allow Fakified subclass to have different device for inner and outer tensor (#141839)
Previously if a wrapper tensor subclass is fakified, the inner tensors would end up having the same device as the outer tensor. This PR makes it so that inner and outer tensors can have different devices.

See OffloadTensor PR https://github.com/pytorch/pytorch/pull/141840/files#diff-3bc0cf540b694f4ec0a3749f78b047456657a53a5657e495ffb68e5970c5fdaaR1955 for an application. A simpler test has been added in this PR.

This is technically bc-breaking because now the callback passed to MetaConverter needs to accept an extra argument, but no one external should be using this anyway?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141839
Approved by: https://github.com/bdhirsh
ghstack dependencies: #141166
2024-12-03 00:09:41 +00:00
9830e7b1e4 Update OpenBLAS to 0.3.28 (#137263)
This includes a number of performance improvements, such as threading optimisations and forwarding GEMM calls to GEMV for calls where N=1 or M=1.

See: https://github.com/OpenMathLib/OpenBLAS/releases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137263
Approved by: https://github.com/malfet
2024-12-03 00:05:34 +00:00
9f9105a67b [MPS] Write/Invoke Metal shaders from C++ (#141547)
By introducing `DynamicMetalShaderLibrary` and `MetalShaderFunction`
Add unittests that also serves as an example of how API works

Using this primitive, one can compile and dispatch any 1D or 2D shader over MPS tensor using the following pattern
```cpp
auto x = torch::empty({8, 16}, at::device(at::kMPS));
DynamicMetalShaderLibrary lib(R"MTL(
  kernel void full(device float* t, constant ulong2& strides, uint2 idx [[thread_position_in_grid]]) {
    t[idx.x*strides.x + idx.y*strides.y] = idx.x + 33.0 * idx.y;
  }
)MTL");
auto func = lib.getKernelFunction("full");
func->runCommandBlock([&] {
   func->startEncoding();
   func->setArg(0, x);
   func->setArg(1, x.strides());
   func->dispatch({8, 16});
});

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141547
Approved by: https://github.com/Skylion007
2024-12-02 23:57:59 +00:00
5c2584a14c [ROCm] Enable inductor GEMM lowering for gfx11 (#141687)
This check doesn't make sense for some of the AMD gpus since they have the right amount of CUs but multi_processor_count returns WGPs on RDNA while still performing adequately. A lot of tests fail on modern archs due to this check defaulting them to not using the GEMMs backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141687
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2024-12-02 22:13:34 +00:00
1f3d8896bc Fix mismatched tensor metadata between FakeTensor and Intel XPU concrete tensor when running F.logsigmoid (#141333)
Fixes https://github.com/pytorch/pytorch/issues/141332
`F.logsigmoid` will return two outputs: `output` and `buffer`.
For `F.logsigmoid` cpu path, it will use buffer to store some intermediate values and use them when computing gradients, so it returns a `buffer` tensor with nonzero size. For cuda and xpu paths, buffer is useless, so the `buffer ` tensor size of xpu `F.logsigmoid`  will be zero, just like cuda. The root cause of the issue is that the codes in `decompositions.py` (ref:https://github.com/pytorch/pytorch/blob/main/torch/_decomp/decompositions.py#L2803) only handle the cuda cases, when the a fake tensor with device is xpu run to here, it will use the cpu path and return a `buffer` with nonzero size, which is conflict to the  implementation of intel xpu concrete tensor. Therefore this pr add conditions to handle xpu cases. Make sure the two returned buffer sizes match each other.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141333
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/ezyang
2024-12-02 22:09:20 +00:00
74eb92ed6e fix deep copy of empty graph (#141660)
Differential Revision: D66532131

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141660
Approved by: https://github.com/ezyang
2024-12-02 22:03:13 +00:00
41e59754b4 [CI] Remove inductor-perf-test-nightly-a10g.yml (#141895)
Summary: Deprecate the A10g nightly perf run. The workflow was introduced as an experiment and doesn't seem to be used by developers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141895
Approved by: https://github.com/huydhn
2024-12-02 21:55:20 +00:00
cyy
55250b324d [1/N] Apply py39 ruff fixes (#138578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138578
Approved by: https://github.com/Skylion007
2024-12-02 21:46:18 +00:00
b47bdb06d8 Revert "[inductor][pattern matcher] revise mkldnn pattern matcher UT (#141334)"
This reverts commit 942a2438e263a2632b8934dd245060c9b237f4be.

Reverted https://github.com/pytorch/pytorch/pull/141334 on behalf of https://github.com/atalman due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/141334#issuecomment-2512891840))
2024-12-02 21:29:02 +00:00
6b05e31042 Revert "[REFACTOR] Inline FxGraphCache.post_compile into sole call site (#141877)"
This reverts commit 61534391ba8204286f5c9ed15ab636e94bd3daf2.

Reverted https://github.com/pytorch/pytorch/pull/141877 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but a lot of failures shows up after this lands ([comment](https://github.com/pytorch/pytorch/pull/141877#issuecomment-2512890426))
2024-12-02 21:26:13 +00:00
64d44a39a1 remote_cache: Add a waitcounter for gets and sets (#141307)
This adds a basic waitcounter to help show if we're spending a lot of
time doing gets and sets to remote caches

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141307
Approved by: https://github.com/masnesral
2024-12-02 20:48:47 +00:00
daa77f3d9f Revert "[BE]: Update mypy to 1.13.0 (#140808)"
This reverts commit 00134d68af2ce50560fa5a74473665ea229e6c9d.

Reverted https://github.com/pytorch/pytorch/pull/140808 on behalf of https://github.com/huydhn due to This is failing a distributed test in trunk, target determination missed this test and did not run it on PR ([comment](https://github.com/pytorch/pytorch/pull/140808#issuecomment-2512788426))
2024-12-02 20:47:43 +00:00
54adbbf6b8 cpp_wrapper: Add support for MemoryFormat arguments (#141367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141367
Approved by: https://github.com/desertfire
2024-12-02 20:40:24 +00:00
30574380a3 [REFACTOR] Factor _fx_graph_cache_key and _time_taken_ns to common base class (#141878)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141878
Approved by: https://github.com/jamesjwu
ghstack dependencies: #141877
2024-12-02 20:07:12 +00:00
61534391ba [REFACTOR] Inline FxGraphCache.post_compile into sole call site (#141877)
I am going to break apart the arguments passed to the constituents
to only pass exactly what is needed, so easy access to the insides
is helpful here.

This also moves two helper functions to output_code.py as well.

Also set _boxed_call at constructor.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141877
Approved by: https://github.com/jamesjwu, https://github.com/jansel
2024-12-02 19:48:05 +00:00
fe68f61c59 Migrate micro benchmark results to benchmark database schema v3 (#141745)
Similar to https://github.com/pytorch/pytorch/pull/141087, this uploads the micro benchmark results to benchmark database with its new schema v3. The data can then be queried.

~I'm testing with `inductor-micro-benchmark-x86` which should be sufficient because `inductor-micro-benchmark` is broken atm.  The CSV output stays for now until the dashboard is migrated to schema v3.~ https://github.com/pytorch/pytorch/issues/141747 has been resolved, so inductor-micro-benchmark should work now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141745
Approved by: https://github.com/yanboliang
2024-12-02 19:45:51 +00:00
cyy
ab5467897a Fix NOLINTNEXTLINE (#141794)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141794
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-12-02 19:22:00 +00:00
161a2340ee Switch to using Python nested int (#141166)
Doesn't seem to noticeably slow down eager - TestNestedTensorSubclass tests with and without the PR finished in similar amounts of time (around 57s, 58s)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141166
Approved by: https://github.com/ezyang
2024-12-02 19:17:30 +00:00
2d708752f0 [dynamo] Remove AutoDerefLocalSource and simplify cell handling (#141629)
This patch
1. removes `AutoDerefLocalSource` in favor of `LocalSource`, thereby
   removing its special handling in guards.
2. introduces a `LocalCellSource` for cells from the root frame, with
   only `reconstruct` implemented, to programmatically enforce that thse
   cells should never be used by other components like guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141629
Approved by: https://github.com/jansel
ghstack dependencies: #141628
2024-12-02 19:09:30 +00:00
e14d8c980f [dynamo][NFC] Rename NewCellVariable to CellVariable (#141628)
It was named `NewCellVariable` because we originally used it to
represent cells by the code Dynamo is tracing through. However, now we
use it to represent pre-existing cells as well, so this patch renames it
to avoid confusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141628
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-12-02 19:09:30 +00:00
0989871ac9 pytorch/feature: Record if parallel compile is enabled (#141074)
This gets a bit messy, but this appears to be the best spot to make a
true / false decision.

Note that since we're looking at whether or not it's used, if the pool
doesn't warm up within the time it takes for a compile, we will mark the
feature use as false.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141074
Approved by: https://github.com/masnesral
ghstack dependencies: #141059
2024-12-02 19:09:11 +00:00
00134d68af [BE]: Update mypy to 1.13.0 (#140808)
Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-02 18:47:54 +00:00
9012e7a62f Revert "[dynamo][pytree][1/N] make CXX pytree traceable: tree_iter / tree_leaves (#137397)"
This reverts commit 07850bb2c1771ba3f5578b0aa85792e5cd70de1c.

Reverted https://github.com/pytorch/pytorch/pull/137397 on behalf of https://github.com/atalman due to Failing internal test ([comment](https://github.com/pytorch/pytorch/pull/137397#issuecomment-2511934283))
2024-12-02 16:05:14 +00:00
eb7deb2db5 Revert "Fix NOLINTNEXTLINE (#141794)"
This reverts commit 7dd9b5fc4343d101294dbbab4b4172f2859460bc.

Reverted https://github.com/pytorch/pytorch/pull/141794 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/12087979418/job/33711943084) [HUD commit link](7dd9b5fc43) ([comment](https://github.com/pytorch/pytorch/pull/141794#issuecomment-2511789484))
2024-12-02 15:07:50 +00:00
a34a56f69f Revert "Ensure that BlockMask length must always exactly match the sequence length in flex_attention (#141625)"
This reverts commit 795f28ac552eb61d02ea02fd64637ba814133bd8.

Reverted https://github.com/pytorch/pytorch/pull/141625 on behalf of https://github.com/albanD due to Broken main ([comment](https://github.com/pytorch/pytorch/pull/141625#issuecomment-2511639687))
2024-12-02 14:10:38 +00:00
ec96597e47 Revert "ILP for auto FSDP wrapping (#140298)"
This reverts commit d4cdc098817a0af10b478256b524533ed67285a9.

Reverted https://github.com/pytorch/pytorch/pull/140298 on behalf of https://github.com/xuanzhang816 due to for other PR ([comment](https://github.com/pytorch/pytorch/pull/140298#issuecomment-2511638743))
2024-12-02 14:08:04 +00:00
942a2438e2 [inductor][pattern matcher] revise mkldnn pattern matcher UT (#141334)
Fixes #139970, #139812.

Revise mkldnn pattern matcher UTs, to check the relevant specific matched patterns instead of the total matched number.
1) Add the missing specific counters in pattern matchers, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`.
2) In UTs, change the general `matcher_count`/`matcher_nodes` checks to the specific ones, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`.
3) In UTs, remove the option of `matcher_count`/`matcher_nodes` params in _test_common and make `matcher_check_fn` a necessary param.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141334
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-12-02 08:42:10 +00:00
96d2a511ce [Inductor][CPP] Fix issue in CPP GEMM Template Prune Tensor (#141798)
**Summary**
When addressing [issue #134998](https://github.com/pytorch/pytorch/issues/134998), we will verify if any node in the current graph shares the same storage as the node we intend to prune. In the implementation, we assumed that when creating the `GraphLowering` in post-grad phase, there would be no `submodules`, and all `get_attr` nodes would correspond to a `torch.Tensor`. However, this assumption proves incorrect when enabling `FlexAttention`. In this scenario, `submodules` are present as `get_attr` node in post-grad phase. For example:

```
V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs]     class sdpa_score30(torch.nn.Module):
V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs]         def forward(self, arg0_1: "bf16[][]cpu", arg1_1: "i32[][]cpu", arg2_1: "i32[][]cpu", arg3_1: "i32[][]cpu", arg4_1: "i32[][]cpu"):
V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs]             return arg0_1

V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1]         sdpa_score30 = self.sdpa_score30
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1]         sdpa_mask30 = self.sdpa_mask30
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1]         flex_attention_30 = torch.ops.higher_order.flex_attention(add_276, index_put_60, index_put_61, sdpa_score30, (_frozen_param293, _frozen_param295, _frozen_param296, _frozen_param297, _frozen_param298, _frozen_param299, _frozen_param300, _frozen_param301, 64, 64, sdpa_mask30), 0.08838834764831843, {'SKIP_MASK_SCORE': True, 'PRESCALE_QK': False, 'ROWS_GUARANTEED_SAFE': False, 'BLOCKS_ARE_CONTIGUOUS': False, 'OUTPUT_LOGSUMEXP': False}, (), (_frozen_param294,));  add_276 = sdpa_score30 = sdpa_mask30 = None
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1]         getitem_60: "bf16[1, 32, 1, 128]" = flex_attention_30[0];  flex_attention_30 = None
```
We added an extra check in the implementation to ensure only comparing the `get_attr` node with `torch.Tensor`. It is difficult to reproduce this issue using pure high-order operators. Adding a unit test after https://github.com/pytorch/pytorch/pull/141453 lands would be more straightforward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141798
Approved by: https://github.com/jgong5
2024-12-02 07:38:57 +00:00
90f4d60672 Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit daed864f7b3ca3b3e64ed13624369fd3007ad47d.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/xuhancn due to need to fix on XPU. ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2510737212))
2024-12-02 07:10:41 +00:00
cyy
8cada5cbe5 Use std::apply (#141834)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141834
Approved by: https://github.com/Skylion007
2024-12-02 05:49:10 +00:00
f16e08042c [user triton] Fix grid codegen for configs with empty kwargs (#141824)
Fixes #141823 by adding special handling of the codegen `if <config kwargs>: return <grid>` for the cases when there are no kwargs in the config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141824
Approved by: https://github.com/Chillee
2024-12-02 04:17:21 +00:00
daed864f7b export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-12-02 03:20:29 +00:00
81ab2cc757 Update torch-xpu-ops commit pin (#141201)
Update the torch-xpu-ops commit to [1e32bbc](1e32bbc3d9), includes:

- Improve XPU aten operator coverage
- Support basic `SparseXPU` operators

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141201
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-12-02 01:49:07 +00:00
795f28ac55 Ensure that BlockMask length must always exactly match the sequence length in flex_attention (#141625)
Fixes https://github.com/pytorch/pytorch/issues/141435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141625
Approved by: https://github.com/drisspg
ghstack dependencies: #138788
2024-12-02 00:35:29 +00:00
8eb259fdc3 Added option to control number of kernel options displayed (#138788)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138788
Approved by: https://github.com/drisspg
2024-12-02 00:35:29 +00:00
fc74ec4989 [2/N] Avoid copy in std::get (#141826)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141826
Approved by: https://github.com/Skylion007, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-02 00:16:48 +00:00
b2fe1b9409 [inductor] Fix 3d tiling (#141709)
Fixes #141121

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141709
Approved by: https://github.com/eellison
2024-12-01 19:47:41 +00:00
90f19fee8a [MPS] Convert channels_last_3d to contiguous for input tensor in nn.Conv3d (#141780)
When the input tensor to Conv3d is in the channels_last_3d memory format the Conv3d op will generate incorrect output (see example image in #141471). This PR checks if the op is 3d, and then attempts to convert the input tensor to contiguous.

Added a regression test that verifies the output by running the same op on the CPU.

I'm unsure if Conv3d supports the channels last memory format after #128393. If it does, we should consider updating the logic to utilize this as it would be more efficient. Perhaps @DenisVieriu97 knows or has more context?

Fixes #141471
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141780
Approved by: https://github.com/malfet
2024-12-01 18:36:53 +00:00
1193 changed files with 38059 additions and 17233 deletions

View File

@ -6,19 +6,6 @@ GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"
source $SCRIPTPATH/aarch64_ci_setup.sh
tagged_version() {
GIT_DESCRIBE="git --git-dir /pytorch/.git describe --tags --match v[0-9]*.[0-9]*.[0-9]*"
if ${GIT_DESCRIBE} --exact >/dev/null; then
${GIT_DESCRIBE}
else
return 1
fi
}
if tagged_version >/dev/null; then
export OVERRIDE_PACKAGE_VERSION="$(tagged_version | sed -e 's/^v//' -e 's/-.*$//')"
fi
###############################################################################
# Run aarch64 builder python
###############################################################################

View File

@ -80,7 +80,7 @@ def update_wheel(wheel_path) -> None:
"/usr/local/cuda/lib64/libnvToolsExt.so.1",
"/usr/local/cuda/lib64/libnvJitLink.so.12",
"/usr/local/cuda/lib64/libnvrtc.so.12",
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.4",
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.6",
"/usr/local/cuda/lib64/libcudnn_adv.so.9",
"/usr/local/cuda/lib64/libcudnn_cnn.so.9",
"/usr/local/cuda/lib64/libcudnn_graph.so.9",

View File

@ -1,5 +1,5 @@
0.7b
manylinux_2_17
0.8b
manylinux_2_28
rocm6.2
9be04068c3c0857a4cfd17d7e39e71d0423ebac2
3e9e1959d23b93d78a08fcc5f868125dc3854dece32fd9458be9ef4467982291
6f8cbcac8a92775291bb1ba8f514d4beb350baf4
e938def5d32869fe2e00aec0300f354c9f157867bebdf2e104d732b94cb238d8

View File

@ -179,6 +179,21 @@ case "$image" in
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.4.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9)
CUDA_VERSION=11.8.0
CUDNN_VERSION=9

View File

@ -76,7 +76,8 @@ install_ubuntu() {
vim \
unzip \
gpg-agent \
gdb
gdb \
bc
# Should resolve issues related to various apt package repository cert issues
# see: https://github.com/pytorch/pytorch/issues/65931

View File

@ -9,12 +9,7 @@ install_ubuntu() {
# Instead use lib and headers from OpenSSL1.1 installed in `install_openssl.sh``
apt-get install -y cargo
echo "Checking out sccache repo"
if [ -n "$CUDA_VERSION" ]; then
# TODO: Remove this
git clone https://github.com/pytorch/sccache
else
git clone https://github.com/mozilla/sccache -b v0.8.2
fi
git clone https://github.com/mozilla/sccache -b v0.8.2
cd sccache
echo "Building sccache"
cargo build --release
@ -44,16 +39,7 @@ export PATH="/opt/cache/bin:$PATH"
if [ -n "$ROCM_VERSION" ]; then
curl --retry 3 http://repo.radeon.com/misc/.sccache_amd/sccache -o /opt/cache/bin/sccache
else
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
if [ -n "$CUDA_VERSION" ]; then
# TODO: Install the pre-built binary from S3 as building from source
# https://github.com/pytorch/sccache has started failing mysteriously
# in which sccache server couldn't start with the following error:
# sccache: error: Invalid argument (os error 22)
install_binary
else
install_ubuntu
fi
install_ubuntu
fi
chmod a+x /opt/cache/bin/sccache
@ -61,21 +47,26 @@ function write_sccache_stub() {
# Unset LD_PRELOAD for ps because of asan + ps issues
# https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90589
if [ $1 == "gcc" ]; then
# Do not call sccache recursively when dumping preprocessor argument
# For some reason it's very important for the first cached nvcc invocation
cat > "/opt/cache/bin/$1" <<EOF
# Do not call sccache recursively when dumping preprocessor argument
# For some reason it's very important for the first cached nvcc invocation
cat >"/opt/cache/bin/$1" <<EOF
#!/bin/sh
if [ "\$1" = "-E" ] || [ "\$2" = "-E" ]; then
exec $(which $1) "\$@"
elif [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then
# sccache does not support -E flag, so we need to call the original compiler directly in order to avoid calling this wrapper recursively
for arg in "\$@"; do
if [ "\$arg" = "-E" ]; then
exec $(which $1) "\$@"
fi
done
if [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then
exec sccache $(which $1) "\$@"
else
exec $(which $1) "\$@"
fi
EOF
else
cat > "/opt/cache/bin/$1" <<EOF
cat >"/opt/cache/bin/$1" <<EOF
#!/bin/sh
if [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then
@ -125,7 +116,7 @@ if [ -n "$ROCM_VERSION" ]; then
TOPDIR=$(dirname $OLDCOMP)
WRAPPED="$TOPDIR/original/$COMPNAME"
mv "$OLDCOMP" "$WRAPPED"
printf "#!/bin/sh\nexec sccache $WRAPPED \"\$@\"" > "$OLDCOMP"
printf "#!/bin/sh\nexec sccache $WRAPPED \"\$@\"" >"$OLDCOMP"
chmod a+x "$OLDCOMP"
}

View File

@ -25,7 +25,8 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
mkdir -p /opt/conda
chown jenkins:jenkins /opt/conda
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
SCRIPT_FOLDER="$( cd "$(dirname "$0")" ; pwd -P )"
source "${SCRIPT_FOLDER}/common_utils.sh"
pushd /tmp
wget -q "${BASE_URL}/${CONDA_FILE}"
@ -65,7 +66,7 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README
if [[ $(uname -m) == "aarch64" ]]; then
conda_install "openblas==0.3.25=*openmp*"
conda_install "openblas==0.3.28=*openmp*"
else
conda_install "mkl=2021.4.0 mkl-include=2021.4.0"
fi
@ -84,8 +85,9 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# Magma package names are concatenation of CUDA major and minor ignoring revision
# I.e. magma-cuda102 package corresponds to CUDA_VERSION=10.2 and CUDA_VERSION=10.2.89
# Magma is installed from a tarball in the ossci-linux bucket into the conda env
if [ -n "$CUDA_VERSION" ]; then
conda_install magma-cuda$(TMP=${CUDA_VERSION/./};echo ${TMP%.*[0-9]}) -c pytorch
${SCRIPT_FOLDER}/install_magma_conda.sh $(cut -f1-2 -d'.' <<< ${CUDA_VERSION}) ${ANACONDA_PYTHON_VERSION}
fi
# Install some other packages, including those needed for Python test reporting

View File

@ -20,10 +20,10 @@ function install_cusparselt_062 {
function install_cusparselt_063 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.6.3.2-archive.tar.xz
tar xf libcusparse_lt-linux-x86_64-0.6.3.2-archive.tar.xz
cp -a libcusparse_lt-linux-x86_64-0.6.3.2-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-x86_64-0.6.3.2-archive/lib/* /usr/local/cuda/lib64/
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.6.3.2-archive.tar.xz
tar xf libcusparse_lt-linux-sbsa-0.6.3.2-archive.tar.xz
cp -a libcusparse_lt-linux-sbsa-0.6.3.2-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-sbsa-0.6.3.2-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}

View File

@ -43,8 +43,6 @@ install_pip_dependencies() {
setup_executorch() {
pushd executorch
# Setup swiftshader and Vulkan SDK which are required to build the Vulkan delegate
as_jenkins bash .ci/scripts/setup-vulkan-linux-deps.sh
export PYTHON_EXECUTABLE=python
export EXECUTORCH_BUILD_PYBIND=ON

View File

@ -7,14 +7,20 @@ source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
function install_huggingface() {
local version
commit=$(get_pinned_commit huggingface)
pip_install pandas==2.0.3
pip_install "git+https://github.com/huggingface/transformers@${commit}"
}
function install_timm() {
local commit
commit=$(get_pinned_commit timm)
pip_install pandas==2.0.3
# TODO (huydhn): There is no torchvision release on 3.13 when I write this, so
# I'm using nightly here instead. We just need to package to be able to install
# TIMM. Removing this once vision has a release on 3.13
if [[ "${ANACONDA_PYTHON_VERSION}" == "3.13" ]]; then
pip_install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu124
fi
pip_install "git+https://github.com/huggingface/pytorch-image-models@${commit}"
# Clean up
conda_run pip uninstall -y cmake torch torchvision triton

View File

@ -3,8 +3,6 @@
set -eou pipefail
MAGMA_VERSION="2.5.2"
function do_install() {
cuda_version=$1
cuda_version_nodot=${1/./}
@ -17,7 +15,7 @@ function do_install() {
set -x
tmp_dir=$(mktemp -d)
pushd ${tmp_dir}
curl -OLs https://anaconda.org/pytorch/magma-cuda${cuda_version_nodot}/${MAGMA_VERSION}/download/linux-64/${magma_archive}
curl -OLs https://ossci-linux.s3.us-east-1.amazonaws.com/${magma_archive}
tar -xvf "${magma_archive}"
mkdir -p "${cuda_dir}/magma"
mv include "${cuda_dir}/magma/include"

View File

@ -0,0 +1,26 @@
#!/usr/bin/env bash
# Script that replaces the magma install from a conda package
set -eou pipefail
function do_install() {
cuda_version_nodot=${1/./}
anaconda_python_version=$2
MAGMA_VERSION="2.6.1"
magma_archive="magma-cuda${cuda_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"
anaconda_dir="/opt/conda/envs/py_${anaconda_python_version}"
(
set -x
tmp_dir=$(mktemp -d)
pushd ${tmp_dir}
curl -OLs https://ossci-linux.s3.us-east-1.amazonaws.com/${magma_archive}
tar -xvf "${magma_archive}"
mv include/* "${anaconda_dir}/include/"
mv lib/* "${anaconda_dir}/lib"
popd
)
}
do_install $1 $2

View File

@ -4,7 +4,7 @@
set -ex
cd /
git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.25 --depth 1 --shallow-submodules
git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.28 --depth 1 --shallow-submodules
OPENBLAS_BUILD_FLAGS="

View File

@ -39,17 +39,7 @@ case ${GPU_ARCH_TYPE} in
BASE_TARGET=rocm
DOCKER_TAG=rocm${GPU_ARCH_VERSION}
GPU_IMAGE=rocm/dev-ubuntu-20.04:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100"
ROCM_REGEX="([0-9]+)\.([0-9]+)[\.]?([0-9]*)"
if [[ $GPU_ARCH_VERSION =~ $ROCM_REGEX ]]; then
ROCM_VERSION_INT=$((${BASH_REMATCH[1]}*10000 + ${BASH_REMATCH[2]}*100 + ${BASH_REMATCH[3]:-0}))
else
echo "ERROR: rocm regex failed"
exit 1
fi
if [[ $ROCM_VERSION_INT -ge 60000 ]]; then
PYTORCH_ROCM_ARCH+=";gfx942"
fi
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx942"
DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}"
;;
*)

View File

@ -25,7 +25,8 @@ ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
COPY requirements-ci.txt /opt/conda/requirements-ci.txt
COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt
COPY ./common/install_magma_conda.sh install_magma_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt
# Install cuda and cudnn
ARG CUDA_VERSION

View File

@ -97,14 +97,7 @@ case ${GPU_ARCH_TYPE} in
DEVTOOLSET_VERSION="11"
GPU_IMAGE=rocm/dev-almalinux-8:${GPU_ARCH_VERSION}-complete
fi
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100"
ROCM_REGEX="([0-9]+)\.([0-9]+)[\.]?([0-9]*)"
if [[ $GPU_ARCH_VERSION =~ $ROCM_REGEX ]]; then
ROCM_VERSION_INT=$((${BASH_REMATCH[1]}*10000 + ${BASH_REMATCH[2]}*100 + ${BASH_REMATCH[3]:-0}))
else
echo "ERROR: rocm regex failed"
exit 1
fi
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101"
DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}"
;;
xpu)

View File

@ -90,7 +90,7 @@ librosa>=0.6.2 ; python_version < "3.11"
#Pinned versions:
#test that import:
mypy==1.11.2
mypy==1.13.0
# Pin MyPy version because new errors are likely to appear with each release
#Description: linter
#Pinned versions: 1.10.0
@ -132,6 +132,9 @@ numpy==1.22.4; python_version == "3.9" or python_version == "3.10"
numpy==1.26.2; python_version == "3.11" or python_version == "3.12"
numpy==2.1.2; python_version >= "3.13"
pandas==2.0.3; python_version < "3.13"
pandas==2.2.3; python_version >= "3.13"
#onnxruntime
#Description: scoring engine for Open Neural Network Exchange (ONNX) models
#Pinned versions: 1.9.0
@ -155,7 +158,7 @@ optree==0.13.0
#test_pointwise_ops.py, test_dtensor_ops.py, test_torchinductor.py, test_fx.py,
#test_fake_tensor.py, test_mps.py
pillow==10.3.0
pillow==11.0.0
#Description: Python Imaging Library fork
#Pinned versions: 10.3.0
#test that import:
@ -190,6 +193,11 @@ pytest-rerunfailures>=10.3
#Pinned versions:
#test that import:
pytest-subtests==0.13.1
#Description: plugin for subtest support
#Pinned versions:
#test that import:
#pytest-benchmark
#Description: fixture for benchmarking code
#Pinned versions: 3.2.3
@ -237,7 +245,7 @@ scikit-image==0.22.0 ; python_version >= "3.10"
#test that import:
scipy==1.10.1 ; python_version <= "3.11"
scipy==1.12.0 ; python_version == "3.12"
scipy==1.14.1 ; python_version >= "3.12"
# Pin SciPy because of failing distribution tests (see #60347)
#Description: scientific python
#Pinned versions: 1.10.1
@ -301,19 +309,20 @@ z3-solver==4.12.2.0
#Pinned versions:
#test that import:
tensorboard==2.13.0
tensorboard==2.13.0 ; python_version < "3.13"
tensorboard==2.18.0 ; python_version >= "3.13"
#Description: Also included in .ci/docker/requirements-docs.txt
#Pinned versions:
#test that import: test_tensorboard
pywavelets==1.4.1 ; python_version < "3.12"
pywavelets==1.5.0 ; python_version >= "3.12"
pywavelets==1.7.0 ; python_version >= "3.12"
#Description: This is a requirement of scikit-image, we need to pin
# it here because 1.5.0 conflicts with numpy 1.21.2 used in CI
#Pinned versions: 1.4.1
#test that import:
lxml==5.0.0
lxml==5.3.0
#Description: This is a requirement of unittest-xml-reporting
# Python-3.9 binaries

View File

@ -14,7 +14,8 @@ matplotlib==3.5.3
#Description: This is used to generate PyTorch docs
#Pinned versions: 3.5.3
tensorboard==2.13.0
tensorboard==2.13.0 ; python_version < "3.13"
tensorboard==2.18.0 ; python_version >= "3.13"
#Description: This is used to generate PyTorch docs
#Pinned versions: 2.13.0

View File

@ -30,7 +30,8 @@ ARG CONDA_CMAKE
COPY requirements-ci.txt /opt/conda/requirements-ci.txt
COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt
COPY ./common/install_magma_conda.sh install_magma_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt
# Install gcc
ARG GCC_VERSION
@ -80,6 +81,8 @@ RUN bash ./install_openssl.sh
ENV OPENSSL_DIR /opt/openssl
ARG INDUCTOR_BENCHMARKS
ARG ANACONDA_PYTHON_VERSION
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface.txt huggingface.txt

View File

@ -36,7 +36,8 @@ ENV DOCS=$DOCS
COPY requirements-ci.txt requirements-docs.txt /opt/conda/
COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt
COPY ./common/install_magma_conda.sh install_magma_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt
RUN if [ -n "${UNINSTALL_DILL}" ]; then pip uninstall -y dill; fi
# Install gcc

View File

@ -18,12 +18,14 @@ retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
PLATFORM="manylinux2014_x86_64"
# TODO move this into the Docker images
OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
retry yum install -q -y zip openssl
elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
retry yum install -q -y zip openssl
PLATFORM="manylinux_2_28_x86_64"
elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then
retry dnf install -q -y zip openssl
elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
@ -253,11 +255,11 @@ make_wheel_record() {
FPATH=$1
if echo $FPATH | grep RECORD >/dev/null 2>&1; then
# if the RECORD file, then
echo "$FPATH,,"
echo "\"$FPATH\",,"
else
HASH=$(openssl dgst -sha256 -binary $FPATH | openssl base64 | sed -e 's/+/-/g' | sed -e 's/\//_/g' | sed -e 's/=//g')
FSIZE=$(ls -nl $FPATH | awk '{print $5}')
echo "$FPATH,sha256=$HASH,$FSIZE"
echo "\"$FPATH\",sha256=$HASH,$FSIZE"
fi
}
@ -377,6 +379,12 @@ for pkg in /$WHEELHOUSE_DIR/torch_no_python*.whl /$WHEELHOUSE_DIR/torch*linux*.w
$PATCHELF_BIN --print-rpath $sofile
done
# create Manylinux 2_28 tag this needs to happen before regenerate the RECORD
if [[ $PLATFORM == "manylinux_2_28_x86_64" && $GPU_ARCH_TYPE != "cpu-s390x" && $GPU_ARCH_TYPE != "xpu" ]]; then
wheel_file=$(echo $(basename $pkg) | sed -e 's/-cp.*$/.dist-info\/WHEEL/g')
sed -i -e s#linux_x86_64#"${PLATFORM}"# $wheel_file;
fi
# regenerate the RECORD file with new hashes
record_file=$(echo $(basename $pkg) | sed -e 's/-cp.*$/.dist-info\/RECORD/g')
if [[ -e $record_file ]]; then
@ -416,12 +424,20 @@ for pkg in /$WHEELHOUSE_DIR/torch_no_python*.whl /$WHEELHOUSE_DIR/torch*linux*.w
popd
fi
# zip up the wheel back
zip -rq $(basename $pkg) $PREIX*
# Rename wheel for Manylinux 2_28
if [[ $PLATFORM == "manylinux_2_28_x86_64" && $GPU_ARCH_TYPE != "cpu-s390x" && $GPU_ARCH_TYPE != "xpu" ]]; then
pkg_name=$(echo $(basename $pkg) | sed -e s#linux_x86_64#"${PLATFORM}"#)
zip -rq $pkg_name $PREIX*
rm -f $pkg
mv $pkg_name $(dirname $pkg)/$pkg_name
else
# zip up the wheel back
zip -rq $(basename $pkg) $PREIX*
# remove original wheel
rm -f $pkg
mv $(basename $pkg) $pkg
fi
# replace original wheel
rm -f $pkg
mv $(basename $pkg) $pkg
cd ..
rm -rf tmp
done
@ -474,9 +490,9 @@ if [[ -z "$BUILD_PYTHONLESS" ]]; then
echo "$(date) :: Running tests"
pushd "$PYTORCH_ROOT"
#TODO: run_tests.sh and check_binary.sh should be moved to pytorch/pytorch project
LD_LIBRARY_PATH=/usr/local/nvidia/lib64 \
"/builder/run_tests.sh" manywheel "${py_majmin}" "$DESIRED_CUDA"
"${PYTORCH_ROOT}/.ci/pytorch/run_tests.sh" manywheel "${py_majmin}" "$DESIRED_CUDA"
popd
echo "$(date) :: Finished tests"
fi

View File

@ -225,11 +225,11 @@ make_wheel_record() {
FPATH=$1
if echo $FPATH | grep RECORD >/dev/null 2>&1; then
# if the RECORD file, then
echo "$FPATH,,"
echo "\"$FPATH\",,"
else
HASH=$(openssl dgst -sha256 -binary $FPATH | openssl base64 | sed -e 's/+/-/g' | sed -e 's/\//_/g' | sed -e 's/=//g')
FSIZE=$(ls -nl $FPATH | awk '{print $5}')
echo "$FPATH,sha256=$HASH,$FSIZE"
echo "\"$FPATH\",sha256=$HASH,$FSIZE"
fi
}

View File

@ -107,17 +107,29 @@ if [[ $ROCM_INT -ge 60200 ]]; then
fi
OS_NAME=`awk -F= '/^NAME/{print $2}' /etc/os-release`
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
if [[ "$OS_NAME" == *"CentOS Linux"* || "$OS_NAME" == *"AlmaLinux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
LIBNUMA_PATH="/usr/lib64/libnuma.so.1"
LIBELF_PATH="/usr/lib64/libelf.so.1"
LIBTINFO_PATH="/usr/lib64/libtinfo.so.5"
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
LIBTINFO_PATH="/usr/lib64/libtinfo.so.5"
else
LIBTINFO_PATH="/usr/lib64/libtinfo.so.6"
fi
LIBDRM_PATH="/opt/amdgpu/lib64/libdrm.so.2"
LIBDRM_AMDGPU_PATH="/opt/amdgpu/lib64/libdrm_amdgpu.so.1"
if [[ $ROCM_INT -ge 60100 ]]; then
# Below libs are direct dependencies of libhipsolver
LIBSUITESPARSE_CONFIG_PATH="/lib64/libsuitesparseconfig.so.4"
LIBCHOLMOD_PATH="/lib64/libcholmod.so.2"
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
LIBCHOLMOD_PATH="/lib64/libcholmod.so.2"
# Below libs are direct dependencies of libsatlas
LIBGFORTRAN_PATH="/lib64/libgfortran.so.3"
else
LIBCHOLMOD_PATH="/lib64/libcholmod.so.3"
# Below libs are direct dependencies of libsatlas
LIBGFORTRAN_PATH="/lib64/libgfortran.so.5"
fi
# Below libs are direct dependencies of libcholmod
LIBAMD_PATH="/lib64/libamd.so.2"
LIBCAMD_PATH="/lib64/libcamd.so.2"
@ -125,7 +137,6 @@ if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
LIBCOLAMD_PATH="/lib64/libcolamd.so.2"
LIBSATLAS_PATH="/lib64/atlas/libsatlas.so.3"
# Below libs are direct dependencies of libsatlas
LIBGFORTRAN_PATH="/lib64/libgfortran.so.3"
LIBQUADMATH_PATH="/lib64/libquadmath.so.0"
fi
MAYBE_LIB64=lib64
@ -175,9 +186,12 @@ do
OS_SO_FILES[${#OS_SO_FILES[@]}]=$file_name # Append lib to array
done
# PyTorch-version specific
# AOTriton dependency only for PyTorch >= 2.4
if (( $(echo "${PYTORCH_VERSION} 2.4" | awk '{print ($1 >= $2)}') )); then
# FIXME: Temporary until https://github.com/pytorch/pytorch/pull/137443 lands
# Install AOTriton
if [ -e ${PYTORCH_ROOT}/.ci/docker/aotriton_version.txt ]; then
cp -a ${PYTORCH_ROOT}/.ci/docker/aotriton_version.txt aotriton_version.txt
bash ${PYTORCH_ROOT}/.ci/docker/common/install_aotriton.sh ${ROCM_HOME} && rm aotriton_version.txt
export AOTRITON_INSTALLED_PREFIX=${ROCM_HOME}/aotriton
ROCM_SO_FILES+=("libaotriton_v2.so")
fi
@ -252,6 +266,20 @@ RCCL_SHARE_FILES=($(ls $RCCL_SHARE_SRC))
DEPS_AUX_SRCLIST+=(${RCCL_SHARE_FILES[@]/#/$RCCL_SHARE_SRC/})
DEPS_AUX_DSTLIST+=(${RCCL_SHARE_FILES[@]/#/$RCCL_SHARE_DST/})
# PyTorch 2.6+ (AOTriton 0.8b+)
# AKS = "AOTriton Kernel Storage", a file format to store GPU kernels compactly
if (( $(echo "${PYTORCH_VERSION} 2.6" | awk '{print ($1 >= $2)}') )); then
LIBAOTRITON_DIR=$(find "$ROCM_HOME/lib/" -name "libaotriton_v2.so" -printf '%h\n')
if [[ -z ${LIBAOTRITON_DIR} ]]; then
LIBAOTRITON_DIR=$(find "$ROCM_HOME/" -name "libaotriton_v2.so" -printf '%h\n')
fi
AKS_FILES=($(find "${LIBAOTRITON_DIR}/aotriton.images" -type f -name '*.aks?' -printf '%P\n'))
AKS_SRC="${LIBAOTRITON_DIR}/aotriton.images"
AKS_DST="lib/aotriton.images"
DEPS_AUX_SRCLIST+=(${AKS_FILES[@]/#/${AKS_SRC}/})
DEPS_AUX_DSTLIST+=(${AKS_FILES[@]/#/${AKS_DST}/})
fi
echo "PYTORCH_ROCM_ARCH: ${PYTORCH_ROCM_ARCH}"
SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"

View File

@ -1,6 +1,6 @@
#!/bin/bash
set -ex
set -ex -o pipefail
# Required environment variable: $BUILD_ENVIRONMENT
# (This is set by default in the Docker images we build, so you don't
@ -87,7 +87,7 @@ else
# Workaround required for MKL library linkage
# https://github.com/pytorch/pytorch/issues/119557
if [ "$ANACONDA_PYTHON_VERSION" = "3.12" ]; then
if [[ "$ANACONDA_PYTHON_VERSION" = "3.12" || "$ANACONDA_PYTHON_VERSION" = "3.13" ]]; then
export CMAKE_LIBRARY_PATH="/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/lib/"
export CMAKE_INCLUDE_PATH="/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/include/"
fi
@ -191,7 +191,7 @@ fi
# We only build FlashAttention files for CUDA 8.0+, and they require large amounts of
# memory to build and will OOM
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ "$TORCH_CUDA_ARCH_LIST" == *"8.6"* || "$TORCH_CUDA_ARCH_LIST" == *"8.0"* ]]; then
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ 1 -eq $(echo "${TORCH_CUDA_ARCH_LIST} >= 8.0" | bc) ]]; then
echo "WARNING: FlashAttention files require large amounts of memory to build and will OOM"
echo "Setting MAX_JOBS=(nproc-2)/3 to reduce memory usage"
export MAX_JOBS="$(( $(nproc --ignore=2) / 3 ))"
@ -230,7 +230,7 @@ fi
# Do not change workspace permissions for ROCm CI jobs
# as it can leave workspace with bad permissions for cancelled jobs
if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /var/lib/jenkins/workspace ]]; then
# Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)
WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")
cleanup_workspace() {
@ -250,7 +250,6 @@ if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then
set -e
get_bazel
install_sccache_nvcc_for_bazel
# Leave 1 CPU free and use only up to 80% of memory to reduce the change of crashing
# the runner
@ -396,7 +395,7 @@ if [[ "$BUILD_ENVIRONMENT" != *libtorch* && "$BUILD_ENVIRONMENT" != *bazel* ]];
# don't do this for libtorch as libtorch is C++ only and thus won't have python tests run on its build
python tools/stats/export_test_times.py
fi
if [[ "$BUILD_ENVIRONMENT" != *s390x* ]]; then
# don't do this for bazel or s390x as they don't use sccache
if [[ "$BUILD_ENVIRONMENT" != *s390x* && "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then
print_sccache_stats
fi

394
.ci/pytorch/check_binary.sh Executable file
View File

@ -0,0 +1,394 @@
#!/bin/bash
# shellcheck disable=SC2086,SC2006,SC2207,SC2076,SC2155,SC2046,SC1091,SC2143
# TODO: Re-enable shellchecks above
set -eux -o pipefail
# This script checks the following things on binaries
# 1. The gcc abi matches DESIRED_DEVTOOLSET
# 2. MacOS binaries do not link against OpenBLAS
# 3. There are no protobuf symbols of any sort anywhere (turned off, because
# this is currently not true)
# 4. Standard Python imports work
# 5. MKL is available everywhere except for MacOS wheels
# 6. XNNPACK is available everywhere except for MacOS wheels
# 7. CUDA is setup correctly and does not hang
# 8. Magma is available for CUDA builds
# 9. CuDNN is available for CUDA builds
#
# This script needs the env variables DESIRED_PYTHON, DESIRED_CUDA,
# DESIRED_DEVTOOLSET and PACKAGE_TYPE
#
# This script expects PyTorch to be installed into the active Python (the
# Python returned by `which python`). Or, if this is testing a libtorch
# Pythonless binary, then it expects to be in the root folder of the unzipped
# libtorch package.
if [[ -z ${DESIRED_PYTHON:-} ]]; then
export DESIRED_PYTHON=${MATRIX_PYTHON_VERSION:-}
fi
if [[ -z ${DESIRED_CUDA:-} ]]; then
export DESIRED_CUDA=${MATRIX_DESIRED_CUDA:-}
fi
if [[ -z ${DESIRED_DEVTOOLSET:-} ]]; then
export DESIRED_DEVTOOLSET=${MATRIX_DESIRED_DEVTOOLSET:-}
fi
if [[ -z ${PACKAGE_TYPE:-} ]]; then
export PACKAGE_TYPE=${MATRIX_PACKAGE_TYPE:-}
fi
# The install root depends on both the package type and the os
# All MacOS packages use conda, even for the wheel packages.
if [[ "$PACKAGE_TYPE" == libtorch ]]; then
# NOTE: Only $PWD works on both CentOS and Ubuntu
export install_root="$PWD"
else
if [[ $DESIRED_PYTHON =~ ([0-9].[0-9]+)t ]]; then
# For python that is maj.mint keep original version
py_dot="$DESIRED_PYTHON"
elif [[ $DESIRED_PYTHON =~ ([0-9].[0-9]+) ]]; then
# Strip everything but major.minor from DESIRED_PYTHON version
py_dot="${BASH_REMATCH[0]}"
else
echo "Unexpected ${DESIRED_PYTHON} format"
exit 1
fi
export install_root="$(dirname $(which python))/../lib/python${py_dot}/site-packages/torch/"
fi
###############################################################################
# Setup XPU ENV
###############################################################################
if [[ "$DESIRED_CUDA" == 'xpu' ]]; then
set +u
# Refer https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html
source /opt/intel/oneapi/compiler/latest/env/vars.sh
source /opt/intel/oneapi/pti/latest/env/vars.sh
fi
###############################################################################
# Check GCC ABI
###############################################################################
# NOTE [ Building libtorch with old vs. new gcc ABI ]
#
# Packages built with one version of ABI could not be linked against by client
# C++ libraries that were compiled using the other version of ABI. Since both
# gcc ABIs are still common in the wild, we need to support both ABIs. Currently:
#
# - All the nightlies built on CentOS 7 + devtoolset7 use the old gcc ABI.
# - All the nightlies built on Ubuntu 16.04 + gcc 5.4 use the new gcc ABI.
echo "Checking that the gcc ABI is what we expect"
if [[ "$(uname)" != 'Darwin' ]]; then
function is_expected() {
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* || "$DESIRED_CUDA" == *"rocm"* ]]; then
if [[ "$1" -gt 0 || "$1" == "ON " ]]; then
echo 1
fi
else
if [[ -z "$1" || "$1" == 0 || "$1" == "OFF" ]]; then
echo 1
fi
fi
}
# First we check that the env var in TorchConfig.cmake is correct
# We search for D_GLIBCXX_USE_CXX11_ABI=1 in torch/TorchConfig.cmake
torch_config="${install_root}/share/cmake/Torch/TorchConfig.cmake"
if [[ ! -f "$torch_config" ]]; then
echo "No TorchConfig.cmake found!"
ls -lah "$install_root/share/cmake/Torch"
exit 1
fi
echo "Checking the TorchConfig.cmake"
cat "$torch_config"
# The sed call below is
# don't print lines by default (only print the line we want)
# -n
# execute the following expression
# e
# replace lines that match with the first capture group and print
# s/.*D_GLIBCXX_USE_CXX11_ABI=\(.\)".*/\1/p
# any characters, D_GLIBCXX_USE_CXX11_ABI=, exactly one any character, a
# quote, any characters
# Note the exactly one single character after the '='. In the case that the
# variable is not set the '=' will be followed by a '"' immediately and the
# line will fail the match and nothing will be printed; this is what we
# want. Otherwise it will capture the 0 or 1 after the '='.
# /.*D_GLIBCXX_USE_CXX11_ABI=\(.\)".*/
# replace the matched line with the capture group and print
# /\1/p
actual_gcc_abi="$(sed -ne 's/.*D_GLIBCXX_USE_CXX11_ABI=\(.\)".*/\1/p' < "$torch_config")"
if [[ "$(is_expected "$actual_gcc_abi")" != 1 ]]; then
echo "gcc ABI $actual_gcc_abi not as expected."
exit 1
fi
# We also check that there are [not] cxx11 symbols in libtorch
#
echo "Checking that symbols in libtorch.so have the right gcc abi"
python3 "$(dirname ${BASH_SOURCE[0]})/smoke_test/check_binary_symbols.py"
echo "cxx11 symbols seem to be in order"
fi # if on Darwin
###############################################################################
# Check for no OpenBLAS
# TODO Check for no Protobuf symbols (not finished)
# Print *all* runtime dependencies
###############################################################################
# We have to loop through all shared libraries for this
if [[ "$(uname)" == 'Darwin' ]]; then
all_dylibs=($(find "$install_root" -name '*.dylib'))
for dylib in "${all_dylibs[@]}"; do
echo "All dependencies of $dylib are $(otool -L $dylib) with rpath $(otool -l $dylib | grep LC_RPATH -A2)"
# Check that OpenBlas is not linked to on Macs
echo "Checking the OpenBLAS is not linked to"
if [[ -n "$(otool -L $dylib | grep -i openblas)" ]]; then
echo "ERROR: Found openblas as a dependency of $dylib"
echo "Full dependencies is: $(otool -L $dylib)"
exit 1
fi
# Check for protobuf symbols
#proto_symbols="$(nm $dylib | grep protobuf)" || true
#if [[ -n "$proto_symbols" ]]; then
# echo "ERROR: Detected protobuf symbols in $dylib"
# echo "Symbols are $proto_symbols"
# exit 1
#fi
done
else
all_libs=($(find "$install_root" -name '*.so'))
for lib in "${all_libs[@]}"; do
echo "All dependencies of $lib are $(ldd $lib) with runpath $(objdump -p $lib | grep RUNPATH)"
# Check for protobuf symbols
#proto_symbols=$(nm $lib | grep protobuf) || true
#if [[ -n "$proto_symbols" ]]; then
# echo "ERROR: Detected protobuf symbols in $lib"
# echo "Symbols are $proto_symbols"
# exit 1
#fi
done
fi
setup_link_flags () {
REF_LIB="-Wl,-R${install_root}/lib"
if [[ "$(uname)" == 'Darwin' ]]; then
REF_LIB="-Wl,-rpath ${install_root}/lib"
fi
ADDITIONAL_LINKER_FLAGS=""
if [[ "$(uname)" == 'Linux' ]]; then
ADDITIONAL_LINKER_FLAGS="-Wl,--no-as-needed"
fi
C10_LINK_FLAGS=""
if [ -f "${install_root}/lib/libc10.so" ] || [ -f "${install_root}/lib/libc10.dylib" ]; then
C10_LINK_FLAGS="-lc10"
fi
TORCH_CPU_LINK_FLAGS=""
if [ -f "${install_root}/lib/libtorch_cpu.so" ] || [ -f "${install_root}/lib/libtorch_cpu.dylib" ]; then
TORCH_CPU_LINK_FLAGS="-ltorch_cpu"
fi
TORCH_CUDA_LINK_FLAGS=""
if [ -f "${install_root}/lib/libtorch_cuda.so" ] || [ -f "${install_root}/lib/libtorch_cuda.dylib" ]; then
TORCH_CUDA_LINK_FLAGS="-ltorch_cuda"
elif [ -f "${install_root}/lib/libtorch_cuda_cpp.so" ] && [ -f "${install_root}/lib/libtorch_cuda_cpp.so" ] || \
[ -f "${install_root}/lib/libtorch_cuda_cu.dylib" ] && [ -f "${install_root}/lib/libtorch_cuda_cu.dylib" ]; then
TORCH_CUDA_LINK_FLAGS="-ltorch_cuda_cpp -ltorch_cuda_cu"
fi
}
TEST_CODE_DIR="$(dirname $(realpath ${BASH_SOURCE[0]}))/test_example_code"
build_and_run_example_cpp () {
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then
GLIBCXX_USE_CXX11_ABI=1
else
GLIBCXX_USE_CXX11_ABI=0
fi
setup_link_flags
g++ ${TEST_CODE_DIR}/$1.cpp -I${install_root}/include -I${install_root}/include/torch/csrc/api/include -D_GLIBCXX_USE_CXX11_ABI=$GLIBCXX_USE_CXX11_ABI -std=gnu++17 -L${install_root}/lib ${REF_LIB} ${ADDITIONAL_LINKER_FLAGS} -ltorch $TORCH_CPU_LINK_FLAGS $TORCH_CUDA_LINK_FLAGS $C10_LINK_FLAGS -o $1
./$1
}
build_example_cpp_with_incorrect_abi () {
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then
GLIBCXX_USE_CXX11_ABI=0
else
GLIBCXX_USE_CXX11_ABI=1
fi
set +e
setup_link_flags
g++ ${TEST_CODE_DIR}/$1.cpp -I${install_root}/include -I${install_root}/include/torch/csrc/api/include -D_GLIBCXX_USE_CXX11_ABI=$GLIBCXX_USE_CXX11_ABI -std=gnu++17 -L${install_root}/lib ${REF_LIB} ${ADDITIONAL_LINKER_FLAGS} -ltorch $TORCH_CPU_LINK_FLAGS $TORCH_CUDA_LINK_FLAGS $C10_LINK_FLAGS -o $1
ERRCODE=$?
set -e
if [ "$ERRCODE" -eq "0" ]; then
echo "Building example with incorrect ABI didn't throw error. Aborting."
exit 1
else
echo "Building example with incorrect ABI throws expected error. Proceeding."
fi
}
###############################################################################
# Check simple Python/C++ calls
###############################################################################
if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then
# NS: Set LD_LIBRARY_PATH for CUDA builds, but perhaps it should be removed
if [[ "$DESIRED_CUDA" == "cu"* ]]; then
export LD_LIBRARY_PATH=/usr/local/cuda/lib64
fi
build_and_run_example_cpp simple-torch-test
# `_GLIBCXX_USE_CXX11_ABI` is always ignored by gcc in devtoolset7, so we test
# the expected failure case for Ubuntu 16.04 + gcc 5.4 only.
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then
build_example_cpp_with_incorrect_abi simple-torch-test
fi
else
pushd /tmp
python -c 'import torch'
popd
fi
###############################################################################
# Check torch.git_version
###############################################################################
if [[ "$PACKAGE_TYPE" != 'libtorch' ]]; then
pushd /tmp
python -c 'import torch; assert torch.version.git_version != "Unknown"'
python -c 'import torch; assert torch.version.git_version != None'
popd
fi
###############################################################################
# Check for MKL
###############################################################################
if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then
echo "Checking that MKL is available"
build_and_run_example_cpp check-torch-mkl
elif [[ "$(uname -m)" != "arm64" && "$(uname -m)" != "s390x" ]]; then
if [[ "$(uname)" != 'Darwin' || "$PACKAGE_TYPE" != *wheel ]]; then
if [[ "$(uname -m)" == "aarch64" ]]; then
echo "Checking that MKLDNN is available on aarch64"
pushd /tmp
python -c 'import torch; exit(0 if torch.backends.mkldnn.is_available() else 1)'
popd
else
echo "Checking that MKL is available"
pushd /tmp
python -c 'import torch; exit(0 if torch.backends.mkl.is_available() else 1)'
popd
fi
fi
fi
###############################################################################
# Check for XNNPACK
###############################################################################
if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then
echo "Checking that XNNPACK is available"
build_and_run_example_cpp check-torch-xnnpack
else
if [[ "$(uname)" != 'Darwin' || "$PACKAGE_TYPE" != *wheel ]] && [[ "$(uname -m)" != "s390x" ]]; then
echo "Checking that XNNPACK is available"
pushd /tmp
python -c 'import torch.backends.xnnpack; exit(0 if torch.backends.xnnpack.enabled else 1)'
popd
fi
fi
###############################################################################
# Check CUDA configured correctly
###############################################################################
# Skip these for Windows machines without GPUs
if [[ "$OSTYPE" == "msys" ]]; then
GPUS=$(wmic path win32_VideoController get name)
if [[ ! "$GPUS" == *NVIDIA* ]]; then
echo "Skip CUDA tests for machines without a Nvidia GPU card"
exit 0
fi
fi
# Test that CUDA builds are setup correctly
if [[ "$DESIRED_CUDA" != 'cpu' && "$DESIRED_CUDA" != 'xpu' && "$DESIRED_CUDA" != 'cpu-cxx11-abi' && "$DESIRED_CUDA" != *"rocm"* && "$(uname -m)" != "s390x" ]]; then
if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then
build_and_run_example_cpp check-torch-cuda
else
pushd /tmp
echo "Checking that CUDA archs are setup correctly"
timeout 20 python -c 'import torch; torch.randn([3,5]).cuda()'
# These have to run after CUDA is initialized
echo "Checking that magma is available"
python -c 'import torch; torch.rand(1).cuda(); exit(0 if torch.cuda.has_magma else 1)'
echo "Checking that CuDNN is available"
python -c 'import torch; exit(0 if torch.backends.cudnn.is_available() else 1)'
# Validates builds is free of linker regressions reported in https://github.com/pytorch/pytorch/issues/57744
echo "Checking that exception handling works"
python -c "import torch; from unittest import TestCase;TestCase().assertRaises(RuntimeError, lambda:torch.eye(7, 7, device='cuda:7'))"
echo "Checking that basic RNN works"
python ${TEST_CODE_DIR}/rnn_smoke.py
echo "Checking that basic CNN works"
python "${TEST_CODE_DIR}/cnn_smoke.py"
echo "Test that linalg works"
python -c "import torch;x=torch.rand(3,3,device='cuda');print(torch.linalg.svd(torch.mm(x.t(), x)))"
popd
fi # if libtorch
fi # if cuda
##########################
# Run parts of smoke tests
##########################
if [[ "$PACKAGE_TYPE" != 'libtorch' ]]; then
pushd "$(dirname ${BASH_SOURCE[0]})/smoke_test"
python -c "from smoke_test import test_linalg; test_linalg()"
if [[ "$DESIRED_CUDA" == *cuda* ]]; then
python -c "from smoke_test import test_linalg; test_linalg('cuda')"
fi
popd
fi
###############################################################################
# Check PyTorch supports TCP_TLS gloo transport
###############################################################################
if [[ "$(uname)" == 'Linux' && "$PACKAGE_TYPE" != 'libtorch' ]]; then
GLOO_CHECK="import torch.distributed as dist
try:
dist.init_process_group('gloo', rank=0, world_size=1)
except RuntimeError as e:
print(e)
"
RESULT=`GLOO_DEVICE_TRANSPORT=TCP_TLS MASTER_ADDR=localhost MASTER_PORT=63945 python -c "$GLOO_CHECK"`
GLOO_TRANSPORT_IS_NOT_SUPPORTED='gloo transport is not supported'
if [[ "$RESULT" =~ "$GLOO_TRANSPORT_IS_NOT_SUPPORTED" ]]; then
echo "PyTorch doesn't support TLS_TCP transport, please build with USE_GLOO_WITH_OPENSSL=1"
exit 1
fi
fi
###############################################################################
# Check for C++ ABI compatibility between gcc7 and gcc9 compiled binaries
###############################################################################
if [[ "$(uname)" == 'Linux' && ("$PACKAGE_TYPE" == 'conda' || "$PACKAGE_TYPE" == 'manywheel')]]; then
pushd /tmp
python -c "import torch; exit(0 if torch.compiled_with_cxx11_abi() else (0 if torch._C._PYBIND11_BUILD_ABI == '_cxxabi1011' else 1))"
popd
fi

View File

@ -111,26 +111,6 @@ function get_bazel() {
chmod u+x tools/bazel
}
# This function is bazel specific because of the bug
# in the bazel that requires some special paths massaging
# as a workaround. See
# https://github.com/bazelbuild/bazel/issues/10167
function install_sccache_nvcc_for_bazel() {
sudo mv /usr/local/cuda/bin/nvcc /usr/local/cuda/bin/nvcc-real
# Write the `/usr/local/cuda/bin/nvcc`
cat << EOF | sudo tee /usr/local/cuda/bin/nvcc
#!/bin/sh
if [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then
exec sccache /usr/local/cuda/bin/nvcc "\$@"
else
exec external/local_cuda/cuda/bin/nvcc-real "\$@"
fi
EOF
sudo chmod +x /usr/local/cuda/bin/nvcc
}
function install_monkeytype {
# Install MonkeyType
pip_install MonkeyType
@ -212,7 +192,7 @@ function install_torchrec_and_fbgemm() {
function clone_pytorch_xla() {
if [[ ! -d ./xla ]]; then
git clone --recursive --quiet https://github.com/pytorch/xla.git
git clone --recursive -b r2.6 https://github.com/pytorch/xla.git
pushd xla
# pin the xla hash so that we don't get broken by changes to xla
git checkout "$(cat ../.github/ci_commit_pins/xla.txt)"

View File

@ -190,11 +190,19 @@ test_torchbench_perf() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
local backend=eager
local dtype=notset
local device=mps
echo "Setup complete, launching torchbench training performance run"
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py --performance --backend eager --training --devices mps --output "$TEST_REPORTS_DIR/torchbench_training.csv"
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --backend "$backend" --training --devices "$device" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"
echo "Launching torchbench inference performance run"
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py --performance --backend eager --inference --devices mps --output "$TEST_REPORTS_DIR/torchbench_training.csv"
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --backend "$backend" --inference --devices "$device" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"
echo "Pytorch benchmark on mps device completed"
}
@ -209,26 +217,27 @@ test_torchbench_smoketest() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
touch "$TEST_REPORTS_DIR"/torchbench_training.csv
touch "$TEST_REPORTS_DIR"/torchbench_inference.csv
local backend=eager
local dtype=notset
local device=mps
touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"
touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"
echo "Setup complete, launching torchbench training performance run"
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py --performance --only hf_T5 --backend eager --training --devices mps --output "$TEST_REPORTS_DIR/torchbench_training.csv"
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py --performance --only llama --backend eager --training --devices mps --output "$TEST_REPORTS_DIR/torchbench_training.csv"
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py --performance --only BERT_pytorch --backend eager --training --devices mps --output "$TEST_REPORTS_DIR/torchbench_training.csv"
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py --performance --only dcgan --backend eager --training --devices mps --output "$TEST_REPORTS_DIR/torchbench_training.csv"
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py --performance --only hf_GPT2 --backend eager --training --devices mps --output "$TEST_REPORTS_DIR/torchbench_training.csv"
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py --performance --only yolov3 --backend eager --training --devices mps --output "$TEST_REPORTS_DIR/torchbench_training.csv"
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py --performance --only resnet152 --backend eager --training --devices mps --output "$TEST_REPORTS_DIR/torchbench_training.csv"
for model in hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152; do
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --only "$model" --backend "$backend" --training --devices "$device" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"
done
echo "Launching torchbench inference performance run"
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py --performance --only hf_T5 --backend eager --inference --devices mps --output "$TEST_REPORTS_DIR/torchbench_inference.csv"
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py --performance --only llama --backend eager --inference --devices mps --output "$TEST_REPORTS_DIR/torchbench_inference.csv"
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py --performance --only BERT_pytorch --backend eager --inference --devices mps --output "$TEST_REPORTS_DIR/torchbench_inference.csv"
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py --performance --only dcgan --backend eager --inference --devices mps --output "$TEST_REPORTS_DIR/torchbench_inference.csv"
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py --performance --only hf_GPT2 --backend eager --inference --devices mps --output "$TEST_REPORTS_DIR/torchbench_inference.csv"
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py --performance --only yolov3 --backend eager --inference --devices mps --output "$TEST_REPORTS_DIR/torchbench_inference.csv"
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py --performance --only resnet152 --backend eager --inference --devices mps --output "$TEST_REPORTS_DIR/torchbench_inference.csv"
for model in hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152; do
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --only "$model" --backend "$backend" --inference --devices "$device" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"
done
echo "Pytorch benchmark on mps device completed"
}
@ -267,25 +276,6 @@ test_timm_perf() {
install_tlparse
if [[ $TEST_CONFIG == *"test_mps"* ]]; then
if [[ $NUM_TEST_SHARDS -gt 1 ]]; then
test_python_shard "${SHARD_NUMBER}"
if [[ "${SHARD_NUMBER}" == 1 ]]; then
test_libtorch
test_custom_script_ops
elif [[ "${SHARD_NUMBER}" == 2 ]]; then
test_jit_hooks
test_custom_backend
fi
else
test_python_all
test_libtorch
test_custom_script_ops
test_jit_hooks
test_custom_backend
fi
fi
if [[ $TEST_CONFIG == *"perf_all"* ]]; then
test_torchbench_perf
test_hf_perf
@ -298,4 +288,19 @@ elif [[ $TEST_CONFIG == *"perf_timm"* ]]; then
test_timm_perf
elif [[ $TEST_CONFIG == *"perf_smoketest"* ]]; then
test_torchbench_smoketest
elif [[ $NUM_TEST_SHARDS -gt 1 ]]; then
test_python_shard "${SHARD_NUMBER}"
if [[ "${SHARD_NUMBER}" == 1 ]]; then
test_libtorch
test_custom_script_ops
elif [[ "${SHARD_NUMBER}" == 2 ]]; then
test_jit_hooks
test_custom_backend
fi
else
test_python_all
test_libtorch
test_custom_script_ops
test_jit_hooks
test_custom_backend
fi

436
.ci/pytorch/run_tests.sh Executable file
View File

@ -0,0 +1,436 @@
#!/bin/bash
# shellcheck disable=SC2086,SC2048,SC2068,SC2145,SC2034,SC2207,SC2143
# TODO: Re-enable shellchecks above
set -eux -o pipefail
# Essentially runs pytorch/test/run_test.py, but keeps track of which tests to
# skip in a centralized place.
#
# TODO Except for a few tests, this entire file is a giant TODO. Why are these
# tests # failing?
# TODO deal with Windows
# This script expects to be in the pytorch root folder
if [[ ! -d 'test' || ! -f 'test/run_test.py' ]]; then
echo "builder/test.sh expects to be run from the Pytorch root directory " \
"but I'm actually in $(pwd)"
exit 2
fi
# Allow master skip of all tests
if [[ -n "${SKIP_ALL_TESTS:-}" ]]; then
exit 0
fi
# If given specific test params then just run those
if [[ -n "${RUN_TEST_PARAMS:-}" ]]; then
echo "$(date) :: Calling user-command $(pwd)/test/run_test.py ${RUN_TEST_PARAMS[@]}"
python test/run_test.py ${RUN_TEST_PARAMS[@]}
exit 0
fi
# Function to retry functions that sometimes timeout or have flaky failures
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
# Parameters
##############################################################################
if [[ "$#" != 3 ]]; then
if [[ -z "${DESIRED_PYTHON:-}" || -z "${DESIRED_CUDA:-}" || -z "${PACKAGE_TYPE:-}" ]]; then
echo "USAGE: run_tests.sh PACKAGE_TYPE DESIRED_PYTHON DESIRED_CUDA"
echo "The env variable PACKAGE_TYPE must be set to 'conda' or 'manywheel' or 'libtorch'"
echo "The env variable DESIRED_PYTHON must be set like '2.7mu' or '3.6m' etc"
echo "The env variable DESIRED_CUDA must be set like 'cpu' or 'cu80' etc"
exit 1
fi
package_type="$PACKAGE_TYPE"
py_ver="$DESIRED_PYTHON"
cuda_ver="$DESIRED_CUDA"
else
package_type="$1"
py_ver="$2"
cuda_ver="$3"
fi
if [[ "$cuda_ver" == 'cpu-cxx11-abi' ]]; then
cuda_ver="cpu"
fi
# cu80, cu90, cu100, cpu
if [[ ${#cuda_ver} -eq 4 ]]; then
cuda_ver_majmin="${cuda_ver:2:1}.${cuda_ver:3:1}"
elif [[ ${#cuda_ver} -eq 5 ]]; then
cuda_ver_majmin="${cuda_ver:2:2}.${cuda_ver:4:1}"
fi
NUMPY_PACKAGE=""
if [[ ${py_ver} == "3.10" ]]; then
PROTOBUF_PACKAGE="protobuf>=3.17.2"
NUMPY_PACKAGE="numpy>=1.21.2"
else
PROTOBUF_PACKAGE="protobuf=3.14.0"
fi
# Environment initialization
if [[ "$(uname)" == Darwin ]]; then
# Install the testing dependencies
retry conda install -yq future hypothesis ${NUMPY_PACKAGE} ${PROTOBUF_PACKAGE} pytest setuptools six typing_extensions pyyaml
else
retry pip install -qr requirements.txt || true
retry pip install -q hypothesis protobuf pytest setuptools || true
numpy_ver=1.15
case "$(python --version 2>&1)" in
*2* | *3.5* | *3.6*)
numpy_ver=1.11
;;
esac
retry pip install -q "numpy==${numpy_ver}" || true
fi
echo "Testing with:"
pip freeze
conda list || true
##############################################################################
# Smoke tests
##############################################################################
# TODO use check_binary.sh, which requires making sure it runs on Windows
pushd /
echo "Smoke testing imports"
python -c 'import torch'
# Test that MKL is there
if [[ "$(uname)" == 'Darwin' && "$package_type" == *wheel ]]; then
echo 'Not checking for MKL on Darwin wheel packages'
else
echo "Checking that MKL is available"
python -c 'import torch; exit(0 if torch.backends.mkl.is_available() else 1)'
fi
if [[ "$OSTYPE" == "msys" ]]; then
GPUS=$(wmic path win32_VideoController get name)
if [[ ! "$GPUS" == *NVIDIA* ]]; then
echo "Skip CUDA tests for machines without a Nvidia GPU card"
exit 0
fi
fi
# Test that the version number is consistent during building and testing
if [[ "$PYTORCH_BUILD_NUMBER" -gt 1 ]]; then
expected_version="${PYTORCH_BUILD_VERSION}.post${PYTORCH_BUILD_NUMBER}"
else
expected_version="${PYTORCH_BUILD_VERSION}"
fi
echo "Checking that we are testing the package that is just built"
python -c "import torch; exit(0 if torch.__version__ == '$expected_version' else 1)"
# Test that CUDA builds are setup correctly
if [[ "$cuda_ver" != 'cpu' ]]; then
cuda_installed=1
nvidia-smi || cuda_installed=0
if [[ "$cuda_installed" == 0 ]]; then
echo "Skip CUDA tests for machines without a Nvidia GPU card"
else
# Test CUDA archs
echo "Checking that CUDA archs are setup correctly"
timeout 20 python -c 'import torch; torch.randn([3,5]).cuda()'
# These have to run after CUDA is initialized
echo "Checking that magma is available"
python -c 'import torch; torch.rand(1).cuda(); exit(0 if torch.cuda.has_magma else 1)'
echo "Checking that CuDNN is available"
python -c 'import torch; exit(0 if torch.backends.cudnn.is_available() else 1)'
fi
fi
# Check that OpenBlas is not linked to on MacOS
if [[ "$(uname)" == 'Darwin' ]]; then
echo "Checking the OpenBLAS is not linked to"
all_dylibs=($(find "$(python -c "import site; print(site.getsitepackages()[0])")"/torch -name '*.dylib'))
for dylib in "${all_dylibs[@]}"; do
if [[ -n "$(otool -L $dylib | grep -i openblas)" ]]; then
echo "Found openblas as a dependency of $dylib"
echo "Full dependencies is: $(otool -L $dylib)"
exit 1
fi
done
echo "Checking that OpenMP is available"
python -c "import torch; exit(0 if torch.backends.openmp.is_available() else 1)"
fi
popd
# TODO re-enable the other tests after the nightlies are moved to CI. This is
# because the binaries keep breaking, often from additional tests, that aren't
# real problems. Once these are on circleci and a smoke-binary-build is added
# to PRs then this should stop happening and these can be re-enabled.
echo "Not running unit tests. Hopefully these problems are caught by CI"
exit 0
##############################################################################
# Running unit tests (except not right now)
##############################################################################
echo "$(date) :: Starting tests for $package_type package for python$py_ver and $cuda_ver"
# We keep track of exact tests to skip, as otherwise we would be hardly running
# any tests. But b/c of issues working with pytest/normal-python-test/ and b/c
# of special snowflake tests in test/run_test.py we also take special care of
# those
tests_to_skip=()
#
# Entire file exclusions
##############################################################################
entire_file_exclusions=("-x")
# cpp_extensions doesn't work with pytest, so we exclude it from the pytest run
# here and then manually run it later. Note that this is only because this
# entire_fil_exclusions flag is only passed to the pytest run
entire_file_exclusions+=("cpp_extensions")
# TODO temporary line to fix next days nightlies, but should be removed when
# issue is fixed
entire_file_exclusions+=('type_info')
if [[ "$cuda_ver" == 'cpu' ]]; then
# test/test_cuda.py exits early if the installed torch is not built with
# CUDA, but the exit doesn't work when running with pytest, so pytest will
# still try to run all the CUDA tests and then fail
entire_file_exclusions+=("cuda")
entire_file_exclusions+=("nccl")
fi
if [[ "$(uname)" == 'Darwin' || "$OSTYPE" == "msys" ]]; then
# pytest on Mac doesn't like the exits in these files
entire_file_exclusions+=('c10d')
entire_file_exclusions+=('distributed')
# pytest doesn't mind the exit but fails the tests. On Mac we run this
# later without pytest
entire_file_exclusions+=('thd_distributed')
fi
#
# Universal flaky tests
##############################################################################
# RendezvousEnvTest sometimes hangs forever
# Otherwise it will fail on CUDA with
# Traceback (most recent call last):
# File "test_c10d.py", line 179, in test_common_errors
# next(gen)
# AssertionError: ValueError not raised
tests_to_skip+=('RendezvousEnvTest and test_common_errors')
# This hung forever once on conda_3.5_cu92
tests_to_skip+=('TestTorch and test_sum_dim')
# test_trace_warn isn't actually flaky, but it doesn't work with pytest so we
# just skip it
tests_to_skip+=('TestJit and test_trace_warn')
#
# Python specific flaky tests
##############################################################################
# test_dataloader.py:721: AssertionError
# looks like a timeout, but interestingly only appears on python 3
if [[ "$py_ver" == 3* ]]; then
tests_to_skip+=('TestDataLoader and test_proper_exit')
fi
#
# CUDA flaky tests, all package types
##############################################################################
if [[ "$cuda_ver" != 'cpu' ]]; then
#
# DistributedDataParallelTest
# All of these seem to fail
tests_to_skip+=('DistributedDataParallelTest')
#
# RendezvousEnvTest
# Traceback (most recent call last):
# File "test_c10d.py", line 201, in test_nominal
# store0, rank0, size0 = next(gen0)
# File "/opt/python/cp36-cp36m/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 131, in _env_rendezvous_handler
# store = TCPStore(master_addr, master_port, start_daemon)
# RuntimeError: Address already in use
tests_to_skip+=('RendezvousEnvTest and test_nominal')
#
# TestCppExtension
#
# Traceback (most recent call last):
# File "test_cpp_extensions.py", line 134, in test_jit_cudnn_extension
# with_cuda=True)
# File "/opt/python/cp35-cp35m/lib/python3.5/site-packages/torch/utils/cpp_extension.py", line 552, in load
# with_cuda)
# File "/opt/python/cp35-cp35m/lib/python3.5/site-packages/torch/utils/cpp_extension.py", line 729, in _jit_compile
# return _import_module_from_library(name, build_directory)
# File "/opt/python/cp35-cp35m/lib/python3.5/site-packages/torch/utils/cpp_extension.py", line 867, in _import_module_from_library
# return imp.load_module(module_name, file, path, description)
# File "/opt/python/cp35-cp35m/lib/python3.5/imp.py", line 243, in load_module
# return load_dynamic(name, filename, file)
# File "/opt/python/cp35-cp35m/lib/python3.5/imp.py", line 343, in load_dynamic
# return _load(spec)
# File "<frozen importlib._bootstrap>", line 693, in _load
# File "<frozen importlib._bootstrap>", line 666, in _load_unlocked
# File "<frozen importlib._bootstrap>", line 577, in module_from_spec
# File "<frozen importlib._bootstrap_external>", line 938, in create_module
# File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
# ImportError: libcudnn.so.7: cannot open shared object file: No such file or directory
tests_to_skip+=('TestCppExtension and test_jit_cudnn_extension')
#
# TestCuda
#
# 3.7_cu80
# RuntimeError: CUDA error: out of memory
tests_to_skip+=('TestCuda and test_arithmetic_large_tensor')
# 3.7_cu80
# RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch-nightly_1538097262541/work/aten/src/THC/THCTensorCopy.cu:205
tests_to_skip+=('TestCuda and test_autogpu')
#
# TestDistBackend
#
# Traceback (most recent call last):
# File "test_thd_distributed.py", line 1046, in wrapper
# self._join_and_reduce(fn)
# File "test_thd_distributed.py", line 1108, in _join_and_reduce
# self.assertEqual(p.exitcode, first_process.exitcode)
# File "/pytorch/test/common.py", line 399, in assertEqual
# super(TestCase, self).assertEqual(x, y, message)
# AssertionError: None != 77 :
tests_to_skip+=('TestDistBackend and test_all_gather_group')
tests_to_skip+=('TestDistBackend and test_all_reduce_group_max')
tests_to_skip+=('TestDistBackend and test_all_reduce_group_min')
tests_to_skip+=('TestDistBackend and test_all_reduce_group_sum')
tests_to_skip+=('TestDistBackend and test_all_reduce_group_product')
tests_to_skip+=('TestDistBackend and test_barrier_group')
tests_to_skip+=('TestDistBackend and test_broadcast_group')
# Traceback (most recent call last):
# File "test_thd_distributed.py", line 1046, in wrapper
# self._join_and_reduce(fn)
# File "test_thd_distributed.py", line 1108, in _join_and_reduce
# self.assertEqual(p.exitcode, first_process.exitcode)
# File "/pytorch/test/common.py", line 397, in assertEqual
# super(TestCase, self).assertLessEqual(abs(x - y), prec, message)
# AssertionError: 12 not less than or equal to 1e-05
tests_to_skip+=('TestDistBackend and test_barrier')
# Traceback (most recent call last):
# File "test_distributed.py", line 1267, in wrapper
# self._join_and_reduce(fn)
# File "test_distributed.py", line 1350, in _join_and_reduce
# self.assertEqual(p.exitcode, first_process.exitcode)
# File "/pytorch/test/common.py", line 399, in assertEqual
# super(TestCase, self).assertEqual(x, y, message)
# AssertionError: None != 1
tests_to_skip+=('TestDistBackend and test_broadcast')
# Memory leak very similar to all the conda ones below, but appears on manywheel
# 3.6m_cu80
# AssertionError: 1605632 not less than or equal to 1e-05 : __main__.TestEndToEndHybridFrontendModels.test_vae_cuda leaked 1605632 bytes CUDA memory on device 0
tests_to_skip+=('TestEndToEndHybridFrontendModels and test_vae_cuda')
# ________________________ TestNN.test_embedding_bag_cuda ________________________
#
# self = <test_nn.TestNN testMethod=test_embedding_bag_cuda>
# dtype = torch.float32
#
# @unittest.skipIf(not TEST_CUDA, "CUDA unavailable")
# @repeat_test_for_types(ALL_TENSORTYPES)
# @skipIfRocm
# def test_embedding_bag_cuda(self, dtype=torch.float):
# self._test_EmbeddingBag(True, 'sum', False, dtype)
# self._test_EmbeddingBag(True, 'mean', False, dtype)
# self._test_EmbeddingBag(True, 'max', False, dtype)
# if dtype != torch.half:
# # torch.cuda.sparse.HalfTensor is not enabled.
# self._test_EmbeddingBag(True, 'sum', True, dtype)
# > self._test_EmbeddingBag(True, 'mean', True, dtype)
#
# test_nn.py:2144:
# _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# test_nn.py:2062: in _test_EmbeddingBag
# _test_vs_Embedding(N, D, B, L)
# test_nn.py:2059: in _test_vs_Embedding
# self.assertEqual(es_weight_grad, e.weight.grad, needed_prec)
# common.py:373: in assertEqual
# assertTensorsEqual(x, y)
# common.py:365: in assertTensorsEqual
# self.assertLessEqual(max_err, prec, message)
# E AssertionError: tensor(0.0000, device='cuda:0', dtype=torch.float32) not less than or equal to 2e-05 :
# 1 failed, 1202 passed, 19 skipped, 2 xfailed, 796 warnings in 1166.73 seconds =
# Traceback (most recent call last):
# File "test/run_test.py", line 391, in <module>
# main()
# File "test/run_test.py", line 383, in main
# raise RuntimeError(message)
tests_to_skip+=('TestNN and test_embedding_bag_cuda')
fi
##############################################################################
# MacOS specific flaky tests
##############################################################################
if [[ "$(uname)" == 'Darwin' ]]; then
# TestCppExtensions by default uses a temp folder in /tmp. This doesn't
# work for this Mac machine cause there is only one machine and /tmp is
# shared. (All the linux builds are on docker so have their own /tmp).
tests_to_skip+=('TestCppExtension')
fi
# Turn the set of tests to skip into an invocation that pytest understands
excluded_tests_logic=''
for exclusion in "${tests_to_skip[@]}"; do
if [[ -z "$excluded_tests_logic" ]]; then
# Only true for i==0
excluded_tests_logic="not ($exclusion)"
else
excluded_tests_logic="$excluded_tests_logic and not ($exclusion)"
fi
done
##############################################################################
# Run the tests
##############################################################################
echo
echo "$(date) :: Calling 'python test/run_test.py -v -p pytest ${entire_file_exclusions[@]} -- --disable-pytest-warnings -k '$excluded_tests_logic'"
python test/run_test.py -v -p pytest ${entire_file_exclusions[@]} -- --disable-pytest-warnings -k "'" "$excluded_tests_logic" "'"
echo
echo "$(date) :: Finished 'python test/run_test.py -v -p pytest ${entire_file_exclusions[@]} -- --disable-pytest-warnings -k '$excluded_tests_logic'"
# cpp_extensions don't work with pytest, so we run them without pytest here,
# except there's a failure on CUDA builds (documented above), and
# cpp_extensions doesn't work on a shared mac machine (also documented above)
if [[ "$cuda_ver" == 'cpu' && "$(uname)" != 'Darwin' ]]; then
echo
echo "$(date) :: Calling 'python test/run_test.py -v -i cpp_extensions'"
python test/run_test.py -v -i cpp_extensions
echo
echo "$(date) :: Finished 'python test/run_test.py -v -i cpp_extensions'"
fi
# thd_distributed can run on Mac but not in pytest
if [[ "$(uname)" == 'Darwin' ]]; then
echo
echo "$(date) :: Calling 'python test/run_test.py -v -i thd_distributed'"
python test/run_test.py -v -i thd_distributed
echo
echo "$(date) :: Finished 'python test/run_test.py -v -i thd_distributed'"
fi

View File

@ -0,0 +1,130 @@
#!/usr/bin/env python3
import concurrent.futures
import distutils.sysconfig
import functools
import itertools
import os
import re
from pathlib import Path
from typing import Any, List, Tuple
# We also check that there are [not] cxx11 symbols in libtorch
#
# To check whether it is using cxx11 ABI, check non-existence of symbol:
PRE_CXX11_SYMBOLS = (
"std::basic_string<",
"std::list",
)
# To check whether it is using pre-cxx11 ABI, check non-existence of symbol:
CXX11_SYMBOLS = (
"std::__cxx11::basic_string",
"std::__cxx11::list",
)
# NOTE: Checking the above symbols in all namespaces doesn't work, because
# devtoolset7 always produces some cxx11 symbols even if we build with old ABI,
# and CuDNN always has pre-cxx11 symbols even if we build with new ABI using gcc 5.4.
# Instead, we *only* check the above symbols in the following namespaces:
LIBTORCH_NAMESPACE_LIST = (
"c10::",
"at::",
"caffe2::",
"torch::",
)
def _apply_libtorch_symbols(symbols):
return [
re.compile(f"{x}.*{y}")
for (x, y) in itertools.product(LIBTORCH_NAMESPACE_LIST, symbols)
]
LIBTORCH_CXX11_PATTERNS = _apply_libtorch_symbols(CXX11_SYMBOLS)
LIBTORCH_PRE_CXX11_PATTERNS = _apply_libtorch_symbols(PRE_CXX11_SYMBOLS)
@functools.lru_cache(100)
def get_symbols(lib: str) -> List[Tuple[str, str, str]]:
from subprocess import check_output
lines = check_output(f'nm "{lib}"|c++filt', shell=True)
return [x.split(" ", 2) for x in lines.decode("latin1").split("\n")[:-1]]
def grep_symbols(lib: str, patterns: List[Any]) -> List[str]:
def _grep_symbols(
symbols: List[Tuple[str, str, str]], patterns: List[Any]
) -> List[str]:
rc = []
for _s_addr, _s_type, s_name in symbols:
for pattern in patterns:
if pattern.match(s_name):
rc.append(s_name)
continue
return rc
all_symbols = get_symbols(lib)
num_workers = 32
chunk_size = (len(all_symbols) + num_workers - 1) // num_workers
def _get_symbols_chunk(i):
return all_symbols[i * chunk_size : (i + 1) * chunk_size]
with concurrent.futures.ThreadPoolExecutor(max_workers=32) as executor:
tasks = [
executor.submit(_grep_symbols, _get_symbols_chunk(i), patterns)
for i in range(num_workers)
]
return functools.reduce(list.__add__, (x.result() for x in tasks), [])
def check_lib_symbols_for_abi_correctness(lib: str, pre_cxx11_abi: bool = True) -> None:
print(f"lib: {lib}")
cxx11_symbols = grep_symbols(lib, LIBTORCH_CXX11_PATTERNS)
pre_cxx11_symbols = grep_symbols(lib, LIBTORCH_PRE_CXX11_PATTERNS)
num_cxx11_symbols = len(cxx11_symbols)
num_pre_cxx11_symbols = len(pre_cxx11_symbols)
print(f"num_cxx11_symbols: {num_cxx11_symbols}")
print(f"num_pre_cxx11_symbols: {num_pre_cxx11_symbols}")
if pre_cxx11_abi:
if num_cxx11_symbols > 0:
raise RuntimeError(
f"Found cxx11 symbols, but there shouldn't be any, see: {cxx11_symbols[:100]}"
)
if num_pre_cxx11_symbols < 1000:
raise RuntimeError("Didn't find enough pre-cxx11 symbols.")
# Check for no recursive iterators, regression test for https://github.com/pytorch/pytorch/issues/133437
rec_iter_symbols = grep_symbols(
lib, [re.compile("std::filesystem::recursive_directory_iterator.*")]
)
if len(rec_iter_symbols) > 0:
raise RuntimeError(
f"recursive_directory_iterator in used pre-CXX11 binaries, see; {rec_iter_symbols}"
)
else:
if num_pre_cxx11_symbols > 0:
raise RuntimeError(
f"Found pre-cxx11 symbols, but there shouldn't be any, see: {pre_cxx11_symbols[:100]}"
)
if num_cxx11_symbols < 100:
raise RuntimeError("Didn't find enought cxx11 symbols")
def main() -> None:
if "install_root" in os.environ:
install_root = Path(os.getenv("install_root")) # noqa: SIM112
else:
if os.getenv("PACKAGE_TYPE") == "libtorch":
install_root = Path(os.getcwd())
else:
install_root = Path(distutils.sysconfig.get_python_lib()) / "torch"
libtorch_cpu_path = install_root / "lib" / "libtorch_cpu.so"
pre_cxx11_abi = "cxx11-abi" not in os.getenv("DESIRED_DEVTOOLSET", "")
check_lib_symbols_for_abi_correctness(libtorch_cpu_path, pre_cxx11_abi)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,205 @@
import argparse
from torchvision import datasets, transforms
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__() # noqa: UP008
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.dropout1 = nn.Dropout(0.25)
self.dropout2 = nn.Dropout(0.5)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = self.conv1(x)
x = F.relu(x)
x = self.conv2(x)
x = F.relu(x)
x = F.max_pool2d(x, 2)
x = self.dropout1(x)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = F.relu(x)
x = self.dropout2(x)
x = self.fc2(x)
output = F.log_softmax(x, dim=1)
return output
def train(args, model, device, train_loader, optimizer, epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % args.log_interval == 0:
print(
f"Train Epoch: {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)} ({100. * batch_idx / len(train_loader):.0f}%)]\tLoss: {loss.item():.6f}" # noqa: B950
)
if args.dry_run:
break
def test(model, device, test_loader):
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += F.nll_loss(
output, target, reduction="sum"
).item() # sum up batch loss
pred = output.argmax(
dim=1, keepdim=True
) # get the index of the max log-probability
correct += pred.eq(target.view_as(pred)).sum().item()
test_loss /= len(test_loader.dataset)
print(
f"\nTest set: Average loss: {test_loss:.4f}, Accuracy: {correct}/{len(test_loader.dataset)} ({100. * correct / len(test_loader.dataset):.0f}%)\n" # noqa: B950
)
def timed(fn):
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
result = fn()
end.record()
torch.cuda.synchronize()
return result, start.elapsed_time(end) / 1000
def main():
# Training settings
parser = argparse.ArgumentParser(description="PyTorch MNIST Example")
parser.add_argument(
"--batch-size",
type=int,
default=64,
metavar="N",
help="input batch size for training (default: 64)",
)
parser.add_argument(
"--test-batch-size",
type=int,
default=1000,
metavar="N",
help="input batch size for testing (default: 1000)",
)
parser.add_argument(
"--epochs",
type=int,
default=4,
metavar="N",
help="number of epochs to train (default: 14)",
)
parser.add_argument(
"--lr",
type=float,
default=1.0,
metavar="LR",
help="learning rate (default: 1.0)",
)
parser.add_argument(
"--gamma",
type=float,
default=0.7,
metavar="M",
help="Learning rate step gamma (default: 0.7)",
)
parser.add_argument(
"--no-cuda", action="store_true", default=False, help="disables CUDA training"
)
parser.add_argument(
"--no-mps",
action="store_true",
default=False,
help="disables macOS GPU training",
)
parser.add_argument(
"--dry-run",
action="store_true",
default=False,
help="quickly check a single pass",
)
parser.add_argument(
"--seed", type=int, default=1, metavar="S", help="random seed (default: 1)"
)
parser.add_argument(
"--log-interval",
type=int,
default=100,
metavar="N",
help="how many batches to wait before logging training status",
)
parser.add_argument(
"--save-model",
action="store_true",
default=False,
help="For Saving the current Model",
)
args = parser.parse_args()
use_cuda = not args.no_cuda and torch.cuda.is_available()
use_mps = not args.no_mps and torch.backends.mps.is_available()
torch.manual_seed(args.seed)
torch.backends.cuda.matmul.allow_tf32 = True
if use_cuda:
device = torch.device("cuda")
elif use_mps:
device = torch.device("mps")
else:
device = torch.device("cpu")
train_kwargs = {"batch_size": args.batch_size}
test_kwargs = {"batch_size": args.test_batch_size}
if use_cuda:
cuda_kwargs = {"num_workers": 1, "pin_memory": True, "shuffle": True}
train_kwargs.update(cuda_kwargs)
test_kwargs.update(cuda_kwargs)
transform = transforms.Compose(
[transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
)
dataset1 = datasets.MNIST("../data", train=True, download=True, transform=transform)
dataset2 = datasets.MNIST("../data", train=False, transform=transform)
train_loader = torch.utils.data.DataLoader(dataset1, **train_kwargs)
test_loader = torch.utils.data.DataLoader(dataset2, **test_kwargs)
model = Net().to(device)
opt_model = torch.compile(model, mode="max-autotune")
optimizer = optim.Adadelta(opt_model.parameters(), lr=args.lr)
scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
for epoch in range(1, args.epochs + 1):
print(
f"Training Time: {timed(lambda: train(args, opt_model, device, train_loader, optimizer, epoch))[1]}"
)
print(
f"Evaluation Time: {timed(lambda: test(opt_model, device, test_loader))[1]}"
)
scheduler.step()
if args.save_model:
torch.save(opt_model.state_dict(), "mnist_cnn.pt")
if __name__ == "__main__":
main()

View File

@ -0,0 +1,385 @@
import argparse
import importlib
import json
import os
import re
import subprocess
import sys
from pathlib import Path
import torch
import torch._dynamo
import torch.nn as nn
import torch.nn.functional as F
if "MATRIX_GPU_ARCH_VERSION" in os.environ:
gpu_arch_ver = os.getenv("MATRIX_GPU_ARCH_VERSION")
else:
gpu_arch_ver = os.getenv("GPU_ARCH_VERSION") # Use fallback if available
gpu_arch_type = os.getenv("MATRIX_GPU_ARCH_TYPE")
channel = os.getenv("MATRIX_CHANNEL")
package_type = os.getenv("MATRIX_PACKAGE_TYPE")
target_os = os.getenv("TARGET_OS", sys.platform)
BASE_DIR = Path(__file__).parent.parent.parent
is_cuda_system = gpu_arch_type == "cuda"
NIGHTLY_ALLOWED_DELTA = 3
MODULES = [
{
"name": "torchvision",
"repo": "https://github.com/pytorch/vision.git",
"smoke_test": "./vision/test/smoke_test.py",
"extension": "extension",
"repo_name": "vision",
},
{
"name": "torchaudio",
"repo": "https://github.com/pytorch/audio.git",
"smoke_test": "./audio/test/smoke_test/smoke_test.py --no-ffmpeg",
"extension": "_extension",
"repo_name": "audio",
},
]
class Net(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.fc1 = nn.Linear(9216, 1)
def forward(self, x):
x = self.conv1(x)
x = self.conv2(x)
x = F.max_pool2d(x, 2)
x = torch.flatten(x, 1)
output = self.fc1(x)
return output
def load_json_from_basedir(filename: str):
try:
with open(BASE_DIR / filename) as fptr:
return json.load(fptr)
except FileNotFoundError as exc:
raise ImportError(f"File {filename} not found error: {exc.strerror}") from exc
except json.JSONDecodeError as exc:
raise ImportError(f"Invalid JSON {filename}") from exc
def read_release_matrix():
return load_json_from_basedir("release_matrix.json")
def test_numpy():
import numpy as np
x = np.arange(5)
torch.tensor(x)
def check_version(package: str) -> None:
release_version = os.getenv("RELEASE_VERSION")
# if release_version is specified, use it to validate the packages
if release_version:
release_matrix = read_release_matrix()
stable_version = release_matrix["torch"]
else:
stable_version = os.getenv("MATRIX_STABLE_VERSION")
# only makes sense to check nightly package where dates are known
if channel == "nightly":
check_nightly_binaries_date(package)
elif stable_version is not None:
if not torch.__version__.startswith(stable_version):
raise RuntimeError(
f"Torch version mismatch, expected {stable_version} for channel {channel}. But its {torch.__version__}"
)
if release_version and package == "all":
for module in MODULES:
imported_module = importlib.import_module(module["name"])
module_version = imported_module.__version__
if not module_version.startswith(release_matrix[module["name"]]):
raise RuntimeError(
f"{module['name']} version mismatch, expected: \
{release_matrix[module['name']]} for channel {channel}. But its {module_version}"
)
else:
print(f"{module['name']} version actual: {module_version} expected: \
{release_matrix[module['name']]} for channel {channel}.")
else:
print(f"Skip version check for channel {channel} as stable version is None")
def check_nightly_binaries_date(package: str) -> None:
from datetime import datetime
format_dt = "%Y%m%d"
date_t_str = re.findall("dev\\d+", torch.__version__)
date_t_delta = datetime.now() - datetime.strptime(date_t_str[0][3:], format_dt)
if date_t_delta.days >= NIGHTLY_ALLOWED_DELTA:
raise RuntimeError(
f"the binaries are from {date_t_str} and are more than {NIGHTLY_ALLOWED_DELTA} days old!"
)
if package == "all":
for module in MODULES:
imported_module = importlib.import_module(module["name"])
module_version = imported_module.__version__
date_m_str = re.findall("dev\\d+", module_version)
date_m_delta = datetime.now() - datetime.strptime(
date_m_str[0][3:], format_dt
)
print(f"Nightly date check for {module['name']} version {module_version}")
if date_m_delta.days > NIGHTLY_ALLOWED_DELTA:
raise RuntimeError(
f"Expected {module['name']} to be less then {NIGHTLY_ALLOWED_DELTA} days. But its {date_m_delta}"
)
def test_cuda_runtime_errors_captured() -> None:
cuda_exception_missed = True
try:
print("Testing test_cuda_runtime_errors_captured")
torch._assert_async(torch.tensor(0, device="cuda"))
torch._assert_async(torch.tensor(0 + 0j, device="cuda"))
except RuntimeError as e:
if re.search("CUDA", f"{e}"):
print(f"Caught CUDA exception with success: {e}")
cuda_exception_missed = False
else:
raise e
if cuda_exception_missed:
raise RuntimeError("Expected CUDA RuntimeError but have not received!")
def smoke_test_cuda(
package: str, runtime_error_check: str, torch_compile_check: str
) -> None:
if not torch.cuda.is_available() and is_cuda_system:
raise RuntimeError(f"Expected CUDA {gpu_arch_ver}. However CUDA is not loaded.")
if package == "all" and is_cuda_system:
for module in MODULES:
imported_module = importlib.import_module(module["name"])
# TBD for vision move extension module to private so it will
# be _extention.
version = "N/A"
if module["extension"] == "extension":
version = imported_module.extension._check_cuda_version()
else:
version = imported_module._extension._check_cuda_version()
print(f"{module['name']} CUDA: {version}")
# torch.compile is available on macos-arm64 and Linux for python 3.8-3.13
if (
torch_compile_check == "enabled"
and sys.version_info < (3, 14, 0)
and target_os in ["linux", "linux-aarch64", "macos-arm64", "darwin"]
):
smoke_test_compile("cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
if torch.version.cuda != gpu_arch_ver:
raise RuntimeError(
f"Wrong CUDA version. Loaded: {torch.version.cuda} Expected: {gpu_arch_ver}"
)
print(f"torch cuda: {torch.version.cuda}")
# todo add cudnn version validation
print(f"torch cudnn: {torch.backends.cudnn.version()}")
print(f"cuDNN enabled? {torch.backends.cudnn.enabled}")
torch.cuda.init()
print("CUDA initialized successfully")
print(f"Number of CUDA devices: {torch.cuda.device_count()}")
for i in range(torch.cuda.device_count()):
print(f"Device {i}: {torch.cuda.get_device_name(i)}")
# nccl is availbale only on Linux
if sys.platform in ["linux", "linux2"]:
print(f"torch nccl version: {torch.cuda.nccl.version()}")
if runtime_error_check == "enabled":
test_cuda_runtime_errors_captured()
def smoke_test_conv2d() -> None:
import torch.nn as nn
print("Testing smoke_test_conv2d")
# With square kernels and equal stride
m = nn.Conv2d(16, 33, 3, stride=2)
# non-square kernels and unequal stride and with padding
m = nn.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
assert m is not None
# non-square kernels and unequal stride and with padding and dilation
basic_conv = nn.Conv2d(
16, 33, (3, 5), stride=(2, 1), padding=(4, 2), dilation=(3, 1)
)
input = torch.randn(20, 16, 50, 100)
output = basic_conv(input)
if is_cuda_system:
print("Testing smoke_test_conv2d with cuda")
conv = nn.Conv2d(3, 3, 3).cuda()
x = torch.randn(1, 3, 24, 24, device="cuda")
with torch.cuda.amp.autocast():
out = conv(x)
assert out is not None
supported_dtypes = [torch.float16, torch.float32, torch.float64]
for dtype in supported_dtypes:
print(f"Testing smoke_test_conv2d with cuda for {dtype}")
conv = basic_conv.to(dtype).cuda()
input = torch.randn(20, 16, 50, 100, device="cuda").type(dtype)
output = conv(input)
assert output is not None
def test_linalg(device="cpu") -> None:
print(f"Testing smoke_test_linalg on {device}")
A = torch.randn(5, 3, device=device)
U, S, Vh = torch.linalg.svd(A, full_matrices=False)
assert (
U.shape == A.shape
and S.shape == torch.Size([3])
and Vh.shape == torch.Size([3, 3])
)
torch.dist(A, U @ torch.diag(S) @ Vh)
U, S, Vh = torch.linalg.svd(A)
assert (
U.shape == torch.Size([5, 5])
and S.shape == torch.Size([3])
and Vh.shape == torch.Size([3, 3])
)
torch.dist(A, U[:, :3] @ torch.diag(S) @ Vh)
A = torch.randn(7, 5, 3, device=device)
U, S, Vh = torch.linalg.svd(A, full_matrices=False)
torch.dist(A, U @ torch.diag_embed(S) @ Vh)
if device == "cuda":
supported_dtypes = [torch.float32, torch.float64]
for dtype in supported_dtypes:
print(f"Testing smoke_test_linalg with cuda for {dtype}")
A = torch.randn(20, 16, 50, 100, device=device, dtype=dtype)
torch.linalg.svd(A)
def smoke_test_compile(device: str = "cpu") -> None:
supported_dtypes = [torch.float16, torch.float32, torch.float64]
def foo(x: torch.Tensor) -> torch.Tensor:
return torch.sin(x) + torch.cos(x)
for dtype in supported_dtypes:
print(f"Testing smoke_test_compile for {device} and {dtype}")
x = torch.rand(3, 3, device=device).type(dtype)
x_eager = foo(x)
x_pt2 = torch.compile(foo)(x)
torch.testing.assert_close(x_eager, x_pt2)
# Check that SIMD were detected for the architecture
if device == "cpu":
from torch._inductor.codecache import pick_vec_isa
isa = pick_vec_isa()
if not isa:
raise RuntimeError("Can't detect vectorized ISA for CPU")
print(f"Picked CPU ISA {type(isa).__name__} bit width {isa.bit_width()}")
# Reset torch dynamo since we are changing mode
torch._dynamo.reset()
dtype = torch.float32
torch.set_float32_matmul_precision("high")
print(f"Testing smoke_test_compile with mode 'max-autotune' for {dtype}")
x = torch.rand(64, 1, 28, 28, device=device).type(torch.float32)
model = Net().to(device=device)
x_pt2 = torch.compile(model, mode="max-autotune")(x)
def smoke_test_modules():
cwd = os.getcwd()
for module in MODULES:
if module["repo"]:
if not os.path.exists(f"{cwd}/{module['repo_name']}"):
print(f"Path does not exist: {cwd}/{module['repo_name']}")
try:
subprocess.check_output(
f"git clone --depth 1 {module['repo']}",
stderr=subprocess.STDOUT,
shell=True,
)
except subprocess.CalledProcessError as exc:
raise RuntimeError(
f"Cloning {module['repo']} FAIL: {exc.returncode} Output: {exc.output}"
) from exc
try:
smoke_test_command = f"python3 {module['smoke_test']}"
if target_os == "windows":
smoke_test_command = f"python {module['smoke_test']}"
output = subprocess.check_output(
smoke_test_command,
stderr=subprocess.STDOUT,
shell=True,
universal_newlines=True,
)
except subprocess.CalledProcessError as exc:
raise RuntimeError(
f"Module {module['name']} FAIL: {exc.returncode} Output: {exc.output}"
) from exc
else:
print(f"Output: \n{output}\n")
def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument(
"--package",
help="Package to include in smoke testing",
type=str,
choices=["all", "torchonly"],
default="all",
)
parser.add_argument(
"--runtime-error-check",
help="No Runtime Error check",
type=str,
choices=["enabled", "disabled"],
default="enabled",
)
parser.add_argument(
"--torch-compile-check",
help="Check torch compile",
type=str,
choices=["enabled", "disabled"],
default="enabled",
)
options = parser.parse_args()
print(f"torch: {torch.__version__}")
print(torch.__config__.parallel_info())
check_version(options.package)
smoke_test_conv2d()
test_linalg()
test_numpy()
if is_cuda_system:
test_linalg("cuda")
if options.package == "all":
smoke_test_modules()
smoke_test_cuda(
options.package, options.runtime_error_check, options.torch_compile_check
)
if __name__ == "__main__":
main()

View File

@ -14,7 +14,7 @@ source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
# Do not change workspace permissions for ROCm CI jobs
# as it can leave workspace with bad permissions for cancelled jobs
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *rocm* && -d /var/lib/jenkins/workspace ]]; then
# Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)
WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")
cleanup_workspace() {
@ -1412,7 +1412,7 @@ test_linux_aarch64() {
inductor/test_pattern_matcher inductor/test_perf inductor/test_profiler inductor/test_select_algorithm inductor/test_smoke \
inductor/test_split_cat_fx_passes inductor/test_standalone_compile inductor/test_torchinductor \
inductor/test_torchinductor_codegen_dynamic_shapes inductor/test_torchinductor_dynamic_shapes inductor/test_memory \
inductor/test_triton_cpu_backend inductor/test_triton_extension_backend \
inductor/test_triton_cpu_backend inductor/test_triton_extension_backend inductor/test_mkldnn_pattern_matcher inductor/test_cpu_cpp_wrapper \
--shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose
}
@ -1421,9 +1421,9 @@ if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-baze
(cd test && python -c "import torch; print(torch.__config__.parallel_info())")
fi
if [[ "${TEST_CONFIG}" == *numpy_2* ]]; then
# Install numpy-2.0.2 and test inductor tracing
python -mpip install --pre numpy==2.0.2
python test/run_test.py --include dynamo/test_unspec.py
# Install numpy-2.0.2 and compatible scipy & numba versions
python -mpip install --pre numpy==2.0.2 scipy==1.13.1 numba==0.60.0
python test/run_test.py --include dynamo/test_functions.py dynamo/test_unspec.py test_binary_ufuncs.py test_fake_tensor.py test_linalg.py test_numpy_interop.py test_tensor_creation_ops.py test_torch.py torch_np/test_basic.py
elif [[ "${BUILD_ENVIRONMENT}" == *aarch64* && "${TEST_CONFIG}" != *perf_cpu_aarch64* ]]; then
test_linux_aarch64
elif [[ "${TEST_CONFIG}" == *backward* ]]; then

View File

@ -0,0 +1,26 @@
cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
project(simple-torch-test)
find_package(Torch REQUIRED)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")
add_executable(simple-torch-test simple-torch-test.cpp)
target_include_directories(simple-torch-test PRIVATE ${TORCH_INCLUDE_DIRS})
target_link_libraries(simple-torch-test "${TORCH_LIBRARIES}")
set_property(TARGET simple-torch-test PROPERTY CXX_STANDARD 17)
find_package(CUDAToolkit 11.8)
target_link_libraries(simple-torch-test CUDA::cudart CUDA::cufft CUDA::cusparse CUDA::cublas CUDA::cusolver)
find_library(CUDNN_LIBRARY NAMES cudnn)
target_link_libraries(simple-torch-test ${CUDNN_LIBRARY} )
if(MSVC)
file(GLOB TORCH_DLLS "$ENV{CUDA_PATH}/bin/cudnn64_8.dll" "$ENV{NVTOOLSEXT_PATH}/bin/x64/*.dll")
message("dlls to copy " ${TORCH_DLLS})
add_custom_command(TARGET simple-torch-test
POST_BUILD
COMMAND ${CMAKE_COMMAND} -E copy_if_different
${TORCH_DLLS}
$<TARGET_FILE_DIR:simple-torch-test>)
endif(MSVC)

View File

@ -0,0 +1,15 @@
#include <torch/torch.h>
int main(int argc, const char* argv[]) {
std::cout << "Checking that CUDA archs are setup correctly" << std::endl;
TORCH_CHECK(torch::rand({ 3, 5 }, torch::Device(torch::kCUDA)).defined(), "CUDA archs are not setup correctly");
// These have to run after CUDA is initialized
std::cout << "Checking that magma is available" << std::endl;
TORCH_CHECK(torch::hasMAGMA(), "MAGMA is not available");
std::cout << "Checking that CuDNN is available" << std::endl;
TORCH_CHECK(torch::cuda::cudnn_is_available(), "CuDNN is not available");
return 0;
}

View File

@ -0,0 +1,6 @@
#include <torch/torch.h>
int main(int argc, const char* argv[]) {
TORCH_CHECK(torch::hasMKL(), "MKL is not available");
return 0;
}

View File

@ -0,0 +1,7 @@
#include <ATen/ATen.h>
#include <torch/torch.h>
int main(int argc, const char* argv[]) {
TORCH_CHECK(at::globalContext().isXNNPACKAvailable(), "XNNPACK is not available");
return 0;
}

View File

@ -0,0 +1,38 @@
r"""
It's used to check basic rnn features with cuda.
For example, it would throw exception if some components are missing
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
class SimpleCNN(nn.Module):
def __init__(self):
super().__init__()
self.conv = nn.Conv2d(1, 1, 3)
self.pool = nn.MaxPool2d(2, 2)
def forward(self, inputs):
output = self.pool(F.relu(self.conv(inputs)))
output = output.view(1)
return output
# Mock one infer
device = torch.device("cuda:0")
net = SimpleCNN().to(device)
net_inputs = torch.rand((1, 1, 5, 5), device=device)
outputs = net(net_inputs)
print(outputs)
criterion = nn.MSELoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.1)
# Mock one step training
label = torch.full((1,), 1.0, dtype=torch.float, device=device)
loss = criterion(outputs, label)
loss.backward()
optimizer.step()

View File

@ -0,0 +1,14 @@
r"""
It's used to check basic rnn features with cuda.
For example, it would throw exception if missing some components are missing
"""
import torch
import torch.nn as nn
device = torch.device("cuda:0")
rnn = nn.RNN(10, 20, 2).to(device)
inputs = torch.randn(5, 3, 10).to(device)
h0 = torch.randn(2, 3, 20).to(device)
output, hn = rnn(inputs, h0)

View File

@ -0,0 +1,6 @@
#include <torch/torch.h>
int main(int argc, const char* argv[]) {
TORCH_WARN("Simple test passed!");
return 0;
}

View File

@ -35,7 +35,7 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
fi
# TODO: Move both of them to Windows AMI
python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==2.13.0
python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==2.13.0 pytest-subtests==0.13.1
# Install Z3 optional dependency for Windows builds.
python -m pip install z3-solver==4.12.2.0

View File

@ -0,0 +1,133 @@
@echo off
:: This script parses args, installs required libraries (miniconda, MKL,
:: Magma), and then delegates to cpu.bat, cuda80.bat, etc.
if not "%CUDA_VERSION%" == "" if not "%PYTORCH_BUILD_VERSION%" == "" if not "%PYTORCH_BUILD_NUMBER%" == "" goto env_end
if "%~1"=="" goto arg_error
if "%~2"=="" goto arg_error
if "%~3"=="" goto arg_error
if not "%~4"=="" goto arg_error
goto arg_end
:arg_error
echo Illegal number of parameters. Pass cuda version, pytorch version, build number
echo CUDA version should be Mm with no dot, e.g. '80'
echo DESIRED_PYTHON should be M.m, e.g. '2.7'
exit /b 1
:arg_end
set CUDA_VERSION=%~1
set PYTORCH_BUILD_VERSION=%~2
set PYTORCH_BUILD_NUMBER=%~3
:env_end
set CUDA_PREFIX=cuda%CUDA_VERSION%
if "%CUDA_VERSION%" == "cpu" set CUDA_PREFIX=cpu
if "%CUDA_VERSION%" == "xpu" set CUDA_PREFIX=xpu
if "%DESIRED_PYTHON%" == "" set DESIRED_PYTHON=3.5;3.6;3.7
set DESIRED_PYTHON_PREFIX=%DESIRED_PYTHON:.=%
set DESIRED_PYTHON_PREFIX=py%DESIRED_PYTHON_PREFIX:;=;py%
set SRC_DIR=%~dp0
pushd %SRC_DIR%
:: Install Miniconda3
set "CONDA_HOME=%CD%\conda"
set "tmp_conda=%CONDA_HOME%"
set "miniconda_exe=%CD%\miniconda.exe"
rmdir /s /q conda
del miniconda.exe
curl --retry 3 -k https://repo.anaconda.com/miniconda/Miniconda3-py311_23.9.0-0-Windows-x86_64.exe -o "%miniconda_exe%"
start /wait "" "%miniconda_exe%" /S /InstallationType=JustMe /RegisterPython=0 /AddToPath=0 /D=%tmp_conda%
if ERRORLEVEL 1 exit /b 1
set "ORIG_PATH=%PATH%"
set "PATH=%CONDA_HOME%;%CONDA_HOME%\scripts;%CONDA_HOME%\Library\bin;%PATH%"
:: create a new conda environment and install packages
:try
SET /A tries=3
:loop
IF %tries% LEQ 0 GOTO :exception
call condaenv.bat
IF %ERRORLEVEL% EQU 0 GOTO :done
SET /A "tries=%tries%-1"
:exception
echo "Failed to create conda env"
exit /B 1
:done
:: Download MAGMA Files on CUDA builds
set MAGMA_VERSION=2.5.4
if "%DEBUG%" == "1" (
set BUILD_TYPE=debug
) else (
set BUILD_TYPE=release
)
if not "%CUDA_VERSION%" == "cpu" if not "%CUDA_VERSION%" == "xpu" (
rmdir /s /q magma_%CUDA_PREFIX%_%BUILD_TYPE%
del magma_%CUDA_PREFIX%_%BUILD_TYPE%.7z
curl -k https://s3.amazonaws.com/ossci-windows/magma_%MAGMA_VERSION%_%CUDA_PREFIX%_%BUILD_TYPE%.7z -o magma_%CUDA_PREFIX%_%BUILD_TYPE%.7z
7z x -aoa magma_%CUDA_PREFIX%_%BUILD_TYPE%.7z -omagma_%CUDA_PREFIX%_%BUILD_TYPE%
)
:: Install sccache
if "%USE_SCCACHE%" == "1" (
mkdir %CD%\tmp_bin
curl -k https://s3.amazonaws.com/ossci-windows/sccache.exe --output %CD%\tmp_bin\sccache.exe
curl -k https://s3.amazonaws.com/ossci-windows/sccache-cl.exe --output %CD%\tmp_bin\sccache-cl.exe
if not "%CUDA_VERSION%" == "" (
set ADDITIONAL_PATH=%CD%\tmp_bin
set SCCACHE_IDLE_TIMEOUT=1500
:: randomtemp is used to resolve the intermittent build error related to CUDA.
:: code: https://github.com/peterjc123/randomtemp-rust
:: issue: https://github.com/pytorch/pytorch/issues/25393
::
:: CMake requires a single command as CUDA_NVCC_EXECUTABLE, so we push the wrappers
:: randomtemp.exe and sccache.exe into a batch file which CMake invokes.
curl -kL https://github.com/peterjc123/randomtemp-rust/releases/download/v0.4/randomtemp.exe --output %SRC_DIR%\tmp_bin\randomtemp.exe
echo @"%SRC_DIR%\tmp_bin\randomtemp.exe" "%SRC_DIR%\tmp_bin\sccache.exe" "%CUDA_PATH%\bin\nvcc.exe" %%* > "%SRC_DIR%/tmp_bin/nvcc.bat"
cat %SRC_DIR%/tmp_bin/nvcc.bat
set CUDA_NVCC_EXECUTABLE=%SRC_DIR%/tmp_bin/nvcc.bat
:: CMake doesn't accept back-slashes in the path
for /F "usebackq delims=" %%n in (`cygpath -m "%CUDA_PATH%\bin\nvcc.exe"`) do set CMAKE_CUDA_COMPILER=%%n
set CMAKE_CUDA_COMPILER_LAUNCHER=%SRC_DIR%\tmp_bin\randomtemp.exe;%SRC_DIR%\tmp_bin\sccache.exe
)
)
set PYTORCH_BINARY_BUILD=1
set TH_BINARY_BUILD=1
set INSTALL_TEST=0
for %%v in (%DESIRED_PYTHON_PREFIX%) do (
:: Activate Python Environment
set PYTHON_PREFIX=%%v
set "CONDA_LIB_PATH=%CONDA_HOME%\envs\%%v\Library\bin"
if not "%ADDITIONAL_PATH%" == "" (
set "PATH=%ADDITIONAL_PATH%;%CONDA_HOME%\envs\%%v;%CONDA_HOME%\envs\%%v\scripts;%CONDA_HOME%\envs\%%v\Library\bin;%ORIG_PATH%"
) else (
set "PATH=%CONDA_HOME%\envs\%%v;%CONDA_HOME%\envs\%%v\scripts;%CONDA_HOME%\envs\%%v\Library\bin;%ORIG_PATH%"
)
pip install ninja
@setlocal
:: Set Flags
if not "%CUDA_VERSION%"=="cpu" if not "%CUDA_VERSION%" == "xpu" (
set MAGMA_HOME=%cd%\magma_%CUDA_PREFIX%_%BUILD_TYPE%
)
echo "Calling arch build script"
call %CUDA_PREFIX%.bat
if ERRORLEVEL 1 exit /b 1
@endlocal
)
set "PATH=%ORIG_PATH%"
popd
if ERRORLEVEL 1 exit /b 1

View File

@ -0,0 +1,26 @@
IF "%DESIRED_PYTHON%"=="" (
echo DESIRED_PYTHON is NOT defined.
exit /b 1
)
:: Create a new conda environment
setlocal EnableDelayedExpansion
FOR %%v IN (%DESIRED_PYTHON%) DO (
set PYTHON_VERSION_STR=%%v
set PYTHON_VERSION_STR=!PYTHON_VERSION_STR:.=!
conda remove -n py!PYTHON_VERSION_STR! --all -y || rmdir %CONDA_HOME%\envs\py!PYTHON_VERSION_STR! /s
if "%%v" == "3.8" call conda create -n py!PYTHON_VERSION_STR! -y -q numpy=1.11 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v
if "%%v" == "3.9" call conda create -n py!PYTHON_VERSION_STR! -y -q numpy=2.0.1 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v
if "%%v" == "3.10" call conda create -n py!PYTHON_VERSION_STR! -y -q -c=conda-forge numpy=2.0.1 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v
if "%%v" == "3.11" call conda create -n py!PYTHON_VERSION_STR! -y -q -c=conda-forge numpy=2.0.1 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v
if "%%v" == "3.12" call conda create -n py!PYTHON_VERSION_STR! -y -q -c=conda-forge numpy=2.0.1 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v
if "%%v" == "3.13" call conda create -n py!PYTHON_VERSION_STR! -y -q -c=conda-forge numpy=2.1.2 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v
call conda run -n py!PYTHON_VERSION_STR! pip install mkl-include
call conda run -n py!PYTHON_VERSION_STR! pip install mkl-static
)
endlocal
:: Install libuv
conda install -y -q -c conda-forge libuv=1.39
set libuv_ROOT=%CONDA_HOME%\Library
echo libuv_ROOT=%libuv_ROOT%

View File

@ -0,0 +1,30 @@
@echo off
set MODULE_NAME=pytorch
IF NOT EXIST "setup.py" IF NOT EXIST "%MODULE_NAME%" (
call internal\clone.bat
cd %~dp0
) ELSE (
call internal\clean.bat
)
IF ERRORLEVEL 1 goto :eof
call internal\check_deps.bat
IF ERRORLEVEL 1 goto :eof
REM Check for optional components
echo Disabling CUDA
set USE_CUDA=0
call internal\check_opts.bat
IF ERRORLEVEL 1 goto :eof
if exist "%NIGHTLIES_PYTORCH_ROOT%" cd %NIGHTLIES_PYTORCH_ROOT%\..
call %~dp0\internal\copy_cpu.bat
IF ERRORLEVEL 1 goto :eof
call %~dp0\internal\setup.bat
IF ERRORLEVEL 1 goto :eof

View File

@ -0,0 +1,59 @@
@echo off
set MODULE_NAME=pytorch
IF NOT EXIST "setup.py" IF NOT EXIST "%MODULE_NAME%" (
call internal\clone.bat
cd %~dp0
) ELSE (
call internal\clean.bat
)
IF ERRORLEVEL 1 goto :eof
call internal\check_deps.bat
IF ERRORLEVEL 1 goto :eof
REM Check for optional components
set USE_CUDA=
set CMAKE_GENERATOR=Visual Studio 15 2017 Win64
IF "%NVTOOLSEXT_PATH%"=="" (
IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib" (
set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt
) ELSE (
echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing
exit /b 1
)
)
IF "%CUDA_PATH_V118%"=="" (
IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\nvcc.exe" (
set "CUDA_PATH_V118=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8"
) ELSE (
echo CUDA 11.8 not found, failing
exit /b 1
)
)
IF "%BUILD_VISION%" == "" (
set TORCH_CUDA_ARCH_LIST=3.7+PTX;5.0;6.0;6.1;7.0;7.5;8.0;8.6;9.0
set TORCH_NVCC_FLAGS=-Xfatbin -compress-all
) ELSE (
set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_35,code=sm_35 -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90
)
set "CUDA_PATH=%CUDA_PATH_V118%"
set "PATH=%CUDA_PATH_V118%\bin;%PATH%"
:optcheck
call internal\check_opts.bat
IF ERRORLEVEL 1 goto :eof
if exist "%NIGHTLIES_PYTORCH_ROOT%" cd %NIGHTLIES_PYTORCH_ROOT%\..
call %~dp0\internal\copy.bat
IF ERRORLEVEL 1 goto :eof
call %~dp0\internal\setup.bat
IF ERRORLEVEL 1 goto :eof

View File

@ -0,0 +1,59 @@
@echo off
set MODULE_NAME=pytorch
IF NOT EXIST "setup.py" IF NOT EXIST "%MODULE_NAME%" (
call internal\clone.bat
cd %~dp0
) ELSE (
call internal\clean.bat
)
IF ERRORLEVEL 1 goto :eof
call internal\check_deps.bat
IF ERRORLEVEL 1 goto :eof
REM Check for optional components
set USE_CUDA=
set CMAKE_GENERATOR=Visual Studio 15 2017 Win64
IF "%NVTOOLSEXT_PATH%"=="" (
IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib" (
set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt
) ELSE (
echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing
exit /b 1
)
)
IF "%CUDA_PATH_V124%"=="" (
IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin\nvcc.exe" (
set "CUDA_PATH_V124=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4"
) ELSE (
echo CUDA 12.4 not found, failing
exit /b 1
)
)
IF "%BUILD_VISION%" == "" (
set TORCH_CUDA_ARCH_LIST=5.0;6.0;6.1;7.0;7.5;8.0;8.6;9.0
set TORCH_NVCC_FLAGS=-Xfatbin -compress-all
) ELSE (
set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90
)
set "CUDA_PATH=%CUDA_PATH_V124%"
set "PATH=%CUDA_PATH_V124%\bin;%PATH%"
:optcheck
call internal\check_opts.bat
IF ERRORLEVEL 1 goto :eof
if exist "%NIGHTLIES_PYTORCH_ROOT%" cd %NIGHTLIES_PYTORCH_ROOT%\..
call %~dp0\internal\copy.bat
IF ERRORLEVEL 1 goto :eof
call %~dp0\internal\setup.bat
IF ERRORLEVEL 1 goto :eof

View File

@ -0,0 +1,59 @@
@echo off
set MODULE_NAME=pytorch
IF NOT EXIST "setup.py" IF NOT EXIST "%MODULE_NAME%" (
call internal\clone.bat
cd %~dp0
) ELSE (
call internal\clean.bat
)
IF ERRORLEVEL 1 goto :eof
call internal\check_deps.bat
IF ERRORLEVEL 1 goto :eof
REM Check for optional components
set USE_CUDA=
set CMAKE_GENERATOR=Visual Studio 15 2017 Win64
IF "%NVTOOLSEXT_PATH%"=="" (
IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib" (
set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt
) ELSE (
echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing
exit /b 1
)
)
IF "%CUDA_PATH_V126%"=="" (
IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvcc.exe" (
set "CUDA_PATH_V126=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6"
) ELSE (
echo CUDA 12.6 not found, failing
exit /b 1
)
)
IF "%BUILD_VISION%" == "" (
set TORCH_CUDA_ARCH_LIST=5.0;6.0;6.1;7.0;7.5;8.0;8.6;9.0
set TORCH_NVCC_FLAGS=-Xfatbin -compress-all
) ELSE (
set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90
)
set "CUDA_PATH=%CUDA_PATH_V126%"
set "PATH=%CUDA_PATH_V126%\bin;%PATH%"
:optcheck
call internal\check_opts.bat
IF ERRORLEVEL 1 goto :eof
if exist "%NIGHTLIES_PYTORCH_ROOT%" cd %NIGHTLIES_PYTORCH_ROOT%\..
call %~dp0\internal\copy.bat
IF ERRORLEVEL 1 goto :eof
call %~dp0\internal\setup.bat
IF ERRORLEVEL 1 goto :eof

View File

@ -0,0 +1,9 @@
@echo off
curl -k https://www.7-zip.org/a/7z1805-x64.exe -O
if errorlevel 1 exit /b 1
start /wait 7z1805-x64.exe /S
if errorlevel 1 exit /b 1
set "PATH=%ProgramFiles%\7-Zip;%PATH%"

View File

@ -0,0 +1,11 @@
call windows/internal/vc_install_helper.bat
if errorlevel 1 exit /b 1
call windows/internal/cuda_install.bat
if errorlevel 1 exit /b 1
call windows/internal/xpu_install.bat
if errorlevel 1 exit /b 1
call windows/build_pytorch.bat %CUDA_VERSION% %PYTORCH_BUILD_VERSION% %PYTORCH_BUILD_NUMBER%
if errorlevel 1 exit /b 1

View File

@ -0,0 +1,80 @@
@echo off
REM Check for necessary components
echo "Checking dependencies"
IF NOT "%PROCESSOR_ARCHITECTURE%"=="AMD64" (
echo You should use 64 bits Windows to build and run PyTorch
exit /b 1
)
echo "Cmake check"
IF "%BUILD_VISION%" == "" (
where /q cmake.exe
IF ERRORLEVEL 1 (
echo CMake is required to compile PyTorch on Windows
exit /b 1
)
)
if not exist "%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" (
echo Visual Studio %VC_YEAR% C++ BuildTools is required to compile PyTorch on Windows
exit /b 1
)
set VC_VERSION_LOWER=17
set VC_VERSION_UPPER=18
if "%VC_YEAR%" == "2019" (
set VC_VERSION_LOWER=16
set VC_VERSION_UPPER=17
)
if NOT "%VS15INSTALLDIR%" == "" if exist "%VS15INSTALLDIR%\VC\Auxiliary\Build\vcvarsall.bat" (
set "VS15VCVARSALL=%VS15INSTALLDIR%\VC\Auxiliary\Build\vcvarsall.bat"
goto vswhere
)
for /f "usebackq tokens=*" %%i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" -legacy -products * -version [%VC_VERSION_LOWER%^,%VC_VERSION_UPPER%^) -property installationPath`) do (
if exist "%%i" if exist "%%i\VC\Auxiliary\Build\vcvarsall.bat" (
set "VS15INSTALLDIR=%%i"
set "VS15VCVARSALL=%%i\VC\Auxiliary\Build\vcvarsall.bat"
goto vswhere
)
)
:vswhere
IF "%VS15VCVARSALL%"=="" (
echo Visual Studio %VC_YEAR% C++ BuildTools is required to compile PyTorch on Windows
exit /b 1
)
set MSSdk=1
set DISTUTILS_USE_SDK=1
where /q python.exe
IF ERRORLEVEL 1 (
echo Python x64 3.5 or up is required to compile PyTorch on Windows
exit /b 1
)
for /F "usebackq delims=" %%i in (`python -c "import sys; print('{0[0]}{0[1]}'.format(sys.version_info))"`) do (
set /a PYVER=%%i
)
if %PYVER% LSS 35 (
echo Warning: PyTorch for Python 2 under Windows is experimental.
echo Python x64 3.5 or up is recommended to compile PyTorch on Windows
echo Maybe you can create a virual environment if you have conda installed:
echo ^> conda create -n test python=3.6 pyyaml numpy
echo ^> activate test
)
for /F "usebackq delims=" %%i in (`python -c "import struct;print( 8 * struct.calcsize('P'))"`) do (
set /a PYSIZE=%%i
)
if %PYSIZE% NEQ 64 (
echo Python x64 3.5 or up is required to compile PyTorch on Windows
exit /b 1
)

View File

@ -0,0 +1,47 @@
@echo off
REM Check for optional components
where /q ninja.exe
IF NOT ERRORLEVEL 1 (
echo Ninja found, using it to speed up builds
set CMAKE_GENERATOR=Ninja
)
IF "%USE_SCCACHE%" == "0" goto sccache_end
where /q clcache.exe
IF NOT ERRORLEVEL 1 (
echo clcache found, using it to speed up builds
set CC=clcache
set CXX=clcache
)
where /q sccache-cl.exe
IF NOT ERRORLEVEL 1 (
echo sccache-cl found, using it to speed up builds
set CC=sccache-cl
set CXX=sccache-cl
)
IF "%CC%" == "sccache-cl" IF "%CXX%" == "sccache-cl" goto sccache_end
where /q sccache.exe
IF NOT ERRORLEVEL 1 (
echo sccache found, using it to speed up builds
set CC=sccache cl
set CXX=sccache cl
)
:sccache_end
IF exist "%MKLProductDir%\mkl\lib\intel64_win" (
echo MKL found, adding it to build
set "LIB=%MKLProductDir%\mkl\lib\intel64_win;%MKLProductDir%\compiler\lib\intel64_win;%LIB%";
)
exit /b 0

View File

@ -0,0 +1,5 @@
@echo off
cd %MODULE_NAME%
python setup.py clean
cd ..

View File

@ -0,0 +1,51 @@
@echo off
:: The conda and wheels jobs are separated on Windows, so we don't need to clone again.
if not exist "%NIGHTLIES_PYTORCH_ROOT%" goto clone_pytorch
echo "Changing to NIGHTLIES_PYTORCH_ROOT"
cd "%NIGHTLIES_PYTORCH_ROOT%"
goto submodule
:clone_pytorch
git clone https://github.com/%PYTORCH_REPO%/%MODULE_NAME%
cd %MODULE_NAME%
IF NOT "%PYTORCH_BRANCH%" == "latest" goto latest_end
:latest_start
if NOT "%NIGHTLIES_DATE%" == "" goto date_end
:date_start
set "DATE_CMD=Get-Date ([System.TimeZoneInfo]::ConvertTimeFromUtc((Get-Date).ToUniversalTime(), [System.TimeZoneInfo]::FindSystemTimeZoneById('Pacific Standard Time'))) -f 'yyyy_MM_dd'"
set "DATE_COMPACT_CMD=Get-Date ([System.TimeZoneInfo]::ConvertTimeFromUtc((Get-Date).ToUniversalTime(), [System.TimeZoneInfo]::FindSystemTimeZoneById('Pacific Standard Time'))) -f 'yyyyMMdd'"
FOR /F "delims=" %%i IN ('powershell -c "%DATE_CMD%"') DO set NIGHTLIES_DATE=%%i
FOR /F "delims=" %%i IN ('powershell -c "%DATE_COMPACT_CMD%"') DO set NIGHTLIES_DATE_COMPACT=%%i
:date_end
if "%NIGHTLIES_DATE_COMPACT%" == "" set NIGHTLIES_DATE_COMPACT=%NIGHTLIES_DATE:~0,4%%NIGHTLIES_DATE:~5,2%%NIGHTLIES_DATE:~8,2%
:: Switch to the latest commit by 11:59 yesterday
echo PYTORCH_BRANCH is set to latest so I will find the last commit
echo before 0:00 midnight on %NIGHTLIES_DATE%
set git_date=%NIGHTLIES_DATE:_=-%
FOR /F "delims=" %%i IN ('git log --before %git_date% -n 1 "--pretty=%%H"') DO set last_commit=%%i
echo Setting PYTORCH_BRANCH to %last_commit% since that was the last
echo commit before %NIGHTLIES_DATE%
set PYTORCH_BRANCH=%last_commit%
:latest_end
IF "%PYTORCH_BRANCH%" == "" set PYTORCH_BRANCH=v%PYTORCH_BUILD_VERSION%
git checkout %PYTORCH_BRANCH%
IF ERRORLEVEL 1 git checkout tags/%PYTORCH_BRANCH%
:submodule
git submodule update --init --recursive
IF ERRORLEVEL 1 exit /b 1

View File

@ -0,0 +1,26 @@
copy "%CUDA_PATH%\bin\cusparse*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\cublas*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\cudart*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\curand*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\cufft*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\cusolver*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\cudnn*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\nvrtc*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\extras\CUPTI\lib64\cupti64_*.dll*" pytorch\torch\lib
copy "C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64\nvToolsExt64_1.dll*" pytorch\torch\lib
copy "%CONDA_LIB_PATH%\libiomp*5md.dll" pytorch\torch\lib
:: Should be set in build_pytorch.bat
copy "%libuv_ROOT%\bin\uv.dll" pytorch\torch\lib
::copy zlib if it exist in windows/system32
if exist "C:\Windows\System32\zlibwapi.dll" (
copy "C:\Windows\System32\zlibwapi.dll" pytorch\torch\lib
)
::copy nvJitLink dll is requires for cuda 12+
if exist "%CUDA_PATH%\bin\nvJitLink_*.dll*" (
copy "%CUDA_PATH%\bin\nvJitLink_*.dll*" pytorch\torch\lib
)

View File

@ -0,0 +1,3 @@
copy "%CONDA_LIB_PATH%\libiomp*5md.dll" pytorch\torch\lib
:: Should be set in build_pytorch.bat
copy "%libuv_ROOT%\bin\uv.dll" pytorch\torch\lib

View File

@ -0,0 +1,190 @@
@echo on
if "%CUDA_VERSION%" == "cpu" (
echo Skipping for CPU builds
exit /b 0
)
if "%CUDA_VERSION%" == "xpu" (
echo Skipping for XPU builds
exit /b 0
)
set SRC_DIR=%NIGHTLIES_PYTORCH_ROOT%
if not exist "%SRC_DIR%\temp_build" mkdir "%SRC_DIR%\temp_build"
set /a CUDA_VER=%CUDA_VERSION%
set CUDA_VER_MAJOR=%CUDA_VERSION:~0,-1%
set CUDA_VER_MINOR=%CUDA_VERSION:~-1,1%
set CUDA_VERSION_STR=%CUDA_VER_MAJOR%.%CUDA_VER_MINOR%
set CUDNN_FOLDER="cuda"
set CUDNN_LIB_FOLDER="lib\x64"
:: Skip all of this if we already have cuda installed
if exist "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin\nvcc.exe" goto set_cuda_env_vars
if %CUDA_VER% EQU 118 goto cuda118
if %CUDA_VER% EQU 121 goto cuda121
if %CUDA_VER% EQU 124 goto cuda124
if %CUDA_VER% EQU 126 goto cuda126
echo CUDA %CUDA_VERSION_STR% is not supported
exit /b 1
:cuda118
set CUDA_INSTALL_EXE=cuda_11.8.0_522.06_windows.exe
if not exist "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" (
curl -k -L "https://ossci-windows.s3.amazonaws.com/%CUDA_INSTALL_EXE%" --output "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"
if errorlevel 1 exit /b 1
set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"
set "ARGS=cuda_profiler_api_11.8 thrust_11.8 nvcc_11.8 cuobjdump_11.8 nvprune_11.8 nvprof_11.8 cupti_11.8 cublas_11.8 cublas_dev_11.8 cudart_11.8 cufft_11.8 cufft_dev_11.8 curand_11.8 curand_dev_11.8 cusolver_11.8 cusolver_dev_11.8 cusparse_11.8 cusparse_dev_11.8 npp_11.8 npp_dev_11.8 nvrtc_11.8 nvrtc_dev_11.8 nvml_dev_11.8 nvtx_11.8"
)
set CUDNN_FOLDER=cudnn-windows-x86_64-9.5.0.50_cuda11-archive
set CUDNN_LIB_FOLDER="lib"
set "CUDNN_INSTALL_ZIP=%CUDNN_FOLDER%.zip"
if not exist "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" (
curl -k -L "http://s3.amazonaws.com/ossci-windows/%CUDNN_INSTALL_ZIP%" --output "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"
if errorlevel 1 exit /b 1
set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"
)
@REM cuDNN 8.3+ required zlib to be installed on the path
echo Installing ZLIB dlls
curl -k -L "http://s3.amazonaws.com/ossci-windows/zlib123dllx64.zip" --output "%SRC_DIR%\temp_build\zlib123dllx64.zip"
7z x "%SRC_DIR%\temp_build\zlib123dllx64.zip" -o"%SRC_DIR%\temp_build\zlib"
xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"
goto cuda_common
:cuda124
set CUDA_INSTALL_EXE=cuda_12.4.0_551.61_windows.exe
if not exist "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" (
curl -k -L "https://ossci-windows.s3.amazonaws.com/%CUDA_INSTALL_EXE%" --output "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"
if errorlevel 1 exit /b 1
set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"
set "ARGS=cuda_profiler_api_12.4 thrust_12.4 nvcc_12.4 cuobjdump_12.4 nvprune_12.4 nvprof_12.4 cupti_12.4 cublas_12.4 cublas_dev_12.4 cudart_12.4 cufft_12.4 cufft_dev_12.4 curand_12.4 curand_dev_12.4 cusolver_12.4 cusolver_dev_12.4 cusparse_12.4 cusparse_dev_12.4 npp_12.4 npp_dev_12.4 nvrtc_12.4 nvrtc_dev_12.4 nvml_dev_12.4 nvjitlink_12.4 nvtx_12.4"
)
set CUDNN_FOLDER=cudnn-windows-x86_64-9.5.0.50_cuda12-archive
set CUDNN_LIB_FOLDER="lib"
set "CUDNN_INSTALL_ZIP=%CUDNN_FOLDER%.zip"
if not exist "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" (
curl -k -L "http://s3.amazonaws.com/ossci-windows/%CUDNN_INSTALL_ZIP%" --output "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"
if errorlevel 1 exit /b 1
set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"
)
@REM cuDNN 8.3+ required zlib to be installed on the path
echo Installing ZLIB dlls
curl -k -L "http://s3.amazonaws.com/ossci-windows/zlib123dllx64.zip" --output "%SRC_DIR%\temp_build\zlib123dllx64.zip"
7z x "%SRC_DIR%\temp_build\zlib123dllx64.zip" -o"%SRC_DIR%\temp_build\zlib"
xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"
goto cuda_common
:cuda126
set CUDA_INSTALL_EXE=cuda_12.6.2_560.94_windows.exe
if not exist "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" (
curl -k -L "https://ossci-windows.s3.amazonaws.com/%CUDA_INSTALL_EXE%" --output "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"
if errorlevel 1 exit /b 1
set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"
set "ARGS=cuda_profiler_api_12.6 thrust_12.6 nvcc_12.6 cuobjdump_12.6 nvprune_12.6 nvprof_12.6 cupti_12.6 cublas_12.6 cublas_dev_12.6 cudart_12.6 cufft_12.6 cufft_dev_12.6 curand_12.6 curand_dev_12.6 cusolver_12.6 cusolver_dev_12.6 cusparse_12.6 cusparse_dev_12.6 npp_12.6 npp_dev_12.6 nvrtc_12.6 nvrtc_dev_12.6 nvml_dev_12.6 nvjitlink_12.6 nvtx_12.6"
)
set CUDNN_FOLDER=cudnn-windows-x86_64-9.5.0.50_cuda12-archive
set CUDNN_LIB_FOLDER="lib"
set "CUDNN_INSTALL_ZIP=%CUDNN_FOLDER%.zip"
if not exist "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" (
curl -k -L "http://s3.amazonaws.com/ossci-windows/%CUDNN_INSTALL_ZIP%" --output "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"
if errorlevel 1 exit /b 1
set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"
)
@REM cuDNN 8.3+ required zlib to be installed on the path
echo Installing ZLIB dlls
curl -k -L "http://s3.amazonaws.com/ossci-windows/zlib123dllx64.zip" --output "%SRC_DIR%\temp_build\zlib123dllx64.zip"
7z x "%SRC_DIR%\temp_build\zlib123dllx64.zip" -o"%SRC_DIR%\temp_build\zlib"
xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"
goto cuda_common
:cuda_common
:: NOTE: We only install CUDA if we don't have it installed already.
:: With GHA runners these should be pre-installed as part of our AMI process
:: If you cannot find the CUDA version you want to build for here then please
:: add it @ https://github.com/pytorch/test-infra/tree/main/aws/ami/windows
if not exist "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin\nvcc.exe" (
if not exist "%SRC_DIR%\temp_build\NvToolsExt.7z" (
curl -k -L https://ossci-windows.s3.us-east-1.amazonaws.com/builder/NvToolsExt.7z --output "%SRC_DIR%\temp_build\NvToolsExt.7z"
if errorlevel 1 exit /b 1
)
if not exist "%SRC_DIR%\temp_build\gpu_driver_dlls.zip" (
curl -k -L "https://ossci-windows.s3.us-east-1.amazonaws.com/builder/additional_dlls.zip" --output "%SRC_DIR%\temp_build\gpu_driver_dlls.zip"
if errorlevel 1 exit /b 1
)
echo Installing CUDA toolkit...
7z x %CUDA_SETUP_FILE% -o"%SRC_DIR%\temp_build\cuda"
pushd "%SRC_DIR%\temp_build\cuda"
sc config wuauserv start= disabled
sc stop wuauserv
sc query wuauserv
start /wait setup.exe -s %ARGS% -loglevel:6 -log:"%cd%/cuda_install_logs"
echo %errorlevel%
popd
echo Installing VS integration...
if "%VC_YEAR%" == "2019" (
xcopy /Y "%SRC_DIR%\temp_build\cuda\CUDAVisualStudioIntegration\extras\visual_studio_integration\MSBuildExtensions\*.*" "C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\MSBuild\Microsoft\VC\v160\BuildCustomizations"
)
if "%VC_YEAR%" == "2022" (
xcopy /Y "%SRC_DIR%\temp_build\cuda\CUDAVisualStudioIntegration\extras\visual_studio_integration\MSBuildExtensions\*.*" "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations"
)
echo Installing NvToolsExt...
7z x %SRC_DIR%\temp_build\NvToolsExt.7z -o"%SRC_DIR%\temp_build\NvToolsExt"
mkdir "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\bin\x64"
mkdir "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\include"
mkdir "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\lib\x64"
xcopy /Y "%SRC_DIR%\temp_build\NvToolsExt\bin\x64\*.*" "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\bin\x64"
xcopy /Y "%SRC_DIR%\temp_build\NvToolsExt\include\*.*" "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\include"
xcopy /Y "%SRC_DIR%\temp_build\NvToolsExt\lib\x64\*.*" "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\lib\x64"
echo Installing cuDNN...
7z x %CUDNN_SETUP_FILE% -o"%SRC_DIR%\temp_build\cudnn"
xcopy /Y "%SRC_DIR%\temp_build\cudnn\%CUDNN_FOLDER%\bin\*.*" "%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin"
xcopy /Y "%SRC_DIR%\temp_build\cudnn\%CUDNN_FOLDER%\%CUDNN_LIB_FOLDER%\*.*" "%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\lib\x64"
xcopy /Y "%SRC_DIR%\temp_build\cudnn\%CUDNN_FOLDER%\include\*.*" "%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\include"
echo Installing GPU driver DLLs
7z x %SRC_DIR%\temp_build\gpu_driver_dlls.zip -o"C:\Windows\System32"
echo Cleaning temp files
rd /s /q "%SRC_DIR%\temp_build" || ver > nul
if not exist "%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin\nvcc.exe" (
echo CUDA %CUDA_VERSION_STR% installed failed.
echo --------- setup.exe.log -------
type "%SRC_DIR%\temp_build\cuda\cuda_install_logs\LOG.setup.exe.log"
echo --------- RunDll32.exe.log
type "%SRC_DIR%\temp_build\cuda\cuda_install_logs\LOG.RunDll32.exe.log"
exit /b 1
)
)
goto set_cuda_env_vars
:set_cuda_env_vars
echo Setting up environment...
set "PATH=%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin;%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\libnvvp;%PATH%"
set "CUDA_PATH=%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%"
set "CUDA_PATH_V%CUDA_VER_MAJOR%_%CUDA_VER_MINOR%=%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%"
set "NVTOOLSEXT_PATH=%ProgramFiles%\NVIDIA Corporation\NvToolsExt"

View File

@ -0,0 +1,9 @@
set WIN_DRIVER_VN=528.89
set "DRIVER_DOWNLOAD_LINK=https://ossci-windows.s3.amazonaws.com/%WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe"
curl --retry 3 -kL %DRIVER_DOWNLOAD_LINK% --output %WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe
if errorlevel 1 exit /b 1
start /wait %WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe -s -noreboot
if errorlevel 1 exit /b 1
del %WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe || ver > NUL

View File

@ -0,0 +1,38 @@
@echo off
:: Caution: Please don't use this script locally
:: It may destroy your build environment.
setlocal
if not exist "%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" (
echo Visual Studio %VC_YEAR% C++ BuildTools is required to compile PyTorch on Windows
exit /b 1
)
set VC_VERSION_LOWER=17
set VC_VERSION_UPPER=18
if "%VC_YEAR%" == "2019" (
set VC_VERSION_LOWER=16
set VC_VERSION_UPPER=17
)
for /f "usebackq tokens=*" %%i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" -legacy -products * -version [%VC_VERSION_LOWER%^,%VC_VERSION_UPPER%^) -property installationPath`) do (
if exist "%%i" if exist "%%i\VC\Auxiliary\Build\vcvarsall.bat" (
set "VS15INSTALLDIR=%%i"
set "VS15VCVARSALL=%%i\VC\Auxiliary\Build\vcvarsall.bat"
goto vswhere
)
)
:vswhere
if "%VS15VCVARSALL%"=="" (
echo Visual Studio %VC_YEAR% C++ BuildTools is required to compile PyTorch on Windows
exit /b 1
)
call "%VS15VCVARSALL%" x86_amd64
for /f "usebackq tokens=*" %%i in (`where link.exe`) do move "%%i" "%%i.bak"
endlocal

View File

@ -0,0 +1,106 @@
@echo off
echo The flags after configuring:
echo USE_CUDA=%USE_CUDA%
echo CMAKE_GENERATOR=%CMAKE_GENERATOR%
if "%USE_CUDA%"=="" echo CUDA_PATH=%CUDA_PATH%
if NOT "%CC%"=="" echo CC=%CC%
if NOT "%CXX%"=="" echo CXX=%CXX%
if NOT "%DISTUTILS_USE_SDK%"=="" echo DISTUTILS_USE_SDK=%DISTUTILS_USE_SDK%
set SRC_DIR=%NIGHTLIES_PYTORCH_ROOT%
IF "%VSDEVCMD_ARGS%" == "" (
call "%VS15VCVARSALL%" x64
) ELSE (
call "%VS15VCVARSALL%" x64 %VSDEVCMD_ARGS%
)
pushd %SRC_DIR%
IF NOT exist "setup.py" (
cd %MODULE_NAME%
)
if "%CXX%"=="sccache cl" goto sccache_start
if "%CXX%"=="sccache-cl" goto sccache_start
goto sccache_end
:sccache_start
set SCCACHE_IDLE_TIMEOUT=0
sccache --stop-server
sccache --start-server
sccache --zero-stats
:sccache_end
if "%BUILD_PYTHONLESS%" == "" goto pytorch else goto libtorch
:libtorch
set VARIANT=shared-with-deps
mkdir libtorch
mkdir libtorch\bin
mkdir libtorch\cmake
mkdir libtorch\include
mkdir libtorch\lib
mkdir libtorch\share
mkdir libtorch\test
mkdir build
pushd build
python ../tools/build_libtorch.py
popd
IF ERRORLEVEL 1 exit /b 1
IF NOT ERRORLEVEL 0 exit /b 1
move /Y torch\bin\*.* libtorch\bin\
move /Y torch\cmake\*.* libtorch\cmake\
robocopy /move /e torch\include\ libtorch\include\
move /Y torch\lib\*.* libtorch\lib\
robocopy /move /e torch\share\ libtorch\share\
move /Y torch\test\*.* libtorch\test\
move /Y libtorch\bin\*.dll libtorch\lib\
echo %PYTORCH_BUILD_VERSION% > libtorch\build-version
git rev-parse HEAD > libtorch\build-hash
IF "%DEBUG%" == "" (
set LIBTORCH_PREFIX=libtorch-win-%VARIANT%
) ELSE (
set LIBTORCH_PREFIX=libtorch-win-%VARIANT%-debug
)
7z a -tzip "%LIBTORCH_PREFIX%-%PYTORCH_BUILD_VERSION%.zip" libtorch\*
:: Cleanup raw data to save space
rmdir /s /q libtorch
if not exist ..\output mkdir ..\output
copy /Y "%LIBTORCH_PREFIX%-%PYTORCH_BUILD_VERSION%.zip" "%PYTORCH_FINAL_PACKAGE_DIR%\"
copy /Y "%LIBTORCH_PREFIX%-%PYTORCH_BUILD_VERSION%.zip" "%PYTORCH_FINAL_PACKAGE_DIR%\%LIBTORCH_PREFIX%-latest.zip"
goto build_end
:pytorch
python setup.py bdist_wheel -d "%PYTORCH_FINAL_PACKAGE_DIR%"
:build_end
IF ERRORLEVEL 1 exit /b 1
IF NOT ERRORLEVEL 0 exit /b 1
if "%CXX%"=="sccache cl" goto sccache_cleanup
if "%CXX%"=="sccache-cl" goto sccache_cleanup
goto sccache_cleanup_end
:sccache_cleanup
sccache --show-stats
taskkill /im sccache.exe /f /t || ver > nul
taskkill /im nvcc.exe /f /t || ver > nul
:sccache_cleanup_end
cd ..

View File

@ -0,0 +1,242 @@
set SRC_DIR=%~dp0
pushd %SRC_DIR%\..
if not "%CUDA_VERSION%" == "cpu" if not "%CUDA_VERSION%" == "xpu" call internal\driver_update.bat
if errorlevel 1 exit /b 1
if "%CUDA_VERSION%" == "xpu" (
call internal\xpu_install.bat
if errorlevel 1 exit /b 1
call "%ProgramFiles(x86)%\Intel\oneAPI\setvars.bat"
if errorlevel 1 exit /b 1
)
set "ORIG_PATH=%PATH%"
setlocal EnableDelayedExpansion
set NVIDIA_GPU_EXISTS=0
for /F "delims=" %%i in ('wmic path win32_VideoController get name') do (
set GPUS=%%i
if not "x!GPUS:NVIDIA=!" == "x!GPUS!" (
SET NVIDIA_GPU_EXISTS=1
goto gpu_check_end
)
)
:gpu_check_end
endlocal & set NVIDIA_GPU_EXISTS=%NVIDIA_GPU_EXISTS%
if "%PACKAGE_TYPE%" == "wheel" goto wheel
if "%PACKAGE_TYPE%" == "conda" goto conda
if "%PACKAGE_TYPE%" == "libtorch" goto libtorch
echo "unknown package type"
exit /b 1
:wheel
echo "install wheel package"
set PYTHON_INSTALLER_URL=
if "%DESIRED_PYTHON%" == "3.13" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.13.0/python-3.13.0-amd64.exe"
if "%DESIRED_PYTHON%" == "3.12" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.12.0/python-3.12.0-amd64.exe"
if "%DESIRED_PYTHON%" == "3.11" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.11.0/python-3.11.0-amd64.exe"
if "%DESIRED_PYTHON%" == "3.10" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.10.0/python-3.10.0-amd64.exe"
if "%DESIRED_PYTHON%" == "3.9" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.9.0/python-3.9.0-amd64.exe"
if "%DESIRED_PYTHON%" == "3.8" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.8.2/python-3.8.2-amd64.exe"
if "%PYTHON_INSTALLER_URL%" == "" (
echo Python %DESIRED_PYTHON% not supported yet
)
del python-amd64.exe
curl --retry 3 -kL "%PYTHON_INSTALLER_URL%" --output python-amd64.exe
if errorlevel 1 exit /b 1
:: According to https://docs.python.org/3/using/windows.html, setting PrependPath to 1 will prepend
:: the installed Python to PATH system-wide. Even calling set PATH=%ORIG_PATH% later on won't make
:: a change. As the builder directory will be removed after the smoke test, all subsequent non-binary
:: jobs will fail to find any Python executable there
start /wait "" python-amd64.exe /quiet InstallAllUsers=1 PrependPath=0 Include_test=0 TargetDir=%CD%\Python
if errorlevel 1 exit /b 1
set "PATH=%CD%\Python%PYTHON_VERSION%\Scripts;%CD%\Python;%PATH%"
if "%DESIRED_PYTHON%" == "3.13" pip install -q --pre numpy==2.1.0 protobuf
if "%DESIRED_PYTHON%" == "3.12" pip install -q --pre numpy==2.0.2 protobuf
if "%DESIRED_PYTHON%" == "3.11" pip install -q --pre numpy==2.0.2 protobuf
if "%DESIRED_PYTHON%" == "3.10" pip install -q --pre numpy==2.0.2 protobuf
if "%DESIRED_PYTHON%" == "3.9" pip install -q --pre numpy==2.0.2 protobuf
if "%DESIRED_PYTHON%" == "3.8" pip install -q numpy protobuf
if errorlevel 1 exit /b 1
for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *.whl') do pip install "%%i"
if errorlevel 1 exit /b 1
goto smoke_test
:conda
echo "install conda package"
:: Install Miniconda3
set "CONDA_HOME=%CD%\conda"
set "tmp_conda=%CONDA_HOME%"
set "miniconda_exe=%CD%\miniconda.exe"
set "CONDA_EXTRA_ARGS=cpuonly -c pytorch-nightly"
if "%CUDA_VERSION%" == "118" (
set "CONDA_EXTRA_ARGS=pytorch-cuda=11.8 -c nvidia -c pytorch-nightly"
)
if "%CUDA_VERSION%" == "121" (
set "CONDA_EXTRA_ARGS=pytorch-cuda=12.1 -c nvidia -c pytorch-nightly"
)
if "%CUDA_VERSION%" == "124" (
set "CONDA_EXTRA_ARGS=pytorch-cuda=12.4 -c nvidia -c pytorch-nightly"
)
if "%CUDA_VERSION%" == "126" (
set "CONDA_EXTRA_ARGS=pytorch-cuda=12.6 -c nvidia -c pytorch-nightly"
)
rmdir /s /q conda
del miniconda.exe
curl -k https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe -o "%miniconda_exe%"
start /wait "" "%miniconda_exe%" /S /InstallationType=JustMe /RegisterPython=0 /AddToPath=0 /D=%tmp_conda%
if ERRORLEVEL 1 exit /b 1
set "PATH=%CONDA_HOME%;%CONDA_HOME%\scripts;%CONDA_HOME%\Library\bin;%PATH%"
conda create -qyn testenv python=%DESIRED_PYTHON%
if errorlevel 1 exit /b 1
call conda install -yq conda-build
if errorlevel 1 exit /b 1
call %CONDA_HOME%\condabin\activate.bat testenv
if errorlevel 1 exit /b 1
set "NO_ARCH_PATH=%PYTORCH_FINAL_PACKAGE_DIR:/=\%\noarch"
mkdir %NO_ARCH_PATH%
for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *') do xcopy "%%i" %NO_ARCH_PATH% /Y
if ERRORLEVEL 1 exit /b 1
call conda index %PYTORCH_FINAL_PACKAGE_DIR%
if errorlevel 1 exit /b 1
call conda install -yq -c "file:///%PYTORCH_FINAL_PACKAGE_DIR%" pytorch==%PYTORCH_BUILD_VERSION% -c pytorch -c numba/label/dev -c nvidia
if ERRORLEVEL 1 exit /b 1
call conda install -yq numpy
if ERRORLEVEL 1 exit /b 1
set /a CUDA_VER=%CUDA_VERSION%
set CUDA_VER_MAJOR=%CUDA_VERSION:~0,-1%
set CUDA_VER_MINOR=%CUDA_VERSION:~-1,1%
set CUDA_VERSION_STR=%CUDA_VER_MAJOR%.%CUDA_VER_MINOR%
:: Install package we just build
:smoke_test
python -c "import torch"
if ERRORLEVEL 1 exit /b 1
echo Checking that MKL is available
python -c "import torch; exit(0 if torch.backends.mkl.is_available() else 1)"
if ERRORLEVEL 1 exit /b 1
if "%NVIDIA_GPU_EXISTS%" == "0" (
echo "Skip CUDA tests for machines without a Nvidia GPU card"
goto end
)
echo Checking that CUDA archs are setup correctly
python -c "import torch; torch.randn([3,5]).cuda()"
if ERRORLEVEL 1 exit /b 1
echo Checking that magma is available
python -c "import torch; torch.rand(1).cuda(); exit(0 if torch.cuda.has_magma else 1)"
if ERRORLEVEL 1 exit /b 1
echo Checking that CuDNN is available
python -c "import torch; exit(0 if torch.backends.cudnn.is_available() else 1)"
if ERRORLEVEL 1 exit /b 1
echo Checking that basic RNN works
python %PYTORCH_ROOT%\.ci\pytorch\test_example_code\rnn_smoke.py
if ERRORLEVEL 1 exit /b 1
echo Checking that basic CNN works
python %PYTORCH_ROOT%\.ci\pytorch\test_example_code\cnn_smoke.py
if ERRORLEVEL 1 exit /b 1
goto end
:libtorch
echo "install and test libtorch"
if "%VC_YEAR%" == "2019" powershell internal\vs2019_install.ps1
if "%VC_YEAR%" == "2022" powershell internal\vs2022_install.ps1
if ERRORLEVEL 1 exit /b 1
for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *-latest.zip') do 7z x "%%i" -otmp
if ERRORLEVEL 1 exit /b 1
pushd tmp\libtorch
set VC_VERSION_LOWER=17
set VC_VERSION_UPPER=18
IF "%VC_YEAR%" == "2019" (
set VC_VERSION_LOWER=16
set VC_VERSION_UPPER=17
)
for /f "usebackq tokens=*" %%i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" -legacy -products * -version [%VC_VERSION_LOWER%^,%VC_VERSION_UPPER%^) -property installationPath`) do (
if exist "%%i" if exist "%%i\VC\Auxiliary\Build\vcvarsall.bat" (
set "VS15INSTALLDIR=%%i"
set "VS15VCVARSALL=%%i\VC\Auxiliary\Build\vcvarsall.bat"
goto vswhere
)
)
:vswhere
IF "%VS15VCVARSALL%"=="" (
echo Visual Studio %VC_YEAR% C++ BuildTools is required to compile PyTorch test on Windows
exit /b 1
)
call "%VS15VCVARSALL%" x64
set install_root=%CD%
set INCLUDE=%INCLUDE%;%install_root%\include;%install_root%\include\torch\csrc\api\include
set LIB=%LIB%;%install_root%\lib
set PATH=%PATH%;%install_root%\lib
cl %PYTORCH_ROOT%\.ci\pytorch\test_example_code\simple-torch-test.cpp c10.lib torch_cpu.lib /EHsc /std:c++17
if ERRORLEVEL 1 exit /b 1
.\simple-torch-test.exe
if ERRORLEVEL 1 exit /b 1
cl %PYTORCH_ROOT%\.ci\pytorch\test_example_code\check-torch-mkl.cpp c10.lib torch_cpu.lib /EHsc /std:c++17
if ERRORLEVEL 1 exit /b 1
.\check-torch-mkl.exe
if ERRORLEVEL 1 exit /b 1
if "%NVIDIA_GPU_EXISTS%" == "0" (
echo "Skip CUDA tests for machines without a Nvidia GPU card"
goto end
)
set BUILD_SPLIT_CUDA=
if exist "%install_root%\lib\torch_cuda_cu.lib" if exist "%install_root%\lib\torch_cuda_cpp.lib" set BUILD_SPLIT_CUDA=ON
if "%BUILD_SPLIT_CUDA%" == "ON" (
cl %PYTORCH_ROOT%\.ci\pytorch\test_example_code\check-torch-cuda.cpp torch_cpu.lib c10.lib torch_cuda_cu.lib torch_cuda_cpp.lib /EHsc /std:c++17 /link /INCLUDE:?warp_size@cuda@at@@YAHXZ /INCLUDE:?_torch_cuda_cu_linker_symbol_op_cuda@native@at@@YA?AVTensor@2@AEBV32@@Z
) else (
cl %PYTORCH_ROOT%\.ci\pytorch\test_example_code\check-torch-cuda.cpp torch_cpu.lib c10.lib torch_cuda.lib /EHsc /std:c++17 /link /INCLUDE:?warp_size@cuda@at@@YAHXZ
)
.\check-torch-cuda.exe
if ERRORLEVEL 1 exit /b 1
popd
echo Cleaning temp files
rd /s /q "tmp" || ver > nul
:end
set "PATH=%ORIG_PATH%"
popd

View File

@ -0,0 +1,132 @@
set SRC_DIR=%~dp0
pushd %SRC_DIR%\..
if "%CUDA_VERSION%" == "cpu" call internal\driver_update.bat
if errorlevel 1 exit /b 1
call internal\cuda_install.bat
set LIB=%CUDA_PATH%\lib\x64;%LIB%
if errorlevel 1 exit /b 1
set "ORIG_PATH=%PATH%"
setlocal EnableDelayedExpansion
set NVIDIA_GPU_EXISTS=0
for /F "delims=" %%i in ('wmic path win32_VideoController get name') do (
set GPUS=%%i
if not "x!GPUS:NVIDIA=!" == "x!GPUS!" (
SET NVIDIA_GPU_EXISTS=1
goto gpu_check_end
)
)
:gpu_check_end
endlocal & set NVIDIA_GPU_EXISTS=%NVIDIA_GPU_EXISTS%
:: Download MAGMA Files on CUDA builds
set MAGMA_VERSION=2.5.4
set CUDA_PREFIX=cuda%CUDA_VERSION%
if "%CUDA_VERSION%" == "92" set MAGMA_VERSION=2.5.2
if "%CUDA_VERSION%" == "100" set MAGMA_VERSION=2.5.2
if "%DEBUG%" == "1" (
set BUILD_TYPE=debug
) else (
set BUILD_TYPE=release
)
if not "%CUDA_VERSION%" == "cpu" (
rmdir /s /q magma_%CUDA_PREFIX%_%BUILD_TYPE%
del magma_%CUDA_PREFIX%_%BUILD_TYPE%.7z
curl -k https://s3.amazonaws.com/ossci-windows/magma_%MAGMA_VERSION%_%CUDA_PREFIX%_%BUILD_TYPE%.7z -o magma_%CUDA_PREFIX%_%BUILD_TYPE%.7z
7z x -aoa magma_%CUDA_PREFIX%_%BUILD_TYPE%.7z -omagma_%CUDA_PREFIX%_%BUILD_TYPE%
set LIB=%CD%\magma_%CUDA_PREFIX%_%BUILD_TYPE%\lib;%LIB%
)
echo "install conda package"
:: Install Miniconda3
set "CONDA_HOME=%CD%\conda"
set "tmp_conda=%CONDA_HOME%"
set "miniconda_exe=%CD%\miniconda.exe"
rmdir /s /q conda
del miniconda.exe
curl -k https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe -o "%miniconda_exe%"
start /wait "" "%miniconda_exe%" /S /InstallationType=JustMe /RegisterPython=0 /AddToPath=0 /D=%tmp_conda%
if ERRORLEVEL 1 exit /b 1
set "PATH=%CONDA_HOME%;%CONDA_HOME%\scripts;%CONDA_HOME%\Library\bin;%PATH%"
conda create -qyn testenv python=%DESIRED_PYTHON%
if errorlevel 1 exit /b 1
call %CONDA_HOME%\condabin\activate.bat testenv
if errorlevel 1 exit /b 1
call conda install -y -q -c conda-forge libuv=1.39
call conda install -y -q intel-openmp
echo "install and test libtorch"
pip install cmake
echo "installing cmake"
if "%VC_YEAR%" == "2019" powershell internal\vs2019_install.ps1
if "%VC_YEAR%" == "2022" powershell internal\vs2022_install.ps1
if ERRORLEVEL 1 exit /b 1
for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *-latest.zip') do 7z x "%%i" -otmp
if ERRORLEVEL 1 exit /b 1
pushd tmp\libtorch
set VC_VERSION_LOWER=17
set VC_VERSION_UPPER=18
IF "%VC_YEAR%" == "2019" (
set VC_VERSION_LOWER=16
set VC_VERSION_UPPER=17
)
for /f "usebackq tokens=*" %%i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" -legacy -products * -version [%VC_VERSION_LOWER%^,%VC_VERSION_UPPER%^) -property installationPath`) do (
if exist "%%i" if exist "%%i\VC\Auxiliary\Build\vcvarsall.bat" (
set "VS15INSTALLDIR=%%i"
set "VS15VCVARSALL=%%i\VC\Auxiliary\Build\vcvarsall.bat"
goto vswhere
)
)
:vswhere
IF "%VS15VCVARSALL%"=="" (
echo Visual Studio %VC_YEAR% C++ BuildTools is required to compile PyTorch test on Windows
exit /b 1
)
call "%VS15VCVARSALL%" x64
set install_root=%CD%
set INCLUDE=%INCLUDE%;%install_root%\include;%install_root%\include\torch\csrc\api\include
set LIB=%LIB%;%install_root%\lib\x64
set PATH=%PATH%;%install_root%\lib
cd %PYTORCH_ROOT%\.ci\pytorch\test_example_code\
mkdir build
cd build
cmake -DCMAKE_PREFIX_PATH=%install_root% ..
if ERRORLEVEL 1 exit /b 1
cmake --build . --config Release
.\Release\simple-torch-test.exe
if ERRORLEVEL 1 exit /b 1
popd
echo Cleaning temp files
rd /s /q "tmp" || ver > nul
:end
set "PATH=%ORIG_PATH%"
popd

View File

@ -0,0 +1,21 @@
if "%VC_YEAR%" == "2019" powershell windows/internal/vs2019_install.ps1
if "%VC_YEAR%" == "2022" powershell windows/internal/vs2022_install.ps1
set VC_VERSION_LOWER=17
set VC_VERSION_UPPER=18
if "%VC_YEAR%" == "2019" (
set VC_VERSION_LOWER=16
set VC_VERSION_UPPER=17
)
for /f "usebackq tokens=*" %%i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" -products Microsoft.VisualStudio.Product.BuildTools -version [%VC_VERSION_LOWER%^,%VC_VERSION_UPPER%^) -property installationPath`) do (
if exist "%%i" if exist "%%i\VC\Auxiliary\Build\vcvarsall.bat" (
set "VS15INSTALLDIR=%%i"
goto vswhere
)
)
:vswhere
echo "Setting VSINSTALLDIR to %VS15INSTALLDIR%"
if errorlevel 1 exit /b 1

View File

@ -0,0 +1,48 @@
# https://developercommunity.visualstudio.com/t/install-specific-version-of-vs-component/1142479
# https://docs.microsoft.com/en-us/visualstudio/releases/2019/history#release-dates-and-build-numbers
# 16.8.6 BuildTools
$VS_DOWNLOAD_LINK = "https://ossci-windows.s3.us-east-1.amazonaws.com/vs16.8.6_BuildTools.exe"
$COLLECT_DOWNLOAD_LINK = "https://aka.ms/vscollect.exe"
$VS_INSTALL_ARGS = @("--nocache","--quiet","--wait", "--add Microsoft.VisualStudio.Workload.VCTools",
"--add Microsoft.Component.MSBuild",
"--add Microsoft.VisualStudio.Component.Roslyn.Compiler",
"--add Microsoft.VisualStudio.Component.TextTemplating",
"--add Microsoft.VisualStudio.Component.VC.CoreIde",
"--add Microsoft.VisualStudio.Component.VC.Redist.14.Latest",
"--add Microsoft.VisualStudio.ComponentGroup.NativeDesktop.Core",
"--add Microsoft.VisualStudio.Component.VC.Tools.x86.x64",
"--add Microsoft.VisualStudio.ComponentGroup.NativeDesktop.Win81")
curl.exe --retry 3 -kL $VS_DOWNLOAD_LINK --output vs_installer.exe
if ($LASTEXITCODE -ne 0) {
echo "Download of the VS 2019 Version 16.8.5 installer failed"
exit 1
}
if (Test-Path "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe") {
$existingPath = & "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe" -products "Microsoft.VisualStudio.Product.BuildTools" -version "[16, 17)" -property installationPath
if ($existingPath -ne $null) {
if (!${env:CIRCLECI}) {
echo "Found correctly versioned existing BuildTools installation in $existingPath"
exit 0
}
echo "Found existing BuildTools installation in $existingPath, keeping it"
}
}
$process = Start-Process "${PWD}\vs_installer.exe" -ArgumentList $VS_INSTALL_ARGS -NoNewWindow -Wait -PassThru
Remove-Item -Path vs_installer.exe -Force
$exitCode = $process.ExitCode
if (($exitCode -ne 0) -and ($exitCode -ne 3010)) {
echo "VS 2019 installer exited with code $exitCode, which should be one of [0, 3010]."
curl.exe --retry 3 -kL $COLLECT_DOWNLOAD_LINK --output Collect.exe
if ($LASTEXITCODE -ne 0) {
echo "Download of the VS Collect tool failed."
exit 1
}
Start-Process "${PWD}\Collect.exe" -NoNewWindow -Wait -PassThru
New-Item -Path "C:\w\build-results" -ItemType "directory" -Force
Copy-Item -Path "C:\Users\${env:USERNAME}\AppData\Local\Temp\vslogs.zip" -Destination "C:\w\build-results\"
exit 1
}

View File

@ -0,0 +1,48 @@
# https://developercommunity.visualstudio.com/t/install-specific-version-of-vs-component/1142479
# https://learn.microsoft.com/en-us/visualstudio/releases/2022/release-history#evergreen-bootstrappers
# 17.4.3 BuildTools
$VS_DOWNLOAD_LINK = "https://download.visualstudio.microsoft.com/download/pr/8f480125-28b8-4a2c-847c-c2b02a8cdd1b/64be21d4ada005d7d07896ed0b004c322409bd04d6e8eba4c03c9fa39c928e7a/vs_BuildTools.exe"
$COLLECT_DOWNLOAD_LINK = "https://aka.ms/vscollect.exe"
$VS_INSTALL_ARGS = @("--nocache","--quiet","--wait", "--add Microsoft.VisualStudio.Workload.VCTools",
"--add Microsoft.Component.MSBuild",
"--add Microsoft.VisualStudio.Component.Roslyn.Compiler",
"--add Microsoft.VisualStudio.Component.TextTemplating",
"--add Microsoft.VisualStudio.Component.VC.CoreIde",
"--add Microsoft.VisualStudio.Component.VC.Redist.14.Latest",
"--add Microsoft.VisualStudio.ComponentGroup.NativeDesktop.Core",
"--add Microsoft.VisualStudio.Component.VC.Tools.x86.x64",
"--add Microsoft.VisualStudio.ComponentGroup.NativeDesktop.Win81")
curl.exe --retry 3 -kL $VS_DOWNLOAD_LINK --output vs_installer.exe
if ($LASTEXITCODE -ne 0) {
echo "Download of the VS $VC_YEAR Version $VS_VERSION installer failed"
exit 1
}
if (Test-Path "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe") {
$existingPath = & "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe" -products "Microsoft.VisualStudio.Product.BuildTools" -version "[17, 18)" -property installationPath
if ($existingPath -ne $null) {
if (!${env:CIRCLECI}) {
echo "Found correctly versioned existing BuildTools installation in $existingPath"
exit 0
}
echo "Found existing BuildTools installation in $existingPath, keeping it"
}
}
$process = Start-Process "${PWD}\vs_installer.exe" -ArgumentList $VS_INSTALL_ARGS -NoNewWindow -Wait -PassThru
Remove-Item -Path vs_installer.exe -Force
$exitCode = $process.ExitCode
if (($exitCode -ne 0) -and ($exitCode -ne 3010)) {
echo "VS $VC_YEAR installer exited with code $exitCode, which should be one of [0, 3010]."
curl.exe --retry 3 -kL $COLLECT_DOWNLOAD_LINK --output Collect.exe
if ($LASTEXITCODE -ne 0) {
echo "Download of the VS Collect tool failed."
exit 1
}
Start-Process "${PWD}\Collect.exe" -NoNewWindow -Wait -PassThru
New-Item -Path "C:\w\build-results" -ItemType "directory" -Force
Copy-Item -Path "C:\Users\${env:USERNAME}\AppData\Local\Temp\vslogs.zip" -Destination "C:\w\build-results\"
exit 1
}

View File

@ -0,0 +1,28 @@
@echo off
set VS_DOWNLOAD_LINK=https://download.visualstudio.microsoft.com/download/pr/8f480125-28b8-4a2c-847c-c2b02a8cdd1b/64be21d4ada005d7d07896ed0b004c322409bd04d6e8eba4c03c9fa39c928e7a/vs_BuildTools.exe
IF "%VS_LATEST%" == "1" (
set VS_INSTALL_ARGS= --nocache --norestart --quiet --wait --add Microsoft.VisualStudio.Workload.VCTools
set VSDEVCMD_ARGS=
) ELSE (
set VS_INSTALL_ARGS=--nocache --quiet --wait --add Microsoft.VisualStudio.Workload.VCTools ^
--add Microsoft.VisualStudio.Component.VC.Tools.14.34 ^
--add Microsoft.Component.MSBuild ^
--add Microsoft.VisualStudio.Component.Roslyn.Compiler ^
--add Microsoft.VisualStudio.Component.TextTemplating ^
--add Microsoft.VisualStudio.Component.VC.CoreIde ^
--add Microsoft.VisualStudio.Component.VC.Redist.14.Latest ^
--add Microsoft.VisualStudio.ComponentGroup.NativeDesktop.Core ^
--add Microsoft.VisualStudio.Component.VC.Tools.x86.x64 ^
--add Microsoft.VisualStudio.Component.VC.Tools.14.34 ^
--add Microsoft.VisualStudio.ComponentGroup.NativeDesktop.Win81
set VSDEVCMD_ARGS=-vcvars_ver=14.34
)
curl -k -L %VS_DOWNLOAD_LINK% --output vs_installer.exe
if errorlevel 1 exit /b 1
start /wait .\vs_installer.exe %VS_INSTALL_ARGS%
if not errorlevel 0 exit /b 1
if errorlevel 1 if not errorlevel 3010 exit /b 1
if errorlevel 3011 exit /b 1

View File

@ -0,0 +1,119 @@
@echo on
REM Description: Install Intel Support Packages on Windows
REM BKM reference: https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html
if not "%CUDA_VERSION%" == "xpu" (
echo Skipping for non XPU builds
exit /b 0
)
set XPU_INSTALL_MODE=%~1
if "%XPU_INSTALL_MODE%"=="" goto xpu_bundle_install_start
if "%XPU_INSTALL_MODE%"=="bundle" goto xpu_bundle_install_start
if "%XPU_INSTALL_MODE%"=="driver" goto xpu_driver_install_start
if "%XPU_INSTALL_MODE%"=="all" goto xpu_driver_install_start
:arg_error
echo Illegal XPU installation mode. The value can be "bundle"/"driver"/"all"
echo If keep the value as space, will use default "bundle" mode
exit /b 1
:xpu_driver_install_start
:: TODO Need more testing for driver installation
set XPU_DRIVER_LINK=https://downloadmirror.intel.com/830975/gfx_win_101.5972.exe
curl -o xpu_driver.exe --retry 3 --retry-all-errors -k %XPU_DRIVER_LINK%
echo "XPU Driver installing..."
start /wait "Intel XPU Driver Installer" "xpu_driver.exe"
if errorlevel 1 exit /b 1
del xpu_driver.exe
if "%XPU_INSTALL_MODE%"=="driver" goto xpu_install_end
:xpu_bundle_install_start
set XPU_BUNDLE_PARENT_DIR=C:\Program Files (x86)\Intel\oneAPI
set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d1a91e2-e8b8-40a5-8c7f-5db768a6a60c/w_intel-for-pytorch-gpu-dev_p_0.5.3.37_offline.exe
set XPU_BUNDLE_PRODUCT_NAME=intel.oneapi.win.intel-for-pytorch-gpu-dev.product
set XPU_BUNDLE_VERSION=0.5.3+31
set XPU_BUNDLE_INSTALLED=0
set XPU_BUNDLE_UNINSTALL=0
set XPU_EXTRA_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d1a91e2-e8b8-40a5-8c7f-5db768a6a60c/w_intel-pti-dev_p_0.9.0.37_offline.exe
set XPU_EXTRA_PRODUCT_NAME=intel.oneapi.win.intel-pti-dev.product
set XPU_EXTRA_VERSION=0.9.0+36
set XPU_EXTRA_INSTALLED=0
set XPU_EXTRA_UNINSTALL=0
if not [%XPU_VERSION%]==[] if [%XPU_VERSION%]==[2025.0] (
set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/efc86abd-cb77-452e-a03f-a741895b8ece/intel-deep-learning-essentials-2025.0.0.336_offline.exe
set XPU_BUNDLE_PRODUCT_NAME=intel.oneapi.win.deep-learning-essentials.product
set XPU_BUNDLE_VERSION=2025.0.0+335
set XPU_BUNDLE_INSTALLED=0
set XPU_BUNDLE_UNINSTALL=0
set XPU_EXTRA_URL=NULL
set XPU_EXTRA_PRODUCT_NAME=intel.oneapi.win.compiler.product
set XPU_EXTRA_VERSION=2025.0.1+1226
set XPU_EXTRA_INSTALLED=0
set XPU_EXTRA_UNINSTALL=0
)
:: Check if XPU bundle is target version or already installed
if exist "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" goto xpu_bundle_ver_check
goto xpu_bundle_install
:xpu_bundle_ver_check
"%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --list-products > xpu_bundle_installed_ver.log
for /f "tokens=1,2" %%a in (xpu_bundle_installed_ver.log) do (
if "%%a"=="%XPU_BUNDLE_PRODUCT_NAME%" (
echo %%a Installed Version: %%b
set XPU_BUNDLE_INSTALLED=1
if not "%XPU_BUNDLE_VERSION%"=="%%b" (
start /wait "Installer Title" "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --action=remove --eula=accept --silent --product-id %%a --product-ver %%b --log-dir uninstall_bundle
set XPU_BUNDLE_UNINSTALL=1
)
)
if "%%a"=="%XPU_EXTRA_PRODUCT_NAME%" (
echo %%a Installed Version: %%b
set XPU_EXTRA_INSTALLED=1
if not "%XPU_EXTRA_VERSION%"=="%%b" (
start /wait "Installer Title" "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --action=remove --eula=accept --silent --product-id %%a --product-ver %%b --log-dir uninstall_bundle
set XPU_EXTRA_UNINSTALL=1
)
)
if not "%%b" == "Version" if not [%%b]==[] if not "%%a"=="%XPU_BUNDLE_PRODUCT_NAME%" if not "%%a"=="%XPU_EXTRA_PRODUCT_NAME%" (
echo "Uninstalling...."
start /wait "Installer Title" "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --action=remove --eula=accept --silent --product-id %%a --product-ver %%b --log-dir uninstall_bundle
)
)
if errorlevel 1 exit /b 1
if exist xpu_bundle_installed_ver.log del xpu_bundle_installed_ver.log
if exist uninstall_bundle rmdir /s /q uninstall_bundle
if "%XPU_BUNDLE_INSTALLED%"=="0" goto xpu_bundle_install
if "%XPU_BUNDLE_UNINSTALL%"=="1" goto xpu_bundle_install
:xpu_extra_check
if "%XPU_EXTRA_URL%"=="NULL" goto xpu_install_end
if "%XPU_EXTRA_INSTALLED%"=="0" goto xpu_extra_install
if "%XPU_EXTRA_UNINSTALL%"=="1" goto xpu_extra_install
goto xpu_install_end
:xpu_bundle_install
curl -o xpu_bundle.exe --retry 3 --retry-all-errors -k %XPU_BUNDLE_URL%
echo "XPU Bundle installing..."
start /wait "Intel Pytorch Bundle Installer" "xpu_bundle.exe" --action=install --eula=accept --silent --log-dir install_bundle
if errorlevel 1 exit /b 1
del xpu_bundle.exe
goto xpu_extra_check
:xpu_extra_install
curl -o xpu_extra.exe --retry 3 --retry-all-errors -k %XPU_EXTRA_URL%
echo "Intel XPU EXTRA installing..."
start /wait "Intel XPU EXTRA Installer" "xpu_extra.exe" --action=install --eula=accept --silent --log-dir install_bundle
if errorlevel 1 exit /b 1
del xpu_extra.exe
:xpu_install_end

View File

@ -0,0 +1,41 @@
@echo on
set MODULE_NAME=pytorch
IF NOT EXIST "setup.py" IF NOT EXIST "%MODULE_NAME%" (
call internal\clone.bat
cd %~dp0
) ELSE (
call internal\clean.bat
)
IF ERRORLEVEL 1 goto :eof
call internal\check_deps.bat
IF ERRORLEVEL 1 goto :eof
REM Check for optional components
echo Disabling CUDA
set USE_CUDA=0
call internal\check_opts.bat
IF ERRORLEVEL 1 goto :eof
echo Activate XPU Bundle env
set VS2022INSTALLDIR=%VS15INSTALLDIR%
set XPU_BUNDLE_ROOT=%ProgramFiles(x86)%\Intel\oneAPI
call "%XPU_BUNDLE_ROOT%\compiler\latest\env\vars.bat"
call "%XPU_BUNDLE_ROOT%\ocloc\latest\env\vars.bat"
IF ERRORLEVEL 1 goto :eof
:: Workaround for https://github.com/pytorch/pytorch/issues/134989
set CMAKE_SHARED_LINKER_FLAGS=/FORCE:MULTIPLE
set CMAKE_MODULE_LINKER_FLAGS=/FORCE:MULTIPLE
set CMAKE_EXE_LINKER_FLAGS=/FORCE:MULTIPLE
if exist "%NIGHTLIES_PYTORCH_ROOT%" cd %NIGHTLIES_PYTORCH_ROOT%\..
call %~dp0\internal\copy_cpu.bat
IF ERRORLEVEL 1 goto :eof
call %~dp0\internal\setup.bat
IF ERRORLEVEL 1 goto :eof

281
.ci/wheel/build_wheel.sh Executable file
View File

@ -0,0 +1,281 @@
#!/usr/bin/env bash
set -ex
SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"
# Env variables that should be set:
# DESIRED_PYTHON
# Which Python version to build for in format 'Maj.min' e.g. '2.7' or '3.6'
#
# PYTORCH_FINAL_PACKAGE_DIR
# **absolute** path to folder where final whl packages will be stored. The
# default should not be used when calling this from a script. The default
# is 'whl', and corresponds to the default in the wheel/upload.sh script.
#
# MAC_PACKAGE_WORK_DIR
# absolute path to a workdir in which to clone an isolated conda
# installation and pytorch checkout. If the pytorch checkout already exists
# then it will not be overwritten.
# Function to retry functions that sometimes timeout or have flaky failures
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
# Parameters
if [[ -n "$DESIRED_PYTHON" && -n "$PYTORCH_BUILD_VERSION" && -n "$PYTORCH_BUILD_NUMBER" ]]; then
desired_python="$DESIRED_PYTHON"
build_version="$PYTORCH_BUILD_VERSION"
build_number="$PYTORCH_BUILD_NUMBER"
else
if [ "$#" -ne 3 ]; then
echo "illegal number of parameters. Need PY_VERSION BUILD_VERSION BUILD_NUMBER"
echo "for example: build_wheel.sh 2.7 0.1.6 20"
echo "Python version should be in format 'M.m'"
exit 1
fi
desired_python=$1
build_version=$2
build_number=$3
fi
echo "Building for Python: $desired_python Version: $build_version Build: $build_number"
python_nodot="$(echo $desired_python | tr -d m.u)"
# Version: setup.py uses $PYTORCH_BUILD_VERSION.post$PYTORCH_BUILD_NUMBER if
# PYTORCH_BUILD_NUMBER > 1
if [[ -n "$OVERRIDE_PACKAGE_VERSION" ]]; then
# This will be the *exact* version, since build_number<1
build_version="$OVERRIDE_PACKAGE_VERSION"
build_number=0
build_number_prefix=''
else
if [[ $build_number -eq 1 ]]; then
build_number_prefix=""
else
build_number_prefix=".post$build_number"
fi
fi
export PYTORCH_BUILD_VERSION=$build_version
export PYTORCH_BUILD_NUMBER=$build_number
package_type="${PACKAGE_TYPE:-wheel}"
# Fill in empty parameters with defaults
if [[ -z "$TORCH_PACKAGE_NAME" ]]; then
TORCH_PACKAGE_NAME='torch'
fi
TORCH_PACKAGE_NAME="$(echo $TORCH_PACKAGE_NAME | tr '-' '_')"
if [[ -z "$PYTORCH_REPO" ]]; then
PYTORCH_REPO='pytorch'
fi
if [[ -z "$PYTORCH_BRANCH" ]]; then
PYTORCH_BRANCH="v${build_version}"
fi
if [[ -z "$RUN_TEST_PARAMS" ]]; then
RUN_TEST_PARAMS=()
fi
if [[ -z "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then
if [[ -n "$BUILD_PYTHONLESS" ]]; then
PYTORCH_FINAL_PACKAGE_DIR='libtorch'
else
PYTORCH_FINAL_PACKAGE_DIR='whl'
fi
fi
mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true
# Create an isolated directory to store this builds pytorch checkout and conda
# installation
if [[ -z "$MAC_PACKAGE_WORK_DIR" ]]; then
MAC_PACKAGE_WORK_DIR="$(pwd)/tmp_wheel_conda_${DESIRED_PYTHON}_$(date +%H%M%S)"
fi
mkdir -p "$MAC_PACKAGE_WORK_DIR" || true
if [[ -n ${GITHUB_ACTIONS} ]]; then
pytorch_rootdir="${PYTORCH_ROOT:-${MAC_PACKAGE_WORK_DIR}/pytorch}"
else
pytorch_rootdir="${MAC_PACKAGE_WORK_DIR}/pytorch"
fi
whl_tmp_dir="${MAC_PACKAGE_WORK_DIR}/dist"
mkdir -p "$whl_tmp_dir"
mac_version='macosx_11_0_arm64'
libtorch_arch='arm64'
# Create a consistent wheel package name to rename the wheel to
wheel_filename_new="${TORCH_PACKAGE_NAME}-${build_version}${build_number_prefix}-cp${python_nodot}-none-${mac_version}.whl"
###########################################################
# Have a separate Pytorch repo clone
if [[ ! -d "$pytorch_rootdir" ]]; then
git clone "https://github.com/${PYTORCH_REPO}/pytorch" "$pytorch_rootdir"
pushd "$pytorch_rootdir"
if ! git checkout "$PYTORCH_BRANCH" ; then
echo "Could not checkout $PYTORCH_BRANCH, so trying tags/v${build_version}"
git checkout tags/v${build_version}
fi
popd
fi
pushd "$pytorch_rootdir"
git submodule update --init --recursive
popd
##########################
# now build the binary
export TH_BINARY_BUILD=1
export INSTALL_TEST=0 # dont install test binaries into site-packages
export MACOSX_DEPLOYMENT_TARGET=10.15
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
SETUPTOOLS_PINNED_VERSION="=46.0.0"
PYYAML_PINNED_VERSION="=5.3"
EXTRA_CONDA_INSTALL_FLAGS=""
case $desired_python in
3.13)
echo "Using 3.13 deps"
SETUPTOOLS_PINNED_VERSION=">=68.0.0"
PYYAML_PINNED_VERSION=">=6.0.1"
NUMPY_PINNED_VERSION="=2.1.0"
;;
3.12)
echo "Using 3.12 deps"
SETUPTOOLS_PINNED_VERSION=">=68.0.0"
PYYAML_PINNED_VERSION=">=6.0.1"
NUMPY_PINNED_VERSION="=2.0.2"
;;
3.11)
echo "Using 3.11 deps"
SETUPTOOLS_PINNED_VERSION=">=46.0.0"
PYYAML_PINNED_VERSION=">=5.3"
NUMPY_PINNED_VERSION="=2.0.2"
;;
3.10)
echo "Using 3.10 deps"
SETUPTOOLS_PINNED_VERSION=">=46.0.0"
PYYAML_PINNED_VERSION=">=5.3"
NUMPY_PINNED_VERSION="=2.0.2"
;;
3.9)
echo "Using 3.9 deps"
SETUPTOOLS_PINNED_VERSION=">=46.0.0"
PYYAML_PINNED_VERSION=">=5.3"
NUMPY_PINNED_VERSION="=2.0.2"
;;
*)
echo "Using default deps"
NUMPY_PINNED_VERSION="=1.11.3"
;;
esac
# Install into a fresh env
tmp_env_name="wheel_py$python_nodot"
conda create ${EXTRA_CONDA_INSTALL_FLAGS} -yn "$tmp_env_name" python="$desired_python"
source activate "$tmp_env_name"
pip install -q "numpy=${NUMPY_PINNED_VERSION}" "pyyaml${PYYAML_PINNED_VERSION}" requests
retry pip install -qr "${pytorch_rootdir}/requirements.txt" || true
# TODO : Remove me later (but in the interim, use Anaconda cmake, to find Anaconda installed OpenMP)
retry pip uninstall -y cmake
retry conda install ${EXTRA_CONDA_INSTALL_FLAGS} -yq llvm-openmp=14.0.6 cmake ninja "setuptools${SETUPTOOLS_PINNED_VERSION}" typing_extensions
# For USE_DISTRIBUTED=1 on macOS, need libuv and pkg-config to find libuv.
export USE_DISTRIBUTED=1
retry conda install ${EXTRA_CONDA_INSTALL_FLAGS} -yq libuv pkg-config
if [[ -n "$CROSS_COMPILE_ARM64" ]]; then
export CMAKE_OSX_ARCHITECTURES=arm64
fi
export USE_MKLDNN=OFF
export USE_QNNPACK=OFF
export BUILD_TEST=OFF
pushd "$pytorch_rootdir"
echo "Calling setup.py bdist_wheel at $(date)"
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
echo "Calling setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"
BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 python setup.py bdist_wheel -d "$whl_tmp_dir"
echo "Finished setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"
echo "Calling setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"
BUILD_PYTHON_ONLY=1 BUILD_LIBTORCH_WHL=0 python setup.py bdist_wheel -d "$whl_tmp_dir" --cmake
echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"
else
python setup.py bdist_wheel -d "$whl_tmp_dir"
fi
echo "Finished setup.py bdist_wheel at $(date)"
if [[ $package_type != 'libtorch' ]]; then
echo "delocating wheel dependencies"
retry pip install https://github.com/matthew-brett/delocate/archive/refs/tags/0.10.4.zip
echo "found the following wheels:"
find $whl_tmp_dir -name "*.whl"
echo "running delocate"
find $whl_tmp_dir -name "*.whl" | xargs -I {} delocate-wheel -v {}
find $whl_tmp_dir -name "*.whl"
find $whl_tmp_dir -name "*.whl" | xargs -I {} delocate-listdeps {}
echo "Finished delocating wheels at $(date)"
fi
echo "The wheel is in $(find $whl_tmp_dir -name '*.whl')"
wheel_filename_gen=$(find $whl_tmp_dir -name '*.whl' | head -n1 | xargs -I {} basename {})
popd
if [[ -z "$BUILD_PYTHONLESS" ]]; then
# Copy the whl to a final destination before tests are run
echo "Renaming Wheel file: $wheel_filename_gen to $wheel_filename_new"
cp "$whl_tmp_dir/$wheel_filename_gen" "$PYTORCH_FINAL_PACKAGE_DIR/$wheel_filename_new"
##########################
# now test the binary, unless it's cross compiled arm64
if [[ -z "$CROSS_COMPILE_ARM64" ]]; then
pip uninstall -y "$TORCH_PACKAGE_NAME" || true
pip uninstall -y "$TORCH_PACKAGE_NAME" || true
# Create new "clean" conda environment for testing
conda create ${EXTRA_CONDA_INSTALL_FLAGS} -yn "test_conda_env" python="$desired_python"
conda activate test_conda_env
pip install "$PYTORCH_FINAL_PACKAGE_DIR/$wheel_filename_new" -v
echo "$(date) :: Running tests"
# TODO: Add real tests, as run_test.sh from builder is a glorified no-op
# pushd "$pytorch_rootdir"
# "${SOURCE_DIR}/../run_tests.sh" 'wheel' "$desired_python" 'cpu'
# popd
echo "$(date) :: Finished tests"
fi
else
pushd "$pytorch_rootdir"
mkdir -p libtorch/{lib,bin,include,share}
cp -r "$(pwd)/build/lib" "$(pwd)/libtorch/"
# for now, the headers for the libtorch package will just be
# copied in from the wheel
unzip -d any_wheel "$whl_tmp_dir/$wheel_filename_gen"
if [[ -d $(pwd)/any_wheel/torch/include ]]; then
cp -r "$(pwd)/any_wheel/torch/include" "$(pwd)/libtorch/"
else
cp -r "$(pwd)/any_wheel/torch/lib/include" "$(pwd)/libtorch/"
fi
cp -r "$(pwd)/any_wheel/torch/share/cmake" "$(pwd)/libtorch/share/"
if [[ "${libtorch_arch}" == "x86_64" ]]; then
if [[ -x "$(pwd)/any_wheel/torch/.dylibs/libiomp5.dylib" ]]; then
cp -r "$(pwd)/any_wheel/torch/.dylibs/libiomp5.dylib" "$(pwd)/libtorch/lib/"
else
cp -r "$(pwd)/any_wheel/torch/lib/libiomp5.dylib" "$(pwd)/libtorch/lib/"
fi
else
cp -r "$(pwd)/any_wheel/torch/lib/libomp.dylib" "$(pwd)/libtorch/lib/"
fi
rm -rf "$(pwd)/any_wheel"
echo $PYTORCH_BUILD_VERSION > libtorch/build-version
echo "$(pushd $pytorch_rootdir && git rev-parse HEAD)" > libtorch/build-hash
zip -rq "$PYTORCH_FINAL_PACKAGE_DIR/libtorch-macos-${libtorch_arch}-$PYTORCH_BUILD_VERSION.zip" libtorch
cp "$PYTORCH_FINAL_PACKAGE_DIR/libtorch-macos-${libtorch_arch}-$PYTORCH_BUILD_VERSION.zip" \
"$PYTORCH_FINAL_PACKAGE_DIR/libtorch-macos-${libtorch_arch}-latest.zip"
fi

View File

@ -1,34 +0,0 @@
#!/bin/bash
echo "RUNNING ON $(uname -a) WITH $(nproc) CPUS AND $(free -m)"
set -eux -o pipefail
source /env
# Because most Circle executors only have 20 CPUs, using more causes OOMs w/ Ninja and nvcc parallelization
MEMORY_LIMIT_MAX_JOBS=18
NUM_CPUS=$(( $(nproc) - 2 ))
# Defaults here for **binary** linux builds so they can be changed in one place
export MAX_JOBS=${MAX_JOBS:-$(( ${NUM_CPUS} > ${MEMORY_LIMIT_MAX_JOBS} ? ${MEMORY_LIMIT_MAX_JOBS} : ${NUM_CPUS} ))}
if [[ "${DESIRED_CUDA}" =~ cu1[1-2][0-9] ]]; then
export BUILD_SPLIT_CUDA="ON"
fi
# Parse the parameters
if [[ "$PACKAGE_TYPE" == 'conda' ]]; then
build_script='conda/build_pytorch.sh'
elif [[ "$DESIRED_CUDA" == cpu ]]; then
build_script='manywheel/build_cpu.sh'
elif [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
build_script='manywheel/build_rocm.sh'
else
build_script='manywheel/build.sh'
fi
if [[ "$CIRCLE_BRANCH" == "main" ]] || [[ "$CIRCLE_BRANCH" == "master" ]] || [[ "$CIRCLE_BRANCH" == release/* ]]; then
export BUILD_DEBUG_INFO=1
fi
# Build the package
SKIP_ALL_TESTS=1 "/builder/$build_script"

View File

@ -22,10 +22,7 @@ fi
python_nodot="\$(echo $DESIRED_PYTHON | tr -d m.u)"
# Set up Python
if [[ "$PACKAGE_TYPE" == conda ]]; then
retry conda create -qyn testenv python="$DESIRED_PYTHON"
source activate testenv >/dev/null
elif [[ "$PACKAGE_TYPE" != libtorch ]]; then
if [[ "$PACKAGE_TYPE" != libtorch ]]; then
python_path="/opt/python/cp\$python_nodot-cp\${python_nodot}"
if [[ "\$python_nodot" = *t ]]; then
python_digits="\$(echo $DESIRED_PYTHON | tr -cd [:digit:])"
@ -58,9 +55,6 @@ mv /final_pkgs/debug-*.zip /tmp/debug_final_pkgs || echo "no debug packages to m
# Install the package
# These network calls should not have 'retry's because they are installing
# locally and aren't actually network calls
# TODO there is duplicated and inconsistent test-python-env setup across this
# file, builder/smoke_test.sh, and builder/run_tests.sh, and also in the
# conda build scripts themselves. These should really be consolidated
# Pick only one package of multiple available (which happens as result of workflow re-runs)
pkg="/final_pkgs/\$(ls -1 /final_pkgs|sort|tail -1)"
if [[ "\$PYTORCH_BUILD_VERSION" == *dev* ]]; then
@ -69,30 +63,7 @@ else
CHANNEL="test"
fi
if [[ "$PACKAGE_TYPE" == conda ]]; then
(
# For some reason conda likes to re-activate the conda environment when attempting this install
# which means that a deactivate is run and some variables might not exist when that happens,
# namely CONDA_MKL_INTERFACE_LAYER_BACKUP from libblas so let's just ignore unbound variables when
# it comes to the conda installation commands
set +u
retry conda install \${EXTRA_CONDA_FLAGS} -yq \
"numpy\${NUMPY_PIN}" \
mkl>=2018 \
ninja \
sympy>=1.12 \
typing-extensions \
${PROTOBUF_PACKAGE}
if [[ "$DESIRED_CUDA" == 'cpu' ]]; then
retry conda install -c pytorch -y cpuonly
else
cu_ver="${DESIRED_CUDA:2:2}.${DESIRED_CUDA:4}"
CUDA_PACKAGE="pytorch-cuda"
retry conda install \${EXTRA_CONDA_FLAGS} -yq -c nvidia -c "pytorch-\${CHANNEL}" "pytorch-cuda=\${cu_ver}"
fi
conda install \${EXTRA_CONDA_FLAGS} -y "\$pkg" --offline
)
elif [[ "$PACKAGE_TYPE" != libtorch ]]; then
if [[ "$PACKAGE_TYPE" != libtorch ]]; then
if [[ "\$BUILD_ENVIRONMENT" != *s390x* ]]; then
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
pkg_no_python="$(ls -1 /final_pkgs/torch_no_python* | sort |tail -1)"
@ -116,15 +87,15 @@ if [[ "$PACKAGE_TYPE" == libtorch ]]; then
fi
# Test the package
/builder/check_binary.sh
/pytorch/.ci/pytorch/check_binary.sh
if [[ "\$GPU_ARCH_TYPE" != *s390x* && "\$GPU_ARCH_TYPE" != *xpu* && "\$GPU_ARCH_TYPE" != *rocm* && "$PACKAGE_TYPE" != libtorch ]]; then
# Exclude s390, xpu, rocm and libtorch builds from smoke testing
python /builder/test/smoke_test/smoke_test.py --package=torchonly --torch-compile-check disabled
python /pytorch/.ci/pytorch/smoke_test/smoke_test.py --package=torchonly --torch-compile-check disabled
fi
# Clean temp files
cd /builder && git clean -ffdx
cd /pytorch/.ci/pytorch/ && git clean -ffdx
# =================== The above code will be executed inside Docker container ===================
EOL

View File

@ -7,9 +7,5 @@ mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR"
# Build
export USE_PYTORCH_METAL_EXPORT=1
export USE_COREML_DELEGATE=1
if [[ "$PACKAGE_TYPE" == conda ]]; then
"${BUILDER_ROOT}/conda/build_pytorch.sh"
else
export TORCH_PACKAGE_NAME="$(echo $TORCH_PACKAGE_NAME | tr '-' '_')"
"${BUILDER_ROOT}/wheel/build_wheel.sh"
fi
export TORCH_PACKAGE_NAME="$(echo $TORCH_PACKAGE_NAME | tr '-' '_')"
"${PYTORCH_ROOT}/.ci/wheel/build_wheel.sh"

View File

@ -1,34 +0,0 @@
#!/bin/bash
set -eux -o pipefail
source "/Users/distiller/project/env"
export "PATH=$workdir/miniconda/bin:$PATH"
pkg="$workdir/final_pkgs/$(ls $workdir/final_pkgs)"
# Create a new test env
# TODO cut all this out into a separate test job and have an entirely different
# miniconda
if [[ "$PACKAGE_TYPE" != libtorch ]]; then
source deactivate || true
conda create -qyn test python="$DESIRED_PYTHON"
source activate test >/dev/null
fi
# Install the package
if [[ "$PACKAGE_TYPE" == libtorch ]]; then
pkg="$(ls $workdir/final_pkgs/*-latest.zip)"
unzip "$pkg" -d /tmp
cd /tmp/libtorch
elif [[ "$PACKAGE_TYPE" == conda ]]; then
conda install -y "$pkg"
else
pip install "$pkg" -v
fi
# Test
if [[ "$PACKAGE_TYPE" == libtorch ]]; then
$workdir/builder/check_binary.sh
else
pushd "$workdir/pytorch"
$workdir/builder/run_tests.sh "$PACKAGE_TYPE" "$DESIRED_PYTHON" "$DESIRED_CUDA"
fi

View File

@ -75,9 +75,8 @@ export PYTORCH_BUILD_NUMBER=1
TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)
# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT
TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64' and python_version < '3.13'"
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then
# Only linux Python < 3.13 are supported wheels for triton
TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64'"
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" && ! "$PYTORCH_BUILD_VERSION" =~ .*xpu.* ]]; then
TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"
if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then
TRITON_SHORTHASH=$(cut -c1-8 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton.txt)

View File

@ -19,13 +19,9 @@ fi
echo "Free space on filesystem before build:"
df -h
pushd "$BUILDER_ROOT"
if [[ "$PACKAGE_TYPE" == 'conda' ]]; then
./windows/internal/build_conda.bat
elif [[ "$PACKAGE_TYPE" == 'wheel' || "$PACKAGE_TYPE" == 'libtorch' ]]; then
export NIGHTLIES_PYTORCH_ROOT="$PYTORCH_ROOT"
./windows/internal/build_wheels.bat
fi
pushd "$PYTORCH_ROOT/.ci/pytorch/"
export NIGHTLIES_PYTORCH_ROOT="$PYTORCH_ROOT"
./windows/internal/build_wheels.bat
echo "Free space on filesystem after build:"
df -h

View File

@ -11,8 +11,7 @@ if [[ "$DESIRED_CUDA" == 'xpu' ]]; then
export XPU_VERSION=2025.0
fi
pushd "$BUILDER_ROOT"
pushd "$PYTORCH_ROOT/.ci/pytorch/"
./windows/internal/smoke_test.bat
popd

View File

@ -26,4 +26,4 @@ runs:
shell: bash
run: |
mkdir -p .additional_ci_files
mv td_results.json .additional_ci_files/td_results.json
mv td_results.json .additional_ci_files/td_results.json || true

View File

@ -30,7 +30,6 @@ runs:
--tty \
--detach \
-v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
-v "${GITHUB_WORKSPACE}/builder:/builder" \
-v "${RUNNER_TEMP}/artifacts:/final_pkgs" \
-w / \
"${DOCKER_IMAGE}"

View File

@ -1 +1 @@
2ec22641e390cda25ec7c61fcbce07507727d584
r2.6

View File

@ -10,7 +10,6 @@ ciflow_push_tags:
- ciflow/inductor-perf-compare
- ciflow/inductor-micro-benchmark
- ciflow/inductor-micro-benchmark-cpu-x86
- ciflow/inductor-cu124
- ciflow/inductor-cu126
- ciflow/linux-aarch64
- ciflow/mps
@ -29,6 +28,7 @@ retryable_workflows:
- trunk
- linux-binary
- windows-binary
- inductor-A100-perf-nightly
labeler_config: labeler.yml
label_to_label_config: label_to_label.yml
mergebot: True

View File

@ -17,6 +17,7 @@ pytest==7.3.2
pytest-xdist==3.3.1
pytest-rerunfailures==10.3
pytest-flakefinder==1.1.0
pytest-subtests==0.13.1
scipy==1.10.1
sympy==1.12.1 ; python_version == "3.8"
sympy==1.13.1 ; python_version >= "3.9"

View File

@ -30,9 +30,14 @@ fi
# Remove packaged libs and headers
rm -rf $TRITON_ROCM_DIR/include/*
LIBTINFO_PATH="/usr/lib64/libtinfo.so.5"
LIBNUMA_PATH="/usr/lib64/libnuma.so.1"
LIBELF_PATH="/usr/lib64/libelf.so.1"
OS_NAME=`awk -F= '/^NAME/{print $2}' /etc/os-release`
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
LIBTINFO_PATH="/usr/lib64/libtinfo.so.5"
else
LIBTINFO_PATH="/usr/lib64/libtinfo.so.6"
fi
OS_SO_PATHS=(
$LIBELF_PATH

View File

@ -39,9 +39,9 @@ SUPPORTED_PERIODICAL_MODES: Dict[str, Callable[[Optional[str]], bool]] = {
}
# The link to the published list of disabled jobs
DISABLED_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json"
DISABLED_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json?versionId=pQg1WJZKNqoisT5kAGG9Wmbuns5zBdBc"
# and unstable jobs
UNSTABLE_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/unstable-jobs.json"
UNSTABLE_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/unstable-jobs.json?versionId=ddADM6lf9NqVTA0APn69zl3M7nMda4DH"
# Some constants used to handle disabled and unstable jobs
JOB_NAME_SEP = "/"
@ -332,7 +332,7 @@ def process_jobs(
# The job name from github is in the PLATFORM / JOB (CONFIG) format, so breaking
# it into its two components first
current_platform, _ = (n.strip() for n in job_name.split(JOB_NAME_SEP, 1) if n)
except ValueError as error:
except ValueError:
warnings.warn(f"Invalid job name {job_name}, returning")
return test_matrix

View File

@ -17,10 +17,10 @@ from typing import Dict, List, Optional, Tuple
# NOTE: Also update the CUDA sources in tools/nightly.py when changing this list
CUDA_ARCHES = ["11.8", "12.4", "12.6"]
CUDA_ARCHES_FULL_VERSION = {"11.8": "11.8.0", "12.4": "12.4.1", "12.6": "12.6.2"}
CUDA_ARCHES_FULL_VERSION = {"11.8": "11.8.0", "12.4": "12.4.1", "12.6": "12.6.3"}
CUDA_ARCHES_CUDNN_VERSION = {"11.8": "9", "12.4": "9", "12.6": "9"}
ROCM_ARCHES = ["6.1", "6.2"]
ROCM_ARCHES = ["6.1", "6.2.4"]
XPU_ARCHES = ["xpu"]
@ -162,7 +162,7 @@ WHEEL_CONTAINER_IMAGES = {
"12.4": f"pytorch/manylinux-builder:cuda12.4-{DEFAULT_TAG}",
"12.6": f"pytorch/manylinux2_28-builder:cuda12.6-{DEFAULT_TAG}",
**{
gpu_arch: f"pytorch/manylinux-builder:rocm{gpu_arch}-{DEFAULT_TAG}"
gpu_arch: f"pytorch/manylinux2_28-builder:rocm{gpu_arch}-{DEFAULT_TAG}"
for gpu_arch in ROCM_ARCHES
},
"xpu": f"pytorch/manylinux2_28-builder:xpu-{DEFAULT_TAG}",
@ -170,7 +170,7 @@ WHEEL_CONTAINER_IMAGES = {
"cpu-cxx11-abi": f"pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-{DEFAULT_TAG}",
"cpu-aarch64": f"pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-{DEFAULT_TAG}",
"cpu-s390x": f"pytorch/manylinuxs390x-builder:cpu-s390x-{DEFAULT_TAG}",
"cuda-aarch64": f"pytorch/manylinuxaarch64-builder:cuda12.4-{DEFAULT_TAG}",
"cuda-aarch64": f"pytorch/manylinuxaarch64-builder:cuda12.6-{DEFAULT_TAG}",
}
@ -194,13 +194,6 @@ LIBTORCH_CONTAINER_IMAGES: Dict[Tuple[str, str], str] = {
): f"pytorch/libtorch-cxx11-builder:cuda{gpu_arch}-{DEFAULT_TAG}"
for gpu_arch in CUDA_ARCHES
},
**{
(
gpu_arch,
PRE_CXX11_ABI,
): f"pytorch/manylinux-builder:rocm{gpu_arch}-{DEFAULT_TAG}"
for gpu_arch in ROCM_ARCHES
},
**{
(
gpu_arch,
@ -222,7 +215,7 @@ def translate_desired_cuda(gpu_arch_type: str, gpu_arch_version: str) -> str:
"cpu-cxx11-abi": "cpu-cxx11-abi",
"cpu-s390x": "cpu",
"cuda": f"cu{gpu_arch_version.replace('.', '')}",
"cuda-aarch64": "cu124",
"cuda-aarch64": "cu126",
"rocm": f"rocm{gpu_arch_version}",
"xpu": "xpu",
}.get(gpu_arch_type, gpu_arch_version)
@ -262,7 +255,9 @@ def generate_libtorch_matrix(
gpu_arch_type = arch_type(arch_version)
gpu_arch_version = "" if arch_version == "cpu" else arch_version
# ROCm builds without-deps failed even in ROCm runners; skip for now
if gpu_arch_type == "rocm" and "without-deps" in libtorch_variant:
if gpu_arch_type == "rocm" and (
"without-deps" in libtorch_variant or "pre-cxx11" in abi_version
):
continue
ret.append(
{
@ -333,10 +328,9 @@ def generate_wheels_matrix(
else arch_version
)
# TODO: Enable python 3.13 on rocm, aarch64, windows
# TODO: Enable python 3.13 on aarch64, windows
if (
gpu_arch_type == "rocm"
or os
os
not in [
"linux",
"linux-s390x",
@ -382,7 +376,11 @@ def generate_wheels_matrix(
),
"use_split_build": "True" if use_split_build else "False",
"devtoolset": (
"cxx11-abi" if arch_version == "cuda-aarch64" else ""
"cxx11-abi"
if (
arch_version == "cuda-aarch64" or arch_version == "12.6"
)
else ""
),
"container_image": WHEEL_CONTAINER_IMAGES[arch_version],
"package_type": package_type,
@ -428,7 +426,8 @@ def generate_wheels_matrix(
"use_split_build": "True" if use_split_build else "False",
"devtoolset": (
"cxx11-abi"
if arch_version in ["cpu-cxx11-abi", "cpu-aarch64"]
if (arch_version in ["cpu-cxx11-abi", "cpu-aarch64", "xpu"])
or gpu_arch_type == "rocm"
else ""
),
"container_image": WHEEL_CONTAINER_IMAGES[arch_version],

View File

@ -1,5 +1,9 @@
#!/usr/bin/env bash
set -eux
CHANGES=$(git status --porcelain "$1")
echo "$CHANGES"
git diff "$1"
# NB: Use --no-pager here to avoid git diff asking for a prompt to continue
git --no-pager diff "$1"
[ -z "$CHANGES" ]

View File

@ -258,7 +258,7 @@ def load_yaml(yaml_text: str) -> Any:
try:
data = yaml.safe_load(yaml_text)
return data
except yaml.YAMLError as exc:
except yaml.YAMLError:
log.exception("Error loading YAML")
raise

View File

@ -898,7 +898,7 @@ class TestBypassFailures(TestCase):
repo = DummyGitRepo()
# Check that failure is classified as flaky but still raises exception
with warnings.catch_warnings(record=True) as w, self.assertRaises(RuntimeError):
rule = find_matching_merge_rule(pr, repo)
find_matching_merge_rule(pr, repo)
self.assertEqual(len(w), 1)
self.assertIn(
"1 checks failed but were likely due flakiness or broken trunk",

View File

@ -1747,7 +1747,7 @@ def get_classifications(
try:
print(f"From Dr.CI checkrun summary: {drci_summary}")
drci_classifications = json.loads(str(drci_summary))
except json.JSONDecodeError as error:
except json.JSONDecodeError:
warn("Invalid Dr.CI checkrun summary")
drci_classifications = {}
@ -1918,7 +1918,6 @@ def do_revert_prs(
dry_run: bool = False,
) -> None:
# Prepare and push revert commits
commit_shas: List[str] = []
for commit_sha, pr in shas_and_prs:
revert_msg = f"\nReverted {pr.get_pr_url()} on behalf of {prefix_with_github_url(author_login)}"
revert_msg += extra_msg

View File

@ -8,7 +8,7 @@
# NOTE: If testing pytorch/builder changes you can change this variable to change what pytorch/builder reference
# the binary builds will check out
{%- set builder_repo = "pytorch/builder" -%}
{%- set builder_branch = "main" -%}
{%- set builder_branch = "release/2.6" -%}
{%- macro concurrency(build_environment) -%}
concurrency:
@ -36,7 +36,7 @@ concurrency:
{%- macro setup_ec2_windows() -%}
!{{ display_ec2_information() }}
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.6
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}

View File

@ -55,7 +55,7 @@ jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.6
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -145,10 +145,9 @@ jobs:
with:
name: !{{ config["build_name"] }}
path: "${{ runner.temp }}/artifacts/"
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch) }}
!{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.6
with:
docker-image: !{{ config["container_image"] }}
- name: Test Pytorch binary
@ -167,13 +166,12 @@ jobs:
with:
name: !{{ config["build_name"] }}
path: "${{ runner.temp }}/artifacts/"
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch) }}
!{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.6
with:
docker-image: !{{ config["container_image"] }}
- name: Test Pytorch binary

View File

@ -78,8 +78,7 @@ jobs:
elif [ -d "/Applications/Xcode_13.3.1.app" ]; then
echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
fi
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch) }}
!{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
- name: Install sccache (only for non-forked PRs, and pushes to trunk)
uses: nick-fields/retry@v3.0.0
if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}

View File

@ -56,7 +56,7 @@ jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.6
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -80,8 +80,7 @@ jobs:
steps:
!{{ common.setup_ec2_windows() }}
!{{ set_runner_specific_vars() }}
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch) }}
!{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
- name: Populate binary env
shell: bash
run: |
@ -122,8 +121,7 @@ jobs:
with:
name: !{{ config["build_name"] }}
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch) }}
!{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
- name: Populate binary env
shell: bash
run: |

View File

@ -47,7 +47,7 @@ jobs:
reenabled-issues: ${{ steps.filter.outputs.reenabled-issues }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6
with:
fetch-depth: 1
submodules: false
@ -69,25 +69,25 @@ jobs:
runs-on: ${{ matrix.runner }}
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.6
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6
with:
docker-image-name: ${{ inputs.docker-image-name }}
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.6
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
@ -97,7 +97,7 @@ jobs:
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.6
if: ${{ inputs.cuda-version != 'cpu' && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
- name: Output disk space left
@ -209,5 +209,5 @@ jobs:
file-suffix: bazel-${{ github.job }}_${{ steps.get-job-id.outputs.job-id }}
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.6
if: always()

View File

@ -159,13 +159,13 @@ jobs:
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
if: inputs.build_environment != 'linux-s390x-binary-manywheel'
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.6
continue-on-error: true
with:
github-secret: ${{ secrets.github-token }}
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6
with:
no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' || inputs.build_environment == 'linux-s390x-binary-manywheel' }}
@ -195,7 +195,6 @@ jobs:
- name: Checkout PyTorch to pytorch dir
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
@ -206,21 +205,6 @@ jobs:
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder to builder dir
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Check if the job is disabled
id: filter
uses: ./pytorch/.github/actions/filter-test-configs
@ -235,7 +219,7 @@ jobs:
- name: Pull Docker image
if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.6
with:
docker-image: ${{ inputs.DOCKER_IMAGE }}
@ -246,7 +230,6 @@ jobs:
mkdir -p artifacts/
container_name=$(docker run \
-e BINARY_ENV_FILE \
-e BUILDER_ROOT \
-e BUILD_ENVIRONMENT \
-e DESIRED_CUDA \
-e DESIRED_DEVTOOLSET \
@ -264,7 +247,6 @@ jobs:
--tty \
--detach \
-v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \
-v "${GITHUB_WORKSPACE}/builder:/builder" \
-v "${RUNNER_TEMP}/artifacts:/artifacts" \
-w / \
"${DOCKER_IMAGE}"
@ -272,10 +254,8 @@ jobs:
docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
if [[ ${BUILD_ENVIRONMENT} == *"aarch64"* ]]; then
docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /pytorch/.ci/aarch64_linux/aarch64_ci_build.sh"
elif [[ ${{ inputs.PACKAGE_TYPE }} == "manywheel" || ${{ inputs.PACKAGE_TYPE }} == "libtorch" ]]; then
docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /pytorch/.ci/${{ inputs.PACKAGE_TYPE }}/build.sh"
else
docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/${{ inputs.PACKAGE_TYPE }}/build.sh"
docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /pytorch/.ci/${{ inputs.PACKAGE_TYPE }}/build.sh"
fi
- name: Chown artifacts
@ -295,7 +275,7 @@ jobs:
- name: Teardown Linux
if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.6
- name: Chown workspace
if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

View File

@ -142,14 +142,14 @@ jobs:
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
if: inputs.build_environment != 'linux-s390x-binary-manywheel'
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.6
continue-on-error: true
with:
github-secret: ${{ secrets.github-token }}
# Setup the environment
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6
with:
no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' || inputs.build_environment == 'linux-s390x-binary-manywheel' }}
@ -172,7 +172,6 @@ jobs:
- name: Checkout PyTorch to pytorch dir
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
@ -182,20 +181,6 @@ jobs:
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder to builder dir
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Check if the job is disabled
id: filter
uses: ./pytorch/.github/actions/filter-test-configs
@ -216,12 +201,12 @@ jobs:
path: "${{ runner.temp }}/artifacts/"
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.6
if: ${{ inputs.GPU_ARCH_TYPE == 'cuda' && steps.filter.outputs.is-test-matrix-empty == 'False' }}
- name: Pull Docker image
if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.6
with:
docker-image: ${{ inputs.DOCKER_IMAGE }}
@ -231,7 +216,7 @@ jobs:
- name: Teardown Linux
if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.6
- name: Chown workspace
if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

View File

@ -103,7 +103,7 @@ jobs:
USE_SPLIT_BUILD: ${{ inputs.use_split_build }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6
with:
no-sudo: true

View File

@ -84,7 +84,7 @@ jobs:
name: build-docs-${{ matrix.docs_type }}-${{ inputs.push }}
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.6
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
instructions: |
@ -95,7 +95,7 @@ jobs:
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6
- name: Setup Linux
uses: ./.github/actions/setup-linux
@ -110,12 +110,12 @@ jobs:
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6
with:
docker-image-name: ${{ inputs.docker-image }}
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.6
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
@ -222,5 +222,5 @@ jobs:
s3-prefix: pytorch/pytorch/${{ github.event.pull_request.number }}/functorchdocs
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.6
if: always()

View File

@ -108,7 +108,7 @@ jobs:
test-matrix: ${{ steps.filter.outputs.test-matrix }}
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.6
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
@ -118,7 +118,7 @@ jobs:
# checkout because when we run this action we don't *have* a local
# checkout. In other cases you should prefer a local checkout.
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6
with:
no-sudo: true
@ -136,7 +136,7 @@ jobs:
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
docker-image-name: ${{ inputs.docker-image-name }}
@ -152,7 +152,7 @@ jobs:
echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.6
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
@ -224,6 +224,12 @@ jobs:
JENKINS_USER="--user jenkins"
USED_IMAGE="${DOCKER_IMAGE}"
fi
# Leaving 1GB for the runner and other things
TOTAL_AVAILABLE_MEMORY_IN_GB=$(awk '/MemTotal/ { printf "%.3f \n", $2/1024/1024 - 1 }' /proc/meminfo)
# https://docs.docker.com/engine/containers/resource_constraints/#--memory-swap-details
TOTAL_MEMORY_WITH_SWAP=$(("${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}" * 2))
# detached container should get cleaned up by teardown_ec2_linux
# Used for JENKINS_USER, which can be empty
# shellcheck disable=SC2086
@ -244,7 +250,10 @@ jobs:
-e PR_LABELS \
-e OUR_GITHUB_JOB_ID \
-e HUGGING_FACE_HUB_TOKEN \
-e SCRIBE_GRAPHQL_ACCESS_TOKEN \
-e USE_SPLIT_BUILD \
--memory="${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}g" \
--memory-swap="${TOTAL_MEMORY_WITH_SWAP}g" \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
@ -311,7 +320,7 @@ jobs:
build-time: ${{ steps.build.outputs.build_time }}
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.6
if: always() && inputs.build-environment != 'linux-s390x-binary-manywheel'
- name: Cleanup docker

View File

@ -54,7 +54,7 @@ on:
since we are investigating the behaviour of the monitor script with different tests.
required: false
type: boolean
default: true
default: false
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
@ -80,7 +80,7 @@ jobs:
timeout-minutes: ${{ matrix.mem_leak_check == 'mem_leak_check' && 600 || inputs.timeout-minutes }}
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.6
if: ${{ !contains(matrix.runner, 'gcp.a100') }}
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
@ -89,7 +89,7 @@ jobs:
docker exec -it $(docker container ps --format '{{.ID}}') bash
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6
with:
no-sudo: true
@ -106,7 +106,7 @@ jobs:
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.6
with:
docker-image-name: ${{ inputs.docker-image }}
@ -120,7 +120,7 @@ jobs:
echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.6
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
@ -131,7 +131,7 @@ jobs:
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
id: install-nvidia-driver
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.6
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
- name: Setup GPU_FLAG for docker run
@ -331,7 +331,7 @@ jobs:
job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}
- name: Upload the benchmark results
uses: pytorch/test-infra/.github/actions/upload-benchmark-results@main
uses: pytorch/test-infra/.github/actions/upload-benchmark-results@release/2.6
with:
benchmark-results-dir: test/test-reports
dry-run: false
@ -377,7 +377,7 @@ jobs:
path: ./**/core.[1-9]*
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.6
if: always() && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false'
# NB: We are currently having an intermittent GPU-related issue on G5 runners with

View File

@ -71,11 +71,11 @@ jobs:
test-matrix: ${{ steps.filter.outputs.test-matrix }}
steps:
- name: Clean up disk space before running MacOS workflow
uses: pytorch/test-infra/.github/actions/check-disk-space@main
uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.6
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6
- name: Set xcode version
env:
@ -87,7 +87,7 @@ jobs:
- name: Setup miniconda
if: inputs.environment-file == ''
uses: pytorch/test-infra/.github/actions/setup-miniconda@main
uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.6
with:
python-version: ${{ inputs.python-version }}
environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}
@ -97,7 +97,7 @@ jobs:
# environment even though the arch is x86-64
- name: Setup miniconda using the provided environment file
if: inputs.environment-file != ''
uses: pytorch/test-infra/.github/actions/setup-miniconda@main
uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.6
with:
python-version: ${{ inputs.python-version }}
environment-file: ${{ inputs.environment-file }}
@ -207,4 +207,4 @@ jobs:
- name: Clean up disk space
if: always()
continue-on-error: true
uses: pytorch/test-infra/.github/actions/check-disk-space@main
uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.6

View File

@ -41,7 +41,7 @@ jobs:
reenabled-issues: ${{ steps.filter.outputs.reenabled-issues }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6
with:
submodules: false
@ -82,7 +82,7 @@ jobs:
use-gha: true
- name: Setup miniconda
uses: pytorch/test-infra/.github/actions/setup-miniconda@main
uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.6
with:
python-version: ${{ inputs.python-version }}
environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}
@ -169,4 +169,4 @@ jobs:
- name: Clean up disk space
if: always()
continue-on-error: true
uses: pytorch/test-infra/.github/actions/check-disk-space@main
uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.6

View File

@ -82,11 +82,19 @@ jobs:
done
- name: Clean up disk space before running MacOS workflow
uses: pytorch/test-infra/.github/actions/check-disk-space@main
uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.6
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.6
- name: Start monitoring script
id: monitor-script
if: ${{ !inputs.disable-monitor }}
continue-on-error: true
run: |
${CONDA_RUN} python3 -m tools.stats.monitor > usage_log.txt 2>&1 &
echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"
- name: Download build artifacts
uses: ./.github/actions/download-build-artifacts
@ -101,20 +109,12 @@ jobs:
use-gha: true
- name: Setup miniconda
uses: pytorch/test-infra/.github/actions/setup-miniconda@main
uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.6
with:
python-version: ${{ inputs.python-version }}
environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}
pip-requirements-file: .github/requirements/pip-requirements-${{ runner.os }}.txt
- name: Start monitoring script
id: monitor-script
if: ${{ !inputs.disable-monitor }}
continue-on-error: true
run: |
${CONDA_RUN} python3 -m tools.stats.monitor > usage_log.txt 2>&1 &
echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"
- name: Parse ref
id: parse-ref
run: .github/scripts/parse_ref.py
@ -224,7 +224,7 @@ jobs:
file-suffix: ${{ github.job }}-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }}_${{ steps.get-job-id.outputs.job-id }}
- name: Upload the benchmark results
uses: pytorch/test-infra/.github/actions/upload-benchmark-results@main
uses: pytorch/test-infra/.github/actions/upload-benchmark-results@release/2.6
with:
benchmark-results-dir: test/test-reports
dry-run: false
@ -234,4 +234,4 @@ jobs:
- name: Clean up disk space
if: always()
continue-on-error: true
uses: pytorch/test-infra/.github/actions/check-disk-space@main
uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.6

Some files were not shown because too many files have changed in this diff Show More